[StarCluster] SGE slow with 50k+ jobs

Rayson Ho raysonlogin at gmail.com
Mon Mar 31 01:53:29 EDT 2014


Hmm, didn't know that you are running that cluster in your own data center!

1) Do you have any parallel jobs in the system? (ie. any PEs used?)

2) Like Chris mentioned in his email, I think BerkeleyDB would help in
this case. The classic spooling method can be slow when you are
talking about 70K+ jobs and the run time of each job is short.

3) Finally, did you get anything from syslog & /var/log/messages? If
you have 20 slots on a 16GB machine, then qmaster can use up 1-2GB --
in each scheduling cycle the scheduler thread (we used to have a
separate scheduler process) needs to duplicate quite a lot of the data
that the qmaster thread sents it, and then freed when the scheduling
cycle is done (ie. the peak usage can be 1.5X the normal usage). With
14GB left that is shared by 20 jobs, then each job can only use 700MB
max. I believe syslog can give you some hints whether the OOM killer
did kick in or not.

Rayson

==================================================
Open Grid Scheduler - The Official Open Source Grid Engine
http://gridscheduler.sourceforge.net/
http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html


On Mon, Mar 31, 2014 at 1:21 AM, Jacob Barhak <jacob.barhak at gmail.com> wrote:
> Thanks Rayson,
>
> The queue instance type is BIP - if you refer to what qmon reports. My machine has 16gb - I should be far from full use even under load where about only half of the memory is used. No swap is used. All computations are done on one physical machine not over the cloud so no network issues should be considered.
>
> The theory that qmaster uses much memory makes sense. htop reports one that sge_qmaster uses 5.8% out of 16GB ram and ~900M-1100M with ~40k jobs still in the queue. Yet there is plenty of free memory left and no swap is used.
>
> I do not know why qmaster died before, yet your theory that the kernel protection stopped it makes sense if memory use accumulates over time even if no jobs are submitted. qmaster died before after about 2 days after processing about 10k jobs from ~70k jobs submitted. It also died when I tried deleting all those jobs left on queue after restart. I am now more conservative and run less than 50k jobs - I had less issues with this number of jobs before other than slow deletion. It seems 50k+ jobs is where things become much slower and fragile.
>
> If you have any suggestions on how to prevent qmaster dying with my intended usage and how to handle even more jobs I will appreciate those. Ideally I would like to run hundreds of thousands of jobs - in my case each job is a different scenario and the system runs those in parallel and collects results to compare them - a typical model competition where HPC should be the solution of use due to parallelization.
>
> By the way, despite my demands, I must add that SGE is a good tool compared to other queuing systems I worked with. I just think I am pressing it to the point it is near its limits and it needs special tuning. I still recommend SGE over other solutions.
>
>            Jacob
>
> Sent from my iPhone
>
> On Mar 30, 2014, at 10:31 PM, Rayson Ho <raysonlogin at gmail.com> wrote:
>
>> Which instance type is used for the qmaster node? And how much free
>> memory do you have on that instance?
>>
>> For a qmaster with over 50K jobs, you can easily get a qmaster process
>> that uses over 1GB of memory. If your qmaster also runs jobs, then the
>> kernel OOM killer will likely pick the qmaster process as it uses that
>> much memory.
>>
>> Rayson
>>
>> ==================================================
>> Open Grid Scheduler - The Official Open Source Grid Engine
>> http://gridscheduler.sourceforge.net/
>> http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html
>>
>>
>> On Sun, Mar 30, 2014 at 11:23 PM, Jacob Barhak <jacob.barhak at gmail.com> wrote:
>>> Thanks Chris,
>>>
>>> This is helpful.
>>>
>>> Your suggestion to change to in demand scheduling makes sense. If I set
>>> flush_finish_sec to 1 sec I will get faster submission of my faster jobs
>>> that take less than a second. Currently I have 20 slots available so this
>>> should allow processing about 1200 of these in a minute. What I am afraid of
>>> is that the scheduler will get called 20 times - once for each such job.
>>> Since grid engine seg faulted in the last day under pressure I am afraid of
>>> stressing it to this level - so I am inquiring about behavior before I make
>>> this change on a running simulation. Does each event trigger another
>>> instance of grid engine to launch? Or are all 20 events handled at once a
>>> second after the first event about completion is received?
>>>
>>> Also, your suggestions do not address the issue that I find disturbing.
>>> Deleting jobs just takes forever when the machine is loaded with jobs. So if
>>> there is trouble it takes me about an hour just to clean the queue and
>>> restart everything from scratch. If there is an easier way to clear a queue
>>> of jobs it would help. I have a single queue, will deleting it and creating
>>> a new one we faster?
>>>
>>> By the way, my queue type according to qmon is BIP. I am running 20 slots
>>> with 8 CPUs and arch is lx26amd64.
>>>
>>> Can this queue type handle 50k+ jobs efficiently for submission/deletion? I
>>> think the information you provided gets launching queued jobs covered, the
>>> real issue is submission/deletion overhead. And I would like to know from
>>> others who launch that many dependent jobs about their experiences.
>>>
>>> Also, is there a way to tell grid engine to resubmit interrupted jobs under
>>> the same name and job number in case of failure. I have many dependencies
>>> and if a machine breaks during a long run I want grid engine to relaunch the
>>> jobs it was working on in a way that will maintain dependencies defined at
>>> submission time. Currently I had to write a script to recover from
>>> interrupted runs - if grid engine has this capability it could make things
>>> easier. And by the way, I need to submit many little jobs rather than job
>>> arrays partially because of these dependencies.
>>>
>>> I would appreciate to learn about other people experiences with 50k+ jobs.
>>>
>>>          Jacob
>>>
>>>
>>> Sent from my iPhone
>>>
>>> On Mar 30, 2014, at 6:35 AM, Chris Dagdigian <dag at bioteam.net> wrote:
>>>
>>>
>>> The #1 suggestion is to change the default scheduling setup over to the
>>> settings for "on demand scheduling" - you can do that live on a running
>>> cluster I believe. Google can help you with the parameters but it
>>> basically involves switching from the looping/cyclical scheduler
>>> behavior to an event based method where SGE runs a scheduling/dispatch
>>> cycle each time a job enters or leaves the system. This is purpose built
>>> for the "many short jobs" use case.
>>>
>>> The #2 suggestion is to make sure you use berkeleyDB based spooling
>>> (this is one scenario where I'll switch from my preference for classic
>>> spooling) AND make sure that you are spooling to local disk on each node
>>> instead of spooling back to a shared filesystem.  These settings cannot
>>> easily be changed on a live system so in starcluster land you might have
>>> to tweak the source code to come up with a different installation
>>> template that would put these settings in. (Note that I'm not sure what
>>> SC puts in by default so there is a chance that this is already set up
>>> the way you'd want ...)
>>>
>>> Other stuff:
>>>
>>> - SGE Array Jobs are way more efficient than single jobs. Instead of
>>> 70,000 jobs can you submit a single array job that has *70,000 tasks*
>>> instead?  Or 7 jobs containing 10,000 tasks etc. ?
>>>
>>> - Is there any way to batch up or collect your short jobs into bigger
>>> collections so that the average job run time is longer? SGE is not the
>>> best at jobs that run for mere seconds or less - most people would batch
>>> those together to at least get a 60-90 second runtime ...
>>>
>>>
>>>
>>> -Chris
>>>
>>>
>>>
>>> Jacob Barhak wrote:
>>>
>>> Hi to SGE experts,
>>>
>>>
>>> This is less of a StarCluster specific issue. It is more of an SGE issue I
>>> encountered and was hoping someone here can help with.
>>>
>>>
>>> My system runs many many smaller jobs - tens of thousands. When I need a
>>> rushed solution to reach a deadline I use StarCluster. However, if I have
>>> time, I run the simulations on a single 8 core machine that has SGE
>>> installed over Ubuntu 12.04. This machine is new and fast with SSD drive and
>>> freshly installed yet I am encountering issues.
>>>
>>>
>>> 1. When I launch about 70k jobs submitting a single new job to the queue
>>> takes a while - about a second or so, compared to fractions of a second when
>>> the queue is empty.
>>>
>>>
>>> 2. Deleting all the jobs from the queue using qdel -u username takes a long
>>> time. It reports about 24 deletes to the screen every few seconds - at this
>>> rate it will take hours to delete the entire queue. It is still deleting
>>> while I am writing these words. Way too much time.
>>>
>>>
>>> 3. The system was working ok for a few days yet now has trouble with
>>> qmaster. It report the following:
>>>
>>> error: commlib error: got select error (connection refused)
>>>
>>> unable to send message to qmaster using port 6444 on host 'localhost': got
>>> send error.
>>>
>>> Also qmon reported cannot reach qmaster. I had to restart and suspend and
>>> disable the queue.
>>>
>>>
>>> Note that qstat -j currently reports:
>>>
>>> All queues dropped because of overload or full
>>>
>>>
>>> Note that I configured the schedule interval to 2 seconds since many of my
>>> jobs are so fast that even 2 seconds is very inefficient for them yet some
>>> are longer and memory consuming so I cannot allow more slots to launch too
>>> many jobs.
>>>
>>>
>>> Am I overloading the system with too many jobs? What is the limit on a
>>> single strong machine? How will this scale when I run this on StarCluster?
>>>
>>>
>>> Any advice on how to efficiently handle many jobs, some of which are very
>>> short, will be appreciated. And I hope this interests the audience.
>>>
>>>
>>>       Jacob
>>>
>>>
>>> Sent from my iPhone
>>>
>>> _______________________________________________
>>>
>>> StarCluster mailing list
>>>
>>> StarCluster at mit.edu
>>>
>>> http://mailman.mit.edu/mailman/listinfo/starcluster
>>>
>>>
>>> _______________________________________________
>>> StarCluster mailing list
>>> StarCluster at mit.edu
>>> http://mailman.mit.edu/mailman/listinfo/starcluster
>>>



More information about the StarCluster mailing list