[StarCluster] Easy way to delete more than 100k jobs

Thu Feb 26 07:33:09 EST 2015

Hi Lyn, Hi Rayson,

You were trying to help,  however,  I described the problem insuficiently.

I just isolated the issue. It seems the problem is not the number of jobs.
It is the use of job dependencies with wildcard in name with the -hold_jid
option.

For example launching 10k jobs named A0000 to A9999 and then 10k jobs named
B0000 to B9999 each depending on "A*" then the system should become
unstable. However,  if each job in B depends on "A0000" and several other A
jobs named explicitly without the windcard character,  then SGE does not
have any issues.

I assume that when deleting the jobs,  SGE tries to recalculate all jobs
depending on the related job. In the example above for each one of the 10k
A jobs deleted,  the system recalculates the list of 10k dependencies for
each of the 10k jobs. This is a fair assumption considering the tests I
made. However,  I did not look at SGE code to confirm my assumption is
correct. Perhaps one of the experts can verify this?

In any case, I recommend avoiding using wildcard dependencies on job names
unless absolutely necessary. In the future,  perhaps qdel * can start
removing jobs according to dependency tree or recalculate dependencies once
after all deletions took place.

I hope this will help someone avoid pitfalls in the future.

            Jacob
 On Feb 23, 2015 3:33 PM, "Jacob Barhak" <jacob.barhak at gmail.com> wrote:

> Thanks Lyn,  Thanks Rayson,
>
> For those who may be reading this in the future looking for a solution,
> here is a partial solution.
>
> It does not reduce the time for deleting many jobs,  yet it prevents the
> system from crushing multiple times in the attempt to delete the jobs.
>
> Here is what I did:
> while sleep 600; do timeout 480 qdel -u UserName ; done
>
> Just replace 600 with a safe period for the system to recover and 480 with
> approxinate time that the system runs before memory being exhausted, and
> replace UserName with your user. Those numbers will change from system to
> system.
>
> This will delete a chunk at a time without crushing the system. I am still
> waiting after about 9 hours,  yet I did not need to restart the server due
> to SGE crushing.
>
> I hope this solution will help others.
>
>          Jacob
> On Feb 23, 2015 2:03 AM, "Rayson Ho" <raysonlogin at gmail.com> wrote:
>
>> Is your local cluster using classic or BerkeleyDB spooling? If it is
>> classic over NFS, then qdel can be very slow.
>>
>> One quick workaround is to hide the job spooling files manually, just
>> move the spooled jobs from $SGE_ROOT/$SGE_CELL/spool/qmaster/jobs to a
>> private backup directory.
>>
>> Rayson
>>
>> ==================================================
>> Open Grid Scheduler - The Official Open Source Grid Engine
>> http://gridscheduler.sourceforge.net/
>> http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html
>>
>>
>>
>> On Sun, Feb 22, 2015 at 8:31 PM, Jacob Barhak <jacob.barhak at gmail.com>
>> wrote:
>>
>>> Hi to SGE experts,
>>>
>>> This is an SGE question rather than StarCluster related. I am actually
>>> having this issue on a local clyster. And I did raise thulis issue a while
>>> ago. So sorry for repetition. And if you know of another list that can
>>> help,  please direct me there.
>>>
>>> The qdel command does not respond well with a large number of jobs. More
>>> than 100k jobs makes things intollerable.
>>>
>>> It takes a long time and consumes too much memory if trying to delete
>>> all jobs.
>>>
>>> Is there a shortcut someone is aware of to clear the enite queue without
>>> waiting for many hours or the server running out of memory?
>>>
>>> Will removing the StarCluser server and reinstalling it work?  If so,
>>> how to bypass long configuration? Are there several files that can do the
>>> trick if handled properly?
>>>
>>> I hope someone has a quick solution.
>>>
>>>           Jacob
>>>
>>> _______________________________________________
>>> StarCluster mailing list
>>> StarCluster at mit.edu
>>> http://mailman.mit.edu/mailman/listinfo/starcluster
>>>
>>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/pipermail/starcluster/attachments/20150226/4ade30ad/attachment.htm