[StarCluster] Easy way to delete more than 100k jobs
MacMullan, Hugh
hughmac at wharton.upenn.edu
Thu Feb 26 08:25:00 EST 2015
Hi Jacob:
Very interesting! Really good to know, and thanks for reporting back.
And FYI there's users at gridengine.org<mailto:users at gridengine.org> for 'everything GridEngine'. https://gridengine.org/mailman/listinfo/users
Cheers,
-Hugh
From: starcluster-bounces at mit.edu [mailto:starcluster-bounces at mit.edu] On Behalf Of Jacob Barhak
Sent: Thursday, February 26, 2015 7:33 AM
Cc: starcluster
Subject: Re: [StarCluster] Easy way to delete more than 100k jobs
Hi Lyn, Hi Rayson,
You were trying to help, however, I described the problem insuficiently.
I just isolated the issue. It seems the problem is not the number of jobs. It is the use of job dependencies with wildcard in name with the -hold_jid option.
For example launching 10k jobs named A0000 to A9999 and then 10k jobs named B0000 to B9999 each depending on "A*" then the system should become unstable. However, if each job in B depends on "A0000" and several other A jobs named explicitly without the windcard character, then SGE does not have any issues.
I assume that when deleting the jobs, SGE tries to recalculate all jobs depending on the related job. In the example above for each one of the 10k A jobs deleted, the system recalculates the list of 10k dependencies for each of the 10k jobs. This is a fair assumption considering the tests I made. However, I did not look at SGE code to confirm my assumption is correct. Perhaps one of the experts can verify this?
In any case, I recommend avoiding using wildcard dependencies on job names unless absolutely necessary. In the future, perhaps qdel * can start removing jobs according to dependency tree or recalculate dependencies once after all deletions took place.
I hope this will help someone avoid pitfalls in the future.
Jacob
On Feb 23, 2015 3:33 PM, "Jacob Barhak" <jacob.barhak at gmail.com<mailto:jacob.barhak at gmail.com>> wrote:
Thanks Lyn, Thanks Rayson,
For those who may be reading this in the future looking for a solution, here is a partial solution.
It does not reduce the time for deleting many jobs, yet it prevents the system from crushing multiple times in the attempt to delete the jobs.
Here is what I did:
while sleep 600; do timeout 480 qdel -u UserName ; done
Just replace 600 with a safe period for the system to recover and 480 with approxinate time that the system runs before memory being exhausted, and replace UserName with your user. Those numbers will change from system to system.
This will delete a chunk at a time without crushing the system. I am still waiting after about 9 hours, yet I did not need to restart the server due to SGE crushing.
I hope this solution will help others.
Jacob
On Feb 23, 2015 2:03 AM, "Rayson Ho" <raysonlogin at gmail.com<mailto:raysonlogin at gmail.com>> wrote:
Is your local cluster using classic or BerkeleyDB spooling? If it is classic over NFS, then qdel can be very slow.
One quick workaround is to hide the job spooling files manually, just move the spooled jobs from $SGE_ROOT/$SGE_CELL/spool/qmaster/jobs to a private backup directory.
Rayson
==================================================
Open Grid Scheduler - The Official Open Source Grid Engine
http://gridscheduler.sourceforge.net/
http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html
On Sun, Feb 22, 2015 at 8:31 PM, Jacob Barhak <jacob.barhak at gmail.com<mailto:jacob.barhak at gmail.com>> wrote:
Hi to SGE experts,
This is an SGE question rather than StarCluster related. I am actually having this issue on a local clyster. And I did raise thulis issue a while ago. So sorry for repetition. And if you know of another list that can help, please direct me there.
The qdel command does not respond well with a large number of jobs. More than 100k jobs makes things intollerable.
It takes a long time and consumes too much memory if trying to delete all jobs.
Is there a shortcut someone is aware of to clear the enite queue without waiting for many hours or the server running out of memory?
Will removing the StarCluser server and reinstalling it work? If so, how to bypass long configuration? Are there several files that can do the trick if handled properly?
I hope someone has a quick solution.
Jacob
_______________________________________________
StarCluster mailing list
StarCluster at mit.edu<mailto:StarCluster at mit.edu>
http://mailman.mit.edu/mailman/listinfo/starcluster
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/pipermail/starcluster/attachments/20150226/3aae8eae/attachment-0001.htm
More information about the StarCluster
mailing list