[StarCluster] SGE slow with 50k+ jobs

Jacob Barhak jacob.barhak at gmail.com
Sun Mar 30 02:37:45 EDT 2014


Hi to SGE experts,

This is less of a StarCluster specific issue. It is more of an SGE issue I encountered and was hoping someone here can help with. 

My system runs many many smaller jobs - tens of thousands. When I need a rushed solution to reach a deadline I use StarCluster. However, if I have time, I run the simulations on a single 8 core machine that has SGE installed over Ubuntu 12.04. This machine is new and fast with SSD drive and freshly installed yet I am encountering issues. 

1. When I launch about 70k jobs submitting a single new job to the queue takes a while - about a second or so, compared to fractions of a second when the queue is empty. 

2. Deleting all the jobs from the queue using qdel -u username takes a long time. It reports about 24 deletes to the screen every few seconds - at this rate it will take hours to delete the entire queue. It is still deleting while I am writing these words. Way too much time. 

3. The system was working ok for a few days yet now has trouble with qmaster. It report the following:
error: commlib error: got select error (connection refused) 
unable to send message to qmaster using port 6444 on host 'localhost': got send error. 
Also qmon reported cannot reach qmaster. I had to restart and suspend and disable the queue. 

Note that qstat -j currently reports:
All queues dropped because of overload or full

Note that I configured the schedule interval to 2 seconds since many of my jobs are so fast that even 2 seconds is very inefficient for them yet some are longer and memory consuming so I cannot allow more slots to launch too many jobs. 

Am I overloading the system with too many jobs? What is the limit on a single strong machine? How will this scale when I run this on StarCluster?

Any advice on how to efficiently handle many jobs, some of which are very short, will be appreciated. And I hope this interests the audience. 

        Jacob

Sent from my iPhone


More information about the StarCluster mailing list