[StarCluster] sge node stops running jobs

Ryan Golhar ngsbioinformatics at gmail.com
Tue Oct 29 20:43:39 EDT 2013


Hi all - I came across a weird problem that I experience every once in a
while and recently more and more.  I've created a 30-node spot cluster
using starcluster.  I started a bunch of jobs on all the nodes and sge
shows all the jobs running.  I come back an hour or two later and check on
the cluster and only half the nodes are listed as running jobs using qstat.
 qhost shows the nodes as down.  I can log into the nodes and sure enough,
sge_exec is not running.  On some of the nodes I can start the service
manually, on others, the entire /opt/sge6 directory is empty.  I have no
idea why this would be the case, especially since they were running jobs to
begin with.  Has anyone else seen?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/pipermail/starcluster/attachments/20131029/c6d27d61/attachment.htm


More information about the StarCluster mailing list