[StarCluster] sge node stops running jobs

Wed Oct 30 00:02:25 EDT 2013

Didn't know they were NFS mounted.  All my nodes are m1.xlarge.  They
aren't doing any I/O over NFS, just local scratch space...ephemeral drives.
 They do read some program files over NFS, but not much.  If it were an NFS
problem, I would suspect seeing an NFS error, or 'ls' hanging, but the
directory lists just fine and is empty. The fact that its NFS is weird and
doesn't fit the symptoms.  Maybe the mount got removed somehow and 'ls' is
showing the local directory?  Hmmm.  I'll have to investigate further next
time it happens.

On Tue, Oct 29, 2013 at 10:18 PM, Rayson Ho <raysonlogin at gmail.com> wrote:

> Since your mentioned that the entire /opt/sge6 directory is empty,
> sounds like the NFS server (ie. StarCluster master) was not available
> at some point?
>
> What instance type are you using for the master?
>
> Note that by default /home & /opt/sge6 are NFS mounts, are thus if the
> nodes are doing lots of I/O, then it can cause issues (NFS doesn't
> scale well, but 30 nodes should be OK unless there really is lots of
> I/O traffic).
>
> # mount
> ...
> master:/home on /home type nfs
> (rw,user=root,nosuid,nodev,vers=3,addr=10.125.9.29)
> master:/opt/sge6 on /opt/sge6 type nfs
> (rw,user=root,nosuid,nodev,vers=3,addr=10.125.9.29)
>
> Rayson
>
> ==================================================
> Open Grid Scheduler - The Official Open Source Grid Engine
> http://gridscheduler.sourceforge.net/
> http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html
>
>
> On Tue, Oct 29, 2013 at 8:43 PM, Ryan Golhar
> <ngsbioinformatics at gmail.com> wrote:
> > Hi all - I came across a weird problem that I experience every once in a
> > while and recently more and more.  I've created a 30-node spot cluster
> using
> > starcluster.  I started a bunch of jobs on all the nodes and sge shows
> all
> > the jobs running.  I come back an hour or two later and check on the
> cluster
> > and only half the nodes are listed as running jobs using qstat.  qhost
> shows
> > the nodes as down.  I can log into the nodes and sure enough, sge_exec is
> > not running.  On some of the nodes I can start the service manually, on
> > others, the entire /opt/sge6 directory is empty.  I have no idea why this
> > would be the case, especially since they were running jobs to begin with.
> > Has anyone else seen?
> >
> > _______________________________________________
> > StarCluster mailing list
> > StarCluster at mit.edu
> > http://mailman.mit.edu/mailman/listinfo/starcluster
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/pipermail/starcluster/attachments/20131030/9d247439/attachment-0001.htm