<div dir="ltr">Didn't know they were NFS mounted. All my nodes are m1.xlarge. They aren't doing any I/O over NFS, just local scratch space...ephemeral drives. They do read some program files over NFS, but not much. If it were an NFS problem, I would suspect seeing an NFS error, or 'ls' hanging, but the directory lists just fine and is empty. The fact that its NFS is weird and doesn't fit the symptoms. Maybe the mount got removed somehow and 'ls' is showing the local directory? Hmmm. I'll have to investigate further next time it happens.</div>
<div class="gmail_extra"><br><br><div class="gmail_quote">On Tue, Oct 29, 2013 at 10:18 PM, Rayson Ho <span dir="ltr"><<a href="mailto:raysonlogin@gmail.com" target="_blank">raysonlogin@gmail.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Since your mentioned that the entire /opt/sge6 directory is empty,<br>
sounds like the NFS server (ie. StarCluster master) was not available<br>
at some point?<br>
<br>
What instance type are you using for the master?<br>
<br>
Note that by default /home & /opt/sge6 are NFS mounts, are thus if the<br>
nodes are doing lots of I/O, then it can cause issues (NFS doesn't<br>
scale well, but 30 nodes should be OK unless there really is lots of<br>
I/O traffic).<br>
<br>
# mount<br>
...<br>
master:/home on /home type nfs<br>
(rw,user=root,nosuid,nodev,vers=3,addr=10.125.9.29)<br>
master:/opt/sge6 on /opt/sge6 type nfs<br>
(rw,user=root,nosuid,nodev,vers=3,addr=10.125.9.29)<br>
<br>
Rayson<br>
<br>
==================================================<br>
Open Grid Scheduler - The Official Open Source Grid Engine<br>
<a href="http://gridscheduler.sourceforge.net/" target="_blank">http://gridscheduler.sourceforge.net/</a><br>
<a href="http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html" target="_blank">http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html</a><br>
<div><div class="h5"><br>
<br>
On Tue, Oct 29, 2013 at 8:43 PM, Ryan Golhar<br>
<<a href="mailto:ngsbioinformatics@gmail.com">ngsbioinformatics@gmail.com</a>> wrote:<br>
> Hi all - I came across a weird problem that I experience every once in a<br>
> while and recently more and more. I've created a 30-node spot cluster using<br>
> starcluster. I started a bunch of jobs on all the nodes and sge shows all<br>
> the jobs running. I come back an hour or two later and check on the cluster<br>
> and only half the nodes are listed as running jobs using qstat. qhost shows<br>
> the nodes as down. I can log into the nodes and sure enough, sge_exec is<br>
> not running. On some of the nodes I can start the service manually, on<br>
> others, the entire /opt/sge6 directory is empty. I have no idea why this<br>
> would be the case, especially since they were running jobs to begin with.<br>
> Has anyone else seen?<br>
><br>
</div></div>> _______________________________________________<br>
> StarCluster mailing list<br>
> <a href="mailto:StarCluster@mit.edu">StarCluster@mit.edu</a><br>
> <a href="http://mailman.mit.edu/mailman/listinfo/starcluster" target="_blank">http://mailman.mit.edu/mailman/listinfo/starcluster</a><br>
><br>
</blockquote></div><br></div>