<div dir="ltr">Didn&#39;t know they were NFS mounted.  All my nodes are m1.xlarge.  They aren&#39;t doing any I/O over NFS, just local scratch space...ephemeral drives.  They do read some program files over NFS, but not much.  If it were an NFS problem, I would suspect seeing an NFS error, or &#39;ls&#39; hanging, but the directory lists just fine and is empty. The fact that its NFS is weird and doesn&#39;t fit the symptoms.  Maybe the mount got removed somehow and &#39;ls&#39; is showing the local directory?  Hmmm.  I&#39;ll have to investigate further next time it happens.</div>

<div class="gmail_extra"><br><br><div class="gmail_quote">On Tue, Oct 29, 2013 at 10:18 PM, Rayson Ho <span dir="ltr">&lt;<a href="mailto:raysonlogin@gmail.com" target="_blank">raysonlogin@gmail.com</a>&gt;</span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Since your mentioned that the entire /opt/sge6 directory is empty,<br>

sounds like the NFS server (ie. StarCluster master) was not available<br>

at some point?<br>

<br>

What instance type are you using for the master?<br>

<br>

Note that by default /home &amp; /opt/sge6 are NFS mounts, are thus if the<br>

nodes are doing lots of I/O, then it can cause issues (NFS doesn&#39;t<br>

scale well, but 30 nodes should be OK unless there really is lots of<br>

I/O traffic).<br>

<br>

# mount<br>

...<br>

master:/home on /home type nfs<br>

(rw,user=root,nosuid,nodev,vers=3,addr=10.125.9.29)<br>

master:/opt/sge6 on /opt/sge6 type nfs<br>

(rw,user=root,nosuid,nodev,vers=3,addr=10.125.9.29)<br>

<br>

Rayson<br>

<br>

==================================================<br>

Open Grid Scheduler - The Official Open Source Grid Engine<br>

<a href="http://gridscheduler.sourceforge.net/" target="_blank">http://gridscheduler.sourceforge.net/</a><br>

<a href="http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html" target="_blank">http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html</a><br>

<div><div class="h5"><br>

<br>

On Tue, Oct 29, 2013 at 8:43 PM, Ryan Golhar<br>

&lt;<a href="mailto:ngsbioinformatics@gmail.com">ngsbioinformatics@gmail.com</a>&gt; wrote:<br>

&gt; Hi all - I came across a weird problem that I experience every once in a<br>

&gt; while and recently more and more.  I&#39;ve created a 30-node spot cluster using<br>

&gt; starcluster.  I started a bunch of jobs on all the nodes and sge shows all<br>

&gt; the jobs running.  I come back an hour or two later and check on the cluster<br>

&gt; and only half the nodes are listed as running jobs using qstat.  qhost shows<br>

&gt; the nodes as down.  I can log into the nodes and sure enough, sge_exec is<br>

&gt; not running.  On some of the nodes I can start the service manually, on<br>

&gt; others, the entire /opt/sge6 directory is empty.  I have no idea why this<br>

&gt; would be the case, especially since they were running jobs to begin with.<br>

&gt; Has anyone else seen?<br>

&gt;<br>

</div></div>&gt; _______________________________________________<br>

&gt; StarCluster mailing list<br>

&gt; <a href="mailto:StarCluster@mit.edu">StarCluster@mit.edu</a><br>

&gt; <a href="http://mailman.mit.edu/mailman/listinfo/starcluster" target="_blank">http://mailman.mit.edu/mailman/listinfo/starcluster</a><br>

&gt;<br>

</blockquote></div><br></div>