<div dir="ltr">Hi fellows,<div><br></div><div>Here´s the situation:</div><div><br></div><div>I was running a 15 node cluster, using spot instances, for some calculation work. It seems that the my bid price was low for the market and I lost all slave nodes this night.</div>
<div>No problem since there was no jobs running at that time.</div><div>Today, I tried to add the lost nodes to the cluster.. and boom.. here comes the bug.</div><div>Well, After a holly restart, no problems anymore.</div>
<div><br></div><div>Anyone can tell me what happened?</div><div><br></div><div>I´m litlle worried since this is happening a lot.</div><div><br></div><div>All the best,</div><div><br>Sergio</div><div><br></div><div>---</div>
<div><br></div><div><div>ubuntu@domU-12-31-39-15-11-FA:~$ starcluster addnode -n 14 decomp</div><div>StarCluster - (<a href="http://star.mit.edu/cluster">http://star.mit.edu/cluster</a>) (v. 0.9999)</div><div>Software Tools for Academics and Researchers (STAR)</div>
<div>Please submit bug reports to <a href="mailto:starcluster@mit.edu">starcluster@mit.edu</a></div><div><br></div><div>>>> Launching node(s): node001, node002, node003, node004, node005, node006, nod e007, node008, node009, node010, node011, node012, node013, node014</div>
<div>SpotInstanceRequest:sir-db61cc34</div><div>SpotInstanceRequest:sir-25ced835</div><div>SpotInstanceRequest:sir-83cb3634</div><div>SpotInstanceRequest:sir-f3012e34</div><div>SpotInstanceRequest:sir-2ab42232</div><div>SpotInstanceRequest:sir-ebcf5434</div>
<div>SpotInstanceRequest:sir-743a2e35</div><div>SpotInstanceRequest:sir-f461de34</div><div>SpotInstanceRequest:sir-78e95e32</div><div>SpotInstanceRequest:sir-0f44ee35</div><div>SpotInstanceRequest:sir-91364835</div><div>SpotInstanceRequest:sir-7dc6b635</div>
<div>SpotInstanceRequest:sir-ae0ac635</div><div>SpotInstanceRequest:sir-46e40a35</div><div>>>> Waiting for node(s) to come up... (updating every 30s)</div><div>>>> Waiting for open spot requests to become active...</div>
<div>14/14 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%</div><div>>>> Waiting for all nodes to be in a 'running' state...</div><div>15/15 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%</div>
<div>>>> Waiting for SSH to come up on all nodes...</div><div>15/15 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%</div><div>>>> Waiting for cluster to come up took 9.552 mins</div>
<div>>>> Running plugin starcluster.clustersetup.DefaultClusterSetup</div><div>>>> Configuring hostnames...</div><div>1/1 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%</div><div>
>>> Configuring /etc/hosts on each node</div><div>15/15 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%</div><div>>>> Configuring NFS exports path(s):</div><div>/home</div><div>>>> Mounting all NFS export path(s) on 1 worker node(s)</div>
<div>1/1 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%</div><div>!!! ERROR - Error occured while running plugin 'starcluster.clustersetup.Default ClusterSetup':</div>
<div>!!! ERROR - error occurred in job (id=node001): remote command 'source /etc/prof ile && mount /home' failed with status 32:</div>
<div>mount: master:/home failed, reason given by server: Permission denied</div><div>Traceback (most recent call last):</div><div> File "/usr/local/lib/python2.7/dist-packages/StarCluster-0.9999-py2.7.egg/star cluster/threadpool.py", line 31, in run</div>
<div> job.run()</div><div> File "/usr/local/lib/python2.7/dist-packages/StarCluster-0.9999-py2.7.egg/star cluster/threadpool.py", line 58, in run</div>
<div> r = self.method(*self.args, **self.kwargs)</div><div> File "/usr/local/lib/python2.7/dist-packages/StarCluster-0.9999-py2.7.egg/star cluster/node.py", line 689, in mount_nfs_shares</div>
<div> self.ssh.execute('mount %s' % path)</div><div> File "/usr/local/lib/python2.7/dist-packages/StarCluster-0.9999-py2.7.egg/star cluster/sshutils/__init__.py", line 538, in execute</div>
<div> msg, command, exit_status, out_str)</div><div>RemoteCommandFailed: remote command 'source /etc/profile && mount /home' failed with status 32:</div>
<div>mount: master:/home failed, reason given by server: Permission denied</div><div><br></div><div><br></div><div>!!! ERROR - Oops! Looks like you've found a bug in StarCluster</div><div>!!! ERROR - Crash report written to: /home/ubuntu/.starcluster/logs/crash-report -25643.txt</div>
<div>!!! ERROR - Please remove any sensitive data from the crash report</div><div>!!! ERROR - and submit it to <a href="mailto:starcluster@mit.edu">starcluster@mit.edu</a></div><div>*</div><div>* Moved to holly restart and everything ran fine</div>
<div>*</div><div>ubuntu@domU-12-31-39-15-11-FA:~$ starcluster restart decomp</div><div>StarCluster - (<a href="http://star.mit.edu/cluster">http://star.mit.edu/cluster</a>) (v. 0.9999)</div><div>Software Tools for Academics and Researchers (STAR)</div>
<div>Please submit bug reports to <a href="mailto:starcluster@mit.edu">starcluster@mit.edu</a></div><div><br></div><div>>>> Running plugin starcluster.plugins.sge.SGEPlugin</div><div>>>> Running plugin starcluster.clustersetup.DefaultClusterSetup</div>
<div>>>> Rebooting cluster...</div><div>>>> Sleeping for 20 seconds...</div><div>>>> Waiting for cluster to come up... (updating every 30s)</div><div>>>> Waiting for all nodes to be in a 'running' state...</div>
<div>15/15 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%</div><div>>>> Waiting for SSH to come up on all nodes...</div><div>15/15 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%</div>
<div>>>> Waiting for cluster to come up took 3.998 mins</div><div>>>> The master node is <a href="http://ec2-23-23-241-73.compute-1.amazonaws.com">ec2-23-23-241-73.compute-1.amazonaws.com</a></div><div>>>> Configuring cluster...</div>
<div>>>> Volume vol-fb714ca1 already attached to master...skipping</div><div>>>> Running plugin starcluster.clustersetup.DefaultClusterSetup</div><div>>>> Configuring hostnames...</div><div>15/15 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%</div>
<div>>>> Mounting EBS volume vol-fb714ca1 on /home...</div><div>>>> Creating cluster user: sgeadmin (uid: 1001, gid: 1001)</div><div>15/15 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%</div>
<div>>>> Configuring scratch space for user(s): sgeadmin</div><div>15/15 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%</div><div>>>> Configuring /etc/hosts on each node</div><div>
15/15 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%</div><div>>>> Starting NFS server on master</div><div>>>> Configuring NFS exports path(s):</div><div>/home</div><div>>>> Mounting all NFS export path(s) on 14 worker node(s)</div>
<div>14/14 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%</div><div>>>> Setting up NFS took 0.083 mins</div><div>>>> Configuring passwordless ssh for root</div><div>>>> Configuring passwordless ssh for sgeadmin</div>
<div>>>> Running plugin starcluster.plugins.sge.SGEPlugin</div><div>>>> Configuring SGE...</div><div>>>> Configuring NFS exports path(s):</div><div>/opt/sge6</div><div>>>> Mounting all NFS export path(s) on 14 worker node(s)</div>
<div>14/14 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%</div><div>>>> Setting up NFS took 0.018 mins</div><div>>>> Removing previous SGE installation...</div><div>>>> Installing Sun Grid Engine...</div>
<div>14/14 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%</div><div>>>> Creating SGE parallel environment 'orte'</div><div>15/15 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%</div>
<div>>>> Adding parallel environment 'orte' to queue 'all.q'</div><div>>>> Configuring cluster took 0.779 mins</div><div>>>> Restarting cluster took 5.162 mins</div></div><div><br>
</div></div>