Thank you Justin for your reply, <div><br></div><div>1. I started using your ami for the GPU "ami-4583572c" as you see in the config file for the first cluster template that is commented out and not used now, but that's when I first created the initial volume on the AWS console and it as by default in the different region "us-east-1c" while the ami was in "us-east-1a", so the first time it didn't connect because of the different region, but the error message was not clear, so, I didn't understand about the region until I used ec2 commands and it said it clearly and the next days I received the replies in this mail group confirming that it is the reason. Then when Rayson said I should use starcluster createvolume command, I deleted the volume and recreated as required in the instructions and terminated the volumecreator, but I think I didn't see it attached once I started sfmcluster, I think it only appeared after using the AWS console, </div>
<div><br></div><div>I did my configurations and installations and downloads, then created a new image "ami-fae74193" and it is available for public now if you need to have a look, using this command and yes while the volume was attached:</div>
<div><br></div><div>ec2-create-image instanceID --name sfmimage --description 'GPU Cluster Ubunto with VisualSFM, MeshlabServer, FFMPEG' -K mykeypath/pkfile.pem<br><br></div><div>Now, I am using the second cluster "mysfmcluster" using my ami "ami-fae74193" and I think I had to detach from the AWS console, and even force detach, to attach to the new cluster. I kept both running for a while to test, and not sure if trying to detach while the cluster is running is the problem, but it took a while, and I am not sure if I had to terminate the first cluster before attaching to second or not, but I remember I had to terminate first, </div>
<div><br></div><div>2. As mentioned on point "1", I created the first volume using AWS console, and in different region, then recreated using starcluster commands, and in both ways was 30GB unpartitioned, and I didn't see errors, </div>
<div><br></div><div>3. Yes, as seen on the attached file, </div><div><br></div><div>4. I just started the cluster now, and it is attached this time, it might be my problems, or that I didn't wait enough till everything is available, however the bad news, there is another problem that didn't stop me from sshmaster afterwards, and the screen output is copied below and I think in the attached debug.log, </div>
<div><br></div><div>5. both files attached, </div><div><br></div><div>thanks again for your support, </div><div><br></div><div>Manal</div><div><br></div><div><br></div><div><div>$ starcluster start mysfmcluster</div><div>
StarCluster - (<a href="http://web.mit.edu/starcluster">http://web.mit.edu/starcluster</a>) (v. 0.93.3)</div><div>Software Tools for Academics and Researchers (STAR)</div><div>Please submit bug reports to <a href="mailto:starcluster@mit.edu">starcluster@mit.edu</a></div>
<div><br></div><div>>>> Using default cluster template: mysfmcluster</div><div>>>> Validating cluster template settings...</div><div>>>> Cluster template settings are valid</div><div>>>> Starting cluster...</div>
<div>>>> Launching a 1-node cluster...</div><div>>>> Launching master node (ami: ami-fae74193, type: cg1.4xlarge)...</div><div>>>> Creating security group @sc-mysfmcluster...</div><div>>>> Creating placement group @sc-mysfmcluster...</div>
<div>SpotInstanceRequest:sir-eeb33011</div><div>>>> Waiting for cluster to come up... (updating every 30s)</div><div>>>> Waiting for open spot requests to become active...</div><div>1/1 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100% </div>
<div>>>> Waiting for all nodes to be in a 'running' state...</div><div>1/1 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100% </div><div>>>> Waiting for SSH to come up on all nodes...</div>
<div>1/1 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100% </div><div>>>> Waiting for cluster to come up took 8.399 mins</div><div>>>> The master node is <a href="http://ec2-23-20-139-233.compute-1.amazonaws.com">ec2-23-20-139-233.compute-1.amazonaws.com</a></div>
<div>>>> Setting up the cluster...</div><div>>>> Attaching volume vol-69bd4807 to master node on /dev/sdz ...</div><div>>>> Configuring hostnames...</div><div>1/1 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100% </div>
<div>>>> Mounting EBS volume vol-69bd4807 on /home...</div><div>>>> Creating cluster user: None (uid: 1001, gid: 1001)</div><div>1/1 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100% </div>
<div>>>> Configuring scratch space for user(s): sgeadmin</div><div>1/1 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100% </div><div>>>> Configuring /etc/hosts on each node</div>
<div>
1/1 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100% </div><div>>>> Starting NFS server on master</div><div>>>> Setting up NFS took 0.073 mins</div><div>>>> Configuring passwordless ssh for root</div>
<div>>>> Shutting down threads...</div><div>20/20 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100% </div><div>Traceback (most recent call last):</div><div> File "/Library/Python/2.6/site-packages/StarCluster-0.93.3-py2.6.egg/starcluster/cli.py", line 255, in main</div>
<div> sc.execute(args)</div><div> File "/Library/Python/2.6/site-packages/StarCluster-0.93.3-py2.6.egg/starcluster/commands/start.py", line 194, in execute</div><div> validate_running=validate_running)</div>
<div> File "/Library/Python/2.6/site-packages/StarCluster-0.93.3-py2.6.egg/starcluster/cluster.py", line 1414, in start</div><div> return self._start(create=create, create_only=create_only)</div><div> File "<string>", line 2, in _start</div>
<div> File "/Library/Python/2.6/site-packages/StarCluster-0.93.3-py2.6.egg/starcluster/utils.py", line 87, in wrap_f</div><div> res = func(*arg, **kargs)</div><div> File "/Library/Python/2.6/site-packages/StarCluster-0.93.3-py2.6.egg/starcluster/cluster.py", line 1437, in _start</div>
<div> self.setup_cluster()</div><div> File "/Library/Python/2.6/site-packages/StarCluster-0.93.3-py2.6.egg/starcluster/cluster.py", line 1446, in setup_cluster</div><div> self._setup_cluster()</div><div>
File "<string>", line 2, in _setup_cluster</div>
<div> File "/Library/Python/2.6/site-packages/StarCluster-0.93.3-py2.6.egg/starcluster/utils.py", line 87, in wrap_f</div><div> res = func(*arg, **kargs)</div><div> File "/Library/Python/2.6/site-packages/StarCluster-0.93.3-py2.6.egg/starcluster/cluster.py", line 1460, in _setup_cluster</div>
<div> self.cluster_shell, self.volumes)</div><div> File "/Library/Python/2.6/site-packages/StarCluster-0.93.3-py2.6.egg/starcluster/clustersetup.py", line 350, in run</div><div> self._setup_passwordless_ssh()</div>
<div> File "/Library/Python/2.6/site-packages/StarCluster-0.93.3-py2.6.egg/starcluster/clustersetup.py", line 225, in _setup_passwordless_ssh</div><div> auth_conn_key=True)</div><div> File "/Library/Python/2.6/site-packages/StarCluster-0.93.3-py2.6.egg/starcluster/node.py", line 418, in generate_key_for_user</div>
<div> key = self.ssh.load_remote_rsa_key(private_key)</div><div> File "/Library/Python/2.6/site-packages/StarCluster-0.93.3-py2.6.egg/starcluster/sshutils/__init__.py", line 210, in load_remote_rsa_key</div>
<div> key = ssh.RSAKey(file_obj=rfile)</div><div> File "build/bdist.macosx-10.6-universal/egg/ssh/rsakey.py", line 48, in __init__</div><div> self._from_private_key(file_obj, password)</div><div> File "build/bdist.macosx-10.6-universal/egg/ssh/rsakey.py", line 167, in _from_private_key</div>
<div> data = self._read_private_key('RSA', file_obj, password)</div><div> File "build/bdist.macosx-10.6-universal/egg/ssh/pkey.py", line 323, in _read_private_key</div><div> raise PasswordRequiredException('Private key file is encrypted')</div>
<div>PasswordRequiredException: Private key file is encrypted</div><div><br></div><div>!!! ERROR - Oops! Looks like you've found a bug in StarCluster</div><div>!!! ERROR - Crash report written to: /Users/manal/.starcluster/logs/crash-report-556.txt</div>
<div>!!! ERROR - Please remove any sensitive data from the crash report</div><div>!!! ERROR - and submit it to <a href="mailto:starcluster@mit.edu">starcluster@mit.edu</a></div></div><div><br></div><div><div class="gmail_quote">
On 23 May 2012 04:49, Justin Riley <span dir="ltr"><<a href="mailto:jtriley@mit.edu" target="_blank">jtriley@mit.edu</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Manal,<br>
<br>
StarCluster chooses which device to attach external EBS volumes on<br>
automatically - you do not and should not need to specify this in your<br>
config. Assuming you use 'createvolume' and update your config correctly<br>
things should "just work".<br>
<br>
You should not have to use the AWS console to attach volumes manually<br>
and if you're having to do this then I'd like to figure out why so we<br>
can fix it. This is a core feature of StarCluster and many users are<br>
using external EBS with StarCluster without issue so I'm extremely<br>
curious why you're having issues...<br>
<br>
With that said I'm having trouble pulling out all of the details I need<br>
from this long thread so I'll ask direct questions instead:<br>
<br>
1. Which AMI are you using? Did you create the AMI yourself? If so how<br>
did you go about creating the AMI and did you have any external EBS<br>
volumes attached while creating the AMI?<br>
<br>
2. How did you create the volume you were having issues mounting with<br>
StarCluster? StarCluster expects your volume to either be completely<br>
unpartitioned (format entire device) or only contain a single partition.<br>
If this isn't the case you should see an error when starting a cluster.<br>
<br>
3. Did you add your volume to your cluster config correctly according to<br>
the docs? (ie add your volume to the VOLUMES list in your cluster<br>
config?)<br>
<br>
4. StarCluster should be spitting out errors when creating the cluster<br>
if it fails to attach/mount/NFS-share any external EBS volumes - did you<br>
notice any errors? Can you please attach the complete screen output of a<br>
failed StarCluster run? Also it would extremely useful if you could send<br>
me your ~/.starcluster/logs/debug.log for a failed run so that I can<br>
take a look.<br>
<br>
5. Would you mind sending me a copy of your config with all of the<br>
sensitive data removed? I just want to make sure you've configured<br>
things as expected.<br>
<br>
Thanks,<br>
<br>
~Justin<br>
<div><br></div></blockquote></div>
</div>