[StarCluster] newbie problems

Manal Helal manal.helal at gmail.com
Tue May 22 21:37:21 EDT 2012


Thank you Justin for your reply,

1. I started using your ami for the GPU "ami-4583572c" as you see in the
config file for the first cluster template that is commented out and not
used now, but that's when I first created the initial volume on the AWS
console and it as by default in the different region "us-east-1c" while the
ami was in "us-east-1a", so the first time it didn't connect because of the
different region, but the error message was not clear, so, I didn't
understand about the region until I used ec2 commands and it said it
clearly and the next days I received the replies in this mail group
confirming that it is the reason. Then when Rayson said I should use
starcluster createvolume command, I deleted the volume and recreated as
required in the instructions and terminated the volumecreator, but I think
I didn't see it attached once I started sfmcluster, I think it only
appeared after using the AWS console,

I did my configurations and installations and downloads, then created a new
image "ami-fae74193" and it is available for public now if you need to have
a look, using this command and yes while the volume was attached:

ec2-create-image instanceID --name sfmimage --description 'GPU Cluster
Ubunto with VisualSFM, MeshlabServer, FFMPEG' -K mykeypath/pkfile.pem

Now, I am using the second cluster  "mysfmcluster" using my ami
"ami-fae74193" and I think I had to detach from the AWS console, and even
force detach, to attach to the new cluster. I kept both running for a while
to test, and not sure if trying to detach while the cluster is running is
the problem, but it took a while, and I am not sure if I had to terminate
the first cluster before attaching to second or not, but I remember I had
to terminate first,

2. As mentioned on point "1", I created the first volume using AWS console,
and in different region, then recreated using starcluster commands, and in
both ways was 30GB unpartitioned, and I didn't see errors,

3. Yes, as seen on the attached file,

4. I just started the cluster now, and it is attached this time, it might
be my problems, or that I didn't wait enough till everything is available,
however the bad news, there is another problem that didn't stop me from
sshmaster afterwards, and the screen output is copied below and I think in
the attached debug.log,

5. both files attached,

thanks again for your support,

Manal


$ starcluster start mysfmcluster
StarCluster - (http://web.mit.edu/starcluster) (v. 0.93.3)
Software Tools for Academics and Researchers (STAR)
Please submit bug reports to starcluster at mit.edu

>>> Using default cluster template: mysfmcluster
>>> Validating cluster template settings...
>>> Cluster template settings are valid
>>> Starting cluster...
>>> Launching a 1-node cluster...
>>> Launching master node (ami: ami-fae74193, type: cg1.4xlarge)...
>>> Creating security group @sc-mysfmcluster...
>>> Creating placement group @sc-mysfmcluster...
SpotInstanceRequest:sir-eeb33011
>>> Waiting for cluster to come up... (updating every 30s)
>>> Waiting for open spot requests to become active...
1/1 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
100%
>>> Waiting for all nodes to be in a 'running' state...
1/1 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
100%
>>> Waiting for SSH to come up on all nodes...
1/1 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
100%
>>> Waiting for cluster to come up took 8.399 mins
>>> The master node is ec2-23-20-139-233.compute-1.amazonaws.com
>>> Setting up the cluster...
>>> Attaching volume vol-69bd4807 to master node on /dev/sdz ...
>>> Configuring hostnames...
1/1 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
100%
>>> Mounting EBS volume vol-69bd4807 on /home...
>>> Creating cluster user: None (uid: 1001, gid: 1001)
1/1 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
100%
>>> Configuring scratch space for user(s): sgeadmin
1/1 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
100%
>>> Configuring /etc/hosts on each node
1/1 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
100%
>>> Starting NFS server on master
>>> Setting up NFS took 0.073 mins
>>> Configuring passwordless ssh for root
>>> Shutting down threads...
20/20 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
100%
Traceback (most recent call last):
  File
"/Library/Python/2.6/site-packages/StarCluster-0.93.3-py2.6.egg/starcluster/cli.py",
line 255, in main
    sc.execute(args)
  File
"/Library/Python/2.6/site-packages/StarCluster-0.93.3-py2.6.egg/starcluster/commands/start.py",
line 194, in execute
    validate_running=validate_running)
  File
"/Library/Python/2.6/site-packages/StarCluster-0.93.3-py2.6.egg/starcluster/cluster.py",
line 1414, in start
    return self._start(create=create, create_only=create_only)
  File "<string>", line 2, in _start
  File
"/Library/Python/2.6/site-packages/StarCluster-0.93.3-py2.6.egg/starcluster/utils.py",
line 87, in wrap_f
    res = func(*arg, **kargs)
  File
"/Library/Python/2.6/site-packages/StarCluster-0.93.3-py2.6.egg/starcluster/cluster.py",
line 1437, in _start
    self.setup_cluster()
  File
"/Library/Python/2.6/site-packages/StarCluster-0.93.3-py2.6.egg/starcluster/cluster.py",
line 1446, in setup_cluster
    self._setup_cluster()
  File "<string>", line 2, in _setup_cluster
  File
"/Library/Python/2.6/site-packages/StarCluster-0.93.3-py2.6.egg/starcluster/utils.py",
line 87, in wrap_f
    res = func(*arg, **kargs)
  File
"/Library/Python/2.6/site-packages/StarCluster-0.93.3-py2.6.egg/starcluster/cluster.py",
line 1460, in _setup_cluster
    self.cluster_shell, self.volumes)
  File
"/Library/Python/2.6/site-packages/StarCluster-0.93.3-py2.6.egg/starcluster/clustersetup.py",
line 350, in run
    self._setup_passwordless_ssh()
  File
"/Library/Python/2.6/site-packages/StarCluster-0.93.3-py2.6.egg/starcluster/clustersetup.py",
line 225, in _setup_passwordless_ssh
    auth_conn_key=True)
  File
"/Library/Python/2.6/site-packages/StarCluster-0.93.3-py2.6.egg/starcluster/node.py",
line 418, in generate_key_for_user
    key = self.ssh.load_remote_rsa_key(private_key)
  File
"/Library/Python/2.6/site-packages/StarCluster-0.93.3-py2.6.egg/starcluster/sshutils/__init__.py",
line 210, in load_remote_rsa_key
    key = ssh.RSAKey(file_obj=rfile)
  File "build/bdist.macosx-10.6-universal/egg/ssh/rsakey.py", line 48, in
__init__
    self._from_private_key(file_obj, password)
  File "build/bdist.macosx-10.6-universal/egg/ssh/rsakey.py", line 167, in
_from_private_key
    data = self._read_private_key('RSA', file_obj, password)
  File "build/bdist.macosx-10.6-universal/egg/ssh/pkey.py", line 323, in
_read_private_key
    raise PasswordRequiredException('Private key file is encrypted')
PasswordRequiredException: Private key file is encrypted

!!! ERROR - Oops! Looks like you've found a bug in StarCluster
!!! ERROR - Crash report written to:
/Users/manal/.starcluster/logs/crash-report-556.txt
!!! ERROR - Please remove any sensitive data from the crash report
!!! ERROR - and submit it to starcluster at mit.edu

On 23 May 2012 04:49, Justin Riley <jtriley at mit.edu> wrote:

> Manal,
>
> StarCluster chooses which device to attach external EBS volumes on
> automatically - you do not and should not need to specify this in your
> config. Assuming you use 'createvolume' and update your config correctly
> things should "just work".
>
> You should not have to use the AWS console to attach volumes manually
> and if you're having to do this then I'd like to figure out why so we
> can fix it. This is a core feature of StarCluster and many users are
> using external EBS with StarCluster without issue so I'm extremely
> curious why you're having issues...
>
> With that said I'm having trouble pulling out all of the details I need
> from this long thread so I'll ask direct questions instead:
>
> 1. Which AMI are you using? Did you create the AMI yourself? If so how
> did you go about creating the AMI and did you have any external EBS
> volumes attached while creating the AMI?
>
> 2. How did you create the volume you were having issues mounting with
> StarCluster? StarCluster expects your volume to either be completely
> unpartitioned (format entire device) or only contain a single partition.
> If this isn't the case you should see an error when starting a cluster.
>
> 3. Did you add your volume to your cluster config correctly according to
> the docs? (ie add your volume to the VOLUMES list in your cluster
> config?)
>
> 4. StarCluster should be spitting out errors when creating the cluster
> if it fails to attach/mount/NFS-share any external EBS volumes - did you
> notice any errors? Can you please attach the complete screen output of a
> failed StarCluster run? Also it would extremely useful if you could send
> me your ~/.starcluster/logs/debug.log for a failed run so that I can
> take a look.
>
> 5. Would you mind sending me a copy of your config with all of the
> sensitive data removed? I just want to make sure you've configured
> things as expected.
>
> Thanks,
>
> ~Justin
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/pipermail/starcluster/attachments/20120523/06df000c/attachment-0001.htm
-------------- next part --------------
A non-text attachment was scrubbed...
Name: config_nocredentials
Type: application/octet-stream
Size: 12183 bytes
Desc: not available
Url : http://mailman.mit.edu/pipermail/starcluster/attachments/20120523/06df000c/attachment-0002.obj
-------------- next part --------------
A non-text attachment was scrubbed...
Name: debug.log
Type: application/octet-stream
Size: 236404 bytes
Desc: not available
Url : http://mailman.mit.edu/pipermail/starcluster/attachments/20120523/06df000c/attachment-0003.obj


More information about the StarCluster mailing list