[Starcluster] can't start cluster

Justin Riley jtriley at MIT.EDU
Thu Jun 17 11:53:26 EDT 2010


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi Dean,

Would you mind joining the list? Thanks.

> I assume this is an Amazon resources problem.

Yes, unfortunately it is. I've fixed the github code to display these
messages in a friendlier way.

> Even when I do not specify an AVAILABILITY_ZONE in the StarCluster
> config file the fact that my EBS volume is in us-east-1a is forcing
> EC2 to try to start the cluster in the over-subscribed us-east-1a
> zone.

This is intended. Otherwise, you would not be able to attach your volume
to the cluster when starting. Using an EBS volume in general determines
the location where instances *have* to be started:

Any instance you wish to attach an EBS volume to must be in the same
zone as the volume.

For StarCluster this is the master node. Technically it could be
possible to only have this zone restriction on the master and launch the
nodes "wherever" but this will likely make the network latency between
nodes even worse than it already is given that the EBS volume would then
be NFS-shared across availability zones.

If you do not specify any volumes in a cluster template, StarCluster
will let amazon decide where to put the instances. You can always force
which zone to use via the AVAILABILITY_ZONE setting in your cluster
template but if a volume has been specified in your template that does
not live in that AVAILABILITY_ZONE setting, you will get an error from
StarCluster.

You have a couple of options here:

1. Comment-out your volume section completely from the cluster template
and thus don't use the volume(s). This will allow Amazon to put you
where there's capacity but you will not have your data.

2. Make a clone of your volume(s) in another zone using EBS snapshots
(ElasticFox is nice for this) and change your volume settings to point
to this new volume. StarCluster will then launch instances in the new
volume's zone.

Personally, if you cant start a cluster consistently I would consider
option 2.

> For you information, when I ran "starcluster listclusters" command
> right after receiving the "InsufficientInstanceCapacity" error I got
> the following error message:

This is fixed in the github code.

Hope that helps,

~Justin

On 06/17/2010 11:23 AM, Dean Snyder wrote:
> I am unable to start up an 8-node m1.large cluster this morning due
> to an "InsufficientInstanceCapacity" error. (See appended log.)
> 
> I assume this is an Amazon resources problem. Even when I do not
> specify an AVAILABILITY_ZONE in the StarCluster config file the fact
> that my EBS volume is in us-east-1a is forcing EC2 to try to start
> the cluster in the over-subscribed us-east-1a zone.
> 
> For you information, when I ran "starcluster listclusters" command
> right after receiving the "InsufficientInstanceCapacity" error I got
> the following error message:
> 
> dean 11:13:37 ~ : starcluster listclusters StarCluster -
> (http://web.mit.edu/starcluster) Software Tools for Academics and
> Researchers (STAR) Please submit bug reports to starcluster at mit.edu 
> --------------------------------------------- cidrcluster (security
> group: @sc-cidrcluster) 
> --------------------------------------------- Traceback (most recent
> call last): File
> "build/bdist.macosx-10.6-universal/egg/starcluster/cli.py", line 
> 1075, in main sc.execute(args) File
> "build/bdist.macosx-10.6-universal/egg/starcluster/cli.py", line 432,
> in execute cluster.list_clusters(cfg) File
> "build/bdist.macosx-10.6-universal/egg/starcluster/cluster.py", line
> 148, in list_clusters print 'Launch time: %s' % master.launch_time 
> AttributeError: 'NoneType' object has no attribute 'launch_time'
> 
> 
> 
> **********************************************************
> 
> 
> dean 11:10:44 ~ : starcluster start cidrcluster StarCluster -
> (http://web.mit.edu/starcluster) Software Tools for Academics and
> Researchers (STAR) Please submit bug reports to starcluster at mit.edu
> 
>>>> Using default cluster template: largecluster Validating cluster
>>>> template settings... Cluster template settings are valid 
>>>> Starting cluster... Launching a 8-node cluster... Launching
>>>> master node... Master AMI: ami-88967ee1 Creating security group
>>>> @sc-cidrcluster...
> Traceback (most recent call last): File
> "build/bdist.macosx-10.6-universal/egg/starcluster/cli.py", line 
> 1075, in main sc.execute(args) File
> "build/bdist.macosx-10.6-universal/egg/starcluster/cli.py", line 239,
> in execute scluster.start(create=not self.opts.no_create) File
> "build/bdist.macosx-10.6-universal/egg/starcluster/utils.py", line
> 27, in wrapper res = func(*arg, **kargs) File
> "build/bdist.macosx-10.6-universal/egg/starcluster/cluster.py", line
> 679, in start self.create_cluster() File
> "build/bdist.macosx-10.6-universal/egg/starcluster/cluster.py", line
> 596, in create_cluster placement=zone) File
> "build/bdist.macosx-10.6-universal/egg/starcluster/cluster.py", line
> 575, in run_instances placement=placement) File
> "build/bdist.macosx-10.6-universal/egg/starcluster/awsutils.py", line
> 161, in run_instances placement=placement) File
> "/Library/Python/2.6/site-packages/boto-1.9b-py2.6.egg/boto/ec2/ 
> connection.py", line 463, in run_instances return
> self.get_object('RunInstances', params, Reservation, verb='POST') 
> File "/Library/Python/2.6/site-packages/boto-1.9b-py2.6.egg/boto/ 
> connection.py", line 620, in get_object response =
> self.make_request(action, params, path, verb) File
> "/Library/Python/2.6/site-packages/boto-1.9b-py2.6.egg/boto/ 
> connection.py", line 591, in make_request headers=headers) File
> "/Library/Python/2.6/site-packages/boto-1.9b-py2.6.egg/boto/ 
> connection.py", line 459, in make_request return self._mexe(method,
> path, data, headers, host, sender) File
> "/Library/Python/2.6/site-packages/boto-1.9b-py2.6.egg/boto/ 
> connection.py", line 435, in _mexe raise
> BotoServerError(response.status, response.reason, body) 
> BotoServerError: BotoServerError: 500 Internal Server Error <?xml
> version="1.0"?> 
> <Response><Errors><Error><Code>InsufficientInstanceCapacity</ 
> Code><Message>We currently do not have sufficient m1.large capacity
> in the Availability Zone you requested (us-east-1a). Our system will
> be working on provisioning additional capacity. You can currently
> get m1.large capacity by not specifying an Availability Zone in your
> request or choosing us-east-1d.</Message></Error></ 
> Errors><RequestID>48be0131-84aa-40c7-a054-fb227c6fa183</RequestID></Response>
>
>  Thanks,
> 
> Dean A. Snyder Senior Programmer/Analyst Center for Inherited Disease
> Research (CIDR) Johns Hopkins School of Medicine Bayview Research
> Campus 333 Cassell Dr, Triad Bldg, Suite 2000 Baltimore, MD 21224 
> cell:717 668-3048 office:410-550-4629 www.cidr.jhmi.edu
> 

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.15 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkwaRPYACgkQ4llAkMfDcrlGIACZAaK5IQsAdAnG8Wp9k//QDFR4
UnwAoIQKXvvMQFX0EDKn95EnJX6y8LHg
=Zquq
-----END PGP SIGNATURE-----



More information about the StarCluster mailing list