[StarCluster] addnode fails on bid

Yugarshi Mondal ymondal at berkeley.edu
Fri Feb 14 17:55:00 EST 2014


To Starcluster Mailing Archives and future self:

I couldn't diagnose the source of the problem, but there's a hacky
workaround.

Find cluster.py in the installation. I found it here:
/usr/local/lib/python2.7/dist-packages/StarCluster-0.9999-py2.7.egg/starcluster/cluster.py
If its not there, check the installation path that's written to terminal
during the installation (if you can).

Look for the function wait_for_active_spots (it was around line 1360,
they're different in 0.9999 and 0.95)
At the top of the function, after the comments, add:
hackStop = raw_input('Hit Return When all nodes are up...')
Save, exit, and recompile cluster.py (better to make an old copy just in
case).

This change will have the effect of hanging the program after starcluster
sends spot requests. Watch the EC2 console and wait for the spot requests
to be filled.
When they are ALL in a running state, hit enter. Addnode should then
proceed as usual. If you're not adding node (if its your intial spot
cluster start), you can hit enter and let the program do the waiting, only
addnode needs to be manually overseen.

Yoshi


On Thu, Feb 13, 2014 at 8:32 PM, Yugarshi Mondal <ymondal at berkeley.edu>wrote:

> Hey Starcluster,
>
> I'm getting the same error as this guy:
> http://star.mit.edu/cluster/mlarchives/1592.html
>
> Briefly:
> When I go to use addnode, a spot request opens on amazon (i'm starting a
> spot cluster, so addnode bids). But starcluster proceeds to try to install
> ssh without waiting for the node to come up.
>
> >>> Launching node(s): node002
> SpotInstanceRequest:sir-b35acc5e
> >>> Waiting for spot requests to propagate...
> >>> Waiting for node(s) to come up... (updating every 30s)
> >>> Waiting for all nodes to be in a 'running' state...
> 2/2 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
> 100%
> >>> Waiting for SSH to come up on all nodes...
> 2/2 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
> 100%
> >>> Waiting for cluster to come up took 0.020 mins
> !!! ERROR - node 'node002' does not exist
>
> Morever, this only happens when addnode tried to bid (either by defualt
> becuase im running a spot cluster or by inline directive)
>
> I don't know what to try next tho. Do you guys have any ideas where to
> start?
>
> thanks
> Yoshi
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/pipermail/starcluster/attachments/20140214/f3ffee11/attachment.htm


More information about the StarCluster mailing list