[StarCluster] Star cluster issues when running MIST over the cloud
Jacob Barhak
jacob.barhak at gmail.com
Tue Jul 23 02:08:48 EDT 2013
Hi Rayson,
First thank you for replying quickly.
The first issue I mentioned is intermittent - it is not reproducible - I
was able to easily start a 5 node cluster and 20 node cluster just now. I
assume this was due to communication problems and will forget about it.
Never the less, it would help if there would be a tool like a log that
helps find out what went wrong and diagnose if this is a network problem or
something else.
The second issue, however, is reproducible. I just tried again a 20 node
cluster.
This time I posted 2497 jobs to the queue - each about 1 minute long. The
system stopped sending jobs the queue about half point. There were 1310
jobs in the queue when the system stopped sending more jobs.
When running "qstat -j" the system provided the following answer:
scheduling info: (Collecting of scheduler job information is
turned off)
I am not familiar with the error messages, yet it seems I need to enable
something that is turned off. If there is quick obvious solution for this
please let me know what to do, otherwise are there any other diagnostics
tools I can use?
Again, thanks for the quick reply and I hope this is an easy fix.
Jacob
On Mon, Jul 22, 2013 at 2:51 PM, Rayson Ho <raysonlogin at gmail.com> wrote:
> On Sun, Jul 21, 2013 at 1:40 AM, Jacob Barhak <jacob.barhak at gmail.com>
> wrote:
> > 1. Sometime starcluster is unable to properly connect the instances on
> the
> > start command and cannot mount /home. It happened once when I asked for 5
> > m1.small machines and when I terminated this cluster and started again
> > things went fine. Is this intermittent due to cloud traffic or is this a
> > bug? Is there a way for me to check why?
>
> Can be a problem with the actual hardware - can you ssh into the node
> and manually mount /home by hand next time you encounter this issue
> and see if you can reproduce it when run interactively?
>
>
> > 2. After launching 20 c1.xlarge machines and running about 2500 jobs,
> each
> > about 5 minutes long, I encountered a problem after and hour or so. It
> seems
> > that SGE stopped sending jobs from to the queue to the instances. No
> error
> > was found and the queue showed about 850 pending jobs. This did not
> change
> > for a while and I could not find any failure with qstat or qhost. No jobs
> > were running on any nodes and I waited a while for these to start without
> > success. I tried the same thing again after a few hours and it seems that
> > the cluster stops sending jobs from the queue after about 1600 jobs have
> > been submitted. This does not happen when SGE is installed on a single
> > Ubuntu machine I have at home. I am trying to figure out what is wrong.
> Did
> > you impose some limit on the number of jobs? Can this be fixed? I really
> > need to submit many jobs - tens of thousands jobs in my full runs. This
> one
> > was relatively small and still did not pass.
>
> Looks like SGE thinks that the nodes are down or in alarm or error
> state? To find out why SGE thinks there are no nodes available, run:
> qstat -j
>
>
>
> > 3. I tried to start Star Cluster with 50 nodes and got an error about
> > exceeding a quota of 20. Is it your quota or Amazon quota? Are there any
> > other restrictions I should be aware of at the beginning? Also after the
> > system is unable start the cluster it thinks it is still running and a
> > terminate command is needed before another start can be issued - even
> though
> > nothing got started.
>
> It's Amazon's quota. 50 is considered small by AWS standard, and they
> can give it to you almost right away... You need to request AWS to
> give you a higher limit:
> https://aws.amazon.com/contact-us/ec2-request/
>
> Note that last year we requested for 10,000 nodes and the whole
> process took less than 1 day:
>
>
> http://blogs.scalablelogic.com/2012/11/running-10000-node-grid-engine-cluster.html
>
> Rayson
>
> ==================================================
> Open Grid Scheduler - The Official Open Source Grid Engine
> http://gridscheduler.sourceforge.net/
>
>
> >
> > This all happened on us-west-2 with the help of star cluster 0.93.3 and
> the
> > Anaconda AMI - ami-a4d64194
> >
> > Here is some more information on what I am doing to help you answer the
> > above.
> >
> > I am running Monte Carlo simulations to simulate chronic disease
> > progression. I am using MIST to run over the cloud:
> >
> >
> https://github.com/scipy/scipy2013_talks/blob/master/talks/jacob_barhak/readme.md
> >
> > The Reference Model is what I am running using MIST:
> >
> > http://youtu.be/7qxPSgINaD8
> >
> > I am launching many simulations in parallel and it takes me days on a
> single
> > 8 core machine. The cloud allows me to cut down this time to hours. This
> is
> > why star cluster is so useful. In the past I did this over other clusters
> > yet the cloud is still new to me.
> >
> > I will appreciate any recommendations I can get from you to improve the
> > behaviors I am experiencing.
> >
> > --
> > Jacob Barhak Ph.D.
> > http://sites.google.com/site/jacobbarhak/
> >
> >
> > Sent from my iPhone
> >
> > _______________________________________________
> > StarCluster mailing list
> > StarCluster at mit.edu
> > http://mailman.mit.edu/mailman/listinfo/starcluster
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/pipermail/starcluster/attachments/20130723/46cf1588/attachment.htm
More information about the StarCluster
mailing list