[StarCluster] Star cluster issues when running MIST over the cloud

Tue Jul 23 19:44:47 EDT 2013

On Tue, Jul 23, 2013 at 2:08 AM, Jacob Barhak <jacob.barhak at gmail.com> wrote:
> The second issue, however, is reproducible. I just tried again a 20 node
> cluster.

Yes, it will always be reproducible and only can start 20 on-demand
instances in a region. Again, you need to fill out the "Request to
Increase Amazon EC2 Instance Limit" form if you want more than 20:

https://aws.amazon.com/contact-us/ec2-request/

> This time I posted 2497 jobs to the queue - each about 1 minute long. The
> system stopped sending jobs the queue about half point. There were 1310 jobs
> in the queue when the system stopped sending more jobs.
> When running "qstat -j" the system provided the following answer:
>
> scheduling info:            (Collecting of scheduler job information is
> turned off)

That's my fault -- I forgot to point out that the scheduler info is
off by default, and you need to run "qconf -msconf", and change the
parameter "schedd_job_info" to true. See:

http://gridscheduler.sourceforge.net/htmlman/htmlman5/sched_conf.html

Rayson

==================================================
Open Grid Scheduler - The Official Open Source Grid Engine
http://gridscheduler.sourceforge.net/

>
> I am not familiar with the error messages, yet it seems I need to enable
> something that is turned off. If there is quick obvious solution for this
> please let me know what to do, otherwise are there any other diagnostics
> tools I can use?
>
> Again, thanks for the quick reply and I hope this is an easy fix.
>
>            Jacob
>
>
> On Mon, Jul 22, 2013 at 2:51 PM, Rayson Ho <raysonlogin at gmail.com> wrote:
>>
>> On Sun, Jul 21, 2013 at 1:40 AM, Jacob Barhak <jacob.barhak at gmail.com>
>> wrote:
>> > 1. Sometime starcluster is unable to properly connect the instances on
>> > the
>> > start command and cannot mount /home. It happened once when I asked for
>> > 5
>> > m1.small machines and when I terminated this cluster and started again
>> > things went fine. Is this intermittent due to cloud traffic or is this a
>> > bug? Is there a way for me to check why?
>>
>> Can be a problem with the actual hardware - can you ssh into the node
>> and manually mount /home by hand next time you encounter this issue
>> and see if you can reproduce it when run interactively?
>>
>>
>> > 2.  After launching 20 c1.xlarge machines and running about 2500 jobs,
>> > each
>> > about 5 minutes long, I encountered a problem after and hour or so. It
>> > seems
>> > that SGE stopped sending jobs from to the queue to the instances. No
>> > error
>> > was found and the queue showed about 850 pending jobs. This did not
>> > change
>> > for a while and I could not find any failure with qstat or qhost. No
>> > jobs
>> > were running on any nodes and I waited a while for these to start
>> > without
>> > success. I tried the same thing again after a few hours and it seems
>> > that
>> > the cluster stops sending jobs from the queue after about 1600 jobs have
>> > been submitted. This does not happen when SGE is installed on a single
>> > Ubuntu machine I have at home. I am trying to figure out what is wrong.
>> > Did
>> > you impose some limit on the number of jobs? Can this be fixed? I really
>> > need to submit many jobs - tens of thousands jobs in my full runs. This
>> > one
>> > was relatively small and still did not pass.
>>
>> Looks like SGE thinks that the nodes are down or in alarm or error
>> state? To find out why SGE thinks there are no nodes available, run:
>> qstat -j
>>
>>
>>
>> > 3. I tried to start Star Cluster with 50 nodes and got an error about
>> > exceeding a quota of 20. Is it your quota or Amazon quota? Are there any
>> > other restrictions I should be aware of at the beginning? Also after the
>> > system is unable start the cluster it thinks it is still running and a
>> > terminate command is needed before another start can be issued - even
>> > though
>> > nothing got started.
>>
>> It's Amazon's quota. 50 is considered small by AWS standard, and they
>> can give it to you almost right away... You need to request AWS to
>> give you a higher limit:
>> https://aws.amazon.com/contact-us/ec2-request/
>>
>> Note that last year we requested for 10,000 nodes and the whole
>> process took less than 1 day:
>>
>>
>> http://blogs.scalablelogic.com/2012/11/running-10000-node-grid-engine-cluster.html
>>
>> Rayson
>>
>> ==================================================
>> Open Grid Scheduler - The Official Open Source Grid Engine
>> http://gridscheduler.sourceforge.net/
>>
>>
>> >
>> > This all happened on us-west-2 with the help of star cluster 0.93.3 and
>> > the
>> > Anaconda AMI - ami-a4d64194
>> >
>> > Here is some more information on what I am doing to help you answer the
>> > above.
>> >
>> > I am running Monte Carlo simulations to simulate chronic disease
>> > progression. I am using MIST to run over the cloud:
>> >
>> >
>> > https://github.com/scipy/scipy2013_talks/blob/master/talks/jacob_barhak/readme.md
>> >
>> > The Reference Model is what I am running using MIST:
>> >
>> > http://youtu.be/7qxPSgINaD8
>> >
>> > I am launching many simulations in parallel and it takes me days on a
>> > single
>> > 8 core machine. The cloud allows me to cut down this time to hours. This
>> > is
>> > why star cluster is so useful. In the past I did this over other
>> > clusters
>> > yet the cloud is still new to me.
>> >
>> > I will appreciate any recommendations I can get from you to improve the
>> > behaviors I am experiencing.

http://commons.wikimedia.org/wiki/User:Raysonho