[StarCluster] Star cluster issues when running MIST over the cloud

Jacob Barhak jacob.barhak at gmail.com
Wed Jul 24 18:47:35 EDT 2013


Thanks Again Rayson,

Your explanation allowed me to continue. I had to jump through several
hoops since the windows cmd terminal does not handle the special characters
well and working with vi without seeing what you do is terrible - perhaps
there is a simple fix for this. Never the less, once the SGE configuration
file was fixed I launched the simulations again.

This time simulations stopped at 1331 jobs waiting in queue. I run qstat -j
and it provided me with the attached output. Basically, it seems that the
system does not handle dependencies I have in the simulation for some
reason or it just overloads. It seems to stop all queues at some point. At
least this is what I understand from the output.

This may explain why this happens around 1300 jobs left. My simulation is
basically composed of 3 parts, the last 2 parts consist of  1248 + 1 jobs
that depend on the first 1248 jobs. My assumption - based on the output are
that dependencies get lost or perhaps overload the system - yet I may be
wrong and this is why I am asking the experts.

Note that this does not happen with smaller simulations such as my test
suite. This also does not happen on an SGE cluster I installed on a single
machine. So is this problem an SGE issue or should I report it to this
group in Star Cluster?

Also and this is a different issue, I tried a larger simulation for about
25K jobs. It seems that the system overloads at some point and gives me the
following error:
Unable to run job: rule "default rule (spool dir)" in spooling context
"flatfile
 spooling" failed writing an object
job 11618 was rejected cause it can't be written: No space left on device.
Exiting.

In the above simulations the cluster was created using:
starcluster start -s 20 -i c1.xlarge mycluster

I assume that the a master with 7gb memory and 4x420GB disk space is
sufficient to support the operations I request. Outside the cloud I have a
single physical machine with 16Gb memory and  less than 0.5Tb disk space
allocated that handles this entire simulation on its own without going to
the cloud - yet it takes much more time and therefore the ability to run on
the cloud is important.

If anyone can help me fix those issues I would be grateful since I wish to
release a new version of my code that can run simulations on the cloud in
bigger scales. So far I can run simulations only on small scale clusters on
the cloud due to the above issues.

Again Rayson, thanks for your guidance and hopefully there is a simple
explanation/fix for this.


                 Jacob




On Wed, Jul 24, 2013 at 7:29 AM, Rayson Ho <raysonlogin at gmail.com> wrote:

> "qconf -msconf" will try to open the editor, which is vi. And going
> through  starcluster sshmaster to launch vi may not be the best
> approach.
>
> This is what I just tested and worked for me:
>
> 1) ssh into the master by running:
> starcluster sshmaster myscluster
>
> 2) launch qconf -msconf and change schedd_job_info to true:
> qconf -msconf
>
> Rayson
>
> ==================================================
> Open Grid Scheduler - The Official Open Source Grid Engine
> http://gridscheduler.sourceforge.net/
>
>
> On Wed, Jul 24, 2013 at 6:05 AM, Jacob Barhak <jacob.barhak at gmail.com>
> wrote:
> > Thanks Again Rayson,
> >
> > Yet even with your generous help I am still stuck. Perhaps you can look
> at
> > what I am doing and correct me.
> >
> > I tried running the commands you suggested to reconfigure the scheduler
> and
> > it seems the system hangs on me.
> >
> > Here is what I did:
> >
> > 1. Created a 1 node cluster.
> > 2. Copied the configuration file to my machine using:
> >  starcluster get myscluster /opt/sge6/default/common/sched_configuration
> .
> > 3. Modified line 12 in sched_configuration to read
> > schedd_job_info TRUE
> > 4. Copied the modified sched_configuration file to the default local
> > directory on the cluster using:
> > starcluster put mycluster sched_configuration .
> > 5. run the configuration command you suggested:
> > starcluster sshmaster mycluster "qconf -msconf sched_configuration"
> >
> > The system hangs in the last stage and does not return unless I press
> > control break - even control C does not work. I waited a few minutes then
> > terminated the cluster. I double checked this behavior using the -u root
> > argument when running the commands to ensure root privileges.
> >
> > I am using a windows 7 machine to issue those commands. I use the
> PythonXY
> > distribution and installed starcluster using easy_install. I am providing
> > this information to see if there is anything wrong with my system
> > compatibility-wise.
> >
> > I am attaching the configuration file and the transcript.
> >
> > I am getting funny characters and it seems the qconf command I issued
> does
> > not recognize the change in schedd_job_info.
> >
> > Is there any other way I can find out what the problem is - If you
> recall I
> > am trying to figure out why jobs in the queue are not dispatched to run
> and
> > keep waiting forever. This happens after a few hundred jobs I am sending.
> >
> > I hope you could find time to look at this once more, or perhaps someone
> > else can help.
> >
> >
> >                    Jacob
> >
> >
> >
> >
> > On Tue, Jul 23, 2013 at 6:44 PM, Rayson Ho <raysonlogin at gmail.com>
> wrote:
> >>
> >> On Tue, Jul 23, 2013 at 2:08 AM, Jacob Barhak <jacob.barhak at gmail.com>
> >> wrote:
> >> > The second issue, however, is reproducible. I just tried again a 20
> node
> >> > cluster.
> >>
> >> Yes, it will always be reproducible and only can start 20 on-demand
> >> instances in a region. Again, you need to fill out the "Request to
> >> Increase Amazon EC2 Instance Limit" form if you want more than 20:
> >>
> >> https://aws.amazon.com/contact-us/ec2-request/
> >>
> >>
> >> > This time I posted 2497 jobs to the queue - each about 1 minute long.
> >> > The
> >> > system stopped sending jobs the queue about half point. There were
> 1310
> >> > jobs
> >> > in the queue when the system stopped sending more jobs.
> >> > When running "qstat -j" the system provided the following answer:
> >> >
> >> > scheduling info:            (Collecting of scheduler job information
> is
> >> > turned off)
> >>
> >> That's my fault -- I forgot to point out that the scheduler info is
> >> off by default, and you need to run "qconf -msconf", and change the
> >> parameter "schedd_job_info" to true. See:
> >>
> >> http://gridscheduler.sourceforge.net/htmlman/htmlman5/sched_conf.html
> >>
> >> Rayson
> >>
> >> ==================================================
> >> Open Grid Scheduler - The Official Open Source Grid Engine
> >> http://gridscheduler.sourceforge.net/
> >>
> >>
> >>
> >> >
> >> > I am not familiar with the error messages, yet it seems I need to
> enable
> >> > something that is turned off. If there is quick obvious solution for
> >> > this
> >> > please let me know what to do, otherwise are there any other
> diagnostics
> >> > tools I can use?
> >> >
> >> > Again, thanks for the quick reply and I hope this is an easy fix.
> >> >
> >> >            Jacob
> >> >
> >> >
> >> > On Mon, Jul 22, 2013 at 2:51 PM, Rayson Ho <raysonlogin at gmail.com>
> >> > wrote:
> >> >>
> >> >> On Sun, Jul 21, 2013 at 1:40 AM, Jacob Barhak <
> jacob.barhak at gmail.com>
> >> >> wrote:
> >> >> > 1. Sometime starcluster is unable to properly connect the instances
> >> >> > on
> >> >> > the
> >> >> > start command and cannot mount /home. It happened once when I asked
> >> >> > for
> >> >> > 5
> >> >> > m1.small machines and when I terminated this cluster and started
> >> >> > again
> >> >> > things went fine. Is this intermittent due to cloud traffic or is
> >> >> > this a
> >> >> > bug? Is there a way for me to check why?
> >> >>
> >> >> Can be a problem with the actual hardware - can you ssh into the node
> >> >> and manually mount /home by hand next time you encounter this issue
> >> >> and see if you can reproduce it when run interactively?
> >> >>
> >> >>
> >> >> > 2.  After launching 20 c1.xlarge machines and running about 2500
> >> >> > jobs,
> >> >> > each
> >> >> > about 5 minutes long, I encountered a problem after and hour or so.
> >> >> > It
> >> >> > seems
> >> >> > that SGE stopped sending jobs from to the queue to the instances.
> No
> >> >> > error
> >> >> > was found and the queue showed about 850 pending jobs. This did not
> >> >> > change
> >> >> > for a while and I could not find any failure with qstat or qhost.
> No
> >> >> > jobs
> >> >> > were running on any nodes and I waited a while for these to start
> >> >> > without
> >> >> > success. I tried the same thing again after a few hours and it
> seems
> >> >> > that
> >> >> > the cluster stops sending jobs from the queue after about 1600 jobs
> >> >> > have
> >> >> > been submitted. This does not happen when SGE is installed on a
> >> >> > single
> >> >> > Ubuntu machine I have at home. I am trying to figure out what is
> >> >> > wrong.
> >> >> > Did
> >> >> > you impose some limit on the number of jobs? Can this be fixed? I
> >> >> > really
> >> >> > need to submit many jobs - tens of thousands jobs in my full runs.
> >> >> > This
> >> >> > one
> >> >> > was relatively small and still did not pass.
> >> >>
> >> >> Looks like SGE thinks that the nodes are down or in alarm or error
> >> >> state? To find out why SGE thinks there are no nodes available, run:
> >> >> qstat -j
> >> >>
> >> >>
> >> >>
> >> >> > 3. I tried to start Star Cluster with 50 nodes and got an error
> about
> >> >> > exceeding a quota of 20. Is it your quota or Amazon quota? Are
> there
> >> >> > any
> >> >> > other restrictions I should be aware of at the beginning? Also
> after
> >> >> > the
> >> >> > system is unable start the cluster it thinks it is still running
> and
> >> >> > a
> >> >> > terminate command is needed before another start can be issued -
> even
> >> >> > though
> >> >> > nothing got started.
> >> >>
> >> >> It's Amazon's quota. 50 is considered small by AWS standard, and they
> >> >> can give it to you almost right away... You need to request AWS to
> >> >> give you a higher limit:
> >> >> https://aws.amazon.com/contact-us/ec2-request/
> >> >>
> >> >> Note that last year we requested for 10,000 nodes and the whole
> >> >> process took less than 1 day:
> >> >>
> >> >>
> >> >>
> >> >>
> http://blogs.scalablelogic.com/2012/11/running-10000-node-grid-engine-cluster.html
> >> >>
> >> >> Rayson
> >> >>
> >> >> ==================================================
> >> >> Open Grid Scheduler - The Official Open Source Grid Engine
> >> >> http://gridscheduler.sourceforge.net/
> >> >>
> >> >>
> >> >> >
> >> >> > This all happened on us-west-2 with the help of star cluster 0.93.3
> >> >> > and
> >> >> > the
> >> >> > Anaconda AMI - ami-a4d64194
> >> >> >
> >> >> > Here is some more information on what I am doing to help you answer
> >> >> > the
> >> >> > above.
> >> >> >
> >> >> > I am running Monte Carlo simulations to simulate chronic disease
> >> >> > progression. I am using MIST to run over the cloud:
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> https://github.com/scipy/scipy2013_talks/blob/master/talks/jacob_barhak/readme.md
> >> >> >
> >> >> > The Reference Model is what I am running using MIST:
> >> >> >
> >> >> > http://youtu.be/7qxPSgINaD8
> >> >> >
> >> >> > I am launching many simulations in parallel and it takes me days
> on a
> >> >> > single
> >> >> > 8 core machine. The cloud allows me to cut down this time to hours.
> >> >> > This
> >> >> > is
> >> >> > why star cluster is so useful. In the past I did this over other
> >> >> > clusters
> >> >> > yet the cloud is still new to me.
> >> >> >
> >> >> > I will appreciate any recommendations I can get from you to improve
> >> >> > the
> >> >> > behaviors I am experiencing.
> >>
> >> http://commons.wikimedia.org/wiki/User:Raysonho
> >
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/pipermail/starcluster/attachments/20130724/a9ac6278/attachment-0001.htm
-------------- next part --------------
scheduling info:            queue instance "all.q at node016" dropped because it is
 temporarily not available
                            queue instance "all.q at node011" dropped because it is
 temporarily not available
                            queue instance "all.q at node002" dropped because it is
 temporarily not available
                            queue instance "all.q at node012" dropped because it is
 temporarily not available
                            queue instance "all.q at master" dropped because it is
temporarily not available
                            queue instance "all.q at node017" dropped because it is
 temporarily not available
                            queue instance "all.q at node010" dropped because it is
 temporarily not available
                            queue instance "all.q at node003" dropped because it is
 temporarily not available
                            queue instance "all.q at node013" dropped because it is
 temporarily not available
                            queue instance "all.q at node019" dropped because it is
 temporarily not available
                            queue instance "all.q at node004" dropped because it is
 temporarily not available
                            queue instance "all.q at node005" dropped because it is
 temporarily not available
                            queue instance "all.q at node009" dropped because it is
 temporarily not available
                            queue instance "all.q at node014" dropped because it is
 temporarily not available
                            queue instance "all.q at node006" dropped because it is
 temporarily not available
                            queue instance "all.q at node018" dropped because it is
 temporarily not available
                            queue instance "all.q at node015" dropped because it is
 temporarily not available
                            queue instance "all.q at node007" dropped because it is
 temporarily not available
                            queue instance "all.q at node001" dropped because it is
 temporarily not available
                            queue instance "all.q at node008" dropped because it is
 temporarily not available
                            All queues dropped because of overload or full

Job dropped because of job dependencies
        2372,   2376,   2377,   2378,   2379,   2380,   2381,   2382,
        2383,   2384,   2385,   2386,   2387,   2388,   2389,   2390,
        2391,   2392,   2393,   2394,   2395,   2396,   2397,   2398,
        2399,   2400,   2401,   2402,   2403,   2404,   2405,   2406,
        2407,   2408,   2409,   2410,   2411,   2412,   2413,   2414,
        2415,   2416,   2417,   2418,   2419,   2420,   2421,   2422,
        2423,   2424,   2425,   2426,   2427,   2428,   2429,   2430,
        2431,   2432,   2433,   2434,   2435,   2436,   2437,   2438,
        2439,   2440,   2441,   2442,   2443,   2444,   2445,   2446,
        2447,   2448,   2449,   2450,   2451,   2452,   2453,   2454,
        2455,   2456,   2457,   2458,   2459,   2460,   2461,   2462,
        2463,   2464,   2465,   2466,   2467,   2468,   2469,   2470,
        2471,   2472,   2473,   2474,   2475,   2476,   2477,   2478,
        2479,   2480,   2481,   2483,   2485,   2486,   2487,   2489,
        2491,   2492,   2493,   2494,   2495,   2497


More information about the StarCluster mailing list