[StarCluster] Star cluster issues when running MIST over the cloud

Wed Jul 24 06:05:37 EDT 2013

Thanks Again Rayson,

Yet even with your generous help I am still stuck. Perhaps you can look at
what I am doing and correct me.

I tried running the commands you suggested to reconfigure the scheduler and
it seems the system hangs on me.

Here is what I did:

1. Created a 1 node cluster.
2. Copied the configuration file to my machine using:
 starcluster get myscluster /opt/sge6/default/common/sched_configuration .
3. Modified line 12 in sched_configuration to read
schedd_job_info TRUE
4. Copied the modified sched_configuration file to the default local
directory on the cluster using:
starcluster put mycluster sched_configuration .
5. run the configuration command you suggested:
starcluster sshmaster mycluster "qconf -msconf sched_configuration"

The system hangs in the last stage and does not return unless I press
control break - even control C does not work. I waited a few minutes then
terminated the cluster. I double checked this behavior using the -u root
argument when running the commands to ensure root privileges.

I am using a windows 7 machine to issue those commands. I use the PythonXY
distribution and installed starcluster using easy_install. I am providing
this information to see if there is anything wrong with my system
compatibility-wise.

I am attaching the configuration file and the transcript.

I am getting funny characters and it seems the qconf command I issued does
not recognize the change in schedd_job_info.

Is there any other way I can find out what the problem is - If you recall I
am trying to figure out why jobs in the queue are not dispatched to run and
keep waiting forever. This happens after a few hundred jobs I am sending.

I hope you could find time to look at this once more, or perhaps someone
else can help.

                   Jacob

On Tue, Jul 23, 2013 at 6:44 PM, Rayson Ho <raysonlogin at gmail.com> wrote:

> On Tue, Jul 23, 2013 at 2:08 AM, Jacob Barhak <jacob.barhak at gmail.com>
> wrote:
> > The second issue, however, is reproducible. I just tried again a 20 node
> > cluster.
>
> Yes, it will always be reproducible and only can start 20 on-demand
> instances in a region. Again, you need to fill out the "Request to
> Increase Amazon EC2 Instance Limit" form if you want more than 20:
>
> https://aws.amazon.com/contact-us/ec2-request/
>
>
> > This time I posted 2497 jobs to the queue - each about 1 minute long. The
> > system stopped sending jobs the queue about half point. There were 1310
> jobs
> > in the queue when the system stopped sending more jobs.
> > When running "qstat -j" the system provided the following answer:
> >
> > scheduling info:            (Collecting of scheduler job information is
> > turned off)
>
> That's my fault -- I forgot to point out that the scheduler info is
> off by default, and you need to run "qconf -msconf", and change the
> parameter "schedd_job_info" to true. See:
>
> http://gridscheduler.sourceforge.net/htmlman/htmlman5/sched_conf.html
>
> Rayson
>
> ==================================================
> Open Grid Scheduler - The Official Open Source Grid Engine
> http://gridscheduler.sourceforge.net/
>
>
>
> >
> > I am not familiar with the error messages, yet it seems I need to enable
> > something that is turned off. If there is quick obvious solution for this
> > please let me know what to do, otherwise are there any other diagnostics
> > tools I can use?
> >
> > Again, thanks for the quick reply and I hope this is an easy fix.
> >
> >            Jacob
> >
> >
> > On Mon, Jul 22, 2013 at 2:51 PM, Rayson Ho <raysonlogin at gmail.com>
> wrote:
> >>
> >> On Sun, Jul 21, 2013 at 1:40 AM, Jacob Barhak <jacob.barhak at gmail.com>
> >> wrote:
> >> > 1. Sometime starcluster is unable to properly connect the instances on
> >> > the
> >> > start command and cannot mount /home. It happened once when I asked
> for
> >> > 5
> >> > m1.small machines and when I terminated this cluster and started again
> >> > things went fine. Is this intermittent due to cloud traffic or is
> this a
> >> > bug? Is there a way for me to check why?
> >>
> >> Can be a problem with the actual hardware - can you ssh into the node
> >> and manually mount /home by hand next time you encounter this issue
> >> and see if you can reproduce it when run interactively?
> >>
> >>
> >> > 2.  After launching 20 c1.xlarge machines and running about 2500 jobs,
> >> > each
> >> > about 5 minutes long, I encountered a problem after and hour or so. It
> >> > seems
> >> > that SGE stopped sending jobs from to the queue to the instances. No
> >> > error
> >> > was found and the queue showed about 850 pending jobs. This did not
> >> > change
> >> > for a while and I could not find any failure with qstat or qhost. No
> >> > jobs
> >> > were running on any nodes and I waited a while for these to start
> >> > without
> >> > success. I tried the same thing again after a few hours and it seems
> >> > that
> >> > the cluster stops sending jobs from the queue after about 1600 jobs
> have
> >> > been submitted. This does not happen when SGE is installed on a single
> >> > Ubuntu machine I have at home. I am trying to figure out what is
> wrong.
> >> > Did
> >> > you impose some limit on the number of jobs? Can this be fixed? I
> really
> >> > need to submit many jobs - tens of thousands jobs in my full runs.
> This
> >> > one
> >> > was relatively small and still did not pass.
> >>
> >> Looks like SGE thinks that the nodes are down or in alarm or error
> >> state? To find out why SGE thinks there are no nodes available, run:
> >> qstat -j
> >>
> >>
> >>
> >> > 3. I tried to start Star Cluster with 50 nodes and got an error about
> >> > exceeding a quota of 20. Is it your quota or Amazon quota? Are there
> any
> >> > other restrictions I should be aware of at the beginning? Also after
> the
> >> > system is unable start the cluster it thinks it is still running and a
> >> > terminate command is needed before another start can be issued - even
> >> > though
> >> > nothing got started.
> >>
> >> It's Amazon's quota. 50 is considered small by AWS standard, and they
> >> can give it to you almost right away... You need to request AWS to
> >> give you a higher limit:
> >> https://aws.amazon.com/contact-us/ec2-request/
> >>
> >> Note that last year we requested for 10,000 nodes and the whole
> >> process took less than 1 day:
> >>
> >>
> >>
> http://blogs.scalablelogic.com/2012/11/running-10000-node-grid-engine-cluster.html
> >>
> >> Rayson
> >>
> >> ==================================================
> >> Open Grid Scheduler - The Official Open Source Grid Engine
> >> http://gridscheduler.sourceforge.net/
> >>
> >>
> >> >
> >> > This all happened on us-west-2 with the help of star cluster 0.93.3
> and
> >> > the
> >> > Anaconda AMI - ami-a4d64194
> >> >
> >> > Here is some more information on what I am doing to help you answer
> the
> >> > above.
> >> >
> >> > I am running Monte Carlo simulations to simulate chronic disease
> >> > progression. I am using MIST to run over the cloud:
> >> >
> >> >
> >> >
> https://github.com/scipy/scipy2013_talks/blob/master/talks/jacob_barhak/readme.md
> >> >
> >> > The Reference Model is what I am running using MIST:
> >> >
> >> > http://youtu.be/7qxPSgINaD8
> >> >
> >> > I am launching many simulations in parallel and it takes me days on a
> >> > single
> >> > 8 core machine. The cloud allows me to cut down this time to hours.
> This
> >> > is
> >> > why star cluster is so useful. In the past I did this over other
> >> > clusters
> >> > yet the cloud is still new to me.
> >> >
> >> > I will appreciate any recommendations I can get from you to improve
> the
> >> > behaviors I am experiencing.
>
> http://commons.wikimedia.org/wiki/User:Raysonho
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/pipermail/starcluster/attachments/20130724/f921e512/attachment-0001.htm
-------------- next part --------------
?C:\Users\Work>starcluster start -s 1  mycluster
StarCluster - (http://web.mit.edu/starcluster) (v. 0.93.3)
Software Tools for Academics and Researchers (STAR)
Please submit bug reports to starcluster at mit.edu

>>> Using default cluster template: smallcluster
>>> Validating cluster template settings...
>>> Cluster template settings are valid
>>> Starting cluster...
>>> Launching a 1-node cluster...
>>> Creating security group @sc-mycluster...
>>> Opening tcp port range 8989-8989 for CIDR 0.0.0.0/0
Reservation:r-78811e4f
>>> Waiting for cluster to come up... (updating every 30s)
>>> Waiting for all nodes to be in a 'running' state...
1/1 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%
>>> Waiting for SSH to come up on all nodes...
1/1 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%
>>> Waiting for cluster to come up took 3.800 mins
>>> The master node is ec2-54-212-100-62.us-west-2.compute.amazonaws.com
>>> Setting up the cluster...
>>> Configuring hostnames...
1/1 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%
>>> Creating cluster user: None (uid: 1003, gid: 1003)
1/1 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%
>>> Configuring scratch space for user(s):
>>> a_user_not_named_disco
1/1 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%
>>> Configuring /etc/hosts on each node
1/1 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%
>>> Starting NFS server on master
>>> Setting up NFS took 0.026 mins
>>> Configuring passwordless ssh for root
>>> Configuring passwordless ssh for a_user_not_named_disco
>>> Shutting down threads...
20/20 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%
>>> Configuring SGE...
>>> Setting up NFS took 0.000 mins
>>> Removing previous SGE installation...
>>> Installing Sun Grid Engine...
>>> Creating SGE parallel environment 'orte'
1/1 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%
>>> Adding parallel environment 'orte' to queue 'all.q'
>>> Shutting down threads...
20/20 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%
>>> Running plugin anaconda_plugin
>>> Configuring cluster took 0.946 mins
>>> Starting cluster took 4.789 mins

The cluster is now ready to use. To login to the master node
as root, run:

    $ starcluster sshmaster mycluster

If you're having issues with the cluster you can reboot the
instances and completely reconfigure the cluster from
scratch using:

    $ starcluster restart mycluster

When you're finished using the cluster and wish to terminate
it and stop paying for service:

    $ starcluster terminate mycluster

Alternatively, if the cluster uses EBS instances, you can
use the 'stop' command to shutdown all nodes and put them
into a 'stopped' state preserving the EBS volumes backing
the nodes:

    $ starcluster stop mycluster

WARNING: Any data stored in ephemeral storage (usually /mnt)
will be lost!

You can activate a 'stopped' cluster by passing the -x
option to the 'start' command:

    $ starcluster start -x mycluster

This will start all 'stopped' nodes and reconfigure the
cluster.

C:\Users\Work>starcluster put mycluster sched_configuration .
StarCluster - (http://web.mit.edu/starcluster) (v. 0.93.3)
Software Tools for Academics and Researchers (STAR)
Please submit bug reports to starcluster at mit.edu

sched_configuration 100% |||||||||||||||||||||||||||| Time: 00:00:00   0.00 B/s

C:\Users\Work>starcluster sshmaster mycluster "qconf -msconf sched_configuration
"
StarCluster - (http://web.mit.edu/starcluster) (v. 0.93.3)
Software Tools for Academics and Researchers (STAR)
Please submit bug reports to starcluster at mit.edu

?[m?[m?[0m?[H?[2J?[24;1H"/tmp/pid-2632-VwyJ3I" 35L, 1453C?[1;1Halgorithm?[24C de
fault
schedule_interval?[16C 0:0:15
maxujobs?[25C 0
queue_sort_method?[16C load
job_load_adjustments?[13C np_load_avg=0.50
load_adjustment_decay_time?[7C 0:7:30
load_formula?[21C np_load_avg
schedd_job_info?[18C false
flush_submit_sec?[17C 0
flush_finish_sec?[17C 0
params?[27C none
reprioritize_interval?[12C 0:0:0
halftime?[25C 168
usage_weight_list?[16C cpu=1.000000,mem=0.000000,io=0.000000
compensation_factor?[14C 5.000000
weight_user?[22C 0.250000
weight_project?[19C 0.250000
weight_department?[16C 0.250000
weight_job?[23C 0.250000
weight_tickets_functional?[8C 0
weight_tickets_share?[13C 0
share_override_tickets?[11C TRUE
-------------- next part --------------
A non-text attachment was scrubbed...
Name: sched_configuration
Type: application/octet-stream
Size: 968 bytes
Desc: not available
Url : http://mailman.mit.edu/pipermail/starcluster/attachments/20130724/f921e512/attachment-0001.obj