<div dir="ltr"><div>Thanks Again Rayson,</div><div> </div><div>Your explanation allowed me to continue. I had to jump through several hoops since the windows cmd terminal does not handle the special characters well and working with vi without seeing what you do is terrible - perhaps there is a simple fix for this. Never the less, once the SGE configuration file was fixed I launched the simulations again.</div>
<div> </div><div>This time simulations stopped at 1331 jobs waiting in queue. I run qstat -j and it provided me with the attached output. Basically, it seems that the system does not handle dependencies I have in the simulation for some reason or it just overloads. It seems to stop all queues at some point. At least this is what I understand from the output.</div>
<div> </div><div>This may explain why this happens around 1300 jobs left. My simulation is basically composed of 3 parts, the last 2 parts consist of 1248 + 1 jobs that depend on the first 1248 jobs. My assumption - based on the output are that dependencies get lost or perhaps overload the system - yet I may be wrong and this is why I am asking the experts.</div>
<div> </div><div>Note that this does not happen with smaller simulations such as my test suite. This also does not happen on an SGE cluster I installed on a single machine. So is this problem an SGE issue or should I report it to this group in Star Cluster?</div>
<div> </div><div>Also and this is a different issue, I tried a larger simulation for about 25K jobs. It seems that the system overloads at some point and gives me the following error:</div><div>Unable to run job: rule "default rule (spool dir)" in spooling context "flatfile<br>
spooling" failed writing an object<br>job 11618 was rejected cause it can't be written: No space left on device.<br>Exiting.</div><div> </div><div>In the above simulations the cluster was created using:</div><div>
starcluster start -s 20 -i c1.xlarge mycluster </div><div> </div><div>I assume that the a master with 7gb memory and 4x420GB disk space is sufficient to support the operations I request. Outside the cloud I have a single physical machine with 16Gb memory and less than 0.5Tb disk space allocated that handles this entire simulation on its own without going to the cloud - yet it takes much more time and therefore the ability to run on the cloud is important.</div>
<div> </div><div>If anyone can help me fix those issues I would be grateful since I wish to release a new version of my code that can run simulations on the cloud in bigger scales. So far I can run simulations only on small scale clusters on the cloud due to the above issues.</div>
<div> </div><div>Again Rayson, thanks for your guidance and hopefully there is a simple explanation/fix for this.</div><div> </div><div> </div><div> Jacob </div><div> </div><div> </div></div><div class="gmail_extra">
<br><br><div class="gmail_quote">On Wed, Jul 24, 2013 at 7:29 AM, Rayson Ho <span dir="ltr"><<a href="mailto:raysonlogin@gmail.com" target="_blank">raysonlogin@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
"qconf -msconf" will try to open the editor, which is vi. And going<br>
through starcluster sshmaster to launch vi may not be the best<br>
approach.<br>
<br>
This is what I just tested and worked for me:<br>
<br>
1) ssh into the master by running:<br>
starcluster sshmaster myscluster<br>
<br>
2) launch qconf -msconf and change schedd_job_info to true:<br>
qconf -msconf<br>
<div class="im HOEnZb"><br>
Rayson<br>
<br>
==================================================<br>
Open Grid Scheduler - The Official Open Source Grid Engine<br>
<a href="http://gridscheduler.sourceforge.net/" target="_blank">http://gridscheduler.sourceforge.net/</a><br>
<br>
<br>
</div><div class="HOEnZb"><div class="h5">On Wed, Jul 24, 2013 at 6:05 AM, Jacob Barhak <<a href="mailto:jacob.barhak@gmail.com">jacob.barhak@gmail.com</a>> wrote:<br>
> Thanks Again Rayson,<br>
><br>
> Yet even with your generous help I am still stuck. Perhaps you can look at<br>
> what I am doing and correct me.<br>
><br>
> I tried running the commands you suggested to reconfigure the scheduler and<br>
> it seems the system hangs on me.<br>
><br>
> Here is what I did:<br>
><br>
> 1. Created a 1 node cluster.<br>
> 2. Copied the configuration file to my machine using:<br>
> starcluster get myscluster /opt/sge6/default/common/sched_configuration .<br>
> 3. Modified line 12 in sched_configuration to read<br>
> schedd_job_info TRUE<br>
> 4. Copied the modified sched_configuration file to the default local<br>
> directory on the cluster using:<br>
> starcluster put mycluster sched_configuration .<br>
> 5. run the configuration command you suggested:<br>
> starcluster sshmaster mycluster "qconf -msconf sched_configuration"<br>
><br>
> The system hangs in the last stage and does not return unless I press<br>
> control break - even control C does not work. I waited a few minutes then<br>
> terminated the cluster. I double checked this behavior using the -u root<br>
> argument when running the commands to ensure root privileges.<br>
><br>
> I am using a windows 7 machine to issue those commands. I use the PythonXY<br>
> distribution and installed starcluster using easy_install. I am providing<br>
> this information to see if there is anything wrong with my system<br>
> compatibility-wise.<br>
><br>
> I am attaching the configuration file and the transcript.<br>
><br>
> I am getting funny characters and it seems the qconf command I issued does<br>
> not recognize the change in schedd_job_info.<br>
><br>
> Is there any other way I can find out what the problem is - If you recall I<br>
> am trying to figure out why jobs in the queue are not dispatched to run and<br>
> keep waiting forever. This happens after a few hundred jobs I am sending.<br>
><br>
> I hope you could find time to look at this once more, or perhaps someone<br>
> else can help.<br>
><br>
><br>
> Jacob<br>
><br>
><br>
><br>
><br>
> On Tue, Jul 23, 2013 at 6:44 PM, Rayson Ho <<a href="mailto:raysonlogin@gmail.com">raysonlogin@gmail.com</a>> wrote:<br>
>><br>
>> On Tue, Jul 23, 2013 at 2:08 AM, Jacob Barhak <<a href="mailto:jacob.barhak@gmail.com">jacob.barhak@gmail.com</a>><br>
>> wrote:<br>
>> > The second issue, however, is reproducible. I just tried again a 20 node<br>
>> > cluster.<br>
>><br>
>> Yes, it will always be reproducible and only can start 20 on-demand<br>
>> instances in a region. Again, you need to fill out the "Request to<br>
>> Increase Amazon EC2 Instance Limit" form if you want more than 20:<br>
>><br>
>> <a href="https://aws.amazon.com/contact-us/ec2-request/" target="_blank">https://aws.amazon.com/contact-us/ec2-request/</a><br>
>><br>
>><br>
>> > This time I posted 2497 jobs to the queue - each about 1 minute long.<br>
>> > The<br>
>> > system stopped sending jobs the queue about half point. There were 1310<br>
>> > jobs<br>
>> > in the queue when the system stopped sending more jobs.<br>
>> > When running "qstat -j" the system provided the following answer:<br>
>> ><br>
>> > scheduling info: (Collecting of scheduler job information is<br>
>> > turned off)<br>
>><br>
>> That's my fault -- I forgot to point out that the scheduler info is<br>
>> off by default, and you need to run "qconf -msconf", and change the<br>
>> parameter "schedd_job_info" to true. See:<br>
>><br>
>> <a href="http://gridscheduler.sourceforge.net/htmlman/htmlman5/sched_conf.html" target="_blank">http://gridscheduler.sourceforge.net/htmlman/htmlman5/sched_conf.html</a><br>
>><br>
>> Rayson<br>
>><br>
>> ==================================================<br>
>> Open Grid Scheduler - The Official Open Source Grid Engine<br>
>> <a href="http://gridscheduler.sourceforge.net/" target="_blank">http://gridscheduler.sourceforge.net/</a><br>
>><br>
>><br>
>><br>
>> ><br>
>> > I am not familiar with the error messages, yet it seems I need to enable<br>
>> > something that is turned off. If there is quick obvious solution for<br>
>> > this<br>
>> > please let me know what to do, otherwise are there any other diagnostics<br>
>> > tools I can use?<br>
>> ><br>
>> > Again, thanks for the quick reply and I hope this is an easy fix.<br>
>> ><br>
>> > Jacob<br>
>> ><br>
>> ><br>
>> > On Mon, Jul 22, 2013 at 2:51 PM, Rayson Ho <<a href="mailto:raysonlogin@gmail.com">raysonlogin@gmail.com</a>><br>
>> > wrote:<br>
>> >><br>
>> >> On Sun, Jul 21, 2013 at 1:40 AM, Jacob Barhak <<a href="mailto:jacob.barhak@gmail.com">jacob.barhak@gmail.com</a>><br>
>> >> wrote:<br>
>> >> > 1. Sometime starcluster is unable to properly connect the instances<br>
>> >> > on<br>
>> >> > the<br>
>> >> > start command and cannot mount /home. It happened once when I asked<br>
>> >> > for<br>
>> >> > 5<br>
>> >> > m1.small machines and when I terminated this cluster and started<br>
>> >> > again<br>
>> >> > things went fine. Is this intermittent due to cloud traffic or is<br>
>> >> > this a<br>
>> >> > bug? Is there a way for me to check why?<br>
>> >><br>
>> >> Can be a problem with the actual hardware - can you ssh into the node<br>
>> >> and manually mount /home by hand next time you encounter this issue<br>
>> >> and see if you can reproduce it when run interactively?<br>
>> >><br>
>> >><br>
>> >> > 2. After launching 20 c1.xlarge machines and running about 2500<br>
>> >> > jobs,<br>
>> >> > each<br>
>> >> > about 5 minutes long, I encountered a problem after and hour or so.<br>
>> >> > It<br>
>> >> > seems<br>
>> >> > that SGE stopped sending jobs from to the queue to the instances. No<br>
>> >> > error<br>
>> >> > was found and the queue showed about 850 pending jobs. This did not<br>
>> >> > change<br>
>> >> > for a while and I could not find any failure with qstat or qhost. No<br>
>> >> > jobs<br>
>> >> > were running on any nodes and I waited a while for these to start<br>
>> >> > without<br>
>> >> > success. I tried the same thing again after a few hours and it seems<br>
>> >> > that<br>
>> >> > the cluster stops sending jobs from the queue after about 1600 jobs<br>
>> >> > have<br>
>> >> > been submitted. This does not happen when SGE is installed on a<br>
>> >> > single<br>
>> >> > Ubuntu machine I have at home. I am trying to figure out what is<br>
>> >> > wrong.<br>
>> >> > Did<br>
>> >> > you impose some limit on the number of jobs? Can this be fixed? I<br>
>> >> > really<br>
>> >> > need to submit many jobs - tens of thousands jobs in my full runs.<br>
>> >> > This<br>
>> >> > one<br>
>> >> > was relatively small and still did not pass.<br>
>> >><br>
>> >> Looks like SGE thinks that the nodes are down or in alarm or error<br>
>> >> state? To find out why SGE thinks there are no nodes available, run:<br>
>> >> qstat -j<br>
>> >><br>
>> >><br>
>> >><br>
>> >> > 3. I tried to start Star Cluster with 50 nodes and got an error about<br>
>> >> > exceeding a quota of 20. Is it your quota or Amazon quota? Are there<br>
>> >> > any<br>
>> >> > other restrictions I should be aware of at the beginning? Also after<br>
>> >> > the<br>
>> >> > system is unable start the cluster it thinks it is still running and<br>
>> >> > a<br>
>> >> > terminate command is needed before another start can be issued - even<br>
>> >> > though<br>
>> >> > nothing got started.<br>
>> >><br>
>> >> It's Amazon's quota. 50 is considered small by AWS standard, and they<br>
>> >> can give it to you almost right away... You need to request AWS to<br>
>> >> give you a higher limit:<br>
>> >> <a href="https://aws.amazon.com/contact-us/ec2-request/" target="_blank">https://aws.amazon.com/contact-us/ec2-request/</a><br>
>> >><br>
>> >> Note that last year we requested for 10,000 nodes and the whole<br>
>> >> process took less than 1 day:<br>
>> >><br>
>> >><br>
>> >><br>
>> >> <a href="http://blogs.scalablelogic.com/2012/11/running-10000-node-grid-engine-cluster.html" target="_blank">http://blogs.scalablelogic.com/2012/11/running-10000-node-grid-engine-cluster.html</a><br>
>> >><br>
>> >> Rayson<br>
>> >><br>
>> >> ==================================================<br>
>> >> Open Grid Scheduler - The Official Open Source Grid Engine<br>
>> >> <a href="http://gridscheduler.sourceforge.net/" target="_blank">http://gridscheduler.sourceforge.net/</a><br>
>> >><br>
>> >><br>
>> >> ><br>
>> >> > This all happened on us-west-2 with the help of star cluster 0.93.3<br>
>> >> > and<br>
>> >> > the<br>
>> >> > Anaconda AMI - ami-a4d64194<br>
>> >> ><br>
>> >> > Here is some more information on what I am doing to help you answer<br>
>> >> > the<br>
>> >> > above.<br>
>> >> ><br>
>> >> > I am running Monte Carlo simulations to simulate chronic disease<br>
>> >> > progression. I am using MIST to run over the cloud:<br>
>> >> ><br>
>> >> ><br>
>> >> ><br>
>> >> > <a href="https://github.com/scipy/scipy2013_talks/blob/master/talks/jacob_barhak/readme.md" target="_blank">https://github.com/scipy/scipy2013_talks/blob/master/talks/jacob_barhak/readme.md</a><br>
>> >> ><br>
>> >> > The Reference Model is what I am running using MIST:<br>
>> >> ><br>
>> >> > <a href="http://youtu.be/7qxPSgINaD8" target="_blank">http://youtu.be/7qxPSgINaD8</a><br>
>> >> ><br>
>> >> > I am launching many simulations in parallel and it takes me days on a<br>
>> >> > single<br>
>> >> > 8 core machine. The cloud allows me to cut down this time to hours.<br>
>> >> > This<br>
>> >> > is<br>
>> >> > why star cluster is so useful. In the past I did this over other<br>
>> >> > clusters<br>
>> >> > yet the cloud is still new to me.<br>
>> >> ><br>
>> >> > I will appreciate any recommendations I can get from you to improve<br>
>> >> > the<br>
>> >> > behaviors I am experiencing.<br>
>><br>
>> <a href="http://commons.wikimedia.org/wiki/User:Raysonho" target="_blank">http://commons.wikimedia.org/wiki/User:Raysonho</a><br>
><br>
><br>
</div></div></blockquote></div><br></div>