<div dir="ltr"><div>Thanks Again Rayson,</div><div> </div><div>Yet even with your generous help I am still stuck. Perhaps you can look at what I am doing and correct me.</div><div> </div><div>I tried running the commands you suggested to reconfigure the scheduler and it seems the system hangs on me.</div>

<div> </div><div>Here is what I did: </div><div> </div><div>1. Created a 1 node cluster.</div><div>2. Copied the configuration file to my machine using: </div><div> starcluster get myscluster /opt/sge6/default/common/sched_configuration .</div>

<div>3. Modified line 12 in sched_configuration to read</div><div>schedd_job_info TRUE</div><div>4. Copied the modified sched_configuration file to the default local directory on the cluster using:</div><div>starcluster put mycluster sched_configuration .</div>

<div>5. run the configuration command you suggested:</div><div>starcluster sshmaster mycluster &quot;qconf -msconf sched_configuration&quot; </div><div> </div><div>The system hangs in the last stage and does not return unless I press control break - even control C does not work. I waited a few minutes then terminated the cluster. I double checked this behavior using the -u root argument when running the commands to ensure root privileges. </div>

<div> </div><div>I am using a windows 7 machine to issue those commands. I use the PythonXY distribution and installed starcluster using easy_install. I am providing this information to see if there is anything wrong with my system compatibility-wise.</div>

<div> </div><div>I am attaching the configuration file and the transcript.</div><div> </div><div>I am getting funny characters and it seems the qconf command I issued does not recognize the change in schedd_job_info.</div>

<div> </div><div>Is there any other way I can find out what the problem is - If you recall I am trying to figure out why jobs in the queue are not dispatched to run and keep waiting forever. This happens after a few hundred jobs I am sending.</div>

<div> </div><div>I hope you could find time to look at this once more, or perhaps someone else can help.</div><div> </div><div> </div><div>                   Jacob</div><div> </div><div> </div></div><div class="gmail_extra">

<br><br><div class="gmail_quote">On Tue, Jul 23, 2013 at 6:44 PM, Rayson Ho <span dir="ltr">&lt;<a href="mailto:raysonlogin@gmail.com" target="_blank">raysonlogin@gmail.com</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<div class="im">On Tue, Jul 23, 2013 at 2:08 AM, Jacob Barhak &lt;<a href="mailto:jacob.barhak@gmail.com">jacob.barhak@gmail.com</a>&gt; wrote:<br>

&gt; The second issue, however, is reproducible. I just tried again a 20 node<br>

&gt; cluster.<br>

<br>

</div>Yes, it will always be reproducible and only can start 20 on-demand<br>

instances in a region. Again, you need to fill out the &quot;Request to<br>

Increase Amazon EC2 Instance Limit&quot; form if you want more than 20:<br>

<br>

<a href="https://aws.amazon.com/contact-us/ec2-request/" target="_blank">https://aws.amazon.com/contact-us/ec2-request/</a><br>

<div class="im"><br>

<br>

&gt; This time I posted 2497 jobs to the queue - each about 1 minute long. The<br>

&gt; system stopped sending jobs the queue about half point. There were 1310 jobs<br>

&gt; in the queue when the system stopped sending more jobs.<br>

&gt; When running &quot;qstat -j&quot; the system provided the following answer:<br>

&gt;<br>

&gt; scheduling info:            (Collecting of scheduler job information is<br>

&gt; turned off)<br>

<br>

</div>That&#39;s my fault -- I forgot to point out that the scheduler info is<br>

off by default, and you need to run &quot;qconf -msconf&quot;, and change the<br>

parameter &quot;schedd_job_info&quot; to true. See:<br>

<br>

<a href="http://gridscheduler.sourceforge.net/htmlman/htmlman5/sched_conf.html" target="_blank">http://gridscheduler.sourceforge.net/htmlman/htmlman5/sched_conf.html</a><br>

<div><div class="h5"><br>

Rayson<br>

<br>

==================================================<br>

Open Grid Scheduler - The Official Open Source Grid Engine<br>

<a href="http://gridscheduler.sourceforge.net/" target="_blank">http://gridscheduler.sourceforge.net/</a><br>

<br>

<br>

<br>

&gt;<br>

&gt; I am not familiar with the error messages, yet it seems I need to enable<br>

&gt; something that is turned off. If there is quick obvious solution for this<br>

&gt; please let me know what to do, otherwise are there any other diagnostics<br>

&gt; tools I can use?<br>

&gt;<br>

&gt; Again, thanks for the quick reply and I hope this is an easy fix.<br>

&gt;<br>

&gt;            Jacob<br>

&gt;<br>

&gt;<br>

&gt; On Mon, Jul 22, 2013 at 2:51 PM, Rayson Ho &lt;<a href="mailto:raysonlogin@gmail.com">raysonlogin@gmail.com</a>&gt; wrote:<br>

&gt;&gt;<br>

&gt;&gt; On Sun, Jul 21, 2013 at 1:40 AM, Jacob Barhak &lt;<a href="mailto:jacob.barhak@gmail.com">jacob.barhak@gmail.com</a>&gt;<br>

&gt;&gt; wrote:<br>

&gt;&gt; &gt; 1. Sometime starcluster is unable to properly connect the instances on<br>

&gt;&gt; &gt; the<br>

&gt;&gt; &gt; start command and cannot mount /home. It happened once when I asked for<br>

&gt;&gt; &gt; 5<br>

&gt;&gt; &gt; m1.small machines and when I terminated this cluster and started again<br>

&gt;&gt; &gt; things went fine. Is this intermittent due to cloud traffic or is this a<br>

&gt;&gt; &gt; bug? Is there a way for me to check why?<br>

&gt;&gt;<br>

&gt;&gt; Can be a problem with the actual hardware - can you ssh into the node<br>

&gt;&gt; and manually mount /home by hand next time you encounter this issue<br>

&gt;&gt; and see if you can reproduce it when run interactively?<br>

&gt;&gt;<br>

&gt;&gt;<br>

&gt;&gt; &gt; 2.  After launching 20 c1.xlarge machines and running about 2500 jobs,<br>

&gt;&gt; &gt; each<br>

&gt;&gt; &gt; about 5 minutes long, I encountered a problem after and hour or so. It<br>

&gt;&gt; &gt; seems<br>

&gt;&gt; &gt; that SGE stopped sending jobs from to the queue to the instances. No<br>

&gt;&gt; &gt; error<br>

&gt;&gt; &gt; was found and the queue showed about 850 pending jobs. This did not<br>

&gt;&gt; &gt; change<br>

&gt;&gt; &gt; for a while and I could not find any failure with qstat or qhost. No<br>

&gt;&gt; &gt; jobs<br>

&gt;&gt; &gt; were running on any nodes and I waited a while for these to start<br>

&gt;&gt; &gt; without<br>

&gt;&gt; &gt; success. I tried the same thing again after a few hours and it seems<br>

&gt;&gt; &gt; that<br>

&gt;&gt; &gt; the cluster stops sending jobs from the queue after about 1600 jobs have<br>

&gt;&gt; &gt; been submitted. This does not happen when SGE is installed on a single<br>

&gt;&gt; &gt; Ubuntu machine I have at home. I am trying to figure out what is wrong.<br>

&gt;&gt; &gt; Did<br>

&gt;&gt; &gt; you impose some limit on the number of jobs? Can this be fixed? I really<br>

&gt;&gt; &gt; need to submit many jobs - tens of thousands jobs in my full runs. This<br>

&gt;&gt; &gt; one<br>

&gt;&gt; &gt; was relatively small and still did not pass.<br>

&gt;&gt;<br>

&gt;&gt; Looks like SGE thinks that the nodes are down or in alarm or error<br>

&gt;&gt; state? To find out why SGE thinks there are no nodes available, run:<br>

&gt;&gt; qstat -j<br>

&gt;&gt;<br>

&gt;&gt;<br>

&gt;&gt;<br>

&gt;&gt; &gt; 3. I tried to start Star Cluster with 50 nodes and got an error about<br>

&gt;&gt; &gt; exceeding a quota of 20. Is it your quota or Amazon quota? Are there any<br>

&gt;&gt; &gt; other restrictions I should be aware of at the beginning? Also after the<br>

&gt;&gt; &gt; system is unable start the cluster it thinks it is still running and a<br>

&gt;&gt; &gt; terminate command is needed before another start can be issued - even<br>

&gt;&gt; &gt; though<br>

&gt;&gt; &gt; nothing got started.<br>

&gt;&gt;<br>

&gt;&gt; It&#39;s Amazon&#39;s quota. 50 is considered small by AWS standard, and they<br>

&gt;&gt; can give it to you almost right away... You need to request AWS to<br>

&gt;&gt; give you a higher limit:<br>

&gt;&gt; <a href="https://aws.amazon.com/contact-us/ec2-request/" target="_blank">https://aws.amazon.com/contact-us/ec2-request/</a><br>

&gt;&gt;<br>

&gt;&gt; Note that last year we requested for 10,000 nodes and the whole<br>

&gt;&gt; process took less than 1 day:<br>

&gt;&gt;<br>

&gt;&gt;<br>

&gt;&gt; <a href="http://blogs.scalablelogic.com/2012/11/running-10000-node-grid-engine-cluster.html" target="_blank">http://blogs.scalablelogic.com/2012/11/running-10000-node-grid-engine-cluster.html</a><br>

&gt;&gt;<br>

&gt;&gt; Rayson<br>

&gt;&gt;<br>

&gt;&gt; ==================================================<br>

&gt;&gt; Open Grid Scheduler - The Official Open Source Grid Engine<br>

&gt;&gt; <a href="http://gridscheduler.sourceforge.net/" target="_blank">http://gridscheduler.sourceforge.net/</a><br>

&gt;&gt;<br>

&gt;&gt;<br>

&gt;&gt; &gt;<br>

&gt;&gt; &gt; This all happened on us-west-2 with the help of star cluster 0.93.3 and<br>

&gt;&gt; &gt; the<br>

&gt;&gt; &gt; Anaconda AMI - ami-a4d64194<br>

&gt;&gt; &gt;<br>

&gt;&gt; &gt; Here is some more information on what I am doing to help you answer the<br>

&gt;&gt; &gt; above.<br>

&gt;&gt; &gt;<br>

&gt;&gt; &gt; I am running Monte Carlo simulations to simulate chronic disease<br>

&gt;&gt; &gt; progression. I am using MIST to run over the cloud:<br>

&gt;&gt; &gt;<br>

&gt;&gt; &gt;<br>

&gt;&gt; &gt; <a href="https://github.com/scipy/scipy2013_talks/blob/master/talks/jacob_barhak/readme.md" target="_blank">https://github.com/scipy/scipy2013_talks/blob/master/talks/jacob_barhak/readme.md</a><br>

&gt;&gt; &gt;<br>

&gt;&gt; &gt; The Reference Model is what I am running using MIST:<br>

&gt;&gt; &gt;<br>

&gt;&gt; &gt; <a href="http://youtu.be/7qxPSgINaD8" target="_blank">http://youtu.be/7qxPSgINaD8</a><br>

&gt;&gt; &gt;<br>

&gt;&gt; &gt; I am launching many simulations in parallel and it takes me days on a<br>

&gt;&gt; &gt; single<br>

&gt;&gt; &gt; 8 core machine. The cloud allows me to cut down this time to hours. This<br>

&gt;&gt; &gt; is<br>

&gt;&gt; &gt; why star cluster is so useful. In the past I did this over other<br>

&gt;&gt; &gt; clusters<br>

&gt;&gt; &gt; yet the cloud is still new to me.<br>

&gt;&gt; &gt;<br>

&gt;&gt; &gt; I will appreciate any recommendations I can get from you to improve the<br>

&gt;&gt; &gt; behaviors I am experiencing.<br>

<br>

</div></div><a href="http://commons.wikimedia.org/wiki/User:Raysonho" target="_blank">http://commons.wikimedia.org/wiki/User:Raysonho</a><br>

</blockquote></div><br></div>