<div class="gmail_quote">On Tue, Oct 26, 2010 at 4:25 PM, Damian Eads <span dir="ltr">&lt;<a href="mailto:eads@soe.ucsc.edu">eads@soe.ucsc.edu</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">


Hi Alexey,<br>

<br>

Thanks for your questions! :)<br></blockquote><div>Hi Damian,</div><div><br></div><div>Thank you for sharing your experience and practice.</div><div>The problem is, that &quot;cloud computing&quot; is able to shifts our usual perception of software and IT services.</div>


<div>For example, some &quot;cloud&quot; programmers have already started to think about AMIs as &quot;shared libraries&quot;.</div><div>Such revising of existing and commonplace viewpoints could revolutionary change the software as well</div>


<div>(in the same way, as matrix algebra in its days helped to discover new laws).</div><div>So, translating the existing software usage practice, without understanding that rules of the game significantly shifted, </div>


<div>could be misleading. This primarily where my question come from.</div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">

<br>

I guess I&#39;ll break my silence and mention that I&#39;m an avid user of<br>

combining both Sun Grid Engine and MPI. I sometimes have several<br>

hundred MPI jobs I need to run where the returns diminish if the<br>

number of cores per MPI job is too high. Thus, I limit the number of<br>

cores/job so several MPI jobs run at once. As for creating &quot;as many<br>

clusters as he wants&quot;, I&#39;ve found it is often easier to manage a<br>

single cluster for a problem mainly because when I manage 2-4<br>

clusters, I often make mistakes in replicating volumes where my data<br>

and results are stored. At the very end of the computation, I run<br>

scripts which combine result files generated from all of the jobs. If<br>

they&#39;re on different volumes, I need to rsync each of them<br>

individually onto a common volume. By having all of my data on a<br>

single volume, I don&#39;t have to think about it. Only when I&#39;m running<br>

second set of jobs for a completely different project with different<br>

code and data sets will I create second cluster.<br></blockquote><div>Yes, just following you, I found &quot;my own&quot; benefits to use &quot;queuing system&quot;.</div><div>If I would need to use CORBA scheduler to couple my MPI functionality, </div>


<div>I would need to run many MPI programs at the same time (better on the same cluster, even from performance viewpoint).</div><div>So, the best solution to do this properly (I mean automatic load balancing), is the &quot;queuing system&quot;.</div>


<div>Therefore, even in case of user run one but complex task (coupled MPI programs, for example) &quot;queuing system&quot; would really useful.</div><div><br></div><div>Thanks everybody, best regards,</div><div>Alexey</div>


<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">

<br>

Cheers,<br>

<font color="#888888"><br>

Damian<br>

</font><div><div></div><div class="h5"><br>

On Sat, Oct 23, 2010 at 7:19 PM, Alexey PETROV<br>

&lt;<a href="mailto:alexey.petrov.nnov@gmail.com">alexey.petrov.nnov@gmail.com</a>&gt; wrote:<br>

&gt; Dear Justin,<br>

&gt;<br>

&gt; Thank you very much for your clear and full answer.<br>

&gt; Yes, I completely agree with you that in case of low bound tasks and,<br>

&gt; especially, if run them in routine everyday mode the &quot;queuing system&quot; is an<br>

&gt; excellent solution. My initial harsh in this question was influenced by the<br>

&gt; background where I came from, namely - MPI. I thought, that once user<br>

&gt; has available &quot;on demand&quot; cluster computing nodes and MPI,<br>

&gt; it eliminates the &quot;queuing system&quot; as a class from the &quot;cloud computing&quot;.<br>

&gt; Because MPI comes with its own task dispatcher and user can directly acquire<br>

&gt; whatever powerful cluster configuration he need for his task, without<br>

&gt; waiting for some proper resources will be available. Now, I see that there<br>

&gt; are a lot of other applications that had better run in a cluster through<br>

&gt; a pre-configured &quot;queuing system&quot;, not by hand on a heap of nodes. Thank<br>

&gt; you.<br>

&gt; And, could I just confirm, once again - &quot;If a single user need to run a MPI<br>

&gt; task just from time to time (not on routine everyday basis), would he have<br>

&gt; some additional benefits from &quot;queuing system&quot; in a cloud, or it better to<br>

&gt; use MPI straightforward&quot;?<br>

&gt; Thank you in advance, sincerely yours,<br>

&gt; Alexey<br>

&gt; On Sat, Oct 23, 2010 at 6:37 PM, Justin Riley &lt;<a href="mailto:jtriley@mit.edu">jtriley@mit.edu</a>&gt; wrote:<br>

&gt;&gt;<br>

&gt;&gt; Alexey,<br>

&gt;&gt;<br>

&gt;&gt; The Sun Grid Engine queueing system is useful when you have a lot of tasks<br>

&gt;&gt; to execute and not just one at a time interactively. For example, you might<br>

&gt;&gt; need to convert 300 videos from one format to another. You could either<br>

&gt;&gt;<br>

&gt;&gt; 1. Write a script that gets the list of nodes from /etc/hosts and then<br>

&gt;&gt; loops over the jobs and the nodes, ssh&#39;ing commands to be executed on each<br>

&gt;&gt; node. A big problem with this approach is that the task execution and<br>

&gt;&gt; management all depends on this script executing successfully all the way<br>

&gt;&gt; through. What happens if the script fails? You would then lose all task<br>

&gt;&gt; accounting information. Also, what if you suddenly discover you need to do<br>

&gt;&gt; another batch of 300 videos while the previous batch is still processing?<br>

&gt;&gt; Are you going to re-execute your script and overload the cluster? This would<br>

&gt;&gt; definitely slow down all of your jobs. How will you write your script to<br>

&gt;&gt; avoid overloading the cluster in this situation without losing the fact that<br>

&gt;&gt; you want to submit new jobs *now*?<br>

&gt;&gt;<br>

&gt;&gt; OR<br>

&gt;&gt;<br>

&gt;&gt; 2. Skip needing to get the list of nodes and ssh&#39;ing commands to them and<br>

&gt;&gt; instead just write a loop that sends 300 jobs to the queuing system using<br>

&gt;&gt; &quot;qsub&quot;. The queuing system will then do the work to find an available node,<br>

&gt;&gt; execute the job, and store it&#39;s accounting information (status, start time,<br>

&gt;&gt; end time, which node executed the job, etc) . The queuing system will also<br>

&gt;&gt; handle load balancing your tasks across the cluster so that any one node<br>

&gt;&gt; doesn&#39;t get significantly overloaded compared to the other nodes in the<br>

&gt;&gt; cluster. If you suddenly discover you need 300 more videos processed you<br>

&gt;&gt; could simply &quot;qsub&quot; 300 more jobs. These jobs will be &#39;queued-up&#39; and<br>

&gt;&gt; executed when a node becomes available. This approach reduces your concerns<br>

&gt;&gt; to just executing a task on a node rather than managing multiple jobs and<br>

&gt;&gt; nodes.<br>

&gt;&gt;<br>

&gt;&gt; Also it is true that you can create &quot;as many clusters as you want&quot; with<br>

&gt;&gt; cloud computing. However, in many cases it could get *very* expensive<br>

&gt;&gt; launching multiple clusters for every single task or set of tasks. Whether<br>

&gt;&gt; it&#39;s more cost effective to launch multiple clusters or just queue a ton of<br>

&gt;&gt; jobs on a single cluster depends highly on the sort of tasks you&#39;re<br>

&gt;&gt; executing.<br>

&gt;&gt;<br>

&gt;&gt; Of course, just because a queueing system is installed doesn&#39;t mean you<br>

&gt;&gt; *have* to use it at all. You can of course run things however you want on<br>

&gt;&gt; the cluster. Hopefully I&#39;ve made it clear that there are significant<br>

&gt;&gt; advantages to using a queuing system to execute jobs on a cluster rather<br>

&gt;&gt; than a home-brewed script.<br>

&gt;&gt;<br>

&gt;&gt; Hope that helps...<br>

&gt;&gt;<br>

&gt;&gt; ~Justin<br>

&gt;&gt;<br>

&gt;&gt; On 10/22/10 5:02 PM, Alexey PETROV wrote:<br>

&gt;&gt;<br>

&gt;&gt; Ye, StartCluster is a great.<br>

&gt;&gt; But, what for do we need to use whatever &quot;queuing system&quot;.<br>

&gt;&gt; Surely, in cloud computing, user can create as many clusters as he wants,<br>

&gt;&gt; each for his particular tasks.<br>

&gt;&gt; So, why?!<br>

&gt;&gt;<br>

&gt;&gt; _______________________________________________<br>

&gt;&gt; StarCluster mailing list<br>

&gt;&gt; <a href="mailto:StarCluster@mit.edu">StarCluster@mit.edu</a><br>

&gt;&gt; <a href="http://mailman.mit.edu/mailman/listinfo/starcluster" target="_blank">http://mailman.mit.edu/mailman/listinfo/starcluster</a><br>

&gt;&gt;<br>

&gt;<br>

&gt;<br>

&gt; _______________________________________________<br>

&gt; StarCluster mailing list<br>

&gt; <a href="mailto:StarCluster@mit.edu">StarCluster@mit.edu</a><br>

&gt; <a href="http://mailman.mit.edu/mailman/listinfo/starcluster" target="_blank">http://mailman.mit.edu/mailman/listinfo/starcluster</a><br>

&gt;<br>

&gt;<br>

</div></div></blockquote></div><br>