[StarCluster] One simple question

Sat Oct 23 12:37:28 EDT 2010

  Alexey,

The Sun Grid Engine queueing system is useful when you have a lot of 
tasks to execute and not just one at a time interactively. For example, 
you might need to convert 300 videos from one format to another. You 
could either

1. Write a script that gets the list of nodes from /etc/hosts and then 
loops over the jobs and the nodes, ssh'ing commands to be executed on 
each node. A big problem with this approach is that the task execution 
and management all depends on this script executing successfully all the 
way through. What happens if the script fails? You would then lose all 
task accounting information. Also, what if you suddenly discover you 
need to do another batch of 300 videos while the previous batch is still 
processing? Are you going to re-execute your script and overload the 
cluster? This would definitely slow down all of your jobs. How will you 
write your script to avoid overloading the cluster in this situation 
without losing the fact that you want to submit new jobs *now*?

OR

2. Skip needing to get the list of nodes and ssh'ing commands to them 
and instead just write a loop that sends 300 jobs to the queuing system 
using "qsub". The queuing system will then do the work to find an 
available node, execute the job, and store it's accounting information 
(status, start time, end time, which node executed the job, etc) . The 
queuing system will also handle load balancing your tasks across the 
cluster so that any one node doesn't get significantly overloaded 
compared to the other nodes in the cluster. If you suddenly discover you 
need 300 more videos processed you could simply "qsub" 300 more jobs. 
These jobs will be 'queued-up' and executed when a node becomes 
available. This approach reduces your concerns to just executing a task 
on a node rather than managing multiple jobs and nodes.

Also it is true that you can create "as many clusters as you want" with 
cloud computing. However, in many cases it could get *very* expensive 
launching multiple clusters for every single task or set of tasks. 
Whether it's more cost effective to launch multiple clusters or just 
queue a ton of jobs on a single cluster depends highly on the sort of 
tasks you're executing.

Of course, just because a queueing system is installed doesn't mean you 
*have* to use it at all. You can of course run things however you want 
on the cluster. Hopefully I've made it clear that there are significant 
advantages to using a queuing system to execute jobs on a cluster rather 
than a home-brewed script.

Hope that helps...

~Justin

On 10/22/10 5:02 PM, Alexey PETROV wrote:
> Ye, StartCluster is a great.
> But, what for do we need to use whatever "/queuing system"./
> Surely, in cloud computing, user can create as many clusters as he 
> wants, each for his particular tasks.
> So, why?!

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/pipermail/starcluster/attachments/20101023/3fd9eeb2/attachment.htm