[StarCluster] Starcluster behavior when nodes fail, when master fails

Rayson Ho raysonlogin at gmail.com
Fri Feb 7 14:15:43 EST 2014


On Fri, Feb 7, 2014 at 2:01 PM, Dmitry Serenbrennikov
<dmitry at adchemy.com> wrote:
> - If a node fails, is SGE/Starcluster able to detect this properly?
>
> - What happens to the jobs running on the failed node? Are they retried? Can
> they be configured to be retired? Does this work reliably?

It depends on the "reschedule_unknown" & "max_unheard" parameters of
the SGE master:

http://gridscheduler.sourceforge.net/htmlman/htmlman5/sge_conf.html


> - What happens to SGE jobs if the master node dies?

They will continue to run to completion, but the status will not be
updated as the qmaster is the only host that can update the job
status.


> - Can the cluster be recovered if the master node is restarted? Is is a
> single point of failure?

Yes, as long as the job status (spool directory) is intact.


> - If yes, does SGE itself support more redundancy than what is available as
> configured in Starcluster? Some diagrams in this presentation seems to imply
> so http://beowulf.rutgers.edu/info-user/pdf/ge_presentation.pdf

Each SGE cluster can have one or more shadow masters, so SGE can fail
over to another instance:

http://gridscheduler.sourceforge.net/htmlman/htmlman8/sge_shadowd.html


> - If one uses Starcluster without the SGE, what is the behavior when master
> node dies? Can the cluster be recovered from this?

Without SGE, then a Starcluster is just a group of instances. (You can
of course use other schedulers like Condor.)

Assuming that you just want a group of instances, then failure of one
does not affect other healthy instances.


> - What if we limit the use of NFS and instead use a separate system for data
> storage which provides its own high availability. Does this improve ability
> of the starcluster to recover from failure of nodes and the master?

I believe you will need to update the EC2 "master" tag (log onto the
AWS Web management console, you will see the "master" tag there if you
have a StarCluster running), or else some of the StarCluster commands
(like starcluster sshmaster) won't work.

Also see "Reducing and Eliminating NFS usage by Grid Engine":
http://gridscheduler.sourceforge.net/howto/nfsreduce.html

Rayson

==================================================
Open Grid Scheduler - The Official Open Source Grid Engine
http://gridscheduler.sourceforge.net/
http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html


>
>
> Thanks very much for any information or anecdotes along these lines!
>
> Best regards!
>
> -Dmitry
>
>
>
> _______________________________________________
> StarCluster mailing list
> StarCluster at mit.edu
> http://mailman.mit.edu/mailman/listinfo/starcluster
>


More information about the StarCluster mailing list