[StarCluster] DRMAA jobs failing when load balancer enabled and jobs longer than 60 mins (Lilley, John F.)

François-Michel L'Heureux fmlheureux at datacratic.com
Thu Mar 6 14:13:48 EST 2014


Hi John

I assume DRMAA is a replacement to OGS/SGE?

About DRMAA bailing out, I don't know the product, but your guess is likely
correct: I might crash when nodes go away. There is a somewhat similar
issue with OGS where we need to clean it when nodes go away. It doesn't
crash though.

For your second issue, regarding execution host, again, I had a similar
issue with OGS. The trick I used is that I left the master node as an
execution host, but I defined its number of slots to 0. Hence, OGS is happe
because there is at least an exec host and the load balancer runs just fine
because when there is only the master node online, there is no slots so it
immediately adds node whenever jobs come in. I don't know if there is a
concept of slots in DRMAA or if this version of the loadbalancer uses it
but if so, I think you could reproduce my trick.

I hope it will help you.

Mich
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/pipermail/starcluster/attachments/20140306/cdc94c9e/attachment.htm


More information about the StarCluster mailing list