[StarCluster] DRMAA jobs failing when load balancer enabled and jobs longer than 60 mins (Lilley, John F.)

François-Michel L'Heureux fmlheureux at datacratic.com
Thu Mar 6 14:31:35 EST 2014


Hi!

>From experience, I don't think it works for qrsh though.
Justin also just tried it and told me it doesn't work.




2014-03-06 14:26 GMT-05:00 Rayson Ho <raysonlogin at gmail.com>:

> Hi Mich,
>
> Thanks for sharing the workaround.
>
> The behavior is due to a relatively undocumented feature of DRMAA /
> Grid Engine -- basically DRAMM jobs in Grid Engine has "-w e" added to
> the job submission request. The -w flag takes the following arguments:
>
>           `e'  error - jobs with invalid requests will be rejected.
>
>           `w'  warning - only a warning will be displayed for invalid
> requests.
>
>           `n'  none - switches off validation; the default for qsub,
> qalter, qrsh, qsh and qlogin.
>
>           `p'  poke - does not submit the job but prints a validation
> report based on a cluster as is with all resource utilizations in
> place.
>
>           `v'  verify - does not submit the job but prints a
> validation report based on an empty cluster.
>
> http://gridscheduler.sourceforge.net/htmlman/htmlman1/qsub.html
>
> Thus with "-w e", if Grid Engine is not happy with the job at
> submission time (eg. it thinks that it does not have enough nodes to
> run the job), then it will reject the job submission.
>
> The correct way is to override the DRMAA request with "-w n" or "-w w"
> if you are going to use load-balancing.
>
> Rayson
>
> ==================================================
> Open Grid Scheduler - The Official Open Source Grid Engine
> http://gridscheduler.sourceforge.net/
> http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html
>
>
> On Thu, Mar 6, 2014 at 2:13 PM, François-Michel L'Heureux
> <fmlheureux at datacratic.com> wrote:
> > Hi John
> >
> > I assume DRMAA is a replacement to OGS/SGE?
> >
> > About DRMAA bailing out, I don't know the product, but your guess is
> likely
> > correct: I might crash when nodes go away. There is a somewhat similar
> issue
> > with OGS where we need to clean it when nodes go away. It doesn't crash
> > though.
> >
> > For your second issue, regarding execution host, again, I had a similar
> > issue with OGS. The trick I used is that I left the master node as an
> > execution host, but I defined its number of slots to 0. Hence, OGS is
> happe
> > because there is at least an exec host and the load balancer runs just
> fine
> > because when there is only the master node online, there is no slots so
> it
> > immediately adds node whenever jobs come in. I don't know if there is a
> > concept of slots in DRMAA or if this version of the loadbalancer uses it
> but
> > if so, I think you could reproduce my trick.
> >
> > I hope it will help you.
> >
> > Mich
> >
> > _______________________________________________
> > StarCluster mailing list
> > StarCluster at mit.edu
> > http://mailman.mit.edu/mailman/listinfo/starcluster
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/pipermail/starcluster/attachments/20140306/b0466e84/attachment.htm


More information about the StarCluster mailing list