[StarCluster] DRMAA jobs failing when load balancer enabled and jobs longer than 60 mins (Lilley, John F.)

Thu Mar 6 14:26:10 EST 2014

Hi Mich,

Thanks for sharing the workaround.

The behavior is due to a relatively undocumented feature of DRMAA /
Grid Engine -- basically DRAMM jobs in Grid Engine has "-w e" added to
the job submission request. The -w flag takes the following arguments:

          `e'  error - jobs with invalid requests will be rejected.

          `w'  warning - only a warning will be displayed for invalid requests.

          `n'  none - switches off validation; the default for qsub,
qalter, qrsh, qsh and qlogin.

          `p'  poke - does not submit the job but prints a validation
report based on a cluster as is with all resource utilizations in
place.

          `v'  verify - does not submit the job but prints a
validation report based on an empty cluster.

http://gridscheduler.sourceforge.net/htmlman/htmlman1/qsub.html

Thus with "-w e", if Grid Engine is not happy with the job at
submission time (eg. it thinks that it does not have enough nodes to
run the job), then it will reject the job submission.

The correct way is to override the DRMAA request with "-w n" or "-w w"
if you are going to use load-balancing.

Rayson

==================================================
Open Grid Scheduler - The Official Open Source Grid Engine
http://gridscheduler.sourceforge.net/
http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html

On Thu, Mar 6, 2014 at 2:13 PM, François-Michel L'Heureux
<fmlheureux at datacratic.com> wrote:
> Hi John
>
> I assume DRMAA is a replacement to OGS/SGE?
>
> About DRMAA bailing out, I don't know the product, but your guess is likely
> correct: I might crash when nodes go away. There is a somewhat similar issue
> with OGS where we need to clean it when nodes go away. It doesn't crash
> though.
>
> For your second issue, regarding execution host, again, I had a similar
> issue with OGS. The trick I used is that I left the master node as an
> execution host, but I defined its number of slots to 0. Hence, OGS is happe
> because there is at least an exec host and the load balancer runs just fine
> because when there is only the master node online, there is no slots so it
> immediately adds node whenever jobs come in. I don't know if there is a
> concept of slots in DRMAA or if this version of the loadbalancer uses it but
> if so, I think you could reproduce my trick.
>
> I hope it will help you.
>
> Mich
>
> _______________________________________________
> StarCluster mailing list
> StarCluster at mit.edu
> http://mailman.mit.edu/mailman/listinfo/starcluster
>