[StarCluster] DRMAA jobs failing when load balancer enabled and jobs longer than 60 mins (Lilley, John F.)

Tue Mar 11 16:55:05 EDT 2014

Hi Rayson and Mich,

Thanks for the feedback. Hardcoding the switches completely resolved the issue! 

John

On Mar 6, 2014, at 11:36 AM, Rayson Ho <raysonlogin at gmail.com> wrote:

> It should only work for DRMAA jobs.
> 
> Rayson
> 
> ==================================================
> Open Grid Scheduler - The Official Open Source Grid Engine
> http://gridscheduler.sourceforge.net/
> http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html
> 
> 
> On Thu, Mar 6, 2014 at 2:31 PM, François-Michel L'Heureux
> <fmlheureux at datacratic.com> wrote:
>> Hi!
>> 
>> From experience, I don't think it works for qrsh though.
>> Justin also just tried it and told me it doesn't work.
>> 
>> 
>> 
>> 
>> 2014-03-06 14:26 GMT-05:00 Rayson Ho <raysonlogin at gmail.com>:
>> 
>>> Hi Mich,
>>> 
>>> Thanks for sharing the workaround.
>>> 
>>> The behavior is due to a relatively undocumented feature of DRMAA /
>>> Grid Engine -- basically DRAMM jobs in Grid Engine has "-w e" added to
>>> the job submission request. The -w flag takes the following arguments:
>>> 
>>>          `e'  error - jobs with invalid requests will be rejected.
>>> 
>>>          `w'  warning - only a warning will be displayed for invalid
>>> requests.
>>> 
>>>          `n'  none - switches off validation; the default for qsub,
>>> qalter, qrsh, qsh and qlogin.
>>> 
>>>          `p'  poke - does not submit the job but prints a validation
>>> report based on a cluster as is with all resource utilizations in
>>> place.
>>> 
>>>          `v'  verify - does not submit the job but prints a
>>> validation report based on an empty cluster.
>>> 
>>> http://gridscheduler.sourceforge.net/htmlman/htmlman1/qsub.html
>>> 
>>> Thus with "-w e", if Grid Engine is not happy with the job at
>>> submission time (eg. it thinks that it does not have enough nodes to
>>> run the job), then it will reject the job submission.
>>> 
>>> The correct way is to override the DRMAA request with "-w n" or "-w w"
>>> if you are going to use load-balancing.
>>> 
>>> Rayson
>>> 
>>> ==================================================
>>> Open Grid Scheduler - The Official Open Source Grid Engine
>>> http://gridscheduler.sourceforge.net/
>>> http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html
>>> 
>>> 
>>> On Thu, Mar 6, 2014 at 2:13 PM, François-Michel L'Heureux
>>> <fmlheureux at datacratic.com> wrote:
>>>> Hi John
>>>> 
>>>> I assume DRMAA is a replacement to OGS/SGE?
>>>> 
>>>> About DRMAA bailing out, I don't know the product, but your guess is
>>>> likely
>>>> correct: I might crash when nodes go away. There is a somewhat similar
>>>> issue
>>>> with OGS where we need to clean it when nodes go away. It doesn't crash
>>>> though.
>>>> 
>>>> For your second issue, regarding execution host, again, I had a similar
>>>> issue with OGS. The trick I used is that I left the master node as an
>>>> execution host, but I defined its number of slots to 0. Hence, OGS is
>>>> happe
>>>> because there is at least an exec host and the load balancer runs just
>>>> fine
>>>> because when there is only the master node online, there is no slots so
>>>> it
>>>> immediately adds node whenever jobs come in. I don't know if there is a
>>>> concept of slots in DRMAA or if this version of the loadbalancer uses it
>>>> but
>>>> if so, I think you could reproduce my trick.
>>>> 
>>>> I hope it will help you.
>>>> 
>>>> Mich
>>>> 
>>>> _______________________________________________
>>>> StarCluster mailing list
>>>> StarCluster at mit.edu
>>>> http://mailman.mit.edu/mailman/listinfo/starcluster
>>>> 
>> 
>> 
>