[StarCluster] DRMAA jobs failing when load balancer enabled and jobs longer than 60 mins (Lilley, John F.)
Lilley, John F.
johnbot at caltech.edu
Tue Mar 11 16:55:05 EDT 2014
Hi Rayson and Mich,
Thanks for the feedback. Hardcoding the switches completely resolved the issue!
John
On Mar 6, 2014, at 11:36 AM, Rayson Ho <raysonlogin at gmail.com> wrote:
> It should only work for DRMAA jobs.
>
> Rayson
>
> ==================================================
> Open Grid Scheduler - The Official Open Source Grid Engine
> http://gridscheduler.sourceforge.net/
> http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html
>
>
> On Thu, Mar 6, 2014 at 2:31 PM, François-Michel L'Heureux
> <fmlheureux at datacratic.com> wrote:
>> Hi!
>>
>> From experience, I don't think it works for qrsh though.
>> Justin also just tried it and told me it doesn't work.
>>
>>
>>
>>
>> 2014-03-06 14:26 GMT-05:00 Rayson Ho <raysonlogin at gmail.com>:
>>
>>> Hi Mich,
>>>
>>> Thanks for sharing the workaround.
>>>
>>> The behavior is due to a relatively undocumented feature of DRMAA /
>>> Grid Engine -- basically DRAMM jobs in Grid Engine has "-w e" added to
>>> the job submission request. The -w flag takes the following arguments:
>>>
>>> `e' error - jobs with invalid requests will be rejected.
>>>
>>> `w' warning - only a warning will be displayed for invalid
>>> requests.
>>>
>>> `n' none - switches off validation; the default for qsub,
>>> qalter, qrsh, qsh and qlogin.
>>>
>>> `p' poke - does not submit the job but prints a validation
>>> report based on a cluster as is with all resource utilizations in
>>> place.
>>>
>>> `v' verify - does not submit the job but prints a
>>> validation report based on an empty cluster.
>>>
>>> http://gridscheduler.sourceforge.net/htmlman/htmlman1/qsub.html
>>>
>>> Thus with "-w e", if Grid Engine is not happy with the job at
>>> submission time (eg. it thinks that it does not have enough nodes to
>>> run the job), then it will reject the job submission.
>>>
>>> The correct way is to override the DRMAA request with "-w n" or "-w w"
>>> if you are going to use load-balancing.
>>>
>>> Rayson
>>>
>>> ==================================================
>>> Open Grid Scheduler - The Official Open Source Grid Engine
>>> http://gridscheduler.sourceforge.net/
>>> http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html
>>>
>>>
>>> On Thu, Mar 6, 2014 at 2:13 PM, François-Michel L'Heureux
>>> <fmlheureux at datacratic.com> wrote:
>>>> Hi John
>>>>
>>>> I assume DRMAA is a replacement to OGS/SGE?
>>>>
>>>> About DRMAA bailing out, I don't know the product, but your guess is
>>>> likely
>>>> correct: I might crash when nodes go away. There is a somewhat similar
>>>> issue
>>>> with OGS where we need to clean it when nodes go away. It doesn't crash
>>>> though.
>>>>
>>>> For your second issue, regarding execution host, again, I had a similar
>>>> issue with OGS. The trick I used is that I left the master node as an
>>>> execution host, but I defined its number of slots to 0. Hence, OGS is
>>>> happe
>>>> because there is at least an exec host and the load balancer runs just
>>>> fine
>>>> because when there is only the master node online, there is no slots so
>>>> it
>>>> immediately adds node whenever jobs come in. I don't know if there is a
>>>> concept of slots in DRMAA or if this version of the loadbalancer uses it
>>>> but
>>>> if so, I think you could reproduce my trick.
>>>>
>>>> I hope it will help you.
>>>>
>>>> Mich
>>>>
>>>> _______________________________________________
>>>> StarCluster mailing list
>>>> StarCluster at mit.edu
>>>> http://mailman.mit.edu/mailman/listinfo/starcluster
>>>>
>>
>>
>
More information about the StarCluster
mailing list