[StarCluster] DRMAA jobs failing when load balancer enabled and jobs longer than 60 mins (Lilley, John F.)
Rayson Ho
raysonlogin at gmail.com
Thu Mar 6 14:36:20 EST 2014
It should only work for DRMAA jobs.
Rayson
==================================================
Open Grid Scheduler - The Official Open Source Grid Engine
http://gridscheduler.sourceforge.net/
http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html
On Thu, Mar 6, 2014 at 2:31 PM, François-Michel L'Heureux
<fmlheureux at datacratic.com> wrote:
> Hi!
>
> From experience, I don't think it works for qrsh though.
> Justin also just tried it and told me it doesn't work.
>
>
>
>
> 2014-03-06 14:26 GMT-05:00 Rayson Ho <raysonlogin at gmail.com>:
>
>> Hi Mich,
>>
>> Thanks for sharing the workaround.
>>
>> The behavior is due to a relatively undocumented feature of DRMAA /
>> Grid Engine -- basically DRAMM jobs in Grid Engine has "-w e" added to
>> the job submission request. The -w flag takes the following arguments:
>>
>> `e' error - jobs with invalid requests will be rejected.
>>
>> `w' warning - only a warning will be displayed for invalid
>> requests.
>>
>> `n' none - switches off validation; the default for qsub,
>> qalter, qrsh, qsh and qlogin.
>>
>> `p' poke - does not submit the job but prints a validation
>> report based on a cluster as is with all resource utilizations in
>> place.
>>
>> `v' verify - does not submit the job but prints a
>> validation report based on an empty cluster.
>>
>> http://gridscheduler.sourceforge.net/htmlman/htmlman1/qsub.html
>>
>> Thus with "-w e", if Grid Engine is not happy with the job at
>> submission time (eg. it thinks that it does not have enough nodes to
>> run the job), then it will reject the job submission.
>>
>> The correct way is to override the DRMAA request with "-w n" or "-w w"
>> if you are going to use load-balancing.
>>
>> Rayson
>>
>> ==================================================
>> Open Grid Scheduler - The Official Open Source Grid Engine
>> http://gridscheduler.sourceforge.net/
>> http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html
>>
>>
>> On Thu, Mar 6, 2014 at 2:13 PM, François-Michel L'Heureux
>> <fmlheureux at datacratic.com> wrote:
>> > Hi John
>> >
>> > I assume DRMAA is a replacement to OGS/SGE?
>> >
>> > About DRMAA bailing out, I don't know the product, but your guess is
>> > likely
>> > correct: I might crash when nodes go away. There is a somewhat similar
>> > issue
>> > with OGS where we need to clean it when nodes go away. It doesn't crash
>> > though.
>> >
>> > For your second issue, regarding execution host, again, I had a similar
>> > issue with OGS. The trick I used is that I left the master node as an
>> > execution host, but I defined its number of slots to 0. Hence, OGS is
>> > happe
>> > because there is at least an exec host and the load balancer runs just
>> > fine
>> > because when there is only the master node online, there is no slots so
>> > it
>> > immediately adds node whenever jobs come in. I don't know if there is a
>> > concept of slots in DRMAA or if this version of the loadbalancer uses it
>> > but
>> > if so, I think you could reproduce my trick.
>> >
>> > I hope it will help you.
>> >
>> > Mich
>> >
>> > _______________________________________________
>> > StarCluster mailing list
>> > StarCluster at mit.edu
>> > http://mailman.mit.edu/mailman/listinfo/starcluster
>> >
>
>
More information about the StarCluster
mailing list