[StarCluster] DRMAA jobs failing when load balancer enabled and jobs longer than 60 mins
Rajat Banerjee
rajatb at post.harvard.edu
Thu Mar 6 13:49:18 EST 2014
Hi John,
Could you explain a little more about what a DRMAA job is and what
resources it requires? Found something on wikipedia but it doesn't seem
relevant.
I wrote big parts of the load balancer and am guessing that it does not
understand your inter-machine dependencies. Sounds like your job is
somewhat tolerant of hosts dropping off, but we can probably come up with a
better solution.
Best,
Rajat
On Thu, Mar 6, 2014 at 12:11 PM, Lilley, John F. <johnbot at caltech.edu>wrote:
> Hi,
>
>
> I'm running a simple java based DRMAA job that runs a sleep commands on
> each of the Starcluster compute instances but am having a problem. If I
> create 10 non-loadbalanced nodes and then submit a job from DRMAA that runs
> sleep on each of those nodes in parallel everything completes fine.If I
> submit the same 10 node DRMAA sleep job with 1 non-loadbalanced node
> available everything works fine and the jobs eventually work their way
> through the single node serially and the main DRMAA process is happy.
>
> If I then enable load balancing, submit 10 jobs that sleep 30 minutes each
> from the main DRMAA process it load balances beautifully by adding 9 nodes,
> all 10 jobs complete and the main process exits gracefully. However, if I
> submit 10 jobs that sleep for 70 minutes most of them finish but then the
> DRMAA process bails before all 10 jobs are complete. My guess is that when
> the first sleep jobs start to finish up the load balancer removes the nodes
> they ran on from the available execution hosts throwing the main DRMAA
> process which is monitoring the jobs for a loop.
>
>
> Perhaps there's a way I can make the DRMAA process more tolerant of
> execution hosts being removed from the available pool? Another issue I have
> is that unless I have 1 execution host running all the time the DRMAA
> process refuses to start at all. I'd rather not have to keep an execution
> host running to accept the DRMAA job submissions if possible. I would
> really appreciate hearing any insights the community has on running DRMAA
> jobs in Starcluster and if anyone has experienced similar obstacles.
>
>
> Thanks for the help!
> John
>
>
> Output received from DRMAA after an hour when load balancing (jobs over 70
> minutes)
>
> --------------------------------------------------------------------------------------------------------------------
> INFO [2014-03-05 19:59:26,077] [OGSJob.java:211] [main] Waiting for job
> 61...
> Exception in thread "main" java.lang.IllegalStateException
> at
> com.sun.grid.drmaa.JobInfoImpl.getExitStatus(JobInfoImpl.java:75)
> at nextgen.core.job.OGSJob.waitFor(OGSJob.java:213)
> at nextgen.core.job.JobUtils.waitForAll(JobUtils.java:23)
> at tests.DrmaaSleepTest.main(DrmaaSleepTest.java:50)
> INFO [2014-03-05 19:59:37,064] [OGSUtils.java:84] [Thread-0] Ending
> DRMAA session
>
> --------------------------------------------------------------------------------------------------------------------
>
> Output received from DRMAA when there are no execution hosts available on
> initial job submission (If I have execution host available it submits OK)
>
> --------------------------------------------------------------------------------------------------------------------
> user at master:~/$ java -jar DrmaaSleepTest.jar -m 5 -n 10 -s Sleep.jar
> log4j:ERROR Could not find value for key log4j.appender.R
> log4j:ERROR Could not instantiate appender named "R".
> WARN [2014-03-06 05:09:54,984] [OGSUtils.java:65] [main] Starting a
> DRMAA session.
> WARN [2014-03-06 05:09:54,989] [OGSUtils.java:66] [main] There should
> only be one active DRMAA session at a time.
> INFO [2014-03-06 05:09:55,430] [OGSUtils.java:92] [main] Attached
> shutdown hook to close DRMAA session upon JVM exit.
> Exception in thread "main" org.ggf.drmaa.DeniedByDrmException:
> warning:user your job is not allowed to run in any queue
> error: no suitable queues
> at com.sun.grid.drmaa.SessionImpl.nativeRunJob(Native Method)
> at com.sun.grid.drmaa.SessionImpl.runJob(SessionImpl.java:349)
> at nextgen.core.job.OGSJob.submit(OGSJob.java:188)
> at tests.DrmaaSleepTest.main(DrmaaSleepTest.java:46)
> INFO [2014-03-06 05:09:55,500] [OGSUtils.java:84] [Thread-0] Ending
> DRMAA session
>
> --------------------------------------------------------------------------------------------------------------------
>
>
>
>
>
>
>
> _______________________________________________
> StarCluster mailing list
> StarCluster at mit.edu
> http://mailman.mit.edu/mailman/listinfo/starcluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/pipermail/starcluster/attachments/20140306/d43fcb84/attachment.htm
More information about the StarCluster
mailing list