[StarCluster] DRMAA jobs failing when load balancer enabled and jobs longer than 60 mins
Chris Dagdigian
dag at bioteam.net
Thu Mar 6 13:56:05 EST 2014
FYI DRMAA is an API for job submission and control that is supported by
a number of different cluster schedulers making it an attractive target
for individuals writing portable cluster-aware tools and many commercial
ISVs who need to create software that speaks with HPC resources.
The formal website is: http://www.drmaa.org/
-Chris
> Rajat Banerjee <mailto:rajatb at post.harvard.edu>
> March 6, 2014 1:49 PM
> Hi John,
> Could you explain a little more about what a DRMAA job is and what
> resources it requires? Found something on wikipedia but it doesn't
> seem relevant.
>
> I wrote big parts of the load balancer and am guessing that it does
> not understand your inter-machine dependencies. Sounds like your job
> is somewhat tolerant of hosts dropping off, but we can probably come
> up with a better solution.
>
> Best,
> Rajat
>
>
>
> _______________________________________________
> StarCluster mailing list
> StarCluster at mit.edu
> http://mailman.mit.edu/mailman/listinfo/starcluster
> Lilley, John F. <mailto:johnbot at caltech.edu>
> March 6, 2014 12:11 PM
> Hi,
>
>
> I’m running a simple java based DRMAA job that runs a sleep commands
> on each of the Starcluster compute instances but am having a problem.
> If I create 10 non-loadbalanced nodes and then submit a job from DRMAA
> that runs sleep on each of those nodes in parallel everything
> completes fine.If I submit the same 10 node DRMAA sleep job with 1
> non-loadbalanced node available everything works fine and the jobs
> eventually work their way through the single node serially and the
> main DRMAA process is happy.
>
> If I then enable load balancing, submit 10 jobs that sleep 30 minutes
> each from the main DRMAA process it load balances beautifully by
> adding 9 nodes, all 10 jobs complete and the main process exits
> gracefully. However, if I submit 10 jobs that sleep for 70 minutes
> most of them finish but then the DRMAA process bails before all 10
> jobs are complete. My guess is that when the first sleep jobs start to
> finish up the load balancer removes the nodes they ran on from the
> available execution hosts throwing the main DRMAA process which is
> monitoring the jobs for a loop.
>
>
> Perhaps there’s a way I can make the DRMAA process more tolerant of
> execution hosts being removed from the available pool? Another issue I
> have is that unless I have 1 execution host running all the time the
> DRMAA process refuses to start at all. I’d rather not have to keep an
> execution host running to accept the DRMAA job submissions if
> possible. I would really appreciate hearing any insights the community
> has on running DRMAA jobs in Starcluster and if anyone has experienced
> similar obstacles.
>
>
> Thanks for the help!
> John
>
>
> Output received from DRMAA after an hour when load balancing (jobs
> over 70 minutes)
> --------------------------------------------------------------------------------------------------------------------
> INFO [2014-03-05 19:59:26,077] [OGSJob.java:211] [main] Waiting for
> job 61...
> Exception in thread "main" java.lang.IllegalStateException
> at com.sun.grid.drmaa.JobInfoImpl.getExitStatus(JobInfoImpl.java:75)
> at nextgen.core.job.OGSJob.waitFor(OGSJob.java:213)
> at nextgen.core.job.JobUtils.waitForAll(JobUtils.java:23)
> at tests.DrmaaSleepTest.main(DrmaaSleepTest.java:50)
> INFO [2014-03-05 19:59:37,064] [OGSUtils.java:84] [Thread-0] Ending
> DRMAA session
> --------------------------------------------------------------------------------------------------------------------
>
> Output received from DRMAA when there are no execution hosts available
> on initial job submission (If I have execution host available it
> submits OK)
> --------------------------------------------------------------------------------------------------------------------
> user at master:~/$ java -jar DrmaaSleepTest.jar -m 5 -n 10 -s Sleep.jar
> log4j:ERROR Could not find value for key log4j.appender.R
> log4j:ERROR Could not instantiate appender named "R".
> WARN [2014-03-06 05:09:54,984] [OGSUtils.java:65] [main] Starting a
> DRMAA session.
> WARN [2014-03-06 05:09:54,989] [OGSUtils.java:66] [main] There should
> only be one active DRMAA session at a time.
> INFO [2014-03-06 05:09:55,430] [OGSUtils.java:92] [main] Attached
> shutdown hook to close DRMAA session upon JVM exit.
> Exception in thread "main" org.ggf.drmaa.DeniedByDrmException:
> warning:user your job is not allowed to run in any queue
> error: no suitable queues
> at com.sun.grid.drmaa.SessionImpl.nativeRunJob(Native Method)
> at com.sun.grid.drmaa.SessionImpl.runJob(SessionImpl.java:349)
> at nextgen.core.job.OGSJob.submit(OGSJob.java:188)
> at tests.DrmaaSleepTest.main(DrmaaSleepTest.java:46)
> INFO [2014-03-06 05:09:55,500] [OGSUtils.java:84] [Thread-0] Ending
> DRMAA session
> --------------------------------------------------------------------------------------------------------------------
>
>
>
>
>
>
>
> _______________________________________________
> StarCluster mailing list
> StarCluster at mit.edu
> http://mailman.mit.edu/mailman/listinfo/starcluster
More information about the StarCluster
mailing list