[StarCluster] Instances are not accepting jobs when the slots are available.

Jin Yu yujin2004 at gmail.com
Thu Jul 17 14:44:36 EDT 2014


Hello,

I just started a cluster of 20 c3.8xlarge instances, which have 32 virtual
cores in each.  In my understanding, each instance should have 32 slots
available  to run the jobs by default. But after running it for a while, I
found a lot of nodes are not running at the full speed.

Following as an example, you can see node016 has only 13 jobs running and
node017 has 9 jobs running, while node018 has 32 jobs running. I have
another ~10000 jobs waiting in the queue, so it is not a matter of running
out of jobs.

Can anyone give me a hint what is going on here?

Thanks!
Jin


all.q at node016                  BIP   0/13/32        148.35   linux-x64     a
    784 0.55500 job.part.a sgeadmin     r     07/17/2014 11:25:59     1

    982 0.55500 job.part.a sgeadmin     r     07/17/2014 14:43:59     1

   1056 0.55500 job.part.a sgeadmin     r     07/17/2014 16:34:44     1

   1057 0.55500 job.part.a sgeadmin     r     07/17/2014 16:34:44     1

   1058 0.55500 job.part.a sgeadmin     r     07/17/2014 16:34:59     1

   1121 0.55500 job.part.a sgeadmin     r     07/17/2014 17:22:44     1

   1122 0.55500 job.part.a sgeadmin     r     07/17/2014 17:22:44     1

   1123 0.55500 job.part.a sgeadmin     r     07/17/2014 17:22:44     1

   1124 0.55500 job.part.a sgeadmin     r     07/17/2014 17:22:44     1

   1125 0.55500 job.part.a sgeadmin     r     07/17/2014 17:22:44     1

   1126 0.55500 job.part.a sgeadmin     r     07/17/2014 17:22:44     1

   1127 0.55500 job.part.a sgeadmin     r     07/17/2014 17:22:44     1

   1128 0.55500 job.part.a sgeadmin     r     07/17/2014 17:22:44     1

---------------------------------------------------------------------------------
all.q at node017                  BIP   0/9/32         83.86    linux-x64     a
    568 0.55500 job.part.a sgeadmin     r     07/17/2014 04:01:14     1

   1001 0.55500 job.part.a sgeadmin     r     07/17/2014 15:07:29     1

   1002 0.55500 job.part.a sgeadmin     r     07/17/2014 15:07:29     1

   1072 0.55500 job.part.a sgeadmin     r     07/17/2014 16:53:29     1

   1116 0.55500 job.part.a sgeadmin     r     07/17/2014 17:19:29     1

   1117 0.55500 job.part.a sgeadmin     r     07/17/2014 17:19:29     1

   1118 0.55500 job.part.a sgeadmin     r     07/17/2014 17:19:44     1

   1119 0.55500 job.part.a sgeadmin     r     07/17/2014 17:19:59     1

   1120 0.55500 job.part.a sgeadmin     r     07/17/2014 17:19:59     1

---------------------------------------------------------------------------------
all.q at node018                  BIP   0/32/32        346.00   linux-x64     a
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/pipermail/starcluster/attachments/20140717/b7287198/attachment.htm


More information about the StarCluster mailing list