[StarCluster] Instances are not accepting jobs when the slots are available.
Jin Yu
yujin2004 at gmail.com
Fri Jul 18 12:01:46 EDT 2014
Rayson,
Thank you for pointing out the ATLAS in the starcluster AMI could creates
multiple threads by default in matrix manipulations. I checked my running
jobs again more carefully using htop and I found each job created exactly
32 threads! That is why I have an overload of ~100.
I have qalter all the waiting jobs for a hard limit of using only one core.
I will report back if it is not working.
Thanks!
Jin
On Fri, Jul 18, 2014 at 10:04 AM, Jin Yu <yujin2004 at gmail.com> wrote:
> Ha, I see. Probably this is the cause and I want to validate it. Do you
> know how to configure the job in SGE to force it using only one core per
> job?
>
> Thanks!
> Jin
>
>
> On Fri, Jul 18, 2014 at 12:09 AM, Rayson Ho <raysonlogin at gmail.com> wrote:
>
>> The BLAS library in the StarCluster AMI is ATLAS, which takes advantage
>> of multicore machines by running each BLAS call using multiple threads.
>>
>> I googled and found that SVD with ALTAS does use more than 1 core:
>> http://blog.revolutionanalytics.com/2010/06/performance-benefits-of-multithreaded-r.html
>>
>> This can explain the behavior you are getting...
>>
>> Rayson
>>
>> ==================================================
>> Open Grid Scheduler - The Official Open Source Grid Engine
>> http://gridscheduler.sourceforge.net/
>> http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html
>>
>>
>> On Thu, Jul 17, 2014 at 10:03 PM, Jin Yu <yujin2004 at gmail.com> wrote:
>>
>>> Hi Rayson,
>>>
>>> My local virtualboxa also has more than one core, but the cpu usage is
>>> never more than 100% when I run it locally. I also checked it using htop,
>>> it only has one thread.
>>>
>>> I don't have any multicores or parallel modules in my R codes, but I did
>>> have a couple of svd calls and exception handlers in my R code.
>>>
>>> Thanks!
>>> Jin
>>>
>>>
>>> On Thu, Jul 17, 2014 at 5:58 PM, Rayson Ho <raysonlogin at gmail.com>
>>> wrote:
>>>
>>>> What R program or module are you using?? As there are 32 virtual cores
>>>> on the C3.8xlarge, if your code by default creates 1 thread per virtual
>>>> core, then there will be 3200% CPU usage on the C3.8xlarge. And also, may
>>>> be your local machine has much fewer cores and that's why this is not
>>>> happening??
>>>>
>>>> Rayson
>>>>
>>>> ==================================================
>>>> Open Grid Scheduler - The Official Open Source Grid Engine
>>>> http://gridscheduler.sourceforge.net/
>>>> http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html
>>>>
>>>>
>>>> On Thu, Jul 17, 2014 at 6:06 PM, Jin Yu <yujin2004 at gmail.com> wrote:
>>>>
>>>>> Here is a followup of my investigation of the unusual high CPU/core
>>>>> usage in EC2 instances.
>>>>>
>>>>> In the last post, I reported my observations of 1. unusual high
>>>>> CPU/core usage of the R process in EC2 instances, which is designed to use
>>>>> one core on the local machine; And 2. unusual high percentage of kernel
>>>>> time in CPU usage.
>>>>>
>>>>> I looked more into the R processes using htop and found a lot of
>>>>> threads were created in each of them. And there are tons of sched_yield()
>>>>> system calls in each thread.
>>>>>
>>>>> Do these phenomenons with starcluster at EC2 ring a bell for someone?
>>>>>
>>>>> Thanks!
>>>>> Jin
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Jul 17, 2014 at 3:48 PM, Jin Yu <yujin2004 at gmail.com> wrote:
>>>>>
>>>>>> Hi Chris,
>>>>>>
>>>>>> Thanks for your prompt reply and point me to look the unusual high
>>>>>> load of the instance! And I found something more mysterious in EC2
>>>>>> instances (C3.8xlarge, to be more specific) :
>>>>>>
>>>>>> 1. I found some of my jobs are using CPU as much as 900%, although
>>>>>> these job are designed to use only one core and behave so in my local
>>>>>> machine, which lead to the unexpected high load of the system. Following is
>>>>>> an example snapshot of these process.
>>>>>>
>>>>>> 2. While all the 8 running jobs takes 3000% CPU which is close to the
>>>>>> full of 32 cores. The kernel time takes up to 70% of the CPU time.
>>>>>>
>>>>>> Are these problem related to the visualization nature of the EC2
>>>>>> instances? Can you give me a hint to investigate them?
>>>>>>
>>>>>> Thanks!
>>>>>> Jin
>>>>>>
>>>>>>
>>>>>>
>>>>>> [image: Inline image 1]
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Jul 17, 2014 at 1:48 PM, Chris Dagdigian <dag at bioteam.net>
>>>>>> wrote:
>>>>>>
>>>>>>>
>>>>>>> Hi Jin,
>>>>>>>
>>>>>>> The cluster is not accepting jobs into those open slots because your
>>>>>>> compute nodes are reporting alarm state "a" - your first host has a
>>>>>>> reported load average of 148!
>>>>>>>
>>>>>>> Alarm state 'a' means "load threshold alarm level reached" it
>>>>>>> basically
>>>>>>> means that the server load is high enough that the nodes are refusing
>>>>>>> new work until the load average goes down.
>>>>>>>
>>>>>>> All of those load alarm thresholds are configurable values within
>>>>>>> SGE so
>>>>>>> you can revise them upwards if you want
>>>>>>>
>>>>>>> Regards,
>>>>>>> Chris
>>>>>>>
>>>>>>>
>>>>>>> Jin Yu wrote:
>>>>>>> > Hello,
>>>>>>> >
>>>>>>> > I just started a cluster of 20 c3.8xlarge instances, which have 32
>>>>>>> > virtual cores in each. In my understanding, each instance should
>>>>>>> have
>>>>>>> > 32 slots available to run the jobs by default. But after running
>>>>>>> it
>>>>>>> > for a while, I found a lot of nodes are not running at the full
>>>>>>> speed.
>>>>>>> >
>>>>>>> > Following as an example, you can see node016 has only 13 jobs
>>>>>>> running
>>>>>>> > and node017 has 9 jobs running, while node018 has 32 jobs running.
>>>>>>> I
>>>>>>> > have another ~10000 jobs waiting in the queue, so it is not a
>>>>>>> matter
>>>>>>> > of running out of jobs.
>>>>>>> >
>>>>>>> > Can anyone give me a hint what is going on here?
>>>>>>> >
>>>>>>> > Thanks!
>>>>>>> > Jin
>>>>>>> >
>>>>>>> >
>>>>>>> > all.q at node016 BIP 0/13/32 148.35
>>>>>>> linux-x64
>>>>>>> > a
>>>>>>> > 784 0.55500 job.part.a sgeadmin r 07/17/2014 11:25:59
>>>>>>> > 1
>>>>>>> > 982 0.55500 job.part.a sgeadmin r 07/17/2014 14:43:59
>>>>>>> > 1
>>>>>>> > 1056 0.55500 job.part.a sgeadmin r 07/17/2014 16:34:44
>>>>>>> > 1
>>>>>>> > 1057 0.55500 job.part.a sgeadmin r 07/17/2014 16:34:44
>>>>>>> > 1
>>>>>>> > 1058 0.55500 job.part.a sgeadmin r 07/17/2014 16:34:59
>>>>>>> > 1
>>>>>>> > 1121 0.55500 job.part.a sgeadmin r 07/17/2014 17:22:44
>>>>>>> > 1
>>>>>>> > 1122 0.55500 job.part.a sgeadmin r 07/17/2014 17:22:44
>>>>>>> > 1
>>>>>>> > 1123 0.55500 job.part.a sgeadmin r 07/17/2014 17:22:44
>>>>>>> > 1
>>>>>>> > 1124 0.55500 job.part.a sgeadmin r 07/17/2014 17:22:44
>>>>>>> > 1
>>>>>>> > 1125 0.55500 job.part.a sgeadmin r 07/17/2014 17:22:44
>>>>>>> > 1
>>>>>>> > 1126 0.55500 job.part.a sgeadmin r 07/17/2014 17:22:44
>>>>>>> > 1
>>>>>>> > 1127 0.55500 job.part.a sgeadmin r 07/17/2014 17:22:44
>>>>>>> > 1
>>>>>>> > 1128 0.55500 job.part.a sgeadmin r 07/17/2014 17:22:44
>>>>>>> > 1
>>>>>>> >
>>>>>>> ---------------------------------------------------------------------------------
>>>>>>> > all.q at node017 BIP 0/9/32 83.86
>>>>>>> linux-x64
>>>>>>> > a
>>>>>>> > 568 0.55500 job.part.a sgeadmin r 07/17/2014 04:01:14
>>>>>>> > 1
>>>>>>> > 1001 0.55500 job.part.a sgeadmin r 07/17/2014 15:07:29
>>>>>>> > 1
>>>>>>> > 1002 0.55500 job.part.a sgeadmin r 07/17/2014 15:07:29
>>>>>>> > 1
>>>>>>> > 1072 0.55500 job.part.a sgeadmin r 07/17/2014 16:53:29
>>>>>>> > 1
>>>>>>> > 1116 0.55500 job.part.a sgeadmin r 07/17/2014 17:19:29
>>>>>>> > 1
>>>>>>> > 1117 0.55500 job.part.a sgeadmin r 07/17/2014 17:19:29
>>>>>>> > 1
>>>>>>> > 1118 0.55500 job.part.a sgeadmin r 07/17/2014 17:19:44
>>>>>>> > 1
>>>>>>> > 1119 0.55500 job.part.a sgeadmin r 07/17/2014 17:19:59
>>>>>>> > 1
>>>>>>> > 1120 0.55500 job.part.a sgeadmin r 07/17/2014 17:19:59
>>>>>>> > 1
>>>>>>> >
>>>>>>> ---------------------------------------------------------------------------------
>>>>>>> > all.q at node018 BIP 0/32/32 346.00
>>>>>>> linux-x64
>>>>>>> > a
>>>>>>> >
>>>>>>> > _______________________________________________
>>>>>>> > StarCluster mailing list
>>>>>>> > StarCluster at mit.edu
>>>>>>> > http://mailman.mit.edu/mailman/listinfo/starcluster
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> StarCluster mailing list
>>>>> StarCluster at mit.edu
>>>>> http://mailman.mit.edu/mailman/listinfo/starcluster
>>>>>
>>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/pipermail/starcluster/attachments/20140718/d20d6286/attachment-0001.htm
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ScreenClip.png
Type: image/png
Size: 12339 bytes
Desc: not available
Url : http://mailman.mit.edu/pipermail/starcluster/attachments/20140718/d20d6286/attachment-0001.png
More information about the StarCluster
mailing list