[StarCluster] Instances are not accepting jobs when the slots are available.

Fri Jul 18 01:09:36 EDT 2014

The BLAS library in the StarCluster AMI is ATLAS, which takes advantage of
multicore machines by running each BLAS call using multiple threads.

I googled and found that SVD with ALTAS does use more than 1 core:
http://blog.revolutionanalytics.com/2010/06/performance-benefits-of-multithreaded-r.html

This can explain the behavior you are getting...

Rayson

==================================================
Open Grid Scheduler - The Official Open Source Grid Engine
http://gridscheduler.sourceforge.net/
http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html

On Thu, Jul 17, 2014 at 10:03 PM, Jin Yu <yujin2004 at gmail.com> wrote:

> Hi Rayson,
>
> My local virtualboxa also has more than one core, but the cpu usage is
> never more than 100% when I run it locally. I also checked it using htop,
> it only has one thread.
>
> I don't have any multicores or parallel modules in my R codes, but I did
> have a couple of svd calls and exception handlers in my R code.
>
> Thanks!
> Jin
>
>
> On Thu, Jul 17, 2014 at 5:58 PM, Rayson Ho <raysonlogin at gmail.com> wrote:
>
>> What R program or module are you using?? As there are 32 virtual cores on
>> the C3.8xlarge, if your code by default creates 1 thread per virtual core,
>> then there will be 3200% CPU usage on the C3.8xlarge. And also, may be your
>> local machine has much fewer cores and that's why this is not happening??
>>
>> Rayson
>>
>> ==================================================
>> Open Grid Scheduler - The Official Open Source Grid Engine
>> http://gridscheduler.sourceforge.net/
>> http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html
>>
>>
>> On Thu, Jul 17, 2014 at 6:06 PM, Jin Yu <yujin2004 at gmail.com> wrote:
>>
>>> Here is a followup of my investigation of the unusual high CPU/core
>>> usage in EC2 instances.
>>>
>>> In the last post, I reported my observations of 1. unusual high CPU/core
>>> usage of the R process in EC2 instances, which is designed to use one core
>>> on the local machine;  And 2. unusual high percentage of kernel time in CPU
>>> usage.
>>>
>>> I looked more into the R processes using htop and found a lot of threads
>>> were created in each of them. And there are tons of sched_yield() system
>>> calls in each thread.
>>>
>>> Do these phenomenons with starcluster at EC2 ring a bell for someone?
>>>
>>> Thanks!
>>> Jin
>>>
>>>
>>>
>>>
>>> On Thu, Jul 17, 2014 at 3:48 PM, Jin Yu <yujin2004 at gmail.com> wrote:
>>>
>>>> Hi Chris,
>>>>
>>>> Thanks for your prompt reply and point me to look  the unusual high
>>>> load of the instance! And I found something more mysterious in EC2
>>>> instances (C3.8xlarge, to be more specific) :
>>>>
>>>> 1. I found some of my jobs are using CPU as much as 900%, although
>>>> these job are designed to use only one core and behave so in my local
>>>> machine, which lead to the unexpected high load of the system. Following is
>>>> an example snapshot of these process.
>>>>
>>>> 2. While all the 8 running jobs takes 3000% CPU which is close to the
>>>> full of 32 cores. The kernel time takes up to 70% of the CPU time.
>>>>
>>>> Are these problem related to the visualization nature of the EC2
>>>> instances? Can you give me a hint to investigate them?
>>>>
>>>> Thanks!
>>>> Jin
>>>>
>>>>
>>>>
>>>> [image: Inline image 1]
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, Jul 17, 2014 at 1:48 PM, Chris Dagdigian <dag at bioteam.net>
>>>> wrote:
>>>>
>>>>>
>>>>> Hi Jin,
>>>>>
>>>>> The cluster is not accepting jobs into those open slots because your
>>>>> compute nodes are reporting alarm state "a"  - your first host has a
>>>>> reported load average of 148!
>>>>>
>>>>> Alarm state 'a' means "load threshold alarm level reached" it basically
>>>>> means that the server load is high enough that the nodes are refusing
>>>>> new work until the load average goes down.
>>>>>
>>>>> All of those load alarm thresholds are configurable values within SGE
>>>>> so
>>>>> you can revise them upwards if you want
>>>>>
>>>>> Regards,
>>>>> Chris
>>>>>
>>>>>
>>>>> Jin Yu wrote:
>>>>> > Hello,
>>>>> >
>>>>> > I just started a cluster of 20 c3.8xlarge instances, which have 32
>>>>> > virtual cores in each.  In my understanding, each instance should
>>>>> have
>>>>> > 32 slots available  to run the jobs by default. But after running it
>>>>> > for a while, I found a lot of nodes are not running at the full
>>>>> speed.
>>>>> >
>>>>> > Following as an example, you can see node016 has only 13 jobs running
>>>>> > and node017 has 9 jobs running, while node018 has 32 jobs running. I
>>>>> > have another ~10000 jobs waiting in the queue, so it is not a matter
>>>>> > of running out of jobs.
>>>>> >
>>>>> > Can anyone give me a hint what is going on here?
>>>>> >
>>>>> > Thanks!
>>>>> > Jin
>>>>> >
>>>>> >
>>>>> > all.q at node016                  BIP   0/13/32        148.35
>>>>> linux-x64
>>>>> >     a
>>>>> >     784 0.55500 job.part.a sgeadmin     r     07/17/2014 11:25:59
>>>>> > 1
>>>>> >     982 0.55500 job.part.a sgeadmin     r     07/17/2014 14:43:59
>>>>> > 1
>>>>> >    1056 0.55500 job.part.a sgeadmin     r     07/17/2014 16:34:44
>>>>> > 1
>>>>> >    1057 0.55500 job.part.a sgeadmin     r     07/17/2014 16:34:44
>>>>> > 1
>>>>> >    1058 0.55500 job.part.a sgeadmin     r     07/17/2014 16:34:59
>>>>> > 1
>>>>> >    1121 0.55500 job.part.a sgeadmin     r     07/17/2014 17:22:44
>>>>> > 1
>>>>> >    1122 0.55500 job.part.a sgeadmin     r     07/17/2014 17:22:44
>>>>> > 1
>>>>> >    1123 0.55500 job.part.a sgeadmin     r     07/17/2014 17:22:44
>>>>> > 1
>>>>> >    1124 0.55500 job.part.a sgeadmin     r     07/17/2014 17:22:44
>>>>> > 1
>>>>> >    1125 0.55500 job.part.a sgeadmin     r     07/17/2014 17:22:44
>>>>> > 1
>>>>> >    1126 0.55500 job.part.a sgeadmin     r     07/17/2014 17:22:44
>>>>> > 1
>>>>> >    1127 0.55500 job.part.a sgeadmin     r     07/17/2014 17:22:44
>>>>> > 1
>>>>> >    1128 0.55500 job.part.a sgeadmin     r     07/17/2014 17:22:44
>>>>> > 1
>>>>> >
>>>>> ---------------------------------------------------------------------------------
>>>>> > all.q at node017                  BIP   0/9/32         83.86
>>>>>  linux-x64
>>>>> >     a
>>>>> >     568 0.55500 job.part.a sgeadmin     r     07/17/2014 04:01:14
>>>>> > 1
>>>>> >    1001 0.55500 job.part.a sgeadmin     r     07/17/2014 15:07:29
>>>>> > 1
>>>>> >    1002 0.55500 job.part.a sgeadmin     r     07/17/2014 15:07:29
>>>>> > 1
>>>>> >    1072 0.55500 job.part.a sgeadmin     r     07/17/2014 16:53:29
>>>>> > 1
>>>>> >    1116 0.55500 job.part.a sgeadmin     r     07/17/2014 17:19:29
>>>>> > 1
>>>>> >    1117 0.55500 job.part.a sgeadmin     r     07/17/2014 17:19:29
>>>>> > 1
>>>>> >    1118 0.55500 job.part.a sgeadmin     r     07/17/2014 17:19:44
>>>>> > 1
>>>>> >    1119 0.55500 job.part.a sgeadmin     r     07/17/2014 17:19:59
>>>>> > 1
>>>>> >    1120 0.55500 job.part.a sgeadmin     r     07/17/2014 17:19:59
>>>>> > 1
>>>>> >
>>>>> ---------------------------------------------------------------------------------
>>>>> > all.q at node018                  BIP   0/32/32        346.00
>>>>> linux-x64
>>>>> >     a
>>>>> >
>>>>> > _______________________________________________
>>>>> > StarCluster mailing list
>>>>> > StarCluster at mit.edu
>>>>> > http://mailman.mit.edu/mailman/listinfo/starcluster
>>>>>
>>>>
>>>>
>>>
>>> _______________________________________________
>>> StarCluster mailing list
>>> StarCluster at mit.edu
>>> http://mailman.mit.edu/mailman/listinfo/starcluster
>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/pipermail/starcluster/attachments/20140718/632e4860/attachment-0001.htm
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ScreenClip.png
Type: image/png
Size: 12339 bytes
Desc: not available
Url : http://mailman.mit.edu/pipermail/starcluster/attachments/20140718/632e4860/attachment-0001.png