[StarCluster] Instances are not accepting jobs when the slots are available.

Jin Yu yujin2004 at gmail.com
Fri Jul 18 16:39:13 EDT 2014


I just found openBLAS is already the default BLAS library in my customized
AMI. And the problem is solved by simply setting "export
OPENBLAS_NUM_THREADS=1". Thanks Rayson and Chris for your valuable
suggestions. Without your prompt helps, I cannot figure out the issue so
fast.

I have submitted 20000 jobs of ~100,000 cpu hours. I am using loadbalancer
to gradually increase the size of cluster to ~5000 cores. Let's see how
well the sge and starcluster will handler this.

Thanks!
Jin


On Fri, Jul 18, 2014 at 12:29 PM, Rayson Ho <raysonlogin at gmail.com> wrote:

> The best way to handle this is to setup a PE in Grid Engine, so that the
> Grid Engine scheduler knows that your job uses more than 1 CPU core.
>
> ALso, Grid Engine has the ability to bind a job to a CPU core, but ATLAS
> would still create 32 threads on a c3.8xlarge, as it still gets the full
> view of the  hardware configuration. So with core binding, the job can only
> use 1 CPU, but ATLAS still creates 32 threads, and all 32 threads will be
> fighting for CPU resources. And ATLAS does not have a dynamic way to limit
> the number of threads it creates. IMO if R works with other BLAS libraries,
> then switching to use something like OpenBLAS and tell it to use only the
> number of cores Grid Engine assigns to the job would fix this issue. And if
> you have a large number of R jobs, then use a serial BLAS library can be a
> good choice too, because you have 32 jobs that each uses 1 virtual core of
> the c3.8xlarge.
>
> Rayson
>
> ==================================================
> Open Grid Scheduler - The Official Open Source Grid Engine
> http://gridscheduler.sourceforge.net/
> http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html
>
>
> On Fri, Jul 18, 2014 at 12:01 PM, Jin Yu <yujin2004 at gmail.com> wrote:
>
>> Rayson,
>>
>> Thank you for pointing out the ATLAS in the starcluster AMI could creates
>> multiple threads by default in matrix manipulations. I checked my running
>> jobs again more carefully using htop and I found each job created exactly
>> 32 threads! That is why I have an overload of ~100.
>>
>> I have qalter all the waiting jobs for a hard limit of using only one
>> core. I will report back if it is not working.
>>
>> Thanks!
>> Jin
>>
>>
>> On Fri, Jul 18, 2014 at 10:04 AM, Jin Yu <yujin2004 at gmail.com> wrote:
>>
>>> Ha, I see. Probably this is the cause and I want to validate it. Do you
>>> know how to configure the job in SGE to force it using only one core per
>>> job?
>>>
>>> Thanks!
>>> Jin
>>>
>>>
>>> On Fri, Jul 18, 2014 at 12:09 AM, Rayson Ho <raysonlogin at gmail.com>
>>> wrote:
>>>
>>>> The BLAS library in the StarCluster AMI is ATLAS, which takes advantage
>>>> of multicore machines by running each BLAS call using multiple threads.
>>>>
>>>> I googled and found that SVD with ALTAS does use more than 1 core:
>>>> http://blog.revolutionanalytics.com/2010/06/performance-benefits-of-multithreaded-r.html
>>>>
>>>> This can explain the behavior you are getting...
>>>>
>>>> Rayson
>>>>
>>>> ==================================================
>>>> Open Grid Scheduler - The Official Open Source Grid Engine
>>>> http://gridscheduler.sourceforge.net/
>>>> http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html
>>>>
>>>>
>>>> On Thu, Jul 17, 2014 at 10:03 PM, Jin Yu <yujin2004 at gmail.com> wrote:
>>>>
>>>>> Hi Rayson,
>>>>>
>>>>> My local virtualboxa also has more than one core, but the cpu usage is
>>>>> never more than 100% when I run it locally. I also checked it using htop,
>>>>> it only has one thread.
>>>>>
>>>>> I don't have any multicores or parallel modules in my R codes, but I
>>>>> did have a couple of svd calls and exception handlers in my R code.
>>>>>
>>>>> Thanks!
>>>>> Jin
>>>>>
>>>>>
>>>>> On Thu, Jul 17, 2014 at 5:58 PM, Rayson Ho <raysonlogin at gmail.com>
>>>>> wrote:
>>>>>
>>>>>> What R program or module are you using?? As there are 32 virtual
>>>>>> cores on the C3.8xlarge, if your code by default creates 1 thread per
>>>>>> virtual core, then there will be 3200% CPU usage on the C3.8xlarge. And
>>>>>> also, may be your local machine has much fewer cores and that's why this is
>>>>>> not happening??
>>>>>>
>>>>>> Rayson
>>>>>>
>>>>>> ==================================================
>>>>>> Open Grid Scheduler - The Official Open Source Grid Engine
>>>>>> http://gridscheduler.sourceforge.net/
>>>>>> http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html
>>>>>>
>>>>>>
>>>>>> On Thu, Jul 17, 2014 at 6:06 PM, Jin Yu <yujin2004 at gmail.com> wrote:
>>>>>>
>>>>>>> Here is a followup of my investigation of the unusual high CPU/core
>>>>>>> usage in EC2 instances.
>>>>>>>
>>>>>>> In the last post, I reported my observations of 1. unusual high
>>>>>>> CPU/core usage of the R process in EC2 instances, which is designed to use
>>>>>>> one core on the local machine;  And 2. unusual high percentage of kernel
>>>>>>> time in CPU usage.
>>>>>>>
>>>>>>> I looked more into the R processes using htop and found a lot of
>>>>>>> threads were created in each of them. And there are tons of sched_yield()
>>>>>>> system calls in each thread.
>>>>>>>
>>>>>>> Do these phenomenons with starcluster at EC2 ring a bell for someone?
>>>>>>>
>>>>>>> Thanks!
>>>>>>> Jin
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/pipermail/starcluster/attachments/20140718/0d437e8f/attachment.htm


More information about the StarCluster mailing list