[StarCluster] CG1 plus StarCluster Questions

Thu May 24 11:46:40 EDT 2012

There is also a SGE way of handling it, which is to use prolog & epilog to set GPUs in exclusive mode & permission.

1. With this setup, 2 queue instances are needed per CG1 host. A queue instance is a container for job execution (but not a queue), and each logically owns a GPU board.

2. As SGE attaches an external and extra GID for process identification (we need to find out which process belongs to which job), so we can set the /dev/nvidiaX device permission to only allow processes that are in that GID to use the device in the prolog.

3. In the epilog, reset permission.

This method was contributed by William Hay on the Grid Engine mailing list.

 -Ron

----- Original Message -----
From: Rayson Ho <raysonlogin at gmail.com>
To: Scott Le Grand <varelse2005 at gmail.com>
Cc: "starcluster at mit.edu" <starcluster at mit.edu>
Sent: Thursday, May 24, 2012 12:11 AM
Subject: Re: [StarCluster] CG1 plus StarCluster Questions

Hi Scott,

We just rely on the internal resource accounting of Grid Engine - ie.
GPU jobs need to explicitly request for GPU devices when the users
qsub the job - ie. qsub -l gpu=1 JobScript.sh, and then Grid Engine
will make sure that only 2 jobs are scheduled on a host with 2 GPU
devices.

We also have the cgroups integration in Open Grid Scheduler/Grid
Engine 2011.11 update 1 (ie. OGS/GE 2011.11u1), but we are not taking
advantage of the device whitelisting controller yet:

http://blogs.scalablelogic.com/2012/05/grid-engine-cgroups-integration.html

http://www.kernel.org/doc/Documentation/cgroups/devices.txt

Rayson

================================
Open Grid Scheduler / Grid Engine
http://gridscheduler.sourceforge.net/

Scalable Grid Engine Support Program
http://www.scalablelogic.com/

On Thu, May 24, 2012 at 12:04 AM, Scott Le Grand <varelse2005 at gmail.com> wrote:
> This is really cool but it raises a question: how does the sensor avoid race
> conditions for executables that don't immediately grab the GPU?
>
>
> On Wed, May 23, 2012 at 8:46 PM, Rayson Ho <raysonlogin at gmail.com> wrote:
>>
>> Hi Justin & Scott,
>>
>> I played with a spot CG1 instance last week, and I was able to use
>> both NVML & OpenCL APIs to pull information from the GPU devices.
>>
>> Open Grid Scheduler's GPU load sensor (which uses the NVML API) is for
>> monitoring the health of GPU devices, and it is very similar to the
>> GPU monitoring product from Bright Computing - but note that Bright
>> has a very nice GUI (and we are not planning to compete against
>> Bright, so most likely the Open Grid Scheduler project will not try to
>> implement a GUI front-end for our GPU load sensor).
>>
>>
>> http://www.brightcomputing.com/NVIDIA-GPU-Cluster-Management-Monitoring.php
>>
>>
>> Note that with the information from the GPU load sensor, we can use
>> the StarCluster load balancer to shutdown nodes that have unhealthy
>> GPUs - ie. GPUs that are too hot, have too many ECC errors, etc.
>>
>> However, we currently don't put the GPUs in exclusive mode yet - which
>> is what Scott has already done and it is on our ToDo list - with that
>> it makes the Open Grid Scheduler/Grid Engine-GPU integration more
>> complete.
>>
>>
>> Anyway, here's how to compile & run the gpu load sensor:
>>
>>  * URL:
>> https://gridscheduler.svn.sourceforge.net/svnroot/gridscheduler/trunk/source/dist/gpu/gpu_sensor.c
>>
>>  * you can compile it in STANDALONE mode by adding -DSTANDALONE
>>
>>  * if you don't run it in standalone mode, you can press ENTER so
>> simulate the internal Grid Engine load sensor environment
>>
>> % cc gpu_sensor.c -I/usr/local/cuda/CUDAToolsSDK/NVML/
>> -L/usr/lib/nvidia-current/ -lnvidia-ml
>> % ./gpu_ls
>>
>> begin
>> ip-10-16-21-185:gpu.0.name:Tesla M2050
>> ip-10-16-21-185:gpu.0.busId:0000:00:03.0
>> ip-10-16-21-185:gpu.0.fanspeed:0
>> ip-10-16-21-185:gpu.0.clockspeed:270
>> ip-10-16-21-185:gpu.0.memfree:2811613184
>> ip-10-16-21-185:gpu.0.memused:6369280
>> ip-10-16-21-185:gpu.0.memtotal:2817982464
>> ip-10-16-21-185:gpu.0.utilgpu:0
>> ip-10-16-21-185:gpu.0.utilmem:0
>> ip-10-16-21-185:gpu.0.sbiteccerror:0
>> ip-10-16-21-185:gpu.0.dbiteccerror:0
>> ip-10-16-21-185:gpu.1.name:Tesla M2050
>> ip-10-16-21-185:gpu.1.busId:0000:00:04.0
>> ip-10-16-21-185:gpu.1.fanspeed:0
>> ip-10-16-21-185:gpu.1.clockspeed:270
>> ip-10-16-21-185:gpu.1.memfree:2811613184
>> ip-10-16-21-185:gpu.1.memused:6369280
>> ip-10-16-21-185:gpu.1.memtotal:2817982464
>> ip-10-16-21-185:gpu.1.utilgpu:0
>> ip-10-16-21-185:gpu.1.utilmem:0
>> ip-10-16-21-185:gpu.1.prevhrsbiteccerror:0
>> ip-10-16-21-185:gpu.1.prevhrdbiteccerror:0
>> ip-10-16-21-185:gpu.1.sbiteccerror:0
>> ip-10-16-21-185:gpu.1.dbiteccerror:0
>> end
>>
>> And we can also use NVidia's SMI:
>>
>> % nvidia-smi
>> Sun May 20 01:21:42 2012
>> +------------------------------------------------------+
>> | NVIDIA-SMI 2.290.10   Driver Version: 290.10         |
>>
>> |-------------------------------+----------------------+----------------------+
>> | Nb.  Name                     | Bus Id        Disp.  | Volatile ECC SB /
>> DB |
>> | Fan   Temp   Power Usage /Cap | Memory Usage         | GPU Util. Compute
>> M. |
>>
>> |===============================+======================+======================|
>> | 0.  Tesla M2050               | 0000:00:03.0  Off    |         0
>>  0 |
>> |  N/A    N/A  P1    Off /  Off |   0%    6MB / 2687MB |    0%     Default
>>    |
>>
>> |-------------------------------+----------------------+----------------------|
>> | 1.  Tesla M2050               | 0000:00:04.0  Off    |         0
>>  0 |
>> |  N/A    N/A  P1    Off /  Off |   0%    6MB / 2687MB |    0%     Default
>>    |
>>
>> |-------------------------------+----------------------+----------------------|
>> | Compute processes:                                               GPU
>> Memory |
>> |  GPU  PID     Process name                                       Usage
>>    |
>>
>> |=============================================================================|
>> |  No running compute processes found
>>     |
>>
>> +-----------------------------------------------------------------------------+
>>
>> And note that Ganglia also has a plugin for NVML:
>>
>> https://github.com/ganglia/gmond_python_modules/tree/master/gpu/nvidia
>>
>>
>> And the complex setup is for internal accounting inside Grid Engine -
>> ie. it tells GE how many GPU cards there are and how many are in use.
>> We can set up a consumable resource and let Grid Engine do the
>> accounting... ie. we model GPUs like any other consumable resources,
>> eg. software licenses, disk space, etc... and we can then use the same
>> techniques to manage the GPUs:
>>
>> http://gridscheduler.sourceforge.net/howto/consumable.html
>> http://gridscheduler.sourceforge.net/howto/loadsensor.html
>>
>> Rayson
>>
>> ================================
>> Open Grid Scheduler / Grid Engine
>> http://gridscheduler.sourceforge.net/
>>
>> Scalable Grid Engine Support Program
>> http://www.scalablelogic.com/
>>
>>
>>
>> On Wed, May 23, 2012 at 4:31 PM, Justin Riley <jtriley at mit.edu> wrote:
>> > Just curious, were you able to get the GPU consumable resource/load
>> > sensor to work with the SC HVM/GPU AMI? I will eventually experiment
>> > with this myself when I create new 12.X AMIs soon but it would be
>> > helpful to have a condensed step-by-step overview if you were able to
>> > get things working. No worries if not.
>> >
>> > Thanks!
>> >
>> > ~Justin
>>
>>
>>
>> --
>> ==================================================
>> Open Grid Scheduler - The Official Open Source Grid Engine
>> http://gridscheduler.sourceforge.net/
>
>

-- 
==================================================
Open Grid Scheduler - The Official Open Source Grid Engine
http://gridscheduler.sourceforge.net/

_______________________________________________
StarCluster mailing list
StarCluster at mit.edu
http://mailman.mit.edu/mailman/listinfo/starcluster