[StarCluster] CG1 plus StarCluster Questions

Rayson Ho raysonlogin at gmail.com
Wed May 23 23:46:34 EDT 2012


Hi Justin & Scott,

I played with a spot CG1 instance last week, and I was able to use
both NVML & OpenCL APIs to pull information from the GPU devices.

Open Grid Scheduler's GPU load sensor (which uses the NVML API) is for
monitoring the health of GPU devices, and it is very similar to the
GPU monitoring product from Bright Computing - but note that Bright
has a very nice GUI (and we are not planning to compete against
Bright, so most likely the Open Grid Scheduler project will not try to
implement a GUI front-end for our GPU load sensor).

http://www.brightcomputing.com/NVIDIA-GPU-Cluster-Management-Monitoring.php


Note that with the information from the GPU load sensor, we can use
the StarCluster load balancer to shutdown nodes that have unhealthy
GPUs - ie. GPUs that are too hot, have too many ECC errors, etc.

However, we currently don't put the GPUs in exclusive mode yet - which
is what Scott has already done and it is on our ToDo list - with that
it makes the Open Grid Scheduler/Grid Engine-GPU integration more
complete.


Anyway, here's how to compile & run the gpu load sensor:

 * URL: https://gridscheduler.svn.sourceforge.net/svnroot/gridscheduler/trunk/source/dist/gpu/gpu_sensor.c

 * you can compile it in STANDALONE mode by adding -DSTANDALONE

 * if you don't run it in standalone mode, you can press ENTER so
simulate the internal Grid Engine load sensor environment

% cc gpu_sensor.c -I/usr/local/cuda/CUDAToolsSDK/NVML/
-L/usr/lib/nvidia-current/ -lnvidia-ml
% ./gpu_ls

begin
ip-10-16-21-185:gpu.0.name:Tesla M2050
ip-10-16-21-185:gpu.0.busId:0000:00:03.0
ip-10-16-21-185:gpu.0.fanspeed:0
ip-10-16-21-185:gpu.0.clockspeed:270
ip-10-16-21-185:gpu.0.memfree:2811613184
ip-10-16-21-185:gpu.0.memused:6369280
ip-10-16-21-185:gpu.0.memtotal:2817982464
ip-10-16-21-185:gpu.0.utilgpu:0
ip-10-16-21-185:gpu.0.utilmem:0
ip-10-16-21-185:gpu.0.sbiteccerror:0
ip-10-16-21-185:gpu.0.dbiteccerror:0
ip-10-16-21-185:gpu.1.name:Tesla M2050
ip-10-16-21-185:gpu.1.busId:0000:00:04.0
ip-10-16-21-185:gpu.1.fanspeed:0
ip-10-16-21-185:gpu.1.clockspeed:270
ip-10-16-21-185:gpu.1.memfree:2811613184
ip-10-16-21-185:gpu.1.memused:6369280
ip-10-16-21-185:gpu.1.memtotal:2817982464
ip-10-16-21-185:gpu.1.utilgpu:0
ip-10-16-21-185:gpu.1.utilmem:0
ip-10-16-21-185:gpu.1.prevhrsbiteccerror:0
ip-10-16-21-185:gpu.1.prevhrdbiteccerror:0
ip-10-16-21-185:gpu.1.sbiteccerror:0
ip-10-16-21-185:gpu.1.dbiteccerror:0
end

And we can also use NVidia's SMI:

% nvidia-smi
Sun May 20 01:21:42 2012
+------------------------------------------------------+
| NVIDIA-SMI 2.290.10   Driver Version: 290.10         |
|-------------------------------+----------------------+----------------------+
| Nb.  Name                     | Bus Id        Disp.  | Volatile ECC SB / DB |
| Fan   Temp   Power Usage /Cap | Memory Usage         | GPU Util. Compute M. |
|===============================+======================+======================|
| 0.  Tesla M2050               | 0000:00:03.0  Off    |         0          0 |
|  N/A    N/A  P1    Off /  Off |   0%    6MB / 2687MB |    0%     Default    |
|-------------------------------+----------------------+----------------------|
| 1.  Tesla M2050               | 0000:00:04.0  Off    |         0          0 |
|  N/A    N/A  P1    Off /  Off |   0%    6MB / 2687MB |    0%     Default    |
|-------------------------------+----------------------+----------------------|
| Compute processes:                                               GPU Memory |
|  GPU  PID     Process name                                       Usage      |
|=============================================================================|
|  No running compute processes found                                         |
+-----------------------------------------------------------------------------+

And note that Ganglia also has a plugin for NVML:

https://github.com/ganglia/gmond_python_modules/tree/master/gpu/nvidia


And the complex setup is for internal accounting inside Grid Engine -
ie. it tells GE how many GPU cards there are and how many are in use.
We can set up a consumable resource and let Grid Engine do the
accounting... ie. we model GPUs like any other consumable resources,
eg. software licenses, disk space, etc... and we can then use the same
techniques to manage the GPUs:

http://gridscheduler.sourceforge.net/howto/consumable.html
http://gridscheduler.sourceforge.net/howto/loadsensor.html

Rayson

================================
Open Grid Scheduler / Grid Engine
http://gridscheduler.sourceforge.net/

Scalable Grid Engine Support Program
http://www.scalablelogic.com/



On Wed, May 23, 2012 at 4:31 PM, Justin Riley <jtriley at mit.edu> wrote:
> Just curious, were you able to get the GPU consumable resource/load
> sensor to work with the SC HVM/GPU AMI? I will eventually experiment
> with this myself when I create new 12.X AMIs soon but it would be
> helpful to have a condensed step-by-step overview if you were able to
> get things working. No worries if not.
>
> Thanks!
>
> ~Justin



-- 
==================================================
Open Grid Scheduler - The Official Open Source Grid Engine
http://gridscheduler.sourceforge.net/


More information about the StarCluster mailing list