This is really cool but it raises a question: how does the sensor avoid race conditions for executables that don&#39;t immediately grab the GPU?<br><br><br><br><div class="gmail_quote">On Wed, May 23, 2012 at 8:46 PM, Rayson Ho <span dir="ltr">&lt;<a href="mailto:raysonlogin@gmail.com" target="_blank">raysonlogin@gmail.com</a>&gt;</span> wrote:<br>


<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi Justin &amp; Scott,<br>

<br>

I played with a spot CG1 instance last week, and I was able to use<br>

both NVML &amp; OpenCL APIs to pull information from the GPU devices.<br>

<br>

Open Grid Scheduler&#39;s GPU load sensor (which uses the NVML API) is for<br>

monitoring the health of GPU devices, and it is very similar to the<br>

GPU monitoring product from Bright Computing - but note that Bright<br>

has a very nice GUI (and we are not planning to compete against<br>

Bright, so most likely the Open Grid Scheduler project will not try to<br>

implement a GUI front-end for our GPU load sensor).<br>

<br>

<a href="http://www.brightcomputing.com/NVIDIA-GPU-Cluster-Management-Monitoring.php" target="_blank">http://www.brightcomputing.com/NVIDIA-GPU-Cluster-Management-Monitoring.php</a><br>

<br>

<br>

Note that with the information from the GPU load sensor, we can use<br>

the StarCluster load balancer to shutdown nodes that have unhealthy<br>

GPUs - ie. GPUs that are too hot, have too many ECC errors, etc.<br>

<br>

However, we currently don&#39;t put the GPUs in exclusive mode yet - which<br>

is what Scott has already done and it is on our ToDo list - with that<br>

it makes the Open Grid Scheduler/Grid Engine-GPU integration more<br>

complete.<br>

<br>

<br>

Anyway, here&#39;s how to compile &amp; run the gpu load sensor:<br>

<br>

 * URL: <a href="https://gridscheduler.svn.sourceforge.net/svnroot/gridscheduler/trunk/source/dist/gpu/gpu_sensor.c" target="_blank">https://gridscheduler.svn.sourceforge.net/svnroot/gridscheduler/trunk/source/dist/gpu/gpu_sensor.c</a><br>


<br>

 * you can compile it in STANDALONE mode by adding -DSTANDALONE<br>

<br>

 * if you don&#39;t run it in standalone mode, you can press ENTER so<br>

simulate the internal Grid Engine load sensor environment<br>

<br>

% cc gpu_sensor.c -I/usr/local/cuda/CUDAToolsSDK/NVML/<br>

-L/usr/lib/nvidia-current/ -lnvidia-ml<br>

% ./gpu_ls<br>

<br>

begin<br>

ip-10-16-21-185:gpu.0.name:Tesla M2050<br>

ip-10-16-21-185:gpu.0.busId:0000:00:03.0<br>

ip-10-16-21-185:gpu.0.fanspeed:0<br>

ip-10-16-21-185:gpu.0.clockspeed:270<br>

ip-10-16-21-185:gpu.0.memfree:2811613184<br>

ip-10-16-21-185:gpu.0.memused:6369280<br>

ip-10-16-21-185:gpu.0.memtotal:<a href="tel:2817982464" value="+12817982464">2817982464</a><br>

ip-10-16-21-185:gpu.0.utilgpu:0<br>

ip-10-16-21-185:gpu.0.utilmem:0<br>

ip-10-16-21-185:gpu.0.sbiteccerror:0<br>

ip-10-16-21-185:gpu.0.dbiteccerror:0<br>

ip-10-16-21-185:gpu.1.name:Tesla M2050<br>

ip-10-16-21-185:gpu.1.busId:0000:00:04.0<br>

ip-10-16-21-185:gpu.1.fanspeed:0<br>

ip-10-16-21-185:gpu.1.clockspeed:270<br>

ip-10-16-21-185:gpu.1.memfree:2811613184<br>

ip-10-16-21-185:gpu.1.memused:6369280<br>

ip-10-16-21-185:gpu.1.memtotal:<a href="tel:2817982464" value="+12817982464">2817982464</a><br>

ip-10-16-21-185:gpu.1.utilgpu:0<br>

ip-10-16-21-185:gpu.1.utilmem:0<br>

ip-10-16-21-185:gpu.1.prevhrsbiteccerror:0<br>

ip-10-16-21-185:gpu.1.prevhrdbiteccerror:0<br>

ip-10-16-21-185:gpu.1.sbiteccerror:0<br>

ip-10-16-21-185:gpu.1.dbiteccerror:0<br>

end<br>

<br>

And we can also use NVidia&#39;s SMI:<br>

<br>

% nvidia-smi<br>

Sun May 20 01:21:42 2012<br>

+------------------------------------------------------+<br>

| NVIDIA-SMI 2.290.10   Driver Version: 290.10         |<br>

|-------------------------------+----------------------+----------------------+<br>

| Nb.  Name                     | Bus Id        Disp.  | Volatile ECC SB / DB |<br>

| Fan   Temp   Power Usage /Cap | Memory Usage         | GPU Util. Compute M. |<br>

|===============================+======================+======================|<br>

| 0.  Tesla M2050               | 0000:00:03.0  Off    |         0          0 |<br>

|  N/A    N/A  P1    Off /  Off |   0%    6MB / 2687MB |    0%     Default    |<br>

|-------------------------------+----------------------+----------------------|<br>

| 1.  Tesla M2050               | 0000:00:04.0  Off    |         0          0 |<br>

|  N/A    N/A  P1    Off /  Off |   0%    6MB / 2687MB |    0%     Default    |<br>

|-------------------------------+----------------------+----------------------|<br>

| Compute processes:                                               GPU Memory |<br>

|  GPU  PID     Process name                                       Usage      |<br>

|=============================================================================|<br>

|  No running compute processes found                                         |<br>

+-----------------------------------------------------------------------------+<br>

<br>

And note that Ganglia also has a plugin for NVML:<br>

<br>

<a href="https://github.com/ganglia/gmond_python_modules/tree/master/gpu/nvidia" target="_blank">https://github.com/ganglia/gmond_python_modules/tree/master/gpu/nvidia</a><br>

<br>

<br>

And the complex setup is for internal accounting inside Grid Engine -<br>

ie. it tells GE how many GPU cards there are and how many are in use.<br>

We can set up a consumable resource and let Grid Engine do the<br>

accounting... ie. we model GPUs like any other consumable resources,<br>

eg. software licenses, disk space, etc... and we can then use the same<br>

techniques to manage the GPUs:<br>

<br>

<a href="http://gridscheduler.sourceforge.net/howto/consumable.html" target="_blank">http://gridscheduler.sourceforge.net/howto/consumable.html</a><br>

<a href="http://gridscheduler.sourceforge.net/howto/loadsensor.html" target="_blank">http://gridscheduler.sourceforge.net/howto/loadsensor.html</a><br>

<div class="im HOEnZb"><br>

Rayson<br>

<br>

================================<br>

Open Grid Scheduler / Grid Engine<br>

<a href="http://gridscheduler.sourceforge.net/" target="_blank">http://gridscheduler.sourceforge.net/</a><br>

<br>

Scalable Grid Engine Support Program<br>

<a href="http://www.scalablelogic.com/" target="_blank">http://www.scalablelogic.com/</a><br>

<br>

<br>

<br>

</div><div class="im HOEnZb">On Wed, May 23, 2012 at 4:31 PM, Justin Riley &lt;<a href="mailto:jtriley@mit.edu">jtriley@mit.edu</a>&gt; wrote:<br>

&gt; Just curious, were you able to get the GPU consumable resource/load<br>

&gt; sensor to work with the SC HVM/GPU AMI? I will eventually experiment<br>

&gt; with this myself when I create new 12.X AMIs soon but it would be<br>

&gt; helpful to have a condensed step-by-step overview if you were able to<br>

&gt; get things working. No worries if not.<br>

&gt;<br>

&gt; Thanks!<br>

&gt;<br>

&gt; ~Justin<br>

<br>

<br>

<br>

</div><div class="HOEnZb"><div class="h5">--<br>

==================================================<br>

Open Grid Scheduler - The Official Open Source Grid Engine<br>

<a href="http://gridscheduler.sourceforge.net/" target="_blank">http://gridscheduler.sourceforge.net/</a><br>

</div></div></blockquote></div><br>