[StarCluster] CG1 plus StarCluster Questions

Scott Le Grand varelse2005 at gmail.com
Thu May 24 00:04:58 EDT 2012


This is really cool but it raises a question: how does the sensor avoid
race conditions for executables that don't immediately grab the GPU?



On Wed, May 23, 2012 at 8:46 PM, Rayson Ho <raysonlogin at gmail.com> wrote:

> Hi Justin & Scott,
>
> I played with a spot CG1 instance last week, and I was able to use
> both NVML & OpenCL APIs to pull information from the GPU devices.
>
> Open Grid Scheduler's GPU load sensor (which uses the NVML API) is for
> monitoring the health of GPU devices, and it is very similar to the
> GPU monitoring product from Bright Computing - but note that Bright
> has a very nice GUI (and we are not planning to compete against
> Bright, so most likely the Open Grid Scheduler project will not try to
> implement a GUI front-end for our GPU load sensor).
>
> http://www.brightcomputing.com/NVIDIA-GPU-Cluster-Management-Monitoring.php
>
>
> Note that with the information from the GPU load sensor, we can use
> the StarCluster load balancer to shutdown nodes that have unhealthy
> GPUs - ie. GPUs that are too hot, have too many ECC errors, etc.
>
> However, we currently don't put the GPUs in exclusive mode yet - which
> is what Scott has already done and it is on our ToDo list - with that
> it makes the Open Grid Scheduler/Grid Engine-GPU integration more
> complete.
>
>
> Anyway, here's how to compile & run the gpu load sensor:
>
>  * URL:
> https://gridscheduler.svn.sourceforge.net/svnroot/gridscheduler/trunk/source/dist/gpu/gpu_sensor.c
>
>  * you can compile it in STANDALONE mode by adding -DSTANDALONE
>
>  * if you don't run it in standalone mode, you can press ENTER so
> simulate the internal Grid Engine load sensor environment
>
> % cc gpu_sensor.c -I/usr/local/cuda/CUDAToolsSDK/NVML/
> -L/usr/lib/nvidia-current/ -lnvidia-ml
> % ./gpu_ls
>
> begin
> ip-10-16-21-185:gpu.0.name:Tesla M2050
> ip-10-16-21-185:gpu.0.busId:0000:00:03.0
> ip-10-16-21-185:gpu.0.fanspeed:0
> ip-10-16-21-185:gpu.0.clockspeed:270
> ip-10-16-21-185:gpu.0.memfree:2811613184
> ip-10-16-21-185:gpu.0.memused:6369280
> ip-10-16-21-185:gpu.0.memtotal:2817982464
> ip-10-16-21-185:gpu.0.utilgpu:0
> ip-10-16-21-185:gpu.0.utilmem:0
> ip-10-16-21-185:gpu.0.sbiteccerror:0
> ip-10-16-21-185:gpu.0.dbiteccerror:0
> ip-10-16-21-185:gpu.1.name:Tesla M2050
> ip-10-16-21-185:gpu.1.busId:0000:00:04.0
> ip-10-16-21-185:gpu.1.fanspeed:0
> ip-10-16-21-185:gpu.1.clockspeed:270
> ip-10-16-21-185:gpu.1.memfree:2811613184
> ip-10-16-21-185:gpu.1.memused:6369280
> ip-10-16-21-185:gpu.1.memtotal:2817982464
> ip-10-16-21-185:gpu.1.utilgpu:0
> ip-10-16-21-185:gpu.1.utilmem:0
> ip-10-16-21-185:gpu.1.prevhrsbiteccerror:0
> ip-10-16-21-185:gpu.1.prevhrdbiteccerror:0
> ip-10-16-21-185:gpu.1.sbiteccerror:0
> ip-10-16-21-185:gpu.1.dbiteccerror:0
> end
>
> And we can also use NVidia's SMI:
>
> % nvidia-smi
> Sun May 20 01:21:42 2012
> +------------------------------------------------------+
> | NVIDIA-SMI 2.290.10   Driver Version: 290.10         |
>
> |-------------------------------+----------------------+----------------------+
> | Nb.  Name                     | Bus Id        Disp.  | Volatile ECC SB /
> DB |
> | Fan   Temp   Power Usage /Cap | Memory Usage         | GPU Util. Compute
> M. |
>
> |===============================+======================+======================|
> | 0.  Tesla M2050               | 0000:00:03.0  Off    |         0
>  0 |
> |  N/A    N/A  P1    Off /  Off |   0%    6MB / 2687MB |    0%     Default
>    |
>
> |-------------------------------+----------------------+----------------------|
> | 1.  Tesla M2050               | 0000:00:04.0  Off    |         0
>  0 |
> |  N/A    N/A  P1    Off /  Off |   0%    6MB / 2687MB |    0%     Default
>    |
>
> |-------------------------------+----------------------+----------------------|
> | Compute processes:                                               GPU
> Memory |
> |  GPU  PID     Process name                                       Usage
>    |
>
> |=============================================================================|
> |  No running compute processes found
>     |
>
> +-----------------------------------------------------------------------------+
>
> And note that Ganglia also has a plugin for NVML:
>
> https://github.com/ganglia/gmond_python_modules/tree/master/gpu/nvidia
>
>
> And the complex setup is for internal accounting inside Grid Engine -
> ie. it tells GE how many GPU cards there are and how many are in use.
> We can set up a consumable resource and let Grid Engine do the
> accounting... ie. we model GPUs like any other consumable resources,
> eg. software licenses, disk space, etc... and we can then use the same
> techniques to manage the GPUs:
>
> http://gridscheduler.sourceforge.net/howto/consumable.html
> http://gridscheduler.sourceforge.net/howto/loadsensor.html
>
> Rayson
>
> ================================
> Open Grid Scheduler / Grid Engine
> http://gridscheduler.sourceforge.net/
>
> Scalable Grid Engine Support Program
> http://www.scalablelogic.com/
>
>
>
> On Wed, May 23, 2012 at 4:31 PM, Justin Riley <jtriley at mit.edu> wrote:
> > Just curious, were you able to get the GPU consumable resource/load
> > sensor to work with the SC HVM/GPU AMI? I will eventually experiment
> > with this myself when I create new 12.X AMIs soon but it would be
> > helpful to have a condensed step-by-step overview if you were able to
> > get things working. No worries if not.
> >
> > Thanks!
> >
> > ~Justin
>
>
>
> --
> ==================================================
> Open Grid Scheduler - The Official Open Source Grid Engine
> http://gridscheduler.sourceforge.net/
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/pipermail/starcluster/attachments/20120523/1c1a892c/attachment-0001.htm


More information about the StarCluster mailing list