[Starcluster] Multiple MPI jobs on SunGrid Engine with StarCluster

Mon Jun 21 12:46:17 EDT 2010

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi Damian,

So OpenMPI and Sun Grid Engine have fairly decent integration support.

StarCluster by default sets up a parallel environment, called "orte",
that has been configured for OpenMPI integration within SGE and has a
number of slots equal to the total number of processors in the cluster.
You can inspect the SGE parallel environment by running:

myuser at ip-10-194-13-219:~$ qconf -sp orte
pe_name            orte
slots              16
user_lists         NONE
xuser_lists        NONE
start_proc_args    /bin/true
stop_proc_args     /bin/true
allocation_rule    $round_robin
control_slaves     TRUE
job_is_first_task  FALSE
urgency_slots      min
accounting_summary FALSE

This is the default configuration for a two-node, c1.xlarge cluster (16
virtual cores).

Notice the "allocation_rule" setting. This defines how to assign slots
to a job. By default StarCluster configures "round_robin" allocation.
This means that if a job requests 8 slots for example, it will go to the
first machine, grab a single slot if available, move to the next machine
and grab a single slot if available, and so on wrapping around the
cluster again if necessary.

You can also configure the parallel environment to try and localize
slots as much as possible using the "fill_up" allocation rule. With this
rule, if a user requests 8 slots and a single machine has 8 slots
available, that job will run entirely on one machine. If 5 slots are
available on one host and 3 on another, it will take all 5 on that host,
and all 3 on the other host.

You can switch between round_robin and fill_up modes by using the
following command:

$ qconf -mp orte

This will open up vi (or any editor defined in EDITOR env variable) and
let you edit the parallel environment settings. So to change from
round_robin to fill_up in the above example, change the allocation_rule
line like so:

myuser at ip-10-194-13-219:~$ qconf -sp orte
pe_name            orte
slots              16
user_lists         NONE
xuser_lists        NONE
start_proc_args    /bin/true
stop_proc_args     /bin/true
allocation_rule    $fill_up
control_slaves     TRUE
job_is_first_task  FALSE
urgency_slots      min
accounting_summary FALSE

So the above explains the parallel environment setup within SGE. It
turns out that if you're using a parallel environment with OpenMPI, you
do not have to specify --byslot/--bynodes/-np/-host/etc options to
mpirun given that SGE will handle the round_robin/fill_up modes for you
and automatically assign hosts and number of processors to be used by
OpenMPI.

So, for your use case I would change the commands as follows:
- -----------------------------------------------------------------
    qsub -pe orte 24 ./myjobscript.sh experiment-1
    qsub -pe orte 24 ./myjobscript.sh experiment-2
    qsub -pe orte 24 ./myjobscript.sh experiment-3
    ...
    qsub -pe orte 24 ./myjobscript.sh experiment-100

where ./myjobscript.sh calls mpirun as follows

    mpirun -x PYTHONPATH=/data/prefix/lib/python2.6/site-packages
                 -wd /data/experiments ./myprogram $1
- -----------------------------------------------------------------
NOTE: You can also pass -wd to the qsub command instead of mpirun and
along the same lines I believe you can pass -v option to qsub rather
than -x to mpirun. Neither of these should make a difference, just
shifts where the -x/-wd concern is (from MPI to SGE).

I will add a section to the docs about using SGE/OpenMPI integration on
StarCluster based on this email.

> Perhaps if carefully
> used, this will ensure that there is a root MPI process running on the
> master node for every MPI job that's simultaneously running.

Is this a requirement for you to have a root MPI process on the master
node for every MPI job? If you're worried about oversubscribing the
master node with MPI processes, then this SGE/OpenMPI integration should
relieve those concerns. If not, what's the reason for needing a 'root
MPI process' running on the master node for every MPI job?

Hope that helps, let me know if you need me to explain or elaborate on
anything. I'll try my best...

~Justin

On 06/20/2010 07:57 PM, Damian Eads wrote:
> Hi,
> 
> Anyone have experience queueing up multiple MPI jobs on StarCluster?
> Does every MPI job require at least one process running on the root
> node? I'd rather not create several clusters to have multiple MPI jobs
> I would need to replicate my data volume for each cluster created and
> manually ensure the data on the volume replications is consistent.
> 
> For example, suppose I want to queue 100 MPI jobs with each job
> requiring three 8 core instances each (24 cores). If I allocate 18
> c1.xlarge instances (18*8=144 cores), I could queue up the jobs with
> 
>    qsub -pe 24 ./myjobscript.sh experiment-1
>    qsub -pe 24 ./myjobscript.sh experiment-2
>    qsub -pe 24 ./myjobscript.sh experiment-3
>    ...
>    qsub -pe 24 ./myjobscript.sh experiment-100
> 
> where ./myjobscript.sh calls mpirun as follows
> 
>     mpirun -byslot -x PYTHONPATH=/data/prefix/lib/python2.6/site-packages \
>                 -wd /data/experiments -host
> master,node001,node002,node003,node004,node005,...,node018 \
>                 -np 24 ./myprogram $1
> 
> Does anyone know if this will work? I'm concerned that when the first
> job is started, the root node will have all of its cores used. I
> noticed the -byslot option in the manpages, which allocates cores
> across the cluster in a round-robin fashion. Perhaps if carefully
> used, this will ensure that there is a root MPI process running on the
> master node for every MPI job that's simultaneously running.
> 
> If anyone has any experience and can give me a push in the right
> direction, I'd greatly appreciate it.
> 
> Thanks!
> 
> Kind regards,
> 
> Damian
> 
> 
> 
> 

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.15 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkwfl1kACgkQ4llAkMfDcrk7KQCePJ3RljbwWvrNjd2zS3KyN+uf
rDEAn3K45vfJSb4wO2BivSbSNei1Ux8I
=wg7m
-----END PGP SIGNATURE-----