[StarCluster] Integration of MPICH2 plugin with SGE

Sergio Mafra sergiohmafra at gmail.com
Tue Aug 20 08:38:57 EDT 2013


Hi Hyokun,

Try this in order to submit your job:

$ qsub -N ttt -V -b y -cwd -pe orte 4 mpirun -np 4 executable_name

PS: See that mpich2 has to receive the same nodes number that orte.

All best,


Sergio



On Mon, Aug 19, 2013 at 9:01 PM, Hyokun Yun <yun3 at purdue.edu> wrote:

> Sergio,
>
>
>
> Thanks for the pointer! I will try to contact them as well.
> However, my setting is pretty much vanilla starcluster except I installed
> gcc-4.7 and boost packages,
> so I consider it more of a starcluster issue than an OGE issue. I believe
> my problem should be
> reproducible by others who are using the same ami and mpich2 plugin.
>
>
>
> Regarding your earlier message: I am using mpich2 1.4.1, and compiled the
> software using it.
>
> Actually when I ran qconf -mp orte, it was set as $round_robin instead of
> $fill_up by default.
> I just re-created the cluster to confirm this.
>
> I am using starcluster 0.93.3, with ami ami-52a0c53b (Ubuntu 12.04), on
> cc2.8xlarge machines.
>
>
> Below is how my qsub file looks like:
>
> !/bin/csh
> #$ -cwd
> #$ -pe orte 4
> #$ -N ttt
> #$ -e ../auto_output/ttt.err
> #$ -o ../auto_output/ttt.out
> mpirun executable_name > ../auto_logs/ttt.txt
>
>
>
> Below is what I get from mpirun --version
>
> HYDRA build details:
>     Version:                                 1.4.1
>     Release Date:                            Wed Aug 24 14:40:04 CDT 2011
>     CC:                              gcc -D_FORTIFY_SOURCE=2
>  -Wl,-Bsymbolic-functions -Wl,-z,relro
>     CXX:                             c++ -D_FORTIFY_SOURCE=2
>  -Wl,-Bsymbolic-functions -Wl,-z,relro
>     F77:                             gfortran  -Wl,-Bsymbolic-functions
> -Wl,-z,relro
>     F90:                             gfortran  -Wl,-Bsymbolic-functions
> -Wl,-z,relro
>     Configure options:                       '--build=x86_64-linux-gnu'
> '--includedir=${prefix}/include' '--mandir=${prefix}/share/man'
> '--infodir=${prefix}/share/info' '--sysconfdir=/etc' '--localstatedir=/var'
> '--libexecdir=${prefix}/lib/mpich2' '--srcdir=.'
> '--disable-maintainer-mode' '--disable-dependency-tracking'
> '--disable-silent-rules' '--enable-shared' '--prefix=/usr' '--enable-fc'
> '--disable-rpath' '--sysconfdir=/etc/mpich2'
> '--includedir=/usr/include/mpich2' '--docdir=/usr/share/doc/mpich2'
> '--with-hwloc-prefix=system' '--enable-checkpointing'
> '--with-hydra-ckpointlib=blcr' 'build_alias=x86_64-linux-gnu'
> 'MPICH2LIB_CFLAGS=-g -O2 -fstack-protector --param=ssp-buffer-size=4
> -Wformat -Wformat-security -g -O2 -fstack-protector
> --param=ssp-buffer-size=4 -Wformat -Wformat-security
> -Werror=format-security -Wall' 'MPICH2LIB_CXXFLAGS=-g -O2 -fstack-protector
> --param=ssp-buffer-size=4 -Wformat -Wformat-security -g -O2
> -fstack-protector --param=ssp-buffer-size=4 -Wformat -Wformat-security
> -Werror=format-security -Wall' 'MPICH2LIB_FFLAGS=-g -O2'
> 'MPICH2LIB_FCFLAGS=' 'LDFLAGS=-Wl,-Bsymbolic-functions -Wl,-z,relro '
> 'CPPFLAGS=-D_FORTIFY_SOURCE=2 -I/build/buildd/mpich2-1.4.1/src/mpl/include
> -I/build/buildd/mpich2-1.4.1/src/mpl/include
> -I/build/buildd/mpich2-1.4.1/src/openpa/src
> -I/build/buildd/mpich2-1.4.1/src/openpa/src
> -I/build/buildd/mpich2-1.4.1/src/mpid/ch3/include
> -I/build/buildd/mpich2-1.4.1/src/mpid/ch3/include
> -I/build/buildd/mpich2-1.4.1/src/mpid/common/datatype
> -I/build/buildd/mpich2-1.4.1/src/mpid/common/datatype
> -I/build/buildd/mpich2-1.4.1/src/mpid/common/locks
> -I/build/buildd/mpich2-1.4.1/src/mpid/common/locks
> -I/build/buildd/mpich2-1.4.1/src/mpid/ch3/channels/nemesis/include
> -I/build/buildd/mpich2-1.4.1/src/mpid/ch3/channels/nemesis/include
> -I/build/buildd/mpich2-1.4.1/src/mpid/ch3/channels/nemesis/nemesis/include
> -I/build/buildd/mpich2-1.4.1/src/mpid/ch3/channels/nemesis/nemesis/include
> -I/build/buildd/mpich2-1.4.1/src/mpid/ch3/channels/nemesis/nemesis/utils/monitor
> -I/build/buildd/mpich2-1.4.1/src/mpid/ch3/channels/nemesis/nemesis/utils/monitor
> -I/build/buildd/mpich2-1.4.1/src/util/wrappers
> -I/build/buildd/mpich2-1.4.1/src/util/wrappers' 'FFLAGS= -g -O2 -O2'
> 'FC=gfortran' 'CFLAGS= -g -O2 -fstack-protector --param=ssp-buffer-size=4
> -Wformat -Wformat-security -g -O2 -fstack-protector
> --param=ssp-buffer-size=4 -Wformat -Wformat-security
> -Werror=format-security -Wall -O2' 'CXXFLAGS= -g -O2 -fstack-protector
> --param=ssp-buffer-size=4 -Wformat -Wformat-security -g -O2
> -fstack-protector --param=ssp-buffer-size=4 -Wformat -Wformat-security
> -Werror=format-security -Wall -O2' '--disable-option-checking' 'CC=gcc'
> 'LIBS=-lrt -lcr -lpthread '
>     Process Manager:                         pmi
>     Launchers available:                      ssh rsh fork slurm ll lsf
> sge manual persist
>     Topology libraries available:              hwloc plpa
>     Resource management kernels available:    user slurm ll lsf sge pbs
>     Checkpointing libraries available:        blcr
>     Demux engines available:                  poll select
>
>
>
> Thanks,
> Hyokun Yun
>
>
> On Mon, Aug 19, 2013 at 11:52 AM, Sergio Mafra <sergiohmafra at gmail.com>wrote:
>
>> Hyokun,
>>
>> Other source that you can take advantage of is this forum dedicated to
>> OGE: http://gridengine.org/blog/2011/01/27/gridengine-users-mailing-list/
>>
>> All best,
>>
>> Sergio
>>
>>
>> On Mon, Aug 19, 2013 at 1:53 AM, Hyokun Yun <yun3 at purdue.edu> wrote:
>>
>>> Dear starcluster users,
>>>
>>>
>>> I am experiencing a problem using MPICH2 plugin with SGE.
>>>
>>> I am using the following image: ami-52a0c53b which uses Ubuntu 12.04
>>>
>>> When I use mpich2 plugin, it seems like mpich2 and SGE are not tightly
>>> integrated: when I execute my script using qsub, I get the following error
>>> message.
>>>
>>> error: executing task of job 1 failed: execution daemon on host
>>> "node001" didn't accept task
>>> error: executing task of job 1 failed: execution daemon on host
>>> "node002" didn't accept task
>>> error: executing task of job 1 failed: execution daemon on host
>>> "node003" didn't accept task
>>> error: executing task of job 1 failed: execution daemon on host
>>> "nodef004" didn't accept task
>>>
>>> It runs fine when I simply execute 'mpirun' myself, instead of relying
>>> on SGE.
>>> Also, the same script runs fine as well when I use OpenMPI instead of
>>> MPICH2.  That's why I suspect it is MPICH2 & SGE integration issue.
>>>
>>> The problem is that I need multi-thread support, and it is by default
>>> disabled in OpenMPI.  I also prefer to use MPICH2 instead of OpenMPI.
>>>
>>> I was able to reproduce the problem when I restarted the cluster from
>>> scratch.  Would any of you please take a look on the problem by trying the
>>> same image with MPICH2 plugin?
>>>
>>>
>>> Thanks,
>>> Hyokun Yun
>>>
>>> _______________________________________________
>>> StarCluster mailing list
>>> StarCluster at mit.edu
>>> http://mailman.mit.edu/mailman/listinfo/starcluster
>>>
>>>
>>
>
>
> --
> *Hyokun Yun *( http://www.stat.purdue.edu/~yun3 )
> Ph.D Candidate
> Department of Statistics
> Purdue University
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/pipermail/starcluster/attachments/20130820/2877418b/attachment.htm


More information about the StarCluster mailing list