<div dir="ltr">Sergio, <div><br></div><div><br></div><div><br></div><div>Thanks for the pointer! I will try to contact them as well.</div><div>However, my setting is pretty much vanilla starcluster except I installed gcc-4.7 and boost packages,<br>
so I consider it more of a starcluster issue than an OGE issue. I believe my problem should be<br>reproducible by others who are using the same ami and mpich2 plugin.</div><div><br></div><div><br></div><div><br></div><div>
Regarding your earlier message: I am using mpich2 1.4.1, and compiled the software using it.</div><div><br></div><div>Actually when I ran qconf -mp orte, it was set as $round_robin instead of $fill_up by default.</div><div>
I just re-created the cluster to confirm this.</div><div><br></div><div>I am using starcluster 0.93.3, with ami <span style="font-family:arial,sans-serif;font-size:13px">ami-52a0c53b (Ubuntu 12.04), on cc2.8xlarge machines.</span></div>
<div><br></div><div><br></div><div>Below is how my qsub file looks like:</div><div><br></div><div><div>!/bin/csh</div><div>#$ -cwd</div><div>#$ -pe orte 4</div><div>#$ -N ttt</div><div>#$ -e ../auto_output/ttt.err</div><div>
#$ -o ../auto_output/ttt.out</div><div>mpirun executable_name > ../auto_logs/ttt.txt</div></div><div><br></div><div><br></div><div><br></div><div>Below is what I get from mpirun --version</div><div><br></div><div><div>
HYDRA build details:</div><div> Version: 1.4.1</div><div> Release Date: Wed Aug 24 14:40:04 CDT 2011</div><div> CC: gcc -D_FORTIFY_SOURCE=2 -Wl,-Bsymbolic-functions -Wl,-z,relro</div>
<div> CXX: c++ -D_FORTIFY_SOURCE=2 -Wl,-Bsymbolic-functions -Wl,-z,relro</div><div> F77: gfortran -Wl,-Bsymbolic-functions -Wl,-z,relro</div><div> F90: gfortran -Wl,-Bsymbolic-functions -Wl,-z,relro</div>
<div> Configure options: '--build=x86_64-linux-gnu' '--includedir=${prefix}/include' '--mandir=${prefix}/share/man' '--infodir=${prefix}/share/info' '--sysconfdir=/etc' '--localstatedir=/var' '--libexecdir=${prefix}/lib/mpich2' '--srcdir=.' '--disable-maintainer-mode' '--disable-dependency-tracking' '--disable-silent-rules' '--enable-shared' '--prefix=/usr' '--enable-fc' '--disable-rpath' '--sysconfdir=/etc/mpich2' '--includedir=/usr/include/mpich2' '--docdir=/usr/share/doc/mpich2' '--with-hwloc-prefix=system' '--enable-checkpointing' '--with-hydra-ckpointlib=blcr' 'build_alias=x86_64-linux-gnu' 'MPICH2LIB_CFLAGS=-g -O2 -fstack-protector --param=ssp-buffer-size=4 -Wformat -Wformat-security -g -O2 -fstack-protector --param=ssp-buffer-size=4 -Wformat -Wformat-security -Werror=format-security -Wall' 'MPICH2LIB_CXXFLAGS=-g -O2 -fstack-protector --param=ssp-buffer-size=4 -Wformat -Wformat-security -g -O2 -fstack-protector --param=ssp-buffer-size=4 -Wformat -Wformat-security -Werror=format-security -Wall' 'MPICH2LIB_FFLAGS=-g -O2' 'MPICH2LIB_FCFLAGS=' 'LDFLAGS=-Wl,-Bsymbolic-functions -Wl,-z,relro ' 'CPPFLAGS=-D_FORTIFY_SOURCE=2 -I/build/buildd/mpich2-1.4.1/src/mpl/include -I/build/buildd/mpich2-1.4.1/src/mpl/include -I/build/buildd/mpich2-1.4.1/src/openpa/src -I/build/buildd/mpich2-1.4.1/src/openpa/src -I/build/buildd/mpich2-1.4.1/src/mpid/ch3/include -I/build/buildd/mpich2-1.4.1/src/mpid/ch3/include -I/build/buildd/mpich2-1.4.1/src/mpid/common/datatype -I/build/buildd/mpich2-1.4.1/src/mpid/common/datatype -I/build/buildd/mpich2-1.4.1/src/mpid/common/locks -I/build/buildd/mpich2-1.4.1/src/mpid/common/locks -I/build/buildd/mpich2-1.4.1/src/mpid/ch3/channels/nemesis/include -I/build/buildd/mpich2-1.4.1/src/mpid/ch3/channels/nemesis/include -I/build/buildd/mpich2-1.4.1/src/mpid/ch3/channels/nemesis/nemesis/include -I/build/buildd/mpich2-1.4.1/src/mpid/ch3/channels/nemesis/nemesis/include -I/build/buildd/mpich2-1.4.1/src/mpid/ch3/channels/nemesis/nemesis/utils/monitor -I/build/buildd/mpich2-1.4.1/src/mpid/ch3/channels/nemesis/nemesis/utils/monitor -I/build/buildd/mpich2-1.4.1/src/util/wrappers -I/build/buildd/mpich2-1.4.1/src/util/wrappers' 'FFLAGS= -g -O2 -O2' 'FC=gfortran' 'CFLAGS= -g -O2 -fstack-protector --param=ssp-buffer-size=4 -Wformat -Wformat-security -g -O2 -fstack-protector --param=ssp-buffer-size=4 -Wformat -Wformat-security -Werror=format-security -Wall -O2' 'CXXFLAGS= -g -O2 -fstack-protector --param=ssp-buffer-size=4 -Wformat -Wformat-security -g -O2 -fstack-protector --param=ssp-buffer-size=4 -Wformat -Wformat-security -Werror=format-security -Wall -O2' '--disable-option-checking' 'CC=gcc' 'LIBS=-lrt -lcr -lpthread '</div>
<div> Process Manager: pmi</div><div> Launchers available: ssh rsh fork slurm ll lsf sge manual persist</div><div> Topology libraries available: hwloc plpa</div>
<div> Resource management kernels available: user slurm ll lsf sge pbs</div><div> Checkpointing libraries available: blcr</div><div> Demux engines available: poll select</div></div><div>
<br></div><div><br></div><div><br></div><div>Thanks,</div><div>Hyokun Yun</div></div><div class="gmail_extra"><br><br><div class="gmail_quote">On Mon, Aug 19, 2013 at 11:52 AM, Sergio Mafra <span dir="ltr"><<a href="mailto:sergiohmafra@gmail.com" target="_blank">sergiohmafra@gmail.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">Hyokun,<div><br></div><div>Other source that you can take advantage of is this forum dedicated to OGE: <a href="http://gridengine.org/blog/2011/01/27/gridengine-users-mailing-list/" target="_blank">http://gridengine.org/blog/2011/01/27/gridengine-users-mailing-list/</a></div>
<div><br></div><div>All best,</div><div><br>Sergio</div></div><div class="gmail_extra"><br><br><div class="gmail_quote"><div class="im">On Mon, Aug 19, 2013 at 1:53 AM, Hyokun Yun <span dir="ltr"><<a href="mailto:yun3@purdue.edu" target="_blank">yun3@purdue.edu</a>></span> wrote:<br>
</div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div><div class="h5"><div dir="ltr"><div>Dear starcluster users,</div><div><br></div><div><br></div><div>I am experiencing a problem using MPICH2 plugin with SGE.</div>
<div><br></div><div>I am using the following image: ami-52a0c53b which uses Ubuntu 12.04</div>
<div><br></div><div>When I use mpich2 plugin, it seems like mpich2 and SGE are not tightly integrated: when I execute my script using qsub, I get the following error message.</div><div><br></div><div>error: executing task of job 1 failed: execution daemon on host "node001" didn't accept task</div>
<div>error: executing task of job 1 failed: execution daemon on host "node002" didn't accept task</div><div>error: executing task of job 1 failed: execution daemon on host "node003" didn't accept task</div>
<div>error: executing task of job 1 failed: execution daemon on host "nodef004" didn't accept task</div><div><br></div><div>It runs fine when I simply execute 'mpirun' myself, instead of relying on SGE.</div>
<div>Also, the same script runs fine as well when I use OpenMPI instead of MPICH2. That's why I suspect it is MPICH2 & SGE integration issue.</div><div><br></div><div>The problem is that I need multi-thread support, and it is by default disabled in OpenMPI. I also prefer to use MPICH2 instead of OpenMPI.</div>
<div><br></div><div>I was able to reproduce the problem when I restarted the cluster from scratch. Would any of you please take a look on the problem by trying the same image with MPICH2 plugin?</div><div><br></div><div>
<br></div><div>Thanks,</div><div>Hyokun Yun</div>
</div>
<br></div></div><div class="im">_______________________________________________<br>
StarCluster mailing list<br>
<a href="mailto:StarCluster@mit.edu" target="_blank">StarCluster@mit.edu</a><br>
<a href="http://mailman.mit.edu/mailman/listinfo/starcluster" target="_blank">http://mailman.mit.edu/mailman/listinfo/starcluster</a><br>
<br></div></blockquote></div><br></div>
</blockquote></div><br><br clear="all"><div><br></div>-- <br><b>Hyokun Yun </b>( <a href="http://www.stat.purdue.edu/~yun3" target="_blank">http://www.stat.purdue.edu/~yun3</a> )<div><div>Ph.D Candidate</div><div>Department of Statistics</div>
<div>Purdue University</div></div><div><br></div>
</div>