[StarCluster] Addnode SGE problem

Justin Riley jtriley at MIT.EDU
Mon Nov 26 16:36:08 EST 2012


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Thanks for reporting and I'm working on a fix:

https://github.com/jtriley/StarCluster/issues/179

~Justin

On 11/26/2012 03:11 PM, Ron Chen wrote:
> I have not used that feature before, but if the latest dev version
> does not work, then filing a bug seems to be the right thing to
> do.
> 
> https://github.com/jtriley/StarCluster/issues
> 
> -Ron
> 
> 
> 
> 
> 
> 
> 
> ________________________________ From: Daniel Polhamus
> <danp at metrumrg.com> To: Ron Chen <ron_chen_123 at yahoo.com> Cc:
> "starcluster at mit.edu" <starcluster at mit.edu> Sent: Monday, November
> 26, 2012 2:44 PM Subject: Re: [StarCluster] Addnode SGE problem
> 
> 
> Thanks Ron, that was helpful.  It looks like the NFS share of my
> EBS volume isn't being setup correctly when using addnode.  In my
> config file, I've set MOUNT_PATH=/data, yet when I use addnode:
> 
> 
>>>> Configuring NFS exports path(s):
> /home
> 
> Sure enough, if I make changes using the master node in /home/danp,
> I can see those changes on node001 whereas there is no /data on
> node001.
> 
> Should I create a bug report for this, or is this message
> sufficient? Dan
> 
> 
> 
> On Mon, Nov 26, 2012 at 2:20 PM, Ron Chen <ron_chen_123 at yahoo.com>
> wrote:
> 
> Do you have the danp user on node001?
>> 
>> Also, you should check the execd's messages file
>> ($SGE_ROOT/default/spool/<host>/messages) to find out why the job
>> caused errors.
>> 
>> http://gridscheduler.sourceforge.net/howto/troubleshooting.html
>> 
>> -Ron
>> 
>> 
>> 
>> ________________________________ From: Daniel Polhamus
>> <danp at metrumrg.com> To: starcluster at mit.edu Sent: Monday,
>> November 26, 2012 1:55 PM Subject: [StarCluster] Addnode SGE
>> problem
>> 
>> 
>> 
>> Hi all,
>> 
>> I've run into a problem with "addnode" that I'm having a
>> difficult time diagnosing.   Using the development version of
>> starcluster, when I issue a starcluster addnode, the nodes added
>> in the resulting cluster are unusable -- they result in SGE
>> errors.  Jobs run on the master node, but any nodes I've added
>> are broken.  If, however, I start the cluster with multiple nodes
>> then resulting nodes are all usable (so it's not a user code
>> issue).  I have a hunch that this is due to the fact that we have
>> several users working under the same account (as different AWS
>> IAM users) and we are not all on the same StarCluster version.
>> To be clear, we are all on varying stages of the developmental
>> version (0.9999).  Where do I begin debugging this?  The hostfile
>> seems to be set up correctly (see output below).
>> 
>> Thanks, Dan
>> 
>> danp at master:~$ cat /etc/hosts 127.0.0.1 localhost
>> 
>> # The following lines are desirable for IPv6 capable hosts ::1
>> ip6-localhost ip6-loopback fe00::0 ip6-localnet ff00::0
>> ip6-mcastprefix ff02::1 ip6-allnodes ff02::2 ip6-allrouters 
>> ff02::3 ip6-allhosts 10.196.149.155 master 10.226.219.58 node001
>> 
>> And here's what the errors look like:
>> 
>> danp at master:~$ qstat -f queuename                      qtype
>> resv/used/tot. load_avg arch          states 
>> ---------------------------------------------------------------------------------
>>
>> 
all.q at master                   BIP   0/0/8          1.29     lx24-amd64
>> ---------------------------------------------------------------------------------
>>
>> 
all.q at node001                  BIP   0/0/8          0.70     lx24-amd64
>> 
>> ############################################################################
>>
>> 
- - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS
>> ############################################################################
>>
>> 
1 0.55500 postList   danp         Eqw   11/26/2012 16:54:38     1
>> 3 0.55500 postList   danp         Eqw   11/26/2012 16:54:39     1
>>  5 0.55500 postList   danp         Eqw   11/26/2012 16:54:39
>> 1 7 0.55500 postList   danp         Eqw   11/26/2012 16:54:39
>> 1 8 0.55500 postList   danp         Eqw   11/26/2012 16:54:40
>> 1 9 0.55500 postList   danp         Eqw   11/26/2012 16:54:40
>> 1 10 0.55500 postList   danp         Eqw   11/26/2012 16:54:40
>> 1 11 0.55500 postList   danp         Eqw   11/26/2012 16:54:40
>> 1 13 0.55500 postList   danp         Eqw   11/26/2012 16:54:40
>> 1 15 0.55500 postList   danp         Eqw   11/26/2012 16:54:41
>> 1
>> 
>> 
>> 
>> 
>> 
>> _______________________________________________ StarCluster
>> mailing list StarCluster at mit.edu 
>> http://mailman.mit.edu/mailman/listinfo/starcluster
>> 
> 
> 

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://www.enigmail.net/

iEYEARECAAYFAlCz4MgACgkQ4llAkMfDcrln6ACfdXPkXoi4jHuWgvw1FyiOodUC
af4AnRTFAER33awFUXBBQctQo53tb+h1
=AuEu
-----END PGP SIGNATURE-----


More information about the StarCluster mailing list