[StarCluster] Addnode SGE problem

Daniel Polhamus danp at metrumrg.com
Mon Nov 26 14:44:50 EST 2012


Thanks Ron, that was helpful.  It looks like the NFS share of my EBS volume
isn't being setup correctly when using addnode.  In my config file, I've
set MOUNT_PATH=/data, yet when I use addnode:

>>> Configuring NFS exports path(s):
/home

Sure enough, if I make changes using the master node in /home/danp, I can
see those changes on node001 whereas there is no /data on node001.

Should I create a bug report for this, or is this message sufficient?
Dan


On Mon, Nov 26, 2012 at 2:20 PM, Ron Chen <ron_chen_123 at yahoo.com> wrote:

> Do you have the danp user on node001?
>
> Also, you should check the execd's messages file
> ($SGE_ROOT/default/spool/<host>/messages) to find out why the job caused
> errors.
>
> http://gridscheduler.sourceforge.net/howto/troubleshooting.html
>
>  -Ron
>
>
>
> ________________________________
> From: Daniel Polhamus <danp at metrumrg.com>
> To: starcluster at mit.edu
> Sent: Monday, November 26, 2012 1:55 PM
> Subject: [StarCluster] Addnode SGE problem
>
>
> Hi all,
>
> I've run into a problem with "addnode" that I'm having a difficult time
> diagnosing.   Using the development version of starcluster, when I issue a
> starcluster addnode, the nodes added in the resulting cluster are unusable
> -- they result in SGE errors.  Jobs run on the master node, but any nodes
> I've added are broken.  If, however, I start the cluster with multiple
> nodes then resulting nodes are all usable (so it's not a user code issue).
>  I have a hunch that this is due to the fact that we have several users
> working under the same account (as different AWS IAM users) and we are not
> all on the same StarCluster version.  To be clear, we are all on varying
> stages of the developmental version (0.9999).  Where do I begin debugging
> this?  The hostfile seems to be set up correctly (see output below).
>
> Thanks,
> Dan
>
> danp at master:~$ cat /etc/hosts
> 127.0.0.1 localhost
>
> # The following lines are desirable for IPv6 capable hosts
> ::1 ip6-localhost ip6-loopback
> fe00::0 ip6-localnet
> ff00::0 ip6-mcastprefix
> ff02::1 ip6-allnodes
> ff02::2 ip6-allrouters
> ff02::3 ip6-allhosts
> 10.196.149.155 master
> 10.226.219.58 node001
>
> And here's what the errors look like:
>
> danp at master:~$ qstat -f
> queuename                      qtype resv/used/tot. load_avg arch
>  states
>
> ---------------------------------------------------------------------------------
> all.q at master                   BIP   0/0/8          1.29     lx24-amd64
>
>
> ---------------------------------------------------------------------------------
> all.q at node001                  BIP   0/0/8          0.70     lx24-amd64
>
>
>
> ############################################################################
>  - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS
>
> ############################################################################
>       1 0.55500 postList   danp         Eqw   11/26/2012 16:54:38     1
>
>       3 0.55500 postList   danp         Eqw   11/26/2012 16:54:39     1
>
>       5 0.55500 postList   danp         Eqw   11/26/2012 16:54:39     1
>
>       7 0.55500 postList   danp         Eqw   11/26/2012 16:54:39     1
>
>       8 0.55500 postList   danp         Eqw   11/26/2012 16:54:40     1
>
>       9 0.55500 postList   danp         Eqw   11/26/2012 16:54:40     1
>
>      10 0.55500 postList   danp         Eqw   11/26/2012 16:54:40     1
>
>      11 0.55500 postList   danp         Eqw   11/26/2012 16:54:40     1
>
>      13 0.55500 postList   danp         Eqw   11/26/2012 16:54:40     1
>
>      15 0.55500 postList   danp         Eqw   11/26/2012 16:54:41     1
>
>
>
>
>
>
> _______________________________________________
> StarCluster mailing list
> StarCluster at mit.edu
> http://mailman.mit.edu/mailman/listinfo/starcluster
>



-- 
Daniel G Polhamus, PhD
Metrum Research Group, LLC
2 Tunxis Rd, Suite 112
Tariffville, CT 06081
(888) 308-7049 ext 403
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/pipermail/starcluster/attachments/20121126/4c1a43a7/attachment-0001.htm


More information about the StarCluster mailing list