Okay, figured out that using ami-999d49f0 for non-HVM master and ami-4583572c for HVM nodes makes SGE work well. It's my fault for not looking at the available public starcluster images carefully enough.<div><br><div><br>
<div><br><div class="gmail_quote">On Mon, Aug 27, 2012 at 2:26 PM, Jesse Lu <span dir="ltr"><<a href="mailto:jesselu@stanford.edu" target="_blank">jesselu@stanford.edu</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
Sorry for the spam, but here's another follow-up. <div><br></div><div>I found that this only happens when I use a non HVM-EBS AMI for the master, but an HWM-EBS for the master.</div><div><br></div><div>This is probably because StarCluster copies the sge install from the master to the nodes, and this doesn't play nice when the nodes are CentOS based but the master is Ubuntu based.</div>
<div><br></div><div>Any ideas for a work-around?<div><div class="h5"><br><div><br><div class="gmail_quote">On Mon, Aug 27, 2012 at 2:07 PM, Jesse Lu <span dir="ltr"><<a href="mailto:jesselu@stanford.edu" target="_blank">jesselu@stanford.edu</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Follow-up,<div><br></div><div>Here are the contents of the installation log file (for grid engine)</div><div><br></div>
<div><div>cat /opt/sge6/default/common/install_logs/execd_install_node001_2012-08-27_14:04:29.log</div>
<div><br></div><div><br></div><div>Your $SGE_ROOT directory: /opt/sge6</div><div><br></div><div><br></div><div>Using cell: >default<</div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div>
Using local execd spool directory [/opt/sge6/default/spool/exec_spool_local]</div><div><br></div><div>Creating local configuration for host >node001<</div><div>sgeadmin@node001 modified "node001" in configuration list</div>
<div>Local configuration for host >node001< created.</div><div><br></div><div>Host >master< already in submit host list!</div><div>Host >node001< already in submit host list!</div><div><br></div><div><br>
</div><div> starting sge_execd</div><div><br></div><div><br></div><div>No modification because "node001" already exists in "hostlist" of "hostgroup"</div><div>root@node001 modified "@allhosts" in host group list</div>
<div>root@node001 modified "all.q" in cluster queue list</div><div><br></div><div>got select error: Connection refused</div><div>got select error: closing "node001/execd/1"</div><div>Execd on host node001 is not started!</div>
<div><div>
<div><br></div><br><div class="gmail_quote">On Mon, Aug 27, 2012 at 1:37 PM, Jesse Lu <span dir="ltr"><<a href="mailto:jesselu@stanford.edu" target="_blank">jesselu@stanford.edu</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
ami-12b6477b produces the folowing error on cluster startup<br><br>!!! ERROR - command 'cd /opt/sge6 && TERM=rxvt ./inst_sge -x -noremote -auto ./ec2_sge.conf' failed with status 1
<div><br></div><div>I'm guessing the sge6 installation is faulty? Can anyone help? Thanks!</div><span><font color="#888888"><div><br></div><div>Jesse</div>
</font></span></blockquote></div><br></div></div></div>
</blockquote></div><br></div></div></div></div>
</blockquote></div><br></div></div></div>