[StarCluster] grid engine not initialize on gpu hvm image

Thu Sep 13 13:29:35 EDT 2012

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi Jesse,

Sorry for the delay in responding but glad you figured out to use
all-Ubuntu AMIs for both HVM and non-HVM nodes. With that said keep in
mind that only HVM nodes are on the high speed network IIRC which means
all traffic between master and nodes (e.g. NFS) will be suboptimal
compared to the performance of an all HVM cluster.

~Justin

On 08/27/2012 05:59 PM, Jesse Lu wrote:
> Okay, figured out that using ami-999d49f0 for non-HVM master and
> ami-4583572c for HVM nodes makes SGE work well. It's my fault for 
> not looking at the available public starcluster images carefully
> enough.
> 
> 
> 
> On Mon, Aug 27, 2012 at 2:26 PM, Jesse Lu <jesselu at stanford.edu 
> <mailto:jesselu at stanford.edu>> wrote:
> 
> Sorry for the spam, but here's another follow-up.
> 
> I found that this only happens when I use a non HVM-EBS AMI for
> the master, but an HWM-EBS for the master.
> 
> This is probably because StarCluster copies the sge install from
> the master to the nodes, and this doesn't play nice when the nodes
> are CentOS based but the master is Ubuntu based.
> 
> Any ideas for a work-around?
> 
> 
> On Mon, Aug 27, 2012 at 2:07 PM, Jesse Lu <jesselu at stanford.edu 
> <mailto:jesselu at stanford.edu>> wrote:
> 
> Follow-up,
> 
> Here are the contents of the installation log file (for grid
> engine)
> 
> cat 
> /opt/sge6/default/common/install_logs/execd_install_node001_2012-08-27_14:04:29.log
>
> 
> 
> Your $SGE_ROOT directory: /opt/sge6
> 
> 
> Using cell: >default<
> 
> 
> 
> 
> 
> Using local execd spool directory 
> [/opt/sge6/default/spool/exec_spool_local]
> 
> Creating local configuration for host >node001< sgeadmin at node001
> modified "node001" in configuration list Local configuration for
> host >node001< created.
> 
> Host >master< already in submit host list! Host >node001< already
> in submit host list!
> 
> 
> starting sge_execd
> 
> 
> No modification because "node001" already exists in "hostlist" of
> "hostgroup" root at node001 modified "@allhosts" in host group list 
> root at node001 modified "all.q" in cluster queue list
> 
> got select error: Connection refused got select error: closing
> "node001/execd/1" Execd on host node001 is not started!
> 
> 
> On Mon, Aug 27, 2012 at 1:37 PM, Jesse Lu <jesselu at stanford.edu 
> <mailto:jesselu at stanford.edu>> wrote:
> 
> ami-12b6477b produces the folowing error on cluster startup
> 
> !!! ERROR - command 'cd /opt/sge6 && TERM=rxvt ./inst_sge -x 
> -noremote -auto ./ec2_sge.conf' failed with status 1
> 
> I'm guessing the sge6 installation is faulty? Can anyone help?
> Thanks!
> 
> Jesse
> 
> 
> 
> 
> 
> 
> _______________________________________________ StarCluster mailing
> list StarCluster at mit.edu 
> http://mailman.mit.edu/mailman/listinfo/starcluster
> 

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAlBSF/4ACgkQ4llAkMfDcrlSwwCbB5lJLmj4GY9rriY9jfxNdqO3
s2UAn13+cEYu9bCqx6jiAP/wuPdetm+D
=Dyis
-----END PGP SIGNATURE-----