[StarCluster] Unable to launch cluster: SGE master install failures

Lyn Gerner schedulerqueen at gmail.com
Wed Jan 22 17:10:00 EST 2014


Hi All,

I am trying to launch a 3-node cluster (using 0.94.3), and keep getting an
error during SGE install on the master, which blows the install of it and
the remaining nodes out of the water.

My starcluster config file specifies disable_queue = true and then invokes
the sge plugin with MASTER_IS_EXEC_HOST = False, so all it needs to do is
install and bring up qmaster.

The qmaster does come up, however, the cluster start keeps timing out with
the following:

>>> Installing Sun Grid Engine...
!!! ERROR - Error occured while running plugin 'sge':
!!! ERROR - remote command 'source /etc/profile && cd /opt/sge6 &&
!!! ERROR - TERM=rxvt ./inst_sge -m -noremote -auto ./ec2_sge.conf'
!!! ERROR - failed with status 1:
!!! ERROR - Reading configuration from file ./ec2_sge.conf
!!! ERROR - [H[2JInstall log can be found in: /opt/sge6/default/common/i
!!! ERROR - nstall_logs/qmaster_install_master_2014-01-22_21:55:08.log

In the install log, it's waiting for the SGE qmaster pid file to show up,
times out after 5mins, and tells me to check my autoinstall config file.

Here are the ps output, and the installation log.

root at AWS-VTMXmaster-w2b /opt/sge6/default/common/install_logs
# ps -ef|grep master
avahi     1038     1  0 20:42 ?        00:00:00 avahi-daemon: running
[master.local]
root      1442     1  0 20:43 ?        00:00:00 /usr/libexec/postfix/master
sgeadmin  1629     1  0 20:43 ?        00:00:00
/opt/sge6/bin/linux-x64/sge_qmaster
root     18277  4408  0 21:30 pts/0    00:00:00 /bin/grep --color=auto
master

root at AWS-VTMXmaster-w2b /opt/sge6/default/common/install_logs
# cat qmaster_install_master_2014-01-22_21:22:55.log
Starting qmaster installation!

Installing Grid Engine as admin user >sgeadmin<



Your $SGE_ROOT directory: /opt/sge6

Using SGE_QMASTER_PORT >63231<.

Using SGE_EXECD_PORT >63232<.

Using >default< as CELL_NAME.


Your $SGE_CLUSTER_NAME: starcluster

Using >/opt/sge6/default/spool/qmaster< as QMASTER_SPOOL_DIR.





Obviously this is not a complete Grid Engine distribution or this
is not your $SGE_ROOT directory.

Missing file or directory: start_gui_installer

Your file permissions will not be set. Exit.


Using >true< as IGNORE_FQDN_DEFAULT.
If it's >true<, the domain name will be ignored.


Making directories

Setting spooling method to dynamic
Dumping bootstrapping information
Initializing spooling database


Using >20000-20100< as gid range.
Using >/opt/sge6/default/spool< as EXECD_SPOOL_DIR.
Using >none at none.edu< as ADMIN_MAIL.
Adding default parallel environments (PE)



   starting sge_qmaster
Reached 5min timeout, while waiting for qmaster PID file.
sge_qmaster daemon didn't start. Please check your
autoinstall configuration file! Installation failed!
"

It's this same error on every attempt, and I am using an unmodified
ec2_sge.conf file.

Appreciate any suggestions for how to get over this.

Thanks much,
Lyn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/pipermail/starcluster/attachments/20140122/e8626345/attachment.htm


More information about the StarCluster mailing list