[StarCluster] Unable to launch cluster: SGE master install failures

Rayson Ho raysonlogin at gmail.com
Wed Jan 22 17:29:08 EST 2014


Hmm, is that a public AMI or is it still private?? If it is public
then may be I can take a look at it when I have time this evening...

If it is private, then just boot it up using the AWS Web Console (thus
not SC code would run), and then SSH & go to the SGE directory, and
find out if the "start_gui_installer" is really missing, because from
your log it is printing out this line:

  Missing file or directory: start_gui_installer

Rayson

==================================================
Open Grid Scheduler - The Official Open Source Grid Engine
http://gridscheduler.sourceforge.net/
http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html


On Wed, Jan 22, 2014 at 5:25 PM, Lyn Gerner <schedulerqueen at gmail.com> wrote:
> Rayson, this is a derivative of the Scientific Linux 6.3 AMI that was
> pointed to previously from the StarCluster site.
>
> Thanks,
> Lyn
>
>
> On Wed, Jan 22, 2014 at 12:13 PM, Rayson Ho <raysonlogin at gmail.com> wrote:
>>
>> Which AMI did you use? Seems like it is missing some files...
>>
>> Rayson
>>
>> ==================================================
>> Open Grid Scheduler - The Official Open Source Grid Engine
>> http://gridscheduler.sourceforge.net/
>> http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html
>>
>>
>> On Wed, Jan 22, 2014 at 5:10 PM, Lyn Gerner <schedulerqueen at gmail.com>
>> wrote:
>> > Hi All,
>> >
>> > I am trying to launch a 3-node cluster (using 0.94.3), and keep getting
>> > an
>> > error during SGE install on the master, which blows the install of it
>> > and
>> > the remaining nodes out of the water.
>> >
>> > My starcluster config file specifies disable_queue = true and then
>> > invokes
>> > the sge plugin with MASTER_IS_EXEC_HOST = False, so all it needs to do
>> > is
>> > install and bring up qmaster.
>> >
>> > The qmaster does come up, however, the cluster start keeps timing out
>> > with
>> > the following:
>> >
>> >>>> Installing Sun Grid Engine...
>> > !!! ERROR - Error occured while running plugin 'sge':
>> > !!! ERROR - remote command 'source /etc/profile && cd /opt/sge6 &&
>> > !!! ERROR - TERM=rxvt ./inst_sge -m -noremote -auto ./ec2_sge.conf'
>> > !!! ERROR - failed with status 1:
>> > !!! ERROR - Reading configuration from file ./ec2_sge.conf
>> > !!! ERROR - [H[2JInstall log can be found in: /opt/sge6/default/common/i
>> > !!! ERROR - nstall_logs/qmaster_install_master_2014-01-22_21:55:08.log
>> >
>> > In the install log, it's waiting for the SGE qmaster pid file to show
>> > up,
>> > times out after 5mins, and tells me to check my autoinstall config file.
>> >
>> > Here are the ps output, and the installation log.
>> >
>> > root at AWS-VTMXmaster-w2b /opt/sge6/default/common/install_logs
>> > # ps -ef|grep master
>> > avahi     1038     1  0 20:42 ?        00:00:00 avahi-daemon: running
>> > [master.local]
>> > root      1442     1  0 20:43 ?        00:00:00
>> > /usr/libexec/postfix/master
>> > sgeadmin  1629     1  0 20:43 ?        00:00:00
>> > /opt/sge6/bin/linux-x64/sge_qmaster
>> > root     18277  4408  0 21:30 pts/0    00:00:00 /bin/grep --color=auto
>> > master
>> >
>> > root at AWS-VTMXmaster-w2b /opt/sge6/default/common/install_logs
>> > # cat qmaster_install_master_2014-01-22_21:22:55.log
>> > Starting qmaster installation!
>> >
>> > Installing Grid Engine as admin user >sgeadmin<
>> >
>> >
>> >
>> > Your $SGE_ROOT directory: /opt/sge6
>> >
>> > Using SGE_QMASTER_PORT >63231<.
>> >
>> > Using SGE_EXECD_PORT >63232<.
>> >
>> > Using >default< as CELL_NAME.
>> >
>> >
>> > Your $SGE_CLUSTER_NAME: starcluster
>> >
>> > Using >/opt/sge6/default/spool/qmaster< as QMASTER_SPOOL_DIR.
>> >
>> >
>> >
>> >
>> >
>> > Obviously this is not a complete Grid Engine distribution or this
>> > is not your $SGE_ROOT directory.
>> >
>> > Missing file or directory: start_gui_installer
>> >
>> > Your file permissions will not be set. Exit.
>> >
>> >
>> > Using >true< as IGNORE_FQDN_DEFAULT.
>> > If it's >true<, the domain name will be ignored.
>> >
>> >
>> > Making directories
>> >
>> > Setting spooling method to dynamic
>> > Dumping bootstrapping information
>> > Initializing spooling database
>> >
>> >
>> > Using >20000-20100< as gid range.
>> > Using >/opt/sge6/default/spool< as EXECD_SPOOL_DIR.
>> > Using >none at none.edu< as ADMIN_MAIL.
>> > Adding default parallel environments (PE)
>> >
>> >
>> >
>> >    starting sge_qmaster
>> > Reached 5min timeout, while waiting for qmaster PID file.
>> > sge_qmaster daemon didn't start. Please check your
>> > autoinstall configuration file! Installation failed!
>> > "
>> >
>> > It's this same error on every attempt, and I am using an unmodified
>> > ec2_sge.conf file.
>> >
>> > Appreciate any suggestions for how to get over this.
>> >
>> > Thanks much,
>> > Lyn
>> >
>> > _______________________________________________
>> > StarCluster mailing list
>> > StarCluster at mit.edu
>> > http://mailman.mit.edu/mailman/listinfo/starcluster
>> >
>
>


More information about the StarCluster mailing list