[StarCluster] EBS volume not mounting on restart

Tue Oct 9 01:41:35 EDT 2012

Dear folks,

I have the following problem while creating a cluster and mounting an ebs volume
on /data. Here is the config file part corresponding to my template:

[cluster issm]
# change this to the name of one of the keypair sections defined above
KEYNAME = ISSMStarCluster
# number of ec2 instances to launch
CLUSTER_SIZE = 2
# create the following user on the cluster
CLUSTER_USER = sgeadmin
# optionally specify shell (defaults to bash)
# (options: tcsh, zsh, csh, bash, ksh)
CLUSTER_SHELL = bash
# AMI to use for cluster nodes. These AMIs are for the us-east-1 region.
# Use the 'listpublic' command to list StarCluster AMIs in other regions
# The base i386 StarCluster AMI is ami-899d49e0
# The base x86_64 StarCluster AMI is ami-999d49f0
# The base HVM StarCluster AMI is ami-4583572c
NODE_IMAGE_ID = ami-4583572c
# instance type for all cluster nodes
# (options: cg1.4xlarge, c1.xlarge, m1.small, c1.medium, m2.xlarge, t1.micro, cc1.4xlarge, m1.medium, cc2.8xlarge, m1.large, m1.xlarge, hi1.4xlarge, m2.4xlarge, m2.2xlarge)
NODE_INSTANCE_TYPE = cc2.8xlarge
# Uncomment to disable installing/configuring a queueing system on the
# cluster (SGE)
#DISABLE_QUEUE=True
# Uncomment to specify a different instance type for the master node (OPTIONAL)
# (defaults to NODE_INSTANCE_TYPE if not specified)
#MASTER_INSTANCE_TYPE = m1.small
# Uncomment to specify a separate AMI to use for the master node. (OPTIONAL)
# (defaults to NODE_IMAGE_ID if not specified)
#MASTER_IMAGE_ID = ami-899d49e0 (OPTIONAL)
# availability zone to launch the cluster in (OPTIONAL)
# (automatically determined based on volumes (if any) or
# selected by Amazon if not specified)
#AVAILABILITY_ZONE = us-east-1c
# list of volumes to attach to the master node (OPTIONAL)
# these volumes, if any, will be NFS shared to the worker nodes
# see "Configuring EBS Volumes" below on how to define volume sections
VOLUMES = issm

# Sections starting with "volume" define your EBS volumes
[volume issm]
VOLUME_ID = vol-7d113b07
MOUNT_PATH = /data

when I first start this cluster:
starcluster start issm, everything works perfectly.

 start issm
StarCluster - (http://web.mit.edu/starcluster) (v. 0.9999)
Software Tools for Academics and Researchers (STAR)
Please submit bug reports to starcluster at mit.edu<mailto:starcluster at mit.edu>

>>> Using default cluster template: issm
>>> Validating cluster template settings...
>>> Cluster template settings are valid
>>> Starting cluster...
>>> Launching a 2-node cluster...
>>> Creating security group @sc-issm...
>>> Creating placement group @sc-issm...
Reservation:r-e3538485
>>> Waiting for cluster to come up... (updating every 10s)
>>> Waiting for all nodes to be in a 'running' state...
2/2 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%
>>> Waiting for SSH to come up on all nodes...
2/2 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%
>>> Waiting for cluster to come up took 2.281 mins
>>> The master node is ec2-107-22-25-149.compute-1.amazonaws.com
>>> Setting up the cluster...
>>> Attaching volume vol-7d113b07 to master node on /dev/sdz ...
>>> Configuring hostnames...
2/2 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%
>>> Mounting EBS volume vol-7d113b07 on /data...
>>> Creating cluster user: None (uid: 1001, gid: 1001)
2/2 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%
>>> Configuring scratch space for user(s): sgeadmin
0/2 |                                                                  |   0%

2/2 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%
>>> Configuring /etc/hosts on each node
2/2 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%
>>> Starting NFS server on master
>>> Configuring NFS exports path(s):
/home /data
>>> Mounting all NFS export path(s) on 1 worker node(s)
1/1 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%
>>> Setting up NFS took 0.152 mins
>>> Configuring passwordless ssh for root
>>> Configuring passwordless ssh for sgeadmin
>>> Shutting down threads...
20/20 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%
>>> Configuring SGE...
>>> Configuring NFS exports path(s):
/opt/sge6
>>> Mounting all NFS export path(s) on 1 worker node(s)
1/1 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%
>>> Setting up NFS took 0.102 mins
>>> Installing Sun Grid Engine...
1/1 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%
>>> Creating SGE parallel environment 'orte'
2/2 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%
>>> Adding parallel environment 'orte' to queue 'all.q'
>>> Shutting down threads...
20/20 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%
>>> Configuring cluster took 1.506 mins
>>> Starting cluster took 3.877 mins

The cluster is now ready to use. To login to the master node
as root, run:

    $ starcluster sshmaster issm

I checked, /data is correctly mounted on my ebs volume, everything fine.
Here is an frisk dump:

root at master:/data# df
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/sda1              8246240   5386292   2441056  69% /
udev                  31263832         4  31263828   1% /dev
tmpfs                 12507188       220  12506968   1% /run
none                      5120         0      5120   0% /run/lock
none                  31267964         0  31267964   0% /run/shm
/dev/xvdb            866917368    205028 822675452   1% /mnt
/dev/xvdz            103212320    192268  97777172   1% /data

the ebs volume I'm mounting is 100Gb in men, so everything checks out.

Now, if I stop the cluster, and start it again using the –x option, the cluster will boot
fine, but will not attach to the volume (won't attempt it at all) and will not even try
to mount /data. It's as though the [volumes] section of my config did not exist!

Here is the output of the starcluster start –x issm command:

st start -c issm -x issm
StarCluster - (http://web.mit.edu/starcluster) (v. 0.9999)
Software Tools for Academics and Researchers (STAR)
Please submit bug reports to starcluster at mit.edu<mailto:starcluster at mit.edu>

>>> Validating existing instances...
>>> Validating cluster template settings...
>>> Cluster template settings are valid
>>> Starting cluster...
>>> Starting stopped node: node001
>>> Waiting for cluster to come up... (updating every 10s)
>>> Waiting for all nodes to be in a 'running' state...
2/2 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%
>>> Waiting for SSH to come up on all nodes...
2/2 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%
>>> Waiting for cluster to come up took 1.780 mins
>>> The master node is ec2-23-22-242-221.compute-1.amazonaws.com
>>> Setting up the cluster...
>>> Configuring hostnames...
2/2 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%
>>> Creating cluster user: None (uid: 1001, gid: 1001)
2/2 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%
>>> Configuring scratch space for user(s): sgeadmin
2/2 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%
>>> Configuring /etc/hosts on each node
2/2 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%
>>> Starting NFS server on master
>>> Configuring NFS exports path(s):
/home
>>> Mounting all NFS export path(s) on 1 worker node(s)
1/1 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%
>>> Setting up NFS took 0.106 mins
>>> Configuring passwordless ssh for root
>>> Configuring passwordless ssh for sgeadmin
>>> Shutting down threads...
20/20 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%
>>> Configuring SGE...
>>> Configuring NFS exports path(s):
/opt/sge6
>>> Mounting all NFS export path(s) on 1 worker node(s)
1/1 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%
>>> Setting up NFS took 0.065 mins
>>> Removing previous SGE installation...
>>> Installing Sun Grid Engine...
1/1 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%
>>> Creating SGE parallel environment 'orte'
2/2 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%
>>> Adding parallel environment 'orte' to queue 'all.q'
>>> Shutting down threads...
20/20 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%
>>> Configuring cluster took 0.846 mins
>>> Starting cluster took 2.647 mins

The cluster is now ready to use. To login to the master node
as root, run:

    $ starcluster sshmaster issm

As you can see, no attempt was made at attaching to the ebs volume, and mounting of
/data was not attempted! When I log in, there is no ebs volume device for /data either

Any help or pointers would be appreciated!

Thanks in advance!

Eric L.

--------------------------------------------------------------------------
Dr. Eric Larour, Software Engineer III,
ISSM Task Manager  (http://issm.jpl.nasa.gov<http://issm.jpl.nasa.gov/>)
Mechanical division, Propulsion Thermal and Materials Section, Applied Low Temperature Physics Group.
Jet Propulsion Laboratory.
MS 79-24, 4800 Oak Grove Drive, Pasadena CA 91109.
eric.larour at jpl.nasa.gov<mailto:eric.larour at jpl.nasa.gov>
http://issm.jpl.nasa.gov<http://issm.jpl.nasa.gov/>
Tel: 1 818 393 2435.
 --------------------------------------------------------------------------

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/pipermail/starcluster/attachments/20121009/0432dcaf/attachment-0001.htm