[StarCluster] Terminating a cluster does not remove its security group (UNCLASSIFIED)

Thu Dec 20 10:11:43 EST 2012

Comments Inline.

> I'm running a HYCOM job now using the /mnt disk on the master node.  
> The problem is that big files are being written.  One of the files is 
> 57GB, and 24 other files of size 4GB are also written.  The job is 
> running more slowly than when run on an EBS volume, but that may be 
> because the MPI executable is on an EBS volume and is being run from 
> there.

When you say there are 153GB of data written, is that the output or is 
that the size of the input files you are copying to /mnt/?  The total 
time to copy 153gb from the NFS share to each individual node plus the 
time to do the actual work may in fact be longer than the original of 
reading of the NFS share.  I wasn't really suggesting that any of the 
input files be relocated or that the mph binary should need to be moved 
(I don't think the latter would have much of an impact at all, aside 
from initially launching).  I was only suggesting that the output of you 
MPI scripts write to /mnt/ and then copy those output files back over to 
the shared filesystem so you can get the final results.

If the input files are relatively static, you could do something like 
this:

1. Create a volume and put all the data on it.
2. Make a snapshot of this volume.
3. For each of the nodes, create a new volume from the snapshot in #2, 
and mount this.

At this point all the nodes will have all the data on their EBS volume 
at launch and no copying is necessary.

When the data is updated, you can either a) update the data, create a 
new snapshot and relaunch the cluster so volumes are generated from the 
new snapshot, or b) Use rsync to update the base volume from a new copy 
whenever a node boots.  This will make the node startup a little bit 
slower, but should be pretty fast if only a subset of the data has 
changed.

Another question in regards to the write functionality of your MPI 
program, is what size chunks does it try to write with.  Buffering the 
output and writing in larger chunks instead of a lot of tiny writes is 
probably going to perform better for you. I use GlusterFS as a shared 
filesystem in my setup, and the key for it is to have write in 64k 
chunks (as opposed to the more common 4kb chunks that things like rsync 
use by default).

> I gave up trying to login to each node of the cluster using 
> "starcluster sshnode cluster <node>", mainly because I was getting ssh 
> errors that were preventing me from logging in to several nodes.  The 
> error message was something like "you may be in a man-in-the-middle 
> attack" or something like that.
>
This is normal for ssh.  When you ssh into a machine you've never ssh'd 
into before, it doesn't know if the host's signature is valid, so it 
warns you before saving it associated with the host you are connecting 
to. Thereafter, if you connect to a machine with the same name, it will 
warn you if the signature doesn't match what you have on file (which 
might be a man-in-the-middle attack). You can delete the signature in 
~/.ssh/known_hosts by removing the line with same hostname.

In general, I would try to do some benchmarking to identify where the 
bottleneck is.  Is it Input Read or Output Write that is slow?

Dustin

> My technique is very kludgy, but this is what I did.  I wrote a bash 
> script to copy four HYCOM input files that each MPI process needs to 
> be able to read to the /mnt disk.  /sharedWork is the name of my 
> standard EBS volume.
>
> root at master > more copyit
> #!/bin/bash
> cd /sharedWork/ABTP/ded/hycom/input
> /bin/cp blkdat.input limits ports.input /mnt
> /bin/cp patch.input_00336 /mnt/patch.input
> ls /mnt
> exit
>
> Then I wrote a small Fortran MPI program that executes this script 
> with a system call.  Since there are 16 MPI processes to an instance 
> (cc2.8xlarge), I only use one MPI process on each node to do the 
> copying.
>
> root at master > more copy.f
> program copy
> include 'mpif.h'
> call MPI_Init (ierror)
> call MPI_Comm_rank (MPI_COMM_WORLD, myid, ier)
> if (mod(myid, 16) .eq. 0) then
>   call system ('/sharedWork/copyit')
> endif
> call MPI_Finalize (ierror)
> stop
> end
>
> I compiled "copy.f" into an executable called "copy" using "mpiifort". 
>  Then in my HYCOM batch script, I run this small MPI job before 
> running the real job:
>
> mpiexec -n 336 /sharedWork/copy
>
> Unfortunately, I think I made an error in that the HYCOM executable is 
> being run from /sharedWork, the EBS volume.
>
> mpiexec -n 336 /sharedWork/hycom
>
> The job is running slowly, more slowly than if all files were on 
> /sharedWork.  For the next run, I think I will modify the "copyit" 
> script to copy the HYCOM executable to /mnt also for each instance and 
> change the mpiexec line to:
>
> mpiexec -n 336 /mnt/hycom
>
> HYCOM needs a lot of input files, and they are all available on /mnt 
> corresponding to the master node (rank 0).  However, most of the input 
> files are read only by rank 0 and only four input files need to be 
> read by all MPI processes.
>
> I'm surprised this works at all.  It is mind-bending for me.  If you 
> have a better way for running on /mnt, let me know your suggestion.
>
> Above all, thank you very much.  I think once the executable is placed 
> on each /mnt disk for each node, maybe the run will go much faster.
>
> Tom Oppe
>
> -----Original Message-----
> From: Dustin Machi [mailto:dmachi at vbi.vt.edu]
> Sent: Thursday, December 20, 2012 5:53 AM
> To: Oppe, Thomas C ERDC-RDE-ITL-MS Contractor
> Cc: Paolo Di Tommaso; starcluster at mit.edu
> Subject: Re: [StarCluster] Terminating a cluster does not remove its 
> security group (UNCLASSIFIED)
>
>> I'm looking at using the /mnt local disk of the "master" node and
>> copying input files to each node's /mnt disk for those files that 
>> each
>> MPI process needs to read.  It's very cumbersome.  I have to login to
>> each node.
>
> It is a little cumbersome to do so, but you shouldn't have to login to 
> each node.  Just submit/run an SGE job that copies/rsyncs the NFS 
> volume mounted on each machine to /mnt/.  Is your problem a read 
> problem or a write problem?
>
> Dustin
>
> Classification: UNCLASSIFIED
> Caveats: FOUO