[StarCluster] Terminating a cluster does not remove its security group (UNCLASSIFIED)

Thu Dec 20 11:50:01 EST 2012

Classification: UNCLASSIFIED
Caveats: FOUO

Dustin,

The run on /mnt just finished and it is disappointing.  The same data as presented in the previous e-mail for my fastest run is now for this run:

      3798728704 2012-12-20 11:36 archv.0001_001_01.a
      3798728704 2012-12-20 11:47 archv.0001_001_02.a
      3798728704 2012-12-20 11:59 archv.0001_001_03.a
      3798728704 2012-12-20 12:09 archv.0001_001_04.a
      3798728704 2012-12-20 12:21 archv.0001_001_05.a
      3798728704 2012-12-20 12:32 archv.0001_001_06.a
      3798728704 2012-12-20 12:43 archv.0001_001_07.a
      3798728704 2012-12-20 12:54 archv.0001_001_08.a
      3798728704 2012-12-20 13:05 archv.0001_001_09.a
      3798728704 2012-12-20 13:16 archv.0001_001_10.a
      3798728704 2012-12-20 13:27 archv.0001_001_11.a
      3798728704 2012-12-20 13:38 archv.0001_001_12.a
      3798728704 2012-12-20 13:50 archv.0001_001_13.a
      3798728704 2012-12-20 14:01 archv.0001_001_14.a
      3798728704 2012-12-20 14:12 archv.0001_001_15.a
      3798728704 2012-12-20 14:22 archv.0001_001_16.a
      3798728704 2012-12-20 14:33 archv.0001_001_17.a
      3798728704 2012-12-20 14:44 archv.0001_001_18.a
      3798728704 2012-12-20 14:56 archv.0001_001_19.a
      3798728704 2012-12-20 15:07 archv.0001_001_20.a
      3798728704 2012-12-20 15:17 archv.0001_001_21.a
      3798728704 2012-12-20 15:28 archv.0001_001_22.a
      3798728704 2012-12-20 15:39 archv.0001_001_23.a
     57455771648 2012-12-20 16:27 archm.0001_001_12.a
      3798728704 2012-12-20 16:29 archv.0001_002_00.a

Each time step (a simulated hour) now takes 10-12 minutes and the "archm.*.a" file appears 48 minutes after the last access time of the "archv.0001_001_23.a" file.  So assuming 12 minutes of that is computation, the "archm.*.a" file takes 36 minutes to write.

Tom Oppe

-----Original Message-----
From: Dustin Machi [mailto:dmachi at vbi.vt.edu] 
Sent: Thursday, December 20, 2012 9:12 AM
To: Oppe, Thomas C ERDC-RDE-ITL-MS Contractor
Cc: starcluster at mit.edu
Subject: Re: [StarCluster] Terminating a cluster does not remove its security group (UNCLASSIFIED)

Comments Inline.

> I'm running a HYCOM job now using the /mnt disk on the master node.  
> The problem is that big files are being written.  One of the files is 
> 57GB, and 24 other files of size 4GB are also written.  The job is 
> running more slowly than when run on an EBS volume, but that may be 
> because the MPI executable is on an EBS volume and is being run from 
> there.

When you say there are 153GB of data written, is that the output or is that the size of the input files you are copying to /mnt/?  The total time to copy 153gb from the NFS share to each individual node plus the time to do the actual work may in fact be longer than the original of reading of the NFS share.  I wasn't really suggesting that any of the input files be relocated or that the mph binary should need to be moved (I don't think the latter would have much of an impact at all, aside from initially launching).  I was only suggesting that the output of you MPI scripts write to /mnt/ and then copy those output files back over to the shared filesystem so you can get the final results.

If the input files are relatively static, you could do something like
this:

1. Create a volume and put all the data on it.
2. Make a snapshot of this volume.
3. For each of the nodes, create a new volume from the snapshot in #2, and mount this.

At this point all the nodes will have all the data on their EBS volume at launch and no copying is necessary.

When the data is updated, you can either a) update the data, create a new snapshot and relaunch the cluster so volumes are generated from the new snapshot, or b) Use rsync to update the base volume from a new copy whenever a node boots.  This will make the node startup a little bit slower, but should be pretty fast if only a subset of the data has changed.

Another question in regards to the write functionality of your MPI program, is what size chunks does it try to write with.  Buffering the output and writing in larger chunks instead of a lot of tiny writes is probably going to perform better for you. I use GlusterFS as a shared filesystem in my setup, and the key for it is to have write in 64k chunks (as opposed to the more common 4kb chunks that things like rsync use by default).

> I gave up trying to login to each node of the cluster using 
> "starcluster sshnode cluster <node>", mainly because I was getting ssh 
> errors that were preventing me from logging in to several nodes.  The 
> error message was something like "you may be in a man-in-the-middle 
> attack" or something like that.
>
This is normal for ssh.  When you ssh into a machine you've never ssh'd into before, it doesn't know if the host's signature is valid, so it warns you before saving it associated with the host you are connecting to. Thereafter, if you connect to a machine with the same name, it will warn you if the signature doesn't match what you have on file (which might be a man-in-the-middle attack). You can delete the signature in ~/.ssh/known_hosts by removing the line with same hostname.

In general, I would try to do some benchmarking to identify where the bottleneck is.  Is it Input Read or Output Write that is slow?

Dustin

> My technique is very kludgy, but this is what I did.  I wrote a bash 
> script to copy four HYCOM input files that each MPI process needs to 
> be able to read to the /mnt disk.  /sharedWork is the name of my 
> standard EBS volume.
>
> root at master > more copyit
> #!/bin/bash
> cd /sharedWork/ABTP/ded/hycom/input
> /bin/cp blkdat.input limits ports.input /mnt /bin/cp patch.input_00336 
> /mnt/patch.input ls /mnt exit
>
> Then I wrote a small Fortran MPI program that executes this script 
> with a system call.  Since there are 16 MPI processes to an instance 
> (cc2.8xlarge), I only use one MPI process on each node to do the 
> copying.
>
> root at master > more copy.f
> program copy
> include 'mpif.h'
> call MPI_Init (ierror)
> call MPI_Comm_rank (MPI_COMM_WORLD, myid, ier) if (mod(myid, 16) .eq. 
> 0) then
>   call system ('/sharedWork/copyit')
> endif
> call MPI_Finalize (ierror)
> stop
> end
>
> I compiled "copy.f" into an executable called "copy" using "mpiifort". 
>  Then in my HYCOM batch script, I run this small MPI job before 
> running the real job:
>
> mpiexec -n 336 /sharedWork/copy
>
> Unfortunately, I think I made an error in that the HYCOM executable is 
> being run from /sharedWork, the EBS volume.
>
> mpiexec -n 336 /sharedWork/hycom
>
> The job is running slowly, more slowly than if all files were on 
> /sharedWork.  For the next run, I think I will modify the "copyit"
> script to copy the HYCOM executable to /mnt also for each instance and 
> change the mpiexec line to:
>
> mpiexec -n 336 /mnt/hycom
>
> HYCOM needs a lot of input files, and they are all available on /mnt 
> corresponding to the master node (rank 0).  However, most of the input 
> files are read only by rank 0 and only four input files need to be 
> read by all MPI processes.
>
> I'm surprised this works at all.  It is mind-bending for me.  If you 
> have a better way for running on /mnt, let me know your suggestion.
>
> Above all, thank you very much.  I think once the executable is placed 
> on each /mnt disk for each node, maybe the run will go much faster.
>
> Tom Oppe
>
> -----Original Message-----
> From: Dustin Machi [mailto:dmachi at vbi.vt.edu]
> Sent: Thursday, December 20, 2012 5:53 AM
> To: Oppe, Thomas C ERDC-RDE-ITL-MS Contractor
> Cc: Paolo Di Tommaso; starcluster at mit.edu
> Subject: Re: [StarCluster] Terminating a cluster does not remove its 
> security group (UNCLASSIFIED)
>
>> I'm looking at using the /mnt local disk of the "master" node and 
>> copying input files to each node's /mnt disk for those files that 
>> each MPI process needs to read.  It's very cumbersome.  I have to 
>> login to each node.
>
> It is a little cumbersome to do so, but you shouldn't have to login to 
> each node.  Just submit/run an SGE job that copies/rsyncs the NFS 
> volume mounted on each machine to /mnt/.  Is your problem a read 
> problem or a write problem?
>
> Dustin
>
> Classification: UNCLASSIFIED
> Caveats: FOUO

Classification: UNCLASSIFIED
Caveats: FOUO