[StarCluster] Terminating a cluster does not remove its security group (UNCLASSIFIED)
Oppe, Thomas C ERDC-RDE-ITL-MS Contractor
Thomas.C.Oppe at erdc.dren.mil
Thu Dec 20 11:31:35 EST 2012
Classification: UNCLASSIFIED
Caveats: FOUO
Dustin,
HYCOM does use fairly big input files. When I do "ls -l" in the /input directory, I get a total of 15319088 KB, so 14.61 GB is the total size of all input files. However, reading the input files at the start of the run seems to go pretty quickly. It is the writing of several big binary output files that seems to go slowly. The model simulates one day, and at every simulated hour there is a file of size 3798728704 bytes written. At the 23rd hour, in addition to this hourly file, there is a huge file of size 57455771648 bytes written. Then there is a final hourly file of size 3798728704 bytes written. For my fastest run using 336 MPI processes on a standard EBS volume, here is the size and time of last modification for each of these large output files:
3798728704 Dec 16 03:49 archv.0001_001_01.a
3798728704 Dec 16 03:58 archv.0001_001_02.a
3798728704 Dec 16 04:07 archv.0001_001_03.a
3798728704 Dec 16 04:17 archv.0001_001_04.a
3798728704 Dec 16 04:25 archv.0001_001_05.a
3798728704 Dec 16 04:34 archv.0001_001_06.a
3798728704 Dec 16 04:48 archv.0001_001_07.a
3798728704 Dec 16 04:56 archv.0001_001_08.a
3798728704 Dec 16 05:04 archv.0001_001_09.a
3798728704 Dec 16 05:13 archv.0001_001_10.a
3798728704 Dec 16 05:21 archv.0001_001_11.a
3798728704 Dec 16 05:29 archv.0001_001_12.a
3798728704 Dec 16 05:38 archv.0001_001_13.a
3798728704 Dec 16 05:46 archv.0001_001_14.a
3798728704 Dec 16 05:54 archv.0001_001_15.a
3798728704 Dec 16 06:03 archv.0001_001_16.a
3798728704 Dec 16 06:11 archv.0001_001_17.a
3798728704 Dec 16 06:19 archv.0001_001_18.a
3798728704 Dec 16 06:28 archv.0001_001_19.a
3798728704 Dec 16 06:36 archv.0001_001_20.a
3798728704 Dec 16 06:44 archv.0001_001_21.a
3798728704 Dec 16 06:53 archv.0001_001_22.a
3798728704 Dec 16 07:01 archv.0001_001_23.a
57455771648 Dec 16 07:32 archm.0001_001_12.a
3798728704 Dec 16 07:36 archv.0001_002_00.a
Notice that each hour time step, including writing the "archv.*.a" file, takes 8-10 minutes (except at 6hr, where extra calculations are made), and in fact the writes of the first 23 "archv.*.a" files goes pretty quickly (15-30 seconds to write one as I have timed it). But something disastrous happens at the write of the big "archm.*.a" file, which is only 15.125 times larger than an "archv.*.a" file. It is written 31 minutes after the previous "archv.0001_001_23.a" is written. Assuming 9 minutes of this is the inter-hour computation, that means it took 22 minutes to write. I have seen it take much longer. This run was for a serial I/O run, in which all write I/O is done by rank 0. For a parallel I/O run, in which selected MPI processes participate in the writing of these files using MPI-2 I/O calls, the situation is much worse. I have seen the "archm.*.a" file take more than an hour to write in parallel I/O mode. The HYCOM developer says that there is no computation at all between the writing of the "archm.*.a" file and the last "archv.*.a" file, and thus it takes 4 minutes to write the last "archv.*.a" file. So something bad happened in the writing of the "archm.*.a" file to affect the I/O write performance of the last "archv.*.a" file.
Your suggestion of keeping the input files on the EBS volume, and just redirecting the output files to /mnt is good, but I'm not sure how to do that. For HYCOM, the output files always appear in the same directory as the input files. As a benchmarker of HYCOM, I am not really allowed to modify the source code to redirect the output files elsewhere. Hence I had to stand on my head to locate the input files on /mnt. I will ask the HYCOM developer if it is possible to have the input and output files in different directories or filesystems.
Your suggestion of using buffered I/O and increasing the size of the buffer chunks is fine, but again I am not sure how to do that with binary sequential access files. In my SGE script file, I have:
export FORT_BUFFERED='YES'
as an Intel Fortran compiler-related environment variable, but I think it affects only ASCII files, not the big binary files "archv.*.a" and "archm.*.a". Can I set the buffer size myself? As mentioned above, I am suspicious that the I/O write performance deteriorates at the writing of the "archm.*.a" file and affects the last "archv.*.a" file. Perhaps the buffer chunk size is getting reset to smaller with the writing of the big file.
You mentioned using GlusterFS with 64k buffer chunks. I am using either standard EBS volumes or what Amazon calls "provisioned IOPS" EBS volumes. I think they are NFS mounted and shared among the nodes. Do you think GlusterFS would be faster? Can you point me to information about how to ask for a GlusterFS file system of 200-300 GB size and how to specify the buffer chunk size?
Amazon warns that the "cc2.8xlarge" instances cannot take full advantage of the provisioned IOPS EBS volumes, and instead I should use "m2.4xlarge" instances, but I think those type instances are slower than the "cc2.8xlarge" instances. Perhaps the "m2.4xlarge" instances are in clusters that have a separate network for I/O so that MPI messages and I/O traffic are on separate networks.
Thank you for the information. As you can see, I need to know more about different file systems available on AWS, how to set buffer chunk sizes, and most of all, how the big files are being written. All I know is that in serial I/O mode, rank 0 uses collective MPI operations to gather data from all the other MPI processes and then writes the data to one big file. Perhaps some parameter associated with the MPI collective operation needs to be resized.
Tom Oppe
-----Original Message-----
From: Dustin Machi [mailto:dmachi at vbi.vt.edu]
Sent: Thursday, December 20, 2012 9:12 AM
To: Oppe, Thomas C ERDC-RDE-ITL-MS Contractor
Cc: starcluster at mit.edu
Subject: Re: [StarCluster] Terminating a cluster does not remove its security group (UNCLASSIFIED)
Comments Inline.
> I'm running a HYCOM job now using the /mnt disk on the master node.
> The problem is that big files are being written. One of the files is
> 57GB, and 24 other files of size 4GB are also written. The job is
> running more slowly than when run on an EBS volume, but that may be
> because the MPI executable is on an EBS volume and is being run from
> there.
When you say there are 153GB of data written, is that the output or is that the size of the input files you are copying to /mnt/? The total time to copy 153gb from the NFS share to each individual node plus the time to do the actual work may in fact be longer than the original of reading of the NFS share. I wasn't really suggesting that any of the input files be relocated or that the mph binary should need to be moved (I don't think the latter would have much of an impact at all, aside from initially launching). I was only suggesting that the output of you MPI scripts write to /mnt/ and then copy those output files back over to the shared filesystem so you can get the final results.
If the input files are relatively static, you could do something like
this:
1. Create a volume and put all the data on it.
2. Make a snapshot of this volume.
3. For each of the nodes, create a new volume from the snapshot in #2, and mount this.
At this point all the nodes will have all the data on their EBS volume at launch and no copying is necessary.
When the data is updated, you can either a) update the data, create a new snapshot and relaunch the cluster so volumes are generated from the new snapshot, or b) Use rsync to update the base volume from a new copy whenever a node boots. This will make the node startup a little bit slower, but should be pretty fast if only a subset of the data has changed.
Another question in regards to the write functionality of your MPI program, is what size chunks does it try to write with. Buffering the output and writing in larger chunks instead of a lot of tiny writes is probably going to perform better for you. I use GlusterFS as a shared filesystem in my setup, and the key for it is to have write in 64k chunks (as opposed to the more common 4kb chunks that things like rsync use by default).
> I gave up trying to login to each node of the cluster using
> "starcluster sshnode cluster <node>", mainly because I was getting ssh
> errors that were preventing me from logging in to several nodes. The
> error message was something like "you may be in a man-in-the-middle
> attack" or something like that.
>
This is normal for ssh. When you ssh into a machine you've never ssh'd into before, it doesn't know if the host's signature is valid, so it warns you before saving it associated with the host you are connecting to. Thereafter, if you connect to a machine with the same name, it will warn you if the signature doesn't match what you have on file (which might be a man-in-the-middle attack). You can delete the signature in ~/.ssh/known_hosts by removing the line with same hostname.
In general, I would try to do some benchmarking to identify where the bottleneck is. Is it Input Read or Output Write that is slow?
Dustin
> My technique is very kludgy, but this is what I did. I wrote a bash
> script to copy four HYCOM input files that each MPI process needs to
> be able to read to the /mnt disk. /sharedWork is the name of my
> standard EBS volume.
>
> root at master > more copyit
> #!/bin/bash
> cd /sharedWork/ABTP/ded/hycom/input
> /bin/cp blkdat.input limits ports.input /mnt /bin/cp patch.input_00336
> /mnt/patch.input ls /mnt exit
>
> Then I wrote a small Fortran MPI program that executes this script
> with a system call. Since there are 16 MPI processes to an instance
> (cc2.8xlarge), I only use one MPI process on each node to do the
> copying.
>
> root at master > more copy.f
> program copy
> include 'mpif.h'
> call MPI_Init (ierror)
> call MPI_Comm_rank (MPI_COMM_WORLD, myid, ier) if (mod(myid, 16) .eq.
> 0) then
> call system ('/sharedWork/copyit')
> endif
> call MPI_Finalize (ierror)
> stop
> end
>
> I compiled "copy.f" into an executable called "copy" using "mpiifort".
> Then in my HYCOM batch script, I run this small MPI job before
> running the real job:
>
> mpiexec -n 336 /sharedWork/copy
>
> Unfortunately, I think I made an error in that the HYCOM executable is
> being run from /sharedWork, the EBS volume.
>
> mpiexec -n 336 /sharedWork/hycom
>
> The job is running slowly, more slowly than if all files were on
> /sharedWork. For the next run, I think I will modify the "copyit"
> script to copy the HYCOM executable to /mnt also for each instance and
> change the mpiexec line to:
>
> mpiexec -n 336 /mnt/hycom
>
> HYCOM needs a lot of input files, and they are all available on /mnt
> corresponding to the master node (rank 0). However, most of the input
> files are read only by rank 0 and only four input files need to be
> read by all MPI processes.
>
> I'm surprised this works at all. It is mind-bending for me. If you
> have a better way for running on /mnt, let me know your suggestion.
>
> Above all, thank you very much. I think once the executable is placed
> on each /mnt disk for each node, maybe the run will go much faster.
>
> Tom Oppe
>
> -----Original Message-----
> From: Dustin Machi [mailto:dmachi at vbi.vt.edu]
> Sent: Thursday, December 20, 2012 5:53 AM
> To: Oppe, Thomas C ERDC-RDE-ITL-MS Contractor
> Cc: Paolo Di Tommaso; starcluster at mit.edu
> Subject: Re: [StarCluster] Terminating a cluster does not remove its
> security group (UNCLASSIFIED)
>
>> I'm looking at using the /mnt local disk of the "master" node and
>> copying input files to each node's /mnt disk for those files that
>> each MPI process needs to read. It's very cumbersome. I have to
>> login to each node.
>
> It is a little cumbersome to do so, but you shouldn't have to login to
> each node. Just submit/run an SGE job that copies/rsyncs the NFS
> volume mounted on each machine to /mnt/. Is your problem a read
> problem or a write problem?
>
> Dustin
>
> Classification: UNCLASSIFIED
> Caveats: FOUO
Classification: UNCLASSIFIED
Caveats: FOUO
More information about the StarCluster
mailing list