[StarCluster] Fast shared or local storage?

Fri May 9 14:18:51 EDT 2014

Hi Cedar,

I am relatively new to AWS, and StarCluster, but my suggestion would be to
use either the C3 or R3 instances, launched in the same placement group,
and just download the data from S3 to the cluster's master node.

1. I think your best bet might actually be to download the 150GB to an
instance store and share it over NFS. In my experience, you can download
from S3 at a couple hundred MB/s if the files are large enough, and if you
open multiple connections. For example, one of my clusters downloads 7
different 2GB files in parallel, each at ~ 30MB/s (I don't think I am
hitting the limits of the instance). I expect if your files are larger, you
will have even better performance. At around 300MB/s, this would make
downloading the whole database take about 8 minutes. If you use an SSD
backed instance like the R3 types, the disk I/O will be free and fast.

2. With any of the cluster compute instances, you get enhanced
networking<http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/enhanced-networking.html>performance
if you launch within the same placement group. I expect the 10G
ethernet with enhanced networking will provide sufficient i/o performance.
If you are still limited by I/O on NFS, you can also cache files on the
worker's instance storage, if they access files multiple times.

Just my two cents, anyone feel free to correct me.

Good luck, and I'd be interested to hear about what you come up with!
Cory

On Fri, May 9, 2014 at 1:08 PM, Cedar McKay <cmckay at uw.edu> wrote:

> Thanks for the very useful reply. I think I'm going to go with the s3fs
> option and cache to local ephemeral drives. A big blast database is split
> into many parts, and I'm pretty sure that every file in a blast db isn't
> read every time, so this way blasting can proceed immediately. The parts of
> the blast database download from s3 on demand, and cached locally. If there
> was much writing, I'd probably be reluctant to use this approach because
> the s3 eventual consistency model seems to require tolerance of write fails
> at the application level. I'll write my results to a shared nfs volume.
>
> I thought about mpiBlast and will probably explore it, but I read some
> reports that it's xml output isn't exactly the same as the official NCBI
> blast output, and may break biopython parsing. I haven't confirmed this,
> and will probably compare the two techniques.
>
> Thanks again!
>
> Cedar
>
>
>
> On May 8, 2014, at 10:56 PM, Rayson Ho <raysonlogin at gmail.com> wrote:
>
> On Thu, May 8, 2014 at 4:46 PM, Cedar McKay <cmckay at uw.edu> wrote:
>
>> My use-case is to blast (briefly:  blast is an alignment search tool)
>>  against a large ~150GB read-only reference database. I'm struggling to
>> figure out how to give each of my nodes access to this database while
>> maximizing performance. The shared volume need only be read-only, but write
>> would be nice too.
>>
>
> Chris from the bioteam (http://bioteam.net/ ) knows much more about Blast
> , but I will try to answer the questions from an AWS developer point of
> view.
>
> (BTW, in case you didn't know... in some cases mpiBlast can give you *super-linear
> speedup* when the input DB is larger than the main memory of each node.)
>
>
> Does anyone have advice about the best approach?
>> My ideas:
>>
>>    - After starting cluster, copy database to ephemeral storage of each
>>    node?
>>
>> If you put some simple logic in the SGE job script to pull the DB from
> S3, and store it locally the first time blastall runs on the node (ie.
> subsequent jobs read from the local copy), then this would give you the
> best performance and the lowest cost.
>
> * Note that SGE can schedule multiple jobs onto the same node, so you will
> need some logic to make sure that only 1 transfer is done.
>
> * Most (but not all) instance types give you over 150GB of ephemeral
> storage that you can read/write without additional cost!
>
> * Note that intra-region S3 to EC2 data transfer is free, but the speed
> was below 80 MB/s last time we benchmarked it (even with instances that
> have 1GbE), so the overhead for the initial transfer will be around 30 mins.
>
> * IMO, this is the easiest as you don't need to set anything else up and
> all you need is a few lines of shell scripting.
>
>
>>    - Create separate EBS volumes for each node starting from a snapshot
>>    containing my reference database. But I don't see a way to automate this.
>>
>> Keep in mind that if you need to read 150GB each time a Blast job runs,
> then it would cost you $0.49 for EBS I/O operations alone. Since main
> memory can't cache that much data, then you will need to re-read the data
> from EBS again.
>
>
>>
>>    - glusterfs. I saw reference to a glusterfs starcluster plugin a
>>    while back, but it doesn't seem to be in the current list of plugins.
>>
>> IMO, it's too much work if all you need is to read input data.
>
>
>>
>>    - s3fs. But is random access within a file poor? Even with caching
>>    turned on?
>>
>> May be the 2nd best option as I assume you will have lots of queued jobs,
> and the 150GB of input data is read once from S3, and then will be accessed
> many times locally.
>
>
>>    - Stick with default approach (nfs share a volume), but provision the
>>    headnode for faster networking? Provisioned IOPS EBS volumes? Any other
>>    simple optimizations?
>>
>> If you just have a few execution (slave) nodes, it would work too. Just
> create a PIOPS EBS volume, and then mount it & NFS share it by specifying
> the values in the StarCluster config file:
>
>
> http://star.mit.edu/cluster/docs/latest/manual/configuration.html#amazon-ebs-volumes
>
> For a larger number of nodes, the NFS server is still the bottleneck. S3
> is much more scalable than a single NFS master. I would copy the DB from S3
> to the local instance or use s3fs if I have say over 8 (YMMV) nodes in the
> cluster.
>
> Rayson
>
> ==================================================
> Open Grid Scheduler - The Official Open Source Grid Engine
> http://gridscheduler.sourceforge.net/
> http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html
>
>
>
>>
>>
>>
>> I really appreciate any help.
>> Thanks,
>> Cedar
>>
>>
>>
>>
>>
>>
>> _______________________________________________
>> StarCluster mailing list
>> StarCluster at mit.edu
>> http://mailman.mit.edu/mailman/listinfo/starcluster
>>
>>
>
>
> _______________________________________________
> StarCluster mailing list
> StarCluster at mit.edu
> http://mailman.mit.edu/mailman/listinfo/starcluster
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/pipermail/starcluster/attachments/20140509/4f475165/attachment-0001.htm