[StarCluster] Fast shared or local storage?

Fri May 9 01:56:23 EDT 2014

On Thu, May 8, 2014 at 4:46 PM, Cedar McKay <cmckay at uw.edu> wrote:

> My use-case is to blast (briefly:  blast is an alignment search tool)
>  against a large ~150GB read-only reference database. I'm struggling to
> figure out how to give each of my nodes access to this database while
> maximizing performance. The shared volume need only be read-only, but write
> would be nice too.
>

Chris from the bioteam (http://bioteam.net/ ) knows much more about Blast ,
but I will try to answer the questions from an AWS developer point of view.

(BTW, in case you didn't know... in some cases mpiBlast can give you
*super-linear
speedup* when the input DB is larger than the main memory of each node.)

Does anyone have advice about the best approach?
> My ideas:
>
>    - After starting cluster, copy database to ephemeral storage of each
>    node?
>
> If you put some simple logic in the SGE job script to pull the DB from S3,
and store it locally the first time blastall runs on the node (ie.
subsequent jobs read from the local copy), then this would give you the
best performance and the lowest cost.

* Note that SGE can schedule multiple jobs onto the same node, so you will
need some logic to make sure that only 1 transfer is done.

* Most (but not all) instance types give you over 150GB of ephemeral
storage that you can read/write without additional cost!

* Note that intra-region S3 to EC2 data transfer is free, but the speed was
below 80 MB/s last time we benchmarked it (even with instances that have
1GbE), so the overhead for the initial transfer will be around 30 mins.

* IMO, this is the easiest as you don't need to set anything else up and
all you need is a few lines of shell scripting.

>    - Create separate EBS volumes for each node starting from a snapshot
>    containing my reference database. But I don't see a way to automate this.
>
> Keep in mind that if you need to read 150GB each time a Blast job runs,
then it would cost you $0.49 for EBS I/O operations alone. Since main
memory can't cache that much data, then you will need to re-read the data
from EBS again.

>
>    - glusterfs. I saw reference to a glusterfs starcluster plugin a while
>    back, but it doesn't seem to be in the current list of plugins.
>
> IMO, it's too much work if all you need is to read input data.

>
>    - s3fs. But is random access within a file poor? Even with caching
>    turned on?
>
> May be the 2nd best option as I assume you will have lots of queued jobs,
and the 150GB of input data is read once from S3, and then will be accessed
many times locally.

>    - Stick with default approach (nfs share a volume), but provision the
>    headnode for faster networking? Provisioned IOPS EBS volumes? Any other
>    simple optimizations?
>
> If you just have a few execution (slave) nodes, it would work too. Just
create a PIOPS EBS volume, and then mount it & NFS share it by specifying
the values in the StarCluster config file:

http://star.mit.edu/cluster/docs/latest/manual/configuration.html#amazon-ebs-volumes

For a larger number of nodes, the NFS server is still the bottleneck. S3 is
much more scalable than a single NFS master. I would copy the DB from S3 to
the local instance or use s3fs if I have say over 8 (YMMV) nodes in the
cluster.

Rayson

==================================================
Open Grid Scheduler - The Official Open Source Grid Engine
http://gridscheduler.sourceforge.net/
http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html

>
>
>
> I really appreciate any help.
> Thanks,
> Cedar
>
>
>
>
>
>
> _______________________________________________
> StarCluster mailing list
> StarCluster at mit.edu
> http://mailman.mit.edu/mailman/listinfo/starcluster
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/pipermail/starcluster/attachments/20140509/509512e5/attachment.htm