[StarCluster] Fast shared or local storage? (Cedar McKay)

Wed May 14 19:14:57 EDT 2014

Super info, thanks Rayson! It's been a number of months since I played with s3fs.

> On May 14, 2014, at 17:35, "Rayson Ho" <raysonlogin at gmail.com> wrote:
> 
> The "64GB limit" in s3fs should only affect writes to S3, as AWS
> limits 10,000 parts in a multipart upload operation, and s3fs by
> default uses 10MB part size. Thus the max filesize is 100GB, but the
> s3fs developers limit it to 64GB (they say it is a nicer number). That
> code was removed from the latest git version, and the limit now is
> 10000 * multipart size (so if multipart_size is set to a larger number
> than 10MB, then the max file size could be larger than 100GB).
> 
> On the other hand, S3 supports concurrent reads from S3 via "Range
> GET"s, which is what is used by s3fs. As S3 does not limit the number
> of concurrent GETs, the 100GB limit shouldn't apply to reads from S3.
> 
> Rayson
> 
> ==================================================
> Open Grid Scheduler - The Official Open Source Grid Engine
> http://gridscheduler.sourceforge.net/
> http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html
> 
> 
> 
> 
>> On 5/13/14, MacMullan, Hugh <hughmac at wharton.upenn.edu> wrote:
>> I remember having to modify the s3fs source to handle files greater than ...
>> I think 64GB, FYI, in case that's important. If anyone runs into issues with
>> big files and s3fs, poke the code, or if you need further details (no idea
>> if that's still a problem), just let me know.
>> 
>> -Hugh
>> 
>> On May 13, 2014, at 16:07, "Cedar McKay"
>> <cmckay at uw.edu<mailto:cmckay at uw.edu>> wrote:
>> 
>> Great, thanks for all the info guys. I ended up implementing mounting my
>> read only databases as an s3fs volume, then designating the ephemeral
>> storage as the cache. Hopefully this will give me the best of all worlds;
>> fast local storage, and lazy downloading.
>> 
>> I haven't tested much yet, but If I have problems with this setup I'll
>> probably just skip the s3fs thing and just load the database straight onto
>> the ephemeral storage as you suggested.
>> 
>> best,
>> Cedar
>> 
>> 
>> 
>> On May 12, 2014, at 2:12 PM, Steve Darnell
>> <darnells at dnastar.com<mailto:darnells at dnastar.com>> wrote:
>> 
>> Hi Cedar,
>> 
>> I completely agree with David. We routinely use blast in a software pipeline
>> build on top of EC2. We started by using an NFS share, but we are currently
>> transitioning to ephemeral storage.
>> 
>> Our plan is to put the nr database (and other file-based data libraries) on
>> local SSD ephemeral storage for each node in the cluster. You may want to
>> consider pre-packaging the compressed libraries on a custom StarCluster AMI,
>> then use a plug-in to mount ephemeral storage and decompress the blast
>> libraries into ephemeral storage. The avoids the download from S3 each time
>> you start a node, which added 10-20 minutes in our case. Plus, it eliminates
>> one more possible point of failure during cluster initialization. To us, it
>> is worth the extra cost of maintaining a custom AMI and the extra size of
>> the AMI itself.
>> 
>> Best regards,
>> Steve
>> 
>> From: starcluster-bounces at mit.edu<mailto:starcluster-bounces at mit.edu>
>> [mailto:starcluster-bounces at mit.edu<mailto:bounces at mit.edu>] On Behalf Of
>> David Stuebe
>> Sent: Friday, May 09, 2014 12:49 PM
>> To: starcluster at mit.edu<mailto:starcluster at mit.edu>
>> Subject: Re: [StarCluster] Fast shared or local storage? (Cedar McKay)
>> 
>> 
>> Hi Cedar
>> 
>> Beware of using NFS – it may not be posix compliant in ways that seem minor
>> but have caused problems for HDF5 files. I don't know what the blast db file
>> structure is or how they organize their writes, but it can be a problem in
>> some circumstances.
>> 
>> I really like the suggestions of using the ephemeral storage. I suggest you
>> create a plugin that moves the data to the drive from S3 on startup when you
>> add a node. That should be simpler than the on demand caching which although
>> elegant may take you some time to implement.
>> 
>> David
>> 
>> 
>> Thanks for the very useful reply. I think I'm going to go with the s3fs
>> option and cache to local ephemeral drives. A big blast database is split
>> into many parts, and I'm pretty sure that every file in a blast db isn't
>> read every time, so this way blasting can proceed immediately. The parts of
>> the blast database download from s3 on demand, and cached locally. If there
>> was much writing, I'd probably be reluctant to use this approach because the
>> s3 eventual consistency model seems to require tolerance of write fails at
>> the application level. I'll write my results to a shared nfs volume.
>> 
>> I thought about mpiBlast and will probably explore it, but I read some
>> reports that it's xml output isn't exactly the same as the official NCBI
>> blast output, and may break biopython parsing. I haven't confirmed this, and
>> will probably compare the two techniques.
>> 
>> Thanks again!
>> 
>> Cedar
>> 
>>