[StarCluster] Tophat run on a 2-node cluster

Fri Aug 2 18:23:00 EDT 2013

As a tophat user myself using SGE to submit parallel alignments to our
cluster, here are a few tips

1) Do not max out your core numbers you have per machine for individual
tophat jobs. tophat (i.e. Bowtie or bowtie2 internally, seems to stop
their linear speed gain when you start using cores across cpus). If your
instances runs 2 cpus with 8 cores each, you will find better speed
running 2 jobs with 8 threads then 1 jobs running 16 threads (I think it
has to do with how the indexes are stored in memory, but I am not an
engineer, so I'Ll stop here!).
2) Unless you expressly needs it, turn of coverage search (this is were
your tophat job stopped). It uses humongous amount of memory in order to
find islands of coverages. We only use it if we have less than 10M reads.

If you use --no-coverage-search, you will find that tophat uses at most 3G
of RAM when aligning against the human genome. This way, if you use
CC2.8xlarge instances, you could run 4 alignments dispatched against your
2 nodes and it should terminate nicely.

Good luck.

--  Marco Blanchette, Ph.D.
Stowers Institute for Medical Research
1000 East 50th Street
Kansas City MO 64110
www.stowers.org

On 8/2/13 5:06 PM, "Rayson Ho" <raysonlogin at gmail.com> wrote:

>Just found that Jacob already answered some of your questions... I
>just wanted to add a few more things:
>
>- for hardware configurations for each instance type, I found the most
>detailed & easy to read info at the bottom of the page at:
>http://aws.amazon.com/ec2/instance-types/
>
>- if you submitted your application to Grid Engine, then qacct can
>tell you more about the memory usage history.
>
>- lastly, /mnt has over 400GB of free storage, so you can take
>advantage of the free instance ephemeral storage.
>
>Rayson
>
>==================================================
>Open Grid Scheduler - The Official Open Source Grid Engine
>http://gridscheduler.sourceforge.net/
>
>
>On Fri, Aug 2, 2013 at 4:51 PM, Manuel J. Torres <mjtorres.phd at gmail.com>
>wrote:
>> I am trying to run tophat software mapping ~38 Gb of RNA-seq reads in
>>fastq
>> format to a reference genome on a 2-node cluster with the following
>> properties:
>> NODE_IMAGE_ID = ami-999d49f0
>> NODE_INSTANCE_TYPE = c1.xlarge
>>
>> Question: How many CPUs are there on this type of cluster?
>>
>> Here is a df -h listing of my cluster:
>> root at master:~# df -h
>> Filesystem            Size  Used Avail Use% Mounted on
>> /dev/xvda1            9.9G  9.9G     0 100% /
>> udev                  3.4G  4.0K  3.4G   1% /dev
>> tmpfs                 1.4G  184K  1.4G   1% /run
>> none                  5.0M     0  5.0M   0% /run/lock
>> none                  3.5G     0  3.5G   0% /run/shm
>> /dev/xvdb1            414G  199M  393G   1% /mnt
>> /dev/xvdz              99G   96G     0 100% /home/large-data
>> /dev/xvdy              20G  5.3G   14G  29% /home/genomic-data
>>
>> I created a third volume for the output that does not appear in this
>>list
>> but is listed in my config file and which I determined I can read and
>>write
>> to. I wrote the output files to this larger empty volume.
>>
>> I can't get tophat to run to completion. It appears to be generating
>> truncated intermediate files. Here is the tophat output:
>>
>> [2013-08-01 17:34:19] Beginning TopHat run (v2.0.9)
>> -----------------------------------------------
>> [2013-08-01 17:34:19] Checking for Bowtie
>>                   Bowtie version:        2.1.0.0
>> [2013-08-01 17:34:21] Checking for Samtools
>>                 Samtools version:        0.1.19.0
>> [2013-08-01 17:34:21] Checking for Bowtie index files (genome)..
>> [2013-08-01 17:34:21] Checking for reference FASTA file
>> [2013-08-01 17:34:21] Generating SAM header for
>> /home/genomic-data/data/Nemve1.allmasked
>>         format:          fastq
>>         quality scale:   phred33 (default)
>> [2013-08-01 17:34:27] Reading known junctions from GTF file
>> [2013-08-01 17:36:56] Preparing reads
>>          left reads: min. length=50, max. length=50, 165174922 kept
>>reads
>> (113024 discarded)
>> [2013-08-01 18:24:07] Building transcriptome data files..
>> [2013-08-01 18:26:43] Building Bowtie index from Nemve1.allmasked.fa
>> [2013-08-01 18:29:01] Mapping left_kept_reads to transcriptome
>> Nemve1.allmasked with Bowtie2
>> [2013-08-02 07:34:40] Resuming TopHat pipeline with unmapped reads
>> [bam_header_read] EOF marker is absent. The input is probably truncated.
>> [bam_header_read] EOF marker is absent. The input is probably truncated.
>> [2013-08-02 07:34:41] Mapping left_kept_reads.m2g_um to genome
>> Nemve1.allmasked with Bowtie2
>> [main_samview] truncated file.
>> [main_samview] truncated file.
>> [bam_header_read] EOF marker is absent. The input is probably truncated.
>> [bam_header_read] invalid BAM binary header (this is not a BAM file).
>> [main_samview] fail to read the header from
>> 
>>"/home/results-data/top-results-8-01-2013/topout/tmp/left_kept_reads.m2g\
>> _um_unmapped.bam".
>> [2013-08-02 07:34:54] Retrieving sequences for splices
>> [2013-08-02 07:35:16] Indexing splices
>> Warning: Empty fasta file:
>> '/home/results-data/top-results-8-01-2013/topout/tmp/segment_juncs.fa'
>> Warning: All fasta inputs were empty
>> Error: Encountered internal Bowtie 2 exception (#1)
>> Command: /home/genomic-data/bin/bowtie2-2.1.0/bowtie2-build
>> /home/results-data/top-results-8-01-2013/topout/tmp/segm\
>> ent_juncs.fa
>> /home/results-data/top-results-8-01-2013/topout/tmp/segment_juncs
>>         [FAILED]
>> Error: Splice sequence indexing failed with err =1
>>
>> Questions:
>>
>> Am I running out of memory?
>>
>> How much RAM does the AMI have and can I make that larger?
>>
>> No matter what configuration starcluster I define, I can't seem to make
>>my
>> root directory larger that 10Gb and it appears to full.
>>
>> Can I make the root directory larger that 10GB?
>>
>> Thanks!
>>
>> --
>> Manuel J Torres, PhD
>> 219 Brannan Street Unit 6G
>> San Francisco, CA 94107
>> VOICE: 415-656-9548
>>
>> _______________________________________________
>> StarCluster mailing list
>> StarCluster at mit.edu
>> http://mailman.mit.edu/mailman/listinfo/starcluster
>>
>_______________________________________________
>StarCluster mailing list
>StarCluster at mit.edu
>http://mailman.mit.edu/mailman/listinfo/starcluster