[miso-users] Install miso using pip and mixed read length BAMs

Wed Jan 7 04:31:15 EST 2015

Dear Yarden.

Thanks for the swift response, please see my inline comments below.

On 7/01/2015 3:09 am, Yarden Katz wrote:
> Hi,
>
> See below for comments:
>
> On Jan 5, 2015, at 6:29 AM, Maurits Evers <maurits.evers at ur.de> wrote:
>
>> Dear all.
>>
>> I have been trying to install&run miso on my Mac and have run into a
>> couple of problems/issues. Any help and/or clarifications would be
>> greatly appreciated.
>>
>> 1. I did a global install following the recommended installation method
>> using pip. Everything seems to install fine, and importing misopy and
>> pysplicing from within python works. However, miso, module_availability
>> and test_miso are unknown commands. Chasing the binaries on my machine,
>> I can see that they are located at
>> /opt/local/Library/Frameworks/Python.framework/Versions/2.7/bin. Adding
>> this location to PATH fixes the issue of the unknown miso executables.
>> Do I need to add anything else?
> When you install MISO with a package manager like "pip", the executables of the package (binaries like "miso"), get placed at a system-specific binary directory -- whose location is unfortunately not standard -- and in your case happens to be /opt/local/Library/Frameworks/Python.framework/Versions/2.7/bin.  It is sometimes placed in ~/.local/bin.  So that has to be in your PATH for the executables to be accessible.  You only need to do it once and all executables from all Python packages should be available, so no need to do anything else.
>
> A more ideal solution in general is to use pip along with virtualenv, to make a virtual environment that contains all the packages needed for a particular task -- but it's of course not required.
I understand, thanks for the clarification. The documentation recommends 
a global install (rather than a local one using virtualenv), so you 
might want to make a note in the docs if a local installation is preferable.
>
>> 2. As to testing the install, module_availability runs fine. test_miso
>> returns a "Run 0 tests in 0.000s". When I try to execute test_miso from
>> within
>> /opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/misopy
>> via python test_miso.py it seems to run the 3 tests mentioned in the
>> documention, but I end up with errors such as the following
>>
>>      .Testing conversion of SAM to BAM...
>>      Executing: sam_to_bam --convert
>>
>> /opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/misopy/test-data/sam-data/c2c12.Atp2b1.sam
>>
>> /opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/misopy/test-output/sam-output
>>      Converting SAM to BAM...
>>      Traceback (most recent call last):
>>         File
>>
>> "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/bin/sam_to_bam",
>>      line 9, in <module>
>>           load_entry_point('misopy==0.5.2', 'console_scripts',
>>      'sam_to_bam')()
>>         File
>>
>> "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/misopy/sam_to_bam.py",
>>      line 63, in main
>>           sam_to_bam(sam_filename, output_dir, header_ref=ref)
>>         File
>>
>> "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/misopy/sam_to_bam.py",
>>      line 13, in sam_to_bam
>>           os.makedirs(output_dir)
>>         File
>>
>> "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/os.py",
>>      line 150, in makedirs
>>           makedirs(head, mode)
>>         File
>>
>> "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/os.py",
>>      line 157, in makedirs
>>           mkdir(name, mode)
>>      OSError: [Errno 13] Permission denied:
>>
>> '/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/misopy/test-output'
>>
>> I don't know why test_miso fails to run properly. Is this something to
>> worry about?
> This is a kink which is partly our fault and partly the terrible and cumbersome way in which Python packages work.  Our test suite needs to create files.  Since you did not use virtualenv (which would create a mini environment in directories where you have write access to), your package got installed by pip in a system wide directory (/opt/local/...).  As a user, it looks like you don't have write access there, so our test suite fails because it needs to create files and it cannot.  I'll workaround this in the next release.
Got it, thanks!
>
>> 3. I have paired-end mouse RNA-seq data which I mapped to the mm10
>> reference genome using tophat. The bam file is sorted and indexed, and I
>> indexed successfully the gff annotation file. Upon running miso with
>>
>>      miso --run indexed ../alignment/tophat/WT.bam --settings-filename
>>      miso_settings.txt --output-dir WT/ --paired-end 472 277 --read-len 120
> Your "--paired-end" parameters look very off -- your insert length distribution most likely does not have a mean of 472 and a standard deviation of 277.  The standard deviation looks far too big, are you sure it's not sqrt(277) = ~17?
As a matter of fact the numbers I provided are the median and median 
absolute deviation rather than the mean and sd. Values were calculated 
using picard tools. Indeed the variance in fragment size length seems 
very large. I checked values using cufflinks, which gives mean = 284 nt, 
sd = 90 nt. This seems more realistic and consistent with the library 
prep protocol.
>> I get the warning that miso found mixed length reads within the BAM
>> file. Prior to mapping, reads were adapter-trimmed and quality-filtered
>> so naturally aligned reads will have a read-length distribution. I don't
>> understand what to make of this warning. I would assume that most
>> RNA-seq data consists of different read lengths, due to some form of
>> trimming/filtering of the raw data. I don't understand why miso would
>> require reads to have the same length in order to be able to estimate
>> isoform expression. Could you advise how to proceed? The read length
>> distribution shows reads with lengths between 20 and 120 nt. Running
>> miso for each of the read lengths separately would be possible but
>> tedious, requiring 100 separate runs followed by merging the individual
>> output files.
> It's unfortunately the case that for now MISO requires the reads to be the same length.  In our experience, trimming the adapters can certainly create variability, but a variation between 20 and 120 is far larger than I've seen, and seems extreme.  In most cases, reads hover around a certain length, such that the minimum length is still basically "as good" as the longest length reads.  E.g. if your reads were between 35-45, you could just trim the reads to 35 -- so you'd have the exact same number of reads (just shorter), and you wouldn't need multiple runs.  But we will adapt MISO to work with multiple read lengths (it requires substantial changes to the code currently.)
>
> What fraction of your reads would you lose if you took reads that are at least 100?  Since the adapter is fixed length, so I'm assuming most of your trimming is caused by poor base quality.  It seems very extreme to have to trim off over 80% of the read, i.e. going from 120 nt to 20 nt, and it shouldn't happen frequently in a high quality RNA-Seq run.
I don't agree with you on this point. Consider the following situation: 
Considerable fragment size distribution (which you get if no additional 
size selection is applied following PCR amplification), a read length of 
120 nt, and paired-end reads. You will sequence (partially) into the 
adapters at the 5'/3' ends, if the gap size is such that length(left 
read)+length(right read)+gap ~= fragment size (You can even have the 
case where the fragment size is smaller than the sum of the read 
lengths). In this case you will end up with a post-trimming read length 
distribution that covers lengths from a minimum (usually 18-20nt) to the 
raw untrimmed read length (in this case 120nt), due to sequencing parts 
(of varying lengths) of the adapter(s). Most high-quality RNA-seq data 
(both small RNA and mRNA protocols) that I have worked with has had a 
similar distribution of read lengths, so it does not seem to be very 
unusual. Of course, you can check using a few sample SRA data sets.
Trimming longer reads down to a fixed length just to get constant read 
lengths sounds like a dubious strategy and should not be done in my opinion.

For now I will stick with cufflinks to estimate isoform expression, as 
cufflinks does not require reads to have a fixed read length. I don't 
understand the reason for this requirement in miso, nor why this would 
be a sensible thing to do for the problem of estimating isoform 
expression strengths, and I think this is a serious limitation to a very 
promising tool. I will keep an eye out for future releases.

Cheers,
Maurits

> Yarden
>
>> Best regards,
>> Maurits
>>
>>
>> -- 
>> Dr. Maurits Evers
>> Center for Integrative Bioinformatics Vienna
>> Max F. Perutz Laboratories
>> Dr. Bohr Gasse 9
>> A-1030 Vienna, Austria
>>
>>
>>
>> -- 
>> Dr. Maurits Evers
>> Statistical Bioinformatics
>> Institute of Functional Genomics
>> University of Regensburg
>> Josef-Engert-Str. 9 (Biopark I)
>> 93053 Regensburg, Germany
>> _______________________________________________
>> miso-users mailing list
>> miso-users at mit.edu
>> http://mailman.mit.edu/mailman/listinfo/miso-users