[miso-users] sashimi -- proposed code to use all reads

Sol Katzman solkatzman at sbcglobal.net
Thu Jan 4 14:03:08 EST 2018


Dear Sashimi Users,

*Summary*:
I have written some code to upgrade sashimi so that it uses all reads,
instead of filtering out reads with multiple splice junction and/or indels.

*Background*:
It is well documented that sashimi does not use all reads:

"For junction visualization, sashimi_plot currently uses only reads that
  cross a single junction. If a read crosses multiple exon-exon junctions,
  it is currently skipped, although MISO will use such a read in isoform
  estimation if it consistent with the given isoform annotation. Also,
  sashimi_plot currently ignores reads containing insertions or
  deletions and does not visualize sequence mismatches."

This has become more problematic as:
a) read length has increased, so that a higher fraction of  RNAseq reads span multiple junctions
b) exons of interest have decreased in length, so that RNAseq reads covering such exons are
    likely to span multiple junctions.

This can lead to situations where the sashimi plot does not accurately reflect the MISO call
of PSI. In particular, the discarding of reads with multiple junctions leads to a systematic
bias against the *included *isoform (in SE events), since the included form has an
internal exon flanked by 2 introns, while the excluded form has only one intron.

*Old Code*:
The relevant existing code in the MISO 0.5.4 release, using the Python Virtual Environment,
is found  in this file:

$misoVirtualenv/lib/python2.7/site-packages/misopy/sashimi_plot/plot_utils/plot_gene.py

The method "readsToWiggle_pysam()" first filters the reads, then calculates both:
a) the junction counts
b) the (mostly) exonic coverage

*New Code:*
Versions of pysam more recent than were apparently available at the time that sashimi
was originally written, have a function "find_introns()" for counting intron-spanning
reads from a set of aligned reads.
Pysam also has the function "get_reference_positions" for getting the positions in the reference covered by
an aligned reads.

Both of these 2 functions apparently are able to parse arbitrary cigar strings and give the
correct results. So I have used these functions to calculate the junction counts and coverage.

Note: the original code also used "get_reference_positions()", under its original name "positions()".
So it did not need to filter the reads for item (b).

Below find the new code. There are some commented out calls to the old code and print
statements, which for debugging purposes will compare the new and old results. Since the
new code was pretty short, I left it inline instead of using function calls.

I ran this with pysam 0.13.0  I don't know when "find_introns()" was first introduced.
I also did not test MISO with this version of pysam.

I hope others find this useful.

/Sol Katzman
--------------------------------------------------------------------------
In plot_gene.py, the beginning of function plot_density_single() is now:

     samfile = pysam.AlignmentFile(bam_filename, 'rb')
     try:
         # need several copies of iterator to be used with different methods
         subset_reads_copy1 = samfile.fetch(reference=chrom, start=tx_start, end=tx_end)
         subset_reads_copy2 = samfile.fetch(reference=chrom, start=tx_start, end=tx_end)
         subset_reads_copy3 = samfile.fetch(reference=chrom, start=tx_start, end=tx_end)
     except ValueError as e:
         print "Error retrieving reads from %s: %s" %(chrom, str(e))
         print "Are you sure %s appears in your BAM file?" %(chrom)
         print "Aborting plot..."
         return axvar

     # find_introns() returns a dictionary {(start, stop): count}
     # need to add 1 to stop for consistency with old code.
     # then join (start,stop) tuple to a string for consistency with old code.
     # use copy of reads iterator to get introns
     jxns = {}
     introns = samfile.find_introns(read for read in subset_reads_copy1 if "N" in read.cigarstring)
     for key, count in introns.iteritems():
         leftss  = key[0]
         rightss = key[1]+1
         if leftss > tx_start and leftss < tx_end \
                and rightss > tx_start and rightss < tx_end:
             jxn = ":".join(map(str, [leftss, rightss]))
             jxns[jxn] = count


     # use copy of reads iterator to get wiggle
     wiggle = zeros((tx_end - tx_start + 1), dtype='f')
     for read in subset_reads_copy2:
         # fetch that specifies a region should not return unaligned reads,
         # but legacy code had this check just in case?
         if read.cigartuples is None:
             print "Skipping unaligned read (with no CIGAR string) %s" % read.query_name
             continue
         read_align_length = read.query_alignment_length
         for pos in read.get_reference_positions() :
             if pos < tx_start or pos > tx_end:
                 continue
             wig_index = pos-tx_start
             wiggle[wig_index] += 1.0/read_align_length

     # # for debug, use copy of reads iterator to get old versions of wiggle,jxns
     # oldwiggle, oldjxns = readsToWiggle_pysam(subset_reads_copy3, tx_start, tx_end)

     # print "============= oldjxns ================="
     # for key, value in oldjxns.iteritems():
     #    print "oldjxns: %s: %s" % (key, str(value))
     #
     # print "============= jxns ================="
     # for key, value in jxns.iteritems():
     #     print "jxns: %s: %s" % (key, str(value))

     # print "============= oldwiggle  %s entries (step 200) =================" % (len(oldwiggle))
     # for i in range(0, len(oldwiggle),200) :
     #    print "oldwiggle: %s: %s" % (str(i), str(oldwiggle[i]))

     # print "============= wiggle  %s entries (step 200) =================" % (len(wiggle))
     # for i in range(0, len(wiggle),200) :
     #    print "wiggle: %s: %s" % (str(i), str(wiggle[i]))

     wiggle = 1e3 * wiggle / coverage





-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/pipermail/miso-users/attachments/20180104/b37c22e8/attachment.html


More information about the miso-users mailing list