<html>

  <head>

    <meta http-equiv="content-type" content="text/html; charset=utf-8">

  </head>

  <body bgcolor="#FFFFFF" text="#000000">

    Dear Sashimi Users,<br>

    <br>

    <b>Summary</b>:<br>

    I have written some code to upgrade sashimi so that it uses all

    reads,<br>

    instead of filtering out reads with multiple splice junction and/or

    indels.<br>

    <br>

    <b>Background</b>:<br>

    It is well documented that sashimi does not use all reads:<br>

    <br>

    "For junction visualization, sashimi_plot currently uses only reads

    that<br>

     cross a single junction. If a read crosses multiple exon-exon

    junctions,<br>

     it is currently skipped, although MISO will use such a read in

    isoform<br>

     estimation if it consistent with the given isoform annotation.

    Also,<br>

     sashimi_plot currently ignores reads containing insertions or<br>

     deletions and does not visualize sequence mismatches."<br>

    <br>

    This has become more problematic as:<br>

    a) read length has increased, so that a higher fraction of  RNAseq

    reads span multiple junctions<br>

    b) exons of interest have decreased in length, so that RNAseq reads

    covering such exons are<br>

       likely to span multiple junctions.<br>

    <br>

    This can lead to situations where the sashimi plot does not

    accurately reflect the MISO call<br>

    of PSI. In particular, the discarding of reads with multiple

    junctions leads to a systematic<br>

    bias against the <b>included </b>isoform (in SE events), since the

    included form has an<br>

    internal exon flanked by 2 introns, while the excluded form has only

    one intron.<br>

    <br>

    <b>Old Code</b>:<br>

    The relevant existing code in the MISO 0.5.4 release, using the

    Python Virtual Environment,<br>

    is found  in this file:<br>

    <br>

     <tt>$misoVirtualenv/lib/python2.7/site-packages/misopy/sashimi_plot/plot_utils/plot_gene.py</tt><tt><br>

    </tt><br>

    The method <tt>"readsToWiggle_pysam()" </tt>first filters the

    reads, then calculates both:<br>

    a) the junction counts<br>

    b) the (mostly) exonic coverage<br>

    <br>

    <b>New Code:</b><br>

    Versions of pysam more recent than were apparently available at the

    time that sashimi<br>

    was originally written, have a function <tt>"find_introns()" </tt>for

    counting intron-spanning<br>

    reads from a set of aligned reads.<br>

    Pysam also has the function <tt>"get_reference_positions"</tt> for

    getting the positions in the reference covered by<br>

    an aligned reads.<br>

    <br>

    Both of these 2 functions apparently are able to parse arbitrary

    cigar strings and give the<br>

    correct results. So I have used these functions to calculate the

    junction counts and coverage.<br>

    <br>

    Note: the original code also used <tt>"get_reference_positions()"</tt>,

    under its original name <tt>"positions()"</tt>.<br>

    So it did not need to filter the reads for item (b).<br>

    <br>

    Below find the new code. There are some commented out calls to the

    old code and print<br>

    statements, which for debugging purposes will compare the new and

    old results. Since the<br>

    new code was pretty short, I left it inline instead of using

    function calls.<br>

    <br>

    I ran this with pysam 0.13.0  I don't know when <tt>"find_introns()"</tt>

    was first introduced.<br>

    I also did not test MISO with this version of pysam.<br>

    <br>

    I hope others find this useful.<br>

    <br>

    /Sol Katzman<br>

    <tt>--------------------------------------------------------------------------<br>

      In plot_gene.py, the beginning of function plot_density_single()

      is now:<br>

      <br>

          samfile = pysam.AlignmentFile(bam_filename, 'rb')<br>

          try:<br>

              # need several copies of iterator to be used with

      different methods<br>

              subset_reads_copy1 = samfile.fetch(reference=chrom,

      start=tx_start, end=tx_end)<br>

              subset_reads_copy2 = samfile.fetch(reference=chrom,

      start=tx_start, end=tx_end)<br>

              subset_reads_copy3 = samfile.fetch(reference=chrom,

      start=tx_start, end=tx_end)<br>

          except ValueError as e:<br>

              print "Error retrieving reads from %s: %s" %(chrom,

      str(e))<br>

              print "Are you sure %s appears in your BAM file?" %(chrom)<br>

              print "Aborting plot..."<br>

              return axvar<br>

          <br>

          # find_introns() returns a dictionary {(start, stop): count}<br>

          # need to add 1 to stop for consistency with old code.<br>

          # then join (start,stop) tuple to a string for consistency

      with old code.<br>

          # use copy of reads iterator to get introns<br>

          jxns = {}<br>

          introns = samfile.find_introns(read for read in

      subset_reads_copy1 if "N" in read.cigarstring)<br>

          for key, count in introns.iteritems():<br>

              leftss  = key[0]<br>

              rightss = key[1]+1<br>

              if leftss &gt; tx_start and leftss &lt; tx_end \<br>

                     and rightss &gt; tx_start and rightss &lt;

      tx_end:                      <br>

                  jxn = ":".join(map(str, [leftss, rightss]))<br>

                  jxns[jxn] = count <br>

      <br>

      <br>

          # use copy of reads iterator to get wiggle<br>

          wiggle = zeros((tx_end - tx_start + 1), dtype='f')<br>

          for read in subset_reads_copy2:<br>

              # fetch that specifies a region should not return

      unaligned reads,<br>

              # but legacy code had this check just in case?<br>

              if read.cigartuples is None:<br>

                  print "Skipping unaligned read (with no CIGAR string)

      %s" % read.query_name<br>

                  continue<br>

              read_align_length = read.query_alignment_length<br>

              for pos in read.get_reference_positions() :<br>

                  if pos &lt; tx_start or pos &gt; tx_end:<br>

                      continue<br>

                  wig_index = pos-tx_start<br>

                  wiggle[wig_index] += 1.0/read_align_length<br>

      <br>

          # # for debug, use copy of reads iterator to get old versions

      of wiggle,jxns<br>

          # oldwiggle, oldjxns = readsToWiggle_pysam(subset_reads_copy3,

      tx_start, tx_end)<br>

      <br>

          # print "============= oldjxns ================="    <br>

          # for key, value in oldjxns.iteritems():<br>

          #    print "oldjxns: %s: %s" % (key, str(value))<br>

          # <br>

          # print "============= jxns ================="            <br>

          # for key, value in jxns.iteritems():<br>

          #     print "jxns: %s: %s" % (key, str(value))<br>

      <br>

          # print "============= oldwiggle  %s entries (step 200)

      =================" % (len(oldwiggle))<br>

          # for i in range(0, len(oldwiggle),200) :<br>

          #    print "oldwiggle: %s: %s" % (str(i), str(oldwiggle[i]))<br>

      <br>

          # print "============= wiggle  %s entries (step 200)

      =================" % (len(wiggle))<br>

          # for i in range(0, len(wiggle),200) :<br>

          #    print "wiggle: %s: %s" % (str(i), str(wiggle[i]))<br>

       <br>

          wiggle = 1e3 * wiggle / coverage<br>

       <br>

      <br>

    </tt><br>

    <br>

    <br>

  </body>

</html>