<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=utf-8">
</head>
<body bgcolor="#FFFFFF" text="#000000">
Dear Sashimi Users,<br>
<br>
<b>Summary</b>:<br>
I have written some code to upgrade sashimi so that it uses all
reads,<br>
instead of filtering out reads with multiple splice junction and/or
indels.<br>
<br>
<b>Background</b>:<br>
It is well documented that sashimi does not use all reads:<br>
<br>
"For junction visualization, sashimi_plot currently uses only reads
that<br>
cross a single junction. If a read crosses multiple exon-exon
junctions,<br>
it is currently skipped, although MISO will use such a read in
isoform<br>
estimation if it consistent with the given isoform annotation.
Also,<br>
sashimi_plot currently ignores reads containing insertions or<br>
deletions and does not visualize sequence mismatches."<br>
<br>
This has become more problematic as:<br>
a) read length has increased, so that a higher fraction of RNAseq
reads span multiple junctions<br>
b) exons of interest have decreased in length, so that RNAseq reads
covering such exons are<br>
likely to span multiple junctions.<br>
<br>
This can lead to situations where the sashimi plot does not
accurately reflect the MISO call<br>
of PSI. In particular, the discarding of reads with multiple
junctions leads to a systematic<br>
bias against the <b>included </b>isoform (in SE events), since the
included form has an<br>
internal exon flanked by 2 introns, while the excluded form has only
one intron.<br>
<br>
<b>Old Code</b>:<br>
The relevant existing code in the MISO 0.5.4 release, using the
Python Virtual Environment,<br>
is found in this file:<br>
<br>
<tt>$misoVirtualenv/lib/python2.7/site-packages/misopy/sashimi_plot/plot_utils/plot_gene.py</tt><tt><br>
</tt><br>
The method <tt>"readsToWiggle_pysam()" </tt>first filters the
reads, then calculates both:<br>
a) the junction counts<br>
b) the (mostly) exonic coverage<br>
<br>
<b>New Code:</b><br>
Versions of pysam more recent than were apparently available at the
time that sashimi<br>
was originally written, have a function <tt>"find_introns()" </tt>for
counting intron-spanning<br>
reads from a set of aligned reads.<br>
Pysam also has the function <tt>"get_reference_positions"</tt> for
getting the positions in the reference covered by<br>
an aligned reads.<br>
<br>
Both of these 2 functions apparently are able to parse arbitrary
cigar strings and give the<br>
correct results. So I have used these functions to calculate the
junction counts and coverage.<br>
<br>
Note: the original code also used <tt>"get_reference_positions()"</tt>,
under its original name <tt>"positions()"</tt>.<br>
So it did not need to filter the reads for item (b).<br>
<br>
Below find the new code. There are some commented out calls to the
old code and print<br>
statements, which for debugging purposes will compare the new and
old results. Since the<br>
new code was pretty short, I left it inline instead of using
function calls.<br>
<br>
I ran this with pysam 0.13.0 I don't know when <tt>"find_introns()"</tt>
was first introduced.<br>
I also did not test MISO with this version of pysam.<br>
<br>
I hope others find this useful.<br>
<br>
/Sol Katzman<br>
<tt>--------------------------------------------------------------------------<br>
In plot_gene.py, the beginning of function plot_density_single()
is now:<br>
<br>
samfile = pysam.AlignmentFile(bam_filename, 'rb')<br>
try:<br>
# need several copies of iterator to be used with
different methods<br>
subset_reads_copy1 = samfile.fetch(reference=chrom,
start=tx_start, end=tx_end)<br>
subset_reads_copy2 = samfile.fetch(reference=chrom,
start=tx_start, end=tx_end)<br>
subset_reads_copy3 = samfile.fetch(reference=chrom,
start=tx_start, end=tx_end)<br>
except ValueError as e:<br>
print "Error retrieving reads from %s: %s" %(chrom,
str(e))<br>
print "Are you sure %s appears in your BAM file?" %(chrom)<br>
print "Aborting plot..."<br>
return axvar<br>
<br>
# find_introns() returns a dictionary {(start, stop): count}<br>
# need to add 1 to stop for consistency with old code.<br>
# then join (start,stop) tuple to a string for consistency
with old code.<br>
# use copy of reads iterator to get introns<br>
jxns = {}<br>
introns = samfile.find_introns(read for read in
subset_reads_copy1 if "N" in read.cigarstring)<br>
for key, count in introns.iteritems():<br>
leftss = key[0]<br>
rightss = key[1]+1<br>
if leftss > tx_start and leftss < tx_end \<br>
and rightss > tx_start and rightss <
tx_end: <br>
jxn = ":".join(map(str, [leftss, rightss]))<br>
jxns[jxn] = count <br>
<br>
<br>
# use copy of reads iterator to get wiggle<br>
wiggle = zeros((tx_end - tx_start + 1), dtype='f')<br>
for read in subset_reads_copy2:<br>
# fetch that specifies a region should not return
unaligned reads,<br>
# but legacy code had this check just in case?<br>
if read.cigartuples is None:<br>
print "Skipping unaligned read (with no CIGAR string)
%s" % read.query_name<br>
continue<br>
read_align_length = read.query_alignment_length<br>
for pos in read.get_reference_positions() :<br>
if pos < tx_start or pos > tx_end:<br>
continue<br>
wig_index = pos-tx_start<br>
wiggle[wig_index] += 1.0/read_align_length<br>
<br>
# # for debug, use copy of reads iterator to get old versions
of wiggle,jxns<br>
# oldwiggle, oldjxns = readsToWiggle_pysam(subset_reads_copy3,
tx_start, tx_end)<br>
<br>
# print "============= oldjxns =================" <br>
# for key, value in oldjxns.iteritems():<br>
# print "oldjxns: %s: %s" % (key, str(value))<br>
# <br>
# print "============= jxns =================" <br>
# for key, value in jxns.iteritems():<br>
# print "jxns: %s: %s" % (key, str(value))<br>
<br>
# print "============= oldwiggle %s entries (step 200)
=================" % (len(oldwiggle))<br>
# for i in range(0, len(oldwiggle),200) :<br>
# print "oldwiggle: %s: %s" % (str(i), str(oldwiggle[i]))<br>
<br>
# print "============= wiggle %s entries (step 200)
=================" % (len(wiggle))<br>
# for i in range(0, len(wiggle),200) :<br>
# print "wiggle: %s: %s" % (str(i), str(wiggle[i]))<br>
<br>
wiggle = 1e3 * wiggle / coverage<br>
<br>
<br>
</tt><br>
<br>
<br>
</body>
</html>