[miso-users] Events with many gene annotations

Sol Katzman solkatzman at sbcglobal.net
Wed Aug 28 17:53:41 EDT 2013


Attached are the distributions (and the script used to generate them).

/Sol.

On 8/28/2013 2:51 PM, Sol Katzman wrote:
> Dear Yarden and Tyler,
>
> A while back, I noticed some performance problems processing AFE/ALE events.
>
> I extracted the distributions of the lengths of the "gene" items in the gff3
> (hg19) event definitions. There are many (1100+/500+) AFE/ALE events over 1Mb in length.
> Only a handful (10) such SE events.
>
> I will send my stats in a follow-up email.
>
> I think that the events longer than 1Mb are pretty questionable.
>
> /Sol.
>
> On 8/28/2013 2:16 PM, Tyler Funnell wrote:
>> Hi Yarden,
>>
>> Yes that's right. The problem is most noticeable for the ALE/AFE events for the reason you mentioned, but I think the current
>> event to gene mapping could have improper annotations for other event types as well. For example, small genes that exist within
>> the introns in a SE event would be picked up.
>>
>> Cheers,
>> Tyler
>>
>>
>> On Aug 28, 2013, at 2:03 PM, Yarden Katz <yarden at mit.edu> wrote:
>>
>>> Hi Tyler,
>>>
>>> Some of the AFE/ALE annotations, which we are currently reworking, have span very large genomic coordinates as you noted.  I
>>> believe these are probably dubious/faulty annotations.  But in any case, as you say, if you overlap the outer-most coordinates
>>> with genes there will potentially be many overlapping genes.
>>>
>>> If I understand correctly, you're proposing to merge the first exon with all genes, then the second exon will genes, and take
>>> the intersection of those?
>>>
>>> Best, --Yarden
>>>
>>> On Aug 27, 2013, at 10:31 PM, Tyler Funnell wrote:
>>>
>>>> Hello,
>>>>
>>>> I've noticed that for some alternative events, there are many gene annotations in the event to ensembl Id mapping file. For
>>>> example AFE event 83896 at uc002kgt.1@uc002hvt.1 has quite a few. I think this might be because the left-most and right-most
>>>> coordinates for this particular event cover a large section of the chromosome and the gene mappings are based on these
>>>> coordinates. If I'm right, I think a better way would be to get the overlap between genes (or gene exons) and individual event
>>>> exons first, then merge to the event level.
>>>>
>>>> Thank you,
>>>> Tyler
>>>>
>>>>
>>>> _______________________________________________
>>>> miso-users mailing list
>>>> miso-users at mit.edu
>>>> http://mailman.mit.edu/mailman/listinfo/miso-users
>>>
>>
>>
>> _______________________________________________
>> miso-users mailing list
>> miso-users at mit.edu
>> http://mailman.mit.edu/mailman/listinfo/miso-users
>>
>
-------------- next part --------------
set genome = hg19
set gffDir = /hive/groups/wet/illumina/geneModels/${genome}/miso_gff/${genome}

set eventList = (AFE ALE A3SS A5SS MXE SE TandemUTR RI)

cd $gffDir

foreach ae ($eventList)
    set gffName = ${ae}.${genome}.gff3
    echo "--------------------------------------------"
    echo "stats for lengths (in kb) of gene items in event gff"
    echo "gffDir:  $gffDir"
    echo "gffName: $gffName"
    cat $gffDir/$gffName | \
    gawk '($3 == "gene")' | \
    cut -f 4,5 | \
    gawk '{print ($2 - $1)/1000}'  \
    > temp.${ae}.lengths
    echo ""
    cat    temp.${ae}.lengths | ave stdin
    echo ""
    cat    temp.${ae}.lengths | textHistogram -binSize=100 stdin
    rm -f  temp.${ae}.lengths
end


--------------------------------------------
stats for lengths (in kb) of gene items in event gff
gffDir:  /hive/groups/wet/illumina/geneModels/hg19/miso_gff/hg19
gffName: AFE.hg19.gff3

Q1 11.718500
median 36.585000
Q3 118.308250
average 2235.596544
min 0.061000
max 242543.000000
count 19720
total 44085963.849000
standard deviation 11335.026775

large values truncated: need 2425 bins or larger binSize than 100
Maximum value 242543.000000
  0 ************************************************************ 14282
100 ********* 2219
200 ***** 1079
300 ** 374
400 * 167
500  117
600  72
700  119
800  46
900  27
1000  14
1100  9
1200  6
1300  9
1400  11
1500  35
1600  7
1700  2
1800  6
1900  5
2000  9
2100  0
2200  1
<minVal or >=2300 ***** 1104
--------------------------------------------
stats for lengths (in kb) of gene items in event gff
gffDir:  /hive/groups/wet/illumina/geneModels/hg19/miso_gff/hg19
gffName: ALE.hg19.gff3

Q1 7.085000
median 19.504000
Q3 60.781000
average 2011.638801
min 0.229000
max 242515.000000
count 10269
total 20657518.843000
standard deviation 9984.357978

large values truncated: need 2425 bins or larger binSize than 100
Maximum value 242515.000000
  0 ************************************************************ 8395
100 ***** 742
200 ** 325
300 * 78
400  48
500  28
600  26
700  32
800  21
900  5
1000  8
1100  1
1200  2
1300  4
1400  8
1500  7
1600  3
1700  0
1800  0
1900  2
2000  7
<minVal or >=2100 **** 527
--------------------------------------------
stats for lengths (in kb) of gene items in event gff
gffDir:  /hive/groups/wet/illumina/geneModels/hg19/miso_gff/hg19
gffName: A3SS.hg19.gff3

Q1 0.873250
median 1.723000
Q3 3.936750
average 13.790749
min -2.440000
max 90101.800000
count 14960
total 206309.608000
standard deviation 737.490719

large values truncated: need 901 bins or larger binSize than 100
Maximum value 90101.000000
  0 ************************************************************ 14769
100  34
200  13
300 * 128
400  8
500  3
600  2
700  0
800  0
900  1
<minVal or >=1000  2
--------------------------------------------
stats for lengths (in kb) of gene items in event gff
gffDir:  /hive/groups/wet/illumina/geneModels/hg19/miso_gff/hg19
gffName: A5SS.hg19.gff3

Q1 0.838000
median 1.730000
Q3 3.620000
average 4.084345
min -22.779000
max 529.531000
count 12807
total 52308.207000
standard deviation 13.678279

  0 ************************************************************ 12778
100  17
200  4
300  2
400  4
500  1
<minVal or >=600  1
--------------------------------------------
stats for lengths (in kb) of gene items in event gff
gffDir:  /hive/groups/wet/illumina/geneModels/hg19/miso_gff/hg19
gffName: MXE.hg19.gff3

Q1 6.725000
median 15.607000
Q3 37.239000
average 38.464499
min 0.475000
max 932.967000
count 2723
total 104738.831000
standard deviation 73.946593

  0 ************************************************************ 2515
100 *** 119
200 * 32
300 * 28
400  13
500  12
600  2
700  0
800  0
900  2
--------------------------------------------
stats for lengths (in kb) of gene items in event gff
gffDir:  /hive/groups/wet/illumina/geneModels/hg19/miso_gff/hg19
gffName: SE.hg19.gff3

Q1 3.207000
median 6.770000
Q3 15.346750
average 20.210842
min -22.779000
max 90101.500000
count 39232
total 792911.753000
standard deviation 643.883559

large values truncated: need 901 bins or larger binSize than 100
Maximum value 90101.000000
  0 ************************************************************ 38473
100 * 574
200  110
300  46
400  7
500  7
600  3
700  0
800  1
900  0
1000  1
<minVal or >=1100  10
--------------------------------------------
stats for lengths (in kb) of gene items in event gff
gffDir:  /hive/groups/wet/illumina/geneModels/hg19/miso_gff/hg19
gffName: TandemUTR.hg19.gff3

Q1 0.679250
median 1.188000
Q3 2.036500
average 1.548208
min 0.109000
max 12.567000
count 2656
total 4112.041000
standard deviation 1.224639

  0 ************************************************************ 2656
--------------------------------------------
stats for lengths (in kb) of gene items in event gff
gffDir:  /hive/groups/wet/illumina/geneModels/hg19/miso_gff/hg19
gffName: RI.hg19.gff3

Q1 0.366000
median 0.555000
Q3 1.052250
average 0.862697
min 0.076000
max 9.694000
count 5986
total 5164.105000
standard deviation 0.822856

  0 ************************************************************ 5986


More information about the miso-users mailing list