[miso-users] Does Miso require exon IDs in the GFF3?

Mon Sep 19 08:13:46 EDT 2011

Hi,

I've created the Gff3-file for MISO using the python script provided by the 
galaxy project (source: 
http://toolshed.g2.bx.psu.edu/repos/vipints/fml_gff3togtf/file/ed53dca1c6ff/fml_gff_converter_programs/scripts/gtf_to_gff3_converter.py 
). This is how the first lines of the gff3-file look like:
##gff-version 3
1    ensGene    gene    116399482    116400440    .    -    .    
ID=ENSG00000214204;Name=ENSG00000214204
1    ensGene    mRNA    116399482    116400440    .    -    .    
ID=ENST00000452680;Parent=ENSG00000214204
1    ensGene    exon    116399482    116400440    .    -    .    
Parent=ENST00000452680
1    ensGene    gene    45996599    45997811    .    +    .    
ID=ENSG00000234379;Name=ENSG00000234379
1    ensGene    mRNA    45996599    45997811    .    +    .    
ID=ENST00000446155;Parent=ENSG00000234379
1    ensGene    exon    45996599    45996712    .    +    .    
Parent=ENST00000446155
1    ensGene    exon    45997019    45997318    .    +    .    
Parent=ENST00000446155
1    ensGene    exon    45997628    45997811    .    +    .    
Parent=ENST00000446155
1    ensGene    gene    234404329    234408246    .    -    .    
ID=ENSG00000236244;Name=ENSG00000236244

Is this a valid input for index_gff.py or do I need to add exon IDs in the last 
column (as in the example in the manual section 
http://genes.mit.edu/burgelab/miso/docs/#gff-based-alternative-events-format ) ?

The reason for my question is that the resulting miso-result-files (raw output) 
are using 'strange' isoform IDs. Here's an example:
#isoforms=['ENST00000445680 at 111981260@111981447 at -_ENST00000445680@111983760 at 111983986@-','ENST00000416099 at 111981262@111981447 at -_ENST00000416099@111983314 at 111983850@-']    
exon_lens=('ENST00000445680 at 111981260@111981447 at -',188),('ENST00000445680 at 111983760@111983986 at -',227),('ENST00000416099 at 111981262@111981447 at -',186),('ENST00000416099 at 111983314@111983850 at -',537)    
iters=5000    burn_in=500    lag=10    percent_accept=96.34    
proposal_type=drift    counts=(0,1):13,(1,0):10,(1,1):2    assigned_counts=0:10,1:15
sampled_psi    log_score
0.5701,0.4299    -173.1684
0.5320,0.4680    -173.3332
0.5257,0.4743    -173.4779

As you can see from the example Miso added 'strange' (?) suffixes to all my 
isoform IDs, and I don't understand why. However, it seems to be aware of the 
actual number of isoforms as the sampled_psi column contains the correct number 
of entries.

Basically, what I would like to do with Miso is an isoform-centric expression 
level analysis (i.e. get counts on a transcript level). But with this weired 
ID-suffixes the summarized results look not really convenient to be parse. Example:
ENSG00000215193    0.09,0.61,0.09,0.13,0.08    0.03,0.47,0.02,0.03,0.03    
0.18,0.75,0.19,0.26,0.20    
'ENST00000474897 at 18560781@18561372 at +_ENST00000474897@18562640 at 18562780@+_ENST00000474897 at 18604246@18604468 at +_ENST00000474897@18606923 at 18607071@+_ENST00000474897 at 18609121@18609801 at +_ENST00000474897@18613610 at 18613905@+','ENST00000329627 at 18560689@18560813 at +_ENST00000329627@18561062 at 18561372@+_ENST00000329627 at 18562640@18562780 at +_ENST00000329627@18566203 at 18566498@+_ENST00000329627 at 18567878@18568024 at +_ENST00000329627@18570738 at 18574236@+','ENST00000399746 at 18561143@18561372 at +_ENST00000399746@18562640 at 18562780@+_ENST00000399746 at 18566203@18566498 at +_ENST00000399746@18567878 at 18568024@+_ENST00000399746 at 18570738@18570784 at +_ENST00000399746@18570786 at 18571022@+','ENST00000428061 at 18561143@18561372 at +_ENST00000428061@18562640 at 18562780@+_ENST00000428061 at 18566203@18566498 at +_ENST00000428061@18570738 at 18570885@+','ENST00000399744 at 18560774@18561372 at +_ENST00000399744@18562640 at 18562780@+_ENST00000399744 at 18566203@18566498 at +_ENST00000399744@18567878 at 18568024@+_ENST00000399744 at 18570738@18571226 at +'    
(0,0,0,1,0):1,(0,1,0,0,0):71,(0,1,0,0,1):1,(0,1,0,1,1):3,(0,1,1,0,1):6,(0,1,1,1,1):3,(1,0,0,0,0):2,(1,0,0,0,1):4,(1,1,0,0,1):1,(1,1,1,1,1):1    
0:4,1:81,2:0,3:1,4:7

Miso correctly calculated one score for each of the 5 transcripts, but it will 
require some table transformation and regexing to reshape this into a table with 
an isoform id column and a score column. Is there some hidden option to obtain 
such a more isoform-centric table representation of the summarized results when 
using "run_miso.py --summarize-samples"?

Best,
Holger Brandl

ps at Yarden: Yes I've received your reply via the mailinglist.

-- 
Dr. Holger Brandl
Bioinformatics Service
Max Planck Institute of Molecular Cell Biology and Genetics
Pfotenhauerstrasse 108
01307 Dresden, Germany

Tel.:   +49/351/210-2738
Fax:    +49 351 210 2000
www:  http://www.mpi-cbg.de