[miso-users] Does Miso require exon IDs in the GFF3?

Thu Sep 22 16:37:26 EDT 2011

Hi Holger,

I looked into your GFF and this is actually the expected behavior of MISO for your GFF.  In general, MISO expects each GFF entry type (i.e. "gene", "mRNA" and "exon") to have the ID attribute set.  Technically however, these are not required for the "exon" entries, and so MISO will set a default ID for the exon.  Your GFF does not have exon IDs, and so the default IDs are set for each.  The format of the default ID is:

parent_transcript at start_coord@end_coord at strand

For example, for the first exon you quoted below, it would be:

ENST00000452680 at 116399482@116400440 at -

Meaning the exon has parent mRNA ENST00000452680, has start coordinate 116399482 and end coordinate 116400440, and is on the "-" strand.  (In case you are curious, the function that does this under the hood is "_set_default_exon_id" in the GFF parser module, GFF.py).

The reason MISO assigns this exon an internal ID is because it represents each mRNA as a list of exons, e.g. "exonID1_exonID2" would be an isoform made up of the exon named by exonID1 and the exon named by exonID2.  

If you find this default format inconvenient to parse, an easy solution would be to simply add exon IDs to your GFF.  One easy option is to use the Ensembl exon ID (things starting with ENSE...) since you're already using the Ensembl IDs for transcripts and genes.

Finally, if you want to view how MISO represents a gene internally after you indexed your GFF, you can use the "--view-gene" option of run_miso.py, e.g.:

# view gene ENSG00000116285

$ python ~/MISO/run_miso.py --view-gene output/chr1/ENSG00000116285.pickle

This will print an elaborate internal representation of how MISO parsed the gene:

==
Gene ENSG00000116285
 - Gene object:  gene_id: ENSG00000116285
isoforms: [Isoform(gene = ENSG00000116285, g_start = 8064464, g_end = 8086343, len = 479,
 parts = ['ENST00000474874 at 8064464@8064618 at -', 'ENST00000474874 at 8075555@8075752 at -', 'ENST00000474874 at 8086218@8086343 at -']), Isoform(gene = ENSG00000116285, g_start = 8071779, g_end = 8086368, len = 3104,
 parts = ['ENST00000377482 at 8071779@8074456 at -', 'ENST00000377482 at 8075368@8075444 at -', 'ENST00000377482 at 8075555@8075752 at -', 'ENST00000377482 at 8086218@8086368 at -']), Isoform(gene = ENSG00000116285, g_start = 8073799, g_end = 8086356, len = 995,
 parts = ['ENST00000469499 at 8073799@8074456 at -', 'ENST00000469499 at 8075555@8075752 at -', 'ENST00000469499 at 8086218@8086356 at -']), Isoform(gene = ENSG00000116285, g_start = 8073487, g_end = 8075693, len = 2097,
 parts = ['ENST00000467067 at 8073487@8075444 at -', 'ENST00000467067 at 8075555@8075693 at -']), Isoform(gene = ENSG00000116285, g_start = 8074100, g_end = 8086356, len = 787,
 parts = ['ENST00000487559 at 8074100@8074456 at -', 'ENST00000487559 at 8075368@8075460 at -', 'ENST00000487559 at 8075555@8075752 at -', 'ENST00000487559 at 8086218@8086356 at -'])]
==
Isoforms: 
 -  Isoform(gene = ENSG00000116285, g_start = 8064464, g_end = 8086343, len = 479,
 parts = ['ENST00000474874 at 8064464@8064618 at -', 'ENST00000474874 at 8075555@8075752 at -', 'ENST00000474874 at 8086218@8086343 at -'])
 -  Isoform(gene = ENSG00000116285, g_start = 8071779, g_end = 8086368, len = 3104,
 parts = ['ENST00000377482 at 8071779@8074456 at -', 'ENST00000377482 at 8075368@8075444 at -', 'ENST00000377482 at 8075555@8075752 at -', 'ENST00000377482 at 8086218@8086368 at -'])
 -  Isoform(gene = ENSG00000116285, g_start = 8073799, g_end = 8086356, len = 995,
 parts = ['ENST00000469499 at 8073799@8074456 at -', 'ENST00000469499 at 8075555@8075752 at -', 'ENST00000469499 at 8086218@8086356 at -'])
 -  Isoform(gene = ENSG00000116285, g_start = 8073487, g_end = 8075693, len = 2097,
 parts = ['ENST00000467067 at 8073487@8075444 at -', 'ENST00000467067 at 8075555@8075693 at -'])
 -  Isoform(gene = ENSG00000116285, g_start = 8074100, g_end = 8086356, len = 787,
 parts = ['ENST00000487559 at 8074100@8074456 at -', 'ENST00000487559 at 8075368@8075460 at -', 'ENST00000487559 at 8075555@8075752 at -', 'ENST00000487559 at 8086218@8086356 at -'])
==
Exons: 
 -  Exon([8064464, 8064618], id = ENST00000474874 at 8064464@8064618 at -, seq = )(ParentGene = ENSG00000116285)
 -  Exon([8075555, 8075752], id = ENST00000474874 at 8075555@8075752 at -, seq = )(ParentGene = ENSG00000116285)
 -  Exon([8086218, 8086343], id = ENST00000474874 at 8086218@8086343 at -, seq = )(ParentGene = ENSG00000116285)
 -  Exon([8071779, 8074456], id = ENST00000377482 at 8071779@8074456 at -, seq = )(ParentGene = ENSG00000116285)
 -  Exon([8075368, 8075444], id = ENST00000377482 at 8075368@8075444 at -, seq = )(ParentGene = ENSG00000116285)
 -  Exon([8075555, 8075752], id = ENST00000377482 at 8075555@8075752 at -, seq = )(ParentGene = ENSG00000116285)
 -  Exon([8086218, 8086368], id = ENST00000377482 at 8086218@8086368 at -, seq = )(ParentGene = ENSG00000116285)
 -  Exon([8073799, 8074456], id = ENST00000469499 at 8073799@8074456 at -, seq = )(ParentGene = ENSG00000116285)
 -  Exon([8075555, 8075752], id = ENST00000469499 at 8075555@8075752 at -, seq = )(ParentGene = ENSG00000116285)
 -  Exon([8086218, 8086356], id = ENST00000469499 at 8086218@8086356 at -, seq = )(ParentGene = ENSG00000116285)
 -  Exon([8073487, 8075444], id = ENST00000467067 at 8073487@8075444 at -, seq = )(ParentGene = ENSG00000116285)
 -  Exon([8075555, 8075693], id = ENST00000467067 at 8075555@8075693 at -, seq = )(ParentGene = ENSG00000116285)
 -  Exon([8074100, 8074456], id = ENST00000487559 at 8074100@8074456 at -, seq = )(ParentGene = ENSG00000116285)
 -  Exon([8075368, 8075460], id = ENST00000487559 at 8075368@8075460 at -, seq = )(ParentGene = ENSG00000116285)
 -  Exon([8075555, 8075752], id = ENST00000487559 at 8075555@8075752 at -, seq = )(ParentGene = ENSG00000116285)
 -  Exon([8086218, 8086356], id = ENST00000487559 at 8086218@8086356 at -, seq = )(ParentGene = ENSG00000116285)
==

Hope this helps.  Best, --Yarden

On Sep 19, 2011, at 8:13 AM, Holger Brandl wrote:

> Hi,
> 
> I've created the Gff3-file for MISO using the python script provided by the 
> galaxy project (source: 
> http://toolshed.g2.bx.psu.edu/repos/vipints/fml_gff3togtf/file/ed53dca1c6ff/fml_gff_converter_programs/scripts/gtf_to_gff3_converter.py 
> ). This is how the first lines of the gff3-file look like:
> ##gff-version 3
> 1    ensGene    gene    116399482    116400440    .    -    .    
> ID=ENSG00000214204;Name=ENSG00000214204
> 1    ensGene    mRNA    116399482    116400440    .    -    .    
> ID=ENST00000452680;Parent=ENSG00000214204
> 1    ensGene    exon    116399482    116400440    .    -    .    
> Parent=ENST00000452680
> 1    ensGene    gene    45996599    45997811    .    +    .    
> ID=ENSG00000234379;Name=ENSG00000234379
> 1    ensGene    mRNA    45996599    45997811    .    +    .    
> ID=ENST00000446155;Parent=ENSG00000234379
> 1    ensGene    exon    45996599    45996712    .    +    .    
> Parent=ENST00000446155
> 1    ensGene    exon    45997019    45997318    .    +    .    
> Parent=ENST00000446155
> 1    ensGene    exon    45997628    45997811    .    +    .    
> Parent=ENST00000446155
> 1    ensGene    gene    234404329    234408246    .    -    .    
> ID=ENSG00000236244;Name=ENSG00000236244
> 
> Is this a valid input for index_gff.py or do I need to add exon IDs in the last 
> column (as in the example in the manual section 
> http://genes.mit.edu/burgelab/miso/docs/#gff-based-alternative-events-format ) ?
> 
> The reason for my question is that the resulting miso-result-files (raw output) 
> are using 'strange' isoform IDs. Here's an example:
> #isoforms=['ENST00000445680 at 111981260@111981447 at -_ENST00000445680@111983760 at 111983986@-','ENST00000416099 at 111981262@111981447 at -_ENST00000416099@111983314 at 111983850@-']    
> exon_lens=('ENST00000445680 at 111981260@111981447 at -',188),('ENST00000445680 at 111983760@111983986 at -',227),('ENST00000416099 at 111981262@111981447 at -',186),('ENST00000416099 at 111983314@111983850 at -',537)    
> iters=5000    burn_in=500    lag=10    percent_accept=96.34    
> proposal_type=drift    counts=(0,1):13,(1,0):10,(1,1):2    assigned_counts=0:10,1:15
> sampled_psi    log_score
> 0.5701,0.4299    -173.1684
> 0.5320,0.4680    -173.3332
> 0.5257,0.4743    -173.4779
> 
> As you can see from the example Miso added 'strange' (?) suffixes to all my 
> isoform IDs, and I don't understand why. However, it seems to be aware of the 
> actual number of isoforms as the sampled_psi column contains the correct number 
> of entries.
> 
> Basically, what I would like to do with Miso is an isoform-centric expression 
> level analysis (i.e. get counts on a transcript level). But with this weired 
> ID-suffixes the summarized results look not really convenient to be parse. Example:
> ENSG00000215193    0.09,0.61,0.09,0.13,0.08    0.03,0.47,0.02,0.03,0.03    
> 0.18,0.75,0.19,0.26,0.20    
> 'ENST00000474897 at 18560781@18561372 at +_ENST00000474897@18562640 at 18562780@+_ENST00000474897 at 18604246@18604468 at +_ENST00000474897@18606923 at 18607071@+_ENST00000474897 at 18609121@18609801 at +_ENST00000474897@18613610 at 18613905@+','ENST00000329627 at 18560689@18560813 at +_ENST00000329627@18561062 at 18561372@+_ENST00000329627 at 18562640@18562780 at +_ENST00000329627@18566203 at 18566498@+_ENST00000329627 at 18567878@18568024 at +_ENST00000329627@18570738 at 18574236@+','ENST00000399746 at 18561143@18561372 at +_ENST00000399746@18562640 at 18562780@+_ENST00000399746 at 18566203@18566498 at +_ENST00000399746@18567878 at 18568024@+_ENST00000399746 at 18570738@18570784 at +_ENST00000399746@18570786 at 18571022@+','ENST00000428061 at 18561143@18561372 at +_ENST00000428061@18562640 at 18562780@+_ENST00000428061 at 18566203@18566498 at +_ENST00000428061@18570738 at 18570885@+','ENST00000399744 at 18560774@18561372 at +_ENST00000399744@18562640 at 18562780@+_ENST00000399744 at 18566203@18566498 at +_ENST00000399744@18567878 at 18568024@+_ENST00000399744 at 18570738@18571226 at +'    
> (0,0,0,1,0):1,(0,1,0,0,0):71,(0,1,0,0,1):1,(0,1,0,1,1):3,(0,1,1,0,1):6,(0,1,1,1,1):3,(1,0,0,0,0):2,(1,0,0,0,1):4,(1,1,0,0,1):1,(1,1,1,1,1):1    
> 0:4,1:81,2:0,3:1,4:7
> 
> Miso correctly calculated one score for each of the 5 transcripts, but it will 
> require some table transformation and regexing to reshape this into a table with 
> an isoform id column and a score column. Is there some hidden option to obtain 
> such a more isoform-centric table representation of the summarized results when 
> using "run_miso.py --summarize-samples"?
> 
> 
> Best,
> Holger Brandl
> 
> 
> ps at Yarden: Yes I've received your reply via the mailinglist.
> 
> -- 
> Dr. Holger Brandl
> Bioinformatics Service
> Max Planck Institute of Molecular Cell Biology and Genetics
> Pfotenhauerstrasse 108
> 01307 Dresden, Germany
> 
> Tel.:   +49/351/210-2738
> Fax:    +49 351 210 2000
> www:  http://www.mpi-cbg.de
> 
> 
> 
> _______________________________________________
> miso-users mailing list
> miso-users at mit.edu
> http://mailman.mit.edu/mailman/listinfo/miso-users