[miso-users] Does Miso require exon IDs in the GFF3?
Holger Brandl
brandl at mpi-cbg.de
Mon Sep 19 08:13:46 EDT 2011
Hi,
I've created the Gff3-file for MISO using the python script provided by the
galaxy project (source:
http://toolshed.g2.bx.psu.edu/repos/vipints/fml_gff3togtf/file/ed53dca1c6ff/fml_gff_converter_programs/scripts/gtf_to_gff3_converter.py
). This is how the first lines of the gff3-file look like:
##gff-version 3
1 ensGene gene 116399482 116400440 . - .
ID=ENSG00000214204;Name=ENSG00000214204
1 ensGene mRNA 116399482 116400440 . - .
ID=ENST00000452680;Parent=ENSG00000214204
1 ensGene exon 116399482 116400440 . - .
Parent=ENST00000452680
1 ensGene gene 45996599 45997811 . + .
ID=ENSG00000234379;Name=ENSG00000234379
1 ensGene mRNA 45996599 45997811 . + .
ID=ENST00000446155;Parent=ENSG00000234379
1 ensGene exon 45996599 45996712 . + .
Parent=ENST00000446155
1 ensGene exon 45997019 45997318 . + .
Parent=ENST00000446155
1 ensGene exon 45997628 45997811 . + .
Parent=ENST00000446155
1 ensGene gene 234404329 234408246 . - .
ID=ENSG00000236244;Name=ENSG00000236244
Is this a valid input for index_gff.py or do I need to add exon IDs in the last
column (as in the example in the manual section
http://genes.mit.edu/burgelab/miso/docs/#gff-based-alternative-events-format ) ?
The reason for my question is that the resulting miso-result-files (raw output)
are using 'strange' isoform IDs. Here's an example:
#isoforms=['ENST00000445680 at 111981260@111981447 at -_ENST00000445680@111983760 at 111983986@-','ENST00000416099 at 111981262@111981447 at -_ENST00000416099@111983314 at 111983850@-']
exon_lens=('ENST00000445680 at 111981260@111981447 at -',188),('ENST00000445680 at 111983760@111983986 at -',227),('ENST00000416099 at 111981262@111981447 at -',186),('ENST00000416099 at 111983314@111983850 at -',537)
iters=5000 burn_in=500 lag=10 percent_accept=96.34
proposal_type=drift counts=(0,1):13,(1,0):10,(1,1):2 assigned_counts=0:10,1:15
sampled_psi log_score
0.5701,0.4299 -173.1684
0.5320,0.4680 -173.3332
0.5257,0.4743 -173.4779
As you can see from the example Miso added 'strange' (?) suffixes to all my
isoform IDs, and I don't understand why. However, it seems to be aware of the
actual number of isoforms as the sampled_psi column contains the correct number
of entries.
Basically, what I would like to do with Miso is an isoform-centric expression
level analysis (i.e. get counts on a transcript level). But with this weired
ID-suffixes the summarized results look not really convenient to be parse. Example:
ENSG00000215193 0.09,0.61,0.09,0.13,0.08 0.03,0.47,0.02,0.03,0.03
0.18,0.75,0.19,0.26,0.20
'ENST00000474897 at 18560781@18561372 at +_ENST00000474897@18562640 at 18562780@+_ENST00000474897 at 18604246@18604468 at +_ENST00000474897@18606923 at 18607071@+_ENST00000474897 at 18609121@18609801 at +_ENST00000474897@18613610 at 18613905@+','ENST00000329627 at 18560689@18560813 at +_ENST00000329627@18561062 at 18561372@+_ENST00000329627 at 18562640@18562780 at +_ENST00000329627@18566203 at 18566498@+_ENST00000329627 at 18567878@18568024 at +_ENST00000329627@18570738 at 18574236@+','ENST00000399746 at 18561143@18561372 at +_ENST00000399746@18562640 at 18562780@+_ENST00000399746 at 18566203@18566498 at +_ENST00000399746@18567878 at 18568024@+_ENST00000399746 at 18570738@18570784 at +_ENST00000399746@18570786 at 18571022@+','ENST00000428061 at 18561143@18561372 at +_ENST00000428061@18562640 at 18562780@+_ENST00000428061 at 18566203@18566498 at +_ENST00000428061@18570738 at 18570885@+','ENST00000399744 at 18560774@18561372 at +_ENST00000399744@18562640 at 18562780@+_ENST00000399744 at 18566203@18566498 at +_ENST00000399744@18567878 at 18568024@+_ENST00000399744 at 18570738@18571226 at +'
(0,0,0,1,0):1,(0,1,0,0,0):71,(0,1,0,0,1):1,(0,1,0,1,1):3,(0,1,1,0,1):6,(0,1,1,1,1):3,(1,0,0,0,0):2,(1,0,0,0,1):4,(1,1,0,0,1):1,(1,1,1,1,1):1
0:4,1:81,2:0,3:1,4:7
Miso correctly calculated one score for each of the 5 transcripts, but it will
require some table transformation and regexing to reshape this into a table with
an isoform id column and a score column. Is there some hidden option to obtain
such a more isoform-centric table representation of the summarized results when
using "run_miso.py --summarize-samples"?
Best,
Holger Brandl
ps at Yarden: Yes I've received your reply via the mailinglist.
--
Dr. Holger Brandl
Bioinformatics Service
Max Planck Institute of Molecular Cell Biology and Genetics
Pfotenhauerstrasse 108
01307 Dresden, Germany
Tel.: +49/351/210-2738
Fax: +49 351 210 2000
www: http://www.mpi-cbg.de
More information about the miso-users
mailing list