[miso-users] Compression for raw MISO output files

Yarden Katz yarden at MIT.EDU
Tue Feb 5 16:41:42 EST 2013


Hi all,

There's a new utility "miso_zip" (available only in the GitHub version of MISO) which compresses directories containing raw MISO output files (*.miso) to reduce the number of output files and the space taken up by them.  Compression is described in detail in the manual (http://genes.mit.edu/burgelab/miso/docs/#compressing-raw-miso-output).  It's easy to use, just:

  miso_zip.py --compress yourfile.misozip yourdir/

This collapses all the *.miso files into a portable SQLite database (http://www.sqlite.org/) that can be read by universally by any SQLite interface, independent of Python.  The collapsed representation is then compressed using the standard zip utility, so your final file is just a regular zip file which is 4-5x smaller than the original directory.

Users who are interested in accessing the raw *.miso files can either: (1) recover the full original contents of the directory using uncompression:

  miso_zip.py --uncompress yourfile.misozip uncompressed/

Or alternatively, (2) they can simply unzip the .misozip file with standard unzip and then access the SQLite files in there (ending in *.miso_db) to retrieve the raw data:

  unzip yourfile.misozip
  # read data from *.miso_db files with SQLite...

Option (2) allows users to access the raw data without having hundreds/thousands of *.miso files, while option (1) recovers the original directory as it was prior to compression.  Compression is fairly slow (~80 mins to compress directory containing ~10-20 RNA-Seq lanes worth of MISO output) but worth it to get rid of so many *.miso files.  

Note that the "--compress" feature will *not* delete the original directory that was compressed -- it's up to the user to delete that if they wish once compression is finished, to avoid disasters.

Also, MISO used to produce many *.miso_dp files -- that contain the delta posterior distribution -- when running pairwise comparisons between samples, which are no longer necessary.  Only super expert users might occasionally have a need for them, in which case they can just compute the contents of the files anyway, so the current version in GitHub does not output these anymore.

Finally, if you're generated all the summary (*.summary) and Bayes factor files (*.miso_bf) for your dataset and are confident you won't need to make more comparisons in the future, you can also just delete the raw output of MISO.  The summary and Bayes factor files, which contain the relevant information, are tiny. 

Compression utilities will become part of the next MISO stable release.  Let me know if you have any questions.  

Best, --Yarden



More information about the miso-users mailing list