[Dspace-general] DjVu file format in DSpace
Andrea Bollini
bollini at cilea.it
Fri May 11 02:34:47 EDT 2007
Hello,
any experience and effort are welcome!
The DSpace community exchange opinions, comments and experiences on many
different channels (mailing lists, bug&path code on sourceforge.net)
Your contribute is "a patch", a new feature.
You can submit it at:
http://sourceforge.net/tracker/?group_id=19984&atid=319984
If you has any problem please let me know, and I can do it for you.
I have forwarded this response also at dspace-tech mailing list that is
the more appropriate for this topic.
Thank you again for your contribute and welcome to the DSpace community.
Best wishes,
Andrea
Иван Пенев ha scritto:
> On Tue Jul 11 04:41:23 EDT 2006 Jama Poulsen wrote:
>
>> Something else. Has anyone worked with DjVu files and DSpace?
>>
>> Some DjVu links:
>> - http://en.wikipedia.org/wiki/DjVu
>> - http://djvulibre.djvuzone.org
>> - http://www.djvuzone.org/links/ (example archives)
>> - http://www.djvuzone.org
>> - http://any2djvu.djvuzone.org/
>> - http://www.archive.org/details/newrock
>>
>> If not I'd like to discuss this anyway :-)
>>
>>
>
> Dear Jama Poulsen, (and everybody interested in this subject...)
>
> I have recently started to use the DSpace software.
> I am neither librarian nor IT specialist, but just a student, and
> for now I would only like to manage my own collection of mathematics
> books (collected from various sites on the Internet), the most of
> which have been scanned from paper and stored in DjVu format.
> As you know, there is a project on <sourceforge.net>, "djvulibre",
> which provides an open-source implementation of DjVu. The package
> includes a utility, "djvutxt", for extracting the text layer from a
> previously OCR-ed DjVu files. I have just written a MediaFilter class
> that invokes this utility to get the extracted text. For now, it
> works well, but I haven't done many tests with it yet. Nevertheless, I
> would like to share the code with the members of the DSpace community,
> who will eventually want to improve it. For I have only entry-level
> Java programming skills, so the code is most likely inefficient and/or
> buggy.
> What I actually did, is to put the following lines in
> [dspace-source]/config/dspace.cfg file:
> plugin.sequence.org.dspace.app.mediafilter.MediaFilter = \
> org.dspace.app.mediafilter.DjVuFilter, \
> ...
> filter.org.dspace.app.mediafilter.DjVuFilter.inputFormats = DjVu
> as well as to add the following element to
> [dspace-source]/config/registries/bitstream-formats.xml:
> <bitstream-type>
> <mimetype>image/vnd.djvu</mimetype>
> <short_description>DjVu</short_description>
> <description>DjVu</description>
> <support_level>1</support_level>
> <internal>false</internal>
> <extension>djvu</extension>
> <extension>djv</extension>
> </bitstream-type>
> and to put the source code DjVuFilter.java in the
> [dspace-source]src/org/dspace/app/mediafilter directory before running
> "ant fresh_install".
>
> Here is the source code:
> -------------------------------------DjVuFilter.java-------------------------------------
>
> /*
> DjVuFilter.java
> Version: 0.1
> DSpace version: 1.4.2 beta
> Author: Ivan Penev
> e-mail: inpenev at gmail.com
> */
>
> package org.dspace.app.mediafilter;
>
> import java.io.InputStream;
> import java.io.FileInputStream;
> import java.io.BufferedInputStream;
> import java.io.ByteArrayInputStream;
> import java.io.OutputStream;
> import java.io.FileOutputStream;
> import java.io.BufferedOutputStream;
> import java.io.FileReader;
> import java.io.BufferedReader;
> import java.io.File;
>
> /**
> * This class provides a media filter for processing files of type DjVu.
> * <p>The current implementation uses a program called
> <code>djvutxt</code>, which extracts the text layer from a previously
> OCR-ed DjVu file and saves it into a UTF-8 text document. The program
> is distributed with the <code>djvulibre</code> package which is freely
> available under the GPL license from <a
> href="http://djvu.sourceforge.net/">http://djvu.sourceforge.net/</a>
> for both Unix and Windows operating systems. Hence, for the media
> filter to work it is required that <code>djvutxt</code> is a valid
> command (in the working environment).</p>
> */
> public class DjVuFilter extends MediaFilter
> {
> /**
> * Get a filename for a newly created filtered bitstream.
> *
> * @param sourceName
> * name of source bitstream
> * @return filename generated by the filter - for example, document.djvu
> * becomes document.djvu.txt
> */
> public String getFilteredName(String sourceName)
> {
> return sourceName + ".txt";
> }
>
> /**
> * Get name of the bundle this filter will stick its generated bitstreams.
> *
> * @return "TEXT"
> */
> public String getBundleName()
> {
> return "TEXT";
> }
>
> /**
> * Get name of the bitstream format returned by this filter.
> *
> * @return "Text"
> */
> public String getFormatString()
> {
> return "Text";
> }
>
> /**
> * Get a string describing the newly-generated bitstream.
> *
> * @return "Extracted text"
> */
> public String getDescription()
> {
> return "Extracted text";
> }
>
> /**
> * Get a bitstream filled with the extracted text from a DjVu bitstream.
> * <p>The bitstream supplied as a parameter is written to a DjVu
> file on the file system (in the working directory), and the system
> command <code>djvutxt</code> is called on the latter to produce a
> UTF-8 text file containg the extracted text. The file is then copied
> to a bitstream. Finally, the auxiliary files are removed from the file
> system, and the generated bitsream is returned as a result.</p>
> * <p>WARNING! Write access to the working directory is needed for
> this method to operate! No exception handling provided!</p>
> *
> * @param source
> * input stream
> *
> * @return result of filter's transformation, written out to a bitstream
> */
> public InputStream getDestinationStream(InputStream source) throws Exception
> {
> /* Some convenience initializations. */
> final String cmd = "djvutxt";
> final String fileName = "aux";
> final String djvuFileName = fileName + ".djvu";
> final String txtFileName = fileName + ".txt";
>
> /* Store input bitstresam to auxiliary DjVu file. */
> File djvuFile = streamToFile(source, djvuFileName);
>
> /* Invoke external command djvutxt with appropriate agruments
> to do the actual job... */
> final String[] cmdArray = {cmd, djvuFileName, txtFileName};
> Process p = Runtime.getRuntime().exec(cmdArray);
> /* ...and wait for it to terminate */
> p.waitFor();
>
> /* Copy extracted text from file to an independent bitstream,
> and optionally print the text to standard output. */
> File txtFile = new File(txtFileName);
> InputStream dest = fileToStream(txtFile, MediaFilterManager.isVerbose);
>
> /* Then remove auxiliary files...*/
> djvuFile.delete();
> txtFile.delete();
> /* ...and return resulting bitstream. */
> return dest;
> }
>
> /**
> * Write given input stream to a file on the file system.
> * <p>WARNING! No exception handling!</p>
> *
> * @param inStream input stream
> * @param fileName name of the file to be generated
> *
> * @return <code>File</code> object associated with the generated file
> *
> * @throws Exception
> */
> private File streamToFile(InputStream inStream, String fileName)
> throws Exception
> {
> /* Data will be read from input stream in chunks of size e.g. 4KB. */
> final int chunkSize = 4096;
> byte[] byteArray = new byte[chunkSize];
>
> /* Open the stream for buffered reading. */
> InputStream bufInStream = new BufferedInputStream(inStream);
>
> /* Create an empty file (if the file already exists, it will be left
> untouched)
> to store the supplied bitstream... */
> File file = new File(fileName);
> file.createNewFile();
> /* ...and associate a buffered output stream with it. */
> OutputStream bufOutStream = new BufferedOutputStream(new
> FileOutputStream(file));
>
> /* Copy data from input stream to newly generated file. */
> int readBytes = -1;
> while ((readBytes = bufInStream.read(byteArray, 0, chunkSize)) != -1)
> bufOutStream.write(byteArray, 0, readBytes);
>
> /* Stop transactions to the file system... */
> bufOutStream.close();
> /* ...and return result. */
> return file;
> }
>
> /**
> * Produce input stream from a given file on the file system.
> * <p>WARNING! No exception handling!</p>
> *
> * @param file <code>File</code> object associated with the given file
> *
> * @return input stream containing the data read from file
> *
> *@throws Exception
> */
> private InputStream fileToStream(File file, boolean verbose) throws Exception
> {
> /* Open the stream for reading. */
> InputStream inStream = new FileInputStream(file);
>
> /* Allocate necessary memory for data buffer. */
> byte[] byteArray = new byte[(int)file.length()];
>
> /* Load file contents into buffer. */
> inStream.read(byteArray);
>
> /* And imediately close transactions with the file system. */
> inStream.close();
>
> /* If required to send the retrieved data to standard output... */
> if (verbose)
> {
> /* Open the file again, but this tim handle it as a character stream... */
> BufferedReader bufReader = new BufferedReader(new FileReader(file));
> /* ...then print its contents line by line to the standard output... */
> String lineOfText = null;
> while ((lineOfText = bufReader.readLine()) != null)
> System.out.println(lineOfText);
> /* ...and close connection to the file. */
> bufReader.close();
> }
>
> /* Finally, generate and return input stream containing desired data. */
> return new ByteArrayInputStream(byteArray);
> }
> }
>
> --------------------------------End of source
> code------------------------------------
>
> Please, excuse me for my poor English, and superfluous verbosity!
>
> Best wishes!
>
> Ivan Penev
> _______________________________________________
> Dspace-general mailing list
> Dspace-general at mit.edu
> http://mailman.mit.edu/mailman/listinfo/dspace-general
>
>
>
More information about the Dspace-general
mailing list