[Dspace-general] DjVu file format in DSpace

Иван Пенев inpenev at gmail.com
Thu May 10 13:20:34 EDT 2007


On Tue Jul 11 04:41:23 EDT 2006 Jama Poulsen wrote:
>  Something else. Has anyone worked with DjVu files and DSpace?
>
>  Some DjVu links:
>  - http://en.wikipedia.org/wiki/DjVu
>  - http://djvulibre.djvuzone.org
>  - http://www.djvuzone.org/links/ (example archives)
>  - http://www.djvuzone.org
>  - http://any2djvu.djvuzone.org/
>  - http://www.archive.org/details/newrock
>
>  If not I'd like to discuss this anyway :-)
>

   Dear Jama Poulsen, (and everybody interested in this subject...)

   I have recently started to use the DSpace software.
   I am neither librarian nor IT specialist, but just a student, and
for now I would only like to manage my own collection of mathematics
books (collected from various sites on the Internet), the most of
which have been scanned from paper and stored in DjVu format.
   As you know, there is a project on <sourceforge.net>, "djvulibre",
which provides an open-source implementation of DjVu. The package
includes a utility, "djvutxt", for extracting the text layer from a
previously OCR-ed DjVu files. I have just written a MediaFilter class
that invokes this utility to get the extracted text.     For now, it
works well, but I haven't done many tests with it yet. Nevertheless, I
would like to share the code with the members of the DSpace community,
who will eventually want to improve it. For I have only entry-level
Java programming skills, so the code is most likely inefficient and/or
buggy.
   What I actually did, is to put the following lines in
[dspace-source]/config/dspace.cfg file:
plugin.sequence.org.dspace.app.mediafilter.MediaFilter = \
    org.dspace.app.mediafilter.DjVuFilter, \
...
filter.org.dspace.app.mediafilter.DjVuFilter.inputFormats = DjVu
as well as to add the following element to
[dspace-source]/config/registries/bitstream-formats.xml:
  <bitstream-type>
	  <mimetype>image/vnd.djvu</mimetype>
	  <short_description>DjVu</short_description>
	  <description>DjVu</description>
	  <support_level>1</support_level>
	  <internal>false</internal>
	  <extension>djvu</extension>
	  <extension>djv</extension>
  </bitstream-type>
and to put the source code DjVuFilter.java in the
[dspace-source]src/org/dspace/app/mediafilter directory before running
"ant fresh_install".

Here is the source code:
-------------------------------------DjVuFilter.java-------------------------------------

/*
   DjVuFilter.java
   Version: 0.1
   DSpace version: 1.4.2 beta
   Author: Ivan Penev
   e-mail: inpenev at gmail.com
*/

package org.dspace.app.mediafilter;

import java.io.InputStream;
import java.io.FileInputStream;
import java.io.BufferedInputStream;
import java.io.ByteArrayInputStream;
import java.io.OutputStream;
import java.io.FileOutputStream;
import java.io.BufferedOutputStream;
import java.io.FileReader;
import java.io.BufferedReader;
import java.io.File;

/**
* This class provides a media filter for processing files of type DjVu.
* <p>The current implementation uses a program called
<code>djvutxt</code>, which extracts the text layer from a previously
OCR-ed DjVu file and saves it into a UTF-8 text document. The program
is distributed with the <code>djvulibre</code> package which is freely
available under the GPL license from <a
href="http://djvu.sourceforge.net/">http://djvu.sourceforge.net/</a>
for both Unix and Windows operating systems. Hence, for the media
filter to work it is required that <code>djvutxt</code> is a valid
command (in the working environment).</p>
*/
public class DjVuFilter extends MediaFilter
{
    /**
    * Get a filename for a newly created filtered bitstream.
    *
    * @param sourceName
    *            name of source bitstream
    * @return filename generated by the filter - for example, document.djvu
    *         becomes document.djvu.txt
    */
	public String getFilteredName(String sourceName)
	{
		return sourceName + ".txt";
	}
		
    /**
    * Get name of the bundle this filter will stick its generated bitstreams.
    *
    * @return "TEXT"
    */
	public String getBundleName()
	{
		return "TEXT";
	}
	
    /**
    * Get name of the bitstream format returned by this filter.
    *
    * @return "Text"
    */	
	public String getFormatString()
	{
		return "Text";
	}
		
    /**
    * Get a string describing the newly-generated bitstream.
    *
    * @return  "Extracted text"
    */	
	public String getDescription()
	{
		return "Extracted text";
	}

    /**
    * Get a bitstream filled with the extracted text from a DjVu bitstream.
    * <p>The bitstream supplied as a parameter is written to a DjVu
file on the file system (in the working directory), and the system
command <code>djvutxt</code> is called on the latter to produce a
UTF-8 text file containg the extracted text. The file is then copied
to a bitstream. Finally, the auxiliary files are removed from the file
system, and the generated bitsream is returned as a result.</p>
    * <p>WARNING! Write access to the working directory is needed for
this method to operate! No exception handling provided!</p>
    *
    * @param source
    *            input stream
    *
    * @return result of filter's transformation, written out to a bitstream
    */
	public InputStream getDestinationStream(InputStream source) throws Exception
	{
		/* Some convenience initializations. */
		final String cmd = "djvutxt";
		final String fileName = "aux";
		final String djvuFileName = fileName + ".djvu";
		final String txtFileName = fileName + ".txt";
		
		/* Store input bitstresam to auxiliary DjVu file. */
		File djvuFile = streamToFile(source, djvuFileName);
		
		/* Invoke external command djvutxt with appropriate agruments
		to do the actual job... */
		final String[] cmdArray = {cmd, djvuFileName, txtFileName};
		Process p = Runtime.getRuntime().exec(cmdArray);
		/* ...and wait for it to terminate */
		p.waitFor();
		
		/* Copy extracted text from file to an independent bitstream,
		 and optionally print the text to standard output. */
		File txtFile = new File(txtFileName);
		InputStream dest = fileToStream(txtFile, MediaFilterManager.isVerbose);
		
		/* Then remove auxiliary files...*/
		djvuFile.delete();
		txtFile.delete();
		/* ...and return resulting bitstream. */
		return dest;
	}
	
    /**
    * Write given input stream to a file on the file system.
    * <p>WARNING! No exception handling!</p>
    *
    * @param inStream input stream
    * @param fileName name of the file to be generated
    *
    * @return <code>File</code> object associated with the generated file
    *
    * @throws Exception
    */
	private File streamToFile(InputStream inStream, String fileName)
throws Exception
	{
		/*  Data will be read from input stream in chunks of size e.g. 4KB. */
		final int chunkSize = 4096;
		byte[] byteArray = new byte[chunkSize];
		
		/* Open the stream for buffered reading. */
		InputStream bufInStream = new BufferedInputStream(inStream);
		
		/* Create an empty file (if the file already exists, it will be left
untouched)
		 to store the supplied bitstream... */
		File file = new File(fileName);
		file.createNewFile();
		/* ...and associate a buffered output stream with it. */
		OutputStream bufOutStream = new BufferedOutputStream(new
FileOutputStream(file));
		
		/* Copy data from input stream to newly generated file. */
		int readBytes = -1;
		while ((readBytes = bufInStream.read(byteArray, 0, chunkSize)) != -1)
			bufOutStream.write(byteArray, 0, readBytes);
		
		/* Stop transactions to the file system... */
		bufOutStream.close();
		/* ...and return result. */
		return file;
	}
	
    /**
    * Produce input stream from a given file on the file system.
    * <p>WARNING! No exception handling!</p>
    *
    * @param file <code>File</code> object associated with the given file
    *
    * @return input stream containing the data read from file
    *
    *@throws Exception
    */
	private InputStream fileToStream(File file, boolean verbose) throws Exception
	{
		/* Open the stream for reading. */
		InputStream inStream = new FileInputStream(file);
		
		/* Allocate necessary memory for data buffer. */
		byte[] byteArray = new byte[(int)file.length()];
		
		/* Load file contents into buffer. */
		inStream.read(byteArray);
		
		/* And imediately close transactions with the file system. */
		inStream.close();
		
		/* If required to send the retrieved data to standard output... */
		if (verbose)
		{
			/* Open the file again, but this tim handle it as a character stream... */
			BufferedReader bufReader = new BufferedReader(new FileReader(file));
			/* ...then print its contents line by line to the standard output... */
			String lineOfText = null;
			while ((lineOfText = bufReader.readLine()) != null)
				System.out.println(lineOfText);
			/* ...and close connection to the file. */
			bufReader.close();
		}
			
		/* Finally, generate and return input stream containing desired data. */
		return new ByteArrayInputStream(byteArray);
	}
}

--------------------------------End of source
code------------------------------------

Please, excuse me for my poor English, and superfluous verbosity!

Best wishes!

Ivan Penev



More information about the Dspace-general mailing list