[Dspace-general] DjVu file format in DSpace

Andrea Bollini bollini at cilea.it
Fri May 11 02:34:47 EDT 2007


Hello,
any experience and effort are welcome!
The DSpace community exchange opinions, comments and experiences on many 
different channels (mailing lists, bug&path code on sourceforge.net)
Your contribute is "a patch", a new feature.
You can submit it at: 
http://sourceforge.net/tracker/?group_id=19984&atid=319984
If you has any problem please let me know, and I can do it for you.

I have forwarded this response also at dspace-tech mailing list that is 
the more appropriate for this topic.
Thank you again for your contribute and welcome to the DSpace community.
Best wishes,
Andrea

Иван Пенев ha scritto:
> On Tue Jul 11 04:41:23 EDT 2006 Jama Poulsen wrote:
>   
>>  Something else. Has anyone worked with DjVu files and DSpace?
>>
>>  Some DjVu links:
>>  - http://en.wikipedia.org/wiki/DjVu
>>  - http://djvulibre.djvuzone.org
>>  - http://www.djvuzone.org/links/ (example archives)
>>  - http://www.djvuzone.org
>>  - http://any2djvu.djvuzone.org/
>>  - http://www.archive.org/details/newrock
>>
>>  If not I'd like to discuss this anyway :-)
>>
>>     
>
>    Dear Jama Poulsen, (and everybody interested in this subject...)
>
>    I have recently started to use the DSpace software.
>    I am neither librarian nor IT specialist, but just a student, and
> for now I would only like to manage my own collection of mathematics
> books (collected from various sites on the Internet), the most of
> which have been scanned from paper and stored in DjVu format.
>    As you know, there is a project on <sourceforge.net>, "djvulibre",
> which provides an open-source implementation of DjVu. The package
> includes a utility, "djvutxt", for extracting the text layer from a
> previously OCR-ed DjVu files. I have just written a MediaFilter class
> that invokes this utility to get the extracted text.     For now, it
> works well, but I haven't done many tests with it yet. Nevertheless, I
> would like to share the code with the members of the DSpace community,
> who will eventually want to improve it. For I have only entry-level
> Java programming skills, so the code is most likely inefficient and/or
> buggy.
>    What I actually did, is to put the following lines in
> [dspace-source]/config/dspace.cfg file:
> plugin.sequence.org.dspace.app.mediafilter.MediaFilter = \
>     org.dspace.app.mediafilter.DjVuFilter, \
> ...
> filter.org.dspace.app.mediafilter.DjVuFilter.inputFormats = DjVu
> as well as to add the following element to
> [dspace-source]/config/registries/bitstream-formats.xml:
>   <bitstream-type>
> 	  <mimetype>image/vnd.djvu</mimetype>
> 	  <short_description>DjVu</short_description>
> 	  <description>DjVu</description>
> 	  <support_level>1</support_level>
> 	  <internal>false</internal>
> 	  <extension>djvu</extension>
> 	  <extension>djv</extension>
>   </bitstream-type>
> and to put the source code DjVuFilter.java in the
> [dspace-source]src/org/dspace/app/mediafilter directory before running
> "ant fresh_install".
>
> Here is the source code:
> -------------------------------------DjVuFilter.java-------------------------------------
>
> /*
>    DjVuFilter.java
>    Version: 0.1
>    DSpace version: 1.4.2 beta
>    Author: Ivan Penev
>    e-mail: inpenev at gmail.com
> */
>
> package org.dspace.app.mediafilter;
>
> import java.io.InputStream;
> import java.io.FileInputStream;
> import java.io.BufferedInputStream;
> import java.io.ByteArrayInputStream;
> import java.io.OutputStream;
> import java.io.FileOutputStream;
> import java.io.BufferedOutputStream;
> import java.io.FileReader;
> import java.io.BufferedReader;
> import java.io.File;
>
> /**
> * This class provides a media filter for processing files of type DjVu.
> * <p>The current implementation uses a program called
> <code>djvutxt</code>, which extracts the text layer from a previously
> OCR-ed DjVu file and saves it into a UTF-8 text document. The program
> is distributed with the <code>djvulibre</code> package which is freely
> available under the GPL license from <a
> href="http://djvu.sourceforge.net/">http://djvu.sourceforge.net/</a>
> for both Unix and Windows operating systems. Hence, for the media
> filter to work it is required that <code>djvutxt</code> is a valid
> command (in the working environment).</p>
> */
> public class DjVuFilter extends MediaFilter
> {
>     /**
>     * Get a filename for a newly created filtered bitstream.
>     *
>     * @param sourceName
>     *            name of source bitstream
>     * @return filename generated by the filter - for example, document.djvu
>     *         becomes document.djvu.txt
>     */
> 	public String getFilteredName(String sourceName)
> 	{
> 		return sourceName + ".txt";
> 	}
> 		
>     /**
>     * Get name of the bundle this filter will stick its generated bitstreams.
>     *
>     * @return "TEXT"
>     */
> 	public String getBundleName()
> 	{
> 		return "TEXT";
> 	}
> 	
>     /**
>     * Get name of the bitstream format returned by this filter.
>     *
>     * @return "Text"
>     */	
> 	public String getFormatString()
> 	{
> 		return "Text";
> 	}
> 		
>     /**
>     * Get a string describing the newly-generated bitstream.
>     *
>     * @return  "Extracted text"
>     */	
> 	public String getDescription()
> 	{
> 		return "Extracted text";
> 	}
>
>     /**
>     * Get a bitstream filled with the extracted text from a DjVu bitstream.
>     * <p>The bitstream supplied as a parameter is written to a DjVu
> file on the file system (in the working directory), and the system
> command <code>djvutxt</code> is called on the latter to produce a
> UTF-8 text file containg the extracted text. The file is then copied
> to a bitstream. Finally, the auxiliary files are removed from the file
> system, and the generated bitsream is returned as a result.</p>
>     * <p>WARNING! Write access to the working directory is needed for
> this method to operate! No exception handling provided!</p>
>     *
>     * @param source
>     *            input stream
>     *
>     * @return result of filter's transformation, written out to a bitstream
>     */
> 	public InputStream getDestinationStream(InputStream source) throws Exception
> 	{
> 		/* Some convenience initializations. */
> 		final String cmd = "djvutxt";
> 		final String fileName = "aux";
> 		final String djvuFileName = fileName + ".djvu";
> 		final String txtFileName = fileName + ".txt";
> 		
> 		/* Store input bitstresam to auxiliary DjVu file. */
> 		File djvuFile = streamToFile(source, djvuFileName);
> 		
> 		/* Invoke external command djvutxt with appropriate agruments
> 		to do the actual job... */
> 		final String[] cmdArray = {cmd, djvuFileName, txtFileName};
> 		Process p = Runtime.getRuntime().exec(cmdArray);
> 		/* ...and wait for it to terminate */
> 		p.waitFor();
> 		
> 		/* Copy extracted text from file to an independent bitstream,
> 		 and optionally print the text to standard output. */
> 		File txtFile = new File(txtFileName);
> 		InputStream dest = fileToStream(txtFile, MediaFilterManager.isVerbose);
> 		
> 		/* Then remove auxiliary files...*/
> 		djvuFile.delete();
> 		txtFile.delete();
> 		/* ...and return resulting bitstream. */
> 		return dest;
> 	}
> 	
>     /**
>     * Write given input stream to a file on the file system.
>     * <p>WARNING! No exception handling!</p>
>     *
>     * @param inStream input stream
>     * @param fileName name of the file to be generated
>     *
>     * @return <code>File</code> object associated with the generated file
>     *
>     * @throws Exception
>     */
> 	private File streamToFile(InputStream inStream, String fileName)
> throws Exception
> 	{
> 		/*  Data will be read from input stream in chunks of size e.g. 4KB. */
> 		final int chunkSize = 4096;
> 		byte[] byteArray = new byte[chunkSize];
> 		
> 		/* Open the stream for buffered reading. */
> 		InputStream bufInStream = new BufferedInputStream(inStream);
> 		
> 		/* Create an empty file (if the file already exists, it will be left
> untouched)
> 		 to store the supplied bitstream... */
> 		File file = new File(fileName);
> 		file.createNewFile();
> 		/* ...and associate a buffered output stream with it. */
> 		OutputStream bufOutStream = new BufferedOutputStream(new
> FileOutputStream(file));
> 		
> 		/* Copy data from input stream to newly generated file. */
> 		int readBytes = -1;
> 		while ((readBytes = bufInStream.read(byteArray, 0, chunkSize)) != -1)
> 			bufOutStream.write(byteArray, 0, readBytes);
> 		
> 		/* Stop transactions to the file system... */
> 		bufOutStream.close();
> 		/* ...and return result. */
> 		return file;
> 	}
> 	
>     /**
>     * Produce input stream from a given file on the file system.
>     * <p>WARNING! No exception handling!</p>
>     *
>     * @param file <code>File</code> object associated with the given file
>     *
>     * @return input stream containing the data read from file
>     *
>     *@throws Exception
>     */
> 	private InputStream fileToStream(File file, boolean verbose) throws Exception
> 	{
> 		/* Open the stream for reading. */
> 		InputStream inStream = new FileInputStream(file);
> 		
> 		/* Allocate necessary memory for data buffer. */
> 		byte[] byteArray = new byte[(int)file.length()];
> 		
> 		/* Load file contents into buffer. */
> 		inStream.read(byteArray);
> 		
> 		/* And imediately close transactions with the file system. */
> 		inStream.close();
> 		
> 		/* If required to send the retrieved data to standard output... */
> 		if (verbose)
> 		{
> 			/* Open the file again, but this tim handle it as a character stream... */
> 			BufferedReader bufReader = new BufferedReader(new FileReader(file));
> 			/* ...then print its contents line by line to the standard output... */
> 			String lineOfText = null;
> 			while ((lineOfText = bufReader.readLine()) != null)
> 				System.out.println(lineOfText);
> 			/* ...and close connection to the file. */
> 			bufReader.close();
> 		}
> 			
> 		/* Finally, generate and return input stream containing desired data. */
> 		return new ByteArrayInputStream(byteArray);
> 	}
> }
>
> --------------------------------End of source
> code------------------------------------
>
> Please, excuse me for my poor English, and superfluous verbosity!
>
> Best wishes!
>
> Ivan Penev
> _______________________________________________
> Dspace-general mailing list
> Dspace-general at mit.edu
> http://mailman.mit.edu/mailman/listinfo/dspace-general
>
>
>   




More information about the Dspace-general mailing list