[Dspace-general] Fwd: Filter media and deleted items

Jeffrey Trimble jtrimble at cc.ysu.edu
Wed Nov 25 08:53:36 EST 2009


We've noticed these too for several hundred of these.  There are two  
ways to treat this:

1.  Use the "skip list" and skip those in question during filter media.

2.  We are noticing problems with Acrobat 8 and Acrobat 9 in that  
those versions are
adding internal taggings that the PDFBox.jar cannot handle.  So far,  
we have done a
"save as" and changed the settings to almost strip the document of  
internal tagging
and other features.  Next week (after the Thanksgiving Holiday) we  
will continue
our experimentation and study of this issue to document it for our  
staff so that they
can make sure all PDFs will extract correctly.

--Jeff



Jeffrey Trimble
System LIbrarian
William F.  Maag Library
Youngstown State University
330.941.2483 (Office)
jtrimble at cc.ysu.edu
http://www.maag.ysu.edu
http://digital.maag.ysu.edu
"I must not fear.  Fear is the mind-killer.
I will permit it to pass over me and through me..."
--Litany against fear....

On Nov 25, 2009, at 5:31 AM, Louw Venter wrote:

> Anyone have any ideas please?
>
> Vrywaringsklousule / Disclaimer: http://www.nwu.ac.za/it/gov-man/disclaimer.html
>
>
>
> >>> On 03 November 2009 at 12:40 PM, "Louw Venter" <Louw.Venter at nwu.ac.za 
> > wrote:
> Hello all,
>
> I made a bit of a mess.
> A while back I uploaded some PDF documents to DSpace and ran Filter  
> media to extract the text. Recently the creators of the pdf files  
> sent me a batch with updated volume numbers etc to replace the  
> existing ones already on the server. So I simply removed the items  
> and added new bitstreams.
> Now when I run the filter media process again the text doesn't get  
> extracted - could this be because the checksums don't match or  
> because the original was located in one assetstore and the new one  
> in another?
>
> Thank you in advance for any help in this regard,
>
>
> ERROR filtering, skipping bitstream:
>
>         Item Handle: 10394/1886
>         Bundle Name: ORIGINAL
>         File Size: 287223
>         Checksum: 6de2597a7cabd6ca3a995c355d9301f1 (MD5)
>         Asset Store: 1
> java.lang.NullPointerException
> java.lang.NullPointerException
>         at org.pdfbox.pdmodel.PDPageNode.getAllKids(PDPageNode.java: 
> 194)
>         at org.pdfbox.pdmodel.PDPageNode.getAllKids(PDPageNode.java: 
> 182)
>         at  
> org 
> .pdfbox.pdmodel.PDDocumentCatalog.getAllPages(PDDocumentCatalog.java: 
> 226)
>         at  
> org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216)
>         at  
> org 
> .dspace 
> .app.mediafilter.PDFFilter.getDestinationStream(PDFFilter.java:141)
>         at  
> org 
> .dspace 
> .app 
> .mediafilter 
> .MediaFilterManager.processBitstream(MediaFilterManager.java:668)
>         at  
> org 
> .dspace 
> .app 
> .mediafilter 
> .MediaFilterManager.filterBitstream(MediaFilterManager.java:570)
>         at  
> org 
> .dspace 
> .app 
> .mediafilter.MediaFilterManager.filterItem(MediaFilterManager.java: 
> 520)
>         at  
> org 
> .dspace 
> .app 
> .mediafilter 
> .MediaFilterManager.applyFiltersItem(MediaFilterManager.java:488)
>         at  
> org 
> .dspace 
> .app 
> .mediafilter 
> .MediaFilterManager.applyFiltersAllItems(MediaFilterManager.java:427)
>         at  
> org 
> .dspace 
> .app.mediafilter.MediaFilterManager.main(MediaFilterManager.java:359)
>
>
> Louw Venter
> _______________________________________________
> Dspace-general mailing list
> Dspace-general at mit.edu
> http://mailman.mit.edu/mailman/listinfo/dspace-general

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/pipermail/dspace-general/attachments/20091125/e1da9423/attachment.htm


More information about the Dspace-general mailing list