[Dspace-general] How to enable full-text searching, posted April 25, 2007

Rowena Wake rwake at ucalgary.ca
Thu Apr 26 18:03:03 EDT 2007


Hi Anny,

I saw your posting regarding have problems with full text searching 
capabilities with PDF's that you upload to collections.
We also had this problem and have discovered that in our case it is 
related to the PDF's themselves. Not all PDF's have Optical Character 
Recognition. If it has not been OCR'd then media filter cannot extract 
the text file & it will not be full text searchable.
Our media filter is set up to run every night and that coupled with 
trying to ensure that all PDF's we upload are OCR'd has solved our problems.
Ensuring OCR can be difficult as we have found out (Ex. settings for 
scanning straight to PDF with the OCR function does not guarantee that 
the text file will actually be attached to the PDF when uploaded, 
documents that are scanned crookedly do not OCR well & many PDF's sent 
to us we need to ORC ourselves) so we shoot for about 80% accuracy in 
our text files.
I hope this relates to your problem and helps.
Rowena

dspace-general-request at mit.edu wrote:

>Send Dspace-general mailing list submissions to
>	dspace-general at mit.edu
>
>To subscribe or unsubscribe via the World Wide Web, visit
>	http://mailman.mit.edu/mailman/listinfo/dspace-general
>or, via email, send a message with subject or body 'help' to
>	dspace-general-request at mit.edu
>
>You can reach the person managing the list at
>	dspace-general-owner at mit.edu
>
>When replying, please edit your Subject line so it is more specific
>than "Re: Contents of Dspace-general digest..."
>
>
>Today's Topics:
>
>   1. How enable full-text searching.(Newbie) (Anny Bridge)
>   2. Re: How enable full-text searching.(Newbie) (Stuart Lewis [sdl])
>   3. Re: How enable full-text searching.(Newbie) (Anny Bridge)
>   4. Re: How enable full-text searching.(Newbie) (Stuart Lewis [sdl])
>   5. DStat Release 3 (Jaco Fourie)
>
>
>----------------------------------------------------------------------
>
>Message: 1
>Date: Wed, 25 Apr 2007 16:18:08 +0800
>From: "Anny Bridge" <anybridge at gmail.com>
>Subject: [Dspace-general] How enable full-text searching.(Newbie)
>To: dspace-general at mit.edu
>Message-ID:
>	<41255c260704250118q1b95e7cfp6255002769208ea2 at mail.gmail.com>
>Content-Type: text/plain; charset="utf-8"
>
>Hi ,
>
>I add a pdf file to a collection.Then try  full-text searching,it failed.
>
>Is it necessary to run bin/index-all manually to support full text
>searching? Or is it ok by simply altering the dspace.cfg file?
>
>Thanks in Advance,
>
>Anny
>-------------- next part --------------
>An HTML attachment was scrubbed...
>URL: http://mailman.mit.edu/pipermail/dspace-general/attachments/20070425/e59a8e3e/attachment-0001.htm
>
>------------------------------
>
>Message: 2
>Date: Wed, 25 Apr 2007 09:31:10 +0100
>From: "Stuart Lewis [sdl]" <sdl at aber.ac.uk>
>Subject: Re: [Dspace-general] How enable full-text searching.(Newbie)
>To: Anny Bridge <anybridge at gmail.com>, <dspace-general at mit.edu>
>Message-ID: <C254D05E.158F1%sdl at aber.ac.uk>
>Content-Type: text/plain;	charset="US-ASCII"
>
>Hi Anny,
>
>  
>
>>I add a pdf file to a collection. Then try full-text searching, it failed.
>>
>>Is it necessary to run bin/index-all manually to support full text searching?
>>Or is it ok by simply altering the dspace.cfg file?
>>    
>>
>
>You need to run bin/filter-media to extract the text from the pdf document.
>
>Thanks,
>
>
>Stuart
>_________________________________________________________________
>
>Datblygydd Cymwysiadau'r We            Web Applications Developer
>Gwasanaethau Gwybodaeth                      Information Services
>Prifysgol Cymru Aberystwyth       University of Wales Aberystwyth
>
>            E-bost / E-mail: Stuart.Lewis at aber.ac.uk
>                 Ffon / Tel: (01970) 622860
>_________________________________________________________________
>
>
>
>------------------------------
>
>Message: 3
>Date: Wed, 25 Apr 2007 16:40:06 +0800
>From: "Anny Bridge" <anybridge at gmail.com>
>Subject: Re: [Dspace-general] How enable full-text searching.(Newbie)
>To: "Stuart Lewis [sdl]" <sdl at aber.ac.uk>
>Cc: dspace-general at mit.edu
>Message-ID:
>	<41255c260704250140t74326aer8db4ad684f2523b9 at mail.gmail.com>
>Content-Type: text/plain; charset="utf-8"
>
>Hi Stuart,
>
>Does it mean i have to run bin/filter-media manually every time when I add a
>pdf  file?
>
>Is it possible by altering the Media Filter plugins (through PluginManager)
>in the dspace.cfg file?
>
>Thanks for your help.
>
>Anny.
>
>On 4/25/07, Stuart Lewis [sdl] <sdl at aber.ac.uk> wrote:
>  
>
>>Hi Anny,
>>
>>    
>>
>>>I add a pdf file to a collection. Then try full-text searching, it
>>>      
>>>
>>failed.
>>    
>>
>>>Is it necessary to run bin/index-all manually to support full text
>>>      
>>>
>>searching?
>>    
>>
>>>Or is it ok by simply altering the dspace.cfg file?
>>>      
>>>
>>You need to run bin/filter-media to extract the text from the pdf
>>document.
>>
>>Thanks,
>>
>>
>>Stuart
>>_________________________________________________________________
>>
>>Datblygydd Cymwysiadau'r We            Web Applications Developer
>>Gwasanaethau Gwybodaeth                      Information Services
>>Prifysgol Cymru Aberystwyth       University of Wales Aberystwyth
>>
>>            E-bost / E-mail: Stuart.Lewis at aber.ac.uk
>>                 Ffon / Tel: (01970) 622860
>>_________________________________________________________________
>>
>>
>>    
>>
>-------------- next part --------------
>An HTML attachment was scrubbed...
>URL: http://mailman.mit.edu/pipermail/dspace-general/attachments/20070425/0f9790da/attachment-0001.htm
>
>------------------------------
>
>Message: 4
>Date: Wed, 25 Apr 2007 09:49:30 +0100
>From: "Stuart Lewis [sdl]" <sdl at aber.ac.uk>
>Subject: Re: [Dspace-general] How enable full-text searching.(Newbie)
>To: Anny Bridge <anybridge at gmail.com>,
>	"dspace-tech at lists.sourceforge.net"
>	<dspace-tech at lists.sourceforge.net>
>Cc: dspace-general at mit.edu
>Message-ID: <C254D4AA.158F9%sdl at aber.ac.uk>
>Content-Type: text/plain;	charset="US-ASCII"
>
>Hi Anny,
>
>  
>
>>Does it mean i have to run bin/filter-media manually every time when I add a
>>pdf  file? 
>>    
>>
>
>People tend to run this periodically, for example once a night. This can be
>enabled via a cron job or scheduled task (unix or windows). See:
>
>http://www.dspace.org/technology/system-docs/install.html#advancedinstall
> 
>  
>
>>Is it possible by altering the Media Filter plugins (through PluginManager) in
>>the dspace.cfg file?
>>    
>>
>
>Those settings are used to enable or disable different filters. For example,
>you might decide that you want to enable full text searching of PDF files,
>but not MS Word documents, in which case you can edit the settings. Or, you
>might decide to write a filter to extract text from a different file format,
>and you can add that there.
>
>Thanks,
>
>
>Stuart
>
>P.S. - I have copied this to the dspace-tech (
>https://lists.sourceforge.net/lists/listinfo/dspace-tech) email list as it
>is probably better suited there
>_________________________________________________________________
>
>Datblygydd Cymwysiadau'r We            Web Applications Developer
>Gwasanaethau Gwybodaeth                      Information Services
>Prifysgol Cymru Aberystwyth       University of Wales Aberystwyth
>
>            E-bost / E-mail: Stuart.Lewis at aber.ac.uk
>                 Ffon / Tel: (01970) 622860
>_________________________________________________________________
>
>
>
>------------------------------
>
>Message: 5
>Date: Wed, 25 Apr 2007 11:08:00 +0200
>From: "Jaco Fourie" <JFourie at csir.co.za>
>Subject: [Dspace-general] DStat Release 3
>To: <dspace-general at mit.edu>
>Message-ID: <462F3690020000310001E39A at cs-emo.csir.co.za>
>Content-Type: text/plain; charset="us-ascii"
>
>I get this error when I run the analyser. Is it out of date?
> 
>D:\DSpace\bin>dsrun ac.ed.dspace.stats.LogAnalyser -start 2006-01-01
>-end 2006-1
>2-31 -out 2007-aggregation.dat
>Using DSpace installation in: D:\DSpace
>Exception in thread "main" org.postgresql.util.PSQLException: ERROR:
>relation "d
>ctyperegistry" does not exist
>
>  
>

-- 
Rowena Wake
Institutional Repository Administrator
Libraries and Cultural Resources
University of Calgary 
Phone: (403) 210-6753
Email: rwake at ucalgary.ca

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/pipermail/dspace-general/attachments/20070426/e8938916/attachment.htm


More information about the Dspace-general mailing list