[Dspace-general] How to enable full-text searching, posted April 25, 2007
Rowena Wake
rwake at ucalgary.ca
Thu Apr 26 18:03:03 EDT 2007
Hi Anny,
I saw your posting regarding have problems with full text searching
capabilities with PDF's that you upload to collections.
We also had this problem and have discovered that in our case it is
related to the PDF's themselves. Not all PDF's have Optical Character
Recognition. If it has not been OCR'd then media filter cannot extract
the text file & it will not be full text searchable.
Our media filter is set up to run every night and that coupled with
trying to ensure that all PDF's we upload are OCR'd has solved our problems.
Ensuring OCR can be difficult as we have found out (Ex. settings for
scanning straight to PDF with the OCR function does not guarantee that
the text file will actually be attached to the PDF when uploaded,
documents that are scanned crookedly do not OCR well & many PDF's sent
to us we need to ORC ourselves) so we shoot for about 80% accuracy in
our text files.
I hope this relates to your problem and helps.
Rowena
dspace-general-request at mit.edu wrote:
>Send Dspace-general mailing list submissions to
> dspace-general at mit.edu
>
>To subscribe or unsubscribe via the World Wide Web, visit
> http://mailman.mit.edu/mailman/listinfo/dspace-general
>or, via email, send a message with subject or body 'help' to
> dspace-general-request at mit.edu
>
>You can reach the person managing the list at
> dspace-general-owner at mit.edu
>
>When replying, please edit your Subject line so it is more specific
>than "Re: Contents of Dspace-general digest..."
>
>
>Today's Topics:
>
> 1. How enable full-text searching.(Newbie) (Anny Bridge)
> 2. Re: How enable full-text searching.(Newbie) (Stuart Lewis [sdl])
> 3. Re: How enable full-text searching.(Newbie) (Anny Bridge)
> 4. Re: How enable full-text searching.(Newbie) (Stuart Lewis [sdl])
> 5. DStat Release 3 (Jaco Fourie)
>
>
>----------------------------------------------------------------------
>
>Message: 1
>Date: Wed, 25 Apr 2007 16:18:08 +0800
>From: "Anny Bridge" <anybridge at gmail.com>
>Subject: [Dspace-general] How enable full-text searching.(Newbie)
>To: dspace-general at mit.edu
>Message-ID:
> <41255c260704250118q1b95e7cfp6255002769208ea2 at mail.gmail.com>
>Content-Type: text/plain; charset="utf-8"
>
>Hi ,
>
>I add a pdf file to a collection.Then try full-text searching,it failed.
>
>Is it necessary to run bin/index-all manually to support full text
>searching? Or is it ok by simply altering the dspace.cfg file?
>
>Thanks in Advance,
>
>Anny
>-------------- next part --------------
>An HTML attachment was scrubbed...
>URL: http://mailman.mit.edu/pipermail/dspace-general/attachments/20070425/e59a8e3e/attachment-0001.htm
>
>------------------------------
>
>Message: 2
>Date: Wed, 25 Apr 2007 09:31:10 +0100
>From: "Stuart Lewis [sdl]" <sdl at aber.ac.uk>
>Subject: Re: [Dspace-general] How enable full-text searching.(Newbie)
>To: Anny Bridge <anybridge at gmail.com>, <dspace-general at mit.edu>
>Message-ID: <C254D05E.158F1%sdl at aber.ac.uk>
>Content-Type: text/plain; charset="US-ASCII"
>
>Hi Anny,
>
>
>
>>I add a pdf file to a collection. Then try full-text searching, it failed.
>>
>>Is it necessary to run bin/index-all manually to support full text searching?
>>Or is it ok by simply altering the dspace.cfg file?
>>
>>
>
>You need to run bin/filter-media to extract the text from the pdf document.
>
>Thanks,
>
>
>Stuart
>_________________________________________________________________
>
>Datblygydd Cymwysiadau'r We Web Applications Developer
>Gwasanaethau Gwybodaeth Information Services
>Prifysgol Cymru Aberystwyth University of Wales Aberystwyth
>
> E-bost / E-mail: Stuart.Lewis at aber.ac.uk
> Ffon / Tel: (01970) 622860
>_________________________________________________________________
>
>
>
>------------------------------
>
>Message: 3
>Date: Wed, 25 Apr 2007 16:40:06 +0800
>From: "Anny Bridge" <anybridge at gmail.com>
>Subject: Re: [Dspace-general] How enable full-text searching.(Newbie)
>To: "Stuart Lewis [sdl]" <sdl at aber.ac.uk>
>Cc: dspace-general at mit.edu
>Message-ID:
> <41255c260704250140t74326aer8db4ad684f2523b9 at mail.gmail.com>
>Content-Type: text/plain; charset="utf-8"
>
>Hi Stuart,
>
>Does it mean i have to run bin/filter-media manually every time when I add a
>pdf file?
>
>Is it possible by altering the Media Filter plugins (through PluginManager)
>in the dspace.cfg file?
>
>Thanks for your help.
>
>Anny.
>
>On 4/25/07, Stuart Lewis [sdl] <sdl at aber.ac.uk> wrote:
>
>
>>Hi Anny,
>>
>>
>>
>>>I add a pdf file to a collection. Then try full-text searching, it
>>>
>>>
>>failed.
>>
>>
>>>Is it necessary to run bin/index-all manually to support full text
>>>
>>>
>>searching?
>>
>>
>>>Or is it ok by simply altering the dspace.cfg file?
>>>
>>>
>>You need to run bin/filter-media to extract the text from the pdf
>>document.
>>
>>Thanks,
>>
>>
>>Stuart
>>_________________________________________________________________
>>
>>Datblygydd Cymwysiadau'r We Web Applications Developer
>>Gwasanaethau Gwybodaeth Information Services
>>Prifysgol Cymru Aberystwyth University of Wales Aberystwyth
>>
>> E-bost / E-mail: Stuart.Lewis at aber.ac.uk
>> Ffon / Tel: (01970) 622860
>>_________________________________________________________________
>>
>>
>>
>>
>-------------- next part --------------
>An HTML attachment was scrubbed...
>URL: http://mailman.mit.edu/pipermail/dspace-general/attachments/20070425/0f9790da/attachment-0001.htm
>
>------------------------------
>
>Message: 4
>Date: Wed, 25 Apr 2007 09:49:30 +0100
>From: "Stuart Lewis [sdl]" <sdl at aber.ac.uk>
>Subject: Re: [Dspace-general] How enable full-text searching.(Newbie)
>To: Anny Bridge <anybridge at gmail.com>,
> "dspace-tech at lists.sourceforge.net"
> <dspace-tech at lists.sourceforge.net>
>Cc: dspace-general at mit.edu
>Message-ID: <C254D4AA.158F9%sdl at aber.ac.uk>
>Content-Type: text/plain; charset="US-ASCII"
>
>Hi Anny,
>
>
>
>>Does it mean i have to run bin/filter-media manually every time when I add a
>>pdf file?
>>
>>
>
>People tend to run this periodically, for example once a night. This can be
>enabled via a cron job or scheduled task (unix or windows). See:
>
>http://www.dspace.org/technology/system-docs/install.html#advancedinstall
>
>
>
>>Is it possible by altering the Media Filter plugins (through PluginManager) in
>>the dspace.cfg file?
>>
>>
>
>Those settings are used to enable or disable different filters. For example,
>you might decide that you want to enable full text searching of PDF files,
>but not MS Word documents, in which case you can edit the settings. Or, you
>might decide to write a filter to extract text from a different file format,
>and you can add that there.
>
>Thanks,
>
>
>Stuart
>
>P.S. - I have copied this to the dspace-tech (
>https://lists.sourceforge.net/lists/listinfo/dspace-tech) email list as it
>is probably better suited there
>_________________________________________________________________
>
>Datblygydd Cymwysiadau'r We Web Applications Developer
>Gwasanaethau Gwybodaeth Information Services
>Prifysgol Cymru Aberystwyth University of Wales Aberystwyth
>
> E-bost / E-mail: Stuart.Lewis at aber.ac.uk
> Ffon / Tel: (01970) 622860
>_________________________________________________________________
>
>
>
>------------------------------
>
>Message: 5
>Date: Wed, 25 Apr 2007 11:08:00 +0200
>From: "Jaco Fourie" <JFourie at csir.co.za>
>Subject: [Dspace-general] DStat Release 3
>To: <dspace-general at mit.edu>
>Message-ID: <462F3690020000310001E39A at cs-emo.csir.co.za>
>Content-Type: text/plain; charset="us-ascii"
>
>I get this error when I run the analyser. Is it out of date?
>
>D:\DSpace\bin>dsrun ac.ed.dspace.stats.LogAnalyser -start 2006-01-01
>-end 2006-1
>2-31 -out 2007-aggregation.dat
>Using DSpace installation in: D:\DSpace
>Exception in thread "main" org.postgresql.util.PSQLException: ERROR:
>relation "d
>ctyperegistry" does not exist
>
>
>
--
Rowena Wake
Institutional Repository Administrator
Libraries and Cultural Resources
University of Calgary
Phone: (403) 210-6753
Email: rwake at ucalgary.ca
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/pipermail/dspace-general/attachments/20070426/e8938916/attachment.htm
More information about the Dspace-general
mailing list