[Dspace-general] hacking dspace: harvests, caches, and services
Eric Lease Morgan
emorgan at nd.edu
Thu Aug 17 12:45:10 EDT 2006
This purpose of this posting is to share with some the DSpace
community ways we have been "hacking" DSpace, ETD-db, and DigiTool
for the sake of open access publishing. Many of the things I outlined
a number of months ago are coming to fruition.
DSpace, ETD-db, and DigiTool are institutional repository-like
applications. Each of these applications have their own strengths and
weaknesses, but the biggest problems facing their implementation here
at Notre Dame include:
* they are "applications" not libraries/modules
* they will always have a particular look & feel
* they operate as "information silos"
* they implement specific searching/browsing interfaces
To overcome these issues, as well as in an effort to improve upon the
functionality of an institutional repository, we here at Notre Dame
have hacked (and we mean that in a good way) each of these
applications to create something different. Specifically, we have
exploited OAI-PMH to first harvest and cache the content from each
application, and then we provide services against the cache. Some of
these processes are itemized and described below.
Be forewarned. The things listed below are implemented in a
"sandbox". Response times will be slow, the links will change, and
your milage will vary.
0. Enhanced Dublin Core - For better or for worse, we created sets of
facet/term combinations (think "subjects") and inserted them into
DSpace fields. When we harvest the content we note its physical
structure, parse it accordingly, and cache it in a specific place. A
good record to see how some of this has been implemented is here:
http://dspace.library.nd.edu:8080/dspace/handle/2305/142?
mode=full&submit_simple=Show+full+item+record
1. Harvest & cache - Each application has an OAI-PMH interface. We
use Net::OAI::Harvester to harvest the Dublin Core metadata from each
of the applications and cache the content in a (MyLibrary) database.
As the content is retrieved we do a (tiny) bit of normalization, but
we also supplement the metadata with facet/term combinations
describing where the content came from, what format it is, and some
sort of subject/descriptor.
2. Name authority - For a sub-set of our implementation, the
Excellent Undergraduate Research, we needed to include photographs
and short biographies of students in browsable displays. To
accomplish this we first created facet/term combinations (name
authorities) in our database. These authorities defined a key which
pointed to a directory containing a JPEG image and more detailed
information regarding the author. By combining these facet/terms, the
files in the directory, and the records in DSpace we are able to
create a browsable list of authors complete with pictures and bios.
For example, see:
http://dewey.library.nd.edu/morgan/idr/undergrad/?cmd=authors
3. Standards-based search - Each application provides search, but not
in a "standard" way, nor against each other. By indexing our cache
and providing an SRU interface to the index we can over come these
problems. Moreover, using such an approach we can swap out our
indexer at will. We began using swish-e as our indexer. We moved to
Plucene (slow), and we will probably switch again. Fun:
http://dewey.library.nd.edu/morgan/idr/sru/client.html
4. Syndicating content with "widgets" - The name of the game is, "Put
your content were the users are; don't expect the users to come to
your site." We created a number of one-line Javascript "widgets"
allowing Web masters to insert content from our institutional
repository into their pages. For example, the Aerospace Engineering
department might want to list the recently "published" theses/
dissertations, or an author might want to do something similar. See:
http://dewey.library.nd.edu/morgan/idr/widgets/widget-03.html
http://dewey.library.nd.edu/morgan/idr/etd/widgets/widget-03.html
http://dewey.library.nd.edu/morgan/idr/etd/?cmd=widgets
5. Pseudo peer-review - Faculty want to be recognized (and cited) by
their peers. Right now the standard for this practice is publication
in peer-reviewed journals and citation counts from ISI. Google is a
trusted resource. ("If you don't believe me, then why do you use it
so frequently?") As more and more content is made available on the
Web things like Google PageRank and links from remote documents may
supplement peer-review and citation counts. Using Google API's it is
possible to retrieve a PageRank and lists of linking documents, and
this has been implemented against some of the browsable lists in our
ETD collection. See:
http://dewey.library.nd.edu/morgan/idr/etd/?cmd=term&id=25
http://dewey.library.nd.edu/morgan/idr/etd/?cmd=term&id=50
Unfortunately, all of our documents have PageRanks of 0, 3, or 4, and
all of the linking documents come from within our own domain. After
time I think this may change.
6. Thumbnail browsing - We are using DigiTool to store images for art
history classes. Not only is this content fraught with copyright
issues, but the content is based on pictures, not words. As content
is harvested from DigiTool's (non-standard) OAI-PMH interface we are
able to "calculate" the location of thumbnail images on the remote
server. Consequently we are able to provide (standard) search
interfaces to the content and display pictures of hits as well as
descriptions.
7. Batch loading - One problem with institutional repositories is
getting content. It is hard to acquire and hard to key in. To get
around this problem we searched our bibliographic indexes for things
written by local authors. We saved these records to EndNote, a
bibliographic citation manager. We then exported the EndNote content
as an XML file, parsed the file, and saved the results in directories
importable by DSpace. After importing we were able to cache the
content and provide browsable interfaces against it. Using this
technique it took two people less than one day to import more than
600 citations. For example:
http://dewey.library.nd.edu/morgan/idr/?cmd=term&id=43331
There are a few other "kewl" features of our implementation, but the
ones outlined above give you the gist of where we are going and what
is possible as long as you have the data. "I don't need the
interface; just give me the data." Our experiments have not been 100
percent successful. We still have some problems with controlled
vocabularies, normalization, scalability, getting content in the
first place, priority setting, and links to the "real" content as
opposed to "splash" screens. Despite these issues, we believe thing
are moving forward and in the right direction.
Fun with institutional repositories.
--
Eric
More information about the Dspace-general
mailing list