[Dspace-general] hacking dspace: harvests, caches, and services

Thu Aug 17 12:45:10 EDT 2006

This purpose of this posting is to share with some the DSpace  
community ways we have been "hacking" DSpace, ETD-db, and DigiTool  
for the sake of open access publishing. Many of the things I outlined  
a number of months ago are coming to fruition.

DSpace, ETD-db, and DigiTool are institutional repository-like  
applications. Each of these applications have their own strengths and  
weaknesses, but the biggest problems facing their implementation here  
at Notre Dame include:

   * they are "applications" not libraries/modules
   * they will always have a particular look & feel
   * they operate as "information silos"
   * they implement specific searching/browsing interfaces

To overcome these issues, as well as in an effort to improve upon the  
functionality of an institutional repository, we here at Notre Dame  
have hacked (and we mean that in a good way) each of these  
applications to create something different. Specifically, we have  
exploited OAI-PMH to first harvest and cache the content from each  
application, and then we provide services against the cache. Some of  
these processes are itemized and described below.

Be forewarned. The things listed below are implemented in a  
"sandbox". Response times will be slow, the links will change, and  
your milage will vary.

0. Enhanced Dublin Core - For better or for worse, we created sets of  
facet/term combinations (think "subjects") and inserted them into  
DSpace fields. When we harvest the content we note its physical  
structure, parse it accordingly, and cache it in a specific place. A  
good record to see how some of this has been implemented is here:

   http://dspace.library.nd.edu:8080/dspace/handle/2305/142? 
mode=full&submit_simple=Show+full+item+record

1. Harvest & cache - Each application has an OAI-PMH interface. We  
use Net::OAI::Harvester to harvest the Dublin Core metadata from each  
of the applications and cache the content in a (MyLibrary) database.  
As the content is retrieved we do a (tiny) bit of normalization, but  
we also supplement the metadata with facet/term combinations  
describing where the content came from, what format it is, and some  
sort of subject/descriptor.

2. Name authority - For a sub-set of our implementation, the  
Excellent Undergraduate Research, we needed to include photographs  
and short biographies of students in browsable displays. To  
accomplish this we first created facet/term combinations (name  
authorities) in our database. These authorities defined a key which  
pointed to a directory containing a JPEG image and more detailed  
information regarding the author. By combining these facet/terms, the  
files in the directory, and the records in DSpace we are able to  
create a browsable list of authors complete with pictures and bios.  
For example, see:

   http://dewey.library.nd.edu/morgan/idr/undergrad/?cmd=authors

3. Standards-based search - Each application provides search, but not  
in a "standard" way, nor against each other. By indexing our cache  
and providing an SRU interface to the index we can over come these  
problems. Moreover, using such an approach we can swap out our  
indexer at will. We began using swish-e as our indexer. We moved to  
Plucene (slow), and we will probably switch again. Fun:

   http://dewey.library.nd.edu/morgan/idr/sru/client.html

4. Syndicating content with "widgets" - The name of the game is, "Put  
your content were the users are; don't expect the users to come to  
your site." We created a number of one-line Javascript "widgets"  
allowing Web masters to insert content from our institutional  
repository into their pages. For example, the Aerospace Engineering  
department might want to list the recently "published" theses/ 
dissertations, or an author might want to do something similar. See:

   http://dewey.library.nd.edu/morgan/idr/widgets/widget-03.html
   http://dewey.library.nd.edu/morgan/idr/etd/widgets/widget-03.html
   http://dewey.library.nd.edu/morgan/idr/etd/?cmd=widgets

5. Pseudo peer-review - Faculty want to be recognized (and cited) by  
their peers. Right now the standard for this practice is publication  
in peer-reviewed journals and citation counts from ISI. Google is a  
trusted resource. ("If you don't believe me, then why do you use it  
so frequently?") As more and more content is made available on the  
Web things like Google PageRank and links from remote documents may  
supplement peer-review and citation counts. Using Google API's it is  
possible to retrieve a PageRank and lists of linking documents, and  
this has been implemented against some of the browsable lists in our  
ETD collection. See:

   http://dewey.library.nd.edu/morgan/idr/etd/?cmd=term&id=25
   http://dewey.library.nd.edu/morgan/idr/etd/?cmd=term&id=50

Unfortunately, all of our documents have PageRanks of 0, 3, or 4, and  
all of the linking documents come from within our own domain. After  
time I think this may change.

6. Thumbnail browsing - We are using DigiTool to store images for art  
history classes. Not only is this content fraught with copyright  
issues, but the content is based on pictures, not words. As content  
is harvested from DigiTool's (non-standard) OAI-PMH interface we are  
able to "calculate" the location of thumbnail images on the remote  
server. Consequently we are able to provide (standard) search  
interfaces to the content and display pictures of hits as well as  
descriptions.

7. Batch loading - One problem with institutional repositories is  
getting content. It is hard to acquire and hard to key in. To get  
around this problem we searched our bibliographic indexes for things  
written by local authors. We saved these records to EndNote, a  
bibliographic citation manager. We then exported the EndNote content  
as an XML file, parsed the file, and saved the results in directories  
importable by DSpace. After importing we were able to cache the  
content and provide browsable interfaces against it. Using this  
technique it took two people less than one day to import more than  
600 citations. For example:

   http://dewey.library.nd.edu/morgan/idr/?cmd=term&id=43331

There are a few other "kewl" features of our implementation, but the  
ones outlined above give you the gist of where we are going and what  
is possible as long as you have the data. "I don't need the  
interface; just give me the data." Our experiments have not been 100  
percent successful. We still have some problems with controlled  
vocabularies, normalization, scalability, getting content in the  
first place, priority setting, and links to the "real" content as  
opposed to "splash" screens. Despite these issues, we believe thing  
are moving forward and in the right direction.

Fun with institutional repositories.

--
Eric