[Dspace-general] [Fwd: Preserving structured collections - major DSpace change for Collection object]

Wed Feb 28 11:06:22 EST 2007

Stephane,

Have you looked at the most recent Manakin development. I think you  
will find that it advantages to know that it uses as combination of  
METS and a home-cooked format called DRI to represent the content of  
any rendered DSpace Object, for instance, on our current "Under the  
Dome" beta project, there is an example of the content just prior to  
transformation to HTML:

https://dome.mit.edu/handle/123456789/4190?XML

and after:

https://dome.mit.edu/handle/123456789/4190

While this may give you some traction on the METS side. We've  
encountered a similar issue with the idea of FRBR style VRACore where  
there is a hierarchy of Collections/Works/Images. Let me give you  
some detail on our initial solution and what I think are its benefits/ 
shortcomings:

VRACore basically looks like this

Collection
       |_ Work
                 |_ Image

We've only used metadata from the Work and Image, and it can be seen  
"flattened" into our dc record for an item (like that shown above),  
so Item = Image+"Duplicate Work" record. This is nice in that it  
initially shoe-horns the RVC/IRIS VRACore metadata into the DSpace  
item metadata record. And this allows some search and browse  
functionality across the Images. But the "flattening" is also a  
downfall, its difficult to get back to the hierarchical richness of  
the original metadata and its difficult to get that expressed in the  
UI in a way that is acceptable for Search and Browse (navigate  
through the above example a little and you'll soon see that some of  
the Search/Browse functionality is "unwieldy" because its not "Work"  
centric, its Image centric. DSpace is not going to give the sort of  
Browse capability that is required to navigate the VRACore model  
hierarchy... (So we're working on a solution to that issue...)

We are now at a point where we want to actually capture all the  
metadata and keep it as rich and hierarchical as VRACore is.  So  
We've been considering the following options and you can take them as  
a "Case study".

1.) It would be wise to keep the Hierarchy expressed in ones Metadata  
independent as possible from the DSpace Data Model, more  
specifically, Communities and Collections are not good for expressing  
complex hierarchical relationships at this time, and they would be a  
nightmare to manage if that hierarchy grows large. We've experienced  
this on DSpace at MIT even without that sort of mapping, (and I can only  
imagine the horror if we did try to do it), In DSpace, the  
"Communites/Collections" model currently does not scale and will  
seriously impact the performance of your system if it grows too large.

2.) Our primary question for the RVC collection VRACore metadata was/ 
is:  Do we create "DSpace Items" for "Works" independently from  
"Images", the initial decision was "no", and now we've flattened the  
metadata into one DSpace Item, Duplicating the Work metadata in each  
Item. I now think that wasn't a good idea and it would have been  
wiser to create Items for VRACore Collections and Works separately  
from Images. That this would allow for "Types" of DSpace Items and  
establish an orthogonal and very independent metadata model layered  
across DSpace Items.

So, my recommendations;

A.) Keep your metadata hierarchy "orthogonal" to the DSpace Object  
Models Hierarchy (no matter if we are talking about EAD, METS,  
VRACore or other Hierarchical/Network metadata standards).  It will  
be much more "transportable" and you will be able to retain it in its  
entire hierarchical richness.

B.) Capture as much as you can in the native XML format, keeping it  
stored in Bitstreams within the Item. Map/Flatten as much of that  
metadata as you can up into the Items flat DC style Metadata Record,  
think of that record only as the first stage in a "two-stage" mapping/ 
indexing process that creates the Browse and Search indexes in DSpace.

C.) Do try to capture the relationships in between your XML Metadata  
files via its referencing/linking strategy. Try to map those into the  
dc:related... Qualified DC fields wherever possible.

Cheers,
Mark Diggory

p.s. I need to acknowledge that when I say "We", that the majority of  
the effort behind this body of work is being accomplished by Carl  
Jones and Ann Whiteside here at MIT Libraries.

On Feb 28, 2007, at 9:42 AM, Tellier, Stephane wrote:

> Hi,
>
> we have considered the METS metadata representation, along with its  
> structure features, which is something that I think seems to look  
> like EAD. We think that METS can resolve almost all of our  
> problems, but we didn't find any open source engine like DSpace  
> that is built on METS.
>
> One of our possible solutions is to build our own METS engine that  
> will be in "front" of DSpace, beeing the entry for any kind of  
> search request or OAI-PMH request. Since METS is a structured  
> representation of collections using XML files "pointing" to each  
> other, forming a kind of hierarchical tree, the idea here would be  
> to use the "xpath" functionnality to do the search in the XML files  
> and then having some pointers in those files possessing handle URLs  
> linked to items (like PDFs) contained by DSpace. I know this sounds  
> complicated and its not my favorite solution.
>
> Another solution would consists of modifying the Item object in  
> DSpace so that it would be possible to create folders and sub- 
> folders in that Item (unless it's already possible, maybe I didn't  
> saw it yet!). This can be very interesting since the metadatas are  
> related to Items, so it will be possible for us to build a  
> "structure" for our periodicals, although it would certainly need  
> another change so that a search result would be able to display in  
> which of the Item's documents the text was found (including the  
> folders path of the document).
>
> We also consider the possibility to insert all of the PDFs in the  
> same Item, but with an additional XML file that would contains the  
> structure information. In the DSpace user interface, that  
> additional file would be treated as a "special link" pointing to an  
> index HTML file displaying the structure of the documents.
>
> My first email has discussed about another possible solution about  
> modifying the Collection object in DSpace...
>
> That's all for now, but I'm sure that our team will surely come up  
> with other ways. We hope to find someting that will prevent us to  
> do major and complex changes to the DSpace tool, unless its  
> something that can be interesting for other users.
>
> From: joseph greene [mailto:joseph.greene at ucd.ie]
> Sent: Wed 28/02/2007 5:07 AM
> To: Tellier, Stephane
> Subject: Re: [Fwd: [Dspace-general] Preserving structured  
> collections - major DSpace change for Collection object]
>
> Hello Stephane,
>
> Have you considered using EAD as a metadata representation of your
> collections? It may be a slightly unorthodox use of EAD but it is a
> metadata structure that seems to fit your sub-collections very nicely.
> It allows the xlink language in most tags. It may not solve your D- 
> space
> issues, but could help from a browsing point of view at least.
>
> On that note, EAD can also have subject headings embedded in it at any
> level within the hierarchy, top to bottom. Lucene could do a good  
> job of
> searching this, along with a parser, which could help you reduce
> replication of data.
>
> Interestingly enough, our project has tentatively decided to include
> subject headings from each collection and 'sub-collection' (in our  
> case,
> 'subseries') in every item belonging to that series to do our item  
> level
> searching -- it sounds alot like your research.
>
> I look forward to further discussion on this topic on the list.
>
> Joseph Greene
>
> Irish Virtual Research Library and Archive
> http://www.ucd.ie/ivrla
>
>
> ----- Original Message -----
> From: John McDonough <john.mcdonough at ucd.ie>
> Date: Tuesday, February 27, 2007 2:41 pm
> Subject: [Fwd: [Dspace-general] Preserving structured collections -
> major DSpace    change for Collection object]
> To: joseph greene <joseph.greene at ucd.ie>, Adele  
> <adele.cocchiglia at ucd.ie>
>
> > FYI!
> >
> > J
> >
> > -------- Original Message --------
> > Subject:      [Dspace-general] Preserving structured collections -
> > major
> > DSpace change for Collection object
> > Date:         Tue, 27 Feb 2007 09:11:47 -0500
> > From:         Tellier, Stephane <stephane.tellier at cgi.com>
> > To:   DSpace-tech at lists.sourceforge.net
> > CC:   dspace-general at mit.edu
> >
> >
> >
> > Hi all,
> >
> > In our project in which we have to implement a DSpace solution,
> > we're
> > actually facing a major problem that might maybe concerns other
> > people
> > working in librairies.
> > We need to submit and preserve periodicals in DSpace in a
> > structuralized
> > form. Example :
> >
> > Times magazine
> >            |_____________1990
> >                                       |__________jan.pdf
> >                                       |__________feb.pdf
> >                                       |__________...
> >                                       |__________dec.pdf
> >            |_____________1991
> >                                       |__________...
> >            |_____________...
> >            |_____________2006
> >                                       |__________...
> >
> > In our library, the main database for metadatas is a catalog. An
> > item
> > can contains a "note" in this catalog and this note possess some
> > descriptive metadatas.
> > In the example above, the Times magazine collection, while
> > containing
> > many pdf items, would possess only 1 note in our catalog. That
> > means,
> > after the transfer from the catalog to DSpace, that the DSpace
> > Collection representing the magazine should be ideally the only
> > object
> > that should contains the metadatas, because we don't want to repeat
> > those metadatas for each of the DSpace Items possessing the pdf
> > files in
> > the whole Collection. This is for performance reason because we
> > have
> > some collections possessing thousands of pdfs (like a newspaper of
> > more
> > than 100 years old and having a pdf for each day).
> >
> > For our team, that means we are actually considering the solution
> > of
> > making a big change to DSpace so that :
> > 1) a collection can have sub-collections (same idea here as
> > Communities);2) a collection can be mapped to the metadatas schema
> > and therefore be
> > considered as an "Item", so that its metadatas would be indexed in
> > the
> > same way. The collection would then be searchable through the dc
> > fields
> > (for example). In that case, if we make a search and it gives one
> > of
> > the item, possessing the pdf,  as a result (full-text indexed pdf),
> > we
> > would get the dc metadatas from its "parent" collection, instead
> > having
> > those in the item's record.
> >
> > As any people here have the same needs and has begin some works
> > about
> > it? We consider that this can be a very useful add-on for DSpace,
> > resolving almost any kind of digital collections. However, we know
> > that
> > this will not be a simple modification...
> >
>
> _______________________________________________
> Dspace-general mailing list
> Dspace-general at mit.edu
> http://mailman.mit.edu/mailman/listinfo/dspace-general

~~~~~~~~~~~~~
Mark R. Diggory - DSpace Systems Manager
MIT Libraries, Systems and Technology Services
Massachusetts Institute of Technology

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/pipermail/dspace-general/attachments/20070228/5c49e9b6/attachment.htm