[Dspace-general] DSpace 1.2 Feature Descriptions

Tue Oct 21 14:40:46 EDT 2003

As promised, what follows are descriptions of several of the proposed
new features for DSpace 1.2. Please post comments and suggestions to
this list: we welcome your input!

Content Thumbnail Support

In DSpace's item display graphical content currently only has text to
describe it - users would like to see thumbnails, a more intuitive
representation.  Clicking on the thumbnail would then deliver the
content to the user.  Users may want to supply their own thumbnails, but
the vast majority would probably prefer to have the system generate
thumbnails for them.

There will probably be four methods of adding thumbnails to an item: 
the system could generate automatically during the workflow of
submission, a batch process accessing the repository could add thumbnail
bundles to items lacking them, or users could submit thumbnails along
with the item (highly unlikely!), or the system could generate
thumbnails dynamically.  To integrate thumbnail generation into a
workflow, DSpace needs to be able to take advantage of all of the tools
available to generate thumbnails, and many are not written in Java. 
Many of these tools are also specific types of content - some are image
tools, PDF utilities, or a even a script tying multiple tools together. 
A 'plug-in' architecture could handle these scenarios well:  a group of
classes wrap whatever tools are needed to generate thumbnails, and
register for certain types of submitted content. As content is
submitted, its type is checked and it is handed off to the
classes that want to process submitted content of that type.
A batch tool could also be created that used the same registry of
content handlers, looking for items without thumbnails, and then
attempting to create thumbnails when possible. 

Storing thumbnails shouldn't involve too many changes to the system. 
(None, if the thumbnails are generated dynamically.)  DSpace was
architected with alternate views of content in mind, where items can
have multiple bundles, each containing a different representation of the
item's content.  A thumbnail could simply be another bundle within the
item.  The item display page could look for a thumbnail bundle
containing images for each bitstream in the primary bundle (we may need
a type field to identify the primary bundle vs. a PDF or thumbnail or
extracted text bundle,) and then display the thumbnail next to the file
name. 

The thumbnail becomes an official part of the item, or a flag could be
used to indicate that the thumbnail is an annotation by the system
rather than a part of the original submission.

Full Text Searching

Currently DSpace users can only search the metadata for items - the text
that may be within the content is not searchable.  Users would like to
search the full text of items within DSpace.  It may also be handy for
users to have access to the extracted text for an item, possibly in the
'full' item display.

Our search engine Lucene can easily index the full text from items, so
the challenge is really extracting and storing the full text of
submitted items. This problem is remarkably similar to the generation of
thumbnails: generating a 'text' representation of an item's content is
very similar to generating a thumbnail representation of that content. 
DSpace's object model supports different representations of content with
bundles - each bundle stores a representation. 

Like thumbnail generation, there are many tools available to extract
text from content, many of which are not in Java, and many are specific
to certain types of content.  Again, a plug-in architecture would handle
representation generation well - as content is submitted, classes
registered for the type of that content are invoked and annotate the
item with a full-text bundle, which would then be recognized and indexed
by the search system.  A plug-in architecture would be handy for
integrating with workflows, or as part of a batch process to be run as
part of regular content 'maintenance.'  Again, these bundles may need to
be typed; in this case a 'full text' bundle type would be a hint for the
indexer that it could index the contents of that bundle.

Also like thumbnails, the extracted text for content becomes an official
part of the item, perhaps with a flag to indicate that it is an
annotation by the system and was not part of the original submission.

Items Shared by Multiple Collections

Currently DSpace assumes that items are part of a single collection. 
Users would like to share items between collections, even generation
'virtual' collections that are groupings of items from other
collections.

DSpace's data model supports mapping items to multiple collections, but
the GUI tools do not.  If an item is shared between collections, then
the question arises over who controls it.  One solution is to assign an
owning collection to an item.  The administrator of the owning
collection can modify the item, and assign viewing permissions - other
collection administrators do not have such control - they can only place
a reference to the item in their collection.  Administrators who could
not access an item themselves would of course not be able to reference
the item in their collection.  Since items in multiple collections are
references and not copies, if an item is for some reason removed or
withdrawn, then the references will also appear to be removed or
withdrawn.  A possible problem in the future to watch out for will be
when collection administrators want to attach metadata to or annotate
these references.