[Dspace-general] Persistent Bitstream IDs - Call For Discussion

Mon Sep 22 15:18:08 EDT 2003

	Bitstream Identifiers in DSpace - Call for Discussion

Part of the planning process for enhancements to DSpace is requirements
analysis - obtaining a more precise understanding of the needs of the
adopter community. We have frequently heard articulated the need for
bitstream-level - rather than item-level - handles or other persistent
identifiers. Recall that in DSpace terminology, an 'item' is a logical
construct consisting of a Dublin Core metadata record, and one or more
digital objects (files), each of which is referred to as a 'bitstream'.
In the current design, when a item has exited the workflow, it is
assigned a handle, but the individual bitstreams are not. The rationale for
this - without delving too deeply into digital preservation debates - is
fairly simple: the item represents some sort of usable content, but its
digital expression may vary over time. For instance, a bitstream may need
to be migrated into a physically distinct bitstream of another format
for preservation purposes (e.g. because the original format has become
obsolete): the bitstream has changed, but the item 'content' has not.
The thing that persists is what is given a persistent identifier.  

So why would one need a persistent identifier for an individual  bitstream?
This write-up is an attempt to solicit your feedback on the subject. Any
response is welcome - but best of all would be a 'use-case' description:
a concrete set of practices or problems for which a different bitstream
identifier system would provide a solution.

		Disentangling the Issues

There are actually a few distinct but related issues here, and it is
useful to attempt to tease them apart. First, what counts as an identifier
in this context? Bitstreams (files) in DSpace already possess an identifier
of a sort, viz. the URL appearing on the item display page:

	https://hpds1.mit.edu/retrieve/1042/Porter_debate.pdf

This identifier, composed of the server name, the primary key in the
bitstream database table, and the filename assigned upon submission has
several noteworthy characteristics:

 (1) it is fixed, in that it always refers to the same bitstream
 (2) like all URLs, the identifier is a location reference:
     you can use it to retrieve the bitstream
 (3) it has limited persistence, viz. as long as server name
     and database tables do not change
 (4) one cannot use it to locate or retrieve the 'parent' item
    (its metadata and related bitstreams)

For many purposes this identifier may be adequate. For instance, if the need
is to retrieve bitstreams for near-term reuse as learning objects, then the
time-scale is likely to be such that the URL will remain valid. Even if the
repository is re-hosted (the server name changes), standard redirection
techniques would ensure that the URL will resolve to the desired bitstream.

Still, the bitstream URL is *fragile* in ways a handle is not:
if the database is exported and re-imported (as was done for the DSpace 1.1
upgrade), there is no guarantee that the bitstream key part of the URL
(therefore its validity) will be preserved. 

This quick dissection suggests some of the issues in play: do you need a
bitstream identifier to be persistent for a longer period, or simply more
'durable'? Does it have to be a handle? Should bitstream IDs also be
required to get you to the item of which it is a part? Do they have
to be a location identifier (like a URL)? Does the identifier system
need to independently maintained (like the handle system), or not?
What are other required/desirable characteristics of bitstream identifiers?

Please share your insights on the dspace-general list; again, a concrete
use-case is the most valuable driver of design: a specific set of circumstances
and outcomes that any design must satisfy.