[Dspace-general] data sets - metadata

Mon Oct 20 13:20:27 EDT 2003

Hi everyone,

This is a great discussion, and I too am particularly interested in this 
subject.  I'm the Data Services Librarian here at MIT, and have been 
discussing issues particular to data sets with others (here & at other 
institutions).  For one perspective, I did a poster session on this topic 
at this year's meeting of IASSIST (the International Association for Social 
Science Information Service and Technology), see
http://macfadden.mit.edu:9500/presentations/iassist/.

Following are some thoughts on this specific issue.  First, though, are 
there others on the list interested in discussing data set issues on an 
ongoing basis?  I've discussed with others the possibility of establishing 
a separate email list on the topic, if there's enough interest.  Let me 
know if you would find this valuable, or if you'd rather just keep the 
discussion on the general list for now.

Regarding this issue, Suzanne brought up a good standard that we're looking 
at.  However, as was discussed, there are two kinds of metadata we're 
discussing:

1) the metadata used in DSpace to describe the items (we're using Dublin 
Core, but looking into different ways to take advantage of other metadata 
standards, such as DDI, see the SIMILE project at 
http://web.mit.edu/simile/www/)
2) the metadata about the data file in the form of the codebook, that for 
us will be one of the files (in addition to the data file) associated w/a 
given item in the system; ideally for social science data, this would be 
marked up in the DDI XML standard, but it's unclear as of yet if/how this 
would interface w/the main search system

But, as JQ said, sometimes the metadata is integrated w/the data (I believe 
this sometimes happens more in scientific communities).  JQ made some other 
important points, many with which we're still struggling, including how to 
have the system understand the internal/external structures of data files 
and how to encourage high-quality documentation of data.  And, as he 
alluded, treatment of statistical data may be different from that for "raw" 
data sets that yet require analysis.  Some of these will be managed at a 
system-wide level, while others (such as requiring/standardizing 
codebook-level metadata) are left up to our communities (e.g. departments) 
who deposit to DSpace; so organizational issues are as important to 
understand as are technical ones.

One question, what is qDC?  Is it a standard for describing scientific 
data?  For those of us not familiar.

Let's keep up the interesting discussion!

Kate McNeill-Harman
Data Services Librarian, MIT

>From: "Suzanne Bell" <sbell at library.rochester.edu>
>To: <dspace-general at mit.edu>
>Subject: RE: [Dspace-general] data sets - metadata
>
>Hello folks-
>
>Would the work of the Data Documentation Initiative be germaine to this 
>issue, I wonder? I have to admit I've not had any direct involvement with 
>this project, but I know they've been working very hard on this issue 
>(metadata for datasets) for several years. They have a good website at:
>  http://www.icpsr.umich.edu/DDI/
>(their focus is social science data, not scientific)
>
>  -Suzanne

>>From: "JQ Johnson" <jqj at darkwing.uoregon.edu>
>>To: <dspace-general at mit.edu>
>>Subject: RE: [Dspace-general] data sets - metadata
>>Date: Tue, 14 Oct 2003 09:15:28 -0700
>>
>>I'm very interested in this question also.  Note that the format of the 
>>data in computer file or DSPace format registry terms (Excel file, text 
>>file, Access .mdb file, etc.) may be much less relevant than the internal 
>>format of the data (for instance, a text file might be a comma-separated 
>>spreadsheet dataset or an SPSS dataset or the raw SQL commands needed to 
>>reconstitute a SQL database or ...).
>>
>>Perhaps more important than the raw observational data is the internal 
>>descriptive data that may or may not accompany the data itself -- the 
>>codebook, if you will.  Note that in some data formats this is a separate 
>>file, while in others it may be integrated into the data format.  This is 
>>truly metadata, but it's sure as anything not qDC!
>>
>>Perhaps equally important is the description of the data collection and 
>>data cleaning technique -- the sort of information that is typically 
>>included in the methods section of a paper based on the raw data.  If we 
>>really care about having data sets useful in the future, then it's 
>>important that the data set include links to such a methods 
>>section.  Such information is highly discipline specific; the appropriate 
>>metadata for a gene sequence is rather different from that for a 
>>statistical survey of domestic abuse victims.
>>
>>Many data sets are made available as supplements to published research; 
>>for example, many journals such as _Science_ now provide web based 
>>repositories for appendices to articles they publish.  At a bare minimum, 
>>the DC-style metadata for a data set that corresponds to a published 
>>paper should include a citation for the published paper.
>>
>>Let's start with the qDC.  coverage.* are natural fields to fill in as is 
>>unqualified format.  I'd say that relation.isbasedon and other relation.* 
>>fields are also critical to include.  I'd also say that a policy that
>>required a human-readable codebook as one of the bitstreams associated 
>>with a dataset (stored as one or more bitstreams in the same item) would 
>>be extremely important.  But then what should the format fields refer 
>>to?  They are item-level, so how do we relate the format.* values to the 
>>particular bitstreams they apply to?  [This seems to be a major weakness 
>>in the current DSpace architecture]
>>
>>Our observation as we've begun to explore statistical data for our 
>>institutional repository is that most researchers are very careless about 
>>documenting their data in ways that would make it useful to other
>>researchers in the future.  We believe that the simple repository is only 
>>a tiny fraction of the real issue, and that the important thing is to 
>>provide advice and formal structures that make it easy for researchers to 
>>document their data collection process.
>>
>>I suspect that trying to answer the question if posed as "data sets" is 
>>going to be a failure, and that we should pose the question in terms of 
>>some particular kind of data set such as statistical sample observations. 
>>"Data set" is so broad that it even includes a set of bibliography 
>>entries (in bibtex, endnote, refer, MARC, or whatever format).  So the 
>>first thing to do is for us to focus on a particular kind of data.
>>
>>JQ Johnson                      Office: 115F Knight Library
>>Academic Education Coordinator  mailto:jqj at darkwing.uoregon.edu
>>1299 University of Oregon       phone: 1-541-346-1746; -3485 fax
>>Eugene, OR  97403-1299          http://darkwing.uoregon.edu/~jqj/
>>
>>-----Original Message-----
>>From: dspace-general-bounces at mit.edu
>>[mailto:dspace-general-bounces at mit.edu]On Behalf Of Gabriela Mircea
>>Sent: Tuesday, October 14, 2003 8:16 AM
>>To: dspace-general at mit.edu
>>Subject: [Dspace-general] data sets - metadata
>>
>>Hi all,
>>
>>We have some data sets that we would like to put into DSpace, but I am 
>>not sure how we should handle the metadata.  Does anyone have data sets 
>>in DSpace, and are you willing to share the way that metadata was organized?
>>We should probably add some more fields. The problem is not how to add 
>>the fields (technical), but what descriptors should we use for data sets. 
>>Are there any standards?
>>
>>Thank you in advance,
>>
>>Gabriela
>>_______________________________________________
>>Dspace-general mailing list
>>Dspace-general at mit.edu
>>http://mailman.mit.edu/mailman/listinfo/dspace-general
>
>___________________________________________
>Katherine McNeill-Harman
>Data Services Reference Librarian
>Dewey Library for Management and Social Sciences
>Massachusetts Institute of Technology
>77 Massachusetts Avenue, E53-100
>Cambridge, MA 02139
>mcneillh at mit.edu
>617-253-0787