[Dspace-general] data sets - metadata
Katherine McNeill-Harman
mcneillh at MIT.EDU
Mon Oct 20 13:20:27 EDT 2003
Hi everyone,
This is a great discussion, and I too am particularly interested in this
subject. I'm the Data Services Librarian here at MIT, and have been
discussing issues particular to data sets with others (here & at other
institutions). For one perspective, I did a poster session on this topic
at this year's meeting of IASSIST (the International Association for Social
Science Information Service and Technology), see
http://macfadden.mit.edu:9500/presentations/iassist/.
Following are some thoughts on this specific issue. First, though, are
there others on the list interested in discussing data set issues on an
ongoing basis? I've discussed with others the possibility of establishing
a separate email list on the topic, if there's enough interest. Let me
know if you would find this valuable, or if you'd rather just keep the
discussion on the general list for now.
Regarding this issue, Suzanne brought up a good standard that we're looking
at. However, as was discussed, there are two kinds of metadata we're
discussing:
1) the metadata used in DSpace to describe the items (we're using Dublin
Core, but looking into different ways to take advantage of other metadata
standards, such as DDI, see the SIMILE project at
http://web.mit.edu/simile/www/)
2) the metadata about the data file in the form of the codebook, that for
us will be one of the files (in addition to the data file) associated w/a
given item in the system; ideally for social science data, this would be
marked up in the DDI XML standard, but it's unclear as of yet if/how this
would interface w/the main search system
But, as JQ said, sometimes the metadata is integrated w/the data (I believe
this sometimes happens more in scientific communities). JQ made some other
important points, many with which we're still struggling, including how to
have the system understand the internal/external structures of data files
and how to encourage high-quality documentation of data. And, as he
alluded, treatment of statistical data may be different from that for "raw"
data sets that yet require analysis. Some of these will be managed at a
system-wide level, while others (such as requiring/standardizing
codebook-level metadata) are left up to our communities (e.g. departments)
who deposit to DSpace; so organizational issues are as important to
understand as are technical ones.
One question, what is qDC? Is it a standard for describing scientific
data? For those of us not familiar.
Let's keep up the interesting discussion!
Kate McNeill-Harman
Data Services Librarian, MIT
>From: "Suzanne Bell" <sbell at library.rochester.edu>
>To: <dspace-general at mit.edu>
>Subject: RE: [Dspace-general] data sets - metadata
>
>Hello folks-
>
>Would the work of the Data Documentation Initiative be germaine to this
>issue, I wonder? I have to admit I've not had any direct involvement with
>this project, but I know they've been working very hard on this issue
>(metadata for datasets) for several years. They have a good website at:
> http://www.icpsr.umich.edu/DDI/
>(their focus is social science data, not scientific)
>
> -Suzanne
>>From: "JQ Johnson" <jqj at darkwing.uoregon.edu>
>>To: <dspace-general at mit.edu>
>>Subject: RE: [Dspace-general] data sets - metadata
>>Date: Tue, 14 Oct 2003 09:15:28 -0700
>>
>>I'm very interested in this question also. Note that the format of the
>>data in computer file or DSPace format registry terms (Excel file, text
>>file, Access .mdb file, etc.) may be much less relevant than the internal
>>format of the data (for instance, a text file might be a comma-separated
>>spreadsheet dataset or an SPSS dataset or the raw SQL commands needed to
>>reconstitute a SQL database or ...).
>>
>>Perhaps more important than the raw observational data is the internal
>>descriptive data that may or may not accompany the data itself -- the
>>codebook, if you will. Note that in some data formats this is a separate
>>file, while in others it may be integrated into the data format. This is
>>truly metadata, but it's sure as anything not qDC!
>>
>>Perhaps equally important is the description of the data collection and
>>data cleaning technique -- the sort of information that is typically
>>included in the methods section of a paper based on the raw data. If we
>>really care about having data sets useful in the future, then it's
>>important that the data set include links to such a methods
>>section. Such information is highly discipline specific; the appropriate
>>metadata for a gene sequence is rather different from that for a
>>statistical survey of domestic abuse victims.
>>
>>Many data sets are made available as supplements to published research;
>>for example, many journals such as _Science_ now provide web based
>>repositories for appendices to articles they publish. At a bare minimum,
>>the DC-style metadata for a data set that corresponds to a published
>>paper should include a citation for the published paper.
>>
>>Let's start with the qDC. coverage.* are natural fields to fill in as is
>>unqualified format. I'd say that relation.isbasedon and other relation.*
>>fields are also critical to include. I'd also say that a policy that
>>required a human-readable codebook as one of the bitstreams associated
>>with a dataset (stored as one or more bitstreams in the same item) would
>>be extremely important. But then what should the format fields refer
>>to? They are item-level, so how do we relate the format.* values to the
>>particular bitstreams they apply to? [This seems to be a major weakness
>>in the current DSpace architecture]
>>
>>Our observation as we've begun to explore statistical data for our
>>institutional repository is that most researchers are very careless about
>>documenting their data in ways that would make it useful to other
>>researchers in the future. We believe that the simple repository is only
>>a tiny fraction of the real issue, and that the important thing is to
>>provide advice and formal structures that make it easy for researchers to
>>document their data collection process.
>>
>>I suspect that trying to answer the question if posed as "data sets" is
>>going to be a failure, and that we should pose the question in terms of
>>some particular kind of data set such as statistical sample observations.
>>"Data set" is so broad that it even includes a set of bibliography
>>entries (in bibtex, endnote, refer, MARC, or whatever format). So the
>>first thing to do is for us to focus on a particular kind of data.
>>
>>JQ Johnson Office: 115F Knight Library
>>Academic Education Coordinator mailto:jqj at darkwing.uoregon.edu
>>1299 University of Oregon phone: 1-541-346-1746; -3485 fax
>>Eugene, OR 97403-1299 http://darkwing.uoregon.edu/~jqj/
>>
>>-----Original Message-----
>>From: dspace-general-bounces at mit.edu
>>[mailto:dspace-general-bounces at mit.edu]On Behalf Of Gabriela Mircea
>>Sent: Tuesday, October 14, 2003 8:16 AM
>>To: dspace-general at mit.edu
>>Subject: [Dspace-general] data sets - metadata
>>
>>Hi all,
>>
>>We have some data sets that we would like to put into DSpace, but I am
>>not sure how we should handle the metadata. Does anyone have data sets
>>in DSpace, and are you willing to share the way that metadata was organized?
>>We should probably add some more fields. The problem is not how to add
>>the fields (technical), but what descriptors should we use for data sets.
>>Are there any standards?
>>
>>Thank you in advance,
>>
>>Gabriela
>>_______________________________________________
>>Dspace-general mailing list
>>Dspace-general at mit.edu
>>http://mailman.mit.edu/mailman/listinfo/dspace-general
>
>___________________________________________
>Katherine McNeill-Harman
>Data Services Reference Librarian
>Dewey Library for Management and Social Sciences
>Massachusetts Institute of Technology
>77 Massachusetts Avenue, E53-100
>Cambridge, MA 02139
>mcneillh at mit.edu
>617-253-0787
More information about the Dspace-general
mailing list