[Dspace-general] RE: [Dspace-tech] Google Scholar and OAI
MacKenzie Smith
kenzie at MIT.EDU
Thu Feb 3 17:49:39 EST 2005
At 01:13 PM 2/3/2005 -0500, Tansley, Robert wrote:
>A related question -- are Google assuming that every DSpace instance
>contains scholarly literature?
No. But we are discussing the issue of selecting DSpace sites of
"scholarly" content vs other DSpace-based repositories out there which
might have non-academic content in them. Right now, as far as I can make it
out, Google is doing this selection manually by examining each repository.
They realize that model won't scale up, and they are evaluation automated
mechanisms to detect "research papers" in repositories as opposed to other
kinds of content that they don't want to include. As you may have noticed,
they're already excluding image content from Google Scholar that exists in
some of the DSpace repositories their already harvesting.
>And will/does/should it differentiate between 'production' DSpace
>instances and the numerous 'test' instances, which may or may not stay
>around and contain 'real' content?
A point that I have recently made, when they were complaining about the
Handles in Harvard's repository.
So for now I've asked them to exclude repositories that use the
default/test Handle namespace of 123456789.
If a DSpace site has real Handles and open access content then there's not
a whole lot that we or Google could do to inform them of it's production
status...
>I assume right now, to identify a DSpace and what's in it, they're using
>some sort of heuristic; but due to people's customisations, diverging
>uses of DSpace and a rapidly-evolving platform, that approach doesn't
>feel like it'll last long.
They're using the registry on the DSpace wiki to find live repositories:
http://wiki.dspace.org/DspaceInstances and I didn't put any on that list
that I *knew* to be in test or pilot mode...
beyond that they're selecting stuff out of these repositories based on
their native characteristics (i.e. however Google recognizes a "research
document", broadly defined).
But the UI customizations are preventing them from also harvesting the
associated metadata successfully.
>I think any mechanism we come up with, such as those Andy Powell
>suggested, should also take into account the above issues.
Yep.
MacKenzie
More information about the Dspace-general
mailing list