[Dspace-general] RE: [Dspace-tech] Google Scholar and OAI

Thu Feb 3 17:49:39 EST 2005

At 01:13 PM 2/3/2005 -0500, Tansley, Robert wrote:
>A related question -- are Google assuming that every DSpace instance
>contains scholarly literature?

No. But we are discussing the issue of selecting DSpace sites of 
"scholarly" content vs other DSpace-based repositories out there which 
might have non-academic content in them. Right now, as far as I can make it 
out, Google is doing this selection manually by examining each repository. 
They realize that model won't scale up, and they are evaluation automated 
mechanisms to detect "research papers" in repositories as opposed to other 
kinds of content that they don't want to include. As you may have noticed, 
they're already excluding image content from Google Scholar that exists in 
some of the DSpace repositories their already harvesting.

>And will/does/should it differentiate between 'production' DSpace
>instances and the numerous 'test' instances, which may or may not stay
>around and contain 'real' content?

A point that I have recently made, when they were complaining about the 
Handles in Harvard's repository.
So for now I've asked them to exclude repositories that use the 
default/test Handle namespace of 123456789.
If a DSpace site has real Handles and open access content then there's not 
a whole lot that we or Google could do to inform them of it's production 
status...

>I assume right now, to identify a DSpace and what's in it, they're using
>some sort of heuristic; but due to people's customisations, diverging
>uses of DSpace and a rapidly-evolving platform, that approach doesn't
>feel like it'll last long.

They're using the registry on the DSpace wiki to find live repositories: 
http://wiki.dspace.org/DspaceInstances and I didn't put any on that list 
that I *knew* to be in test or pilot mode...
beyond that they're selecting stuff out of these repositories based on 
their native characteristics (i.e. however Google recognizes a "research 
document", broadly defined).
But the UI customizations are preventing them from also harvesting the 
associated metadata successfully.

>I think any mechanism we come up with, such as those Andy Powell 
>suggested, should also take into account the above issues.

Yep.

MacKenzie