[Dspace-general] FW: Google harvesting using OAI-PMH

Robert Tansley roberttansley at google.com
Tue Apr 29 14:00:49 EDT 2008


To add a couple of points:

OAI-PMH actually gives *less* information than we want: there is no
predictable way to reach the item full text/content from an OAI-PMH
record.

Also, in addition to support for sitemaps.org Sitemaps, DSpace 1.5 can
generate static HTML pages containing links to item pages. These are
basically pre-generated browse pages for crawlers, which don't need
you to register with Webmaster tools etc and carry most of the same
benefits (i.e. minimising crawler load).

If you have either sitemaps.org or the HTML sitemaps set up (note that
a link to the HTML sitemap must be present on your DSpace site, e.g.
from the home page) setting up your robots.txt to prevent crawling of
the dynamic browse UI pages should lighten load on your server
somewhat.

Rob

On Mon, Apr 28, 2008 at 2:10 AM, Stuart Lewis <sdl at aber.ac.uk> wrote:
> Hi Leonie,
>
>
>  > Would anyone like to comment from a DSpace perspective, (see below) regarding
>  > current practices. There seems to be a lot of conflicting information around.
>
>  Agreed! From what I've seen there has been a lot of hype about a service
>  used by only 200 people which Google is discontinuing.
>
>
>  > A few months ago there was some discussion regarding Google harvesting
>  > information from DigiTool ­ so here is a short update.
>  >
>  > The official Google statement stated that Google supports harvesting using
>  > OAI-PMH and as DigiTool can serve as an OAI-PMH provider we assumed that
>  > harvesting by Google is possible. After a few attempts, and a few discussions
>  > with Google (we should thank CJH for doing the effort) we got a clear message
>  > that OAI-PMH harvesting is no longer supported and customers should use the
>  > standard map site required by Google (personally, I¹m not sure if OAI-PMH was
>  > ever supported).
>
>  *As far as we know*, Google did not use OAI-PMH to harvest items as for
>  example OAISter do. They only used OAI-PMH as a way of discovering new web
>  pages in a web site, which they then go and index in a traditional way. So
>  sites could provide their OAI-PMH feed to Google to help ensure Google had
>  complete coverage of a site.
>
>  Because all Google really wants is the URLs which they can then index,
>  OAI-PMH is somewhat heavy-weight for this purpose, giving them a lot more
>  information than they really want. This can be seen by their participation
>  in the development of sitemaps (http://sitemaps.org/) which is much more
>  lightweight.
>
>
>  > We use the standard site map and exclude the Browse screens and Suggest a
>  > title, do most other sites do the same?
>
>  Yes - that is common good practise, as there is no point in Google spidering
>  your browse screens if they can get the same information (a list of all the
>  items / collections / communities) in a more efficient way.
>
>  One of the less publicised features of DSpace version 1.5 is the inclusion
>  of support for sitemaps. [dspace]/bin/generate-sitemaps will generate your
>  sitemaps, which are then exposed at http://dspace.example.com/dspace/sitemap
>
>  You'll need to register with Google Webmaster Tools (
>  http://www.google.com/webmasters/tools/) in order to be able to inform
>  Google where the sitemap is located.
>
>  Thanks,
>
>
>  Stuart
>  _________________________________________________________________
>
>  Gwasanaethau Gwybodaeth                      Information Services
>  Prifysgol Aberystwyth                      Aberystwyth University
>
>             E-bost / E-mail: Stuart.Lewis at aber.ac.uk
>                  Ffon / Tel: (01970) 622860
>  _________________________________________________________________
>
>
>  _______________________________________________
>  Dspace-general mailing list
>  Dspace-general at mit.edu
>  http://mailman.mit.edu/mailman/listinfo/dspace-general
>




More information about the Dspace-general mailing list