[Dspace-general] Google search results bypass metadata records

Thu Jun 21 20:26:21 EDT 2007

Hi Pat!  (copy to Roger Costello, proposer of HTTP "Meta-Location" 
header, that may help make this discussion fruitful)

I like very much your idea: for a given URL, there should be a standard 
"derivation" to get its metadata / context.

In an HTML document, we could have a <LINK with rel=link type:
http://www.w3.org/TR/html401/types.html#type-links
But no "Metadata" type! May be "Index" could be used.
This document suggests to use REL="META"
http://vancouver-webpages.com/ml/draft-daviel-metadata-link-00.txt

There is also the PROFILE attribute of the <HEAD
http://www.w3.org/TR/html401/struct/global.html#h-7.4.4.3
Anyway, we mainly store PDF (and not HTML documents) and we usually 
cannot change the content of the documents we receive...

I just checked and I see no HTTP verb or Header field that could carry 
this "content independent" information
http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html

Idealy, "derivation" toward metadata should be done through HTTP (which 
is content independent).
May be one day the ACCEPT header could contain metadata/html (instead of 
text/html) to indicate that metadata is requested and not the text 
(content).

People in W3C seem to discuss this:
http://osdir.com/ml/org.w3c.tag/2003-02/msg00247.html
They propose a new HTTP header: Meta-Location
http://www.xfront.com/dist-reg/distributed-registry.html
So we will see in the future versions of HTTP...

For now, does anybody controls the relation with Google so they could 
add to their DSpace awareness a "Metadata" link to the document hit 
indication, linking

https://pacer.ischool.utexas.edu/bitstream/2081/859/1/Arrangement.doc

to

https://pacer.ischool.utexas.edu/handle/2081/859

Another idea would be to link any incorrect bitstream to the metadata. 
Instead of sending :

  Invalid Identifier

The identifier 2081/859/1/Arransqdsqd does not correspond to a valid 
Bitstream in DSpace

we could simply show the metadata page. So any corruption to the end of 
the URL would help the user by showing the metadata record.
Especially:

https://pacer.ischool.utexas.edu/bitstream/2081/859/1/
and
https://pacer.ischool.utexas.edu/bitstream/2081/859/
should return the metadata as it is usual to remove level to an URL to get an upper level Table of Content.

May be this (removing a level to an URL) could be a standard "URL based" way to go from a document to its context.
But, I know, Index is not Metadata ! To be implemented by Google? For DSpace?
For other sites, as a default generic method to contextualize a document???

Have a nice evening,

Christophe Dupriez

P.S. Mr.Costello site, XFront.com, seems packed up with interesting XSLT tutorials...

Pat Galloway a écrit :
> After having read the original post on this topic I was more than a 
> little concerned because we have restricted materials on our server; but 
> I tested using text in restricted Microsoft Word files and found no 
> hits; the same day, however, I retrieved on this string "Michael 
> Joyce—Arrangement-4/20/2005" which was the first text string in an 
> unrestricted Word document (but not in the DSPace metadata), and got 
> this result using regular Google search:
>
> [DOC] Michael Joyce--ArrangementFile Format: Microsoft Word - View as HTML
> Michael Joyce—Arrangement-4/20/2005 (updated 05/01/2005). Series I. 
> Works. (Subseries for each title). Series II. Academic Career ...
> https://pacer.ischool.utexas.edu/bitstream/2081/859/1/Arrangement.doc
>
> Obviously if you go here it is impossible to get to the metadata unless 
> you know to sever the last two filepath elements. Clearly this might be 
> a concern for many reasons; but I was in the midst of writing an article 
> for Library Trends on archives and information retrieval, and found it 
> interesting (in the context of archival discussions about "exploding" 
> the authority of the finding aid) that opening DSpace to Web 2.0 permits 
> this broader, deracinated granular access. I would hope that it NOT be 
> made impossible, either on the DSpace or the Google side; just that 
> (since Google is DSpace-aware) the searcher be shown how to get the 
> metadata if indeed it is wanted for the searcher's purposes. Too often 
> we think it's up to us to determine what researchers need and want. Has 
> anyone heard any complaints from them?
>
> Pat Galloway
> School of Information
> UNiversity of Texas
>
>
> _______________________________________________
> Dspace-general mailing list
> Dspace-general at mit.edu
> http://mailman.mit.edu/mailman/listinfo/dspace-general
>
>
>
>   

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/pipermail/dspace-general/attachments/20070622/78e733f8/attachment.htm
-------------- next part --------------
A non-text attachment was scrubbed...
Name: christophe.dupriez.vcf
Type: text/x-vcard
Size: 454 bytes
Desc: not available
Url : http://mailman.mit.edu/pipermail/dspace-general/attachments/20070622/78e733f8/attachment.vcf