[Dspace-general] [Possible Spam]::Re: Statistics Issues andDSpace

Wed Feb 28 06:13:47 EST 2007

Hi Bjorn,

We probably need to document better the mechanism on the documentation of
the Statistc add-on. We will try to do that.
Meanwhile, here is a brief description of the mechanism. I hope my terrible
english doesnt turn this very confusing.

The mechanism for excluding crawlers is based on the Apache access web logs
and the principle of well behaved web-crawling robots (they firstly get
robots.txt). We started by implementing the mechanism as full automatic
based on this principles and excluding all IP's that access robots.txt.
Later we realize that this full automatic mechanism wasnt very effective,
because of mainly 2 reasons. Large crawlers like Google downloads robots.txt
from one machine and then the crawling process is made from hundreds of
machines. This machines dont need to access robots.txt. On the other hand,
if the mechanism was full automatic, because of some individual experiments,
accessing robots.txt, we may be excluding user or proxy IP's.
Due to this reasons we had to implement a 2 step, semi-automatic mechanism.

The stats-detect-spiders script detects downloads on robots.txt file and
populate a temporary table with the ip/agent of the machine that made the
Access. Each record of this must be analyzed by a person who makes de
decision of inserting it as a spider. This can be made with the
stats_spider_add(ip,agent). Note that this procedure will store the agent
also. So, after that all accesses made by agents registered as spiders (even
if they dont access robots.txt) will be considered crawling. For instance,
the google example i mentioned: The Googlebot machine getting robots.txt is
identified, inserted as spider (ip+agent) by a person, all other machines
from googlebot share the same agent name, so they all will be considered
crawling.

In the first weeks of production of the statistics addon the manual step
might be a little time consuming, but after a initial period when all the
big crawlers are identified you can schedule the stats_detect_spider once a
week for instance and analyse the temporary table in 15 minutes or less.

For people who is starting to run statistics add-on we may even share our
stats_agents and ip's information that we already identified, so they didnt
need to have that initial work. At this moment, we have abot 180 agents
identified as crwalers corresponding to about 5000 machines (ip's).

I hope i could help.
If you have any question or doubt dont hesitate to contact us.

Thank you
Angelo Miranda

-----Original Message-----
From: Bjorn Skobba [mailto:bjorn.skobba at brunel.ac.uk] 
Sent: quarta-feira, 28 de Fevereiro de 2007 8:53
To: Angelo Miranda
Cc: 'John Murtagh'; dspace-general at mit.edu
Subject: Re: [Possible Spam]::Re: [Dspace-general] Statistics Issues
andDSpace

Hi Angelo,

Thanks for your reply.

Is the mechanism for excluding crawlers, etc the stats-detect-spiders
script?

Can you tell me a little bit about it - is it supposed to be scheduled
from cron(how often), how does it work, what it does, etc?

Many thanks for your help.

Bjorn Skobba
Brunel University

On Tue, 2007-02-27 at 16:45 +0000, Angelo Miranda wrote:
> Hi,
> 
>  
> 
> I am from RepositoriUM team at University of Minho.
> 
> I am not sure if i understood your question.
> 
> Are you asking if the statistics add-on excludes the views and
> downloads from crawler/robot and similar devices ?
> 
> If thats the question the answer is yes. The add-on has a mechanism
> for ignoring those accesses.
> 
>  
> 
> Thank You
> 
> Angelo Miranda
> 
>  
> 
> -----Original Message-----
> From: dspace-general-bounces at mit.edu
> [mailto:dspace-general-bounces at mit.edu] On Behalf Of John Murtagh
> Sent: terça-feira, 27 de Fevereiro de 2007 12:08
> To: dspace-general at mit.edu
> Subject: [Dspace-general] Statistics Issues and DSpace
> 
>  
> 
> Hello all,
> 
> The use of statistics for DSpace is an important part of our strategy
> to increase deposits and we wish to protect the integrity of that
> information.
> 
> A quick question about the statistics element of DSpace, namely the
> add on provided by Universidade do Minho
> <http://www.ecs.soton.ac.uk/~harnad/Hypermail/Amsci/6086.html>
> 
> I wondered if there had been any issues or concerns in the collection
> and processing of downloads and views for items on DSpace? This could
> be in the form of retrieval robots, OAI harvesters, Google or the
> manipulation of statistics.
> 
> Anyone got any news or info they'd like to share?
> 
> Thanks in advance
> 
> 
> John Murtagh
> ________________________________________________
> 
> 
> Website: http://bura.brunel.ac.uk
> 
> 
> 
> John Murtagh
> Project Manager - Brunel University Research Archive
> Brunel Library
> Kingston Road
> Uxbridge
> UB8 3PH
> 
> Tel: 0189 526 5417
> Fax: 01895269741
> E-mail: john.murtagh at brunel.ac.uk
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> _______________________________________________
> Dspace-general mailing list
> Dspace-general at mit.edu
> http://mailman.mit.edu/mailman/listinfo/dspace-general