[Dspace-general] Dspace-general Digest, Vol 61, Issue 22

Wed Aug 27 05:12:04 EDT 2008

Dear All,

A detailed description of the functionality and architecture of the
statistics Add-on we have developed can be found on the docs folder of the
downloadable file -
http://wiki.dspace.org/static_files/6/68/Stats-addon-2.0.tar.gz

On our production implementation of the Add-on on RepositoriUM, we have
developed some more tools/functionality for automated and semi-automated
detection and exclusion of crawlers (not only based in "well behaved"
robots, but also on the patterns and behavior from IP addresses, etc.), that
are not available in the version 2.0 of the Add-on. 

As we are currently upgrading RepositóriUM  to DSpace 1.5, hopefully we will
release a Stats Add-on 2.1, compatible with DSpace 1.5, and including the
new functionality/tools in late September or October.

Best Regards,

Eloy Rodrigues
Universidade do Minho - Serviços de Documentação
Campus de Gualtar - 4710 - 057 Braga 
Telefone: + 351 253604150; Fax: + 351 253604159
Campus de Azurém - 4800 - 058 Guimarães
Telefone: + 351 253510168; Fax: + 351 253510117

-----Original Message-----
From: dspace-general-bounces at mit.edu [mailto:dspace-general-bounces at mit.edu]
On Behalf Of dspace-general-request at mit.edu
Sent: quarta-feira, 27 de Agosto de 2008 09:31
To: dspace-general at mit.edu
Subject: Dspace-general Digest, Vol 61, Issue 22

Send Dspace-general mailing list submissions to
	dspace-general at mit.edu

To subscribe or unsubscribe via the World Wide Web, visit
	http://mailman.mit.edu/mailman/listinfo/dspace-general
or, via email, send a message with subject or body 'help' to
	dspace-general-request at mit.edu

You can reach the person managing the list at
	dspace-general-owner at mit.edu

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Dspace-general digest..."

Today's Topics:

   1. Re: Week 2: Statistics (Tim Donohue)
   2. Re: Statistics (Mark H. Wood)
   3. Re: Statistics (Dorothea Salo)
   4. Re: Statistics (Tim Donohue)
   5. Re: Week 2: Statistics (Mark H. Wood)
   6. Re: Week 2: Statistics (Dorothea Salo)
   7. Re: Week 2: Statistics (Christophe Dupriez)

----------------------------------------------------------------------

Message: 1
Date: Tue, 26 Aug 2008 11:09:15 -0500
From: Tim Donohue <tdonohue at illinois.edu>
Subject: Re: [Dspace-general] Week 2: Statistics
To: Dorothea Salo <dsalo at library.wisc.edu>
Cc: dspace-general at mit.edu
Message-ID: <48B42AAB.6010804 at illinois.edu>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed

Dorothea & all,

Dorothea Salo wrote:
> 2008/8/25 Mark H. Wood <mwood at iupui.edu>:
>> One thing to keep in mind about whole-site statistical tables is that
>> there are already tools to do this for web sites in general, such as
>> AWStats or Webalizer or whatever your favorite may be.  We probably
>> should not spend effort to try to duplicate those.
> 
> Perhaps not, but if this is the direction we want people to go in, we
> probably ought to document how to do it, at least informally on the
> wiki. Does anybody have such a system in place?

For IDEALS (www.ideals.uiuc.edu), we use AWStats to get site-wide 
traffic information.  However, that information is *not* publicly 
accessible.  We only use it for administrative purposes, since most of 
the information AWStats generates for us is generally *not* useful to 
our users.

So, for example, AWStats can provide us with the following general 
information:
   * Which features of DSpace are being used most frequently (e.g. 
Subject Browse, Community/Collection browse, search, etc.)
   * Which web browsers our users are using
   * # of overall hits in a given month,week,day,hour
   * Approximate amount of time users spend on our site
   * What external resources people use to get to our site (e.g. Google, 
Blog posts, Library website, etc.)
   * The top searches used to get to your site (in Google, Yahoo, MSN, etc)

But, AWStats only works at a global level.  So, it *cannot* give us any 
real information at a community, collection or item level, since it 
doesn't understand DSpace's internal structure and cannot parse DSpace's 
log files (it parses the *web server* log files, rather than DSpace's 
internal logs)

So, in the end, AWStats is a worthwhile tool to keep in mind.  However, 
without some major customizations specific to DSpace, it's really more 
of an Administrative tool to help you determine *how* users are using 
your site.  It doesn't give any real worthwhile "statistics" in terms of 
file downloads or individual community/collection access counts, which 
are more likely to be useful to your users.

- Tim

-- 
Tim Donohue
Research Programmer, Illinois Digital Environment for
Access to Learning and Scholarship (IDEALS)
University of Illinois at Urbana-Champaign
tdonohue at illinois.edu | (217) 333-4648

------------------------------

Message: 2
Date: Tue, 26 Aug 2008 15:47:20 -0400
From: "Mark H. Wood" <mwood at IUPUI.Edu>
Subject: Re: [Dspace-general] Statistics
To: dspace-general at mit.edu
Message-ID: <20080826194720.GA20164 at IUPUI.Edu>
Content-Type: text/plain; charset="us-ascii"

On Tue, Aug 26, 2008 at 10:07:43AM -0500, Tim Donohue wrote:
> So, although I think it was already mentioned, I'd add as a requirement 
> for a good Statistics Package:
> 
> * Must filter out web-crawlers in a semi-automated fashion!

+1!  Suggestions as to how?

The Rochester mod.s could be augmented to filter out the easiest cases
more simply.  Some well-behaved crawlers can be spotted automatically.
(No, I don't recall how.)  The filter rules could be made more
flexible than just a single type of fixed-size netblocks (if memory
serves).  I've been meaning to work on these at some point, but
haven't yet reached That Point.

Crawler filtering sounds like something that might be abstracted from
the various existing stat. patches and provided as a common service.
We all should invent this wheel only once.

-- 
Mark H. Wood, Lead System Programmer   mwood at IUPUI.Edu
Typically when a software vendor says that a product is "intuitive" he
means the exact opposite.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: not available
Url :
http://mailman.mit.edu/pipermail/dspace-general/attachments/20080826/7dddd0c
1/attachment-0001.bin

------------------------------

Message: 3
Date: Tue, 26 Aug 2008 15:09:16 -0500
From: "Dorothea Salo" <dsalo at library.wisc.edu>
Subject: Re: [Dspace-general] Statistics
To: dspace-general at mit.edu
Message-ID:
	<356cf3980808261309j1a9964adif49b5ecefe5b98fe at mail.gmail.com>
Content-Type: text/plain; charset=UTF-8

2008/8/26 Mark H. Wood <mwood at iupui.edu>:
> On Tue, Aug 26, 2008 at 10:07:43AM -0500, Tim Donohue wrote:
>> So, although I think it was already mentioned, I'd add as a requirement
>> for a good Statistics Package:
>>
>> * Must filter out web-crawlers in a semi-automated fashion!
>
> +1!  Suggestions as to how?

The site <http://www.user-agents.org/> maintains a list of
user-agents, classified by type. They have an XML-downloadable version
at <http://www.user-agents.org/allagents.xml>, as well as an RSS-feed
updater. Perhaps polling this would be a useful starting point?

Dorothea

-- 
Dorothea Salo dsalo at library.wisc.edu
Digital Repository Librarian AIM: mindsatuw
University of Wisconsin
Rm 218, Memorial Library
(608) 262-5493

------------------------------

Message: 4
Date: Tue, 26 Aug 2008 15:29:23 -0500
From: Tim Donohue <tdonohue at illinois.edu>
Subject: Re: [Dspace-general] Statistics
To: Dorothea Salo <dsalo at library.wisc.edu>
Cc: dspace-general at mit.edu
Message-ID: <48B467A3.7080100 at illinois.edu>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed

Dorothea Salo wrote:
> 2008/8/26 Mark H. Wood <mwood at iupui.edu>:
>> On Tue, Aug 26, 2008 at 10:07:43AM -0500, Tim Donohue wrote:
>>> So, although I think it was already mentioned, I'd add as a requirement
>>> for a good Statistics Package:
>>>
>>> * Must filter out web-crawlers in a semi-automated fashion!
>> +1!  Suggestions as to how?
> 
> The site <http://www.user-agents.org/> maintains a list of
> user-agents, classified by type. They have an XML-downloadable version
> at <http://www.user-agents.org/allagents.xml>, as well as an RSS-feed
> updater. Perhaps polling this would be a useful starting point?
> 
> Dorothea
> 

Universidade of Minho's Statistics Add-On for DSpace can do some basic 
automated filtering of web crawlers:

See its list of main features on the DSpace Wiki:

http://wiki.dspace.org/index.php//StatisticsAddOn

(It looks like they determine spiders by how spiders tend to identify 
themselves.  Most "nice" spiders, like Google, will identify themselves 
in a common fashion, e.g. "Googlebot")

Frankly, although our statistics for IDEALS are nice looking...Minho's 
work is much more extensive and offers a greater variety of features 
(from what I've seen/heard of it).  It's just missing our "Top 10 
Downloads" list :)

- Tim

-- 
Tim Donohue
Research Programmer, Illinois Digital Environment for
Access to Learning and Scholarship (IDEALS)
University of Illinois at Urbana-Champaign
tdonohue at illinois.edu | (217) 333-4648

------------------------------

Message: 5
Date: Tue, 26 Aug 2008 16:34:33 -0400
From: "Mark H. Wood" <mwood at IUPUI.Edu>
Subject: Re: [Dspace-general] Week 2: Statistics
To: dspace-general at mit.edu
Message-ID: <20080826203433.GB20164 at IUPUI.Edu>
Content-Type: text/plain; charset="us-ascii"

On Tue, Aug 26, 2008 at 09:44:45AM -0500, Dorothea Salo wrote:
> 2008/8/25 Mark H. Wood <mwood at iupui.edu>:
> > What might be helpful is to provide some views or stored procedures
> > that stat. tools could use to classify observations.  Such tools
> > usually have good facilities for poking around in databases, but could
> > perhaps use help in getting the information they need without having to
> > understand (and track changes to!) the fulness of DSpace's schema.
> 
> Interesting. Where would this leave the average repository manager who
> isn't using Stata, but just wants some numbers to show people?

Well, it depends on which numbers are wanted.  I do think there will
be some reports that are popular enough, and easy enough to get right,
that they should be built in.  The support for external tools would be
aimed at people who do want to use them.  What sort of data would be
useful to the manager who isn't into heavy statistical analysis, which
aren't likely to be provided as built-ins?

Where I'm going is:

o  The realm of reasonable possibilities for statistical analysis and
   presentation of DSpace activity is rather huge;

o  people who understand statistical processing have already figured
   out the hard parts of analysis and presentation;

o  the tail should not be allowed to wag the dog -- we want
   statistics, but that's subordinate to building excellend document
   repository software.  Part of, important, but in a supporting role.

So I am hoping that we can mostly satisfy most people with relatively
modest built-in statistical support, and take care of the other cases
with modest support for the development of external reporting
mechanisms.  This being a community, I imagine that some will develop
external solutions that they can share.

This is one reason why I think that it should be as easy as possible
for multiple stat. projects to tap into built-in streams of
observations.  Different sites have different needs, and I think we
need to be able to easily play with various ways of doing stat.s.  I'm
not convinced that we are going to understand the need sufficiently
without getting into the field a selection of solutions that can be
easily snapped in and tried by a sizable number of sites.  There are a
number of good attempts now, but it's not easy to install them and
that limits the amount of experience we can gather.

-- 
Mark H. Wood, Lead System Programmer   mwood at IUPUI.Edu
Typically when a software vendor says that a product is "intuitive" he
means the exact opposite.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: not available
Url :
http://mailman.mit.edu/pipermail/dspace-general/attachments/20080826/d187370
6/attachment-0001.bin

------------------------------

Message: 6
Date: Tue, 26 Aug 2008 18:13:14 -0500
From: "Dorothea Salo" <dsalo at library.wisc.edu>
Subject: Re: [Dspace-general] Week 2: Statistics
To: dspace-general at mit.edu
Message-ID:
	<356cf3980808261613n27ea9a5x917b98b833df37dc at mail.gmail.com>
Content-Type: text/plain; charset=UTF-8

2008/8/26 Mark H. Wood <mwood at iupui.edu>:

> Well, it depends on which numbers are wanted.  I do think there will
> be some reports that are popular enough, and easy enough to get right,
> that they should be built in.  The support for external tools would be
> aimed at people who do want to use them.  What sort of data would be
> useful to the manager who isn't into heavy statistical analysis, which
> aren't likely to be provided as built-ins?

Well, I hope that's where the discussion this week has been pointing.
If not, we'll have to find a different way to gather that information.
Looking at existing implementations of statistics (e.g. EPrints, SSRN)
might be a start.

> o  the tail should not be allowed to wag the dog -- we want
>   statistics, but that's subordinate to building excellent document
>   repository software.  Part of, important, but in a supporting role.

This is such an interesting statement that I think I will make it next
week's topic! What *is* excellent document repository software? I have
a feeling that the non-developer community may have a rather different
take on it from most developers... we'll see if I'm right.

> So I am hoping that we can mostly satisfy most people with relatively
> modest built-in statistical support, and take care of the other cases
> with modest support for the development of external reporting
> mechanisms.

I'd be interested to know how the proposals that have been put forward
this week place on a modesty scale. Developers?

> This is one reason why I think that it should be as easy as possible
> for multiple stat. projects to tap into built-in streams of
> observations.  Different sites have different needs, and I think we
> need to be able to easily play with various ways of doing stat.s.

Agreed, but just to toss this out: I foresee a countervailing pressure
in future toward standardized and aggregated statistics across
repositories. I have heard a number of statements to the effect that
faculty are using download counts from disciplinary repositories in
tenure-and-promotion packages. As their work becomes scattered and/or
duplicated across various repositories, they're going to want to
aggregate that information.

>  There are a
> number of good attempts now, but it's not easy to install them and
> that limits the amount of experience we can gather.

+1. This is a problem for more than just statistics!

Dorothea

-- 
Dorothea Salo dsalo at library.wisc.edu
Digital Repository Librarian AIM: mindsatuw
University of Wisconsin
Rm 218, Memorial Library
(608) 262-5493

------------------------------

Message: 7
Date: Wed, 27 Aug 2008 10:37:12 +0200
From: Christophe Dupriez <christophe.dupriez at destin.be>
Subject: Re: [Dspace-general] Week 2: Statistics
To: Dorothea Salo <dsalo at library.wisc.edu>
Cc: dspace <dspace-general at mit.edu>
Message-ID: <48B51238.4010008 at destin.be>
Content-Type: text/plain; charset="iso-8859-1"

Hi Dorothea and participants to this discussion!

I would like to say that statistics are there for different purposes:
1) detect errors (why nobody looked at my site last sunday?)
2) provide KPI (Key Performance Indicators), measures that a manager 
follows on the medium term to take organisational decisions
3) investigate new hypothesis before investing to change the organisation.

For purpose (3), by essence, you need to "open" to analysis the detailed 
logs of the events and the data stored in DSpace. Generic programs like 
SAS or reports generators are the best to dig in data and answer to new, 
unforeseen questions. Everybody in the community will be happy to have 
this "back door" available.

For purpose (2), we need to know what KPIs are needed by IR managers. I 
will go further, new IRs and their managers would be very happy not to 
reinvent KPIs and to have good ones already proposed to sustain a 
documented IR development process. A very big part of DSpace 
attractiveness is (and should be implemented really!) that it provides 
"best practices" for IR management (and not only computing).

For purpose (2), Use cases, practices, measures must be designed 
upfront. It will contribute strongly to the overall specifications of 
DSpace.

For purpose (1), a more formal, bottom up, data driven approach may be 
sufficient to install validation tools (like the checksum checker) to 
ensure that DSpace operations are "in line".

So we have no choice: we have to listen IR managers (please come by!) to 
know the good practices DSpace must support...

Have a nice day!

Christophe
(peeking on the list when I should not during my holidays!)

Dorothea Salo a ?crit :
> Greetings, DSpace community,
>
> I want to thank everyone once again for last week's stimulating
> discussion and impressive chat turnout! I have a new question for
> everyone this week, pursuant to some discussion on the lists:
>
> "Statistics" are one of the commonest requests for a new DSpace
> feature. Without further specification, however, it's hard to know
> what data to present, since there are no standards or even clear best
> practices in this area. What statistics do the following groups of
> DSpace users need to see, and in what form are the statistics best
> presented to them?
>
> Depositors
> End-users (defined as "people examining items and downloading
> bitstreams from a DSpace instance;" we may have to refine this further
> in discussion)
> DSpace repository managers (as distinct from systems administrators)
>
> What else should developers keep in mind as they implement this feature?
>
> Because it would be nice to reach a working consensus on this (unlike
> last week's question, which was intended to pull out as broad a
> selection of needs as possible), I think we should start discussing
> immediately. I encourage all respondents to respond TO THE MAILING
> LIST instead of to me.
>
> I will be holding another chat to discuss the weekly question. It will
> take place Wednesday 27 August in the DSpace IRC chatroom, #dspace on
> irc.freenode.net. I apologize to West Coast (USA) community members
> for last week's unconscionably early hour; we'll try 10 am US Central
> (11 am Eastern, 4 pm GMT) this week, and we may go even later next
> week if our European community members can stand it.
>
> For those who don't normally use IRC, there are two easy web gateways.
> One is mibbit.com; the other is specific to our channel and can be
> found at <http://dspace.testathon.net/cgi-bin/irc.cgi>. I encourage
> all of us to become familiar with the channel; it is a source of
> real-time technical information from DSpace developers, as well as a
> community in its own right.
>
> Dorothea
>
>   

-------------- next part --------------
A non-text attachment was scrubbed...
Name: christophe_dupriez.vcf
Type: text/x-vcard
Size: 454 bytes
Desc: not available
Url :
http://mailman.mit.edu/pipermail/dspace-general/attachments/20080827/3784f64
3/christophe_dupriez.vcf

------------------------------

_______________________________________________
Dspace-general mailing list
Dspace-general at mit.edu
http://mailman.mit.edu/mailman/listinfo/dspace-general

End of Dspace-general Digest, Vol 61, Issue 22
**********************************************