[Dspace-general] Week 2: Statistics

Mon Aug 25 10:07:43 EDT 2008

My answers:

> What statistics do the following groups of
> DSpace users need to see, and in what form are the statistics best
> presented to them?
>
> Depositors

At a minimum, I would like depositors to see the number of times an
item's splash page has been visited, and the number of times each
content bitstream (as distinct from e.g. thumbnails) has been
downloaded. I would also like aggregate statistics available for each
author in the system, though I recognize that this creates
authority-control and role-evaluation issues. (For example, if Dr.
Helen Troia is the author of articles in the repository, the editor of
a journal whose backfiles are in the repository, as well as a thesis
advisor for some theses in the thesis collection, the journal and the
theses should NOT count toward her downloads.)

HTML items (and similar aggregates, once we can work with them; e.g.
Flash objects) cause trouble for bitstream analysis. To cut through
the jungle, I suggest that only the primary bitstream have its
accesses counted. If possible, it would be nice to count accesses for
all HTML bitstreams, but that can be lived without if need be.

I don't believe these statistics need to be real-time; a daily or even
weekly cron-job would suffice. I do believe we need to take into
account when an item was ingested, recognizing that older items will
pile up the downloads over time. In addition to total-aggregates,
then, I would recommend "in the last week," "in the last month," and
"in the last year/since ingest" information. Per-calendar-year
information should be kept and displayed indefinitely, even if the
underlying data are eventually purged, because authors will use this
in tenure-and-promotion packages. A sense of delta would be nice as
well -- depositors would LOVE to know if suddenly an item's downloads
spike.

Other desiderata, less important: broad-brush geographic information
(country of origin? Google Maps mashup?) for accesses, per-collection
and per-community access counts (because it NEVER hurts to get a sense
of competition going), search terms (in DSpace itself or from search
engines) that land people at a particular item.

> End-users (defined as "people examining items and downloading
> bitstreams from a DSpace instance;" we may have to refine this further
> in discussion)

I think end-users can usefully be shown the per-item and per-bitstream
information discussed above. They don't need to see per-author
information -- or at the very least, authors should be able to decide
whether to make this information public. (We do NOT want to embarrass
anyone; that's a serious turnoff for our potential depositors.)

> DSpace repository managers (as distinct from systems administrators)

I get survey after survey asking for activity information on the
repository. I can't answer them. To do so, I need download information
on the whole repository. (Current JSPUI statistics offer an
approximation to this, but I'm very leery of trusting it; I don't
understand how it's calculated, and the numbers seem incredibly off to
me.) I am sometimes asked about growth rate in accesses, so it would
be useful to break this down by year. Some algorithm for breaking it
down by amount of content in the repository ("downloads-per-item,"
where "item" would have to be some kind of average of
items-in-repository over the period examined) would be useful as well.

(And yes, I absolutely loathe those surveys too, but when they come
from ARL, I don't have the luxury of ignoring them.)

Some "wow" numbers would be useful for marketing purposes. A lot of
what I've already described would do the trick there.

I would also like to be able to track deposits per
collection/community over time; this helps me know where to focus
marketing and collection-development efforts, as well as helping me
report progress to the appropriate administrators. (I run a
system-wide repository, so I have to track deposits by campus; each
campus has its own community.)

> What else should developers keep in mind as they implement this feature?

Search-engine crawlers. Excluding them provides a much more realistic
sense of interest. We need to make clear this is happening, though, or
we will be at a perceived disadvantage relative to repositories that
don't strip out these accesses.

Dorothea

-- 
Dorothea Salo dsalo at library.wisc.edu
Digital Repository Librarian AIM: mindsatuw
University of Wisconsin
Rm 218, Memorial Library
(608) 262-5493