[Dspace-general] Week 2: Statistics

Wed Aug 27 20:08:01 EDT 2008

There is a degree of cleanup you would want to do to the incoming  
data prior to storing it.

1.) host name and geoip resolution on the ip
2.) elimination of bots and automated download tools
3.) elimination of duplicate requests (double-clicks)

Likewise, it would be very important to obfuscate/clean the IP data  
that gets stored to eliminate privacy concerns when governments come  
knocking at your door.

I would recommend a different db instance to store such data that can  
have a connection pool configured to optimize writing over reading.

I would recommend evaluating Reporting Engines and Frameworks to  
arrive at an optimal database configuration independent of DSpace.

We've worked hard on an internal Statistics/Reporting solution for  
DSpace at MIT that uses the DSpace DB another storage database and  
processing across Apache Logs.  Eventually I'd like to see us move to  
a usage event driven update process rather than our Apache log  
trolling. It currently doesn't sport a UI and generates spreadsheet  
reports.

I think its important to separate the DSpace database, the statistics  
database, the reporting tools and the User interface needs into  
separate but related projects so that they may evolve and be  
supported addons to DSpace.  Ultimately this means we working in the  
DSpace Core need to make sure hooks like "Usage Events" get into  
place and are available for Addons to attach listeners to.  What does  
this mean to the group:

1.) Endorsing and Shepherding those changes to the core code-base  
into a near future release.
2.) Evaluation of the need for a common notification framework for  
both Usage and Modification events.
3.) Establishing a Roadmap that is inclusive, allowing new projects  
and team members to participate within the development/release process.

---

As well, I find the following concerns with the Minho statistics addon.

A.) Usage of procedural postgreql excludes oracle users and restricts  
portability and introduces a layer of complexity that requires  
maintainers to be able to debug within a layer that is not  
traditionally customized by DSpace. I feel that this needs to be in  
the java implementation rather than in a storage specific language  
and execution environment. This is a major factor in our not using  
the Minho solution at DSpace at MIT.

B.) Overlays may be used to deploy on top of JSPUI/XMLUI, but we  
should work for better plug-ability of this functionality.  
Specifically, we've seen that the JSPUI's usage of JSP Tag libraries  
isn't ideal or well designed in DSpace. The usage of Tag libraries  
should ideally be replaced with Beans/Collections and JSTL iterator  
tags. THe JSPUI should be looking at templates and portlets for  
solutions to allow plugability rather than direct customization of  
JSP's, Taglibraries and Servlets by the community.

Tangent: This is why the XMLUI was created, to get away from this bad  
design.

C.) I commend the usage of a separate SQL namespace, but suggest  
further that it might be better to be a completely separate DB  
allowing optimized write connections independent of the dspace db,  
whose connections are better optimized for reading and transactional  
security.

D.) The Usage of the JDBC Log4j appender, while creative, introduces  
another layer of complexity that isn't explicit. A Plugable  
UsageEvent API may better manage the generation of events in the UI  
to be directed to the Statistics addon. This may be of lesser concern  
because it could be adapted to work as a UsageEvent consumer, rather  
than consuming Logging events destined for dspace.log directly.

These comments are meant to be constructive, speaking for the  
community, I think don't want to see this work fall to the "wayside"  
and work to eliminate barriers to its update into the community.

I highly promote that those working on projects within the community  
(such as the Minho statistics addon) take advantage of the tools and  
services we are maintaining to enable your work in an open  
environment where you can seek support and advice directly from the  
community of DSpace developers.

We are working on a new Contribution WIKI page section to outline  
these Services and the policies and procedures around working with them.

http://wiki.dspace.org/index.php/ 
DSpaceResources#DSpace_Community_Sandbox

-Mark

On Aug 27, 2008, at 10:57 AM, Randy Stern wrote:

> One useful distinction is to separate to some degree the statistics  
> that we
> may want to calculate from the events/raw data that needs to be  
> recorded by
> the DSpace system as it operates. As long as the events are  
> recorded in the
> database (preferably *not* logged in files), various computations,
> aggregations, reports, and APIs for exposing that data can be  
> generated
> later. So we may want to focus initially on what data to record and  
> plan
> for a statistics data model, database tables, and recording to be  
> built
> into DSpace 2.0.
>
> At 09:46 AM 8/27/2008 -0400, Mark H. Wood wrote:
>> On Tue, Aug 26, 2008 at 06:13:14PM -0500, Dorothea Salo wrote:
>>> 2008/8/26 Mark H. Wood <mwood at iupui.edu>:
>> [snip]
>>> This is such an interesting statement that I think I will make it  
>>> next
>>> week's topic! What *is* excellent document repository software? I  
>>> have
>>> a feeling that the non-developer community may have a rather  
>>> different
>>> take on it from most developers... we'll see if I'm right.
>>
>> I think you are, and I look forward to that discussion!
>>
>>>> This is one reason why I think that it should be as easy as  
>>>> possible
>>>> for multiple stat. projects to tap into built-in streams of
>>>> observations.  Different sites have different needs, and I think we
>>>> need to be able to easily play with various ways of doing stat.s.
>>>
>>> Agreed, but just to toss this out: I foresee a countervailing  
>>> pressure
>>> in future toward standardized and aggregated statistics across
>>> repositories. I have heard a number of statements to the effect that
>>> faculty are using download counts from disciplinary repositories in
>>> tenure-and-promotion packages. As their work becomes scattered  
>>> and/or
>>> duplicated across various repositories, they're going to want to
>>> aggregate that information.
>>
>> Quite so.  I just don't feel that we've yet got to the point at which
>> we understand how to do that well.  A lot of good solutions come  
>> about
>> in this way: an abstract and somewhat indistinct common need is
>> recognized; a number of people all go off in different directions and
>> try things; solutions are compared, borrow from each other, coalesce;
>> finally a now well-understood need finds itself fulfilled with one or
>> two mature implementations.  I feel that we're still deep in the "try
>> things" phase.
>>
>> The degree to which statistics are desired and used suggests that, in
>> addition to traditional reports, we should be thinking in terms of
>> exposing statistical products in machine-readable form.  I have been
>> thinking for some time that we might, with reasonable effort, help to
>> work out a lingua franca for exchanging usage statistics among
>> repositories of various "brands" so that the utility of various  
>> ideas,
>> and the behavior of repository users, might be studied more
>> effectively.  But again, what we can all agree on will very likely be
>> a small subset of what we can individually envision.
>>
>> This really ought to be considered early-on, because if we can  
>> come up
>> with a common theme in the abstract, then machine- and human-readable
>> reporting become side-by-side layers on top of the pool of  
>> statistical
>> data products, and both will be easier to think about if they are
>> merely formatting something already produced.  Likewise the  
>> production
>> of those stat.s will be easier to think about if presentation issues
>> can be separated from the task.
>>
>> I do *not* mean to say here that the statistics that people want now
>> should have to wait indefinitely on some Grand Scheme to do it all.
>> It would be better to organize the development in successive
>> approximations if it looks like taking too long to do it all in one
>> push.  It's probably going to take several years to fully realize
>> satisfactory monitoring and reporting of DSpace usage, but that
>> doesn't mean that we can't provide better and better approximations
>> much sooner.
>>
>> --
>> Mark H. Wood, Lead System Programmer   mwood at IUPUI.Edu
>> Typically when a software vendor says that a product is  
>> "intuitive" he
>> means the exact opposite.
>>
>>
>> _______________________________________________
>> Dspace-general mailing list
>> Dspace-general at mit.edu
>> http://mailman.mit.edu/mailman/listinfo/dspace-general
>
>
> Randy Stern
> Manager of Systems Development
> Harvard University Library Office for Information Systems
> 90 Mount Auburn Street
> Cambridge, MA 02138
> Tel. +1 (617) 495-3724
> Email <randy_stern at harvard.edu>
>
>
> _______________________________________________
> Dspace-general mailing list
> Dspace-general at mit.edu
> http://mailman.mit.edu/mailman/listinfo/dspace-general