Plan of Attack

I've begun the process of putting in place a series of modules and core changes to DSpace to support the inclusion of external statistics and reporting systems. Much of this work is coming from code donations from @MIRE and represents components of @MIRE products with we feel will improve the health of the DSpace ecosystem by exposing. Most specifically, by exposing and donating these components, we seek to show "how" modularity needs to be modelled and apporached not only in the future DSpace+2.0, but more immediately now in 1.6. --Mark Diggory 04:25, 3 July 2009 (EDT)

The breakdown of the projects is as follows:

UsageEvent improvements:

Adjustment of the UsageEvent API to support exposing richer detail about the DSpaceObject in the implementations please see:

http://jira.dspace.org/jira/browse/DS-243

GEOIP, SOLR and DNS modules

Creation (and/or publishing into Maven) of Support Modules for features that may be common in more than one statistics/reporting implementation: Please see.

http://maven.dspace.org/release/org/dspace/dnsjava/dnsjava/2.0.6/

https://scm.dspace.org/svn/repo/modules/dspace-geoip (Self installing library for supporting GeoIP lookup)

https://scm.dspace.org/svn/repo/modules/dspace-solr (Self deploying SOLR webapplication pluggable into dspace distributions)

SOLR Based Statistics Logging and Reporting

https://scm.dspace.org/svn/repo/modules/dspace-solr-stats

(Soon to come: Self deploying Solr Client and Statistics Loggers for above Solr Instance)

https://scm.dspace.org/svn/repo/modules/dspace-solr-reporting

(Soon to come: Self deploying Solr Client and Statistics Report/Query API for above Solr Instance)

DSpace+1.6 ServiceManager Support

Finally, We are working on a slimmed down version of the DSpace+2.0 ServiceManager that should allow dynamic registration of Asyncronous services like StatisticsLogging and StatisticsReporting into DSpace without the need for explicit configuration via dspace.cfg/PluginManager.

https://scm.dspace.org/svn/repo/modules/dspace-services (Soon to come: Service API and Utility classes to support dynamic registration of services)

Raw descriptions of ideas presented so far : please elaborate

"Page Views" versus Downloads

COUNTER compliant usage statistics

leverage Google Analytics instead of collecting our own data

offline management reports using a nightly copy of observations

A Case for Google Analytics

I'll address the points already raised (and try to update if more are added), and provide my own opinion on this matter.

It's correct to say that if you simply follow the most basic integration guidelines provided, then Google Analytics will only track HTML page views (and then, only those that have the tracking code integrated!). However, the data collection API that is provided is capable of tracking more than simple page views. It's a relatively simple task to add a javascript event to the bitstream download links, so that when a user clicks on them, that download request is tracked.

That won't help the case of a direct link to the bitstream from outside the repository (although as they don't have proper persistent identifiers / URLs, that possibly should be discouraged). If you need to track such downloads, then the most obvious means would be to detect referers that are outside of the repository, and deliver a 'your download will start automatically' page in place of the actual bitstream - which would then include the GA tracking code. Alternatively, it's just a service API, and it wouldn't be that hard to construct a call from the Java code directly (although if you bypass the ga.js javascript library, you may be exposed to the API changing).

Having dealt with statistics gathering in a variety of ways (direct database logging, offline log analysis) for some moderately high volume sites, I'm aware of how problematic it can be. The volume of data generated is huge, and has scalability problems in storing, parsing and reporting - for either style of statistics gathering. Then you have to deal with determining and removing robot / invalid accesses. Recognising when a user may have double-clicked on a link.

And the flipside to having a(n external) service API for collecting statistics data is that it already provides for tracking events - AJAX operations, Flash controls, redirects to other sites, can deal with local / content caches or content delivery networks (and that could be an important point going forward with DuraCloud). All things that you would need to add additional javascript calls and local service collection points to cope with for local statistics gathering.

Historically, there has been a problem with Google Analytics not being able to provide statistics integrated with the repository / to non-registered users, but now that there is an API for retrieving data from GA, that is an issue that can be solved.

My personal view is that we have enough to deal with in terms of delivering repository functionality, and making the repository itself scale, to really not need to be dealing with all the problems that come with statistics gathering / generation. Moreover, if you do keep the statistics internally, then the scalability of your repository will be compromised by the requirements to provide statistics.

If you look at the reasons why Google purchased Urchin, provide Analytics for free, their capability to provide scalable services, size, stability and general ethos of the company, then I would consider the risks of relying on them to be much lower than the cost of maintaining your own statistics. That said, I would still advocate a framework and collection of data points within the repository for supporting integrated reporting (ie. downloads on the item page), where that data can be supplied from either Google Analytics, or an internal data collection service.

--grahamtriggs 12:03, 29 May 2009 (BST)

The importance of "Hooks" for specific data collection points and use of a statistics logging backend should not be undervalued. We have an opportunity here to gather critical data that can be used to facilitate a richer browse, search and viewing expereince. Throwing it off to GA and suggesting its out of scope of "Repository Concerns" is a bit heavy handed IMO. --Mark Diggory 14:06, 9 June 2009 (EDT)

Also consider that many IR administrators are probably expecting this addon to work with the local data they've already collected – people who have been requesting statistics have been sitting on years worth of logs, but do not necessarily have any accurate historical data with GA. (I realise that the logs aren't in the UsageEvent format we want to use now, anyway, but converting them is not a big deal). I think there would have to be more of these cases than IRs having years of GA data, but no logs saved. --Kshepherd 22:02, 9 June 2009 (EDT)

Technical Requirements

Existing 3rd party statistics systems

Please add your statistics reports/tools here – if reports are not publicly accessible, screenshots or video demonstrations would still be great to see.