Date & Time

  • August 9th 15:00 UTC/GMT - 11:00 ET

This call is a Community Forum call: Sharing best practices and challenges in the use of existing DSpace features

Dial-in

We will use the international conference call dial-in. Please follow directions below.

  • U.S.A/Canada toll free: 866-740-1260, participant code: 2257295
  • International toll free: http://www.readytalk.com/intl 
    • Use the above link and input 2257295 and the country you are calling from to get your country's toll-free dial in #
    • Once on the call, enter participant code 2257295

Agenda

Community Forum Call: DSpace Statistics

Sharing best practices, challenges, and questions

  • DSpace statistics
    • interpreting statistics
    • improving robot filtering & assessing robot traffic
    • exchanging which types of reports are being used for which purposes

 

Preparing for the call

Bring your questions/comments you would like to discuss to the call, or add them to the comments of this meeting page.

If you can join the call, or are willing to comment on the topics submitted via the meeting page, please add your name, institution, and repository URL to the Call Attendees section below.

Meeting notes

History of DSpace statistics

The first DSpace statistics, currently often referred to as the DSpace legacy stats, were based on DSpace logs. As this system does not take into account any traffic originating from bots, let alone they would filter out such traffic, it is highly discouraged to use these statistics. The lack of robot filtering would bias the results and make them uninterpretable.

The current DSpace usage statistics, introduced in DSpace version 1.6, is based on SOLR.

After the release further improvements and alternatives to the standard DSpace statistics have been developed on the initiative of several universities, institutions, and third party service providers.

An alternative to the DSpace statistics is google Analytics. Although this is an interesting tool to use in some use-cases, it does have some limitations. First of all analytics is a black box. You have to assume its robot filtering is working properly as it is unknown what filtering is used. Secondly, google analytics doesn't know DSpace's internal structure. It isn't familiar with the hierarchy of repository, communities, collections and items. This causes Analytics to be unable to create statistics on an aggregated level (e.g. the total item page views of all items in a collection).

Another alternative is the third party add-on Piwik. The DCAT was under the impression this system might provide skewed statistics. As piwik uses a client side javascript to collect statistics, only downloads made by clicking the DSpace download link are likely to be counted. Chances are high that downloads originating from outside DSpace, for example directly from google, are not logged.

Future of DSpace statistics

In the new User Interface it would be beneficial to enable SOLR to be queried directly through the centralized API instead of SOLR's REST API. This would allow to replace SOLR with another system, should a better data source arise. In the meantime, people developing to the DSpace statistics layer could more easily contribute their work to the community, as this would also be built upon this central DSpace API.

Performance

Some institutions noticed performance issues caused by the overhead created by SOLR. Harvard university has solved this issue by relying on web server logs. These logs are already made and therefor do not add additional load on DSpace. An other solution by a third party service provider was to use elasticsearch instead of SOLR, which appeared performant.

There are some opportunities to reduce the overhead load created by SOLR. It is for example not required to run SOLR on the same server as DSpace. It is possible to create a separate SOLR server. Another way of reducing the load is by creating a sharded SOLR core (for example per year). One third party service uses a SOLR caching mechanism to balance the load SOLR puts on DSpace's performance, this way there should not be a noticeable difference.

Housekeeping announcement

Up to now the name of DCAT itself, the 'DSpace Community Advisory Team', sounds rather formal. This may scare people off to join the conversations. For that reason there will be meetings called 'Community Forum calls'. We hope this name indicates the call is open to the entire community.

Discussion topics for the next DCAT calls are already listed on the DCAT meeting notes page. Next month's topic of interest will be the DSpace standard Data model and DSpace-CRIS.

Call Attendees

  • No labels

31 Comments

  1. I would be interested in other institutions' experience with stats visualization for end users - both on item, collection/community, and repository levels. We would like to improve this aspect of our JSPUI instance.

    E.g. an example from the University of Ottawa http://www.ruor.uottawa.ca/handle/10393/31736/statistics

    1. The University of Sydney runs the Atmire contents and usage analysis module on JSPUI. The public stats on JSPUI still look the same, but XMLUI is also deployed for internal use of the module's administrator functionality.

      https://ses.library.usyd.edu.au/handle/2123/15386

      https://ses.library.usyd.edu.au/xmlui/handle/2123/15386 => show statistical information button at the bottom

  2. For DigitalGeorgetown, we use the following statistics reports

    • DSpace admin statistics reports
    • Google Analytics
    • Our own statistics reporting tools written in PHP.  These reports query the DSpace statistics solr repository

    We often need to gather cumulative statistics across a collection or community.  Neither the DSpace admin statistics reports nor Google Analytics roll up all item and bitstream access to the collection/community level.

    The DSpace admin statistics reports have a mechanism to exclude bot traffic.  We created some additional bot exclusion filters by running faceted searches in our statistics repository.  Google Analytics has its own mechanisms (which are probably superior).  Unfortunately, when we compare numbers across all 3 sources of statistics, the numbers are not always consistent.

    Some of our desired functionality has been captured as Statistics Use Cases on the DSpace wiki.

    Sample SOLR Queries

    These sample queries are written with PHP, but the SOLR query syntax is captured in a URL.

     

    1. Great input, thanks Terry. Would be very interested to learn why you choose to add these additional exclusion filters outside of DSpace, instead of just adding those additional patterns in the config file that DSpace uses itself at https://github.com/DSpace/DSpace/blob/master/dspace/config/spiders/agents/example

      1. When I created this code a few years ago, I could not tell if the spiders/bot process was obsolete, so I implemented the exclusions in my own code.

    2. Thanks v much for sharing this Terry. I've shared a link to your PHP tools and some of the other bits of today's Community Forum chat to my colleagues, cos I found it v interesting and helpful and I think they will too.

  3. Things I'd love to hear about if there is enough interest from others and time in the call.

    Communication and user sentiment: Applying corrections to exposed usage data 

    Unusual spikes in download/pageview traffic sometimes go undetected for weeks, if not months. Communicating to your stakeholders that you are actually reducing the numbers of downloads and pageviews, by eliminating newly detected robot traffic is not trivial, and may upset people.

    At the recent open repositories conference, a presenter from Bepress made an interesting statement that they are cleaning/reducing stat counts all the time, and that the integrity/reliability of their figures is of much more importance to them than upsetting a person that their counts have suddenly gone down from one day to another. I think that's a bold and admirable angle, but might not be easily applied in all contexts where DSpace is used.

    Would be very interested to hear from people on the call that if certain tools are used to regularly clean up stats or eliminate bots in usage data that was already exposed, what kind of communication and disclaimers are used.

    Updated list of open JIRA stats issues 

    It would be great if we could bundle an overview of open issues with stats, especially if some of these are of high priority to the call attendees. Here are a few of them that I could find:

    Unable to locate Jira server for this macro. It may be due to Application Link configuration.  - if you use sharding in JSPUI, your JSPUI stats will only show the last year

    Unable to locate Jira server for this macro. It may be due to Application Link configuration.  - the bundled IP lists of bots in DSpace is out of date, scripts that automatically update those are not used everywhere.

    Unable to locate Jira server for this macro. It may be due to Application Link configuration.  

    Unable to locate Jira server for this macro. It may be due to Application Link configuration.  - JSPUI item display can fail when the usage event fails to register

    Unable to locate Jira server for this macro. It may be due to Application Link configuration.  

    Unable to locate Jira server for this macro. It may be due to Application Link configuration.

     

     

    1. Here's another JIRA related to Google Analytics: DS-2899 - Google Analytics Statistics not relating parent comm/coll to bitstream download Received The issue of downloads showing as 0.

  4. Bram Luyten (Atmire) would it be possible to add info on the Minho add on that you mentioned? Thank you

    1. Here's the link to the U-Minho statistics add-on:

      https://wiki.duraspace.org/display/DSPACE/StatisticsAddOn

      My colleges Tiago Guimarães  and Jose Carvalho (both back by the end of August) may tell something more about this, but the ones actually developing it may be reached under repositorium@sdum.uminho.pt

       

      1. Link to (older, but fairly comprehensive) Minho Stats documentation, specifically a table list of stats available in that package. Good starting point for a list of possible stats that may be needed from a repository.

        http://researchrepository.ucd.ie/minhoDocs/docs/implemented.html

        Taken from https://wiki.duraspace.org/download/attachments/19006475/Docs.tar.gz

         

  5. DSpace SOLR Statistics with XMLUI is quite limited out of the box. And the under-the-hood code is unfriendly for adding additional features. We feel that we somewhat trust the data in SOLR, and run one-time queries against SOLR, but have used Terry's PHP dashboard at times, but would somewhat prefer something that can be included with DSpace out-of-the-box. Or, perhaps a standalone tool that is a bit more polished and can use either REST API or DB, and SOLR stats, that could be displayed publicly. Perhaps eventually useful for the next UI for DSpace.

    We have added the Elastic Search Statistics user interface in the past, which added a number of interesting features, including under-the-hood technical improvements, but we didn't feel that this received much DSpace community adoption, and have not added much to this since the initial contribution.

    We have run into numerous issues using Google Analytics statistics, and have stopped using that.

     

    1. Also, the most common things that we hear from users is that they want to know which collections get the most visits / downloads. What are the top downloaded objects in a collection/community. What are the overall top downloads. Where are users coming from. Information like visits to the collection page itself is kind of meaningless.

  6. We use SOLR stats for our DSpace repository.  Our developers have created additional aggregated stats for the entire repository, community, collection, or author.   The stats are displayed by date range or for all time.   Each date range shows the top 10 items in that date range. 

    https://dept.ku.edu/~kuswstat/

     

  7. Export and import of SOLR stats:  Unable to locate Jira server for this macro. It may be due to Application Link configuration.

  8. Google Tag Manager looks to be a good way to get download stats. We've been working on that, but here's a more complete summary from Cal Poly Pomona of what can be done: http://journal.code4lib.org/articles/10311

  9. Here's link to Piwik mentioned by Joseph Greene from University College Dublin: https://piwik.org/

  10. A subscription option that would allow to generate and email per-item or per-collection stats on a regular basis to the submitter/departmental contact - has that been implemented in any of the solutions mentioned and is it of interest?

    1. how are those generated?

      1. When our module is installed, we run processing jobs on the SOLR core to enrich the usage events in the SOLR core with metadata fields from the items.

        At that point, the indexed metadata fields become fields that you can also facet on in SOLR.

        Presentation from Ignace Deroost at OR about this last year:

        http://www.slideshare.net/bramluyten/metadata-based-statistics-for-dspace

    2. This is a commercial add-on, correct?

      1. That's correct, it's our commercial add-on module with customizations added.

        The public display of the standard add-on module can be viewed here:

        https://atmire.com/preview/

        We can also setup demo's for the admin parts.

  11. Using Alt-Metrics to show how much an article (doi) has been picked up on the web or shared on social media: https://darchive.mblwhoilibrary.org/handle/1912/8229

    1. Can you share the code for adding Altmetrics?

       

    2. We display Altmetrics in our individual records and it's been very popular, especially with humanists and social scientists.  Here's an example:  http://hdl.handle.net/1808/10882

       

  12. Hello, here is a link to the public face of our Minho stats (version 4 for DSpace 1.8.2), downloads and content types:

    http://researchrepository.ucd.ie/jspui/stats?level=general&type=access&page=down-series&start=01-08-2016&end=16-08-2016&pyear=2016&pmonth=08&anoinicio=2016&anofim=2016&mesinicio=01&mesfim=08

    Click a few times on the diferent levels (UCD | School or Research Center | item :::: Downloads, rankings | content etc)

    What you can't see is all of the admin functions. See here for a complete list: http://researchrepository.ucd.ie/minhoDocs/docs/implemented.html

     

     

     

  13. Here are my slides from OR2016, which are experiments based on data taken for this paper:

     

    Greene, Joseph : Web robot detection in scholarly Open Access institutional repositories. Library Hi Tech, 34 (3) 2016-07.

    http://hdl.handle.net/10197/7682

     

    Overview of usage statistics in repositories; experimental comparision of EPrints, DSpace and Minho stats add-on for DSpace:

    http://www.slideshare.net/eservice/how-accurate-are-ir-usage-statistics

     

    Detailed look at experimental comparisions of EPrints, DSpace and Minho stats add-on for DSpace and different web robot detection techniques:

    http://www.slideshare.net/eservice/icanhazrobot-improved-robot-detection-for-ir-usage-statistics

     

    There is further information in the notes sections of these Powerpoints – download to view.