Contribute to the DSpace Development Fund

The newly established DSpace Development Fund supports the development of new features prioritized by DSpace Governance. For a list of planned features see the fund wiki page.

Current state as of DSpace 5

Available engines

  • original statistics
    • generated from usage events in dspace.log
    • Events captured: TODO
    • Fields captured: TODO
  • Solr statistics (since DSpace 1.6)
    • uses Solr for storage
    • basic presentation available in the UI; additional commerical modules available; easy to query via HTTP
    • restricted to localhost by default
    • Events captured: TODO
    • Fields captured: type, id, ip, time, epersonid, continent, country, countryCode, city, longitude, latitude, owningComm, owningColl, owningItem, dns, userAgent, isBot, referrer, uid, statistics_type
  • ElasticSearch statistics (since DSpace 3)
    • Use ElasticSearch for storage
    • goal was to improve performance compared to Solr, because continuous writing of new events had negative impact on concurrent reading
    • implements its own UI for presenting the data; easy to query via HTTP
    • currently doesn't work for bitstream download events
    • exposes unsecured read/write access to ElasticSearch on port 9200 by default
    • Events captured: Item, Bitstream, Collection, Community view
    • Fields captured: IP, time, DNS/hostname, User Agent, isBot flag, geo information (Continent, Country, Country Code, City, Latitude/Longitude)

Available formats and tools

  • dspace.log stores both usage events and log events in general, tends to take up much disk space
  • stats-log-converter - tool to filter dspace.log and extract usage events into a statistics.log format
  • stats-log-importer - tool to import statistics.log format into Solr statistics; useful for one-time migration from original statistics to Solr statistics; logs don't contain all the fields that Solr record
  • stats-log-importer-elasticsearch - analogous to stats-log-importer but imports to ElasticSearch statistics

Problems

Persistence

  • dspace.log files, Solr or ElasticSearch index are not suitable for persistent storage
  • extracting from dspace.log files takes a long time because they don't contain only usage data
  • dspace.log files take up a lot of disk space
  • Solr and ElasticSearch indexes are not meant for reliable persistent storage; Solr even says so: http://wiki.apache.org/solr/HowToReindex#Using_Solr_as_a_Data_Source
  • historically we have treated Solr indexes as a cache that can be rebuilt from persistent data (search, oai indexes)
  • Solr data can be exported, e.g. in CSV; there's a problem with multivalued fields and a trivial export/import may not yield the same result you had before

Usage events mixed with errors in dspace.log

  • good for debugging (correlated events visible in one place)
  • bad for keeping around
  • you may want to keep access data forever, because we currently don't have persistent storage
  • you likely don't want to keep error, info and debug-level information forever
  • filtering is slow

Displaying statistics

  • DSpace doesn't provide extensive display and visualization options out-of-the-box
  • this may be what we want; let others build them

Do we even want keeping statistics to be the responsibility of DSpace?

  • we already provide a dispatcher/consumer model for events, so it's possible to capture them
  • it's possible to write a consumer that will do any serialization, including persistent storage types
  • a consumer may be written to export usage events in a standardized format and/or protocol for feeding into specialized systems

Keeping certain data forever may be against certain laws

Possible solutions

Persistence

  • Solr CSV export - this is akin to backup using database dumps, there will always be a time period between last export and now that is not backed up, therefor this should be considered an interim solution
  • event consumer - implement an event consumer that writes to a persistent storage, e.g.
    • a file in append mode
    • RDBMS - these are continuous writes, some users might dislike that

Storage format

  • CSV - the data maps naturally to a tabular format. The Solr CSV format might be most convenient as this would ensure interoperability with data previously exported from Solr.
  • statistics.log - described above; again, there would be interoperability advantage - we already have existing importers for Solr and ES (though it would need to be extended to include missing fields like geo information and User Agent)
  • COUNTER - standardized XML-based format for usage statistics

Related tickets

Unable to locate Jira server for this macro. It may be due to Application Link configuration.

Unable to locate Jira server for this macro. It may be due to Application Link configuration.

Unable to locate Jira server for this macro. It may be due to Application Link configuration.

Unable to locate Jira server for this macro. It may be due to Application Link configuration.

Related tools

Google Analytics

 

  • easy to configure, easy to use service, free of charge tier available
  • detail is not unlimited - doesn't let you see individual IP addresses
  • possible problems - third-party service without guarantees (SLA available as a paid option), limit on number of processed events per day, possible limit on how many years back they store data

Piwik

 

  • collects events using a JavaScript snippet (like Google Analytics)
  • uses RDBMS for storage
  • PHP-based interface, visualizations available

Logstash

 

  • solution to store and analyze log files (in general, not just for usage data)
  • no development needed on the DSpace side to start using Logstash - it works with any log file
  • good for correlating data from various sources (e.g. error log with an access log; or logs from distinct systems)

Related projects

IRUS (predecessors - PIRUS, PIRUS2) - a JISC project (United Kingdom)

 

SCEUR (Portugal)

 

  • No labels