Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Current state as of DSpace 5

Available engines

 

  • original statistics
    • generated from usage events in dspace.log
    • Events captured: TODO
    • Fields captured: TODO

...

Available formats and tools

...

  • dspace.log stores both usage events and log events in general, tends to take up much disk space
  • stats-log-converter - tool to filter dspace.log and extract usage events into a statistics.log format
  • stats-log-importer - tool to import statistics.log format into Solr statistics; useful for one-time migration from original statistics to Solr statistics; logs don't contain all the fields that Solr record
  • stats-log-importer-elasticsearch - analogous to stats-log-importer but imports to ElasticSearch statistics

Problems

Persistence

  • logsdspace.log files, Solr or ElasticSearch index are not suitable for persistent storage
  • extracting from logs dspace.log files takes a long time because they don't contain only usage data
  • logs dspace.log files take up a lot of disk space
  • Solr and ElasticSearch indexes are not meant for reliable persistent storage; Solr even says so: http://wiki.apache.org/solr/HowToReindex#Using_Solr_as_a_Data_Source
  • historically we have treated Solr indexes as a cache that can be rebuilt from persistent data (search, oai indexes)
  • Solr data can be exported, e.g. in CSV; there's a problem with multivalued fields and a trivial export/import may not yield the same result you had before

Usage events mixed with errors in dspace.log

 

  • good for debugging (correlated events visible in one place)
  • bad for keeping around
  • you may want to keep access data forever, because we currently don't have persistent storage
  • you likely don't want to keep error, info and debug-level information forever
  • filtering is slow

...

Do we even want keeping statistics to be the responsibility of DSpace?

 

  • we already provide a dispatcher/consumer model for events, so it's possible to capture them
  • it's possible to write a consumer that will do any serialization, including persistent storage types
  • a consumer may be written to export usage events in a standardized format and/or protocol for feeding into specialized systems

...

Possible solutions

Persistence

...

    • a file in append mode
    • RDBMS - these are continuous writes, some users might dislike that

Storage format

...

  • CSV - the data maps naturally to a tabular format. The Solr CSV format might be most convenient as this would ensure interoperability with data previously exported from Solr.
  • statistics.log - described above; again, there would be interoperability advantage - we already have existing importers for Solr and ES (though it would need to be extended to include missing fields like geo information and User Agent)
  • COUNTER - standardized XML-based format for usage statistics

...