Current state as of DSpace 5
Available engines
- original statistics
- generated from usage events in dspace.log
- Events captured: TODO
- Fields captured: TODO
...
Available formats and tools
...
- dspace.log stores both usage events and log events in general, tends to take up much disk space
- stats-log-converter - tool to filter dspace.log and extract usage events into a statistics.log format
- stats-log-importer - tool to import statistics.log format into Solr statistics; useful for one-time migration from original statistics to Solr statistics; logs don't contain all the fields that Solr record
- stats-log-importer-elasticsearch - analogous to stats-log-importer but imports to ElasticSearch statistics
Problems
Persistence
- logsdspace.log files, Solr or ElasticSearch index are not suitable for persistent storage
- extracting from logs dspace.log files takes a long time because they don't contain only usage data
- logs dspace.log files take up a lot of disk space
- Solr and ElasticSearch indexes are not meant for reliable persistent storage; Solr even says so: http://wiki.apache.org/solr/HowToReindex#Using_Solr_as_a_Data_Source
- historically we have treated Solr indexes as a cache that can be rebuilt from persistent data (search, oai indexes)
- Solr data can be exported, e.g. in CSV; there's a problem with multivalued fields and a trivial export/import may not yield the same result you had before
Usage events mixed with errors in dspace.log
...
- good for debugging (correlated events visible in one place)
- bad for keeping around
- you may want to keep access data forever, because we currently don't have persistent storage
- you likely don't want to keep error, info and debug-level information forever
- filtering is slow
...
Do we even want keeping statistics to be the responsibility of DSpace?
...
- we already provide a dispatcher/consumer model for events, so it's possible to capture them
- it's possible to write a consumer that will do any serialization, including persistent storage types
- a consumer may be written to export usage events in a standardized format and/or protocol for feeding into specialized systems
...
Keeping certain data forever may be against certain laws
- particularly in EU and regarding to storing IP addresses indefinitely; solution would be to only store aggregated or anonymized data indefinitely
- http://security.stackexchange.com/questions/52517/data-protection-laws-and-regulation-for-storing-ip-addresses-for-registered-user
Possible solutions
Persistence
...
- a file in append mode
- RDBMS - these are continuous writes, some users might dislike that
Storage format
...
- CSV - the data maps naturally to a tabular format. The Solr CSV format might be most convenient as this would ensure interoperability with data previously exported from Solr.
- statistics.log - described above; again, there would be interoperability advantage - we already have existing importers for Solr and ES (though it would need to be extended to include missing fields like geo information and User Agent)
- COUNTER - standardized XML-based format for usage statistics
...