There are a number of issues with the Statistics Sharding process in DSpace.
Data Corruption Issue (DSpace 6)
- In DSpace 6x, tomcat does not properly restart after a statistics shard has been created.
- This behavior has been verified at 3 instances (Georgetown, UCLA and ?). Tom Desair has volunteered to try to reproduce this error.
- In DSpace 5x and 6x, the owningComm field is corrupted by the sharding process
- Tom Desair has created a fix: https://github.com/DSpace/DSpace/pull/1613 While testing this fix, many of the other issues listed here have been discovered.
While attempting to resolve this issue, a number of long standing challenges with the sharding process have become evident.
Shard Testing Issues
- The shard process requires statistics records from a prior calendar year to be present.
- Proposal: Ensure that the statistics import/export tools allow for the creation of records from a prior year.
- See "Statistics Import/Export Tool Issues"
- Proposal: Ensure that the statistics import/export tools allow for the creation of records from a prior year.
- Once the shard process has been run for records from a calendar year, the process cannot be re-run.
- Proposal: Allow the sharding process to append records into an existing shard (rather than failing)
- DSpace 5x PR: https://github.com/DSpace/DSpace/pull/1625
- A DSpace 6x PR cannot be tested until DS-3457 is resolved
- Proposal: Allow the sharding process to append records into an existing shard (rather than failing)
Statistics Import/Export Tool Issues
- Make solr-import-statistics, solr-export-statistics, and solr-reindex-statistics easier to use
- Issues
- The import and export tool always assume that the main statistics repo is being processed making it difficult to successfully process an individual shard.
- The import tool often fails when attempting to import records due to _version issues.
- Error messages are confusing from these tools.
- The export and re-index tools often fail due to the presence of existing export files.
- Proposed Changes
- Proposal: Do not force the inclusion of a "-i statistics" parameter to the function. Rather, set "-i statistics" as a default when no "-i" parameter is found.
- Proposal: Make the import process more tolerant during record ingest
- Proposal: Make import/export failure messages more explicit. Include the repository, the export file, and the reason for failure in error and log messages.
- Proposal: Add a command line option allowing export files to be overwritten on export.
- Proposal: Add a command line option allowing export files to be overwritten on re-index
- Pull Requests
- DSpace 5x PR: https://github.com/DSpace/DSpace/pull/1623/files
- DSpace 6x PR: https://github.com/DSpace/DSpace/pull/1624/files
- When sharding, the destination repo name is off by one calendar year
- Note that this issue has been found at Georgetown. Tom Desair could not reproduce this issue.
- solr-reindex-statistics does not work for a shard