There are a number of issues with the Statistics Sharding process in DSpace.
Data Corruption Issue (DSpace 6)
- In DSpace 6x, tomcat does not properly restart after a statistics shard has been created.
- This behavior has been verified at 3 instances (Georgetown, UCLA and ?). Tom Desair has volunteered to try to reproduce this error.
- How to test
- Back up your solr directory
- If you have statistics from 2016 or earlier in your statistics repository
- Run stats-util -s to shard last years records into a shard
- If you do not have statistics from 2016 in your repository
- See the instructions related to testing PRs 1623/1624 to force old stats records (from a prior year) into your statistics repository
- Run stats-util -s
- Restart tomcat
- Tomcat will not restart properly. The tomcat process will not resond to a stop request.
- To resolve this issue
- kill -9 your tomcat process
- delete the statistics shard directories
- restart tomcat
- In DSpace 5x and 6x, the owningComm field is corrupted by the sharding process
- Tom Desair has created a fix: https://github.com/DSpace/DSpace/pull/1613 While testing this fix, many of the other issues listed here have been discovered.
- Ultimately, we need the ability to repair statistics shards that contain a corrupted owningComm (perhaps via the solr-reindex-statistics command)
While attempting to resolve this issue, a number of long standing challenges with the sharding process have become evident.
Shard Testing Issues
- The shard process requires statistics records from a prior calendar year to be present.
- Proposal: Ensure that the statistics import/export tools allow for the creation of records from a prior year.
- See "Statistics Import/Export Tool Issues"
- Proposal: Ensure that the statistics import/export tools allow for the creation of records from a prior year.
- Once the shard process has been run for records from a calendar year, the process cannot be re-run.
- Proposal: Allow the sharding process to append records into an existing shard (rather than failing)
- DSpace 5x PR: https://github.com/DSpace/DSpace/pull/1625
- A DSpace 6x PR cannot be tested until DS-3457 is resolved
- How to test
- run stats-util -s to create shards
- import old records (from a prior year where a shard already exists) into the statistics repository
- run stats-util -s again
- Without this PR, the action should fail because the shard exists
- With this PR the action should succeed
- Proposal: Allow the sharding process to append records into an existing shard (rather than failing)
Statistics Import/Export Tool Issues
- Make solr-import-statistics, solr-export-statistics, and solr-reindex-statistics easier to use
- Issues
- The import and export tool always assume that the main statistics repo is being processed making it difficult to successfully process an individual shard.
- The import tool often fails when attempting to import records due to _version issues.
- Error messages are confusing from these tools.
- The export and re-index tools often fail due to the presence of existing export files.
- Proposed Changes
- Proposal 1: Do not force the inclusion of a "-i statistics" parameter to the function. Rather, set "-i statistics" as a default when no "-i" parameter is found.
- How to test (solr-export-statistics)
- You will need a shard. If you do not have one, See Proposal 2 to facilitate the creation of a shard.
- Clear the solr-export directory
- Run solr-export-statistics -i statistics-xxxx
- Without this PR, you will notice that both statistics and statistics-xxxx are exported
- With this PR, you will notice that only statistics-xxxx is exported
- How to test (solr-export-statistics)
- You will need a shard. If you do not have one, See Proposal 2 to facilitate the creation of a shard.
- Clear the solr-export directory
- Run solr-export-statistics -i statistics-xxxx -i statistics
- Both statistics and statistics-xxxx are exported
- Without this PR
- Run solr-import-statistics -i statistics-xxxx
- You will notice that both statistics and statistics-xxxx are imported (or attempted to be imported)
- With this PR,
- Run solr-import-statistics -i statistics-xxxx -o
- You will notice that only statistics-xxxx is imported
- How to test (solr-reindex-statistics)
- The process is likely similar to above
- How to test (solr-export-statistics)
- Proposal 2: Make the import process more tolerant during record ingest
- How to test
- Clear the solr-export directory
- run "solr-export-statistics -i statistics"
- extract the top 3-5 lines from the export file saving it to a new file matching the naming convention (for instance make the output file for statistics_export_2017-01.csv be statistics_export_2017-02.csv)
- Edit the identifier on each record (it is a uuid, so just edit with an alphanumeric character)
- run "solr-import-statistics -i statistics"
- Note that the process fails with a "_version_" error
- Install the PR and run "solr-import-statistics -i statistics -o"
- The records should import successfully
- NOTE: to force records from a prior year, repeat this process modifying the record date to use a prior year
- How to test
- Proposal 3: Make import/export failure messages more explicit. Include the repository, the export file, and the reason for failure in error and log messages.
- How to test
- Clear the solr-export directory
- run "solr-export-statistics -i statistics"
- run "solr-export-statistics -i statistics"
- The second time this command is run, you will see an error message warning
- Without the PR, the error message will be unclear
- With this PR, the error message will clearly indicate that the export file cannot be overwritten
- How to test
- Proposal 4: Add a command line option allowing export files to be overwritten on export.
- How to test
- Clear the solr-export directory
- run "solr-export-statistics -i statistics"
- run "solr-export-statistics -i statistics"
- The process will fail
- run "solr-export-statistics -i statistics -o"
- The export file will be overwritten
- How to test
- Proposal 5: Add a command line option allowing export files to be overwritten on re-index
- How to test
- Clear the solr-export directory
- run "solr-reindex-statistics -i statistics"
- run "solr-reindex-statistics -i statistics"
- The process will fail due to the existence of an export file
- run "solr-reindex-statistics -i statistics -o"
- The export file will be overwritten
- How to test
- Proposal 1: Do not force the inclusion of a "-i statistics" parameter to the function. Rather, set "-i statistics" as a default when no "-i" parameter is found.
- Pull Requests
- DSpace 5x PR: https://github.com/DSpace/DSpace/pull/1623/files
- DSpace 6x PR: https://github.com/DSpace/DSpace/pull/1624/files
- When sharding, the destination repo name is off by one calendar year
- Note that this issue has been found at Georgetown. Tom Desair could not reproduce this issue.
- solr-reindex-statistics does not work for a shard
Testing