...
There are a number of issues with the Statistics Sharding process in DSpace.
Update 2017-02-08
- JIRA issues resoulved during today's DSpace meeting
Jira server DuraSpace JIRA serverId c815ca92-fd23-34c2-8fe3-956808caf8c5 key DS-3457 Jira server DuraSpace JIRA serverId c815ca92-fd23-34c2-8fe3-956808caf8c5 key DS-3436 Jira server DuraSpace JIRA serverId c815ca92-fd23-34c2-8fe3-956808caf8c5 key DS-3464 Jira server DuraSpace JIRA serverId c815ca92-fd23-34c2-8fe3-956808caf8c5 key DS-3437
- Documentation Updated
- DSpace 6x: SOLR Statistics Maintenance
- The warning section on this page should be referenced in the 6.1 and 7.0 release notes
- Additional testing documentation for shards: Testing Solr Shards
- DSpace 5x : SOLR Statistics Maintenance
- The warning section on this page should be referenced in the 5.7 release notes
- DSpace 6x: SOLR Statistics Maintenance
- PR's needing a Merge
- Allow Shard Overwrite
- 5x : https://github.com/DSpace/DSpace/pull/1643
Jira server DuraSpace JIRA serverId c815ca92-fd23-34c2-8fe3-956808caf8c5 key DS-3458
- Prevent multi-value corruption during shard (port of Tom Desair's 6x PR)
- Allow Shard Overwrite
Data Corruption Issue (DSpace 6)
...
- The shard process requires statistics records from a prior calendar year to be present.
- Proposal: Ensure that the statistics import/export tools allow for the creation of records from a prior year.
- See "Statistics Import/Export Tool Issues"
- Proposal: Ensure that the statistics import/export tools allow for the creation of records from a prior year.
- Once the shard process has been run for records from a calendar year, the process cannot be re-run.
- Proposal: Allow the sharding process to append records into an existing shard (rather than failing)
Jira server DuraSpace JIRA serverId c815ca92-fd23-34c2-8fe3-956808caf8c5 key DS-3458 - DSpace 5x PR: https://github.com/DSpace/DSpace/pull/1625
- A DSpace 6x PR cannot be tested until DS-3457 is resolved
- How to test
- run stats-util -s to create shards
- import old records (from a prior year where a shard already exists) into the statistics repository
- run stats-util -s again
- Without this PR, the action should fail because the shard exists
- With this PR the action should succeed
- Pull Requests
- DSpace 5x PR: https://github.com/DSpace/DSpace/pull/1625
- DSpace 6x PR: https://github.com/DSpace/DSpace/pull/1633
- DSpace master PR: https://github.com/DSpace/DSpace/pull/1634
- Proposal: Allow the sharding process to append records into an existing shard (rather than failing)
Statistics Import/Export Tool Issues
Make solr-import-statistics, solr-export-statistics, and solr-reindex-statistics easier to use.
Jira | ||||||
---|---|---|---|---|---|---|
|
- Issues
- The import and export tool always assume that the main statistics repo is being processed making it difficult to successfully process an individual shard.
- The import tool often fails when attempting to import records due to _version issues.
- Error messages are confusing from these tools.
- The export and re-index tools often fail due to the presence of existing export files.
- The reindex process fails on a statistics shard
- originally reported as
Jira server DuraSpace JIRA serverId c815ca92-fd23-34c2-8fe3-956808caf8c5 key DS-3464
- originally reported as
- The reindex process corrupts multi-value fields like owningComm.
- Shard names are off by one calendar year (depending on your time zone)
- originally reported as
Jira server DuraSpace JIRA serverId c815ca92-fd23-34c2-8fe3-956808caf8c5 key DS-3437
- originally reported as
- Proposed Changes (Completed)
- Proposal 1: Do not force the inclusion of a "-i statistics" parameter to the function. Rather, set "-i statistics" as a default when no "-i" parameter is found.
- How to test (solr-export-statistics)
- You will need a shard. If you do not have one, See Proposal 2 to facilitate the creation of a shard.
- Clear the solr-export directory
- Run solr-export-statistics -i statistics-xxxx
- Without this PR, you will notice that both statistics and statistics-xxxx are exported
- With this PR, you will notice that only statistics-xxxx is exported
- How to test (solr-export-statistics)
- You will need a shard. If you do not have one, See Proposal 2 to facilitate the creation of a shard.
- Clear the solr-export directory
- Run solr-export-statistics -i statistics-xxxx -i statistics
- Both statistics and statistics-xxxx are exported
- Without this PR
- Run solr-import-statistics -i statistics-xxxx
- You will notice that both statistics and statistics-xxxx are imported (or attempted to be imported)
- With this PR,
- Run solr-import-statistics -i statistics-xxxx -of
- You will notice that only statistics-xxxx is imported
- How to test (solr-reindex-statistics)
- See Proposals 5 and 6 for testing instructions
- How to test (solr-export-statistics)
- Proposal 2: Make the import process more tolerant during record ingest
- How to test
- Clear the solr-export directory
- run "solr-export-statistics -i statistics"
- extract the top 3-5 lines from the export file saving it to a new file matching the naming convention (for instance make the output file for statistics_export_2017-01.csv be statistics_export_2017-02.csv)
- Edit the identifier on each record (it is a uuid, so just edit with an alphanumeric character)
- run "solr-import-statistics -i statistics"
- Note that the process fails with a "_version_" error
- Install the PR and run "solr-import-statistics -i statistics -o"
- The records should import successfully
- NOTE: to force records from a prior year, repeat this process modifying the record date to use a prior year
- How to test
- Proposal 3: Make import/export failure messages more explicit. Include the repository, the export file, and the reason for failure in error and log messages.
- How to test
- Clear the solr-export directory
- run "solr-export-statistics -i statistics"
- run "solr-export-statistics -i statistics"
- The second time this command is run, you will see an error message warning
- Without the PR, the error message will be unclear
- With this PR, the error message will clearly indicate that the export file cannot be overwritten
- How to test
- Proposal 4: Add a command line option allowing export files to be overwritten on export.
- How to test
- Clear the solr-export directory
- run "solr-export-statistics -i statistics"
- run "solr-export-statistics -i statistics"
- The process will fail
- run "solr-export-statistics -i statistics -of"
- The export file will be overwritten
- How to test
- Proposal 5: Add a command line option allowing export files to be overwritten on re-index
- How to test
- Clear the solr-export directory
- run "solr-reindex-statistics -i statistics"
- run "solr-reindex-statistics -i statistics"
- The process will fail due to the existence of an export file
- run "solr-reindex-statistics -i statistics -of"
- The export file will be overwritten
- How to test
- Proposal 6: Set the correct "instanceDir" for statistics shards (since the config files reside in the "statistics" directory)
- How to test
- Clear the solr-export directory
- run "solr-reindex-statistics -i statistics-xxxx"
- How to test
- Proposal 7: Correctly re-index multi-value fields such as owningComm
- How to test
- View an item with multiple owning communities in DSpace
- Find the item view record in the Solr Admin console
- Notice that owningComm is an array
- run "solr-reindex-statistics -i statistics"
- Find the item view record in the Solr Admin console
- owningComm should still be an array with multiple values
- Without the fix, owningComm is a string separated by commas
- How to test
- Proposal 8: Repair multi-value fields in a shard that were corrupted by prior sharding or prior reindex operations
- How to test
- In the Solr Admin Console, look for owningComm fields containing either "," or "\"
- Note the id's or other identifying information for the records
- run "solr-reindex-statistics -i statistics-xxxx"
- Find the records again in the Solr Admin Console
- If problems exist, run
- solr-export-statistics -i statistics-xxxx -f
- for file in *; do sed -E -e "s/[\\]+,/,/g" -i $file; done
- solr-import-statistics -i statistics-xxxx
- The owningComm fields should be an array
- In the Solr Admin Console, look for owningComm fields containing either "," or "\"
- How to test
- Proposal 9: Consistently use UTC from statistics records to determine shard name
- How to test
- If not in UTC, create a statistic record for a shard that does not exist
- run shard process without the PR
- Note the shard name is off by one year
- Test results may vary based on your time zone relative to UTC
- Repeat the process with the PR in place
- Note that the shard name matches the year of the records
- How to test
- Proposal 1: Do not force the inclusion of a "-i statistics" parameter to the function. Rather, set "-i statistics" as a default when no "-i" parameter is found.
- Pull Requests
- DSpace 5x PR: https://github.com/DSpace/DSpace/pull/1623/files
- DSpace 6x PR: https://github.com/DSpace/DSpace/pull/1624/files
- When sharding, the destination repo name is off by one calendar year
Jira server DuraSpace JIRA serverId c815ca92-fd23-34c2-8fe3-956808caf8c5 key DS-3437 - Note that this issue has been found at Georgetown. Tom Desair could not reproduce this issue.
- DSpace master PR: https://github.com/DSpace/DSpace/pull/1635
Manual Repair of Corrupted Export Files
...
language | text |
---|
...
- Use solr-export-statistics to export a repo
- Run the following to repair records
for file in *; do sed -E -e "s/[\\]+,/,/g"
...
-i
...
$file;
...
done
...
- Run solr-import-statistics to import the fixed records
Testing Solr
Testing CSV Export
...