...
- Make solr-import-statistics, solr-export-statistics, and solr-reindex-statistics easier to use
Jira server DuraSpace JIRA serverId c815ca92-fd23-34c2-8fe3-956808caf8c5 key DS-3456 - Issues
- The import and export tool always assume that the main statistics repo is being processed making it difficult to successfully process an individual shard.
- The import tool often fails when attempting to import records due to _version issues.
- Error messages are confusing from these tools.
- The export and re-index tools often fail due to the presence of existing export files.
- The reindex process fails on a statistics shard
- originally reported as
Jira server DuraSpace JIRA serverId c815ca92-fd23-34c2-8fe3-956808caf8c5 key DS-3464
- originally reported as
- The reindex process corrupts multi-value fields like owningComm.
- Proposed Changes (Completed)
- Proposal 1: Do not force the inclusion of a "-i statistics" parameter to the function. Rather, set "-i statistics" as a default when no "-i" parameter is found.
- How to test (solr-export-statistics)
- You will need a shard. If you do not have one, See Proposal 2 to facilitate the creation of a shard.
- Clear the solr-export directory
- Run solr-export-statistics -i statistics-xxxx
- Without this PR, you will notice that both statistics and statistics-xxxx are exported
- With this PR, you will notice that only statistics-xxxx is exported
- How to test (solr-export-statistics)
- You will need a shard. If you do not have one, See Proposal 2 to facilitate the creation of a shard.
- Clear the solr-export directory
- Run solr-export-statistics -i statistics-xxxx -i statistics
- Both statistics and statistics-xxxx are exported
- Without this PR
- Run solr-import-statistics -i statistics-xxxx
- You will notice that both statistics and statistics-xxxx are imported (or attempted to be imported)
- With this PR,
- Run solr-import-statistics -i statistics-xxxx -o
- You will notice that only statistics-xxxx is imported
- How to test (solr-reindex-statistics)
- See Proposals 5 and 6 for testing instructions
- How to test (solr-export-statistics)
- Proposal 2: Make the import process more tolerant during record ingest
- How to test
- Clear the solr-export directory
- run "solr-export-statistics -i statistics"
- extract the top 3-5 lines from the export file saving it to a new file matching the naming convention (for instance make the output file for statistics_export_2017-01.csv be statistics_export_2017-02.csv)
- Edit the identifier on each record (it is a uuid, so just edit with an alphanumeric character)
- run "solr-import-statistics -i statistics"
- Note that the process fails with a "_version_" error
- Install the PR and run "solr-import-statistics -i statistics -o"
- The records should import successfully
- NOTE: to force records from a prior year, repeat this process modifying the record date to use a prior year
- How to test
- Proposal 3: Make import/export failure messages more explicit. Include the repository, the export file, and the reason for failure in error and log messages.
- How to test
- Clear the solr-export directory
- run "solr-export-statistics -i statistics"
- run "solr-export-statistics -i statistics"
- The second time this command is run, you will see an error message warning
- Without the PR, the error message will be unclear
- With this PR, the error message will clearly indicate that the export file cannot be overwritten
- How to test
- Proposal 4: Add a command line option allowing export files to be overwritten on export.
- How to test
- Clear the solr-export directory
- run "solr-export-statistics -i statistics"
- run "solr-export-statistics -i statistics"
- The process will fail
- run "solr-export-statistics -i statistics -o"
- The export file will be overwritten
- How to test
- Proposal 5: Add a command line option allowing export files to be overwritten on re-index
- How to test
- Clear the solr-export directory
- run "solr-reindex-statistics -i statistics"
- run "solr-reindex-statistics -i statistics"
- The process will fail due to the existence of an export file
- run "solr-reindex-statistics -i statistics -o"
- The export file will be overwritten
- How to test
- Proposal 6: Set the correct "instanceDir" for statistics shards (since the config files reside in the "statistics" directory)
- How to test
- Clear the solr-export directory
- run "solr-reindex-statistics -i statistics-xxxx"
- How to test
- Proposal 7: Correctly re-index multi-value fields such as owningComm
- How to test
- View an item with multiple owning communities in DSpace
- Find the item view record in the Solr Admin console
- Notice that owningComm is an array
- run "solr-reindex-statistics -i statistics"
- Find the item view record in the Solr Admin console
- owningComm should still be an array with multiple values
- Without the fix, owningComm is a string separated by commas
- How to test
- Proposal 8: Repair multi-value fields in a shard that were corrupted by prior sharding or prior reindex operations
- How to test
- In the Solr Admin Console, look for owningComm fields containing either "," or "\"
- Note the id's or other identifying information for the records
- run "solr-reindex-statistics -i statistics-xxxx"
- Find the records again in the Solr Admin Console
- If problems exist, run
- solr-export-statistics -i statistics-xxxx -o
- for file in *; do sed -E -e "s/[\\]+,/,/g" -i $file; done
- solr-import-statistics -i statistics-xxxx -o
- The owningComm fields should be an array
- In the Solr Admin Console, look for owningComm fields containing either "," or "\"
- How to test
- Proposal 1: Do not force the inclusion of a "-i statistics" parameter to the function. Rather, set "-i statistics" as a default when no "-i" parameter is found.
- Pull Requests
- DSpace 5x PR: https://github.com/DSpace/DSpace/pull/1623/files
- DSpace 6x PR: https://github.com/DSpace/DSpace/pull/1624/files
- When sharding, the destination repo name is off by one calendar year
Jira server DuraSpace JIRA serverId c815ca92-fd23-34c2-8fe3-956808caf8c5 key DS-3437 - Note that this issue has been found at Georgetown. Tom Desair could not reproduce this issue.
- Added additional info to the ticket. Record inclusion is based on UTC but the shard name is not.
- TODO: merge reference to getYearUTC() into 1623 and 1624
Manual Repair of Corrupted Export Files
...