...
- The shard process requires statistics records from a prior calendar year to be present.
- Proposal: Ensure that the statistics import/export tools allow for the creation of records from a prior year.
- See "Statistics Import/Export Tool Issues"
- Proposal: Ensure that the statistics import/export tools allow for the creation of records from a prior year.
- Once the shard process has been run for records from a calendar year, the process cannot be re-run.
- Proposal: Allow the sharding process to append records into an existing shard (rather than failing)
Jira server DuraSpace JIRA serverId c815ca92-fd23-34c2-8fe3-956808caf8c5 key DS-3458 - How to test
- run stats-util -s to create shards
- import old records (from a prior year where a shard already exists) into the statistics repository
- run stats-util -s again
- Without this PR, the action should fail because the shard exists
- With this PR the action should succeed
- Pull Requests
- DSpace 5x PR: https://github.com/DSpace/DSpace/pull/1625/files
- DSpace 6x PR: https://github.com/DSpace/DSpace/pull/1633
- DSpace master PR: https://github.com/DSpace/DSpace/pull/1634
- Proposal: Allow the sharding process to append records into an existing shard (rather than failing)
...
- Issues
- The import and export tool always assume that the main statistics repo is being processed making it difficult to successfully process an individual shard.
- The import tool often fails when attempting to import records due to _version issues.
- Error messages are confusing from these tools.
- The export and re-index tools often fail due to the presence of existing export files.
- The reindex process fails on a statistics shard
- originally reported as
Jira server DuraSpace JIRA serverId c815ca92-fd23-34c2-8fe3-956808caf8c5 key DS-3464
- originally reported as
- The reindex process corrupts multi-value fields like owningComm.
- Shard names are off by one calendar year (depending on your time zone)
- originally reported as
Jira server DuraSpace JIRA serverId c815ca92-fd23-34c2-8fe3-956808caf8c5 key DS-3437
- originally reported as
- Proposed Changes (Completed)
- Proposal 1: Do not force the inclusion of a "-i statistics" parameter to the function. Rather, set "-i statistics" as a default when no "-i" parameter is found.
- How to test (solr-export-statistics)
- You will need a shard. If you do not have one, See Proposal 2 to facilitate the creation of a shard.
- Clear the solr-export directory
- Run solr-export-statistics -i statistics-xxxx
- Without this PR, you will notice that both statistics and statistics-xxxx are exported
- With this PR, you will notice that only statistics-xxxx is exported
- How to test (solr-export-statistics)
- You will need a shard. If you do not have one, See Proposal 2 to facilitate the creation of a shard.
- Clear the solr-export directory
- Run solr-export-statistics -i statistics-xxxx -i statistics
- Both statistics and statistics-xxxx are exported
- Without this PR
- Run solr-import-statistics -i statistics-xxxx
- You will notice that both statistics and statistics-xxxx are imported (or attempted to be imported)
- With this PR,
- Run solr-import-statistics -i statistics-xxxx -f
- You will notice that only statistics-xxxx is imported
- How to test (solr-reindex-statistics)
- See Proposals 5 and 6 for testing instructions
- How to test (solr-export-statistics)
- Proposal 2: Make the import process more tolerant during record ingest
- How to test
- Clear the solr-export directory
- run "solr-export-statistics -i statistics"
- extract the top 3-5 lines from the export file saving it to a new file matching the naming convention (for instance make the output file for statistics_export_2017-01.csv be statistics_export_2017-02.csv)
- Edit the identifier on each record (it is a uuid, so just edit with an alphanumeric character)
- run "solr-import-statistics -i statistics"
- Note that the process fails with a "_version_" error
- Install the PR and run "solr-import-statistics -i statistics"
- The records should import successfully
- NOTE: to force records from a prior year, repeat this process modifying the record date to use a prior year
- How to test
- Proposal 3: Make import/export failure messages more explicit. Include the repository, the export file, and the reason for failure in error and log messages.
- How to test
- Clear the solr-export directory
- run "solr-export-statistics -i statistics"
- run "solr-export-statistics -i statistics"
- The second time this command is run, you will see an error message warning
- Without the PR, the error message will be unclear
- With this PR, the error message will clearly indicate that the export file cannot be overwritten
- How to test
- Proposal 4: Add a command line option allowing export files to be overwritten on export.
- How to test
- Clear the solr-export directory
- run "solr-export-statistics -i statistics"
- run "solr-export-statistics -i statistics"
- The process will fail
- run "solr-export-statistics -i statistics -f"
- The export file will be overwritten
- How to test
- Proposal 5: Add a command line option allowing export files to be overwritten on re-index
- How to test
- Clear the solr-export directory
- run "solr-reindex-statistics -i statistics"
- run "solr-reindex-statistics -i statistics"
- The process will fail due to the existence of an export file
- run "solr-reindex-statistics -i statistics -f"
- The export file will be overwritten
- How to test
- Proposal 6: Set the correct "instanceDir" for statistics shards (since the config files reside in the "statistics" directory)
- How to test
- Clear the solr-export directory
- run "solr-reindex-statistics -i statistics-xxxx"
- How to test
- Proposal 7: Correctly re-index multi-value fields such as owningComm
- How to test
- View an item with multiple owning communities in DSpace
- Find the item view record in the Solr Admin console
- Notice that owningComm is an array
- run "solr-reindex-statistics -i statistics"
- Find the item view record in the Solr Admin console
- owningComm should still be an array with multiple values
- Without the fix, owningComm is a string separated by commas
- How to test
- Proposal 8: Repair multi-value fields in a shard that were corrupted by prior sharding or prior reindex operations
- How to test
- In the Solr Admin Console, look for owningComm fields containing either "," or "\"
- Note the id's or other identifying information for the records
- run "solr-reindex-statistics -i statistics-xxxx"
- Find the records again in the Solr Admin Console
- If problems exist, run
- solr-export-statistics -i statistics-xxxx -f
- for file in *; do sed -E -e "s/[\\]+,/,/g" -i $file; done
- solr-import-statistics -i statistics-xxxx
- The owningComm fields should be an array
- In the Solr Admin Console, look for owningComm fields containing either "," or "\"
- How to test
- Proposal 9: Consistently use UTC from statistics records to determine shard name
- How to test
- If not in UTC, create a statistic record for a shard that does not exist
- run shard process without the PR
- Note the shard name is off by one year
- Test results may vary based on your time zone relative to UTC
- Repeat the process with the PR in place
- Note that the shard name matches the year of the records
- How to test
- Proposal 1: Do not force the inclusion of a "-i statistics" parameter to the function. Rather, set "-i statistics" as a default when no "-i" parameter is found.
- Pull Requests
- DSpace 5x PR: https://github.com/DSpace/DSpace/pull/1623/files
- DSpace 6x PR: https://github.com/DSpace/DSpace/pull/1624/files
- DSpace master PR: https://github.com/DSpace/DSpace/pull/1635/files
Manual Repair of Corrupted Export Files
...