You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 8 Next »

There are a number of issues with the Statistics Sharding process in DSpace.

Data Corruption Issue (DSpace 6)

  1. In DSpace 6x, tomcat does not properly restart after a statistics shard has been created.  
    1. Unable to locate Jira server for this macro. It may be due to Application Link configuration.
    2. This behavior has been verified at 3 instances (Georgetown, UCLA and ?). Tom Desair has volunteered to try to reproduce this error.
  2. In DSpace 5x and 6x, the owningComm field is corrupted by the sharding process 
    1. Unable to locate Jira server for this macro. It may be due to Application Link configuration.
    2. Tom Desair has created a fix: https://github.com/DSpace/DSpace/pull/1613  While testing this fix, many of the other issues listed here have been discovered.

While attempting to resolve this issue, a number of long standing challenges with the sharding process have become evident.

Shard Testing Issues

  1. The shard process requires statistics records from a prior calendar year to be present.
    1. Proposal: Ensure that the statistics import/export tools allow for the creation of records from a prior year.
      1. See "Statistics Import/Export Tool Issues"
  2. Once the shard process has been run for records from a calendar year, the process cannot be re-run.
    1. Proposal: Allow the sharding process to append records into an existing shard (rather than failing) 
      1. Unable to locate Jira server for this macro. It may be due to Application Link configuration.
      2. DSpace 5x PR: https://github.com/DSpace/DSpace/pull/1625
      3. A DSpace 6x PR cannot be tested until DS-3457 is resolved

Statistics Import/Export Tool Issues

  1. Make solr-import-statistics, solr-export-statistics, and solr-reindex-statistics easier to use
    1. Unable to locate Jira server for this macro. It may be due to Application Link configuration.
    2. Issues
      1. The import and export tool always assume that the main statistics repo is being processed making it difficult to successfully process an individual shard.
      2. The import tool often fails when attempting to import records due to _version issues.
      3. Error messages are confusing from these tools.
      4. The export and re-index tools often fail due to the presence of existing export files.
    3. Proposed Changes
      1. Proposal: Do not force the inclusion of a "-i statistics" parameter to the function.  Rather, set "-i statistics" as a default when no "-i" parameter is found. 
      2. Proposal: Make the import process more tolerant during record ingest
      3. Proposal: Make import/export failure messages more explicit.  Include the repository, the export file, and the reason for failure in error and log messages. 
      4. Proposal: Add a command line option allowing export files to be overwritten on export.
      5. Proposal: Add a command line option allowing export files to be overwritten on re-index
    4. Pull Requests
      1. DSpace 5x PR: https://github.com/DSpace/DSpace/pull/1623/files
      2. DSpace 6x PR: https://github.com/DSpace/DSpace/pull/1624/files
  2. When sharding, the destination repo name is off by one calendar year 
    1. Unable to locate Jira server for this macro. It may be due to Application Link configuration.
    2. Note that this issue has been found at Georgetown.  Tom Desair could not reproduce this issue.
  3. solr-reindex-statistics does not work for a shard 
    1. Unable to locate Jira server for this macro. It may be due to Application Link configuration.
  • No labels