You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 2 Next »

There are a number of issues with the Statistics Sharding process in DSpace.

Data Corruption Issue (DSpace 6)

  1. In DSpace 6x, tomcat does not properly restart after a statistics shard has been created.   Unable to locate Jira server for this macro. It may be due to Application Link configuration.
    1. This behavior has been verified at 2 instances (Georgetown and ?). Tom Desair has volunteered to try to reproduce this error.
  2. In DSpace 5x and 6x, the owningComm field is corrupted by the sharding process  Unable to locate Jira server for this macro. It may be due to Application Link configuration.
    1. Tom Desair has created a fix: https://github.com/DSpace/DSpace/pull/1613  While testing this fix, many of the other issues listed here have been discovered.

While attempting to resolve this issue, a number of long standing challenges with the sharding process have become evident.

Shard Testing Issues

  1. The shard process requires statistics records from a prior calendar year to be present.
    1. Proposal: Ensure that the statistics import/export tools allow for the creation of records from a prior year.
    2. Proposal: Allow the sharding process to optionally shard records for the current calendar year.
  2. Once the shard process has been run for records from a calendar year, the process cannot be re-run.
    1. Proposal: Allow the sharding process to append records into an existing shard (rather than failing)  Unable to locate Jira server for this macro. It may be due to Application Link configuration.

Statistics Import/Export Tool Issues

  1. The import tool often fails when attempting to import records due to _version issues.
    1. Proposal: Make the import process more tolerant during record ingest
  2. The import and export tool always assume that the main statistics repo is being processed making it difficult to successfully process an individual shard
    1. Proposal: Do not force the inclusion of a "-i statistics" parameter to the function.  Rather, set "-i statistics" as a default when no "-i" parameter is found.  Unable to locate Jira server for this macro. It may be due to Application Link configuration.
      1. DSpace 5x PR: https://github.com/DSpace/DSpace/pull/1623/files
      2. DSpace 6x PR: https://github.com/DSpace/DSpace/pull/1624/files
        1. Note that I cannot successfully test 6x fixes due to DS-3457
    2. Proposal: Make import/export failure messages more explicit.  Include the repository, the export file, and the reason for failure in error and log messages.  Unable to locate Jira server for this macro. It may be due to Application Link configuration.
    3. Proposal: Add a command line option allowing export files to be overwritten on export.
  3. When sharding, the destination repo name is off by one calendar year  Unable to locate Jira server for this macro. It may be due to Application Link configuration.
    1. Note that this issue has been found at Georgetown.  Tom Desair could not reproduce this issue.
  • No labels