Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

There are a number of issues with the Statistics Sharding process in DSpace.

Data Corruption Issue (DSpace 6)

  1. In DSpace 6x, tomcat

...

  1. does not

...

  • Applies to 
    • dspace-5_x: not an issue
    • dspace-6_x
    • master

...

  • -i statistics is always present on the command line triggering frequent write errors
  • Add more descriptive error messages
  • Change placement of "-i statistics" parameter
  • Applies to
    • dpace-5_x:
    • dspace-6_x:
    • master:
  1. properly restart after a statistics shard has been created.  
    Jira
    serverDuraSpace JIRA
    serverIdc815ca92-fd23-34c2-8fe3-956808caf8c5
    keyDS-3457
    1. This behavior has been verified at 2 instances (Georgetown and ?). Tom Desair has volunteered to try to reproduce this error.
  2. In DSpace 5x and 6x, the owningComm field is corrupted by the sharding process 
    Jira
    serverDuraSpace JIRA
    serverIdc815ca92-fd23-34c2-8fe3-956808caf8c5
    keyDS-3436
    1. Tom Desair has created a fix: https://github.com/DSpace/DSpace/pull/1613  While testing this fix, many of the other issues listed here have been discovered.

While attempting to resolve this issue, a number of long standing challenges with the sharding process have become evident.

Shard Testing Issues

  1. The shard process requires statistics records from a prior calendar year to be present.
    1. Proposal: Ensure that the statistics import/export tools allow for the creation of records from a prior year.
    2. Proposal: Allow the sharding process to optionally shard records for the current calendar year.
  2. Once the shard process has been run for records from a calendar year, the process cannot be re-run.
    1. Proposal: Allow the sharding process to append records into an existing shard (rather than failing) 
      Jira
      serverDuraSpace JIRA
      serverIdc815ca92-fd23-34c2-8fe3-956808caf8c5
      keyDS-3458


Statistics Import/Export Tool Issues

  1. The import tool often fails when attempting to import records due to _version issues.
    1. Proposal: Make the import process more tolerant during record ingest
  2. The import and export tool always assume that the main statistics repo is being processed making it difficult to successfully process an individual shard
    1. Proposal: Do not force the inclusion of a "-i statistics" parameter to the function.  Rather, set "-i statistics" as a default when no "-i" parameter is found. 
      Jira
      serverDuraSpace JIRA
      serverIdc815ca92-fd23-34c2-8fe3-956808caf8c5
      keyDS-3456
      1. DSpace 5x PR: https://github.com/DSpace/DSpace/pull/1623/files
      2. DSpace 6x PR: https://github.com/DSpace/DSpace/pull/1624/files
        1. Note that I cannot successfully test 6x fixes due to DS-3457
    2. Proposal: Make import/export failure messages more explicit.  Include the repository, the export file, and the reason for failure in error and log messages. 
      Jira
      serverDuraSpace JIRA
      serverIdc815ca92-fd23-34c2-8fe3-956808caf8c5
      keyDS-3456
    3. Proposal: Add a command line option allowing export files to be overwritten on export.
  3. When sharding, the destination repo name is off by one calendar year 
    Jira
    serverDuraSpace JIRA
    serverIdc815ca92-fd23-34c2-8fe3-956808caf8c5
    keyDS-3437
    1. Note that this issue has been found at Georgetown.  Tom Desair could not reproduce this issue.

...