You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 9 Next »

There are a number of issues with the Statistics Sharding process in DSpace.

Data Corruption Issue (DSpace 6)

  1. In DSpace 6x, tomcat does not properly restart after a statistics shard has been created.  
    1. Unable to locate Jira server for this macro. It may be due to Application Link configuration.
    2. This behavior has been verified at 3 instances (Georgetown, UCLA and ?). Tom Desair has volunteered to try to reproduce this error.
    3. How to test
      1. Back up your solr directory
      2. If you have statistics from 2016 or earlier in your statistics repository
        1. Run stats-util -s to shard last years records into a shard
      3. If you do not have statistics from 2016 in your repository
        1. See the instructions related to testing PRs 1623/1624 to force old stats records (from a prior year) into your statistics repository
        2. Run stats-util -s
      4. Restart tomcat
      5. Tomcat will not restart properly.  The tomcat process will not resond to a stop request.
      6. To resolve this issue
        1. kill -9 your tomcat process
        2. delete the statistics shard directories
        3. restart tomcat
  2. In DSpace 5x and 6x, the owningComm field is corrupted by the sharding process 
    1. Unable to locate Jira server for this macro. It may be due to Application Link configuration.
    2. Tom Desair has created a fix: https://github.com/DSpace/DSpace/pull/1613  While testing this fix, many of the other issues listed here have been discovered.

While attempting to resolve this issue, a number of long standing challenges with the sharding process have become evident.

Shard Testing Issues

  1. The shard process requires statistics records from a prior calendar year to be present.
    1. Proposal: Ensure that the statistics import/export tools allow for the creation of records from a prior year.
      1. See "Statistics Import/Export Tool Issues"
  2. Once the shard process has been run for records from a calendar year, the process cannot be re-run.
    1. Proposal: Allow the sharding process to append records into an existing shard (rather than failing) 
      1. Unable to locate Jira server for this macro. It may be due to Application Link configuration.
      2. DSpace 5x PR: https://github.com/DSpace/DSpace/pull/1625
      3. A DSpace 6x PR cannot be tested until DS-3457 is resolved
      4. How to test
        1. run stats-util -s to create shards
        2. import old records (from a prior year where a shard already exists) into the statistics repository
        3. run stats-util -s again
          1. Without this PR, the action should fail because the shard exists
          2. With this PR the action should succeed

Statistics Import/Export Tool Issues

  1. Make solr-import-statistics, solr-export-statistics, and solr-reindex-statistics easier to use
    1. Unable to locate Jira server for this macro. It may be due to Application Link configuration.
    2. Issues
      1. The import and export tool always assume that the main statistics repo is being processed making it difficult to successfully process an individual shard.
      2. The import tool often fails when attempting to import records due to _version issues.
      3. Error messages are confusing from these tools.
      4. The export and re-index tools often fail due to the presence of existing export files.
    3. Proposed Changes
      1. Proposal 1: Do not force the inclusion of a "-i statistics" parameter to the function.  Rather, set "-i statistics" as a default when no "-i" parameter is found. 
        1. How to test (solr-export-statistics)
          1. You will need a shard.  If you do not have one, See Proposal 2 to facilitate the creation of a shard.
          2. Clear the solr-export directory 
          3. Run solr-export-statistics -i statistics-xxxx 
          4. Without this PR, you will notice that both statistics and statistics-xxxx are exported
          5. With this PR, you will notice that only statistics-xxxx is exported
        2. How to test (solr-export-statistics)
          1. You will need a shard.  If you do not have one, See Proposal 2 to facilitate the creation of a shard.
          2. Clear the solr-export directory 
          3. Run solr-export-statistics -i statistics-xxxx -i statistics
          4. Both statistics and statistics-xxxx are exported
          5. Without this PR
            1. Run solr-import-statistics -i statistics-xxxx 
            2. You will notice that both statistics and statistics-xxxx are imported (or attempted to be imported)
          6. With this PR, 
            1. Run solr-import-statistics -i statistics-xxxx -o
            2. You will notice that only statistics-xxxx is imported
        3. How to test (solr-reindex-statistics)
          1. The process is likely similar to above
      2. Proposal 2: Make the import process more tolerant during record ingest
        1. How to test
          1. Clear the solr-export directory
          2. run "solr-export-statistics -i statistics"
          3. extract the top 3-5 lines from the export file saving it to a new file matching the naming convention (for instance make the output file for statistics_export_2017-01.csv be statistics_export_2017-02.csv)
            1. Edit the identifier on each record (it is a uuid, so just edit with an alphanumeric character)
          4. run "solr-import-statistics -i statistics"
            1. Note that the process fails with a "_version_" error
          5. Install the PR and run "solr-import-statistics -i statistics -o" 
            1. The records should import successfully
          6. NOTE: to force records from a prior year, repeat this process modifying the record date to use a prior year
      3. Proposal 3: Make import/export failure messages more explicit.  Include the repository, the export file, and the reason for failure in error and log messages. 
        1. How to test
          1. Clear the solr-export directory
          2. run "solr-export-statistics -i statistics"
          3. run "solr-export-statistics -i statistics"
            1. The second time this command is run, you will see an error message warning
            2. Without the PR, the error message will be unclear
            3. With this PR, the error message will clearly indicate that the export file cannot be overwritten
      4. Proposal 4: Add a command line option allowing export files to be overwritten on export.
        1. How to test
          1. Clear the solr-export directory
          2. run "solr-export-statistics -i statistics"
          3. run "solr-export-statistics -i statistics"
            1. The process will fail
          4. run "solr-export-statistics -i statistics -o"
            1. The export file will be overwritten
      5. Proposal 5: Add a command line option allowing export files to be overwritten on re-index
        1. How to test
          1. Clear the solr-export directory
          2. run "solr-reindex-statistics -i statistics"
          3. run "solr-reindex-statistics -i statistics"
            1. The process will fail due to the existence of an export file
          4. run "solr-reindex-statistics -i statistics -o"
            1. The export file will be overwritten
    4. Pull Requests
      1. DSpace 5x PR: https://github.com/DSpace/DSpace/pull/1623/files
      2. DSpace 6x PR: https://github.com/DSpace/DSpace/pull/1624/files
  2. When sharding, the destination repo name is off by one calendar year 
    1. Unable to locate Jira server for this macro. It may be due to Application Link configuration.
    2. Note that this issue has been found at Georgetown.  Tom Desair could not reproduce this issue.
  3. solr-reindex-statistics does not work for a shard 
    1. Unable to locate Jira server for this macro. It may be due to Application Link configuration.

Testing

 

  • No labels