Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. The shard process requires statistics records from a prior calendar year to be present.
    1. Proposal: Ensure that the statistics import/export tools allow for the creation of records from a prior year.
      1. See "Statistics Import/Export Tool Issues"
  2. Once the shard process has been run for records from a calendar year, the process cannot be re-run.
    1. Proposal: Allow the sharding process to append records into an existing shard (rather than failing) 
      1. Jira
        serverDuraSpace JIRA
        serverIdc815ca92-fd23-34c2-8fe3-956808caf8c5
        keyDS-3458
      2. How to test
        1. run stats-util -s to create shards
        2. import old records (from a prior year where a shard already exists) into the statistics repository
        3. run stats-util -s again
          1. Without this PR, the action should fail because the shard exists
          2. With this PR the action should succeed

    2. Pull Requests
      1. DSpace 5x PR: https://github.com/DSpace/DSpace/pull/1625/files
      2. DSpace 6x PR: https://github.com/DSpace/DSpace/pull/1633
      3. DSpace master PR: https://github.com/DSpace/DSpace/pull/1634

...

  1. Issues
    1. The import and export tool always assume that the main statistics repo is being processed making it difficult to successfully process an individual shard.
    2. The import tool often fails when attempting to import records due to _version issues.
    3. Error messages are confusing from these tools.
    4. The export and re-index tools often fail due to the presence of existing export files.
    5. The reindex process fails on a statistics shard
      1. originally reported as 
        Jira
        serverDuraSpace JIRA
        serverIdc815ca92-fd23-34c2-8fe3-956808caf8c5
        keyDS-3464
    6. The reindex process corrupts multi-value fields like owningComm.
    7. Shard names are off by one calendar year (depending on your time zone)
      1. originally reported as 
        Jira
        serverDuraSpace JIRA
        serverIdc815ca92-fd23-34c2-8fe3-956808caf8c5
        keyDS-3437
  2. Proposed Changes (Completed)
    1. Proposal 1: Do not force the inclusion of a "-i statistics" parameter to the function.  Rather, set "-i statistics" as a default when no "-i" parameter is found. 
      1. How to test (solr-export-statistics)
        1. You will need a shard.  If you do not have one, See Proposal 2 to facilitate the creation of a shard.
        2. Clear the solr-export directory 
        3. Run solr-export-statistics -i statistics-xxxx 
        4. Without this PR, you will notice that both statistics and statistics-xxxx are exported
        5. With this PR, you will notice that only statistics-xxxx is exported
      2. How to test (solr-export-statistics)
        1. You will need a shard.  If you do not have one, See Proposal 2 to facilitate the creation of a shard.
        2. Clear the solr-export directory 
        3. Run solr-export-statistics -i statistics-xxxx -i statistics
        4. Both statistics and statistics-xxxx are exported
        5. Without this PR
          1. Run solr-import-statistics -i statistics-xxxx 
          2. You will notice that both statistics and statistics-xxxx are imported (or attempted to be imported)
        6. With this PR, 
          1. Run solr-import-statistics -i statistics-xxxx -f
          2. You will notice that only statistics-xxxx is imported
      3. How to test (solr-reindex-statistics)
        1. See Proposals 5 and 6 for testing instructions
    2. Proposal 2: Make the import process more tolerant during record ingest
      1. How to test
        1. Clear the solr-export directory
        2. run "solr-export-statistics -i statistics"
        3. extract the top 3-5 lines from the export file saving it to a new file matching the naming convention (for instance make the output file for statistics_export_2017-01.csv be statistics_export_2017-02.csv)
          1. Edit the identifier on each record (it is a uuid, so just edit with an alphanumeric character)
        4. run "solr-import-statistics -i statistics"
          1. Note that the process fails with a "_version_" error
        5. Install the PR and run "solr-import-statistics -i statistics" 
          1. The records should import successfully
        6. NOTE: to force records from a prior year, repeat this process modifying the record date to use a prior year
    3. Proposal 3: Make import/export failure messages more explicit.  Include the repository, the export file, and the reason for failure in error and log messages. 
      1. How to test
        1. Clear the solr-export directory
        2. run "solr-export-statistics -i statistics"
        3. run "solr-export-statistics -i statistics"
          1. The second time this command is run, you will see an error message warning
          2. Without the PR, the error message will be unclear
          3. With this PR, the error message will clearly indicate that the export file cannot be overwritten
    4. Proposal 4: Add a command line option allowing export files to be overwritten on export.
      1. How to test
        1. Clear the solr-export directory
        2. run "solr-export-statistics -i statistics"
        3. run "solr-export-statistics -i statistics"
          1. The process will fail
        4. run "solr-export-statistics -i statistics -f"
          1. The export file will be overwritten
    5. Proposal 5: Add a command line option allowing export files to be overwritten on re-index
      1. How to test
        1. Clear the solr-export directory
        2. run "solr-reindex-statistics -i statistics"
        3. run "solr-reindex-statistics -i statistics"
          1. The process will fail due to the existence of an export file
        4. run "solr-reindex-statistics -i statistics -f"
          1. The export file will be overwritten
    6. Proposal 6: Set the correct "instanceDir" for statistics shards (since the config files reside in the "statistics" directory)
      1. How to test
        1. Clear the solr-export directory
        2. run "solr-reindex-statistics -i statistics-xxxx"
    7. Proposal 7: Correctly re-index multi-value fields such as owningComm 
      1. How to test 
        1. View an item with multiple owning communities in DSpace
        2. Find the item view record in the Solr Admin console
        3. Notice that owningComm is an array
        4. run "solr-reindex-statistics -i statistics"
        5. Find the item view record in the Solr Admin console
        6. owningComm should still be an array with multiple values
          1. Without the fix, owningComm is a string separated by commas
    8. Proposal 8: Repair multi-value fields in a shard that were corrupted by prior sharding or prior reindex operations
      1. How to test
        1. In the Solr Admin Console, look for owningComm fields containing either "," or "\"
          1. Note the id's or other identifying information for the records
        2. run "solr-reindex-statistics -i statistics-xxxx"
        3. Find the records again in the Solr Admin Console
        4. If problems exist, run
          1. solr-export-statistics -i statistics-xxxx -f
          2. for file in *; do sed -E -e "s/[\\]+,/,/g" -i $file; done
          3. solr-import-statistics -i statistics-xxxx
        5. The owningComm fields should be an array
    9. Proposal 9: Consistently use UTC from statistics records to determine shard name
      1. How to test
        1. If not in UTC, create a statistic record for a shard that does not exist
        2. run shard process without the PR
          1. Note the shard name is off by one year
          2. Test results may vary based on your time zone relative to UTC
        3. Repeat the process with the PR in place
          1. Note that the shard name matches the year of the records
  3. Pull Requests
    1. DSpace 5x PR: https://github.com/DSpace/DSpace/pull/1623/files
    2. DSpace 6x PR: https://github.com/DSpace/DSpace/pull/1624/files
    3. DSpace master PR: https://github.com/DSpace/DSpace/pull/1635/files

 

Manual Repair of Corrupted Export Files

...