Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. Wiki Markup
    *General Curation Configuration:* First, in your {{\[dspace\]/config/modules/curate.cfg}} you will want to enable & configure the METS-based replication tasks. (NOTE: there is a sample {{curate.cfg}} file provided in {{\[dspace-replicate\]/config/modules/curate.cfg}} which is pre-configured to use METS-based AIPs).
    • Enable the Replication Tasks: In the list of "Task Class implementations" (plugin.named.org.dspace.curate.CurationTask), add the following.
      REMEMBER to add a comma and backslash (", \") after each line (except the final line).
      Code Block
      plugin.named.org.dspace.curate.CurationTask = \
          ... (YOUR EXISTING TASKS) ... , \
          org.dspace.ctask.replicate.EstimateAIPSize = estaipsize, \
          org.dspace.ctask.replicate.ReadOdometer = readodometer, \
          org.dspace.ctask.replicate.TransmitAIP = transmitaip, \
          org.dspace.ctask.replicate.VerifyAIP = verifyaip, \
          org.dspace.ctask.replicate.FetchAIP = fetchaip, \
          org.dspace.ctask.replicate.CompareWithAIP = auditaip, \
          org.dspace.ctask.replicate.RemoveAIP = removeaip, \
          org.dspace.ctask.replicate.METSRestoreFromAIP = restorefromaip, \
          org.dspace.ctask.replicate.METSRestoreFromAIP = replacewithaip, \
          org.dspace.ctask.replicate.METSRestoreFromAIP = restorekeepexisting, \
          org.dspace.ctask.replicate.METSRestoreFromAIP = restoresinglefromaip, \
          org.dspace.ctask.replicate.METSRestoreFromAIP = replacesinglewithaip
      
    • Give Each Task a Human-Friendly Task Name: Under the ui.tasknames setting, give each of the above Tasks a human-friendy name. Here are some recommended values, but you are welcome to tweak them.
      REMEMBER to add a comma and backslash (", \") after each line (except the final line).
      Code Block
      ui.tasknames = \
          ... (YOUR EXISTING TASK NAMES) ... , \
          estaipsize = Estimate Storage Space for AIP(s) Size, \
          readodometer = Read Odometer, \
          transmitaip = Transmit AIP(s) to Storage, \
          verifyaip = Verify AIP(s) exist in Storage, \
          fetchaip = Fetch AIP(s) from Storage, \
          auditaip = Audit/Compare against AIP(s), \
          removeaip = Remove AIP(s) from Storage, \
          restorefromaip = Restore Missing Object(s) from AIP(s), \
          replacewithaip = Replace Existing Object(s) with AIP(s), \
          restorekeepexisting = Restore Missing Object(s) but Keep Existing Objects,\
          restoresinglefromaip = Restore Single Object from AIP, \
          replacesinglewithaip = Replace Single Object with AIP
      
    • Optionally Create a Task Group: Finally, if you'd like to create a Task Group for these tasks, you can create a group named "replicate" and add them all to it. The below is just an example for how you may wish to set the ui.taskgroups and ui.taskgroup.* settings. It creates two Task Groups: (1) a "General Purpose Tasks" group for a few default DSpace Curation Tasks, and (2) a "Replication Suite Tasks" group for all these new Replication tasks.
      Code Block
      # Tasks may be organized into named groups which display together in UI drop-downs
      ui.taskgroups = \
         general = General Purpose Tasks,
         replicate = Replication Suite Tasks
      
      # Group membership is defined using comma-separated lists of task names, one property per group
      ui.taskgroup.general = profileformats, requiredmetadata, checklinks
      ui.taskgroup.replicate = estaipsize, readodometer, transmitaip, verifyaip, fetchaip, auditaip, removeaip, restorefromaip, replacewithaip, restorekeepexisting, restoresinglefromaip, replacesinglewithaip
      
  2. Wiki Markup
    *Replication Suite Configuration*: Next, in your {{\[dspace\]/config/modules/replicate.cfg}} you will want to ensure it is setup to properly use METS-based AIPs.   Under the "AIP Packaging Settings" you'll want the following settings enabled:
    Code Block
    # Package type. Permitted values: 'mets', 'bagit'
    # mets = Generate default DSpace AIPs as described in: https://wiki.duraspace.org/display/DSDOC18/AIP+Backup+and+Restore
    # bagit = Generate AIPs based on the BagIt packaging format: https://wiki.ucop.edu/display/Curation/BagIt
    packer.pkgtype = mets
    
    # Format of package compression. Permitted values: 'zip' or 'tgz'
    # for 'mets' packages, only 'zip' is supported
    packer.archfmt = zip
    
    # Whether or not the name packages with a DSpace type prefix.
    # When 'true', package files are named [type]@[handle].[format] (e.g. ITEM@123456789-1.zip)
    # When 'false', package files are named [handle].[format] (e.g. 123456789-1.zip)
    # Defaults to 'true'. For 'mets' packages, this must be 'true'.
    packer.typeprefix = true
    
  3. Wiki Markup
    *Optionally tweak the AIP Restore/Replace settings:*  Optionally, you can decide to tweak the way AIPs are restored or replaced (using [DSDOC18:AIP Backup and Restore]). These settings normally *should not need to be tweaked*, but are available in the {{\[dspace\]/config/modules/replicate-mets.cfg}} configuration file.  See that configuration file for more details.

...

  1. Wiki Markup
    *General Curation Configuration:* First, in your {{\[dspace\]/config/modules/curate.cfg}} you will want to enable & configure the BagIt-based replication tasks. (NOTE: there is a sample {{curate.cfg}} file provided in {{\[dspace-replicate\]/config/modules/curate.cfg}} which provides example settings).
    • Enable the Replication Tasks: In the list of "Task Class implementations" (plugin.named.org.dspace.curate.CurationTask), add the following.
      REMEMBER to add a comma and backslash (", \") after each line (except the final line).
      Code Block
      plugin.named.org.dspace.curate.CurationTask = \
          ... (YOUR EXISTING TASKS) ... , \
          org.dspace.ctask.replicate.EstimateAIPSize = estaipsize, \
          org.dspace.ctask.replicate.ReadOdometer = readodometer, \
          org.dspace.ctask.replicate.TransmitAIP = transmitaip, \
          org.dspace.ctask.replicate.VerifyAIP = verifyaip, \
          org.dspace.ctask.replicate.FetchAIP = fetchaip, \
          org.dspace.ctask.replicate.CompareWithAIP = auditaip, \
          org.dspace.ctask.replicate.RemoveAIP = removeaip, \
          org.dspace.ctask.replicate.BagItRestoreFromAIP = restorefromaip, \
          org.dspace.ctask.replicate.BagItReplaceWithAIP = replacewithaip
      
    • Give Each Task a Human-Friendly Task Name: Under the ui.tasknames setting, give each of the above Tasks a human-friendy name. Here are some recommended values, but you are welcome to tweak them.
      REMEMBER to add a comma and backslash (", \") after each line (except the final line).
      Code Block
      ui.tasknames = \
          ... (YOUR EXISTING TASK NAMES) ... , \
          estaipsize = Estimate Storage Space for AIP(s) Size, \
          readodometer = Read Odometer, \
          transmitaip = Transmit AIP(s) to Storage, \
          verifyaip = Verify AIP(s) exist in Storage, \
          fetchaip = Fetch AIP(s) from Storage, \
          auditaip = Audit/Compare against AIP(s), \
          removeaip = Remove AIP(s) from Storage, \
          restorefromaip = Restore Missing Object(s) from AIP(s), \
          replacewithaip = Replace Existing Object(s) with AIP(s)
      
    • Optionally Create a Task Group: Finally, if you'd like to create a Task Group for these tasks, you can create a group named "replicate" and add them all to it. The below is just an example for how you may wish to set the ui.taskgroups and ui.taskgroup.* settings. It creates two Task Groups: (1) a "General Purpose Tasks" group for a few default DSpace Curation Tasks, and (2) a "Replication Suite Tasks" group for all these new Replication tasks.
      Code Block
      # Tasks may be organized into named groups which display together in UI drop-downs
      ui.taskgroups = \
         general = General Purpose Tasks,
         replicate = Replication Suite Tasks
      
      # Group membership is defined using comma-separated lists of task names, one property per group
      ui.taskgroup.general = profileformats, requiredmetadata, checklinks
      ui.taskgroup.replicate = estaipsize, readodometer, transmitaip, verifyaip, fetchaip, auditaip, removeaip, restorefromaip, replacewithaip
      
  2. Wiki Markup
    *Replication Suite Configuration*: Next, in your {{\[dspace\]/config/modules/replicate.cfg}} you will want to ensure it is setup to properly use BagIt-based AIPs.   Under the "AIP Packaging Settings" you'll want the following settings enabled:
    Code Block
    # Package type. Permitted values: 'mets', 'bagit'
    # mets = Generate default DSpace AIPs as described in: https://wiki.duraspace.org/display/DSDOC18/AIP+Backup+and+Restore
    # bagit = Generate AIPs based on the BagIt packaging format: https://wiki.ucop.edu/display/Curation/BagIt
    packer.pkgtype = bagit
    

...

We can suppose our data curator has identified a collection of items in her DSpace repository consisting of high-value, born-digital, and unique/irreplaceable (not held elsewhere) content. She prudently wishes to insure against catastrophic local loss of this content by keeping a copy or replica of this collection elsewhere. She'd prefer to replicate all her DSpace content, but realizes that storage costs over long periods has made her administration wary, so decides to begin with this collection.

First Steps - Estimation

Replication Task Used:

Estimate Storage Space for AIP(s) (ID: estaipsize)

In order to budget for replication storage, she needs to know the 'size' of the collection. When she asks her sysadmin, he replies that it is easy to give her figures for the whole asset store, but since collections aren't stored separately, she would have to add up each item's bitstreams in the collection, a rather tedious process. Thus the first task: a reporting tool which operates on natural DSpace objects, rather than storage volumes.

Wiki Markup
To install this task, edit
/dspace
 {{\[dspace\]/config/modules/curate.cfg}} (NB: all curation configuration is 'modular' in the sense that the configuration properties live outside of dspace.cfg, in named files. This means that if a given suite of tasks is unused, it's configuration is never installed). First, add the task to the lists of curation tasks.

Code Block
plugin.named.org.dspace.curate.CurationTask = \
.... other curation tasks
    org.dspace.ctask.replicate.EstimateAIPSize = estaipsize

...

Code Block
ui.tasknames = \
.... other tasks
    estaipsize = Estimate Storage Space for AIPsAIP(s)

Of course, both the name of the task ('estaipsize'), and the language for the UI are up to you. Now the curator can navigate to her collection, select the 'curate' tab, and then from the dropdown list of tasks choose the entry, and perform the task. On the page, the results will display:

...

The estimates from this task are rather crude, in that they do not measure the actual AIPs, but just the bitstreams (so ignore the metadata xml), but should be fine for storage costing and allocating purposes.

Replicating

Replication Task Used:

Transmit AIP(s) to Storage (ID: transmitaip)

Having secured Having secured approval to replicate 'Amazing Images' collection, our curator obviously needs a task to generate the AIP representations of each item in the collection, and transmit these archive files to the replication storage site (which may be service-backed, local, in the cloud, etc, as will be explored below). Adding this task is just like the previous step: editing into curate.cfg the configuration properties. (We won't repeat a description of this process each time, but note that you may always add a task, but elect not to display it in the administrative UI.). This task is 'org.dspace.ctask.replicate.TransmitAIP'.

Wiki Markup
Since we are now working with AIPs, we should examine how they are configured to the tasks. Most configuration specific to the replication task suite is found at
/dspace
 {{\[dspace\]/config/modules/replicate.cfg}}. There are two main properties to set (or accept default values):

Code Block
# Package type. Permitted values: 'mets', 'bagit'
packer.pkgtype = mets
# Format of package compression. Permitted values: 'zip' or 'tgz'
# for 'mets' packages, only zip is supported
packer.archfmt = zip

The default values will create a METS-based AIP in the default DSpace AIP Format, compressed into a 'zip' archive. The other alternative supported by the replication task suite is Library of Congress 'Bagit' packaging, which may compressed either into a 'zip' file or a 'tgz' ('gzipped tar'), a compression standard more common in Unix systems.

Our data curator may elect to perform this task in the admin GUI, or, if the collection is rather large, she may instead 'queue' the task for later execution by using the queueing facility available in the curation system. We should note that the 'transmitAIP' task, like all other replication tasks, operate on whatever DSpace object they are given. Thus, if the object is a collection, the task creates (and transmits, of course) an AIP for the collection object itself (metadata and logo), as well as AIPs for each item in the collection. If the task is given an identifier for a single Item, then only one AIP will be created.

Verifying Replication

Replication Task Used:

Verify AIP(s) exist in Storage (ID: verifyaip)

While the transmitAIP task will report on whether or not it was successful in generating and transmitting AIP(s) to the replication service, our data curator wants the ability (within DSpace, not by using the replication service tools or UIs) to check whenever she likes that the AIP(s) which were transmitted are still there. A simple task 'org.dspace.ctask.replicate.VerifyAIP' can perform this function.

Ensuring Replica Integrity and Accuracy over time

Replication Task Used:

Audit against AIP(s) (ID: auditaip)

The 'Amazing Images' collection is comparatively static, meaning that few new items are likely to be added, and most of the metadata in each item is not routinely changed. However, over longer periods of time, cataloging errors are discovered and corrected, perhaps formats become obsolete and new bitstreams are added. If the curator is fastidious about each change, and performs the 'transmitAIPtransmitaip' task on each item that has changed, then in general the set of AIP replicas will always be 'in sync' with the repository. However, it useful to have the means to ensure that the replicas agree with the repository without having to create and transmit entirely new ones. Thus the task: 'org.dspace.ctask.replicate.CompareWithAIP', which can also be thought of as a simple audit task. When performed on an Item, the task does the following:

  • generates an AIP for the Item DSpace object locally (but does not transmit it)
  • computes a an MD5 checksum on the local AIP
  • requests from the replication storage service a an MD5 checksum for the replica AIP in storage
  • compares the 2 checksums

The task will thus fail only if the checksums differ, which can only happen if some part of the Item DSpace Object (metadata or bitstream) itself differs. If the version of the item that is believed to be authentic is the repository (local) one, then a simple performance of 'transmitAIP' task on the item will restore synchrony. For collections and communities, this task also does an 'extent' comparison, which means that it will determine whether the replica store has an AIP for every item known (locally) to be in the collection or community.

Repairing Damage

Replication Tasks Used:

Restore Missing Objects(s) from AIP(s) (ID: restorefromaip)

 

Replace Existing Object(s) with AIP(s) (ID: replacewithaip)

 

Restore Missing Object(s) but Keep Existing Objects (ID: restorekeepexisting) (only supported for METS-based AIPs)

 

Restore Single Object from AIP (ID: restoresinglefromaip) (only supported for METS-based AIPs)

 

Replace Single Object with AIP (ID: replacesinglewithaip) (only supported for METS-based AIPs)

The AIPs in the replica store represent an insurance policy, and when 'claims' against that policy are filed, they can cover 2 situations: either the repository object is completely missing, and we want to restore it, or it is damaged and we want to repair the damage with data from the replica store AIP. A pair of replication tasks perform these functions: 'org.dspace.ctask.replicate.RecoverFromAIP' will do the following:

...