Contribute to the DSpace Development Fund

The newly established DSpace Development Fund supports the development of new features prioritized by DSpace Governance. For a list of planned features see the fund wiki page.

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 71 Next »

Replication Task Suite

The Replication Task Suite is a DSpace 1.8 Add-On which provides a set of curation system tasks to assist in performing replication (backup/restore/audit) of DSpace contents to other locations. The DSpace content is packaged in containers known as AIPs (OAIS speak: 'archival information packages'). You can read much more about how AIPs are constituted here: AIP Backup and Restore. This add-on is also built on the DSpace curation system, which is described here: CurationSystem. We will describe a concrete situation facing a repository data curator, and introduce each task as the need arises. We will also describe some of the technical configuration details to enable these tasks.

Early Access Release Available

An "Early Access" release of the Replication Task Suite is available via SVN at: http://scm.dspace.org/svn/repo/modules/dspace-replicate/tags/dspace-replicate-1.0-EA/

This 1.0-EA (Early Access) release may also be installed via Maven. See #Maven-based Installation

More Information

More information on the Replication Task Suite is available from the following webinars/screencasts:

The #Problem Statement & Usage Examples section below also provides some real-life scenarios/examples of where each Replication task may come in handy.

Source Code:
The Replication Task Suite source code is available at: http://scm.dspace.org/svn/repo/modules/dspace-replicate/
In addition, there is an associated JIRA Issue at: https://jira.duraspace.org/browse/DS-876

Prerequisites

Must be installed on a DSpace 1.8.x System

Known Curation System bug in 1.8.0

DSpace 1.8.0 contains a bug in the Curation System which causes a NullPointerException error to be returned when any curation task is run across the entire site (see DS-1077). This bug directly affects the Replication Task Suite. Even when a replication task succeeds, it will still throw a NullPointerException. You can check the DSpace logs to tell whether the task actually succeeded or not. This bug will be resolved in DSpace 1.8.1.
Because of the above bug, we recommend running the Replication Suite on DSpace 1.8.1 or above.

Developers may obtain an early version of the soon-to-be DSpace 1.8.1 release by accessing the 1.8 Bug-fix Branch in the DSpace SVN: http://scm.dspace.org/svn/repo/dspace/branches/dspace-1_8_x/

Because of enhancements to the Curation System in DSpace 1.8.0, the Replication Suite is only compatible with a DSpace 1.8.x System.

User Interface Compatibility Notes

As the Replication Suite is just a suite of Curation System tasks, it may be called (like any Curation Tasks) from the following locations:

  • From the Command Line
  • From the Admin UI (XMLUI Only)
  • From Approval Workflow
  • From custom Java code

For more information see the Curation System details on Task Invocation.

Installation

Maven-based Installation is recommended

At this time, it's recommended to install the DSpace Replication Suite via #Maven-based Installation. This form of installation will ensure that DSpace Replication Suite doesn't require re-installation during your next upgrade.

Maven-based Installation

  1. In your DSpace Source directory ([dspace-src]), you will modify two Maven pom.xml files:

    • [dspace-src]/dspace/pom.xml (This POM controls dependencies of CommandLine scripts. Modifying it will let you run dspace-replicate from commandline)

    • [dspace-src]/dspace/modules/xmlui/pom.xml (This POM controls dependencies of XMLUI. Modifying it will let you run dspace-replicate from XMLUI)

  2. For both of these pom.xml files, add the following <dependency> section at the end of the existing <dependencies> section (just before the closing </dependencies> tag):
    <dependency>
       <groupId>org.dspace</groupId>
       <artifactId>dspace-replicate</artifactId>
       <version>1.0-EA</version>
    </dependency>
    
  3. Once you've finished modifying both pom.xml files, rebuild DSpace by running the following from your [dspace-src]/dspace/ folder:

    mvn clean package
    
  4. You will need to update your existing DSpace 1.8.x installation, by running the following from your [dspace-src]/dspace/target/dspace-1.8.x-SNAPSHOT-build/ directory

    ant update
    

    Alternatively, if you don't want to do a full DSpace update, you can just update your existing binaries & webapps by running the following two commands:

    • ant update_code (Updates the existing [dspace]/lib/ directory)

    • ant update_webapps (Updates the existing [dspace]/webapp/ directory)

  5. Copy the Replication Suite's configuration files to your DSpace configuration directory
  6. Finally, follow the Configuration settings instructions below to configure the Replication Suite based on your usage needs.

Manual Installation

Temporary until next DSpace Rebuild

This Manual Installation will only work properly until your next DSpace rebuild. The next time you run 'ant update', you will need to copy the DSpace-Replication Suite JAR files (see below) back over to your DSpace installation. For a more "permanent" installation option, see the #Maven-based Installation option above

  1. Download the Replication Suite code
  2. Build/Compile the Replication Suite, by running the following from the root directory
    mvn package
  3. Copy the generated JAR files to your DSpace 1.8 installation.
    1. There are a total of 5 JARs that will need to be copied to your [dspace]/lib/

      • [dspace-replicate]/target/dspace-replicate-[version].jar (The Replication Suite Plugin)

      • [dspace-replicate]/target/lib/common-[version].jar (DuraCloud common libraries - required for DuraCloud integration)

      • [dspace-replicate]/target/lib/commons-compress-[version].jar (Apache Commons Compress - prerequisite for Replication Suite plugin)

      • [dspace-replicate]/target/lib/storageprovider-[version].jar (DuraCloud storage provider libraries - required for DuraCloud integration)

      • [dspace-replicate]/target/lib/storeclient-[version].jar (DuraCloud store client libraries - required for DuraCloud integration)

    2. Also, copy the above 5 JARs also to your XMLUI web application's WEB-INF/lib directory (e.g. [dspace]/webapps/xmlui/WEB-INF/lib/)

  4. Copy the Replication Suite's configuration files to your DSpace configuration directory
    • Replication Suite Configuration File: Copy [dspace-replicate]/config/modules/replicate.cfg to your [dspace]/config/modules/ directory

    • METS-specific AIP Configuration Settings: Copy [dspace-replicate]/config/modules/replicate-mets.cfg to your [dspace]/config/modules/ directory

    • DuraCloud Configuration File: Copy [dspace-replicate]/config/modules/duracloud.cfg to your [dspace]/config/modules/ directory

  5. Finally, follow the Configuration settings instructions below to configure the Replication Suite based on your usage needs.
    • There is a sample curate.cfg file provided in [dspace-replicate]/config/modules/curate.cfg which can be used as a reference. It is pre-configured to use the DSpace AIP Format (METS-based packaging).

Configuration

Configuration of the Replication Task Suite is based entirely on your local institution's backup, restore and preservation needs.

Before getting started, you may wish to determine the answers to the following questions:

  1. #AIP Format Options: Does you institution want to backup using the default DSpace AIP format (METS packaging)? Or would you rather utilize the new BagIt AIP Format?
  2. #Storage Options: Does you institution plan to use the Replication Suite to backup to a local/mounted drive? Or would you like to connect it to a DuraCloud account?
  3. #Additional Options: Do you plan to use Checkm manifests for checksum auditing?

AIP Format Options

One of the first questions to ask yourself is the format you wish to utilize for your AIPs.

There are two options:

  1. DSpace AIP Format (METS-based) (default) - This is the same AIP format utilized by the DSpace AIP Backup and Restore feature, so it is 100% compatible with that existing feature. In fact when using this format, the Replication Task Suite just "wraps" calls to the AIP Backup and Restore feature itself.
  2. BagIt AIP Format - This is a new AIP format provided by the Replication Task Suite. It generates AIPs in the BagIt File Packaging Format. Institutions which already are familiar with BagIt or use it elsewhere may find this format preferrable.

Configuring usage of DSpace default AIP Format (METS-based)

This section goes through the steps of configuring the Replication Suite to use the default DSpace AIP format, which utilizes METS packaging.

  1. General Curation Configuration: First, in your [dspace]/config/modules/curate.cfg you will want to enable & configure the METS-based replication tasks. (NOTE: there is a sample curate.cfg file provided in [dspace-replicate]/config/modules/curate.cfg which is pre-configured to use METS-based AIPs).

    • Enable the Replication Tasks: In the list of "Task Class implementations" (plugin.named.org.dspace.curate.CurationTask), add the following.
      REMEMBER to add a comma and backslash (", \") after each line (except the final line).
      plugin.named.org.dspace.curate.CurationTask = \
          ... (YOUR EXISTING TASKS) ... , \
          org.dspace.ctask.replicate.EstimateAIPSize = estaipsize, \
          org.dspace.ctask.replicate.ReadOdometer = readodometer, \
          org.dspace.ctask.replicate.TransmitAIP = transmitaip, \
          org.dspace.ctask.replicate.VerifyAIP = verifyaip, \
          org.dspace.ctask.replicate.FetchAIP = fetchaip, \
          org.dspace.ctask.replicate.CompareWithAIP = auditaip, \
          org.dspace.ctask.replicate.RemoveAIP = removeaip, \
          org.dspace.ctask.replicate.METSRestoreFromAIP = restorefromaip, \
          org.dspace.ctask.replicate.METSRestoreFromAIP = replacewithaip, \
          org.dspace.ctask.replicate.METSRestoreFromAIP = restorekeepexisting, \
          org.dspace.ctask.replicate.METSRestoreFromAIP = restoresinglefromaip, \
          org.dspace.ctask.replicate.METSRestoreFromAIP = replacesinglewithaip
      
    • Give Each Task a Human-Friendly Task Name: Under the ui.tasknames setting, give each of the above Tasks a human-friendy name. Here are some recommended values, but you are welcome to tweak them.
      REMEMBER to add a comma and backslash (", \") after each line (except the final line).
      ui.tasknames = \
          ... (YOUR EXISTING TASK NAMES) ... , \
          estaipsize = Estimate Storage Space for AIP(s), \
          readodometer = Read Odometer, \
          transmitaip = Transmit AIP(s) to Storage, \
          verifyaip = Verify AIP(s) exist in Storage, \
          fetchaip = Fetch AIP(s) from Storage, \
          auditaip = Audit against AIP(s), \
          removeaip = Remove AIP(s) from Storage, \
          restorefromaip = Restore Missing Object(s) from AIP(s), \
          replacewithaip = Replace Existing Object(s) with AIP(s), \
          restorekeepexisting = Restore Missing Object(s) but Keep Existing Objects,\
          restoresinglefromaip = Restore Single Object from AIP, \
          replacesinglewithaip = Replace Single Object with AIP
      
    • Optionally Create a Task Group: Finally, if you'd like to create a Task Group for these tasks, you can create a group named "replicate" and add them all to it. The below is just an example for how you may wish to set the ui.taskgroups and ui.taskgroup.* settings. It creates two Task Groups: (1) a "General Purpose Tasks" group for a few default DSpace Curation Tasks, and (2) a "Replication Suite Tasks" group for all these new Replication tasks.
      # Tasks may be organized into named groups which display together in UI drop-downs
      ui.taskgroups = \
         general = General Purpose Tasks,
         replicate = Replication Suite Tasks
      
      # Group membership is defined using comma-separated lists of task names, one property per group
      ui.taskgroup.general = profileformats, requiredmetadata, checklinks
      ui.taskgroup.replicate = estaipsize, readodometer, transmitaip, verifyaip, fetchaip, auditaip, removeaip, restorefromaip, replacewithaip, restorekeepexisting, restoresinglefromaip, replacesinglewithaip
      
  2. Replication Suite Configuration: Next, in your [dspace]/config/modules/replicate.cfg you will want to ensure it is setup to properly use METS-based AIPs. Under the "AIP Packaging Settings" you'll want the following settings enabled:

    # Package type. Permitted values: 'mets', 'bagit'
    # mets = Generate default DSpace AIPs as described in: https://wiki.duraspace.org/display/DSDOC18/AIP+Backup+and+Restore
    # bagit = Generate AIPs based on the BagIt packaging format: https://wiki.ucop.edu/display/Curation/BagIt
    packer.pkgtype = mets
    
    # Format of package compression. Permitted values: 'zip' or 'tgz'
    # for 'mets' packages, only 'zip' is supported
    packer.archfmt = zip
    
    # Whether or not the name packages with a DSpace type prefix.
    # When 'true', package files are named [type]@[handle].[format] (e.g. ITEM@123456789-1.zip)
    # When 'false', package files are named [handle].[format] (e.g. 123456789-1.zip)
    # Defaults to 'true'. For 'mets' packages, this must be 'true'.
    packer.typeprefix = true
    
  3. Optionally tweak the AIP Restore/Replace settings: Optionally, you can decide to tweak the way AIPs are restored or replaced (using AIP Backup and Restore). These settings normally should not need to be tweaked, but are available in the [dspace]/config/modules/replicate-mets.cfg configuration file. See that configuration file for more details.

Configuring usage of DSpace BagIt AIP Format

This section goes through the steps of configuring the Replication Suite to use BagIt-based AIPs. For more information on the BagIt packaging format, see: https://wiki.ucop.edu/display/Curation/BagIt

  1. General Curation Configuration: First, in your [dspace]/config/modules/curate.cfg you will want to enable & configure the BagIt-based replication tasks. (NOTE: there is a sample curate.cfg file provided in [dspace-replicate]/config/modules/curate.cfg which provides example settings).

    • Enable the Replication Tasks: In the list of "Task Class implementations" (plugin.named.org.dspace.curate.CurationTask), add the following.
      REMEMBER to add a comma and backslash (", \") after each line (except the final line).
      plugin.named.org.dspace.curate.CurationTask = \
          ... (YOUR EXISTING TASKS) ... , \
          org.dspace.ctask.replicate.EstimateAIPSize = estaipsize, \
          org.dspace.ctask.replicate.ReadOdometer = readodometer, \
          org.dspace.ctask.replicate.TransmitAIP = transmitaip, \
          org.dspace.ctask.replicate.VerifyAIP = verifyaip, \
          org.dspace.ctask.replicate.FetchAIP = fetchaip, \
          org.dspace.ctask.replicate.CompareWithAIP = auditaip, \
          org.dspace.ctask.replicate.RemoveAIP = removeaip, \
          org.dspace.ctask.replicate.BagItRestoreFromAIP = restorefromaip, \
          org.dspace.ctask.replicate.BagItReplaceWithAIP = replacewithaip
      
    • Give Each Task a Human-Friendly Task Name: Under the ui.tasknames setting, give each of the above Tasks a human-friendy name. Here are some recommended values, but you are welcome to tweak them.
      REMEMBER to add a comma and backslash (", \") after each line (except the final line).
      ui.tasknames = \
          ... (YOUR EXISTING TASK NAMES) ... , \
          estaipsize = Estimate Storage Space for AIP(s), \
          readodometer = Read Odometer, \
          transmitaip = Transmit AIP(s) to Storage, \
          verifyaip = Verify AIP(s) exist in Storage, \
          fetchaip = Fetch AIP(s) from Storage, \
          auditaip = Audit/Compare against AIP(s), \
          removeaip = Remove AIP(s) from Storage, \
          restorefromaip = Restore Missing Object(s) from AIP(s), \
          replacewithaip = Replace Existing Object(s) with AIP(s)
      
    • Optionally Create a Task Group: Finally, if you'd like to create a Task Group for these tasks, you can create a group named "replicate" and add them all to it. The below is just an example for how you may wish to set the ui.taskgroups and ui.taskgroup.* settings. It creates two Task Groups: (1) a "General Purpose Tasks" group for a few default DSpace Curation Tasks, and (2) a "Replication Suite Tasks" group for all these new Replication tasks.
      # Tasks may be organized into named groups which display together in UI drop-downs
      ui.taskgroups = \
         general = General Purpose Tasks,
         replicate = Replication Suite Tasks
      
      # Group membership is defined using comma-separated lists of task names, one property per group
      ui.taskgroup.general = profileformats, requiredmetadata, checklinks
      ui.taskgroup.replicate = estaipsize, readodometer, transmitaip, verifyaip, fetchaip, auditaip, removeaip, restorefromaip, replacewithaip
      
  2. Replication Suite Configuration: Next, in your [dspace]/config/modules/replicate.cfg you will want to ensure it is setup to properly use BagIt-based AIPs. Under the "AIP Packaging Settings" you'll want the following settings enabled:

    # Package type. Permitted values: 'mets', 'bagit'
    # mets = Generate default DSpace AIPs as described in: https://wiki.duraspace.org/display/DSDOC18/AIP+Backup+and+Restore
    # bagit = Generate AIPs based on the BagIt packaging format: https://wiki.ucop.edu/display/Curation/BagIt
    packer.pkgtype = bagit
    

Storage Options

Where your AIPs will be stored is the next decision to make. There are three options currently available:

  1. Local Storage: Replicate/Backup content to another location (folder) on your local filesystem.
  2. Mountable Storage: Replicate/Backup content to a mounted external filesystem (e.g. NFS-mounted drive).
  3. DuraCloud Storage: Replicate/Backup content to an existing DuraCloud account.

Configuring Local Storage

The local storage option may also be used for a mounted drive / SAN which just appears as though it is a local filesystem folder. However, some mounted drives (e.g. NFS-mounted drives) may need to use the Mountable Storage option instead.

Before configuring a local storage option, please ensure you have enough space available on your local hard drive (or mounted drive/SAN if your local folder is actually remote storage). You can use the "Estimate Storage Space" (estaipsize) task to estimate the amount of new storage space you will need.

To configure local storage, please change the following settings in your [dspace]/config/modules/replicate.cfg configuration file:

  1. Enable Local Storage Plugin: Ensure the Replication suite is setup to use the 'LocalObjectStore' plugin
    # Replica store implementation class (specify one)
    plugin.single.org.dspace.ctask.replicate.ObjectStore = \
        org.dspace.ctask.replicate.store.LocalObjectStore
    
  2. Configure Local Storage Folder: Configure the location where you want all AIPs to be stored on your local filestystem. This defaults to the [dspace]/repstore folder. However, we recommend changing this to at least a separate hard drive from your existing DSpace installation directory! This ensures that all your content will not be lost in the case of a hard drive failure.

    # Location of local (e.g. local, mountable, sync) object store
    # ignored for non-local stores (e.g. DuraCloud)
    store.dir = ${dspace.dir}/repstore
    
  3. Optionally Configure Subfolder Settings: Optionally, you can configure the sub-folder names (under store.dir) which will be used to store AIPs, checkm manifests (if enabled), etc.
    # The storage group / folder where AIPs are stored/retrieved when AIP based tasks 
    # (e.g. "Transmit AIP", "Recover from AIP") are executed.
    # For Local object stores, this group name corresponds to a subfolder in the 'store.dir'
    # For DuraCloud object stores, this group name corresponds to a DuraCloud Space ID (Space must already exist)
    group.aip.name = aips
    
    # The storage group / folder where Checkm Manifests are stored/retrieved when Checkm Manifest based tasks are executed
    # (org.dspace.ctask.replicate.checkm.*).
    # For Local object stores, this group name corresponds to a subfolder in the 'store.dir'
    # For DuraCloud object stores, this group name corresponds to a DuraCloud Space ID (Space must already exist)
    group.manifest.name = manifests
    
    # The storage group / folder where AIPs are temporarily stored/retrieved when an object deletion occurs
    # and the ReplicationConsumer is enabled (see below). Essentially, this 'delete' group provides a 
    # location where AIPs can be temporarily kept in case the deletion needs to be reverted and the object restored.
    # WARNING: THIS MUST NOT BE SET TO THE SAME VALUE AS 'group.aip.name'. If it is set to the 
    # same value, then your AIP backup processes will be UNSTABLE and restoration may be difficult or impossible.
    # For Local object stores, this group name corresponds to a subfolder in the 'store.dir'
    # For DuraCloud object stores, this group name corresponds to a DuraCloud Space ID (Space must already exist)
    group.delete.name = deletes
    

Configuring Mountable Storage

Before configuring a mounted storage option, please ensure you have enough space available on your external, mounted drive/SAN. You can use the "Estimate Storage Space" (estaipsize) task to estimate the amount of new storage space you will need.

To configure local storage, please change the following settings in your [dspace]/config/modules/replicate.cfg configuration file:

  1. Enable Local Storage Plugin: Ensure the Replication suite is setup to use the 'MountableObjectStore' plugin
    # Replica store implementation class (specify one)
    plugin.single.org.dspace.ctask.replicate.ObjectStore = \
        org.dspace.ctask.replicate.store.MountableObjectStore
    
  2. Configure Mounted Folder: Configure the location where you want all AIPs to be stored. The folder should already be mounted on your local filesystem. This defaults to the [dspace]/repstore folder.

    # Location of local (e.g. local, mountable, sync) object store
    # ignored for non-local stores (e.g. DuraCloud)
    store.dir = ${dspace.dir}/repstore
    
  3. Optionally Configure Subfolder Settings: Optionally, you can configure the sub-folder names (under store.dir) which will be used to store AIPs, checkm manifests (if enabled), etc.
    # The storage group / folder where AIPs are stored/retrieved when AIP based tasks 
    # (e.g. "Transmit AIP", "Recover from AIP") are executed.
    # For Local object stores, this group name corresponds to a subfolder in the 'store.dir'
    # For DuraCloud object stores, this group name corresponds to a DuraCloud Space ID (Space must already exist)
    group.aip.name = aips
    
    # The storage group / folder where Checkm Manifests are stored/retrieved when Checkm Manifest based tasks are executed
    # (org.dspace.ctask.replicate.checkm.*).
    # For Local object stores, this group name corresponds to a subfolder in the 'store.dir'
    # For DuraCloud object stores, this group name corresponds to a DuraCloud Space ID (Space must already exist)
    group.manifest.name = manifests
    
    # The storage group / folder where AIPs are temporarily stored/retrieved when an object deletion occurs
    # and the ReplicationConsumer is enabled (see below). Essentially, this 'delete' group provides a 
    # location where AIPs can be temporarily kept in case the deletion needs to be reverted and the object restored.
    # WARNING: THIS MUST NOT BE SET TO THE SAME VALUE AS 'group.aip.name'. If it is set to the 
    # same value, then your AIP backup processes will be UNSTABLE and restoration may be difficult or impossible.
    # For Local object stores, this group name corresponds to a subfolder in the 'store.dir'
    # For DuraCloud object stores, this group name corresponds to a DuraCloud Space ID (Space must already exist)
    group.delete.name = deletes
    

Configuring DuraCloud Storage

DuraCloud Account Settings

In order to configure DuraCloud Storage, you first must have an existing DuraCloud Account. This account's settings should be configured in your [dspace]/config/modules/duracloud.cfg file as follows:

  1. DuraCloud HostName: This is the location of your DuraCloud instance (the URL you tend to access for your account). Just provide the hostname.
    # DuraCloud service location (just the hostname)
    host = demo.duracloud.org
    
  2. DuraCloud Service Port: This is the port that DuraCloud is running on. It is almost always "443" unless you have installed DuraCloud yourself and configured it differently.
    # DuraCloud service port (usually 443 for https)
    port = 443
    
  3. DuraCloud's "DuraStore" path: This the path to DuraCloud's "DuraStore" service. It is almost always "durastore" unless you have installed DuraCloud yourself and configured it differently.
    context = durastore
    
  4. DuraCloud Username & Password: Finally, fill out your account username & password in these settings. Please note, as this file now contains your DuraCloud account information, we recommend securing it (if possible). Just ensure it is still readable by the system user that DSpace runs as.
    # DuraCloud user name
    username = rep-agent
    # DuraCloud password
    password = passw0rd
    
DuraCloud Storage Settings

Now, to configure DuraCloud as your storage location please change the following settings in your [dspace]/config/modules/replicate.cfg configuration file:

  1. Enable DuraCloud Storage Plugin: Ensure the Replication suite is setup to use the 'DuraCloudObjectStore' plugin
    # Replica store implementation class (specify one)
    plugin.single.org.dspace.ctask.replicate.ObjectStore = \
        org.dspace.ctask.replicate.store.DuraCloudObjectStore
    
  2. Configure DuraCloud Primary Space to use: Your DuraCloud account allows you to separate content into various "Spaces". You'll need to create a new DuraCloud Space that your AIPs will be stored within, and configure that as your group.aip.name (by default it's set to a DuraCloud Space with ID of "aips"). You should also create a new DuraCloud Space that your AIPs will be moved to if they are ever removed, and configure that as your group.delete.name. Optionally, if you are using Checkm manifests, you can also create and configure a group.manifest.name DuraCloud Space
    # The storage group / folder where AIPs are stored/retrieved when AIP based tasks 
    # (e.g. "Transmit AIP", "Recover from AIP") are executed.
    # For Local object stores, this group name corresponds to a subfolder in the 'store.dir'
    # For DuraCloud object stores, this group name corresponds to a DuraCloud Space ID (Space must already exist)
    group.aip.name = aips
    
  3. Optionally, Configure Additional DuraCloud Spaces: If you have chosen to utilize Checkm manifest validation, you will need to create and configure a DuraCloud Space corresponding to the group.manifest.name setting below. Additionally, if you have chosen to enable the Automatic Replication, you will need to create and configure a DuraCloud Space corresponding to the group.delete.name setting below.
    # The storage group / folder where Checkm Manifests are stored/retrieved when Checkm Manifest based tasks are executed
    # (org.dspace.ctask.replicate.checkm.*).
    # For Local object stores, this group name corresponds to a subfolder in the 'store.dir'
    # For DuraCloud object stores, this group name corresponds to a DuraCloud Space ID (Space must already exist)
    group.manifest.name = manifests
    
    # The storage group / folder where AIPs are temporarily stored/retrieved when an object deletion occurs
    # and the ReplicationConsumer is enabled (see below). Essentially, this 'delete' group provides a 
    # location where AIPs can be temporarily kept in case the deletion needs to be reverted and the object restored.
    # WARNING: THIS MUST NOT BE SET TO THE SAME VALUE AS 'group.aip.name'. If it is set to the 
    # same value, then your AIP backup processes will be UNSTABLE and restoration may be difficult or impossible.
    # For Local object stores, this group name corresponds to a subfolder in the 'store.dir'
    # For DuraCloud object stores, this group name corresponds to a DuraCloud Space ID (Space must already exist)
    group.delete.name = deletes
    

Additional Options

Configuring usage of Checkm manifest validation

This section goes through the steps of configuring the usage of Checkm manifest tasks. These tasks provide a cability to store DSpace content checksums external from DSpace in the Checkm Manifest format. Some institutions may find this to be a useful replacement for the default DSpace Checksum Checker/Validator, which only stores/validates checksums internal to the DSpace system.

However, as this is an optional set of tasks, they are disabled by default. Should you wish to enable these tasks, just do the following:

  1. General Curation Configuration: First, in your [dspace]/config/modules/curate.cfg you will want to enable & configure the Checkm Manifest tasks. (NOTE: there is a sample curate.cfg file provided in [dspace-replicate]/config/modules/curate.cfg which provides example settings).

    • Enable the Checkm Tasks: In the list of "Task Class implementations" (plugin.named.org.dspace.curate.CurationTask), add the following.
      REMEMBER to add a comma and backslash (", \") after each line (except the final line).
      plugin.named.org.dspace.curate.CurationTask = \
          ... (YOUR EXISTING TASKS) ... , \
          org.dspace.ctask.replicate.checkm.TransmitManifest = transmitmanifest, \
          org.dspace.ctask.replicate.checkm.VerifyManifest = verifymanifest, \
          org.dspace.ctask.replicate.checkm.FetchManifest = fetchmanifest, \
          org.dspace.ctask.replicate.checkm.CompareWithManifest = auditmanifest, \
          org.dspace.ctask.replicate.checkm.RemoveManifest = removemanifest
      
    • Give Each Task a Human-Friendly Task Name: Under the ui.tasknames setting, give each of the above Tasks a human-friendy name. Here are some recommended values, but you are welcome to tweak them.
      REMEMBER to add a comma and backslash (", \") after each line (except the final line).
      ui.tasknames = \
          ... (YOUR EXISTING TASK NAMES) ... , \
          transmitmanifest = Transmit Checkm Manifest to Storage, \
          verifymanifest = Verify Checkm Manifest exists in Storage, \
          fetchmanifest = Fetch Checkm Manifest from Storage, \
          auditmanifest = Audit against Checkm Manifest, \
          removemanifest = Remove Checkm Manifest from Storage
      
    • Optionally Create a Task Group: Finally, if you'd like to create a Task Group for these tasks, you can create a group named "checkm" and add them all to it. The below is just an example for how you may wish to set the ui.taskgroups and ui.taskgroup.* settings. It creates two Task Groups: (1) a "General Purpose Tasks" group for a few default DSpace Curation Tasks, and (2) a "Checkm Validation Tasks" group for all these new Replication tasks.
      # Tasks may be organized into named groups which display together in UI drop-downs
      ui.taskgroups = \
         general = General Purpose Tasks,
         checkm = Checkm Validation Tasks
      
      # Group membership is defined using comma-separated lists of task names, one property per group
      ui.taskgroup.general = profileformats, requiredmetadata, checklinks
      ui.taskgroup.checkm = transmitmanifest, verifymanifest, fetchmanifest, auditmanifest, removemanifest
      

Problem Statement & Usage Examples

We can suppose our data curator has identified a collection of items in her DSpace repository consisting of high-value, born-digital, and unique/irreplaceable (not held elsewhere) content. She prudently wishes to insure against catastrophic local loss of this content by keeping a copy or replica of this collection elsewhere. She'd prefer to replicate all her DSpace content, but realizes that storage costs over long periods has made her administration wary, so decides to begin with this collection.

First Steps - Estimation

Replication Task Used:

Estimate Storage Space for AIP(s)

Task ID: estaipsize

In order to budget for replication storage, she needs to know the 'size' of the collection. When she asks her sysadmin, he replies that it is easy to give her figures for the whole asset store, but since collections aren't stored separately, she would have to add up each item's bitstreams in the collection, a rather tedious process. Thus the first task: a reporting tool which operates on natural DSpace objects, rather than storage volumes.

To install this task, edit [dspace]/config/modules/curate.cfg (NB: all curation configuration is 'modular' in the sense that the configuration properties live outside of dspace.cfg, in named files. This means that if a given suite of tasks is unused, it's configuration is never installed). First, add the task to the lists of curation tasks.

plugin.named.org.dspace.curate.CurationTask = \
.... other curation tasks
    org.dspace.ctask.replicate.EstimateAIPSize = estaipsize

Next, in the same file, add this task to the list that appears in the administrative UI:

ui.tasknames = \
.... other tasks
    estaipsize = Estimate Storage Space for AIP(s)

Of course, both the name of the task ('estaipsize'), and the language for the UI are up to you. Now the curator can navigate to her collection, select the 'curate' tab, and then from the dropdown list of tasks choose the entry, and perform the task. On the page, the results will display:

ID: 123456789/1 (Amazing Images) estimated AIP size: 4 gigabytes

The estimates from this task are rather crude, in that they do not measure the actual AIPs, but just the bitstreams (so ignore the metadata xml), but should be fine for storage costing and allocating purposes.

Replicating

Replication Task Used:

Transmit AIP(s) to Storage

Task ID: transmitaip

Having secured approval to replicate 'Amazing Images' collection, our curator obviously needs a task to generate the AIP representations of each item in the collection, and transmit these archive files to the replication storage site (which may be service-backed, local, in the cloud, etc, as will be explored below). Adding this task is just like the previous step: editing into curate.cfg the configuration properties. (We won't repeat a description of this process each time, but note that you may always add a task, but elect not to display it in the administrative UI.). This task is 'org.dspace.ctask.replicate.TransmitAIP'.

Since we are now working with AIPs, we should examine how they are configured to the tasks. Most configuration specific to the replication task suite is found at [dspace]/config/modules/replicate.cfg. There are two main properties to set (or accept default values):

# Package type. Permitted values: 'mets', 'bagit'
packer.pkgtype = mets
# Format of package compression. Permitted values: 'zip' or 'tgz'
# for 'mets' packages, only zip is supported
packer.archfmt = zip

The default values will create a METS-based AIP in the default DSpace AIP Format, compressed into a 'zip' archive. The other alternative supported by the replication task suite is Library of Congress 'Bagit' packaging, which may compressed either into a 'zip' file or a 'tgz' ('gzipped tar'), a compression standard more common in Unix systems.

Our data curator may elect to perform this task in the admin GUI, or, if the collection is rather large, she may instead 'queue' the task for later execution by using the queueing facility available in the curation system. We should note that the 'transmitAIP' task, like all other replication tasks, operate on whatever DSpace object they are given. Thus, if the object is a collection, the task creates (and transmits, of course) an AIP for the collection object itself (metadata and logo), as well as AIPs for each item in the collection. If the task is given an identifier for a single Item, then only one AIP will be created.

Verifying Replication

Replication Task Used:

Verify AIP(s) exist in Storage

Task ID: verifyaip

While the transmitAIP task will report on whether or not it was successful in generating and transmitting AIP(s) to the replication service, our data curator wants the ability (within DSpace, not by using the replication service tools or UIs) to check whenever she likes that the AIP(s) which were transmitted are still there. A simple task 'org.dspace.ctask.replicate.VerifyAIP' can perform this function.

Ensuring Replica Integrity and Accuracy over time

Replication Task Used:

Audit against AIP(s)

Task ID: auditaip

The 'Amazing Images' collection is comparatively static, meaning that few new items are likely to be added, and most of the metadata in each item is not routinely changed. However, over longer periods of time, cataloging errors are discovered and corrected, perhaps formats become obsolete and new bitstreams are added. If the curator is fastidious about each change, and performs the 'transmitaip' task on each item that has changed, then in general the set of AIP replicas will always be 'in sync' with the repository. However, it useful to have the means to ensure that the replicas agree with the repository without having to create and transmit entirely new ones. Thus the task: 'org.dspace.ctask.replicate.CompareWithAIP', which can also be thought of as a simple audit task. When performed on an Item, the task does the following:

  1. generates an AIP for the DSpace object locally (but does not transmit it)
  2. computes an MD5 checksum on the local AIP
  3. requests from the replication storage service an MD5 checksum for the AIP in storage
  4. compares the 2 checksums

The task will thus fail only if the checksums differ, which can only happen if some part of the DSpace Object (metadata or bitstream) itself differs. If the version of the item that is believed to be authentic is the repository (local) one, then a simple performance of 'transmitAIP' task on the item will restore synchrony. For collections and communities, this task also does an 'extent' comparison, which means that it will determine whether the replica store has an AIP for every item known (locally) to be in the collection or community.

Repairing Damage

Replication Tasks Used:

Restore Missing Objects(s) from AIP(s)

Task ID: restorefromaip

 

Replace Existing Object(s) with AIP(s)

Task ID: replacewithaip

 

Restore Missing Object(s) but Keep Existing Objects (*METS-AIP)

Task ID: restorekeepexisting

 

Restore Single Object from AIP (*METS-AIP)

Task ID: restoresinglefromaip

 

Replace Single Object with AIP (*METS-AIP)

Task ID: replacesinglewithaip

NOTE: Those tasks marked (*METS-AIP) are only supported when using METS-based AIPs

The AIPs in the replica store represent an insurance policy, and when 'claims' against that policy are filed, they can cover 2 situations: either the repository object is completely missing, and we want to restore it, or it is damaged and we want to repair the damage with data from the replica store AIP. A set of replication tasks perform these functions:

Restoring Object(s)

The "Restore" (restorefromaip) task will do the following:

  1. fetch the replica store AIP for the given object identifier
  2. decompress it and create a new DSpace object
  3. install the object into the repository, including restoring it's state (withdrawn, embargoed, etc)
  4. if the object is a collection or community, all child objects (e.g. items) will also have their AIP fetched, decompressed and restored

NOTE: This restorefromaip task will fail if there is already an object in the repository bearing the identifier given.

If you are using METS-based AIPs, two additional restoration tasks are available:

  • Restore Single Object from AIP (restoresinglefromaip)
    • This task acts the same as the default "restorefromaip" task, but it does NOT restore any child objects. So, if it is run on a collection, just the collection itself will be restored (items in that collection will not be restored).
  • Restore Missing Object(s) but Keep Existing Objects (restorekeepexisting)
    • This task acts similar to the default "restorefromaip" task, but it attempts to skip over any objects which already exist in the repository. In other words an error is not thrown if an object already exists – rather that entire object (and all its child objects) are skipped over during processing and left unchanged. This mode is identical to the "Keep Existing" mode of the DSpace AIP Backup and Restore tool.

Replacing Object(s)

The "Replace" replacewithaip task expects to replace an existing DSpace object. This task will do the following:

  1. fetch the replica store AIP for the given DSpace Object
  2. decompress it
  3. locate the existing DSpace object to be replaced & clear out all its existing metadata, files, access rights, etc.
  4. replace the existing DSpace object metadata, files, access rights, etc. with the information found in the AIP (thus "overlaying" or replacing all information in the existing object)
  5. if the object is a collection or community, all child objects (e.g. items) will also have their AIP fetched, decompressed and existing objects replaced

NOTE: When using BagIt-based AIPs, this task will fail if the DSpace object is not found or no longer exists. When using METS-based AIPs, this task will instead perform a restoration of any DSpace object that is not found or no longer exists.

If you are using METS-based AIPs, an addition replacement task is available:

  • Replace Single Object from AIP (replacesinglewithaip)
    • This task acts the same as the default "replacewithaip" task, but it does NOT replace any child objects. So, if it is run on a collection, just the collection metadata will be replaced (items existing in that collection will not be replaced).

Cleanup

Replication Task Used:

Remove AIP(s) from Storage

Task ID: removeaip

Ordinarily, a replication arrangement is long standing: the preservation function cannot be fulfilled unless the replicas (here, the AIPs) are always kept and available. However, some collections (or items within them) may be removed for a variety of reasons: legal challenge, de-accession, etc. When the repository no longer locally wants to hold the object, the replica AIP ceases to have value. The task 'org.dspace.ctask.replicate.RemoveAIP' will delete the replica store AIP for its identifier. As will other replication tasks, if the identifier points to collection or community, all the AIPs of all the members will also be deleted.

Keeping Score

Replication Task Used:

Read Odometer

Task ID: readodometer

Many storage providers have cost structures that are more complex than simple functions of the total stored bytes: particularly cloud providers have costs associated wth the use of the network to upload and download the stored object. An object that occupies 2 megaBytes might cost far more over time than a 1 gigaByte object, if the former is downloaded 1000 times for every time the latter is. The replication system provides a very rudimentary task to help manage and track these factors: 'org.dspace.ctask.replicate.ReadOdometer'. This task simply displays the readings from the replication system that record cumulative use. The statistics are:

  • total number of objects (AIPS, typically) in the replica store
  • total size of all objects
  • total number of bytes downloaded from the store
  • total number of bytes uploaded to the store

These figures can be used as a means of checking and validating service charges from storage providers.

More Information on where Odometer statistics are kept

The odometer statistics are stored in a small text file located at: [base.dir]/odometer, where [base.dir] is the value of the base.dir setting in your [dspace]/config/modules/replicate.cfg configuration file. Should you ever need to reset your odometer, you can do so by moving or removing this existing odometer file.

Automation

While the coordinated use of the tasks described above can provide the basis for a solid replication strategy and practice, there are several processes that could necessitate a fair amount of curatorial work. For example, in the discussion on ensuring integrity of AIPs over time, we remarked that vigilance was required by the curator to transmit new AIPs whenever Items change. It is possible to leverage existing facilities in DSpace to substantially reduce this effort through automation.

The replication code includes a so-called 'event consumer', that can 'listen for' any changes to objects in the repository. Event consumers are documented elsewhere, but all we need to do to activate this consumer is add it to the list of consumers (in dspace.cfg):

#### Event System Configuration ####

# default synchronous dispatcher (same behavior as traditional DSpace)
event.dispatcher.default.class = org.dspace.event.BasicDispatcher
event.dispatcher.default.consumers = search, browse, eperson, harvester, replicate
....
# consumer to manage content replication
event.consumer.replicate.class = org.dspace.ctask.replicate.ReplicateConsumer
event.consumer.replicate.filters = Community|Collection|Item+Install|Modify|Modify_Metadata|Delete

This configuration essentially means: listen for any new, modified or deleted Items, Collections and Communities. If you do not care about Community or Collection AIPs, just remove 'Community' or 'Collection' from the list.

When the ReplicateConsumer gets a relevant event, it will act on it as follows:

If the event is an addition of a new DSpace object (actually for Items, an 'installation' - i.e. when the item exits workflow), then a request for an AIP transmission is queued. The same occurs whenever an object has changed (so-called modify events). When an object is deleted, a 'catalog' of the deletion is transmitted to the replication service. The catalog just lists all the parts of the deletion: if an item, then just the handle of the item, if a collection, then all the item handles that were in it. This way, if the deletion was mistaken, the catalog can be used to recover all the contents. This represents the default behavior of the consumer. You may configure it in /dspace/modules/replicate.cfg:

###  ReplicateConsumer settings ###
# ReplicateConsumer must be properly declared/configured in dspace.cfg
# All tasks defined will be queued, unless the '+p' suffix is appended, when
# they will be immediately performed. Exercise considerable caution when using
# +p, as lengthy tasks can adversely affect UI or other responsiveness.

# Replicate event consumer tasks upon install/add events.
# A comma separated list of valid task plugin names (with optional '+p' suffix)
consumer.tasks.add = transmitaip

# Replicate event consumer tasks upon modification events.
# A comma separated list of valid task plugin names (with optional '+p' suffix)
consumer.tasks.mod = transmitaip

# Replicate event consumer tasks upon a delete/remove events.
# A comma separated list of valid task plugin names (with optional '+p' suffix)
consumer.tasks.del = catalog+p

# Replicate event consumer queue name - where all queued tasks are placed
consumer.queue = replication

Using the event consumer, the curator can essentially operate replication in 'auto-pilot' after the first complete transmission of AIPs.
One important configuration to be aware of is this: by default, the consumer will process all events it receives - regardless of collection. But in our current case, we intend for only the 'Amazing Images' collection to be replicated. To effect this, we must create a file in the directory defined by the /dspace/config/modules/replicate.cfg property:

# Base directory for replication operations
base.dir = ${dspace.dir}/replicate

Create a simple text file called 'include' and put the handle of the collection for 'Amazing Images' in it. You can add as many collections
(one per line) as you like. If you replicate all but a few collections, just name the file 'exclude' and list the collection handles you want to exclude.

Replica Storage

For the replication of AIPs to be of any significant value, they must be stored in a safe, persistent, reliable, accessible, and available location. The replication tasks of transmitting, fetching, etc all rely on the storage provider configured. This and related properties are found in replicate.cfg:

# Replica store implementation class
plugin.single.org.dspace.ctask.replicate.ObjectStore = \
    org.dspace.ctask.replicate.store.LocalObjectStore

# Location of local (e.g. local, mountable, sync) object store
# ignored for non-local stores (e.g. DuraCloud)
store.dir = ${dspace.dir}/repstore

The default configuration simply writes the AIPs to the local directory configured by the 'store.dir' property above. This is not intended to be a production-grade solution, since a failure in the DSpace asset store could likely also affect this storage. It is provided mostly as a way to begin to work with the replication tasks without worrying about finding a storage provider.

For replicating in earnest, a service like DuraCloud is recommended, and what follows are instructions on how to configure a DuraCloud storage provider. Note that this service must be established and provisioned prior to use, and those details may be obtained from DuraSpace:

http://duraspace.org/duracloud.php

  • No labels