Contribute to the DSpace Development Fund

The newly established DSpace Development Fund supports the development of new features prioritized by DSpace Governance. For a list of planned features see the fund wiki page.

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 85 Next »

Replication Task Suite

The Replication Task Suite is a DSpace Add-On which provides a set of curation system tasks to assist in performing replication (backup/restore/audit) of DSpace contents to other locations. The DSpace content is packaged in containers known as AIPs (OAIS speak: 'archival information packages'). By default, AIPs are generated in the default DSpace AIP Format (the same format used by the AIP Backup and Restore tool). If desired, there is an option to generate BagIt-based AIPs instead of using the default DSpace AIP format.

This Add-On integrates DSpace with DuraCloud for users that wish to easily back up their content into DuraCloud directly from their DSpace administrative interface

More Information

More information on the Replication Task Suite is available from the following webinars/screencasts:

The Problem Statement & Usage Examples section below also provides some real-life scenarios/examples of where each Replication task may come in handy.

 

Installation

Supported DSpace Versions

The Replication Task Suite currently supports the following versions of DSpace software:

Replication Task Suite VersionSupported DSpace Version(s)Supported InterfacesNotes
1.0DSpace version 1.8.xXMLUI and/or commandlineHighly recommended to use either DSpace 1.8.1 or 1.8.2. DSpace 1.8.0 has a known bug where running a Replication Task will always return a NullPointerException - see DS-1077
3.0-SNAPSHOTDSpace version 3.xXMLUI and/or commandlineThe "3.0-SNAPSHOT" version of the Replication Task Suite is nearly identical to the "1.0" version. It just includes a minor bug fix to ensure it will run on DSpace 3.0.

Installation instructions for each version are included below:

User Interface Compatibility Notes

As the Replication Suite is just a suite of Curation System tasks, it may be called (like any Curation Tasks) from the following locations:

  • From the Command Line
  • From the Admin UI (XMLUI Only)
  • From Item Approval Workflow
  • From custom Java code

For more information see the Curation System details on Task Invocation.

Installation on DSpace 1.8.x

Known Curation System bug in 1.8.0

DSpace 1.8.0 contains a bug in the Curation System which causes a NullPointerException error to be returned when any curation task is run across the entire site (see DS-1077). This bug directly affects the Replication Task Suite. Even when a replication task succeeds, it will still throw a NullPointerException. You can check the DSpace logs to tell whether the task actually succeeded or not. This bug was resolved in DSpace 1.8.1.
Because of the above bug, we recommend running the Replication Task Suite on DSpace 1.8.1 or above.

 

  1. In your DSpace Source directory ([dspace-src]), you will modify two Maven pom.xml files:
    • [dspace-src]/dspace/pom.xml (This POM controls dependencies of CommandLine scripts. Modifying it will let you run dspace-replicate from commandline)
    • [dspace-src]/dspace/modules/xmlui/pom.xml (This POM controls dependencies of XMLUI. Modifying it will let you run dspace-replicate from XMLUI)

  2. For each of these pom.xml files, add the following <dependency> section at the end of the existing <dependencies> section (just before the closing </dependencies> tag).

    <dependencies>
        ...
    
    	<!-- Adding this dependency will install the Replication Task Suite Addon -->
    	<dependency>
       		<groupId>org.dspace</groupId>
       		<artifactId>dspace-replicate</artifactId>
       		<version>1.0</version>
    	</dependency>
    </dependencies> 
  3. Once you've finished modifying both pom.xml files, rebuild DSpace by running the following from your [dspace-src]/dspace/ folder:

    mvn clean package
    
  4. You will need to update your existing DSpace 1.8.x installation, by running the following from your [dspace-src]/dspace/target/dspace-[version]-build/ directory

    ant update
    

    Alternatively, if you don't want to do a full DSpace update, you can just update your existing binaries & webapps by running the following two commands:

    • ant update_code (Updates the existing [dspace]/lib/ directory)
    • ant update_webapps (Updates the existing [dspace]/webapp/ directory)
  5. Follow the instructions in the Configuration section below in order to enable & configure the Replication Task Suite Add-On.

     

Installation on DSpace 3.x

  1. In your DSpace Source directory ([dspace-src]), you will need to modify the following POM file:
    • [dspace-src]/dspace/modules/additions/pom.xml (This POM will ensure that the "dspace-replicate" dependency is made available to commandline and ALL DSpace interfaces)

  2. For this pom.xml file, add the following <dependency> section at the end of the existing <dependencies> section (just before the closing </dependencies> tag).

    <dependencies>
        ...
    
    	<!-- Adding this dependency will install the Replication Task Suite Addon -->
    	<dependency>
       		<groupId>org.dspace</groupId>
       		<artifactId>dspace-replicate</artifactId>
       		<version>3.0-SNAPSHOT</version>
    	</dependency>
    </dependencies> 

    Replication Task Suite version 3.0-SNAPSHOT is nearly identical to 1.0 stable

    The 3.0-SNAPSHOT version of the Replication Task Suite is nearly identical to the 1.0 stable version.  The only changes are very minor bug fixes to allow for the Replication Task Suite to be compatible with the new DSpace 3.x API.  So, even though this is a "-SNAPSHOT" version, you should still find it to be stable.  A "3.0-EA1" (Early Access #1) version will be released in the near future after more extensive testing is performed.

  3. Once you've finished modifying both pom.xml files, rebuild DSpace by running the following from your [dspace-src]/dspace/ folder:

    mvn clean package
    
  4. You will need to update your existing DSpace 3.x installation, by running the following from your [dspace-src]/dspace/target/dspace-[version]-build/ directory

    ant update
    

    Alternatively, if you don't want to do a full DSpace update, you can just update your existing binaries & webapps by running the following two commands:

    • ant update_code (Updates the existing [dspace]/lib/ directory)
    • ant update_webapps (Updates the existing [dspace]/webapp/ directory)
  5. Follow the instructions in the Configuration section below in order to enable & configure the Replication Task Suite Add-On.

Configuration

Configuration of the Replication Task Suite is based entirely on your local institution's backup, restore and preservation needs.

Before getting started, you may wish to determine the answers to the following questions:

  1. AIP Format Options: Does you institution want to backup using the default DSpace AIP format (METS packaging)? Or would you rather utilize the new BagIt AIP Format?
  2. Storage Options: Does you institution plan to use the Replication Suite to backup to a local/mounted drive? Or would you like to connect it to a DuraCloud account?
  3. Additional Options: Do you plan to use Checkm manifests for checksum auditing?

Overview of Task Suite usage

For a higher level introduction to the Replication Task Suite, please see the Problem Statement & Usage Examples section below. It may provide you with a better idea of how you'd like to configure this task suite based on your institutional needs.

AIP Format Options

One of the first questions to ask yourself is the format you wish to utilize for your AIPs.

There are two options:

  1. DSpace AIP Format (METS-based) (default) - This is the same AIP format utilized by the DSpace AIP Backup and Restore feature, so it is 100% compatible with that existing feature. In fact when using this format, the Replication Task Suite just "wraps" calls to the AIP Backup and Restore feature itself.
  2. BagIt AIP Format - This is a new AIP format provided by the Replication Task Suite. It generates AIPs in the BagIt File Packaging Format. Institutions which already are familiar with BagIt or use it elsewhere may find this format preferrable.

For more information on the tasks available based on your AIP format choice, please see the Problem Statement & Usage Examples section below. This section also provides good examples of how to use each of the tasks available to you in the Replication Task Suite.

Configuring usage of DSpace default AIP Format (METS-based)

This section goes through the steps of configuring the Replication Suite to use the default DSpace AIP format, which utilizes METS packaging.

  1. General Curation Configuration: First, in your [dspace]/config/modules/curate.cfg you will want to enable & configure the METS-based replication tasks. (NOTE: there is a sample curate.cfg file provided in [dspace-replicate]/config/modules/curate.cfgwhich is pre-configured to use METS-based AIPs).
    • Enable the Replication Tasks: In the list of "Task Class implementations" (plugin.named.org.dspace.curate.CurationTask), add the following.
      REMEMBER to add a comma and backslash (", \") after each line (except the final line).

      plugin.named.org.dspace.curate.CurationTask = \
          ... (YOUR EXISTING TASKS) ... , \
          org.dspace.ctask.replicate.EstimateAIPSize = estaipsize, \
          org.dspace.ctask.replicate.ReadOdometer = readodometer, \
          org.dspace.ctask.replicate.TransmitAIP = transmitaip, \
          org.dspace.ctask.replicate.VerifyAIP = verifyaip, \
          org.dspace.ctask.replicate.FetchAIP = fetchaip, \
          org.dspace.ctask.replicate.CompareWithAIP = auditaip, \
          org.dspace.ctask.replicate.RemoveAIP = removeaip, \
          org.dspace.ctask.replicate.METSRestoreFromAIP = restorefromaip, \
          org.dspace.ctask.replicate.METSRestoreFromAIP = replacewithaip, \
          org.dspace.ctask.replicate.METSRestoreFromAIP = restorekeepexisting, \
          org.dspace.ctask.replicate.METSRestoreFromAIP = restoresinglefromaip, \
          org.dspace.ctask.replicate.METSRestoreFromAIP = replacesinglewithaip
      
    • Give Each Task a Human-Friendly Task Name: Under the ui.tasknames setting, give each of the above Tasks a human-friendy name. Here are some recommended values, but you are welcome to tweak them.
      REMEMBER to add a comma and backslash (", \") after each line (except the final line).

      ui.tasknames = \
          ... (YOUR EXISTING TASK NAMES) ... , \
          estaipsize = Estimate Storage Space for AIP(s), \
          readodometer = Read Odometer, \
          transmitaip = Transmit AIP(s) to Storage, \
          verifyaip = Verify AIP(s) exist in Storage, \
          fetchaip = Fetch AIP(s) from Storage, \
          auditaip = Audit against AIP(s), \
          removeaip = Remove AIP(s) from Storage, \
          restorefromaip = Restore Missing Object(s) from AIP(s), \
          replacewithaip = Replace Existing Object(s) with AIP(s), \
          restorekeepexisting = Restore Missing Object(s) but Keep Existing Objects,\
          restoresinglefromaip = Restore Single Object from AIP, \
          replacesinglewithaip = Replace Single Object with AIP
      
    • Optionally Create a Task Group: Finally, if you'd like to create a Task Group for these tasks, you can create a group named "replicate" and add them all to it. The below is just an example for how you may wish to set the ui.taskgroups and ui.taskgroup.*settings. It creates two Task Groups: (1) a "General Purpose Tasks" group for a few default DSpace Curation Tasks, and (2) a "Replication Suite Tasks" group for all these new Replication tasks.

      # Tasks may be organized into named groups which display together in UI drop-downs
      ui.taskgroups = \
         general = General Purpose Tasks,
         replicate = Replication Suite Tasks
      
      # Group membership is defined using comma-separated lists of task names, one property per group
      ui.taskgroup.general = profileformats, requiredmetadata, checklinks
      ui.taskgroup.replicate = estaipsize, readodometer, transmitaip, verifyaip, fetchaip, auditaip, removeaip, restorefromaip, replacewithaip, restorekeepexisting, restoresinglefromaip, replacesinglewithaip
      
  2. Replication Suite Configuration: Next, in your [dspace]/config/modules/replicate.cfgyou will want to ensure it is setup to properly use METS-based AIPs. Under the "AIP Packaging Settings" you'll want the following settings enabled:

    # Package type. Permitted values: 'mets', 'bagit'
    # mets = Generate default DSpace AIPs as described in: https://wiki.duraspace.org/display/DSDOC18/AIP+Backup+and+Restore
    # bagit = Generate AIPs based on the BagIt packaging format: https://wiki.ucop.edu/display/Curation/BagIt
    packer.pkgtype = mets
    
    # Format of package compression. Permitted values: 'zip' or 'tgz'
    # for 'mets' packages, only 'zip' is supported
    packer.archfmt = zip
    
    # Whether or not the name packages with a DSpace type prefix.
    # When 'true', package files are named [type]@[handle].[format] (e.g. ITEM@123456789-1.zip)
    # When 'false', package files are named [handle].[format] (e.g. 123456789-1.zip)
    # Defaults to 'true'. For 'mets' packages, this must be 'true'.
    packer.typeprefix = true
    
  3. Optionally tweak the AIP Restore/Replace settings: Optionally, you can decide to tweak the way AIPs are restored or replaced (using AIP Backup and Restore). These settings normally should not need to be tweaked, but are available in the [dspace]/config/modules/replicate-mets.cfg configuration file. See that configuration file for more details.

Configuring usage of DSpace BagIt AIP Format

This section goes through the steps of configuring the Replication Suite to use BagIt-based AIPs. For more information on the BagIt packaging format, see: https://wiki.ucop.edu/display/Curation/BagIt

  1. General Curation Configuration: First, in your [dspace]/config/modules/curate.cfg you will want to enable & configure the BagIt-based replication tasks. (NOTE: there is a sample curate.cfg file provided in [dspace-replicate]/config/modules/curate.cfg which provides example settings).
    • Enable the Replication Tasks: In the list of "Task Class implementations" (plugin.named.org.dspace.curate.CurationTask), add the following.
      REMEMBER to add a comma and backslash (", \") after each line (except the final line).

      plugin.named.org.dspace.curate.CurationTask = \
          ... (YOUR EXISTING TASKS) ... , \
          org.dspace.ctask.replicate.EstimateAIPSize = estaipsize, \
          org.dspace.ctask.replicate.ReadOdometer = readodometer, \
          org.dspace.ctask.replicate.TransmitAIP = transmitaip, \
          org.dspace.ctask.replicate.VerifyAIP = verifyaip, \
          org.dspace.ctask.replicate.FetchAIP = fetchaip, \
          org.dspace.ctask.replicate.CompareWithAIP = auditaip, \
          org.dspace.ctask.replicate.RemoveAIP = removeaip, \
          org.dspace.ctask.replicate.BagItRestoreFromAIP = restorefromaip, \
          org.dspace.ctask.replicate.BagItReplaceWithAIP = replacewithaip
      
    • Give Each Task a Human-Friendly Task Name: Under the ui.tasknames setting, give each of the above Tasks a human-friendy name. Here are some recommended values, but you are welcome to tweak them.
      REMEMBER to add a comma and backslash (", \") after each line (except the final line).

      ui.tasknames = \
          ... (YOUR EXISTING TASK NAMES) ... , \
          estaipsize = Estimate Storage Space for AIP(s), \
          readodometer = Read Odometer, \
          transmitaip = Transmit AIP(s) to Storage, \
          verifyaip = Verify AIP(s) exist in Storage, \
          fetchaip = Fetch AIP(s) from Storage, \
          auditaip = Audit/Compare against AIP(s), \
          removeaip = Remove AIP(s) from Storage, \
          restorefromaip = Restore Missing Object(s) from AIP(s), \
          replacewithaip = Replace Existing Object(s) with AIP(s)
      
    • Optionally Create a Task Group: Finally, if you'd like to create a Task Group for these tasks, you can create a group named "replicate" and add them all to it. The below is just an example for how you may wish to set the ui.taskgroups and ui.taskgroup.*settings. It creates two Task Groups: (1) a "General Purpose Tasks" group for a few default DSpace Curation Tasks, and (2) a "Replication Suite Tasks" group for all these new Replication tasks.

      # Tasks may be organized into named groups which display together in UI drop-downs
      ui.taskgroups = \
         general = General Purpose Tasks,
         replicate = Replication Suite Tasks
      
      # Group membership is defined using comma-separated lists of task names, one property per group
      ui.taskgroup.general = profileformats, requiredmetadata, checklinks
      ui.taskgroup.replicate = estaipsize, readodometer, transmitaip, verifyaip, fetchaip, auditaip, removeaip, restorefromaip, replacewithaip
      
  2. Replication Suite Configuration: Next, in your [dspace]/config/modules/replicate.cfgyou will want to ensure it is setup to properly use BagIt-based AIPs. Under the "AIP Packaging Settings" you'll want the following settings enabled:

    # Package type. Permitted values: 'mets', 'bagit'
    # mets = Generate default DSpace AIPs as described in: https://wiki.duraspace.org/display/DSDOC18/AIP+Backup+and+Restore
    # bagit = Generate AIPs based on the BagIt packaging format: https://wiki.ucop.edu/display/Curation/BagIt
    packer.pkgtype = bagit
    

Storage Options

Where your AIPs will be stored is the next decision to make. There are three options currently available:

  1. Local Storage: Replicate/Backup content to another location (folder) on your local filesystem.
  2. Mountable Storage: Replicate/Backup content to a mounted external filesystem (e.g. NFS-mounted drive).
  3. DuraCloud Storage: Replicate/Backup content to an existing DuraCloud account.

Configuring Local Storage

The local storage option may also be used for a mounted drive / SAN which just appears as though it is a local filesystem folder. However, some mounted drives (e.g. NFS-mounted drives) may need to use the Mountable Storage option instead.

Before configuring a local storage option, please ensure you have enough space available on your local hard drive (or mounted drive/SAN if your local folder is actually remote storage). You can use the "Estimate Storage Space" (estaipsize) task to estimate the amount of new storage space you will need.

To configure local storage, please change the following settings in your [dspace]/config/modules/replicate.cfg configuration file:

  1. Enable Local Storage Plugin: Ensure the Replication suite is setup to use the 'LocalObjectStore' plugin

    # Replica store implementation class (specify one)
    plugin.single.org.dspace.ctask.replicate.ObjectStore = \
        org.dspace.ctask.replicate.store.LocalObjectStore
    
  2. Configure Local Storage Folder: Configure the location where you want all AIPs to be stored on your local filestystem. This defaults to the [dspace]/repstore folder. However, we recommend changing this to at least a separate hard drive from your existing DSpace installation directory!This ensures that all your content will not be lost in the case of a hard drive failure.

    # Location of local (e.g. local, mountable, sync) object store
    # ignored for non-local stores (e.g. DuraCloud)
    store.dir = ${dspace.dir}/repstore
    
  3. Optionally Configure Subfolder Settings: Optionally, you can configure the sub-folder names (under store.dir) which will be used to store AIPs, checkm manifests (if enabled), etc.

    # The storage group / folder where AIPs are stored/retrieved when AIP based tasks 
    # (e.g. "Transmit AIP", "Recover from AIP") are executed.
    # For Local object stores, this group name corresponds to a subfolder in the 'store.dir'
    # For DuraCloud object stores, this group name corresponds to a DuraCloud Space ID (Space must already exist)
    group.aip.name = aips
    
    # The storage group / folder where Checkm Manifests are stored/retrieved when Checkm Manifest based tasks are executed
    # (org.dspace.ctask.replicate.checkm.*).
    # For Local object stores, this group name corresponds to a subfolder in the 'store.dir'
    # For DuraCloud object stores, this group name corresponds to a DuraCloud Space ID (Space must already exist)
    group.manifest.name = manifests
    
    # The storage group / folder where AIPs are temporarily stored/retrieved when an object deletion occurs
    # and the ReplicationConsumer is enabled (see below). Essentially, this 'delete' group provides a 
    # location where AIPs can be temporarily kept in case the deletion needs to be reverted and the object restored.
    # WARNING: THIS MUST NOT BE SET TO THE SAME VALUE AS 'group.aip.name'. If it is set to the 
    # same value, then your AIP backup processes will be UNSTABLE and restoration may be difficult or impossible.
    # For Local object stores, this group name corresponds to a subfolder in the 'store.dir'
    # For DuraCloud object stores, this group name corresponds to a DuraCloud Space ID (Space must already exist)
    group.delete.name = deletes
    

Configuring Mountable Storage

Before configuring a mounted storage option, please ensure you have enough space available on your external, mounted drive/SAN. You can use the "Estimate Storage Space" (estaipsize) task to estimate the amount of new storage space you will need.

To configure local storage, please change the following settings in your [dspace]/config/modules/replicate.cfg configuration file:

  1. Enable Local Storage Plugin: Ensure the Replication suite is setup to use the 'MountableObjectStore' plugin

    # Replica store implementation class (specify one)
    plugin.single.org.dspace.ctask.replicate.ObjectStore = \
        org.dspace.ctask.replicate.store.MountableObjectStore
    
  2. Configure Mounted Folder: Configure the location where you want all AIPs to be stored. The folder should already be mounted on your local filesystem. This defaults to the [dspace]/repstorefolder.

    # Location of local (e.g. local, mountable, sync) object store
    # ignored for non-local stores (e.g. DuraCloud)
    store.dir = ${dspace.dir}/repstore
    
  3. Optionally Configure Subfolder Settings: Optionally, you can configure the sub-folder names (under store.dir) which will be used to store AIPs, checkm manifests (if enabled), etc.

    # The storage group / folder where AIPs are stored/retrieved when AIP based tasks 
    # (e.g. "Transmit AIP", "Recover from AIP") are executed.
    # For Local object stores, this group name corresponds to a subfolder in the 'store.dir'
    # For DuraCloud object stores, this group name corresponds to a DuraCloud Space ID (Space must already exist)
    group.aip.name = aips
    
    # The storage group / folder where Checkm Manifests are stored/retrieved when Checkm Manifest based tasks are executed
    # (org.dspace.ctask.replicate.checkm.*).
    # For Local object stores, this group name corresponds to a subfolder in the 'store.dir'
    # For DuraCloud object stores, this group name corresponds to a DuraCloud Space ID (Space must already exist)
    group.manifest.name = manifests
    
    # The storage group / folder where AIPs are temporarily stored/retrieved when an object deletion occurs
    # and the ReplicationConsumer is enabled (see below). Essentially, this 'delete' group provides a 
    # location where AIPs can be temporarily kept in case the deletion needs to be reverted and the object restored.
    # WARNING: THIS MUST NOT BE SET TO THE SAME VALUE AS 'group.aip.name'. If it is set to the 
    # same value, then your AIP backup processes will be UNSTABLE and restoration may be difficult or impossible.
    # For Local object stores, this group name corresponds to a subfolder in the 'store.dir'
    # For DuraCloud object stores, this group name corresponds to a DuraCloud Space ID (Space must already exist)
    group.delete.name = deletes
    

Configuring DuraCloud Storage

DuraCloud Account Settings

In order to configure DuraCloud Storage, you first must have an existing DuraCloud Account. This account's settings should be configured in your [dspace]/config/modules/duracloud.cfg file as follows:

  1. DuraCloud HostName: This is the location of your DuraCloud instance (the URL you tend to access for your account). Just provide the hostname.

    # DuraCloud service location (just the hostname)
    host = demo.duracloud.org
    
  2. DuraCloud Service Port:This is the port that DuraCloud is running on. It is almost always "443" unless you have installed DuraCloud yourself and configured it differently.

    # DuraCloud service port (usually 443 for https)
    port = 443
    
  3. DuraCloud's "DuraStore" path:This the path to DuraCloud's "DuraStore" service. It is almost always "durastore" unless you have installed DuraCloud yourself and configured it differently.

    context = durastore
    
  4. DuraCloud Username & Password: Finally, fill out your account username & password in these settings. Please note, as this file now contains your DuraCloud account information, we recommend securing it (if possible). Just ensure it is still readable by the system user that DSpace runs as.

    # DuraCloud user name
    username = rep-agent
    # DuraCloud password
    password = passw0rd
    
DuraCloud Storage Settings

Now, to configure DuraCloud as your storage location please change the following settings in your [dspace]/config/modules/replicate.cfg configuration file:

  1. Enable DuraCloud Storage Plugin: Ensure the Replication suite is setup to use the 'DuraCloudObjectStore' plugin

    # Replica store implementation class (specify one)
    plugin.single.org.dspace.ctask.replicate.ObjectStore = \
        org.dspace.ctask.replicate.store.DuraCloudObjectStore
    
  2. Configure DuraCloud Primary Space to use: Your DuraCloud account allows you to separate content into various "Spaces". You'll need to create a new DuraCloud Space that your AIPs will be stored within, and configure that as your group.aip.name (by default it's set to a DuraCloud Space with ID of "aips"). You should also create a new DuraCloud Space that your AIPs will be moved to if they are ever removed, and configure that as your group.delete.name. Optionally, if you are using Checkm manifests, you can also create and configure a group.manifest.nameDuraCloud Space

    # The storage group / folder where AIPs are stored/retrieved when AIP based tasks 
    # (e.g. "Transmit AIP", "Recover from AIP") are executed.
    # For Local object stores, this group name corresponds to a subfolder in the 'store.dir'
    # For DuraCloud object stores, this group name corresponds to a DuraCloud Space ID (Space must already exist)
    group.aip.name = aips
    
  3. Optionally, Configure Additional DuraCloud Spaces: If you have chosen to utilize Checkm manifest validation, you will need to create and configure a DuraCloud Space corresponding to the group.manifest.name setting below. Additionally, if you have chosen to enable the Automatic Replication, you will need to create and configure a DuraCloud Space corresponding to the group.delete.namesetting below.

    # The storage group / folder where Checkm Manifests are stored/retrieved when Checkm Manifest based tasks are executed
    # (org.dspace.ctask.replicate.checkm.*).
    # For Local object stores, this group name corresponds to a subfolder in the 'store.dir'
    # For DuraCloud object stores, this group name corresponds to a DuraCloud Space ID (Space must already exist)
    group.manifest.name = manifests
    
    # The storage group / folder where AIPs are temporarily stored/retrieved when an object deletion occurs
    # and the ReplicationConsumer is enabled (see below). Essentially, this 'delete' group provides a 
    # location where AIPs can be temporarily kept in case the deletion needs to be reverted and the object restored.
    # WARNING: THIS MUST NOT BE SET TO THE SAME VALUE AS 'group.aip.name'. If it is set to the 
    # same value, then your AIP backup processes will be UNSTABLE and restoration may be difficult or impossible.
    # For Local object stores, this group name corresponds to a subfolder in the 'store.dir'
    # For DuraCloud object stores, this group name corresponds to a DuraCloud Space ID (Space must already exist)
    group.delete.name = deletes
    

Additional Options

Configuring usage of Checkm manifest validation

This section goes through the steps of configuring the usage of Checkm manifest tasks. These tasks provide a capability to store DSpace content checksums external from DSpace in the Checkm Manifest format. Some institutions may find this to be a useful replacement for the default DSpace Checksum Checker/Validator, which only stores/validates checksums internal to the DSpace system.

However, as this is an optional set of tasks, they are disabled by default. Should you wish to enable these tasks, just do the following:

  1. General Curation Configuration: First, in your [dspace]/config/modules/curate.cfg you will want to enable & configure the Checkm Manifest tasks. (NOTE: there is a sample curate.cfg file provided in [dspace-replicate]/config/modules/curate.cfgwhich provides example settings).
    • Enable the Checkm Tasks: In the list of "Task Class implementations" (plugin.named.org.dspace.curate.CurationTask), add the following.
      REMEMBER to add a comma and backslash (", \") after each line (except the final line).

      plugin.named.org.dspace.curate.CurationTask = \
          ... (YOUR EXISTING TASKS) ... , \
          org.dspace.ctask.replicate.checkm.TransmitManifest = transmitmanifest, \
          org.dspace.ctask.replicate.checkm.VerifyManifest = verifymanifest, \
          org.dspace.ctask.replicate.checkm.FetchManifest = fetchmanifest, \
          org.dspace.ctask.replicate.checkm.CompareWithManifest = auditmanifest, \
          org.dspace.ctask.replicate.checkm.RemoveManifest = removemanifest
      
    • Give Each Task a Human-Friendly Task Name: Under the ui.tasknames setting, give each of the above Tasks a human-friendy name. Here are some recommended values, but you are welcome to tweak them.
      REMEMBER to add a comma and backslash (", \") after each line (except the final line).

      ui.tasknames = \
          ... (YOUR EXISTING TASK NAMES) ... , \
          transmitmanifest = Transmit Checkm Manifest to Storage, \
          verifymanifest = Verify Checkm Manifest exists in Storage, \
          fetchmanifest = Fetch Checkm Manifest from Storage, \
          auditmanifest = Audit against Checkm Manifest, \
          removemanifest = Remove Checkm Manifest from Storage
      
    • Optionally Create a Task Group: Finally, if you'd like to create a Task Group for these tasks, you can create a group named "checkm" and add them all to it. The below is just an example for how you may wish to set the ui.taskgroups and ui.taskgroup.*settings. It creates two Task Groups: (1) a "General Purpose Tasks" group for a few default DSpace Curation Tasks, and (2) a "Checkm Validation Tasks" group for all these new Replication tasks.

      # Tasks may be organized into named groups which display together in UI drop-downs
      ui.taskgroups = \
         general = General Purpose Tasks,
         checkm = Checkm Validation Tasks
      
      # Group membership is defined using comma-separated lists of task names, one property per group
      ui.taskgroup.general = profileformats, requiredmetadata, checklinks
      ui.taskgroup.checkm = transmitmanifest, verifymanifest, fetchmanifest, auditmanifest, removemanifest
      

Problem Statement & Usage Examples

We can suppose our data curator has identified a collection of items in her DSpace repository consisting of high-value, born-digital, and unique/irreplaceable (not held elsewhere) content. She prudently wishes to insure against catastrophic local loss of this content by keeping a copy or replica of this collection elsewhere. She'd prefer to replicate all her DSpace content, but realizes that storage costs over long periods has made her administration wary, so decides to begin with this collection.

First Steps - Estimation

Replication Task Used:

Estimate Storage Space for AIP(s)

Task ID: estaipsize

In order to budget for replication storage, she needs to know the 'size' of the collection. When she asks her sysadmin, he replies that it is easy to give her figures for the whole asset store, but since collections aren't stored separately, she would have to add up each item's bitstreams in the collection, a rather tedious process. Thus the first task: a reporting tool which operates on natural DSpace objects, rather than storage volumes.

To install this task, edit [dspace]/config/modules/curate.cfg (NB: all curation configuration is 'modular' in the sense that the configuration properties live outside of dspace.cfg, in named files. This means that if a given suite of tasks is unused, it's configuration is never installed). First, add the task to the lists of curation tasks.

plugin.named.org.dspace.curate.CurationTask = \
.... other curation tasks
    org.dspace.ctask.replicate.EstimateAIPSize = estaipsize

Next, in the same file, add this task to the list that appears in the administrative UI:

ui.tasknames = \
.... other tasks
    estaipsize = Estimate Storage Space for AIP(s)

Of course, both the name of the task ('estaipsize'), and the language for the UI are up to you. Now the curator can navigate to her collection, select the 'curate' tab, and then from the dropdown list of tasks choose the entry, and perform the task. On the page, the results will display:

ID: 123456789/1 (Amazing Images) estimated AIP size: 4 gigabytes

The estimates from this task are rather crude, in that they do not measure the actual AIPs, but just the bitstreams (so ignore the metadata xml), but should be fine for storage costing and allocating purposes.

Replicating

Replication Task Used:

Transmit AIP(s) to Storage

Task ID: transmitaip

Having secured approval to replicate 'Amazing Images' collection, our curator obviously needs a task to generate the AIP representations of each item in the collection, and transmit these archive files to the replication storage site (which may be service-backed, local, in the cloud, etc, as will be explored below). Adding this task is just like the previous step: editing into curate.cfg the configuration properties. (We won't repeat a description of this process each time, but note that you may always add a task, but elect not to display it in the administrative UI.). This task is 'org.dspace.ctask.replicate.TransmitAIP'.

Since we are now working with AIPs, we should examine how they are configured to the tasks. Most configuration specific to the replication task suite is found at [dspace]/config/modules/replicate.cfg. There are two main properties to set (or accept default values):

# Package type. Permitted values: 'mets', 'bagit'
packer.pkgtype = mets
# Format of package compression. Permitted values: 'zip' or 'tgz'
# for 'mets' packages, only zip is supported
packer.archfmt = zip

The default values will create a METS-based AIP in the default DSpace AIP Format, compressed into a 'zip' archive. The other alternative supported by the replication task suite is Library of Congress 'Bagit' packaging, which may compressed either into a 'zip' file or a 'tgz' ('gzipped tar'), a compression standard more common in Unix systems.

Our data curator may elect to perform this task in the admin GUI, or, if the collection is rather large, she may instead 'queue' the task for later execution by using the queueing facility available in the curation system. We should note that the 'transmitAIP' task, like all other replication tasks, operate on whatever DSpace object they are given. Thus, if the object is a collection, the task creates (and transmits, of course) an AIP for the collection object itself (metadata and logo), as well as AIPs for each item in the collection. If the task is given an identifier for a single Item, then only one AIP will be created.

Verifying Replication

Replication Task Used:

Verify AIP(s) exist in Storage

Task ID: verifyaip

While the transmitAIP task will report on whether or not it was successful in generating and transmitting AIP(s) to the replication service, our data curator wants the ability (within DSpace, not by using the replication service tools or UIs) to check whenever she likes that the AIP(s) which were transmitted are still there. A simple task 'org.dspace.ctask.replicate.VerifyAIP' can perform this function.

Ensuring Replica Integrity and Accuracy over time

Replication Task Used:

Audit against AIP(s)

Task ID: auditaip

The 'Amazing Images' collection is comparatively static, meaning that few new items are likely to be added, and most of the metadata in each item is not routinely changed. However, over longer periods of time, cataloging errors are discovered and corrected, perhaps formats become obsolete and new bitstreams are added. If the curator is fastidious about each change, and performs the 'transmitaip' task on each item that has changed, then in general the set of AIP replicas will always be 'in sync' with the repository. However, it useful to have the means to ensure that the replicas agree with the repository without having to create and transmit entirely new ones. Thus the task: 'org.dspace.ctask.replicate.CompareWithAIP', which can also be thought of as a simple audit task. When performed on an Item, the task does the following:

  1. generates an AIP for the DSpace object locally (but does not transmit it)
  2. computes an MD5 checksum on the local AIP
  3. requests from the replication storage service an MD5 checksum for the AIP in storage
  4. compares the 2 checksums

The task will thus fail only if the checksums differ, which can only happen if some part of the DSpace Object (metadata or bitstream) itself differs. If the version of the item that is believed to be authentic is the repository (local) one, then a simple performance of 'transmitAIP' task on the item will restore synchrony. For collections and communities, this task also does an 'extent' comparison, which means that it will determine whether the replica store has an AIP for every item known (locally) to be in the collection or community.

Repairing Damage

Replication Tasks Used:

Restore Missing Objects(s) from AIP(s)

Task ID: restorefromaip

 

Replace Existing Object(s) with AIP(s)

Task ID: replacewithaip

 

Restore Missing Object(s) but Keep Existing Objects (*METS-AIP)

Task ID: restorekeepexisting

 

Restore Single Object from AIP (*METS-AIP)

Task ID: restoresinglefromaip

 

Replace Single Object with AIP (*METS-AIP)

Task ID: replacesinglewithaip

NOTE: Those tasks marked (*METS-AIP) are only supported when using METS-based AIPs

The AIPs in the replica store represent an insurance policy, and when 'claims' against that policy are filed, they can cover 2 situations: either the repository object is completely missing, and we want to restore it, or it is damaged and we want to repair the damage with data from the replica store AIP. A set of replication tasks perform these functions:

Restoring Object(s)

The "Restore" (restorefromaip) task will do the following:

  1. fetch the replica store AIP for the given object identifier
  2. decompress it and create a new DSpace object
  3. install the object into the repository, including restoring it's state (withdrawn, embargoed, etc)
  4. if the object is a collection or community, all child objects (e.g. items) will also have their AIP fetched, decompressed and restored

NOTE: This restorefromaip task will fail if there is already an object in the repository bearing the identifier given.

If you are using METS-based AIPs, two additional restoration tasks are available:

  • Restore Single Object from AIP (restoresinglefromaip)
    • This task acts the same as the default "restorefromaip" task, but it does NOT restore any child objects. So, if it is run on a collection, just the collection itself will be restored (items in that collection will not be restored).
  • Restore Missing Object(s) but Keep Existing Objects (restorekeepexisting)
    • This task acts similar to the default "restorefromaip" task, but it attempts to skip over any objects which already exist in the repository. In other words an error is not thrown if an object already exists – rather that entire object (and all its child objects) are skipped over during processing and left unchanged. This mode is identical to the "Keep Existing" mode of the DSpace AIP Backup and Restore tool.

Replacing Object(s)

The "Replace" replacewithaip task expects to replace an existing DSpace object. This task will do the following:

  1. fetch the replica store AIP for the given DSpace Object
  2. decompress it
  3. locate the existing DSpace object to be replaced & clear out all its existing metadata, files, access rights, etc.
  4. replace the existing DSpace object metadata, files, access rights, etc. with the information found in the AIP (thus "overlaying" or replacing all information in the existing object)
  5. if the object is a collection or community, all child objects (e.g. items) will also have their AIP fetched, decompressed and existing objects replaced

NOTE: When using BagIt-based AIPs, this task will fail if the DSpace object is not found or no longer exists. When using METS-based AIPs, this task will instead perform a restoration of any DSpace object that is not found or no longer exists.

If you are using METS-based AIPs, an addition replacement task is available:

  • Replace Single Object from AIP (replacesinglewithaip)
    • This task acts the same as the default "replacewithaip" task, but it does NOT replace any child objects. So, if it is run on a collection, just the collection metadata will be replaced (items existing in that collection will not be replaced).

Cleanup

Replication Task Used:

Remove AIP(s) from Storage

Task ID: removeaip

Ordinarily, a replication arrangement is long standing: the preservation function cannot be fulfilled unless the replicas (here, the AIPs) are always kept and available. However, some collections (or items within them) may be removed for a variety of reasons: legal challenge, de-accession, etc. When the repository no longer locally wants to hold the object, the replica AIP ceases to have value. The task 'org.dspace.ctask.replicate.RemoveAIP' will delete the replica store AIP for its identifier. As will other replication tasks, if the identifier points to collection or community, all the AIPs of all the members will also be deleted.

Keeping Score

Replication Task Used:

Read Odometer

Task ID: readodometer

Many storage providers have cost structures that are more complex than simple functions of the total stored bytes: particularly cloud providers have costs associated wth the use of the network to upload and download the stored object. An object that occupies 2 megaBytes might cost far more over time than a 1 gigaByte object, if the former is downloaded 1000 times for every time the latter is. The replication system provides a very rudimentary task to help manage and track these factors: 'org.dspace.ctask.replicate.ReadOdometer'. This task simply displays the readings from the replication system that record cumulative use. The statistics are:

  • total number of objects (AIPS, typically) in the replica store
  • total size of all objects
  • total number of bytes downloaded from the store
  • total number of bytes uploaded to the store

These figures can be used as a means of checking and validating service charges from storage providers.

More Information on where Odometer statistics are kept

The odometer statistics are stored in a small text file located at: [base.dir]/odometer, where [base.dir] is the value of the base.dir setting in your [dspace]/config/modules/replicate.cfg configuration file. Should you ever need to reset your odometer, you can do so by moving or removing this existing odometer file.

Automation (optional)

While the coordinated use of the tasks described above can provide the basis for a solid replication strategy and practice, there are several processes that could necessitate a fair amount of curatorial work. For example, in the discussion on ensuring integrity of AIPs over time, we remarked that vigilance was required by the curator to transmit new AIPs whenever Items change. It is possible to leverage existing facilities in DSpace to substantially reduce this effort through automation.

The replication code includes a so-called 'event consumer', that can 'listen for' any changes to objects in the repository. Event consumers are documented elsewhere, but all we need to do to activate this consumer is add it to the list of consumers (in dspace.cfg):

#### Event System Configuration ####

# default synchronous dispatcher (same behavior as traditional DSpace)
event.dispatcher.default.class = org.dspace.event.BasicDispatcher
event.dispatcher.default.consumers = search, browse, eperson, harvester, replicate
....
# consumer to manage content replication
event.consumer.replicate.class = org.dspace.ctask.replicate.ReplicateConsumer
event.consumer.replicate.filters = Community|Collection|Item+Install|Modify|Modify_Metadata|Delete

This configuration essentially means: listen for any new, modified or deleted Items, Collections and Communities. If you do not care about Community or Collection AIPs, just remove 'Community' or 'Collection' from the list.

When the ReplicateConsumer gets a relevant event, it will act on it as follows:

If the event is an addition of a new DSpace object (actually for Items, an 'installation' - i.e. when the item exits workflow), then a request for an AIP transmission is queued. The same occurs whenever an object has changed (so-called modify events). When an object is deleted, a 'catalog' of the deletion is transmitted to the replication service. The catalog just lists all the parts of the deletion: if an item, then just the handle of the item, if a collection, then all the item handles that were in it. This way, if the deletion was mistaken, the catalog can be used to recover all the contents. This represents the default behavior of the consumer. You may configure it in /dspace/modules/replicate.cfg:

###  ReplicateConsumer settings ###
# ReplicateConsumer must be properly declared/configured in dspace.cfg
# All tasks defined will be queued, unless the '+p' suffix is appended, when
# they will be immediately performed. Exercise considerable caution when using
# +p, as lengthy tasks can adversely affect UI or other responsiveness.

# Replicate event consumer tasks upon install/add events.
# A comma separated list of valid task plugin names (with optional '+p' suffix)
consumer.tasks.add = transmitaip

# Replicate event consumer tasks upon modification events.
# A comma separated list of valid task plugin names (with optional '+p' suffix)
consumer.tasks.mod = transmitaip

# Replicate event consumer tasks upon a delete/remove events.
# A comma separated list of valid task plugin names (with optional '+p' suffix)
consumer.tasks.del = catalog+p

# Replicate event consumer queue name - where all queued tasks are placed
consumer.queue = replication

Using the event consumer, the curator can essentially operate replication in 'auto-pilot' after the first complete transmission of AIPs.
One important configuration to be aware of is this: by default, the consumer will process all events it receives - regardless of collection. But in our current case, we intend for only the 'Amazing Images' collection to be replicated. To effect this, we must create a file in the directory defined by the /dspace/config/modules/replicate.cfg property:

# Base directory for replication operations
base.dir = ${dspace.dir}/replicate

Create a simple text file called 'include' and put the handle of the collection for 'Amazing Images' in it. You can add as many collections
(one per line) as you like. If you replicate all but a few collections, just name the file 'exclude' and list the collection handles you want to exclude.

Replica Storage

For the replication of AIPs to be of any significant value, they must be stored in a safe, persistent, reliable, accessible, and available location. The replication tasks of transmitting, fetching, etc all rely on the storage provider configured. This and related properties are found in replicate.cfg:

# Replica store implementation class
plugin.single.org.dspace.ctask.replicate.ObjectStore = \
    org.dspace.ctask.replicate.store.LocalObjectStore

# Location of local (e.g. local, mountable, sync) object store
# ignored for non-local stores (e.g. DuraCloud)
store.dir = ${dspace.dir}/repstore

The default configuration (LocalObjectStore) simply writes the AIPs to the local directory configured by the 'store.dir' property in replicate.cfg. This is not intended to be a production-grade solution, since a failure in the DSpace asset store could likely also affect this storage. It is provided mostly as a way to begin to work with the replication tasks without worrying about finding a storage provider.

For replicating in earnest, a service like DuraCloud is recommended (DuraCloudObjectStore. Such a service has the additional benefits of providing offsite storage/replication while also providing additional preservation management tools. Note that this service must be established and provisioned prior to use. For more information on DuraCloud see: http://www.duracloud.org

Alternatively, the MountableObjectStore option may be used if you wish to keep your AIP storage more "local" (e.g. on a local SAN or storage network). This option acts similar to the default configuration (in that it writes to the local directory configured by the 'store.dir' property in replicate.cfg). But, the expectation is that directory is actually a mounted storage drive, so AIPs are written in such a way as to support more complex storage architectures (e.g. an NFS-mounted store).

More information about each of these storage options (and how to configure them) is available in the Storage Options configuration section.

 

Codebase / Development

  1. Download the Replication Suite code from GitHub: https://github.com/DSpace/dspace-replicate
    1. Checkout the branch you wish to develop against.  For example, to checkout the 1.x branch of the codebase:

      git checkout dspace-replicate-1.x
  2. Build/Compile the Replication Suite, by running the following from the root directory

    mvn package
  3. The code will be compiled into a JAR and all its dependencies will also be copied to your "target" directory
    1. The main dspace-replicate.jar will be compiled to:
      • [dspace-replicate]/target/dspace-replicate-[version].jar (The Replication Suite Plugin)
    2. There will also be a total of 4 dependency JARs that will be copied to:
      • [dspace-replicate]/target/lib/common-[version].jar (DuraCloud common libraries - required for DuraCloud integration)
      • [dspace-replicate]/target/lib/commons-compress-[version].jar (Apache Commons Compress - prerequisite for Replication Suite plugin)
      • [dspace-replicate]/target/lib/storageprovider-[version].jar (DuraCloud storage provider libraries - required for DuraCloud integration)
      • [dspace-replicate]/target/lib/storeclient-[version].jar (DuraCloud store client libraries - required for DuraCloud integration)
    3. Also, copy the above 5 JARs also to your XMLUI web application's WEB-INF/lib directory (e.g. [dspace]/webapps/xmlui/WEB-INF/lib/)
  4. Once the codebase is compiled, you can install it by following the Installation instructions above.  
    1. Alternatively, you may temporarily copy all 5 JARs (dspace-replicate + dependency JARs) to the following locations for testing purposes only:
      • DSpace "lib" folder (e.g. [dspace]/lib/) - This will make the Replication Task Suite available via the commandline
      • DSpace XMLUI "lib" folder (e.g. [dspace]/webapps/xmlui/WEB-INF/lib/) - This will make the Replication Task Suite available via the XMLUI.
    2. You will also need to follow the Configuration instructions above in order to properly enable & configure the Replication Task Suite.
  • No labels