Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: More info on automation

...

  1. AIP Format Options: Does you institution want to backup using the default DSpace AIP format (METS packaging)? Or would you rather utilize the new BagIt AIP Format?
  2. Storage Options: Does you institution plan to use the Replication Suite to backup to a local/mounted drive? Or would you like to connect it to a DuraCloud account?
  3. Automation Options: Do you want to automatically sync your AIP backup store with what is in DSpace? (this is recommended, but not required)
  4. Additional Options: Do you plan to use Checkm manifests for checksum auditing?

...

  1. Enable DuraCloud Storage Plugin: Ensure the Replication suite is setup to use the 'DuraCloudObjectStore' plugin

    Code Block
    # Replica store implementation class (specify one)
    plugin.single.org.dspace.ctask.replicate.ObjectStore = \
        org.dspace.ctask.replicate.store.DuraCloudObjectStore
    
  2. Configure DuraCloud Primary Space to use: Your DuraCloud account allows you to separate content into various "Spaces". You'll need to create a new DuraCloud Space that your AIPs will be stored within, and configure that as your group.aip.name (by default it's set to a DuraCloud Space with ID of "aip_store"). You should also create a new DuraCloud Space that your AIPs will be moved to if they are ever removed, and configure that as your group.delete.name. Optionally, if you are using Checkm manifests, you can also create and configure a group.manifest.name DuraCloud Space

    Code Block
    # The primary storage group / folder where AIPs are stored/retrieved when AIP based tasks 
    # are executed (e.g. "Transmit AIP", "Recover from AIP")
    group.aip.name = aip_store
    
  3. Optionally, Configure Additional DuraCloud Spaces: If you have chosen to utilize Checkm manifest validation, you will need to create and configure a DuraCloud Space corresponding to the group.manifest.name setting below. Additionally, if you have chosen to enable the Automatic Replication, you will need to create and configure a DuraCloud Space corresponding to the group.delete.name setting below.

    Code Block
    # The storage group / folder where Checkm Manifests are stored/retrieved when Checkm Manifest 
    # based tasks are executed (org.dspace.ctask.replicate.checkm.*).
    group.manifest.name = manifest_store
    
    # The storage group / folder where AIPs are temporarily stored/retrieved when an object deletion occurs
    # and the ReplicationConsumer is enabled (see below). Essentially, this 'delete' group provides a 
    # location where AIPs can be temporarily kept in case the deletion needs to be reverted and the object restored.
    # WARNING: THIS MUST NOT BE SET TO THE SAME VALUE AS 'group.aip.name'. If it is set to the 
    # same value, then your AIP backup processes will be UNSTABLE and restoration may be difficult or impossible.
    group.delete.name = trash
    
    Info
    titleUsing File Prefixes instead of separate DuraCloud Spaces

    If you'd rather keep all your DSpace Files in a single DuraCloud Space, you can tweak your "group.aip.name", "group.manifest.name" and "group.delete.name" settings to specify a file-prefix to use.  For example:

    group.aip.name = dspace_backup/aip_store

    group.manifest.name = dspace_backup/manifest_store

    group.delete.name = dspace_backup/trash

    With the above settings in place, all your DSpace content will be stored in the "dspace_backup" Space within DuraCloud.  AIPs will all be stored with a file-prefix of "aip_store/" (e.g. "aip_store/ITEM@123456789-2.zip").  Manifests will all be stored with a file-prefix of "manifest_store/".  And any deleted objects will be temporarily stored with a file-prefix of "trash/".   This allows you to keep all your content in a single DuraCloud Space while avoiding name conflicts between AIPs, Manifests and deleted files.

 

Automation Options

Performing a backup of DSpace is one thing..but ensuring that backup is always "synchronized" with your changing DSpace content is another.

The Replication Task Suite offers several options to automate replication of content to your backup storage location of choice.

  1. Automatically Sync Changes (via Queue) : Any changes that happen in DSpace (new objects, changed objects, deleted objects) are automatically added to a "queue". This queue can then be processed on a schedule.
  2. Scheduled Site Auditing/Replication : You may also wish to perform a full site audit or backup on a scheduled basis.

Automatically Sync Changes (via Queue)

The Replication Task Suite includes an 'event consumer', that can 'listen for' any changes to objects in the repository. The job of this 'consumer' is to ensure that anytime an object is added/changed/deleted, it is added to the queue of objects that need to be replicated to your backup storage location.

Activate the Consumer

In order to enable/activate this consumer, we need to add it to the list of DSpace consumers (in dspace.cfg).  It is recommended to add this new configuration to the end of the list of existing "event.consumer." options in your dspace.cfg file.

  • METS-based AIP Consumer: This consumer will listen for changes to any DSpace Communities, Collections, Items, Groups, or EPeople.  It should be utilized if you have chosen to use METS-based AIPs. See AIP Format Options above for more details.

    Code Block
    #### Event System Configuration ####
    
    ....
    
    # consumer to manage METS AIP content replication
    event.consumer.replicate.class = org.dspace.ctask.replicate.METSReplicateConsumer
    event.consumer.replicate.filters = Community|Collection|Item|Group|EPerson+All
    

     

    • In human terms, this configuration essentially means: listen for all changes to Communities, Collections, Items, Groups and EPeople. If a change is detected, run the "METSReplicateConsumer" (which adds that object to the queue).
  • BagIt-based AIP Consumer : This consumer will ONLY listen for changes to DSpace Communities, Collections and Items as those are the only types of objects which are stored in BagIt-based AIPs. See AIP Format Options above for more details

    Code Block
    #### Event System Configuration ####
    
    ....
    
    # consumer to manage BagIt AIP content replication
    event.consumer.replicate.class = org.dspace.ctask.replicate.BagItReplicateConsumer
    event.consumer.replicate.filters = Community|Collection|Item+Install|Modify|Modify_Metadata|Delete
    

     

    • In human terms, this configuration essentially means: listen for any new, modified or deleted Items, Collections and Communities. If you do not care about Community or Collection AIPs, just remove 'Community' or 'Collection' from the list. When one of the specified changes is detected, run the "BagItReplicateConsumer" (which adds that object to the queue).

You will need to restart DSpace for this new Consumer to be recognized.

How the Consumer works

 

 

When the activated ReplicateConsumer detects a change on an object in DSpace, it will do the following:

  • Newly Added Objects: If the event is an addition of a new DSpace object (for items this only occurs once the item exits approval workflow), then a request for an AIP transmission is queued.
  • Changed Objects: The same occurs whenever an object has changed (so-called modify events). The modified object is queued for AIP transmission.
  • Deleted Objects: When an object is deleted, a 'catalog' of the deletion is transmitted to the replication service. The catalog just lists all the objects that were deleted: if an item, then just the handle of the item, if a collection, then all the item handles that were in it. This way, if the deletion was mistaken, the catalog can be used to recover all the contents. This represents the default behavior of the consumer. However, you may configure it in [dspace]/modules/replicate.cfg
Configuring the Consumer

The actions of the activated ReplicateConsumer (i.e. both METSReplicateConsumer and BagItReplicateConsumer) is configured within the [dspace]/modules/replicate.cfg.  Below are the default options (which normally need no modification)

 

Code Block
###  ReplicateConsumer settings ###
# ReplicateConsumer must be properly declared/configured in dspace.cfg
# All tasks defined will be queued, unless the '+p' suffix is appended, when
# they will be immediately performed. Exercise considerable caution when using
# +p, as lengthy tasks can adversely affect UI or other responsiveness. 

# Replicate event consumer tasks upon install/add events.
# A comma separated list of valid task plugin names (with optional '+p' suffix)
# By default we transmit a new AIP when a new object is added
consumer.tasks.add = transmitsingleaip

# Replicate event consumer tasks upon modification events.
# A comma separated list of valid task plugin names (with optional '+p' suffix)
# By default we transmit an updated AIP when an object is modified
consumer.tasks.mod = transmitsingleaip

# Replicate event consumer tasks upon a delete/remove events.
# A comma separated list of valid task plugin names (with optional '+p' suffix)
# By default we write out a deletion catalog & move the deleted object's AIP
# to the "trash" group in storage (where it can be permanently deleted later)
consumer.tasks.del = catalog+p

# Replicate event consumer queue name - where all queued tasks are placed
# This queue appears under the curate.cfg file's 'taskqueue.dir'
# (default taskqueue location is [dspace]/ctqueues/)
consumer.queue = replication

As you can see in the default configuration above...

  • Both "add" and "modification" events add the "transmitsingleaip" task (which will regenerate & transmit the object AIP to replica storage) to the queue of tasks to perform.  Please ensure you are scheduling this queue to be processed, as detailed in  Processing the Consumer Queue below.
  • The "delete" event triggers a special "catalog" task.  This "catalog" task does the following:
    • First, it creates a plaintext "catalog" file which lists all the objects that were deleted.
    • Second, it moves the AIPs for those deleted objects to the "group.delete.name" storage area (this is essentially putting them in a "trash" folder, where they can be cleaned up later, or potentially restored if the deletion was accidental).
  • By default, the queue used for all replication events is located at : [dspace]/ctqueues/replication  (this is a plaintext file which just lists all actions that should be performed the next time the queue is processed)
Info
titleIncluding or Excluding specific objects from automatic sync

It is also possible to include or exclude specific objects (via their handles) from this automatic sync.  This can be done via the addition of an "include" or "exclude" textfile in your "base.dir" (by default: [dspace]/replicate/).

For example, suppose you only want to synchronize a single important Community (with handle "123456789/10"), you can create a textfile named "include" and add a single line of text:

  • 123456789/10

Alternatively, if you'd like to synchronize everything except for two unimportant Collections (with handles: "123456789/11" and "123456789/12"), you can create a text file named "exclude" and add two lines of text:

  • 123456789/11
  • 123456789/12

In either the "include" or "exclude" files, you can add as many handles (one per line) as you like. These handles can represent Communities, Collections or Items.

Please note that the "exclude" file takes precedence over the "include" file.  So, if an object handle is listed in both files, that object will be excluded from processing.

Warning
titleDon't forget to schedule the Consumer Queue to be processed!

By default, just configuring the Consumer will only generate a queue of tasks in the location specified by the consumer.queue setting in "replicate.cfg".  You must ensure that you schedule this queue to be processed for the synchronization to be complete.  See the  Processing the Consumer Queue section below.

Processing the Consumer Queue

Once you've setup your Consumer & restarted DSpace, you'll start to see a (plain text file) queue of tasks (in the location specified by the consumer.queue setting in "replicate.cfg") that need to be performed in order to synchronize your AIP backup with what is in your DSpace instance.  This replication queue is just a normal DSpace Curation System queue, and it can be processed via command line or a cron job (recommended).

Processing this queue is as simple as scheduling a cron job (or similar) to run on a daily, weekly or monthly basis (how frequently you want this to run is up to you).  For example, here's a cron job which will process a queue named "replication" every Saturday at 1:00AM local time and places the results into a "replication-queue.log" file (NOTE: you may need to modify the paths to DSpace below if you wish to use this example):

Code Block
0 1 * * 6 $HOME/dspace/bin/dspace curate -q replication > $HOME/dspace/log/replication-queue.log

In case it is not obvious, you can also process this queue manually via the command line by simply running: [dspace]/bin/dspace curate -q replication

During the processing of the queue, the existing queue file is "locked", but any new changes are logged to a separate queue file.  In other words, changes that happen in the DSpace system during the processing of the queue will still be captured in a separate queue file (which will be processed the next time you execute the above script).

 

 

 

Additional Options

Configuring usage of Checkm manifest validation

...