Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: More info on DuraCloud storage

...

Configuring DuraCloud Storage

How DuraCloud

...

storage works

The Replication Task Suite includes a DuraCloud Storage plugin which utilizes the DuraCloud REST API to send/retrieve content to/from DuraCloud. This allows one to backup & restore DSpace via DuraCloud.

  • Before you can use the DuraCloud Storage plugin, you first must signup for a DuraCloud account (or signup for a trial account). 
  • Once you have a DuraCloud account, you can configure the Replication Task Suite to use your DuraCloud Account Settings (as detailed below).
  • In DuraCloud, you will also want to create one (or more) "DuraCloud Spaces" in which to store your DSpace AIPs. You'll then need to configure those space(s) in the DuraCloud Storage Settings of the Replication Task Suite (as detailed below).  The DuraCloud Space represents the location in your DuraCloud account where you want to DSpace to store its content.  Having a separate DuraCloud Space for your DSpace content is recommended (though not required), as it allows you to separate your DSpace content from any other content you may wish to store in DuraCloud.

When you backup (transmit) DSpace content to DuraCloud via the Replication Task Suite, the following general steps occur:

  1. For each DSpace object (Community, Collection, Item), an AIP zip file is generated on the server running DSpace.  The AIP is temporarily stored in the server's [dspace]/replicate/[group.aip.name] directory, where "[group.aip.name]" is the value of the "group.aip.name" setting in your "replicate.cfg" configuration file (see DuraCloud Storage Settings below for more info).  This "group.aip.name" setting also corresponds to the ID of the DuraCloud Space where the AIP will be stored.
  2. Once the AIP is generated, the Replication Task Suite determines whether a file of this same name already exists in the DuraCloud Space.
    1. If this file does not exist in DuraCloud, the locally generated AIP is uploaded to DuraCloud.
    2. If a file of this name already exists, then the Replication Task Suite checks to see if it differs from the locally generated AIP. It does so by verifying the DuraCloud reported checksum with the locally generated checksum.
      1. If the AIP checksums differ, the locally generated AIP is uploaded to DuraCloud and it replaces the version that was previously in DuraCloud.
      2. If the AIP checksums are identical, then the AIP is skipped. Nothing is uploaded to DuraCloud as the files are identical. This ensures that unnecessary uploads to DuraCloud are avoided.
  3. Once the local copy of the AIP is no longer needed, it is removed from the server's temporary location.
  4. If an upload to DuraCloud occurred, the local "odometer" is incremented to ensure it always details the total amount of content that has been uploaded (see Keeping Score section for more info on the "odometer").

When you restore/replace DSpace content from DuraCloud via the Replication Task Suite, the following general steps occur:

  1. For each DSpace object (Community, Collection, Item), that object's AIP is downloaded from DuraCloud to the server running DSpace (the appropriate AIP is located in DuraCloud via its filename).  The AIP is temporarily stored in the server's [dspace]/replicate/[group.aip.name] directory, where "[group.aip.name]" is the value of the "group.aip.name" setting in your "replicate.cfg" configuration file (see DuraCloud Storage Settings below for more info).  This "group.aip.name" setting also corresponds to the ID of the DuraCloud Space where the AIP is stored.
  2. Once the download completes, the local "odometer" is incremented to ensure it always details the total amount of content that has been downloaded (see Keeping Score section for more info on the "odometer").
  3. The AIP is then "unzipped", and the DSpace object is restored/replaced as needed.
  4. Once the local copy of the AIP is no longer needed, it is removed from the server's temporary location.

Whether you are backing up content to DuraCloud or restoring content from DuraCloud, the Replication Task Suite helps to ensure that these tasks are as seamless as possible.  As moving content in/out of the cloud can sometimes result in extra costs, the Replication Task Suite also ensures it avoids unnecessary uploads.  Finally, the Replication Task Suite helps you better estimate what those costs may be by keeping a running total of uploads/downloads in the "odometer".

DuraCloud Account Settings

In order to configure DuraCloud Storage, you first must have an existing DuraCloud Account (or a trial account). This account's settings should be configured in your [dspace]/config/modules/duracloud.cfg file as follows:

  1. DuraCloud HostName: This is the location of your DuraCloud instance (the URL you tend to access for your account). Just provide the hostname.

    Code Block
    # DuraCloud service location (just the hostname)
    host = demo.duracloud.org
    
  2. DuraCloud Service Port: This is the port that DuraCloud is running on. It is almost always "443", unless you have installed DuraCloud yourself and configured it differently.

    Code Block
    # DuraCloud service port (usually 443 for https)
    port = 443
    
  3. DuraCloud's "DuraStore" path: This the path to DuraCloud's "DuraStore" service. It is almost always "durastore", unless you have installed DuraCloud yourself and configured it differently.

    Code Block
    context = durastore
    
  4. DuraCloud Username & Password: Finally, fill out your account username & password in these settings. Please note, as this file now contains your DuraCloud account information, we recommend securing it (if possible). Just ensure it is still readable by the system user that DSpace runs as.

    Code Block
    # DuraCloud user name
    username = myduraclouduser
    # DuraCloud password
    password = passw0rd
    
DuraCloud Storage Settings

Now, to configure DuraCloud as your storage location please change the following settings in your [dspace]/config/modules/replicate.cfg configuration file:

  1. Enable DuraCloud Storage Plugin: Ensure the Replication suite is setup to use the 'DuraCloudObjectStore

In order to configure DuraCloud Storage, you first must have an existing DuraCloud Account. This account's settings should be configured in your [dspace]/config/modules/duracloud.cfg file as follows:

  1. DuraCloud HostName: This is the location of your DuraCloud instance (the URL you tend to access for your account). Just provide the hostname.

    Code Block
    # DuraCloud service location (just the hostname)
    host = demo.duracloud.org
    
  2. DuraCloud Service Port: This is the port that DuraCloud is running on. It is almost always "443", unless you have installed DuraCloud yourself and configured it differently.

    Code Block
    # DuraCloud service port (usually 443 for https)
    port = 443
    
  3. DuraCloud's "DuraStore" path: This the path to DuraCloud's "DuraStore" service. It is almost always "durastore", unless you have installed DuraCloud yourself and configured it differently.

    Code Block
    context = durastore
    
  4. DuraCloud Username & Password: Finally, fill out your account username & password in these settings. Please note, as this file now contains your DuraCloud account information, we recommend securing it (if possible). Just ensure it is still readable by the system user that DSpace runs as.

    Code Block
    # DuraCloud user name
    username = myduraclouduser
    # DuraCloud password
    password = passw0rd
    
DuraCloud Storage Settings

Now, to configure DuraCloud as your storage location please change the following settings in your [dspace]/config/modules/replicate.cfg configuration file:

  1. Enable DuraCloud Storage Plugin: Ensure the Replication suite is setup to use the 'DuraCloudObjectStore' plugin

    Code Block
    # Replica store implementation class (specify one)
    plugin.single.org.dspace.ctask.replicate.ObjectStore = \
        org.dspace.ctask.replicate.store.DuraCloudObjectStore
    
  2. Configure DuraCloud Primary Space to use: Your DuraCloud account allows you to separate content into various "Spaces". You'll need to create a new DuraCloud Space that your AIPs will be stored within, and configure that as your group.aip.name (by default it's set to a DuraCloud Space with ID of "aip_store"). You should also create a new DuraCloud Space that your AIPs will be moved to if they are ever removed, and configure that as your group.delete.name. Optionally, if you are using Checkm manifests, you can also create and configure a group.manifest.name DuraCloud Space

    Code Block
    # The primary storage group / folder where AIPs are stored/retrieved when AIP based tasks 
    # are executed (e.g. "Transmit AIP", "Recover from AIP")
    group.aip.name = aip_store
    
  3. Optionally, Configure Additional DuraCloud Spaces: If you have chosen to utilize Checkm manifest validation, you will need to create and configure a DuraCloud Space corresponding to the group.manifest.name setting below. Additionally, if you have chosen to enable the Automatic Replication, you will need to create and configure a DuraCloud Space corresponding to the group.delete.name setting below.

    Code Block
    # The storage group / folder where Checkm Manifests are stored/retrieved when Checkm Manifest 
    # based tasks are executed (org.dspace.ctask.replicate.checkm.*).
    group.manifest.name = manifest_store
    
    # The storage group / folder where AIPs are temporarily stored/retrieved when an object deletion occurs
    # and the ReplicationConsumer is enabled (see below). Essentially, this 'delete' group provides a 
    # location where AIPs can be temporarily kept in case the deletion needs to be reverted and the object restored.
    # WARNING: THIS MUST NOT BE SET TO THE SAME VALUE AS 'group.aip.name'. If it is set to the 
    # same value, then your AIP backup processes will be UNSTABLE and restoration may be difficult or impossible.
    group.delete.name = trash
    
    Info
    titleUsing File Prefixes instead of separate DuraCloud Spaces

    If you'd rather keep all your DSpace Files files in a single DuraCloud Space, you can tweak your "group.aip.name", "group.manifest.name" and "group.delete.name" settings to specify a file-prefix to use.  For example:

    group.aip.name = dspace_backup/aip_store

    group.manifest.name = dspace_backup/manifest_store

    group.delete.name = dspace_backup/trash

    With the above settings in place, all your DSpace content will be stored in the "dspace_backup" Space within DuraCloud.  AIPs will all be stored with a file-prefix of "aip_store/" (e.g. "aip_store/ITEM@123456789-2.zip").  Manifests will all be stored with a file-prefix of "manifest_store/".  And any deleted objects will be temporarily stored with a file-prefix of "trash/".   This allows you to keep all your content in a single DuraCloud Space while avoiding name conflicts between AIPs, Manifests and deleted files.

...

Automation Options

Performing a backup of DSpace is one thing..but ensuring that backup is always "synchronized" with your changing DSpace content is another.

The Replication Task Suite offers several options to automate replication of content to your backup storage location of choice.

  1. Automatically Sync Changes (via Queue) : Any changes that happen in DSpace (new objects, changed objects, deleted objects) are automatically added to a "queue". This queue can then be processed on a schedule.
  2. Scheduled Site Auditing/Replication : You may also wish to perform a full site audit or backup on a scheduled basis.

...

The Replication Task Suite includes an 'event consumer', that can 'listen for' any changes to objects in the repository. The job of this 'consumer' is to ensure that anytime an object is added/changed/deleted, it is added to the queue of objects that need to be replicated to your backup storage location.

Activate the Sync Consumer

In order to enable/activate this consumersynchronization, we you will need to add it a new consumer to the list of DSpace consumers (in dspace.cfg).  It is recommended to add this new configuration to the end of the list of existing "event.consumer." options in your dspace.cfg file.

  • METS-based AIP Replicate Consumer: This consumer will listen for changes to any DSpace Communities, Collections, Items, Groups, or EPeople.  It should be utilized if you have chosen to use METS-based AIPs. See AIP Format Options above for more details.

    Code Block
    #### Event System Configuration ####
    
    ....
    
    # consumer to manage METS AIP content replication
    event.consumer.replicate.class = org.dspace.ctask.replicate.METSReplicateConsumer
    event.consumer.replicate.filters = Community|Collection|Item|Group|EPerson+All
    

     

    • In human terms, this configuration essentially means: listen for all changes to Communities, Collections, Items, Groups and EPeople. If a change is detected, run the "METSReplicateConsumer" (which adds that object to the queue).
  • BagIt-based AIP Consumer : This consumer will ONLY listen for changes to DSpace Communities, Collections and Items as those are the only types of objects which are stored in BagIt-based AIPs. See AIP Format Options above for more details

    Code Block
    #### Event System Configuration ####
    
    ....
    
    # consumer to manage BagIt AIP content replication
    event.consumer.replicate.class = org.dspace.ctask.replicate.BagItReplicateConsumer
    event.consumer.replicate.filters = Community|Collection|Item+Install|Modify|Modify_Metadata|Delete
    

     

    • In human terms, this configuration essentially means: listen for any new, modified or deleted Items, Collections and Communities. If you do not care about Community or Collection AIPs, just remove 'Community' or 'Collection' from the list. When one of the specified changes is detected, run the "BagItReplicateConsumer" (which adds that object to the queue).

You will need to restart DSpace for this new Consumer to be recognized.

How the Sync Consumer works

...

 

...

When the activated ReplicateConsumer detects a change on an object in DSpace, it will do the following:

  • Newly Added Objects: If the event is an addition of a new DSpace object (for items this only occurs once the item exits approval workflow), then a request for an AIP transmission is queued.
  • Changed Objects: The same occurs whenever an object has changed (so-called modify events). The modified object is queued for AIP transmission.
  • Deleted Objects: When an object is deleted, a 'catalog' of the deletion is transmitted to the replication service. The catalog just lists all the objects that were deleted: if an item, then just the handle of the item, if a collection, then all the item handles that were in it. This way, if the deletion was mistaken, the catalog can be used to recover all the contents. This represents the default behavior of the consumer. However, you may configure it in [dspace]/modules/replicate.cfg
Configuring the Sync Consumer

The actions of the activated ReplicateConsumer (i.e. both METSReplicateConsumer and BagItReplicateConsumer) is configured within the [dspace]/modules/replicate.cfg.  Below are the default options (which normally need no modification) 

Code Block
###  ReplicateConsumer settings ###
# ReplicateConsumer must be properly declared/configured in dspace.cfg
# All tasks defined will be queued, unless the '+p' suffix is appended, when
# they will be immediately performed. Exercise considerable caution when using
# +p, as lengthy tasks can adversely affect UI or other responsiveness. 

# Replicate event consumer tasks upon install/add events.
# A comma separated list of valid task plugin names (with optional '+p' suffix)
# By default we transmit a new AIP when a new object is added
consumer.tasks.add = transmitsingleaip

# Replicate event consumer tasks upon modification events.
# A comma separated list of valid task plugin names (with optional '+p' suffix)
# By default we transmit an updated AIP when an object is modified
consumer.tasks.mod = transmitsingleaip

# Replicate event consumer tasks upon a delete/remove events.
# A comma separated list of valid task plugin names (with optional '+p' suffix)
# By default we write out a deletion catalog & move the deleted object's AIP
# to the "trash" group in storage (where it can be permanently deleted later)
consumer.tasks.del = catalog+p

# Replicate event consumer queue name - where all queued tasks are placed
# This queue appears under the curate.cfg file's 'taskqueue.dir'
# (default taskqueue location is [dspace]/ctqueues/)
consumer.queue = replication

...

  • Both "add" and "modification" events add the "transmitsingleaip" task (which will regenerate & transmit the object AIP to replica storage) to the queue of tasks to perform.  Please ensure you are scheduling this queue to be processed, as detailed in  Processing the Consumer Queue below.
  • The "delete" event triggers a special "catalog" task.  This "catalog" task does the following:
    • First, it creates a plaintext "catalog" file which lists all the objects that were deleted.
    • Second, it moves the AIPs for those deleted objects to the "group.delete.name" storage area (this is essentially putting them in a "trash" folder, where they can be cleaned up later, or potentially restored if the deletion was accidental).
  • By default, the queue used for all replication events is located at : [dspace]/ctqueues/replication  (this is a plaintext file which just lists all actions that should be performed the next time the queue is processed)
Info
titleIncluding or Excluding specific objects from automatic sync

It is also possible to include or exclude specific objects (via their handles) from this automatic sync.  This can be done via the addition of an "include" or "exclude" textfile in your "base.dir" (by default: [dspace]/replicate/).

For example, suppose you only want to synchronize a single important Community (with handle "123456789/10"), you can create a textfile named "include" and add a single line of text:

  • 123456789/10

Alternatively, if you'd like to synchronize everything except for two unimportant Collections (with handles: "123456789/11" and "123456789/12"), you can create a text file named "exclude" and add two lines of text:

  • 123456789/11
  • 123456789/12

In either the "include" or "exclude" files, you can add as many handles (one per line) as you like. These handles can represent Communities, Collections or Items.

Please note that the "exclude" file takes precedence over the "include" file.  So, if an object handle is listed in both files, that object will be excluded from processing.

Warning
titleDon't forget to schedule the Consumer Queue to be processed!

By default, just configuring the Consumer will only generate a queue of tasks in the location specified by the consumer.queue setting in "replicate.cfg".  You must ensure that you schedule this queue to be processed for the synchronization to be complete.  See the  Processing the Consumer Queue section below.

Processing the Consumer Queue

Once you've setup your Consumer & restarted DSpace, you'll start to see a (plain text file) queue of tasks (in the location specified by the consumer.queue setting in "replicate.cfg") that need to be performed in order to synchronize your AIP backup with what is in your DSpace instance.  This replication queue is just a normal DSpace Curation System queue, and it can be processed via command line or a cron job (recommended).

Processing this queue is as simple as scheduling a cron job (or similar) to run on a daily, weekly or monthly basis (how frequently you want this to run is up to you).  For example, here's a cron job which will process a queue named "replication" every Saturday at 1:00AM local time and places the results into a "replication-queue.log" file (NOTE: you may need to modify the paths to DSpace below if you wish to use this example):

Code Block
0 1 * * 6 $HOME/dspace/bin/dspace curate -q replication > $HOME/dspace/log/replication-queue.log

In case it is not obvious, you can also process this queue manually via the command line by simply running: [dspace]/bin/dspace curate -q replication

  • storage) to the queue of tasks to perform.  Please ensure you are scheduling this queue to be processed, as detailed in  Processing the Consumer Queue below.
  • The "delete" event triggers a special "catalog" task.  This "catalog" task does the following:
    • First, it creates a plaintext "catalog" file which lists all the objects that were deleted.
    • Second, it moves the AIPs for those deleted objects to the "group.delete.name" storage area (this is essentially putting them in a "trash" folder, where they can be cleaned up later, or potentially restored if the deletion was accidental).
  • By default, the queue used for all replication events is located at : [dspace]/ctqueues/replication  (this is a plaintext file which just lists all actions that should be performed the next time the queue is processed)
Info
titleIncluding or Excluding specific objects from automatic sync

It is also possible to include or exclude specific objects (via their handles) from this automatic sync.  This can be done via the addition of an "include" or "exclude" textfile in your "base.dir" (by default: [dspace]/replicate/).

For example, suppose you only want to synchronize a single important Community (with handle "123456789/10"), you can create a textfile named "include" and add a single line of text:

  • 123456789/10

Alternatively, if you'd like to synchronize everything except for two unimportant Collections (with handles: "123456789/11" and "123456789/12"), you can create a text file named "exclude" and add two lines of text:

  • 123456789/11
  • 123456789/12

In either the "include" or "exclude" files, you can add as many handles (one per line) as you like. These handles can represent Communities, Collections or Items.

Please note that the "exclude" file takes precedence over the "include" file.  So, if an object handle is listed in both files, that object will be excluded from processing.

Warning
titleDon't forget to schedule the Consumer Queue to be processed!

By default, just configuring the Consumer will only generate a queue of tasks in the location specified by the consumer.queue setting in "replicate.cfg".  You must ensure that you schedule this queue to be processed for the synchronization to be complete.  See the  Processing the Consumer Queue section below.

Processing the Sync Consumer Queue

Once you've setup your Consumer & restarted DSpace, you'll start to see a (plain text file) queue of tasks (in the location specified by the consumer.queue setting in "replicate.cfg") that need to be performed in order to synchronize your AIP backup with what is in your DSpace instance.  This replication queue is just a normal DSpace Curation System queue, and it can be processed via command line or a cron job (recommended).

Processing this queue is as simple as scheduling a cron job (or similar) to run on a daily, weekly or monthly basis (how frequently you want this to run is up to you).  For example, here's a cron job which will process a queue named "replication" every Saturday at 1:00AM local time and places the results into a "replication-queue.log" file (NOTE: you may need to modify the paths to DSpace below if you wish to use this example):

Code Block
0 1 * * 6 $HOME/dspace/bin/dspace curate -q replication > $HOME/dspace/log/replication-queue.log

In case it is not obvious, you can also process this queue manually via the command line by simply running: [dspace]/bin/dspace curate -q replication

During the processing of the queue, the existing queue file is "locked", but any new changes are logged to a separate queue file.  In other words, changes that happen in the DSpace system during the processing of the queue will still be captured in a separate queue file (which will be processed the next time you execute the above script).

Enhancing the Performance of the Queue Processing (optional)

For large or highly active repositories, the "replication" queue may grow rather large as each change in the system adds (at least) one task to the queue.  In addition, if the same object is modified multiple times (e.g. several small tweaks to an item), it will cause duplicate entries to appear in this queue.

In DSpace, by default, duplicate tasks in a Curation System queue will each be processed individually. So, that means if an Item is updated 10 times, it will appear in the queue 10 times, and its AIP will be (re-)generated and (re-)transmitted to storage 10 times when that queue is processed.  (Transmission Note: Some storage platforms, e.g. DuraCloud, provide a way to determine whether a newly generated AIP actually differs from the one in replica storage. So, in the case of DuraCloud storage, the AIP will be re-generated 10 times, but it will only be transmitted to DuraCloud ONCE. The other 9 times, the DuraCloud storage plugin will determine that the checksum of the new AIP is identical to the one in DuraCloud and skip the transmission step.  See How DuraCloud storage works section above for more info.)

 

 During the processing of the queue, the existing queue file is "locked", but any new changes are logged to a separate queue file.  In other words, changes that happen in the DSpace system during the processing of the queue will still be captured in a separate queue file (which will be processed the next time you execute the above script).

 

 

 

Additional Options

Configuring usage of Checkm manifest validation

...