Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

While the coordinated use of the tasks described above can provide the basis for a solid replication strategy and practice, there are several processes that could necessitate a fair amount of curatorial work. For example, in the discussion on ensuring integrity of AIPs over time, we remarked that vigilance was required by the curator to transmit new AIPs whenever Items change. It is possible to leverage existing facilities in DSpace to substantially reduce this effort through automation.

The replication code Replication Task Suite includes a so-called 'event consumer', that can 'listen for' any changes to objects in the repository. Event consumers are documented elsewhere, but all we need to do to activate this consumer is add it to the list of consumers (in dspace.cfg):

Code Block
#### Event System Configuration ####

# default synchronous dispatcher (same behavior as traditional DSpace)
event.dispatcher.default.class = org.dspace.event.BasicDispatcher
event.dispatcher.default.consumers = search, browse, eperson, harvester, replicate
....
# consumer to manage content replication
event.consumer.replicate.class = org.dspace.ctask.replicate.ReplicateConsumer
event.consumer.replicate.filters = Community|Collection|Item+Install|Modify|Modify_Metadata|Delete

This configuration essentially means: listen for any new, modified or deleted Items, Collections and Communities. If you do not care about Community or Collection AIPs, just remove 'Community' or 'Collection' from the list.

When the ReplicateConsumer gets a relevant event, it will act on it as follows:

If the event is an addition of a new DSpace object (actually for Items, an 'installation' - i.e. when the item exits workflow), then a request for an AIP transmission is queued. The same occurs whenever an object has changed (so-called modify events). When an object is deleted, a 'catalog' of the deletion is transmitted to the replication service. The catalog just lists all the parts of the deletion: if an item, then just the handle of the item, if a collection, then all the item handles that were in it. This way, if the deletion was mistaken, the catalog can be used to recover all the contents. This represents the default behavior of the consumer. You may configure it in /dspace/modules/replicate.cfg:

Code Block
###  ReplicateConsumer settings ###
# ReplicateConsumer must be properly declared/configured in dspace.cfg
# All tasks defined will be queued, unless the '+p' suffix is appended, when
# they will be immediately performed. Exercise considerable caution when using
# +p, as lengthy tasks can adversely affect UI or other responsiveness.

# Replicate event consumer tasks upon install/add events.
# A comma separated list of valid task plugin names (with optional '+p' suffix)
consumer.tasks.add = transmitaip

# Replicate event consumer tasks upon modification events.
# A comma separated list of valid task plugin names (with optional '+p' suffix)
consumer.tasks.mod = transmitaip

# Replicate event consumer tasks upon a delete/remove events.
# A comma separated list of valid task plugin names (with optional '+p' suffix)
consumer.tasks.del = catalog+p

# Replicate event consumer queue name - where all queued tasks are placed
consumer.queue = replication

Using the event consumer, the curator can essentially operate replication in 'auto-pilot' after the first complete transmission of AIPs.
One important configuration to be aware of is this: by default, the consumer will process all events it receives - regardless of collection. But in our current case, we intend for only the 'Amazing Images' collection to be replicated. To effect this, we must create a file in the directory defined by the /dspace/config/modules/replicate.cfg property:

Code Block
# Base directory for replication operations
base.dir = ${dspace.dir}/replicate

Create a simple text file called 'include' and put the handle of the collection for 'Amazing Images' in it. You can add as many collections (one per line) as you like. If you replicate all but a few collections, just name the file 'exclude' and list the collection handles you want to exclude.

Replica Storage

For the replication of AIPs to be of any significant value, they must be stored in a safe, persistent, reliable, accessible, and available location. The replication tasks of transmitting, fetching, etc all rely on the storage provider configured. This and related properties are found in [dspace]/config/modules/replicate.cfg:

Code Block
# Replica store implementation class
plugin.single.org.dspace.ctask.replicate.ObjectStore = \
    org.dspace.ctask.replicate.store.LocalObjectStore

# Location of local (e.g. local, mountable, sync) object store
# ignored for non-local stores (e.g. DuraCloud)
store.dir = ${dspace.dir}/repstore

When this consumer detects that a change has occurred (e.g. object added/changed/deleted), it will automatically queue specific AIPs to be regenerated.

Using the event consumer, the curator can essentially operate replication in 'auto-pilot' after the first complete transmission of AIPs.  This provides a "set it and forget it" option for your backup solution.

More information about setting up automation is available in the Automation Options configuration section above.

Replica Storage / Backup Location

For the replication of AIPs to be of any significant value, they must be stored in a safe, persistent, reliable, accessible, and available location. The replication tasks of transmitting, fetching, etc all rely on the storage provider configured. 

The Replication Task Suite provides three storage provider options:

  • System Folder: The default configuration (LocalObjectStore) simply writes the AIPs to the local directory configured by the 'store.dir' property in replicate.cfg. This is not intended to be a production-grade solution, since a failure in the DSpace asset store could likely also affect this storage

...

  • (unless it is on a separate physical drive). This base option is provided mostly as a way to begin to work with the replication tasks without worrying about finding a storage provider.
  • DuraCloud: For replicating in earnest, a service like DuraCloud is recommended (DuraCloudObjectStore). Such a service has the additional benefits of providing offsite storage/replication while also providing additional preservation management tools. Note that this service must be established and provisioned prior to use. For more information on DuraCloud see: http://www.duracloud.org
  • Mounted Drive: Alternatively,

...

  • a MountableObjectStore option may be used if you wish to keep your AIP storage more "local" (e.g. on a local SAN or storage network). This option acts similar to the default configuration (in that it writes to the local directory configured by the 'store.dir' property in replicate.cfg). But, the expectation is that directory is actually a mounted storage drive, so AIPs are written in such a way as to support more complex storage architectures (e.g. an NFS-mounted store).

More information about each of these storage options (and how to configure them) is available in the Storage Options configuration section above. 

Codebase / Development

The following are notes for developers on how to checkout the Replication Task Suite code & build it from source.

...