Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

We can suppose our data curator has identified a collection of items in her DSpace repository consisting of high-value, born-digital, and unique/irreplaceable (not held elsewhere) content (called the 'Amazing Images' collection). She prudently wishes to insure against catastrophic local loss of this content by keeping a copy or replica of this collection elsewhere (e.g. either on a backup drive, or even in the cloud via a service like DuraCloud). She'd prefer to replicate all her DSpace content, but realizes that storage costs over long periods has made her administration wary, so decides to begin with this collection.

...

In order to budget for replication storage, she needs to know the 'size' of the collection. When she asks her sysadmin, he replies that it is easy to give her figures for the whole DSpace asset store, but since collections aren't stored separately, she would have to add up each item's bitstreams in the collection, a rather tedious process. Thus the first task: a reporting tool which operates on natural DSpace objects, rather than storage volumes. The "Estimate Storage Space for AIP(s)" (estaipsize) task will give her this ability.

As this is the first task we are introducing in the Replication Task suite, let's take quick look at how each task is configured/enabled.  Each enabled task in this Suite is defined inTo install this task, edit [dspace]/config/modules/curate.cfg (NB: all curation configuration is 'modular' in the sense that the configuration properties live outside of dspace.cfg, in named files. This means that if a given suite of tasks is unused, it's configuration is never installed). First, add the task to the lists of curation tasks. .  So, this estaipsize task has its own definition in the file as follows:

Code Block
plugin.named.org.dspace.curate.CurationTask = \
.... other curation tasks
    org.dspace.ctask.replicate.EstimateAIPSize = estaipsize

Next, in the In that same file, add this task to the list that appears in the administrative each task is given a human-readable "name", which is what is displayed in the DSpace Administrative UI:

Code Block
ui.tasknames = \
.... other tasks
    estaipsize = Estimate Storage Space for AIP(s)

Of course, both the name of the task ('estaipsize'), and the human-readable name can be easily modifiable in this file. You are free to rename them as you see fit.

Now, getting back to our curator.  To utilize this task (or any other task in the Replication Task Suite), she language for the UI are up to you. Now the curator can navigate to her collection, select the 'curate' tab, and then . Then from the dropdown list of tasks choose she chooses the entryappropriate task, and perform the taskclicks "Perform". On the page, the results will display:

ID: 123456789/1 (Amazing Images) estimated AIP size: 4 gigabytes

The We should warn that the estimates from this task are rather crude, in that they do not measure the actual AIPs, but just the bitstreams (so ignore the metadata xml), but should be fine for storage costing and allocating purposessize of all AIPs. Rather they just total up the bitstream (file) sizes (and do not include metadata.xml files). However, even this crude estimate should provide a good estimate of overall storage needs.

Replicating

Replication Task Used:

Transmit AIP(s) to Storage

Task ID: transmitaip

Having secured approval to replicate 'Amazing Images' collection, our curator obviously needs a task to generate the AIP representations of each item in the collection, and transmit these archive files to the replication storage site (which may be service-backed, local, in the cloud, etc, as will be explored below). Adding this   This task is just like the previous step: editing into curate.cfg the configuration properties. (We won't repeat a description of this process each time, but note that you may always add a task, but elect not to display it in the administrative UI.). This task is 'org.dspace.ctask.replicate.TransmitAIP'.the "Transmit AIP(s) to Storage" (transmitaip) task.

Since we are now working with AIPs, we should examine how they are configured to the tasks. Most configuration specific to the replication task suite is found at [dspace]/config/modules/replicate.cfg. There are two main properties to set (or accept default values):

...

Our data curator may elect to perform this task in the admin GUIDSpace Admin UI, or, if the collection is rather large, she may instead 'queue' the task for later execution by using the queueing facility available in the curation system. We should note that the 'transmitAIPtransmitaip' task, like all other replication tasks, operate operates on whatever DSpace object(s) they are given. Thus, if the object is a collection, the task creates (and transmits, of course) an AIP for the collection object itself (metadata and logo), as well as AIPs for each item in the collection. If the task is given an identifier for a single Item, then only one AIP will be created and transmitted.

Verifying Replication

Replication Task Used:

Verify AIP(s) exist in Storage

Task ID: verifyaip

While the transmitAIP 'transmitaip' task will report on whether or not it was successful in generating and transmitting AIP(s) to the replication service, our data curator wants the ability (within DSpace, not by using the replication service tools or UIs) to check whenever she likes that the AIP(s) which were transmitted are still there. A simple task 'org.dspace.ctask.replicate.VerifyAIP' "Verify AIP(s) exist in Storage" (verifyaip) can perform this function.

...

The 'Amazing Images' collection is comparatively static, meaning that few new items are likely to be added, and most of the metadata in each item is not routinely changed. However, over longer periods of time, cataloging errors are discovered and corrected, perhaps formats become obsolete and new bitstreams are added. If the curator is fastidious about each change, and performs the 'transmitaip' task on each item that has changed, then in general the set of AIP replicas will always be 'in sync' with the repository. However, it useful to have the means to ensure that the replicas agree with the repository without having to create and transmit entirely new ones. Thus the task: 'org.dspace.ctask.replicate.CompareWithAIP'"Audit against AIP(s)" (auditaip), which can also be thought of as a simple audit , quick auditing task. When performed on an Item, the task does the following:

...

The task will thus fail only if the checksums differ, which can only happen if some part of the DSpace Object (metadata or bitstream) itself differs. If the version of the item that is believed to be authentic is the repository (local) one, then a simple performance of 'transmitAIPtransmitaip' task on the item will restore synchrony. For collections and communities, this task also does an 'extent' comparison, which means that it will determine whether the replica store has an AIP for every item known (locally) to be in the collection or community.

Repairing Damage

The AIPs in the replica store represent an insurance policy, and when 'claims' against that policy are filed, they can cover two situations:

  • either the repository object is completely missing, and we want to restore it,
  • or it is damaged and we want to repair the damage with data from the replica store AIP.

A set of replication tasks perform these functions, as described below.

Restoring Object(s)

Replace Existing with AIP(s)Restore Missing Object(s) but Keep Existing Objects (*METS-AIP

Replication Tasks Used:

Restore Missing ObjectsObject(s) from AIP(s)

Task ID: restorefromaip

 

Restore Missing Object(s)

Task ID: replacewithaip

 

but Keep Existing Objects (*METS-AIP Only)

Task ID: restorekeepexisting

 

Restore Single Object from AIP (*METS-AIP Only)

Task ID: restoresinglefromaip

 

Replace Single Object with AIP (*METS-AIP)

Task ID: replacesinglewithaip

NOTE: Those tasks marked (*METS-AIP) are only supported when using METS-based AIPs

The AIPs in the replica store represent an insurance policy, and when 'claims' against that policy are filed, they can cover 2 situations: either the repository object is completely missing, and we want to restore it, or it is damaged and we want to repair the damage with data from the replica store AIP. A set of replication tasks perform these functions:

Restoring Object(s)

If the curator should ever find the need to restore a deleted object, a variety of restoration based tasks are available.  The base task is the "Restore Missing Object(s) from AIP(s)" (restorefromaip) task.

This "Restore Missing Object(s) from AIP(s)" The "Restore" (restorefromaip) task will do the following:

  1. fetch the replica store AIP for the given object identifier
  2. decompress it and create a new DSpace object
  3. install the object into the repository, including restoring it's its state (withdrawn, embargoed, etc.)
  4. if the object is a collection or community, all child objects (e.g. items) will also have their AIP fetched, decompressed and restored

NOTE: This restorefromaip task will fail if there is already an object in the repository bearing the identifier given.given. In other words, it will report a failure if an object is found to already exist.

When utilizing If you are using METS-based AIPs, two additional restoration tasks are available:

  • Restore Single Object from AIP (restoresinglefromaip)
    • This task acts the same as the default "restorefromaip" task, but it does NOT restore any child objects. So, if it is run on a collection, just the collection itself will be restored (items in that collection will not be restored).
  • Restore Missing Object(s) but Keep Existing Objects (restorekeepexisting)
    • This task acts similar to the default "restorefromaip" task, but it attempts to skip over any objects which already exist in the repository. In other words, an error is not thrown if an object already exists – rather that entire object (and all its child objects) are skipped over during processing and left unchanged. This mode is identical to the "Keep Existing" mode of the DSpace AIP Backup and Restore tool.

Replacing Object(s)

Replacing Object(s)

Replication Tasks Used:

Replace Existing Object(s) with AIP(s)

Task ID: replacewithaip

 

Replace Single Object with AIP (*METS-AIP Only)

Task ID: replacesinglewithaip

If the curator should ever find a need to replace a corrupted object or revert an existing object back to the version in remote storage, a variety of replacement tasks are available.  The base task is the "Replace Existing Object(s) with AIP(s)" (replacewithaip) task.

The "Replace Existing Object(s) with AIP(s)" (replacewithaip) The "Replace" replacewithaip task expects to replace an existing DSpace object. This task will do the following:

  1. fetch the replica store AIP for the given DSpace Object
  2. decompress it
  3. locate the existing DSpace object to be replaced & clear out all its existing metadata, files, access rights, etc.
  4. replace the existing DSpace object metadata, files, access rights, etc. with the information found in the AIP (thus "overlayingoverwriting" or replacing all information in the existing object)
  5. if the object is a collection or community, all child objects (e.g. items) will also have their AIP fetched, decompressed and existing objects replaced

...

Ordinarily, a replication arrangement is long standing: the preservation function cannot be fulfilled unless the replicas (here, the AIPs) are always kept and available. However, some collections (or items within them) may be removed for a variety of reasons: legal challenge, de-accession, etc. When the repository no longer locally wants to hold the object, the replica AIP ceases to have value. The task 'org.dspace.ctask.replicate.RemoveAIP' will Remove AIP(s) from Storage' (removeaip) will permanently delete the replica store AIP for its identifier. As will other replication tasks, if the identifier points to collection or community, all the AIPs of all the members will also be permanently deleted.

Keeping Score

Replication Task Used:

Read Odometer

Task ID: readodometer

Many storage providers have cost structures that are more complex than simple functions of the total stored bytes: particularly cloud providers have costs associated wth the use of the network to upload and download the stored object. An object that occupies 2 megaBytes might cost far more over time than a 1 gigaByte object, if the former is downloaded 1000 times for every time the latter is. The replication system provides a very rudimentary task to help manage and track these factors: 'org.dspace.ctask.replicate.ReadOdometer'Read Odometer' (readodometer). This task simply displays the readings from the replication system that record records cumulative use. The statistics are:

...

Create a simple text file called 'include' and put the handle of the collection for 'Amazing Images' in it. You can add as many collections
(one per line) as you like. If you replicate all but a few collections, just name the file 'exclude' and list the collection handles you want to exclude.

...

For the replication of AIPs to be of any significant value, they must be stored in a safe, persistent, reliable, accessible, and available location. The replication tasks of transmitting, fetching, etc all rely on the storage provider configured. This and related properties are found in [dspace]/config/modules/replicate.cfg:

Code Block
# Replica store implementation class
plugin.single.org.dspace.ctask.replicate.ObjectStore = \
    org.dspace.ctask.replicate.store.LocalObjectStore

# Location of local (e.g. local, mountable, sync) object store
# ignored for non-local stores (e.g. DuraCloud)
store.dir = ${dspace.dir}/repstore

...

For replicating in earnest, a service like DuraCloud is recommended (DuraCloudObjectStore). Such a service has the additional benefits of providing offsite storage/replication while also providing additional preservation management tools. Note that this service must be established and provisioned prior to use. For more information on DuraCloud see: http://www.duracloud.org

...

More information about each of these storage options (and how to configure them) is available in the Storage Options configuration section above.

 

Codebase / Development

The following are notes for developers on how to checkout the Replication Task Suite code & build it from source.

  1. Download the Replication Suite code from GitHub: https://github.com/DSpace/dspace-replicate
    1. Checkout the branch you wish to develop against.  For example, to checkout the 1.x branch of the codebase:

      Code Block
      git checkout dspace-replicate-1.x
  2. Build/Compile the Replication Suite, by running the following from the root directory

    Code Block
    mvn package
  3. The code will be compiled into a JAR and all its dependencies will also be copied to your "target" directory
    1. The main dspace-replicate.jar will be compiled to:
      • [dspace-replicate]/target/dspace-replicate-[version].jar (The Replication Suite Plugin)
    2. There will also be a total of 4 dependency JARs that will be copied to:
      • [dspace-replicate]/target/lib/common-[version].jar (DuraCloud common libraries - required for DuraCloud integration)
      • [dspace-replicate]/target/lib/commons-compress-[version].jar (Apache Commons Compress - prerequisite for Replication Suite plugin)
      • [dspace-replicate]/target/lib/storageprovider-[version].jar (DuraCloud storage provider libraries - required for DuraCloud integration)
      • [dspace-replicate]/target/lib/storeclient-[version].jar (DuraCloud store client libraries - required for DuraCloud integration)
  4. Once the codebase is compiled, you can install it by following the Installation instructions above.  
    1. Alternatively, you may temporarily copy all 5 JARs (dspace-replicate + dependency JARs) to the following locations for testing purposes only:
      • DSpace "lib" folder (e.g. [dspace]/lib/) - This will make the Replication Task Suite available via the commandline
      • DSpace XMLUI "lib" folder (e.g. [dspace]/webapps/xmlui/WEB-INF/lib/) - This will make the Replication Task Suite available via the XMLUI.
    2. You will also need to follow the Configuration instructions above in order to properly enable & configure the Replication Task Suite.