Contribute to the DSpace Development Fund

The newly established DSpace Development Fund supports the development of new features prioritized by DSpace Governance. For a list of planned features see the fund wiki page.

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 14 Next »

Replication Task Suite

One current application of the curation system is a related set (suite) of tasks to assist in performing replication of DSpace content to other locations. The content is packaged in containers known as AIPs (OAIS speak: 'archival information packages'). You can read much more about how AIPs are constituted here: AipBackupRestore, and as of DSpace 1.7, support for generating AIPs will be included. This discussion also presupposes a little knowledge of the DSpace curation system, which is described here: CurationSystem. We will describe a concrete situation facing a repository data curator, and introduce each task as the need arises. We will also describe some of the technical configuration details to enable these tasks.

Prerequisites

To use the code described here, you will need a build of DSpace that supports both curation and AIPs. See CurationSystem for a link to a code branch that fulfills these requirements. You will also need a 'jar' of the replication task code, which must be placed in /dspace/lib. There must also be the replication configuration file (replicate.cfg) in /dspace/config/modules. Leave all values defaulted for now.

Problem Statement

We can suppose our data curator has identified a collection of items in her DSpace repository consisting of high-value, born-digital, and unique/irreplaceable (not held elsewhere) content. She prudently wishes to insure against catastrophic local loss of this content by keeping a copy or replica of this collection elsewhere. She'd prefer to replicate all her DSpace content, but realizes that storage costs over long periods has made her administration wary, so decides to begin with this collection.

First Steps - Estimation

In order to budget for replication storage, she needs to know the 'size' of the collection. When she asks her sysadmin, he replies that it is easy to give her figures for the whole asset store, but since collections aren't stored separately, she would have to add up each item's bitstreams in the collection, a rather tedious process. Thus the first task: a reporting tool which operates on natural DSpace objects, rather than storage volumes.

To install this task, edit /dspace/config/modules/curate.cfg (NB: all curation configuration is 'modular' in the sense that the configuration properties live outside of dspace.cfg, in named files. This means that if a given suite of tasks is unused, it's configuration is never installed). First, add the task to the lists of curation tasks.

plugin.named.org.dspace.curate.CurationTask = \
.... other curation tasks
    org.dspace.ctask.replicate.EstimateAIPSize = estaipsize

Next, in the same file, add this task to the list that appears in the administrative UI:

ui.tasknames = \
.... other tasks
    estaipsize = Estimate Storage for AIPS

Of course, both the name of the task ('estaipsize'), and the language for the UI are up to you. Now the curator can navigate to her collection, select the 'curate' tab, and then from the dropdown list of tasks choose the entry, and perform the task. On the page, the results will display:

ID: 123456789/1 (Amazing Images) estimated AIP size: 4 gigabytes

The estimates from this task are rather crude, in that they do not measure the actual AIPs, but just the bitstreams (so ignore the metadata xml), but should be fine for storage costing and allocating purposes.

Replicating

Having secured approval to replicate 'Amazing Images' collection, our curator obviously needs a task to generate the AIP representations of each item in the collection, and transmit these archive files to the replication storage site (which may be service-backed, local, in the cloud, etc, as will be explored below). Adding this task is just like the previous step: editing into curate.cfg the configuration properties. (We won't repeat a description of this process each time, but note that you may always add a task, but elect not to display it in the administrative UI.). This task is 'org.dspace.ctask.replicate.TransmitAIP'.

Since we are now working with AIPs, we should examine how they are configured to the tasks. Most configuration specific to the replication task suite is found at /dspace/config/modules/replicate.cfg. There are two main properties to set (or accept default values):

# Package type. Permitted values: 'mets', 'bagit'
packer.pkgtype = mets
# Format of package compression. Permitted values: 'zip' or 'tgz'
# for 'mets' packages, only zip is supported
packer.archfmt = zip

The default values will create a METS-based AIP, compressed into a 'zip' archive. The other alternative supported by the replication task suite is Library of Congress 'Bagit' packaging, which may compressed either into a 'zip' file or a 'tgz' ('gzipped tar'), a compression standard more common in Unix systems.

Our data curator may elect to perform this task in the admin GUI, or, if the collection is rather large, she may instead 'queue' the task for later execution by using the queueing facility available in the curation system. We should note that the 'transmitAIP' task, like all other replication tasks, operate on whatever DSpace object they are given. Thus, if the object is a collection, the task creates (and transmits, of course) an AIP for the collection object itself (metadata and logo), as well as AIPs for each item in the collection. If the task is given an identifier for a single Item, then only one AIP will be created.

Verifying Replication

While the transmitAIP task will report on whether or not it was successful in generating and transmitting AIP(s) to the replication service, our data curator wants the ability (within DSpace, not by using the replication service tools or UIs) to check whenever she likes that the AIP(s) which were transmitted are still there. A simple task 'org.dspace.ctask.VerifyAIP' can perform this function.

Ensuring Replica Integrity and Accuracy over time

The 'Amazing Images' collection is comparatively static, meaning that few new items are likely to be added, and most of the metadata in each item is not routinely changed. However, over longer periods of time, cataloging errors are discovered and corrected, perhaps formats become obsolete and new bitstreams are added. If the curator is fastidious about each change, and performs the 'transmitAIP' task on each item that has changed, then in general the set of AIP replicas will always be 'in sync' with the repository. However, it useful to have the means to ensure that the replicas agree with the repository without having to create and transmit entirely new ones. Thus the task: 'org.dspace.ctask.replicate.CompareWithAIP', which can also be thought of as a simple audit task. When performed on an Item, the task does the following:

  • generates an AIP for the Item (but does not transmit it)
  • computes a checksum on the local AIP
  • requests from the replication storage service a checksum for the replica AIP
  • compares the 2 checksums

The task will thus fail only if the checksums differ, which can only happen if some part of the Item (metadata or bitstream) itself differs. If the version of the item that is believed to be authentic is the repository (local) one, then a simple performance of 'transmitAIP' task on the item will restore synchrony. For collections and communities, this task also does an 'extent' comparison, which means that it will determine whether the replica store has an AIP for every item known (locally) to be in the collection or community.

Repairing Damage

The AIPs in the replica store represent an insurance policy, and when 'claims' against that policy are filed, they can cover 2 situations: either the repository object is completely missing, and we want to restore it, or it is damaged and we want to repair the damage with data from the replica store AIP. A pair of replication tasks perform these functions: 'org.dspace.ctask.replicate.RecoverFromAIP' will do the following:

  • fetch the replica store AIP for the identifier given the task
  • decompress it and create a new DSpace object
  • install the object into the repository, including restoring it's state (withdrawn, embargoed, etc)

This task will fail if there is already an object in the repository bearing the identifier given. By contrast, the task 'org.dspace.ctask.replicate.ReplaceWithAIP' (the 'repair' task), expects an existing repository object, and will fail if it does not find one. This task simply 'overlays' the metadata and bitstreams of the AIP version onto the existing record.

Cleanup

Ordinarily, a replication arrangement is long standing: the preservation function cannot be fulfilled unless the replicas (here, the AIPs) are always kept and available. However, some collections (or items within them) may be removed for a variety of reasons: legal challenge, de-accession, etc. When the repository no longer locally wants to hold the object, the replica AIP ceases to have value. The task 'org.dspace.ctask.replicate.RemoveAIP' will delete the replica store AIP for its identifier. As will other replication tasks, if the identifier points to collection or community, all the AIPs of all the members will also be deleted.

Keeping Score

Many storage providers have cost structures that are more complex than simple functions of the total stored bytes: particularly cloud providers have costs associated wth the use of the network to upload and download the stored object. An object that occupies 2 megaBytes might cost far more over time than a 1 gigaByte object, if the former is downloaded 1000 times for every time the latter is. The replication system provides a very rudimentary task to help manage and track these factors: 'org.dspace.ctask.replicate.ReadOdometer'. This task simply displays the readings from the replication system that record cumulative use. The statistics are:

  • total number of objects (AIPS, typically) in the replica store
  • total size of all objects
  • total number of bytes downloaded from the store
  • total number of bytes uploaded to the store

These figures can be used as a means of checking and validating service charges from storage providers.

Automation

While the coordinated use of the tasks described above can provide the basis for a solid replication strategy and practice, there are several processes that could necessitate a fair amount of curatorial work. For example, in the discussion on ensuring integrity of AIPs over time, we remarked that vigilance was required by the curator to transmit new AIPs whenever Items change. It is possible to leverage existing facilities in DSpace to substantially reduce this effort through automation.

The replication code includes a so-called 'event consumer', that can 'listen for' any changes to objects in the repository. Event consumers are documented elsewhere, but all we need to do to activate this consumer is add it to the list of consumers (in dspace.cfg):

#### Event System Configuration ####

# default synchronous dispatcher (same behavior as traditional DSpace)
event.dispatcher.default.class = org.dspace.event.BasicDispatcher
event.dispatcher.default.consumers = search, browse, eperson, harvester, replicate
....
# consumer to manage content replication
event.consumer.replicate.class = org.dspace.ctask.replicate.ReplicateConsumer
event.consumer.replicate.filters = Community|Collection|Item+Install|Modify|Modify_Metadata|Delete
  • No labels