Contribute to the DSpace Development Fund

The newly established DSpace Development Fund supports the development of new features prioritized by DSpace Governance. For a list of planned features see the fund wiki page.

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 2 Next »

Replication Task Suite

One current application of the curation system is a related set (suite) of tasks to assist in performing replication of DSpace content to other locations. The content is packaged in containers known as AIPs (OAIS speak: 'archival information packages'). You can read much more about how AIPs are constituted here: AipBackupRestore, and as of DSpace 1.7, support for generating AIPs will be included. This discussion also presupposes a little knowledge of the DSpace curation system, which is described here: CurationSystem. We will describe a concrete situation facing a repository data curator, and introduce each task as the need arises. We will also describe some of the technical configuration details to enable these tasks.

Prerequisites

To use the code described here, you will need a build of DSpace that supports both curation and AIPs. See CurationSystem for a link to a code branch that fulfills these requirements. You will also need a 'jar' of the replication task code, which must be placed in /dspace/lib. There must also be the replication configuration file (replicate.cfg) in /dspace/config/modules. Leave all values defaulted for now.

Problem Statement

We can suppose our data curator has identified a collection of items in her DSpace repository consisting of high-value, born-digital, and unique/irreplaceable (not held elsewhere) content. She prudently wishes to insure against catastrophic local loss of this content by keeping a copy or replica of this collection elsewhere. She'd prefer to replicate all her DSpace content, but realizes that storage costs over long periods has made her administration wary, so decides to begin with this collection.

First Steps - Estimation

In order to budget for replication storage, she needs to know the 'size' of the collection. When she asks her sysadmin, he replies that it is easy to give her figures for the whole asset store, but since collections aren't stored separately, she would have to add up each item's bitstreams in the collection, a rather tedious process. Thus the first task: a reporting tool which operates on natural DSpace objects, rather than storage volumes.

To install this task, edit /dspace/config/modules/curate.cfg (NB: all curation configuration is 'modular' in the sense that the configuration properties live outside of dspace.cfg, in named files. This means that if a given suite of tasks is unused, it's configuration is never installed). First, add the task to the lists of curation tasks.

plugin.named.org.dspace.curate.CurationTask = \
.... other curation tasks
    org.dspace.ctask.replicate.EstimateAIPSize = estaipsize

Next, in the same file, add this task to the list that appears in the administrative UI:

ui.tasknames = \
.... other tasks
    estaipsize = Estimate Storage for AIPS

Of course, both the name of the task ('estaipsize'), and the language for the UI are up to you. Now the curator can navigate to her collection, select the 'curate' tab, and then from the dropdown list of tasks choose the entry, and perform the task. On the page, the results will display:

ID: 123456789/1 (Amazing Images) estimated AIP size: 4 gigabytes

The estimates from this task are rather crude, in that they do not measure the actual AIPs, but just the bitstreams (so ignore the metadata xml), but should be fine for storage costing and allocating purposes.

Replicating

Having secured approval to replicate 'Amazing Images' collection, our curator obviously needs a task to generate the AIP representations of each item in the collection, and transmit these archive files to the replication storage site (which may be service-backed, local, in the cloud, etc, as will be explored below). Adding this task is just like the previous one: adding to curate.cfg the configuration properties. (We won't repeat a description of this process each time, but note that you may always add a task, but elect not to display it in the administrative UI.). This task is 'org.dspace.ctask.replicate.TransmitAIP'.

  • No labels