Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The 'Amazing Images' collection is comparatively static, meaning that few new items are likely to be added, and most of the metadata in each item is not routinely changed. However, over longer periods of time, cataloging errors are discovered and corrected, perhaps formats become obsolete and new bitstreams are added. If the curator is fastidious about each change, and performs the 'transmitaip' task on each item that has changed, then in general the set of AIP replicas will always be 'in sync' with the repository. However, it useful to have the means to ensure that the replicas agree with the repository without having to create and transmit entirely new ones. Thus the task: 'org.dspace.ctask.replicate.CompareWithAIP', which can also be thought of as a simple audit task. When performed on an Item, the task does the following:

  1. generates an AIP for the DSpace object locally (but does not transmit it)
  2. computes an MD5 checksum on the local AIP
  3. requests from the replication storage service an MD5 checksum for the AIP in storage
  4. compares the 2 checksums

The task will thus fail only if the checksums differ, which can only happen if some part of the DSpace Object (metadata or bitstream) itself differs. If the version of the item that is believed to be authentic is the repository (local) one, then a simple performance of 'transmitAIP' task on the item will restore synchrony. For collections and communities, this task also does an 'extent' comparison, which means that it will determine whether the replica store has an AIP for every item known (locally) to be in the collection or community.

...

The "Restore" (restorefromaip) task will do the following:

  1. fetch the replica store AIP for the given object identifier

...

  1. decompress it and create a new DSpace object
  2. install the object into the repository, including restoring it's state (withdrawn, embargoed, etc)
  3. if the object is a collection or community, all child objects (e.g. items) will also have their AIP fetched, decompressed and restored

NOTE: This restorefromaip task will fail if there is already an object in the repository bearing the identifier given.

If you are using METS-based AIPs, two additional restoration tasks are available:

  • Restore Single Object from AIP (restoresinglefromaip)
    • This task acts the same as the default "restorefromaip" task, but it does NOT restore any child objects. So, if it is run on a collection, just the collection itself will be restored (items in that collection will not be restored).
  • Restore Missing Object(s) but Keep Existing Objects (restorekeepexisting)
    • This task acts similar to the default "restorefromaip" task, but it attempts to skip over any objects which already exist in the repository. In other words an error is not thrown if an object already exists – rather that entire object (and all its child objects) are skipped over during processing and left unchanged. This mode is identical to the "Keep Existing" mode of the DSpace AIP Backup and Restore tool.

Replacing Object(s)

By contrast, the task 'org.dspace.ctask.replicate.ReplaceWithAIP' (the 'repair' task), expects an existing repository object, and will fail if it does not find one. This task simply 'overlays' the metadata and bitstreams of the AIP version onto the existing record.

Cleanup

Ordinarily, a replication arrangement is long standing: the preservation function cannot be fulfilled unless the replicas (here, the AIPs) are always kept and available. However, some collections (or items within them) may be removed for a variety of reasons: legal challenge, de-accession, etc. When the repository no longer locally wants to hold the object, the replica AIP ceases to have value. The task 'org.dspace.ctask.replicate.RemoveAIP' will delete the replica store AIP for its identifier. As will other replication tasks, if the identifier points to collection or community, all the AIPs of all the members will also be deleted.

Keeping Score

Many storage providers have cost structures that are more complex than simple functions of the total stored bytes: particularly cloud providers have costs associated wth the use of the network to upload and download the stored object. An object that occupies 2 megaBytes might cost far more over time than a 1 gigaByte object, if the former is downloaded 1000 times for every time the latter is. The replication system provides a very rudimentary task to help manage and track these factors: 'org.dspace.ctask.replicate.ReadOdometer'. This task simply displays the readings from the replication system that record cumulative use. The statistics are:

  • total number of objects (AIPS, typically) in the replica store
  • total size of all objects
  • total number of bytes downloaded from the store
  • total number of bytes uploaded to the store

...

The "Replace" replacewithaip task expects to replace an existing DSpace object. This task will do the following:

  1. fetch the replica store AIP for the given DSpace Object
  2. decompress it
  3. locate the existing DSpace object to be replaced & clear out all its existing metadata, files, access rights, etc.
  4. replace the existing DSpace object metadata, files, access rights, etc. with the information found in the AIP (thus "overlaying" or replacing all information in the existing object)
  5. if the object is a collection or community, all child objects (e.g. items) will also have their AIP fetched, decompressed and existing objects replaced

NOTE: When using BagIt-based AIPs, this task will fail if the DSpace object is not found or no longer exists. When using METS-based AIPs, this task will instead perform a restoration of any DSpace object that is not found or no longer exists.

If you are using METS-based AIPs, an addition replacement task is available:

  • Replace Single Object from AIP (replacesinglewithaip)
    • This task acts the same as the default "replacewithaip" task, but it does NOT replace any child objects. So, if it is run on a collection, just the collection metadata will be replaced (items existing in that collection will not be replaced).

Cleanup

Replication Task Used:

Remove AIP(s) from Storage

Task ID: removeaip

Ordinarily, a replication arrangement is long standing: the preservation function cannot be fulfilled unless the replicas (here, the AIPs) are always kept and available. However, some collections (or items within them) may be removed for a variety of reasons: legal challenge, de-accession, etc. When the repository no longer locally wants to hold the object, the replica AIP ceases to have value. The task 'org.dspace.ctask.replicate.RemoveAIP' will delete the replica store AIP for its identifier. As will other replication tasks, if the identifier points to collection or community, all the AIPs of all the members will also be deleted.

Keeping Score

Replication Task Used:

Read Odometer

Task ID: readodometer

Many storage providers have cost structures that are more complex than simple functions of the total stored bytes: particularly cloud providers have costs associated wth the use of the network to upload and download the stored object. An object that occupies 2 megaBytes might cost far more over time than a 1 gigaByte object, if the former is downloaded 1000 times for every time the latter is. The replication system provides a very rudimentary task to help manage and track these factors: 'org.dspace.ctask.replicate.ReadOdometer'. This task simply displays the readings from the replication system that record cumulative use. The statistics are:

  • total number of objects (AIPS, typically) in the replica store
  • total size of all objects
  • total number of bytes downloaded from the store
  • total number of bytes uploaded to the store

These figures can be used as a means of checking and validating service charges from storage providers.

Info
titleMore Information on where Odometer statistics are kept

Wiki Markup
The odometer statistics are stored in a small text file located at: {{\[base.dir\]/odometer}}, where {{\[base.dir\]}} is the value of the {{base.dir}} setting in your {{\[dspace\]/config/modules/replicate.cfg}} configuration file.  Should you ever need to reset your odometer, you can do so by moving or removing this existing {{odometer}} file.

Automation

While the coordinated use of the tasks described above can provide the basis for a solid replication strategy and practice, there are several processes that could necessitate a fair amount of curatorial work. For example, in the discussion on ensuring integrity of AIPs over time, we remarked that vigilance was required by the curator to transmit new AIPs whenever Items change. It is possible to leverage existing facilities in DSpace to substantially reduce this effort through automation.

...