Unsupported Release

This documentation relates to an old, unsupported version of DSpace, version 1.7.x. Looking for another version? See all documentation.

As of January 2014, the DSpace 1.7.x platform is no longer supported. We recommend upgrading to a more recent version of DSpace.

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 26 Next »

DuraCloud Backup & Restore Prototype for DSpace 1.6

Background & Overview

This comes out of a requirement for DSpace integration with DuraCloud (http://www.duracloud.org). One of these requirements is to be able to essentially "backup" local DSpace contents into the cloud (as a type of offsite backup), and "restore" those contents at a later time.

Essentially, we'd like a way to be able to export the entire hierarchy (i.e. bitstreams, metadata and relationships between Communities/Collections/Items) into a relatively standard format (e.g. METS or similar structured packaging format). This entire hierarchy should also be able to be re-imported into DSpace in the same format, to allow for "roundtripping" of that content (essentially a restore of that content in the same or different DSpace installation).

Perceived benefits to DSpace community:

  • Would allow folks to more easily move entire Communities or Collections between DSpace instances.
  • Would allow for a potentially more consistent backup of this hierarchy (e.g. to DuraCloud, or just to your own local backup system), rather than relying on synchronizing a backup of your DB (metadata/relationships) and assetstore (bitstreams).
  • Would provide a way for people to more easily get their data out of DSpace (whatever the purpose may be).
  • Would provide a relatively standard format for people to migrate entire hierarchies (Communities/Collections) into DSpace (from another system).

Known Issues:

  • Exporting/Importing the Community/Collection/Item hierarchy technically doesn't cover all the "content" held in DSpace. There are also Groups, EPeople and permissions/rights (which would get you closer to a full export/import of all DSpace content). However, concentrating on just the hierarchy of Community/Collection/Item seems like a good first step.

This is related to (and a partial subset of) MIT's AipPrototype: http://jira.dspace.org/jira/browse/DS-465 However, the AIP prototype currently does not make it very easy to re-import the exported AIPs for Communities or Collections. So, this feature would extend on the AIP prototype's current packagers/crosswalks to allow for an full export and import of an entire DSpace hierarchy, or just a set of Communities, Collections or Items.

The current plan is to build off of the subset of the AipPrototype (essentially the packagers, crosswalks and related changes) which begins to allow for this roundtripping of Communities and Collections.

How does this work help DSpace interact with DuraCloud?

In this initial prototype, this work is entirely about exporting DSpace content objects to a location on a local filesystem. So, this work doesn't interact solely with DuraCloud, and could be used by any backup storage system to backup your DSpace contents.

In the initial DuraCloud work, the DuraCloud team is working on a way to "synchronize" DuraCloud with a local file folder. So, DuraCloud can be configured to "watch" a given folder and automatically replicate its contents into the cloud.

Therefore, moving content from DSpace to DuraCloud would currently be a two-step process:

  1. First, export AIPs describing that content from DSpace to a filesystem folder
  2. Second, enable DuraCloud to watch that same filesystem folder and replicate it into the cloud.

Similarly, moving content from DuraCloud back into DSpace would also be a two-step process:

  1. First, you'd tell DuraCloud to replicate the AIPs from the cloud to a folder on your file system
  2. Second, you'd ingest those AIPs back into DSpace

(These backup/restore processes may change as we go forward and investigate more use cases. This is just the initial plan.)

Makeup and Definition of AIPs

AIPs are Archival Information Packages.

  • AIP is a package describing one archival object.
    • Archival object may be Item, Collection, or Community. Bitstreams are included in an Item's AIP.
    • Each AIP is logically self-contained, can be restored without rest of the archive. (So you could restore a single Item, Collection or Community)
    • AIP profile favors completeness and accuracy rather than presenting the semantics of an object in a standard format. It conforms to the quirks of DSpace's internal object model rather than attempting to produce a universally understandable representation of the object.
    • An AIP can serve as a DIP (Dissemination Information Package) or SIP (Submission Information Package), especially when transferring custody of objects to another DSpace implementation.
  • In contrast to SIP or DIP, the AIP should include all available DSpace structural and administrative metadata, and basic provenance information.
  • Restoration of an archive from AIPs is not perfectly complete at this time; it is intended to recover from catastrophic loss of content and metadata, not restore the exact same archive as before. Currently, some information (e.g. access controls, people, groups) would be lost, as they are not stored in the AIPs.

AIPs Structure

Generally speaking, an AIP is an Zip file containing a METS manifest and all related content bitstreams.

Some examples include:

  • Site AIP (Sample: aip0-site.zip)
    • METS contains basic metadata about DSpace Site and persistent IDs referencing all Top Level Communities
  • Community AIP (Sample: COLLECTION@123456789-2.zip)
    • METS contains all metadata for Community and persistent IDs referencing all members (SubCommunities or Collections). Package may also include a Logo file, if one exists.
  • Collection AIP (Sample: COLLECTION@123456789-2.zip)
    • METS contains all metadata for Collection and persistent IDs referencing all members (Items). Package may also include a Logo file, if one exists.
  • Item AIP (Sample: ITEM@123456789-8.zip)
    • METS contains all metadata for Item and references to all Bundles and Bitstreams. Package also includes all Bitstream files.

Notes:

  • Bitstreams and Bundles are second-class archival objects; they are recorded in the context of an Item.
  • BitstreamFormats are not even second-class; they are described implicitly within Item technical metadata, and reconstructed from that during restoration

What is NOT in AIPs

  • DSpace Groups, EPeople and Policies (access rights) are currently not described in AIPs. However, there is hope to include them in a future version.
  • DSpace Site configurations ([dspace]/config/ directory) or customizations are not described in AIPs

  • DSpace Database model (or customizations therein) is not described in AIPs

Where to get the Code

There is an SVN sandbox area for this work (so that others can help out, if it interests them). If anyone has comments, suggestions or feedback on this idea, or would like to be involved in this project, definitely let me know (or add comments to this wiki page).

 svn co http://scm.dspace.org/svn/repo/sandbox/aip-external-1_6-prototype/ 

What code has really changed?

The majority of the code changes are in two main areas:

  1. org.dspace.content.packager.* - Packager classes
    • DSpaceAIPDisseminator - Disseminates/Exports AIP(s)
    • DSpaceAIPIngester - Ingests exported AIP(s)\
    • Changes were also made to refactor / enhance the AbstractMETSDisseminator, AbstractMETSIngester, and METSManifest classes
  2. org.dspace.content.crosswalk.*
    • AIPDIMCrosswalk - Crosswalks DIM metadata for AIPs
    • AIPTechMDCrosswalk - Crosswalks METS TechMD sections for AIPs
    • There were also changes to the MODSDisseminationCrosswalk and XSLTDisseminationCrosswalk to support creating "Site" AIPs

Running the Code

Here's how to get up and running relatively quickly!

Install Prototype

  1. Download the code from the SVN Sandbox (see above).
  2. Build & Install the prototype. This is just a modified version of DSpace 1.6.0 – so, follow the normal DSpace 1.6.0 Installation procedure.
    • If you have a DSpace 1.6.0 instance already running, you can just build the code and point it at your existing DSpace 1.6.0 database & assetstore.

You'll want to have some content (Communities, Collections & Items) to test with!

Exporting AIPs

There are two main "modes" you can run the AIP packager in:

  • Single AIP (default) - Exports just an AIP describing a single DSpace object. So, if you ran it in this default mode for a Collection, you'd just end up with a single Collection AIP (which would not include AIPs for all its child Items)
  • Hierarchy (including child objects) - Exports the requested AIP describing an object, plus the AIP for all child objects. Some examples follow:
    • For a Site - this would export all Communities, Collections & Items within the site into AIP files (in a provided directory)
    • For a Community - this would export that Community and all SubCommunities, Collections and Items into AIP files (in a provided directory)
    • For a Collection - this would export that Collection and all contained Items into AIP files (in a provided directory)
    • For an Item – this just exports the Item into an AIP as normal (as it already contains its Bitstreams/Bundles by default)

Exporting just a single AIP

To export in single AIP mode (default), use this 'packager' command template:

 /dspace/bin/dspace packager -d -t AIP -e <eperson> -i <handle> <file-path>

for example:

 /dspace/bin/dspace packager -d -t AIP -e admin@myu.edu -i 4321/4567 aip4567.zip

The above code will export the object of the given handle (4321/4567) into an AIP file named "aip4567.zip". This will not include any child objects for Communities or Collections.

Exporting AIP Hierarchy

To export an AIP hierarchy, use the -c (or --includeChildren) package parameter.

For example, use this 'packager' command template:

 /dspace/bin/dspace packager -d -t AIP -e <eperson> -i <handle> \
                             -c <child-dir-path> <file-path>

for example:

 /dspace/bin/dspace packager -d -t AIP -e admin@myu.edu -i 4321/4567 \
                             -c /path/to/children-aips/ aip4567.zip

The above code will export the object of the given handle (4321/4567) into an AIP file named "aip4567.zip". In addition it would export all children objects to a directory at the path "/path/to/children-aips/". The child AIPs are all named using the following format:

  • File Name Format: <Obj-Type>@<Handle-with-dashes>.zip
    • e.g. COMMUNITY@123456789-1.zip, COLLECTION@123456789-2.zip, ITEM@123456789-200.zip
    • This general file naming convention ensures that you can easily locate an object to restore by its name (assuming you know its Object Type and Handle).
  • Alternatively, if object doesn't have a Handle, it uses this File Name Format: <Obj-Type>@internal-id-<DSpace-ID>.zip (e.g. ITEM@internal-id-234.zip)
Exporting Entire Site

To export an entire DSpace Site, pass the packager the Handle <site-handle-prefix>/0. For example, if your site prefix is "4321", you'd run a command similar to the following:

 /dspace/bin/dspace packager -d -t AIP -e admin@myu.edu -i 4321/0 \
                             -c /path/to/children-aips/ sitewide-aip.zip

Again, this would export the DSpace Site AIP into the file "sitewide-aip.zip", and export AIPs for all Communities, Collections and Items into the "/path/to/children-aips" directory.

Ingesting AIPs

Again, like export, there are two main "modes" you can run the AIP packager in:

  • Single AIP (default) - Ingests just an AIP describing a single DSpace object. So, if you ran it in this default mode for a Collection AIP, you'd just create a DSpace Collection from the AIP (but not ingest any of its child objects)
  • Hierarchy (including child objects) - Ingests the requested AIP describing an object, plus the AIP for all child objects. Some examples follow:
    • For a Site - this would create all Communities, Collections & Items based on the located AIP files
    • For a Community - this would create that Community and all SubCommunities, Collections and Items based on the located AIP files
    • For a Collection - this would create that Collection and all contained Items based on the located AIP files
    • For an Item – this just create the Item (including all Bitstreams & Bundles) based on the AIP file.

Ingesting just a Single AIP

To ingest a single AIP and create a new DSpace object under a parent of your choice, add the ignoreParent and ignoreHandle package parameters to the command. Also, note that you are running the packager in -s (submit) mode.

NOTE: This only ingests the single AIP specified. It does not ingest all children objects.

 /dspace/bin/dspace packager -s -t AIP -e <eperson> -p <parent-handle> -o ignoreParent=true -o ignoreHandle=true <file-path>

If you leave out these package-parameter options, the AIP package ingester will attempt to install the AIP under the parent handle it had before, and give it back its original Handle. After all, the point of AIPs was to reproduce the exact object that was exported. When you are effectively using the AIP as a SIP, however, you may not want it back under the same parent or handle, so there is a way to override these features.

Ingesting an AIP Hierarchy

To ingest an AIP hierarchy from a directory of AIPs, use the -c (or --includeChildren) package parameter. In addition, as this is not a restore, you'd want to specify the -o ignoreParent=true parameter (ignores Parent Object information contained in the package) and the -o ignoreHandle=true parameter (ignores handle in package, and a new handle is assigned on ingest).

For example, use this 'packager' command template:

 /dspace/bin/dspace packager -s -t AIP -e <eperson> -p <parent-handle> -o ignoreParent=true -o ignoreHandle=true \
                             -c <child-dir-path> <file-path>

for example:

 /dspace/bin/dspace packager -d -t AIP -e admin@myu.edu -p 4321/12 -o ignoreParent=true -o ignoreHandle=true \
                             -c /path/to/children-aips/ aip4567.zip

The above code will ingest the package named "aip4567.zip" as a child of the specified Parent Object (handle="4321/12"). The resulting object is assigned a new Handle (-o ignoreHandle=true). In addition, any child AIPs referenced by "aip4567.zip" in the folder "/path/to/children-aips" are also recursively ingested (a new Handle is also assigned for each child AIP).

Restoring an AIP Hierarchy

NOTE: This doesn't quite work yet! – Tim

Restoring is slightly different than just re-ingesting. When restoring, we want to retain the old Handles within the Hierarchy. So, it's similar to the Ingesting an AIP Hierarchy instructions above, but it doesn't specify the ignoreParent or ignoreHandle parameters (as we obviously want to retain this information.

For example:

 /dspace/bin/dspace packager -d -t AIP -e admin@myu.edu -p 4321/12 \
                             -c /path/to/children-aips/ aip4567.zip

In this case, the package "aip4567.zip" is restored to the DSpace installation with the Handle provided within the package itself. In addition, any child AIPs referenced by "aip4567.zip" in the folder "/path/to/children-aips" are also recursively ingested. They are also restored with the Handles provided with their package.

Ingesting Entire Site

Details Coming Soon! In all likelihood it will take the same parameters as the "Exporting entire Site", except that you'll be running the packager in -s (submit) mode.

Discussion / Use Cases

Please add your own potential use cases or discussion topics

  • MITUseCases - Notes on defining common operations in a replication system.

Questions / Comments?

Questions or comments – either add them inline above, or contact Tim Donohue

  • No labels