Unsupported Release

This documentation relates to an old, unsupported version of DSpace, version 1.7.x. Looking for another version? See all documentation.

As of January 2014, the DSpace 1.7.x platform is no longer supported. We recommend upgrading to a more recent version of DSpace.

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 8 Next »

DuraCloud Backup & Restore Prototype for DSpace 1.6

Background & Overview

This comes out of a requirement for DSpace integration with DuraCloud (http://www.duraspace.org/duracloud.php). One of these requirements is to be able to essentially "backup" local DSpace contents into the cloud (as a type of offsite backup), and "restore" those contents at a later time.

Essentially, we'd like a way to be able to export the entire hierarchy (i.e. bitstreams, metadata and relationships between Communities/Collections/Items) into a relatively standard format (e.g. METS or similar structured packaging format). This entire hierarchy should also be able to be re-imported into DSpace in the same format, to allow for "roundtripping" of that content (essentially a restore of that content in the same or different DSpace installation).

Perceived benefits to DSpace community:

  • Would allow folks to more easily move entire Communities or Collections between DSpace instances.
  • Would allow for a potentially more consistent backup of this hierarchy (e.g. to DuraCloud, or just to your own local backup system), rather than relying on synchronizing a backup of your DB (metadata/relationships) and assetstore (bitstreams).
  • Would provide a way for people to more easily get their data out of DSpace (whatever the purpose may be).
  • Would provide a relatively standard format for people to migrate entire hierarchies (Communities/Collections) into DSpace (from another system).

Known Issues:

  • Exporting/Importing the Community/Collection/Item hierarchy technically doesn't cover all the "content" held in DSpace. There are also Groups, EPeople and permissions/rights (which would get you closer to a full export/import of all DSpace content). However, concentrating on just the hierarchy of Community/Collection/Item seems like a good first step.

This is related to (and a partial subset of) MIT's AipPrototype: http://jira.dspace.org/jira/browse/DS-465 However, the AIP prototype currently does not make it very easy to re-import the exported AIPs for Communities or Collections. So, this feature would extend on the AIP prototype's current packagers/crosswalks to allow for an full export and import of an entire DSpace hierarchy, or just a set of Communities, Collections or Items.

The current plan is to build off of the subset of the AipPrototype (essentially the packagers, crosswalks and related changes) which begins to allow for this roundtripping of Communities and Collections.

Makeup and Definition of AIPs

AIPs are Archival Information Packages.

  • AIP is a package describing one archival object.
    • Archival object may be Item, Collection, or Community. Bitstreams are included in an Item's AIP.
    • Each AIP is logically self-contained, can be restored without rest of the archive. (So you could restore a single Item, Collection or Community)
    • AIP profile favors completeness and accuracy rather than presenting the semantics of an object in a standard format. It conforms to the quirks of DSpace's internal object model rather than attempting to produce a universally understandable representation of the object.
    • An AIP can serve as a DIP (Dissemination Information Package) or SIP (Submission Information Package), especially when transferring custody of objects to another DSpace implementation.
  • In contrast to SIP or DIP, the AIP should include all available DSpace structural and administrative metadata, and basic provenance information.
  • Restoration of an archive from AIPs is not perfectly complete at this time; it is intended to recover from catastrophic loss of content and metadata, not restore the exact same archive as before. Currently, some information (e.g. access controls, people, groups) would be lost, as they are not stored in the AIPs.

AIPs Structure

Generally speaking, an AIP is an Zip file containing a METS manifest and all related content bitstreams.

Some examples include:

  • Site AIP
    • METS contains basic metadata about DSpace Site and persistent IDs referencing all Top Level Communities
  • Community AIP
    • METS contains all metadata for Community and persistent IDs referencing all members (SubCommunities or Collections). Package may also include a Logo file, if one exists.
  • Collection AIP
    • METS contains all metadata for Collection and persistent IDs referencing all members (Items). Package may also include a Logo file, if one exists.
  • Item AIP
    • METS contains all metadata for Item and references to all Bundles and Bitstreams. Package also includes all Bitstream files.

Notes:

  • Bitstreams and Bundles are second-class archival objects; they are recorded in the context of an Item.
  • BitstreamFormats are not even second-class; they are described implicitly within Item technical metadata, and reconstructed from that during restoration

What is NOT in AIPs

  • DSpace Groups, EPeople and Policies (access rights) are currently not described in AIPs. However, there is hope to include them in a future version.
  • DSpace Site configurations ([dspace]/config/ directory) or customizations are not described in AIPs

  • DSpace Database model (or customizations therein) is not described in AIPs

Where to get the Code

There is an SVN sandbox area for this work (so that others can help out, if it interests them). If anyone has comments, suggestions or feedback on this idea, or would like to be involved in this project, definitely let me know (or add comments to this issue).

 svn co http://scm.dspace.org/svn/repo/sandbox/aip-external-1_6-prototype/ 

Running the Code

Here's how to get up and running relatively quickly!

Install Prototype

  1. Download the code from the SVN Sandbox (see above).
  2. Build & Install the prototype. This is just a modified version of DSpace 1.6.0 – so, follow the normal DSpace 1.6.0 Installation procedure.
    • If you have a DSpace 1.6.0 instance already running, you can just build the code and point it at your existing DSpace 1.6.0 database & assetstore.

You'll want to have some content (Communities, Collections & Items) to test with!

Exporting AIPs

There are two main "modes" you can run the AIP packager in:

  • Single AIP (default) - Exports just an AIP describing a single DSpace object. So, if you ran it in this default mode for a Collection, you'd just end up with a single Collection AIP (which would not include AIPs for all its child Items)
  • Hierarchy (including child objects) - Exports the requested AIP describing an object, plus the AIP for all child objects. Some examples follow:
    • For a Site - this would export all Communities, Collections & Items within the site into AIP files (in a provided directory)
    • For a Community - this would export that Community and all SubCommunities, Collections and Items into AIP files (in a provided directory)
    • For a Collection - this would export that Collection and all contained Items into AIP files (in a provided directory)
    • For an Item – this just exports the Item into an AIP as normal (as it already contains its Bitstreams/Bundles by default)

Exporting just a single AIP

To export in single AIP mode (default), use this 'packager' command template:

 /dspace/bin/dspace packager -d -t AIP -e <eperson> -i <handle> <file-path>

for example:

 /dspace/bin/dspace packager -d -t AIP -e admin@myu.edu -i 4321/4567 aip4567.zip

The above code will export the object of the given handle (4321/4567) into an AIP file named "aip4567.zip". This will not include any child objects for Communities or Collections.

Exporting AIP Hierarchy

To export an AIP hierarchy, use the includeChildren and childDirectory package parameters.

For example, use this 'packager' command template:

 /dspace/bin/dspace packager -d -t AIP -e <eperson> -i <handle> \
                             -o includeChildren=true -o childDirectory=<dir-path> <file-path>

for example:

 /dspace/bin/dspace packager -d -t AIP -e admin@myu.edu -i 4321/4567 \
                             -o includeChildren=true -o childDirectory=/path/to/children-aips/ aip4567.zip

The above code will export the object of the given handle (4321/4567) into an AIP file named "aip4567.zip". In addition it would export all children objects to a directory at the path "/path/to/children-aips/". The child AIPs are all named using the following format:

  • File Name Format: <Obj-Type>@<Handle-with-dashes>.zip
    • e.g. COMMUNITY@123456789-1.zip, COLLECTION@123456789-2.zip, ITEM@123456789-200.zip
    • This general file naming convention ensures that you can easily locate an object to restore by its name (assuming you know its Object Type and Handle).
  • Alternatively, if object doesn't have a Handle, it uses this File Name Format: <Obj-Type>@internal-id-<DSpace-ID>.zip
Exporting entire Site

To export an entire DSpace Site, pass the packager the Handle <site-handle-prefix>/0. For example, if your site prefix is "4321", you'd run a command similar to the following:

 /dspace/bin/dspace packager -d -t AIP -e admin@myu.edu -i 4321/0 \
                             -o includeChildren=true -o childDirectory=/path/to/children-aips/ sitewide-aip.zip

Again, this would export the DSpace Site AIP into the file "sitewide-aip.zip", and export AIPs for all Communities, Collections and Items into the "/path/to/children-aips" directory.

Ingesting AIPs

Again, like export, there are two main "modes" you can run the AIP packager in:

  • Single AIP (default) - Ingests just an AIP describing a single DSpace object. So, if you ran it in this default mode for a Collection AIP, you'd just create a DSpace Collection from the AIP (but not ingest any of its child objects)
  • Hierarchy (including child objects) - Ingests the requested AIP describing an object, plus the AIP for all child objects. Some examples follow:
    • For a Site - this would create all Communities, Collections & Items based on the located AIP files
    • For a Community - this would create that Community and all SubCommunities, Collections and Items based on the located AIP files
    • For a Collection - this would create that Collection and all contained Items based on the located AIP files
    • For an Item – this just create the Item (including all Bitstreams & Bundles) based on the AIP file.

Ingesting just a Single AIP

To ingest a single AIP and create a new DSpace object under a parent of your choice, add the ignoreParent and ignoreHandle package parameters to the command. Also, note that you are running the packager in -s (submit) mode.

NOTE: This only ingests the single AIP specified. It does not ingest all children objects.

 /dspace/bin/dspace packager -s -t AIP -e <eperson> -p <parent-handle> -o ignoreParent=true -o ignoreHandle=true <file-path>

If you leave out these package-parameter options, the AIP package ingester will attempt to install the AIP under the parent handle it had before, and give it back its original Handle. After all, the point of AIPs was to reproduce the exact object that was exported. When you are effectively using the AIP as a SIP, however, you may not want it back under the same parent or handle, so there is a way to override these features.

Ingesting an AIP Hierarchy

Details Coming Soon! In all likelihood it will take the same parameters as the "Exporting an AIP Hierarchy", except that you'll be running the packager in -s (submit) mode.

Questions / Comments?

Questions or comments – either add them inline above, or contact Tim Donohue

  • No labels