Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Renaming this page & removing all mentions of 'prototype', as it's no longer a "prototype" since it is now on Trunk and will be released in 1.7

...

Warning

For Developers: This code changes the current org.dspace.content.packager.PackagerIngester and org.dspace.content.packager.PackagerDisseminator interfaces. If you've written any local, custom Packagers at your institution, they will need to be refactored to utilize these updated interfaces.

AIP Backup & Restore for DSpace 1.7

Background & Overview

Note

Additional background information available in the OR10 Presentation entitled Improving DSpace Backups, Restores & Migrations

...

This is related to (and a partial subset of) MIT's AipPrototype. However, the original AIP prototype did not make it very easy to re-import the exported AIPs for Communities or Collections. So, this prototype AIP Backup/Restore feature extends on the old AIP prototype's packagers/crosswalks to allow for an full export and import of an entire DSpace hierarchy, or just a set of Communities, Collections or Items.

How does this work help DSpace interact with DuraCloud?

This work is entirely about exporting DSpace content objects to a location on a local filesystem. So, this work doesn't interact solely with DuraCloud, and could be used by any backup storage system to backup your DSpace contents.

...

(These backup/restore processes may change as we go forward and investigate more use cases. This is just the initial plan.)

Makeup and Definition of AIPs

AIPs are Archival Information Packages.

  • AIP is a package describing one archival object.
    • Archival object may be Item, Collection, or Community. Bitstreams are included in an Item's AIP.
    • Each AIP is logically self-contained, can be restored without rest of the archive. (So you could restore a single Item, Collection or Community)
    • AIP profile favors completeness and accuracy rather than presenting the semantics of an object in a standard format. It conforms to the quirks of DSpace's internal object model rather than attempting to produce a universally understandable representation of the object.
    • An AIP can serve as a DIP (Dissemination Information Package) or SIP (Submission Information Package), especially when transferring custody of objects to another DSpace implementation.
  • In contrast to SIP or DIP, the AIP should include all available DSpace structural and administrative metadata, and basic provenance information.
  • Restoration of an archive from AIPs is not perfectly complete at this time; it is intended to recover from catastrophic loss of content and metadata, not restore the exact same archive as before. Currently, some information (e.g. access controls, people, groups) would be lost, as they are not stored in the AIPs.

AIP Structure / Format

Generally speaking, an AIP is an Zip file containing a METS manifest and all related content bitstreams.

For more specific details of AIP format / structure, along with examples, please see DSpaceAIPFormat

Where to get the Code

The latest code is available on DSpace Trunk (and will be released in DSpace 1.7.0)

Code Block
 svn co http://scm.dspace.org/svn/repo/dspace/trunk/ 

What code has really changed?

The majority of the code changes are in two main areas:

  1. org.dspace.content.packager.* - Packager classes
    • PackageIngester interface - Now ingests 'java.io.File' objects instead of InputStreams (to better support recursive imports of Communities/Collections)
    • PackageDisseminator interface - Now exports 'java.io.File' objects instead of OutputStreams (to better support recursive exports of Communities/Collections)
    • DSpaceAIPDisseminator - Disseminates/Exports AIP(s)
    • DSpaceAIPIngester - Ingests exported AIP(s)\
    • Changes were also made to refactor / enhance the AbstractMETSDisseminator, AbstractMETSIngester, and METSManifest classes
  2. org.dspace.content.crosswalk.*
    • AIPDIMCrosswalk - Crosswalks DIM metadata for AIPs
    • AIPTechMDCrosswalk - Crosswalks METS TechMD sections for AIPs
    • There were also changes to the MODSDisseminationCrosswalk and XSLTDisseminationCrosswalk to support creating "Site" AIPs

...

Warning

For Developers: Because of the changes to the PackageIngester and PackageDisseminator interfaces, if you've created any local Packagers at your institution, those will need to be refactored.

Running the Code

Exporting AIPs

Export Modes & Options

All AIP Exports are done by using the Dissemination Mode (-d option) of the packager command.

...

  • Single AIP (default, using -d option) - Exports just an AIP describing a single DSpace object. So, if you ran it in this default mode for a Collection, you'd just end up with a single Collection AIP (which would not include AIPs for all its child Items)
  • Hierarchy of AIPs (using the -d --all or -d -a option) - Exports the requested AIP describing an object, plus the AIP for all child objects. Some examples follow:
    • For a Site - this would export all Communities, Collections & Items within the site into AIP files (in a provided directory)
    • For a Community - this would export that Community and all SubCommunities, Collections and Items into AIP files (in a provided directory)
    • For a Collection - this would export that Collection and all contained Items into AIP files (in a provided directory)
    • For an Item – this just exports the Item into an AIP as normal (as it already contains its Bitstreams/Bundles by default)

Exporting just a single AIP

To export in single AIP mode (default), use this 'packager' command template:

...

The above code will export the object of the given handle (4321/4567) into an AIP file named "aip4567.zip". This will not include any child objects for Communities or Collections.

Exporting AIP Hierarchy

To export an AIP hierarchy, use the -a (or --all) package parameter.

...

  • File Name Format: <Obj-Type>@<Handle-with-dashes>.zip
    • e.g. COMMUNITY@123456789-1.zip, COLLECTION@123456789-2.zip, ITEM@123456789-200.zip
    • This general file naming convention ensures that you can easily locate an object to restore by its name (assuming you know its Object Type and Handle).
  • Alternatively, if object doesn't have a Handle, it uses this File Name Format: <Obj-Type>@internal-id-<DSpace-ID>.zip (e.g. ITEM@internal-id-234.zip)
Exporting Entire Site

To export an entire DSpace Site, pass the packager the Handle <site-handle-prefix>/0. For example, if your site prefix is "4321", you'd run a command similar to the following:

...

Again, this would export the DSpace Site AIP into the file "sitewide-aip.zip", and export AIPs for all Communities, Collections and Items into the same directory as the Site AIP.

Ingesting / Restoring AIPs

Ingestion Modes & Options

Ingestion of AIPs is a bit more complex than Dissemination, as there are several different "modes" available:

...

  • Single AIP (default) - Ingests just an AIP describing a single DSpace object. So, if you ran it in this default mode for a Collection AIP, you'd just create a DSpace Collection from the AIP (but not ingest any of its child objects)
  • Hierarchy of AIPs (by including the --all or -a option after the mode) - Ingests the requested AIP describing an object, plus the AIP for all child objects. Some examples follow:
    • For a Site - this would ingest all Communities, Collections & Items based on the located AIP files
    • For a Community - this would ingest that Community and all SubCommunities, Collections and Items based on the located AIP files
    • For a Collection - this would ingest that Collection and all contained Items based on the located AIP files
    • For an Item – this just ingest the Item (including all Bitstreams & Bundles) based on the AIP file.
The difference between "Submit" and "Restore/Replace" modes

It's worth understanding the primary differences between a Submission (specified by -s parameter) and a Restore (specified by -r parameter).

...

  • Restore / Replace Mode - restores a new object (as if from a backup)
    • By default, the Handle specified in the AIP is restored
      • However, for restores, you can force a new handle to be generated by specifying -o ignoreHandle=true as one of your parameters. (NOTE: Doesn't work for replace mode as the new object always retains the handle of the replaced object)
    • By default, the object is restored under the Parent specified in the AIP
      • However, for restores, you can force it to restore under a different parent object by using the -p parameter. (NOTE: Doesn't work for replace mode, as the new object always retains the parent of the replaced object)
    • Always skips any Collection workflow approval processes when restoring/replacing an Item in a Collection
    • Never adds a new Deposit License to Items (rather it restores the previous deposit license, as long as it is stored in the AIP)
    • Never adds new DSpace System metadata to Items (rather it just restores the metadata as specified in the AIP)

Submitting AIP(s) to create a new object

Submitting a Single AIP
Note

This option allows you to essentially use an AIP as a SIP (Submission Information Package). The default settings will create a new DSpace object (with a new handle and a new parent object, if specified) from your AIP.

...

If you leave out the -p parameter, the AIP package ingester will attempt to install the AIP under the same parent it had before. As you are also specifying the -s (submit) parameter, the packager will assume you want a new Handle to be assigned (as you are effectively specifying that you are submitting a new object). If you want the object to retain the Handle specified in the AIP, you can specify the -o ignoreHandle=false option to force the packager to not ignore the Handle specified in the AIP.

Submitting an AIP Hierarchy
Note

This option allows you to essentially use a set of AIPs as SIPs (Submission Information Packages). The default settings will create a new DSpace object (with a new handle and a new parent object, if specified) from each AIP

...

The above command will ingest the package named "community-aip.zip" as a top-level community (i.e. the specified parent is "4321/0" which is a Site Handle). Again, the resulting object is assigned a new Handle. In addition, any child AIPs referenced by "community-aip.zip" are also recursively ingested (a new Handle is also assigned for each child AIP).

Restoring/Replacing using AIP(s)

Restoring is slightly different than just submitting. When restoring, we make every attempt to restore the object as it used to be (including its handle, parent object, etc.).

...

Info

Restoring a Single AIP: All of the below examples show how to restore an entire hierarchy of objects (using -a option). To restore a single object, you can use the same commands, but remove the -a option.

Default Restore Mode

By default, the restore mode (-r option) will rollback all changes if any object is found to already exist. The user will be informed if which object already exists within their DSpace installation.

...

In the above example, the package "aip4567.zip" is restored to the DSpace installation with the Handle provided within the package itself (and added as a child of the parent object specified within the package itself). In addition, any child AIPs referenced by "aip4567.zip" are also recursively ingested (the -a option specifies to also restore all child AIPs). They are also restored with the Handles & Parent Objects provided with their package. If any object is found to already exist, all changes are rolled back (i.e. nothing is restored to DSpace)

Restore, Keep Existing Mode

When the "Keep Existing" flag (-k option) is specified, the restore will attempt to skip over any objects found to already exist. It will report to the user that the object was found to exist (and was not modified or changed). It will then continue to restore all objects which do not already exist.

...

In the above example, the package "aip4567.zip" is restored to the DSpace installation with the Handle provided within the package itself (and added as a child of the parent object specified within the package itself). In addition, any child AIPs referenced by "aip4567.zip" are also recursively restored (the -a option specifies to also restore all child AIPs). They are also restored with the Handles & Parent Objects provided with their package. If any object is found to already exist, it is skipped over (child objects are also skipped). All non-existing objects are restored.

Force Replace Mode

When the "Force Replace" flag (-f option) is specified, the restore will overwrite any objects found to already exist in DSpace. In other words, existing content is deleted and then replaced by the contents of the AIP(s).

...

If any error occurs, the script attempts to rollback the entire replacement process.

Restoring Entire Site

Details Coming Soon! In all likelihood it will take the same parameters as the "Exporting entire Site", except that you'll be running the packager in -r (restore) mode.

Configuration in 'dspace.cfg'

The following new configurations relate to AIPs:

AIP Metadata Dissemination Configurations

The following configurations allow you to specify what metadata is stored within each METS-based AIP. In 'dspace.cfg', the general format for each of these settings is:

...

  • aip.disseminate.techMD - Lists the DSpace Crosswalks (by name) which should be called to populate the <techMD> section of the METS file within the AIP (Default: PREMIS)
    • The PREMIS Crosswalk generates PREMIS metadata for the object specified by the AIP
  • aip.disseminate.sourceMD - Lists the DSpace Crosswalks (by name) which should be called to populate the <sourceMD> section of the METS file within the AIP (Default: AIP-TECHMD)
    • The AIP-TECHMD Crosswalk generates technical metadata (in DIM format) for the object specified by the AIP
  • aip.disseminate.digiprovMD - Lists the DSpace Crosswalks (by name) which should be called to populate the <digiprovMD> section of the METS file within the AIP (Default: None)
  • aip.disseminate.rightsMD - Lists the DSpace Crosswalks (by name) which should be called to populate the <rightsMD> section of the METS file within the AIP (Default: DSpaceDepositLicense:DSPACE_DEPLICENSE, CreativeCommonsRDF:DSPACE_CCRDF, CreativeCommonsText:DSPACE_CCTEXT)
    • The DSPACE_DEPLICENSE crosswalk ensures the DSpace Deposit License is referenced/stored in AIP
    • The DSPACE_CCRDF crosswalk ensures any Creative Commons RDF Licenses are reference/stored in AIP
    • The DSPACE_CCTEXT crosswalk ensures any Creative Commons Textual Licenses are referenced/stored in AIP
  • aip.disseminate.dmd - Lists the DSpace Crosswalks (by name) which should be called to populate the <dmdSec> section of the METS file within the AIP (Default: MODS, DIM)
    • The MODS crosswalk translates the DSpace descriptive metadata (for this object) into MODS. As MODS is a relatively "standard" metadata schema, it may be useful to include a copy of MODS metadata in your AIPs if you should ever want to import them into another (non-DSpace) system.
    • The DIM crosswalk just translates the DSpace internal descriptive metadata into an XML format. This XML format is proprietary to DSpace, but stores the metadata in a format similar to Qualified Dublin Core.

AIP Ingestion Metadata Crosswalk Configurations

The following configurations allow you to specify what DSpace Crosswalks are used during the ingestion/restoration of AIPs. These configurations also allow you to ignore areas of the METS file (in the AIP) if you do not want that area to be restored.

...

Note

If unspecified in the above settings, the AIP ingester will automatically use the Crosswalk which is named the same as the @MDTYPE or @OTHERMDTYPE attribute for the metadata section. For example, a metadata section with an @MDTYPE="PREMIS" will be processed by the DSpace Crosswalk named "PREMIS".

AIP Ingestion EPerson Configurations

The following setting determines whether the AIP Ingester should create an EPerson (if necessary) when attempting to restore or ingest an Item whose Submitter cannot be located in the system. By default it is set to "false"

  • mets.dspaceAIP.ingest.createSubmitter = false

AIP Configurations To Improve Ingestion Speed while Validating

It is recommended to validate all AIPs on ingestion (when possible). But validation can be extremely slow, as each validation request first must download all referenced Schema documents from various locations on the web (sometimes as many as 10 schemas may be necessary to download in order to validate a single METS file).

...

Code Block
#mets.xsd.mets = http://www.loc.gov/METS/ mets.xsd
#mets.xsd.xlink = http://www.w3.org/1999/xlink xlink.xsd
#mets.xsd.mods = http://www.loc.gov/mods/v3 mods.xsd
#mets.xsd.xml = http://www.w3.org/XML/1998/namespace xml.xsd
#mets.xsd.dc = http://purl.org/dc/elements/1.1/ dc.xsd
#mets.xsd.dcterms = http://purl.org/dc/terms/ dcterms.xsd
#mets.xsd.premis = http://www.loc.gov/standards/premis PREMIS.xsd
#mets.xsd.premisObject = http://www.loc.gov/standards/premis PREMIS-Object.xsd
#mets.xsd.premisEvent = http://www.loc.gov/standards/premis PREMIS-Event.xsd
#mets.xsd.premisAgent = http://www.loc.gov/standards/premis PREMIS-Agent.xsd
#mets.xsd.premisRights = http://www.loc.gov/standards/premis PREMIS-Rights.xsd

To-Do List – What remains to be done!

Testing Special Cases during Restore/Replace

The below special cases need further testing, especially when performing a "Restore" or "Replace". Mostly, these are just notes for Tim (and other developers), to ensure that all these various "edge" cases can be restored properly (or perhaps not restored properly, if the decision is made that it needs not be restored).

As each special case is implemented, we can check off the item in the below list. Special cases which have been fully tested & implemented are marked with a (tick). Feel free to add more special cases to this listing, if we missed anything.

Item Restoration/Replacement

Special Cases

  • (tick) Restore existing Deposit License from AIP – i.e. do not add a new license (or change the license) during restore/replace
  • (tick) Restore existing CC License(s)
  • Restore item mappings to multiple collections (for items which are mapped to several collections)
  • (tick) Restore withdrawal state
  • Restore embargo state
  • Restore permissions & roles (user/group permissions), if possible
  • Options to restore just metadata or just particular bitstreams/bundles?
  • Will not restore items which have not made it into the "archived" state. In other words, at this time, there are no plans to restore items which are still in an approval workflow (WorkflowItems) or items which are unfinished submissions (WorkspaceItems). WorkspaceItems and WorkflowItems are never exported as AIPs.

Collection Restoration/Replacement

Special Cases

  • Restore permissions & roles (user/group permissions), if possible
    • Restore Workflow approval groups
  • (tick) Restore Collection-specific license
  • Restore Collection's Item Template?
  • Restore Collection's content source info? (e.g. OAI-Harvesting Collections versus normal Collections)

Community Restoration/Replacement

Special Cases

  • Restore permissions & roles (user/group permissions), if possible

Admin UI work

As part of the CurationTaskProposal (led by Richard Rodgers & MIT), a new Curation Framework is in the works. This Curation Framework will have a Command Line interface initially. However, the goal for 1.7, is to also have Administrative UI tools which are able to kick off various "curation tools". Among these curation tools will be the ability to export/import AIPs via the Admin UI.

Notes on AIP ingest speed & improving it

Some very basic ingestion speed tests were performed on a set of 26 AIPs (which represented a Community containing a Collection containing 24 Items). These tests found that, by default, the parsing/ingest settings are currently not optimized for speed.

...

  • Default Settings (validates all METS files using external Schemas): took about 1 minute, 12 seconds to ingest all 26 AIPs
  • Locally cached all schemas (with validation turned on): took about 12 seconds to ingest all 26 AIPs
    • You can locally cache all schemas by using the mets.xsd.* settings in dspace.cfg
  • No validation (-o validate=false flag): took about 11 seconds to ingest all 26 AIPs

Discussion / Use Cases

Please add your own potential use cases or discussion topics

...

  • MIT Use Cases - Notes on defining common operations in a replication system.

Questions / Comments?

Questions or comments – either add them inline above, or contact Tim Donohue