Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Wiki Markup
{toc:outline=true|style=none}

{note:title=Will be released in 
Table of Contents
outlinetrue
stylenone
Will be released in
Note
title
1.7.0
}This code is now available on the current DSpace SVN Trunk (http://scm.dspace.org/svn/repo/dspace/trunk/
Image Removed
).  It will be officially released as part of DSpace 1.7.0.
Warning
titleWarning For Developers
This code changes the current
{note} 

{warning:title=Warning For Developers}This code changes the current {{org.dspace.content.packager.PackagerIngester}} and {{org.dspace.content.packager.PackagerDisseminator}} interfaces.  If you've written any local, custom Packagers at your institution, they will need to be refactored to utilize these updated interfaces
.

AIP Backup & Restore for DSpace 1.7

Background & Overview

Note

Additional background information available in the OR10 Presentation entitled Improving DSpace Backups, Restores & Migrations

...

.{warning}


h1. AIP Backup & Restore for DSpace 1.7

h2. Background & Overview

{note}Additional background information available in the OR10 Presentation entitled [Improving DSpace Backups, Restores & Migrations|http://www.slideshare.net/tdonohue/improving-dspace-backups-restores-migrations]{note}

This work comes out of a requirement for DSpace integration with DuraCloud ([http://www.duracloud.org]). One of these requirements is to be able to essentially "backup" local DSpace contents into the cloud (as a type of offsite backup), and "restore" those contents at a later time.

...



Essentially, we need a way to be able to export the entire hierarchy (i.e. bitstreams, metadata and relationships between Communities/Collections/Items) into a relatively standard format (e.g. METS or similar structured packaging format). This entire hierarchy should also be able to be re-imported into DSpace in the same format, to allow for "round-tripping" of that content (essentially a restore of that content in the same or different DSpace installation).

...



*Perceived benefits to DSpace community:

...

*
* Would allow folks to more easily move entire Communities or Collections between DSpace instances.

...


* Would allow for a potentially more consistent backup of this hierarchy (e.g. to DuraCloud, or just to your own local backup system), rather than relying on synchronizing a backup of your DB (metadata/relationships) and assetstore (bitstreams).

...


* Would provide a way for people to more easily get their data out of DSpace (whatever the purpose may be).

...


* Would provide a relatively standard format for people to migrate entire hierarchies (Communities/Collections) into DSpace (from another system).

...



This is related to (and a partial subset of) MIT's [AipPrototype]. However, the original AIP prototype did not make it very easy to re-import the exported AIPs for Communities or Collections. So, this AIP Backup/Restore feature extends on the old AIP prototype's  packagers/crosswalks to allow for an full export and import of an entire DSpace hierarchy, or just a set of Communities, Collections or Items.

...

How does this work help DSpace interact with DuraCloud?

This work is entirely about exporting DSpace content objects to a location on a local filesystem. So, this work doesn't interact solely with DuraCloud, and could be used by any backup storage system to backup your DSpace contents.

In the initial DuraCloud work, the DuraCloud team is working on a way to "synchronize" DuraCloud with a local file folder. So, DuraCloud can be configured to "watch" a given folder and automatically replicate its contents into the cloud.

Therefore, moving content from DSpace to DuraCloud would currently be a two-step process:

  1. First, export AIPs describing that content from DSpace to a filesystem folder
  2. Second, enable DuraCloud to watch that same filesystem folder and replicate it into the cloud.

Similarly, moving content from DuraCloud back into DSpace would also be a two-step process:

  1. First, you'd tell DuraCloud to replicate the AIPs from the cloud to a folder on your file system
  2. Second, you'd ingest those AIPs back into DSpace

(These backup/restore processes may change as we go forward and investigate more use cases. This is just the initial plan.)

Makeup and Definition of AIPs

AIPs are Archival Information Packages.

  • AIP is a package describing one archival object.
    • Archival object may be Item, Collection, or Community. Bitstreams are included in an Item's AIP.
    • Each AIP is logically self-contained, can be restored without rest of the archive. (So you could restore a single Item, Collection or Community)
    • AIP profile favors completeness and accuracy rather than presenting the semantics of an object in a standard format. It conforms to the quirks of DSpace's internal object model rather than attempting to produce a universally understandable representation of the object.
    • An AIP can serve as a DIP (Dissemination Information Package) or SIP (Submission Information Package), especially when transferring custody of objects to another DSpace implementation.
  • In contrast to SIP or DIP, the AIP should include all available DSpace structural and administrative metadata, and basic provenance information.
  • Restoration of an archive from AIPs is not perfectly complete at this time; it is intended to recover from catastrophic loss of content and metadata, not restore the exact same archive as before. Currently, some information (e.g. access controls, people, groups) would be lost, as they are not stored in the AIPs.

AIP Structure / Format

Generally speaking, an AIP is an Zip file containing a METS manifest and all related content bitstreams.

For more specific details of AIP format / structure, along with examples, please see DSpaceAIPFormat

Where to get the Code

The latest code is available on DSpace Trunk (and will be released in DSpace 1.7.0)

Code Block



h3. How does this work help DSpace interact with DuraCloud?

This work is entirely about *exporting* DSpace content objects to a location on a local filesystem.  So, this work doesn't interact solely with DuraCloud, and could be used by any backup storage system to backup your DSpace contents.

In the initial DuraCloud work, the DuraCloud team is working on a way to "synchronize" DuraCloud with a local file folder.  So, DuraCloud can be configured to "watch" a given folder and automatically replicate its contents into the cloud.

Therefore, moving content from DSpace to DuraCloud would currently be a two-step process:
# First, export AIPs describing that content from DSpace to a filesystem folder
# Second, enable DuraCloud to watch that same filesystem folder and replicate it into the cloud.

Similarly, moving content from DuraCloud back into DSpace would also be a two-step process:
# First, you'd tell DuraCloud to replicate the AIPs from the cloud to a folder on your file system
# Second, you'd ingest those AIPs back into DSpace

(These backup/restore processes may change as we go forward and investigate more use cases.  This is just the initial plan.)

h2. Makeup and Definition of AIPs

h3. AIPs are Archival Information Packages.

* AIP is a package describing one archival object.
** Archival object may be *Item*, *Collection*, or *Community*. Bitstreams are included in an Item's AIP.
** Each AIP is logically self-contained, can be restored without rest of the archive. (So you could restore a single Item, Collection or Community)
** AIP profile favors completeness and accuracy rather than presenting the semantics of an object in a standard format.  It conforms to the quirks of DSpace's internal object model rather than attempting to produce a universally understandable representation of the object.
** An AIP _can_ serve as a DIP (Dissemination Information Package) or SIP (Submission Information Package), especially when transferring custody of objects to another DSpace implementation.
* In contrast to SIP or DIP, the AIP should include all available DSpace structural and administrative metadata, and basic provenance information.
* Restoration of an archive from AIPs is not perfectly complete at this time; it is intended to recover from catastrophic loss of content and metadata, _not_ restore the exact same archive as before.  Currently, some information (e.g. access controls, people, groups) would be lost, as they are not stored in the AIPs.

h3. AIP Structure / Format

Generally speaking, an AIP is an Zip file containing a METS manifest and all related content bitstreams.

For more specific details of AIP format / structure, along with examples, please see [DSpaceAIPFormat]

h2. Where to get the Code

The latest code is available on DSpace Trunk (and will be released in DSpace 1.7.0)

{code} svn co http://scm.dspace.org/svn/repo/dspace/trunk/ 

What code has really changed?

The majority of the code changes are in two main areas:

{code}

h3. What code has really changed?

The majority of the code changes are in two main areas:

# [org.dspace.content.packager.\*|http://fisheye3.atlassian.com/browse/dspace/trunk/dspace-api/src/main/java/org/dspace/content/packager] \- Packager

...

 classes
#* {{PackageIngester}} interface - Now ingests 'java.io.File' objects instead of InputStreams (to better support recursive imports of Communities/Collections)

...


#* {{PackageDisseminator}} interface - Now exports 'java.io.File' objects instead of OutputStreams (to better support recursive exports of Communities/Collections)

...


#* {{DSpaceAIPDisseminator}} \- Disseminates/Exports AIP(s)

...


#* {{DSpaceAIPIngester}} \- Ingests exported AIP(s)\

...


#* Changes were also made to refactor / enhance the {{AbstractMETSDisseminator}}, {{AbstractMETSIngester}}, and {{METSManifest

...

}} classes
# [org.dspace.content.crosswalk.

...

  • AIPDIMCrosswalk - Crosswalks DIM metadata for AIPs
  • AIPTechMDCrosswalk - Crosswalks METS TechMD sections for AIPs
  • There were also changes to the MODSDisseminationCrosswalk and XSLTDisseminationCrosswalk to support creating "Site" AIPs
Note

For a full list of code changes (including patches) see: AipCoreAPIChanges

Warning
titleWarning For Developers

Because of the changes to the PackageIngester and PackageDisseminator interfaces, if you've created any local Packagers at your institution, those will need to be refactored.

Running the Code

Exporting AIPs

Export Modes & Options

All AIP Exports are done by using the Dissemination Mode (-d option) of the packager command.

There are two types of AIP Dissemination you can perform:

  • Single AIP (default, using -d option) - Exports just an AIP describing a single DSpace object. So, if you ran it in this default mode for a Collection, you'd just end up with a single Collection AIP (which would not include AIPs for all its child Items)
  • Hierarchy of AIPs (using the -d --all or -d -a option) - Exports the requested AIP describing an object, plus the AIP for all child objects. Some examples follow:
    • For a Site - this would export all Communities, Collections & Items within the site into AIP files (in a provided directory)
    • For a Community - this would export that Community and all SubCommunities, Collections and Items into AIP files (in a provided directory)
    • For a Collection - this would export that Collection and all contained Items into AIP files (in a provided directory)
    • For an Item – this just exports the Item into an AIP as normal (as it already contains its Bitstreams/Bundles by default)

Exporting just a single AIP

To export in single AIP mode (default), use this 'packager' command template:

Code Block
\*|http://fisheye3.atlassian.com/browse/dspace/trunk/dspace-api/src/main/java/org/dspace/content/crosswalk]
#* {{AIPDIMCrosswalk}} \- Crosswalks DIM metadata for AIPs
#* {{AIPTechMDCrosswalk}} \- Crosswalks METS TechMD sections for AIPs
#* There were also changes to the {{MODSDisseminationCrosswalk}} and {{XSLTDisseminationCrosswalk}} to support creating "Site" AIPs


{note:title=For More Information}For a full list of code changes (including patches) see: [AipCoreAPIChanges]{note}

{warning:title=Warning For Developers}Because of the changes to the {{PackageIngester}} and {{PackageDisseminator}} interfaces, if you've created any local Packagers at your institution, those will need to be refactored.{warning}


h2. Running the Code

h3. Exporting AIPs

h4. Export Modes & Options

All AIP Exports are done by using the Dissemination Mode ({{\-d}} option) of the {{packager}} command.

There are two types of AIP Dissemination you can perform:
* *Single AIP* (default, using {{\-d}} option) - Exports just an AIP describing a single DSpace object.  So, if you ran it in this default mode for a Collection, you'd just end up with a single Collection AIP (which would not include AIPs for all its child Items)
* *Hierarchy of AIPs* (using the {{\-d \-\-all}} or {{\-d \-a}} option) - Exports the requested AIP describing an object, plus the AIP for all child objects.  Some examples follow:
** For a Site - this would export *all* Communities, Collections & Items within the site into AIP files (in a provided directory)
** For a Community - this would export that Community and all SubCommunities, Collections and Items into AIP files (in a provided directory)
** For a Collection - this would export that Collection and all contained Items into AIP files (in a provided directory)
** For an Item -- this just exports the Item into an AIP as normal (as it already contains its Bitstreams/Bundles by default)


h4. Exporting just a single AIP

To export in single AIP mode (default), use this 'packager' command template:

{code} /dspace/bin/dspace packager -d -t AIP -e <eperson> -i <handle> <file-path>
{code}for example:

...



{code
} /dspace/bin/dspace packager -d -t AIP -e admin@myu.edu -i 4321/4567 aip4567.zip
{code}The above code will export the object of the given handle (4321/4567) into an AIP file named "aip4567.zip".  This will *not* include any child objects for Communities or Collections.

...

Exporting AIP Hierarchy

To export an AIP hierarchy, use the -a (or --all) package parameter.

For example, use this 'packager' command template:

Code Block



h4. Exporting AIP Hierarchy

To export an AIP hierarchy, use the {{\-a}} (or {{\--all}}) package parameter.

For example, use this 'packager' command template:

{code} /dspace/bin/dspace packager -d -a -t AIP -e <eperson> -i <handle> <file-path>
{code}for example:

...



{code
} /dspace/bin/dspace packager -d -a -t AIP -e admin@myu.edu -i 4321/4567 aip4567.zip
{code}The above code will export the object of the given handle (4321/4567) into an AIP file named "aip4567.zip".  In addition it would export all children objects to the same directory as the "aip4567.zip" file.  The child AIP files are all named using the following format:

...


* File Name Format: {{<Obj-Type>@<Handle-with-dashes>.zip

...

}}
** e.g. COMMUNITY@123456789-1.zip, COLLECTION@123456789-2.zip, ITEM@123456789-200.zip

...


** This general file naming convention ensures that you can easily locate an object to restore by its name (assuming you know its Object Type and Handle).

...


* Alternatively, if object doesn't have a Handle, it uses this File Name Format: {{<Obj-Type>@internal-id-<DSpace-ID>.zip}} (e.g. ITEM@internal-id-234.zip)

...



h5. Exporting Entire Site

...



To export an entire DSpace Site, pass the packager the Handle {{<site-handle-prefix>/0}}.  For example, if your site prefix is "4321", you'd run a command similar to the following:

...



{code
} /dspace/bin/dspace packager -d -a -t AIP -e admin@myu.edu -i 4321/0 sitewide-aip.zip
{code}Again, this would export the DSpace Site AIP into the file "sitewide-aip.zip", and export AIPs for *all* Communities, Collections and Items into the same directory as the Site AIP.

...



h3. Ingesting / Restoring AIPs

...



h4. Ingestion Modes & Options

...



Ingestion of AIPs is a bit more complex than Dissemination, as there are several different "modes" available:

...


# Submit/Ingest Mode ({{\-s}} option, default)

...

 -- submit AIP(s) to DSpace in order to create a new object(s) (i.e. AIP is treated like a SIP

...

 -- Submission Information Package)

...


# Restore Mode ({{\-r}} option)

...

 -- restore pre-existing object(s) in DSpace based on AIP(s).  This also attempts to restore all handles and relationships (parent/child objects).  This is a specialized type of "submit", where the object is created with a known Handle and known relationships.

...


# Replace Mode ({{\-r \-f}} option)

...

 -- replace existing object(s) in DSpace based on AIP(s). This also attempts to restore all handles and relationships (parent/child objects).  This is a specialized type of "restore" where the contents of existing object(s) is replaced by the contents in the AIP(s).  By default, if a normal "restore" finds the object already exists, it will back out (i.e. rollback all changes) and report which object already exists.

...



Again, like export, there are two types of AIP Ingestion you can perform (using any of the above modes):

...


* *Single AIP* (default) - Ingests just an AIP describing a single DSpace object.  So, if you ran it in this default mode for a Collection AIP, you'd just create a DSpace Collection from the AIP (but not ingest any of its child objects)

...


* *Hierarchy of AIPs* (by including the {{\-\-all}} or {{\-a}} option after the mode) - Ingests the requested AIP describing an object, plus the AIP for all child objects.  Some examples follow:

...


** For a Site - this would ingest *all* Communities, Collections & Items based on the located AIP files

...


** For a Community - this would ingest that Community and all SubCommunities, Collections and Items based on the located AIP files

...


** For a Collection - this would ingest that Collection and all contained Items based on the located AIP files

...


** For an Item

...

 -- this just ingest the Item (including all Bitstreams & Bundles) based on the AIP file.

h5.

...

 The difference between "Submit" and "Restore/Replace" modes

...



It's worth understanding the primary differences between a Submission (specified by {{-s}} parameter) and a Restore (specified by {{-r}} parameter).

...



* *Submission Mode* ({{-s}}) - creates a new object (AIP is treated like a SIP)

...


** By default, a new Handle is always

...

 assigned 
*** However, you can force it to use the handle specified in the AIP by specifying {{-o ignoreHandle=false}} as one of your parameters

...


** By default, a new Parent object *must* be specified (using the {{-p}} parameter). This is the location where the new object will be created.

...


*** However, you can force it to use the parent object specified in the AIP by specifiying {{-o ignoreParent=false}} as one of your parameters

...


** By default, will respect a Collection's Workflow process when you submit an Item to a Collection

...


*** However, you can specifically _skip_ any workflow approval processes by specifying {{-w}} parameter.

...


** *Always* adds a new Deposit License to Items
** *Always* adds new DSpace System metadata to Items (includes new 'dc.date.accessioned', 'dc.date.available', 'dc.date.issued' and 'dc.description.provenance' entries)

...



* *Restore / Replace Mode*  - restores a new object (as if from a backup)

...


** By default, the Handle specified in the AIP is restored

...


*** However, for restores, you can force a new handle to be generated by specifying {{-o ignoreHandle=true}} as one of your parameters. (NOTE: Doesn't work for _replace_ mode as the new object always retains the handle of the replaced object)

...


** By default, the object is restored under the Parent specified

...

 in the AIP
*** However, for restores, you can force it to restore under a different parent object by using the {{-p}} parameter. (NOTE: Doesn't work for _replace_ mode, as the new object always retains the parent of the replaced object)

...


** *Always* skips any Collection workflow approval processes when restoring/replacing an Item in a Collection

...

Submitting AIP(s) to create a new object

Submitting a Single AIP
Note

This option allows you to essentially use an AIP as a SIP (Submission Information Package). The default settings will create a new DSpace object (with a new handle and a new parent object, if specified) from your AIP.

To ingest a single AIP and create a new DSpace object under a parent of your choice, specify the -p (or --parent) package parameter to the command. Also, note that you are running the packager in -s (submit) mode.

NOTE: This only ingests the single AIP specified. It does not ingest all children objects.

Code Block

** *Never* adds a new Deposit License to Items (rather it restores the previous deposit license, as long as it is stored in the AIP)
** *Never* adds new DSpace System metadata to Items (rather it just restores the metadata as specified in the AIP)

h4. Submitting AIP(s) to create a new object

h5. Submitting a Single AIP

{note=AIPs treated as SIPs}This option allows you to essentially use an AIP as a SIP (Submission Information Package).  The default settings will create a new DSpace object (with a new handle and a new parent object, if specified) from your AIP.{note}

To ingest a single AIP and create a new DSpace object under a parent of your choice, specify the {{\-p}} (or {{\--parent}}) package parameter to the command.  Also, note that you are running the {{packager}} in {{\-s}} (submit) mode.

_NOTE:_ This only ingests the single AIP specified.  It does *not* ingest all children objects.

{code} /dspace/bin/dspace packager -s -t AIP -e <eperson> -p <parent-handle> <file-path>
{code}

If you leave out the {{\-p}} parameter, the AIP package ingester will attempt to install the AIP under the same parent it had before.  As you are also specifying the {{\-s}} (submit) parameter, the {{packager}} will assume you want a new Handle to be assigned (as you are effectively specifying that you are submitting a *new* object).  If you want the object to retain the Handle specified in the AIP, you can specify the {{\-o ignoreHandle=false}} option to force the packager to _not_ ignore the Handle specified in the AIP.

...

Submitting an AIP Hierarchy
Note

This option allows you to essentially use a set of AIPs as SIPs (Submission Information Packages). The default settings will create a new DSpace object (with a new handle and a new parent object, if specified) from each AIP

To ingest an AIP hierarchy from a directory of AIPs, use the -a (or --all) package parameter.

For example, use this 'packager' command template:

Code Block



h5. Submitting an AIP Hierarchy

{note:title=AIPs treated as SIPs}This option allows you to essentially use a set of AIPs as SIPs (Submission Information Packages).  The default settings will create a new DSpace object (with a new handle and a new parent object, if specified) from each AIP {note}

To ingest an AIP hierarchy from a directory of AIPs, use the {{\-a}} (or {{\--all}}) package parameter.

For example, use this 'packager' command template:

{code} /dspace/bin/dspace packager -s -a -t AIP -e <eperson> -p <parent-handle> <file-path>
{code}for example:

...



{code
} /dspace/bin/dspace packager -s -a -t AIP -e admin@myu.edu -p 4321/12 aip4567.zip
{code}
The above command will ingest the package named "aip4567.zip" as a child of the specified Parent Object (handle="4321/12").  The resulting object is assigned a new Handle (since {{\-s}} is specified).  In addition, any child AIPs referenced by "aip4567.zip" are also recursively ingested (a new Handle is also assigned for each child AIP).

...



Another example

...

 -- *Ingesting a Top-Level

...

 Community* (by using the Site Handle, {{<site-handle-prefix>/0}}):

...


{code
} /dspace/bin/dspace packager -s -a -t AIP -e admin@myu.edu -p 4321/0 community-aip.zip
{code}The above command will ingest the package named "community-aip.zip" as a *top-level community* (i.e. the specified parent is "4321/0" which is a Site Handle).  Again, the resulting object is assigned a new Handle.  In addition, any child AIPs referenced by "community-aip.zip" are also recursively ingested (a new Handle is also assigned for each child AIP).

...



h4. Restoring/Replacing using AIP(s)

...



*Restoring* is slightly different than just *submitting*.  When restoring, we make every attempt to restore the object as it *used to be* (including its handle, parent object, etc.).

...



There are currently three restore modes:

...


# Default Restore Mode ({{\-r}}) = Attempt to restore object (and optionally children). Rollback all changes if any object is found to already exist.

...


# Restore, Keep Existing Mode ({{\-r \-k}}) =  Attempt to restore object (and optionally children).  If an object is found to already exist, skip over it (and all children objects), and continue to restore all other non-existing objects.

...


# Force Replace Mode ({{\-r \-f}}) = Restore an object (and optionally children) and *overwrite* any existing objects in DSpace.  Therefore, if an object is found to already exist in DSpace, its contents are replaced by the contents of the AIP. _WARNING: This mode is potentially dangerous as it will permanently destroy any object contents that do not currently exist in the AIP. You may want to perform a secondary backup, unless you are sure you know what you are

...

Info

Restoring a Single AIP: All of the below examples show how to restore an entire hierarchy of objects (using -a option). To restore a single object, you can use the same commands, but remove the -a option.

Default Restore Mode

By default, the restore mode (-r option) will rollback all changes if any object is found to already exist. The user will be informed if which object already exists within their DSpace installation.

Use this 'packager' command template:

Code Block
 doing\!_

{info:title=Restoring a Single AIP}All of the below examples show how to restore an entire hierarchy of objects (using {{-a}} option).   To restore a single object, you can use the same commands, but remove the {{-a}} option.{info}

h5. Default Restore Mode

By default, the restore mode ({{\-r}} option) will rollback all changes if any object is found to already exist.  The user will be informed if which object already exists within their DSpace installation.

Use this 'packager' command template:
{code} /dspace/bin/dspace packager -r -a -t AIP -e <eperson> <file-path>
{code}
For example:

...


{code
} /dspace/bin/dspace packager -r -a -t AIP -e admin@myu.edu aip4567.zip

...

{code}

_Notice that unlike_ {{_\-s{_}}} _option (for submission/ingesting), the_ {{_\-r{_}}} _option does not require the Parent Object (_{{{}_\-p{_}}} _option) to be specified if it can be determined from the package itself.

...

_

In the above example, the package "aip4567.zip" is restored to the DSpace installation with the Handle provided within the package itself (and added as a child of the parent object specified within the package itself).  In addition, any child AIPs referenced by "aip4567.zip" are also recursively ingested (the {{-a}} option specifies to also restore all child AIPs).  They are also restored with the Handles & Parent Objects provided with their package.  If any object is found to already exist, all changes are rolled back (i.e. nothing is restored to DSpace)

...



h5. Restore, Keep Existing Mode

...



When the "Keep Existing" flag ({{\-k}} option) is specified, the restore will attempt to skip over any objects found to already exist.  It will report to the user that the object was found to exist (and was not modified or changed).  It will then continue to restore all objects which do not already exist.

...



One special case to note:  If a Collection or Community is found to already exist, its child objects are also skipped over.  So, this mode will not auto-restore items to an existing Collection.

...



Use this 'packager' command template:

...


{code
} /dspace/bin/dspace packager -r -a -k -t AIP -e <eperson> <file-path>
{code}
For example:

...



{code
} /dspace/bin/dspace packager -r -a -k -t AIP -e admin@myu.edu aip4567.zip
{code}

In the above example, the package "aip4567.zip" is restored to the DSpace installation with the Handle provided within the package itself (and added as a child of the parent object specified within the package itself).  In addition, any child AIPs referenced by "aip4567.zip" are also recursively restored (the {{-a}} option specifies to also restore all child AIPs).  They are also restored with the Handles & Parent Objects provided with their package.  If any object is found to already exist, it is skipped over (child objects are also skipped).  All non-existing objects are restored.

...



h5. Force Replace Mode

...



When the "Force Replace" flag ({{\-f}} option) is specified, the restore will *overwrite* any objects found to already exist in DSpace.  In other words, existing content is deleted and then replaced by the contents of the AIP(s).

...

Panel

WARNING: Because this mode actually destroys existing content in DSpace, it is potentially dangerous and may result in data loss! It is recommended to always perform a secondary full backup (assetstore files & database) before attempting to replace any existing object(s) in DSpace.

Panel

SECOND WARNING: This doesn't 100% work yet for an entire Site! You've been warned!!! - Tim

Use this 'packager' command template:

Code Block


{warning:title=Potential for Data Loss}Because this mode actually *destroys* existing content in DSpace, it is potentially dangerous and may result in data loss\!  It is recommended to always perform a secondary full backup (assetstore files & database) before attempting to replace any existing object(s) in DSpace.{warning}

{warning:title:Full Site Replace Not Recommended}This doesn't 100% work yet for an entire Site\! You've been warned\!\!\! - Tim{warning}

Use this 'packager' command template:
{code} /dspace/bin/dspace packager -r -a -f -t AIP -e <eperson> <file-path>
{code}
For example:

...



{code
} /dspace/bin/dspace packager -r -a -f -t AIP -e admin@myu.edu aip4567.zip
{code}

In the above example, the package "aip4567.zip" is restored to the DSpace installation with the Handle provided within the package itself (and added as a child of the parent object specified within the package itself).  In addition, any child AIPs referenced by "aip4567.zip" are also recursively ingested.  They are also restored with the Handles & Parent Objects provided with their package. _If any object is found to already exist, its contents are replaced by the contents of the appropriate AIP.

...

_

If any error occurs, the script attempts to rollback the entire replacement process.

...



h5. Restoring Entire Site

...



_Details Coming Soon\!_ In all likelihood it will take the same parameters as the "Exporting entire Site", except that you'll be running the {{packager}} in {{\-r}} (restore) mode.

...




h2. Configuration in 'dspace.cfg'

...



The following new configurations relate to AIPs:

...



h3. AIP Metadata Dissemination Configurations

...



The following configurations allow you to specify what metadata is stored within each METS-based AIP.  In 'dspace.cfg', the general format for each of these settings is:

...



* {{aip.disseminate.<setting> = <mdType>:<DSpace-crosswalk-name> \[, ...\]}}

...

 
** <setting> is the setting name (see below for the full list of valid settings)

...


** <mdType> is optional. It allows you to specify the value of the @MDTYPE or @OTHERMDTYPE attribute in the corresponding METS element.

...


** <DSpace-crosswalk-name> is required.  It specifies the name of the DSpace Crosswalk which should be used to generate this metadata.

...


** Zero or more {{<label-for-METS>:<DSpace-crosswalk-name>}} may be specified for each

...

Warning

It is recommended to minimally use the default settings when generating AIPs. DSpace can only restore information that is included within an AIP. Therefore, if you choose to no longer include some information in an AIP, DSpace will no longer be able to restore that information from an AIP backup

The default settings in 'dspace.cfg' are:

  • aip.disseminate.techMD - Lists the DSpace Crosswalks (by name) which should be called to populate the <techMD> section of the METS file within the AIP (Default: PREMIS)
    • The PREMIS Crosswalk generates PREMIS metadata for the object specified by the AIP
  • aip.disseminate.sourceMD - Lists the DSpace Crosswalks (by name) which should be called to populate the <sourceMD> section of the METS file within the AIP (Default: AIP-TECHMD)
    • The AIP-TECHMD Crosswalk generates technical metadata (in DIM format) for the object specified by the AIP
  • aip.disseminate.digiprovMD - Lists the DSpace Crosswalks (by name) which should be called to populate the <digiprovMD> section of the METS file within the AIP (Default: None)
  • aip.disseminate.rightsMD - Lists the DSpace Crosswalks (by name) which should be called to populate the <rightsMD> section of the METS file within the AIP (Default: DSpaceDepositLicense:DSPACE_DEPLICENSE, CreativeCommonsRDF:DSPACE_CCRDF, CreativeCommonsText:DSPACE_CCTEXT)
    • The DSPACE_DEPLICENSE crosswalk ensures the DSpace Deposit License is referenced/stored in AIP
    • The DSPACE_CCRDF crosswalk ensures any Creative Commons RDF Licenses are reference/stored in AIP
    • The DSPACE_CCTEXT crosswalk ensures any Creative Commons Textual Licenses are referenced/stored in AIP
  • aip.disseminate.dmd - Lists the DSpace Crosswalks (by name) which should be called to populate the <dmdSec> section of the METS file within the AIP (Default: MODS, DIM)
    • The MODS crosswalk translates the DSpace descriptive metadata (for this object) into MODS. As MODS is a relatively "standard" metadata schema, it may be useful to include a copy of MODS metadata in your AIPs if you should ever want to import them into another (non-DSpace) system.
    • The DIM crosswalk just translates the DSpace internal descriptive metadata into an XML format. This XML format is proprietary to DSpace, but stores the metadata in a format similar to Qualified Dublin Core.

AIP Ingestion Metadata Crosswalk Configurations

The following configurations allow you to specify what DSpace Crosswalks are used during the ingestion/restoration of AIPs. These configurations also allow you to ignore areas of the METS file (in the AIP) if you do not want that area to be restored.

In dspace.cfg, the general format for each of these settings is:

 setting

{info:title=AIP Metadata Recommendations}It is recommended to *minimally* use the default settings when generating AIPs.  DSpace can only restore information that is included within an AIP.  Therefore, if you choose to no longer include some information in an AIP, DSpace will no longer be able to restore that information from an AIP backup {info}

The default settings in 'dspace.cfg' are:

* {{aip.disseminate.techMD}} - Lists the DSpace Crosswalks (by name) which should be called to populate the {{<techMD>}} section of the METS file within the AIP (Default: PREMIS)
** The PREMIS Crosswalk generates PREMIS metadata for the object specified by the AIP
* {{aip.disseminate.sourceMD}} - Lists the DSpace Crosswalks (by name) which should be called to populate the {{<sourceMD>}} section of the METS file within the AIP (Default: AIP-TECHMD)
** The AIP-TECHMD Crosswalk generates technical metadata (in DIM format) for the object specified by the AIP
* {{aip.disseminate.digiprovMD}} - Lists the DSpace Crosswalks (by name) which should be called to populate the {{<digiprovMD>}} section of the METS file within the AIP (Default: _None_)
* {{aip.disseminate.rightsMD}} - Lists the DSpace Crosswalks (by name) which should be called to populate the {{<rightsMD>}} section of the METS file within the AIP (Default: DSpaceDepositLicense:DSPACE_DEPLICENSE, CreativeCommonsRDF:DSPACE_CCRDF, CreativeCommonsText:DSPACE_CCTEXT)
** The DSPACE_DEPLICENSE crosswalk ensures the DSpace Deposit License is referenced/stored in AIP
** The DSPACE_CCRDF crosswalk ensures any Creative Commons RDF Licenses are reference/stored in AIP
** The DSPACE_CCTEXT crosswalk ensures any Creative Commons Textual Licenses are referenced/stored in AIP
* {{aip.disseminate.dmd}} - Lists the DSpace Crosswalks (by name) which should be called to populate the {{<dmdSec>}} section of the METS file within the AIP (Default: MODS, DIM)
** The MODS crosswalk translates the DSpace descriptive metadata (for this object) into MODS.  As MODS is a relatively "standard" metadata schema, it may be useful to include a copy of MODS metadata in your AIPs if you should ever want to import them into another (non-DSpace) system.
** The DIM crosswalk just translates the DSpace internal descriptive metadata into an XML format.  This XML format is proprietary to DSpace, but stores the metadata in a format similar to Qualified Dublin Core.

h3. AIP Ingestion Metadata Crosswalk Configurations

The following configurations allow you to specify what DSpace Crosswalks are used during the ingestion/restoration of AIPs.  These configurations also allow you to ignore areas of the METS file (in the AIP) if you do not want that area to be restored.

In {{dspace.cfg}}, the general format for each of these settings is:

* {{mets.dspaceAIP.ingest.crosswalk.<mdType> = <DSpace-crosswalk-name>

...

}} 
** <mdType> is the type of metadata as specified in the METS file.  This corresponds to the value of the @MDTYPE attribute (of that metadata section in the METS).  When the @MDTYPE attribute is "OTHER", then the <mdType> corresponds to the @OTHERMDTYPE attribute value.
** <DSpace-crosswalk-name> specifies the name of the DSpace Crosswalk which should be used to ingest this metadata into DSpace.   You can specify the "NULLSTREAM" crosswalk if you specifically want this metadata to be ignored (and skipped over during ingestion).

...



By default, the settings in {{dspace.cfg}} are:

...



{code
}
mets.dspaceAIP.ingest.crosswalk.DSpaceDepositLicense = NULLSTREAM
mets.dspaceAIP.ingest.crosswalk.CreativeCommonsRDF = NULLSTREAM
mets.dspaceAIP.ingest.crosswalk.CreativeCommonsText = NULLSTREAM
{code}

The above settings tell the ingester to *ignore* any metadata sections which reference DSpace Deposit Licenses or Creative Commons Licenses.  These metadata sections can be safely ignored as long as the "LICENSE" and "CC_LICENSE" bundles are included in AIPs (which is the default setting).  As the Licenses are included in those Bundles, they will already be restored when restoring the bundle

...

Note
If unspecified in the above settings, the AIP ingester will automatically use the Crosswalk which is named the same as the @MDTYPE or @OTHERMDTYPE attribute for the metadata section. For example, a metadata section with an
 contents.

{info:title=More Info on Default Crosswalks used}If unspecified in the above settings, the AIP ingester will automatically use the Crosswalk which is named the same as the @MDTYPE or @OTHERMDTYPE attribute for the metadata section.  For example, a metadata section with an @MDTYPE="PREMIS" will be processed by the DSpace Crosswalk named "PREMIS".

AIP Ingestion EPerson Configurations

The following setting determines whether the AIP Ingester should create an EPerson (if necessary) when attempting to restore or ingest an Item whose Submitter cannot be located in the system. By default it is set to "false"

{info}

h3. AIP Ingestion EPerson Configurations

The following setting determines whether the AIP Ingester should create an EPerson (if necessary) when attempting to restore or ingest an Item whose Submitter cannot be located in the system.  By default it is set to "false"

* {{mets.dspaceAIP.ingest.createSubmitter =

...

AIP Configurations To Improve Ingestion Speed while Validating

It is recommended to validate all AIPs on ingestion (when possible). But validation can be extremely slow, as each validation request first must download all referenced Schema documents from various locations on the web (sometimes as many as 10 schemas may be necessary to download in order to validate a single METS file).

In order to perform validations in a speedy fashion, you can pull down a local copy of all schemas. Validation will then use this local cache, which can sometimes increase the speed up to 10X.

To use a local cache of XML schemas when validating, use the following settings in 'dspace.cfg'. The general format is:

...

 false}}

h3. AIP Configurations To Improve Ingestion Speed while Validating

It is recommended to validate all AIPs on ingestion (when possible).  But validation can be extremely slow, as each validation request first must download all referenced Schema documents from various locations on the web (sometimes as many as 10 schemas may be necessary to download in order to validate a single METS file).   

In order to perform validations in a speedy fashion, you can pull down a local copy of *all* schemas.  Validation will then use this local cache, which can sometimes increase the speed up to 10X.

To use a local cache of XML schemas when validating, use the following settings in 'dspace.cfg'.  The general format is:

* {{mets.xsd.<abbreviation> = <namespace> <local-file-name>}}
** {{<abbreviation>}} is a unique abbreviation (of your choice) for this schema 
** {{<namespace>}} is the Schema namespace
** {{<local-file-name>}} the full name of the cached schema file (which should reside in your {{\[dspace\]/config/schemas/}} directory)

...



The default settings are all commented out.  But, they provide a full listing of all schemas currently used during validation of AIPs.  In order to utilize them, uncomment the settings, download the appropriate schema file, and save it to your {{\[dspace\]/config/schemas/}} using the specified file name:

...



{code
}
#mets.xsd.mets = http://www.loc.gov/METS/ mets.xsd
#mets.xsd.xlink = http://www.w3.org/1999/xlink xlink.xsd
#mets.xsd.mods = http://www.loc.gov/mods/v3 mods.xsd
#mets.xsd.xml = http://www.w3.org/XML/1998/namespace xml.xsd
#mets.xsd.dc = http://purl.org/dc/elements/1.1/ dc.xsd
#mets.xsd.dcterms = http://purl.org/dc/terms/ dcterms.xsd
#mets.xsd.premis = http://www.loc.gov/standards/premis PREMIS.xsd
#mets.xsd.premisObject = http://www.loc.gov/standards/premis PREMIS-Object.xsd
#mets.xsd.premisEvent = http://www.loc.gov/standards/premis PREMIS-Event.xsd
#mets.xsd.premisAgent = http://www.loc.gov/standards/premis PREMIS-Agent.xsd
#mets.xsd.premisRights = http://www.loc.gov/standards/premis PREMIS-Rights.xsd

{code}

h2. To-Do List

...

 -- What remains to be done!

...



h3. Testing Special Cases during Restore/Replace

...



The below special cases need further testing, especially when performing a "Restore" or "Replace".  Mostly, these are just notes for Tim (and other developers), to ensure that all these various "edge" cases can be restored properly (or perhaps not restored properly, if the decision is made that it needs not be restored).

...



As each special case is implemented, we can check off the item in the below list.   Special cases which have been fully tested & implemented are marked with a

...

Item Restoration/Replacement

Special Cases

  • (tick) Restore existing Deposit License from AIP – i.e. do not add a new license (or change the license) during restore/replace
  • (tick) Restore existing CC License(s)
  • Restore item mappings to multiple collections (for items which are mapped to several collections)
  • (tick) Restore withdrawal state
  • Restore embargo state
  • Restore permissions & roles (user/group permissions), if possible
  • Options to restore just metadata or just particular bitstreams/bundles?
  • Will not restore items which have not made it into the "archived" state. In other words, at this time, there are no plans to restore items which are still in an approval workflow (WorkflowItems) or items which are unfinished submissions (WorkspaceItems). WorkspaceItems and WorkflowItems are never exported as AIPs.

Collection Restoration/Replacement

Special Cases

  • Restore permissions & roles (user/group permissions), if possible
    • Restore Workflow approval groups
  • (tick) Restore Collection-specific license
  • Restore Collection's Item Template?
  • Restore Collection's content source info? (e.g. OAI-Harvesting Collections versus normal Collections)

Community Restoration/Replacement

Special Cases

  • Restore permissions & roles (user/group permissions), if possible

Admin UI work

As part of the CurationTaskProposal (led by Richard Rodgers & MIT), a new Curation Framework is in the works. This Curation Framework will have a Command Line interface initially. However, the goal for 1.7, is to also have Administrative UI tools which are able to kick off various "curation tools". Among these curation tools will be the ability to export/import AIPs via the Admin UI.

Notes on AIP ingest speed & improving it

Some very basic ingestion speed tests were performed on a set of 26 AIPs (which represented a Community containing a Collection containing 24 Items). These tests found that, by default, the parsing/ingest settings are currently not optimized for speed.

Here are the basic (non-scientific) results

  • Default Settings (validates all METS files using external Schemas): took about 1 minute, 12 seconds to ingest all 26 AIPs
  • Locally cached all schemas (with validation turned on): took about 12 seconds to ingest all 26 AIPs
    • You can locally cache all schemas by using the mets.xsd.* settings in dspace.cfg
  • No validation (-o validate=false flag): took about 11 seconds to ingest all 26 AIPs

Discussion / Use Cases

Please add your own potential use cases or discussion topics

  • MIT Use Cases - Notes on defining common operations in a replication system.

Questions / Comments?

...

 (/).  Feel free to add more special cases to this listing, if we missed anything.

h4. Item Restoration/Replacement

*Special Cases*
* (/) Restore existing Deposit License from AIP -- i.e. do not add a new license (or change the license) during restore/replace
* (/) Restore existing CC License(s)
* Restore item mappings to multiple collections (for items which are mapped to several collections)
* (/) Restore withdrawal state
* Restore embargo state
* Restore permissions & roles (user/group permissions), if possible
* Options to restore just metadata or just particular bitstreams/bundles?
* _Will not restore items which have not made it into the "archived" state._ In other words, at this time, there are no plans to restore items which are still in an approval workflow (WorkflowItems) or items which are unfinished submissions (WorkspaceItems).  WorkspaceItems and WorkflowItems are never exported as AIPs.

h4. Collection Restoration/Replacement

*Special Cases*
* Restore permissions & roles (user/group permissions), if possible
** Restore Workflow approval groups
* (/) Restore Collection-specific license
* Restore Collection's Item Template?
* Restore Collection's content source info? (e.g. OAI-Harvesting Collections versus normal Collections)

h4. Community Restoration/Replacement

*Special Cases*
* Restore permissions & roles (user/group permissions), if possible

h3. Admin UI work

As part of the [CurationTaskProposal] (led by Richard Rodgers & MIT), a new Curation Framework is in the works.  This Curation Framework will have a Command Line interface initially.  However, the goal for 1.7, is to also have Administrative UI tools which are able to kick off various "curation tools".  Among these curation tools will be the ability to export/import AIPs via the Admin UI.

h3. Notes on AIP ingest speed & improving it

Some very basic ingestion speed tests were performed on a set of 26 AIPs (which represented a Community containing a Collection containing 24 Items).  These tests found that, by default, the parsing/ingest settings are currently *not* optimized for speed.

Here are the basic (non-scientific) results
* Default Settings (validates all METS files using external Schemas): took about 1 minute, 12 seconds to ingest all 26 AIPs
* Locally cached all schemas (with validation turned on): took about 12 seconds to ingest all 26 AIPs
** You can locally cache all schemas by using the {{mets.xsd.\*}} settings in {{dspace.cfg}}
* No validation ({{\-o validate=false}} flag): took about 11 seconds to ingest all 26 AIPs


h1. Discussion / Use Cases

Please add your own potential use cases or discussion topics

* [DuraCloud DSpace Interaction Notes] \- Notes/Discussion on how DSpace and DuraCloud may need to interact more directly.  This page is *specific* to DuraCloud Use Cases.

* [AIP Export Implementation Notes] \- Notes/Discussion on this specific AIP Backup/Restore Implementation (not specific to DuraCloud).

* [duracloudpilot:MIT Use Cases] \- Notes on defining common operations in a replication system.


h1. Questions / Comments?

Questions or comments -- either add them inline above, or contact [Tim Donohue|~tdonohue:HOME]