Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

<?xml version="1.0" encoding="utf-8"?>
<html>

PLEDGE AIP Prototype

This page describes a prototype AIP implementation planned as
part of the PLEDGE project.
Since the PLEDGE project only needs AIPs
to replicate them under the direction of a policy engine,
it was not necessary to create an AIP-based asset store.

...

  • AIP is a package describing one archival object.
    • Archival object may be*Item*, Collection, or Community. Bitstreams are included in an Item's AIP.
    • Each AIP is logically self-contained, can be restored without rest of the archive.
    • AIP profile trades off favoring completeness and accuracy rather than presenting the semantics of an object in a standard format. It conforms to the quirks of DSpace's internal object model rather than attempting to produce a universally understandable representation of the object.
    • An AIP can serve as a DIP, especially when transferring custody of objects to another DSpace implementation, but it is not intended to be a general-purpose DIP (the DSpace METS SIP profile is better for that).
  • The implementation is layered on top of DSpace 1.4 plus the EventSystemPrototype with minimal other changes to the source.
  • Restoration of an archive from AIPs is not perfectly complete; it is intended to recover from catastropic loss of content and metadata, not restore the exact same archive as before. Some information (e.g. access controls) would be lost.
  • This prototype does NOT attempt to redefine the asset store in terms of AIPs, as in the AssetStore proposals.

...

  • Although this prototype does not implement AIP-centric storage, it can be leveraged toward that goal.
  • This design does not consider any of the proposals for versioning DSpace objects.
  • See the ObjectUri proposal that establishes a pattern for generating URIs for pieces of the DSpace object model. This AIP implementation relies on it to name Bitstream asset-store files for internal AIPs.
  • The impact on performance was a consideration:
    • Creating or rewriting an AIP manifest is too expensive to do synchronously as part of every data model change.
    • The solution is to let AIP updates lag behind real-time changes by some amount. Possible implementations include:
      1. Use asynchronous consumer (see Event System Prototype) to update AIPs at intervals and/or in a separate JVM.
      2. Periodically run the Code BlockAIPManager command-line application (e.g. from cron) to update all "stale" AIPs.
  • How best to record descriptive metadata? crosswalk to a standard like MODS, best for archival and external consumption? Save DIM directly for maximum completeness and accuracy of restoration in another DSpace? Include both?
  • Some archival objects have elements that will probably be left out of the AIP prototype, e.g. template Items for Collections.
  • Internal AIPs are susceptible to a race condition: Whenever any "member" elements of an object (e.g. the Bitstreams of an Item, or the Items of a Collection) are permanently deleted from the archive, the AIP for the parent object will contain unresolvable references until it is updated. If the internal AIP – which, remember, depends on the asset store for its components – is ingested at this point, the ingest will fail until the AIP is corrected.

AIP Details: METS Usage

  • Code Blockmets element
    • Code Block@PROFILE fixed value="http://www.dspace.org/schema/aip/1.0/mets.xsd" (this is how we identify an AIP manifest)code
    • @OBJID URN-format persistent identifier (Handle) if available, or else a unique identifier.
    • Code Block@LABEL title if availablecode
    • @TYPE DSpace object type, one of "DSpace ITEM", "DSpace COLLECTION", "DSpace COMMUNITY". Code Block
    • @ID is a globally unique identifier, such as
      Code Block
      dspace67075091976862014717971209717749394363
      {{dspace67075091976862014717971209717749394363{{.
    Code Block
  • mets/metsHdr element
    • Code Block@CREATEDATE timestamp that AIP was created.code
    • @LASTMODDATE last-modified date on Item, or nothing for other objects.
      • <font color="RED">mets defines these attributes as describing the METS document itself, we use them to describe the AIP, which sometimes we think of as the METS document, but more often think of as the 'package' – i.e. the METS document and all the files. I don't have a problem with the use Larry put forth, but we need to mention it in a prolife. I wonder if these dates shouldn't rather be in a techMD section, or maybe both.</font>
    • Code Blockagent element:
        code
      • @ROLE = "CUSTODIAN",
      • Code Block@TYPE = "OTHER",code
      • @OTHERTYPE = "DSpace Archive",code
      • name = Site handle.
  • Code Blockmets/dmdSec element
    • object's descriptive metadata crosswalked to MODS (or whatever the METS default is)
      • <font color="RED">See link to RW's Comments Page below for notes on use of MODS</font>
    • object's descriptive metadata in DSpace native DIM intermediate format, to serve as a complete and precise record for restoration or ingestion into another DSpace.
      • <font color="RED">We should require mets/dmdSec@OTHERMDTYPE if @MDTYPE = "OTHER"</font>
    • When the Code BlockmdWrap Code Block@TYPE value is Code BlockOTHER{{, the element MUST include a value for the Code Block{{@OTHERTYPE attribute which names the crosswalk that produced (or interprets) that metadata, e.g. Code Block{{AIP-TECHMD{{.
    Code Block
  • mets/amdSec element - admin (technical, source, rights, and provenance) metadata for the entire archival object.
    • Code BlockrightsMD elements of the following TYPEs:
      • Code BlockDSpaceDepositLicense if the object has a deposit license, it is contained here.code
      • CreativeCommonsRDF If the object is an Item with a Creative Commons license expressed in RDF, it is included here.code
      • CreativeCommonsText If the object is an Item with a Creative Commons license in plain text, it is included here.
      code
    • sourceMD elements - recorded twice, once in DSpace native format, once in PREMIS:
      <font COLOR="RED">NOTE: PREMIS is only implemented for Bitstreams at the moment, and for the forseeable future.</font>
      • DSpace native format: MDTYPE="OTHER" OTHERMDTYPE=" Code Block{{AIP-TECHMD{{" (see Crosswalks section below for details'')
      • PREMIS expression of this technical metadata for archival object. (To be done later.)
      Code Block
    • digiprovMD
      • When History data is available, includes a section of Code BlockTYPE="DSpaceHistory" containing an RDF/XML rendition of the history data for the object. For internal AIPs, the history is stored in an external bitstream in the asset store; for self-contained packages it is a file in the package.
  • Code Blockmets/amdSec elements - technical metadata for each of an Items's Bitstreams, both in PREMIS and DIM formats Code Block
  • mets/fileSec element
    • For archival objects of type ITEM:
    • Code Blockmets/fileSec/fileGrp/file element
      • Set @SIZE to length of the bitstream. There is a redundant value in the techMD but it is more accessible here.
      • Set @MIMETYPE, @CHECKSUM, @CHECKSUMTYPE to corresponding bitstream values. There is redundant info in the techMD.
      • SET @SEQ to bitstream's SequenceID if it has one.
    • For archival objects of types COLLECTION and COMMUNITY:
      • Only if the object has a logo bitstream, there is a Code BlockfileSec with one Code BlockfileGrp child of Code Block{{@TYPE="LOGO"{{.
      • The Code BlockfileGrp contains one Code Blockfile element, representing the logo Bitstream. It has the same file format, checksum, etc fields as the Item content bitstreams, but does not include metadata section references or a SequenceID.
      • See the main Code BlockstructMap for the reference to this file.
    code
  • mets/structMap - Primary structure map, Code Block@LABEL="DSpace Object", @TYPE="LOGICAL"
    • For COLLECTION objects: Top-level Code Blockdiv has one child:
        code
      1. div with Code Block@TYPE="MEMBERS"{{. For every Item in the Collection, it contains a Code Block{{div with an Code Blockmptr linking to the Handle of that Item. Its Code Block@LOCTYPE="HANDLE"{{, and Code Block{{@xlink:href value is the raw Handle.
      • If Collection has a Logo bitstream, there is an Code Blockfptr reference to it in the very first
        Code Block
        div
        {{div{{.
    • For COMMUNITY objects: Top-level Code Blockdiv has two children:
      1. Code Blockdiv with Code Block@TYPE="SUBCOMMUNITIES"{{. For every Sub-Community in the Community it contains a Code Block{{div with an Code Blockmptr linking to the Handle of that Community. Its Code Block@LOCTYPE="HANDLE"{{, and Code Block{{@xlink:href value is the raw Handle. Code Block
      2. div with Code Block@TYPE="COLLECTIONS"{{. For every child Collection, it contains a Code Block{{div with an Code Blockmptr linking to the Handle of that Collection. Its Code Block@LOCTYPE="HANDLE"{{, and Code Block{{@xlink:href value is the raw Handle.
      • If Community has a Logo bitstream, there is an Code Blockfptr reference to it in the very first
        Code Block
        div
        {{div{{.
    • ITEM objects have the same kind of simple structure map as SIP/DIP: top level Code Blockdiv with a Code Blockdiv under it for each visible Bitstream.
      • If Item has primary bitstream, put it in first Code Block{{structMap/div/fptr{{.
    Code Block
  • mets/structMap - Structure Map to indicate object's Parent
    • Contains one Code Blockdiv element which has the unique attribute value Code BlockTYPE="AIP Parent Link" to identify it as the older of the parent pointer.
      • It contains a Code Blockmptr element whose Code Blockxlink:href attribute value is the raw Handle of the parent object, e.g. Code Block{{1721.1/4321{{.<p>In order to restore a DSpace archive from internal AIPs in the asset store, the parent of each object must be available at the surface level of the METS document so the object can be instantiated under its correct parent before the metadata (which may also name the parent) is crosswalked.

...

The following steps have been tested for a very small archive
and successfully restored the RDBMS tables from internal AIPs in the
asset store. Note that this is a coarse overview and does not
consider error-handling.

Restoration

  1. Run Code Block/dspace/bin/cleanup to clear out unused bitstreams from the asset store.
  2. Shut down your servlet container, if necessary.
  3. Remove the search indices: Code Blockrm /dspace/search/*
  4. If your archive is configured to use History, save the old History by renaming its directory, and create a new, empty History directory
    e.g. Code Blockmv history history.old ; mkdir history
  5. Start with an empty database. Either:
    1. Backup the current state of the RDBMS, and destroy it with
      e.g. Code Blockdrop database dspace;
    2. Simply change your DSpace configuration to point to a different database instance, if you have room for another database.
  6. Create a new, empty database:code
    createdb -U dspace -E UNICODE dspace
  7. Run the scripts in your install directory to initialize the DB:
    Code Blockant setup_database load_registries
  8. Back in the DSpace run directory, create an admin user:
    Code Block /dspace/bin/create-administrator
  9. Initialize the search and browse indices:
    Code Block /dspace/bin/index-all
  10. In your DSpace configuration, ensure that the AIP restoration application will run with History turned off:
    1. Set up a separate dispatcher for the AIPManager application:
      Code BlockaipManager.dispatcher = restore
    2. Ensure that the Code Blockrestore Dispatcher does NOT call the History consumer, although it should call the search and browse consumers synchronously:
      Code Blockevent.dispatcher.restore.class = org.dspace.event.BasicDispatcher<br>event.dispatcher.restore.consumers = search:sync, browse:sync
  11. Rebuild the Bitstream table:
    Code Block/dspace/bin/dsrun org.dspace.administer.RebuildBitstreamTable -r
  12. Rebuild the InternalAIP table:
    Code Block/dspace/bin/dsrun org.dspace.administer.AIPManager -c -a -f -v -e ''admin-user''
  13. Restore archive from the internal AIPs:
    Code Block/dspace/bin/dsrun org.dspace.administer.AIPManager -r -a -v -e ''admin-user''

At each stage, carefully monitor the output and the DSpace log for indications of errors. You can retry the restore of an internal AIP, or even the whole set of them, if necessary; it automatically skips any objects that already exist.

...

To create an internal AIP, just add the package parameter

...

internal=true

...

to the command.
The resulting "package" will be a METS manifest document, e.g.

...

To ingest an AIP and create a new object under a parent of your choice, add the

...

ignoreParent

...

and

...

ignoreHandle

...

package parameters to the command:

...

Then apply the changes to your DSpace installation directory:

NOTE: The interface of

...

org.dspace.content.packager.PackageIngester

...

has been changed slightly. This will break any existing package ingesters, although the ones in the DSpace core have been fixed. Look at the changes to e.g.

...

org.dspace.content.packager.PDFPackager

...

for an example of how to update your code. The changes are quite minimal.

  1. Unpack the new source Zip file in your install directory with
    Code Block
    unzip
    {{unzip{{.
  2. For each of the "diff" files, in order, go to your install directory and apply the diff with the command:
    Code Blockpatch -p 0 -l < ''diff-file''
  3. Build and install the code: Code Blockant install_code build_wars
  4. Ensure the configuration changes in Code Blockconfig/dspace.cfg get propagated to your run-time config file.
  5. Ensure the new files in Code Blockconfig/crosswalks are installed in your run-time directory.
  6. Apply the database change by running the SQL code in the file:
    Code Block etc/database_schema_14-15.sql
  7. Be sure to install the new WAR file(s) in your servlet container.
  8. Test by updating internal AIPs as shown above

...

The following configuration keys apply to the AIP packager and management infrastructure.
They may also require certain crosswalk plugins to be configured,
but that is a separate issue that is addressed in the sample DSpace
configuration supplied with the system source.

  • Code BlockaipManager.dispatcher
    name of the Event Dispatcher for the AIPManager application; when restoring an archive from AIPs, it is best to set this to a dispatcher that calls the search and browse consumers, but not History.
  • Code Blockaip.packager
    plugin name of the Packager used to ingest and disseminate AIPs; by default it is
    Code Block
    AIP
    {{AIP{{.
  • Code Blockmets.dspaceAIP.ingest.crosswalk.''mdSecType''
    crosswalk plugin (either XML or Stream-oriented) to be called to interpret the given mdSec type. To ignore a section, set it to Code BlockNULLSTREAM (for stream data) or Code BlockNIL for XML.
  • Code Blockaip.disseminate.''mdSecName''
    Sets the type name and crosswalks associated with each metadata section under the METS Code Block{{amdSec: sourceMD, techMD, rightsMD, digiprovMD{{. Value is comma-separated list of mdSecType:pluginName specifiers. For example:
    Code Blockaip.disseminate.techMD = PREMIS Code Block
  • aip.disseminate.dmd
    Sets the crosswalks and type names of descriptive metadata sections to include; value format is the same as the admin MD sections.
  • Code Blockaip.ingest.createEperson
    When value is "true", AIP ingester will create an EPerson if needed so it can set the Submitter of a newly-created Item to the "correct" value. An EPerson created this way cannot login. Default is false.

...

...