Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • need to preserve fedora 3 content, history and audit trail
  • ability to leverage fedora 4 features
  • need to make data accessible and functional in the new environment
  • desire to make migration easier, faster and less error-prone

 

Proposal 1:

Ideas behind migration-utils

A Develop a framework for pluggable migration tool that is based on processing of FOXML xml.. Ideally this utility code could be packaged as a command line program for unix-like operation, or used as a library within more complex tools such as a camel-based migration utility.

A FOXML-based (rather than API based approach) has the advantages ofThis has the strength that:

  • foxml (when exported in the "archive" context, or persisted in the low level store) is a complete representation of the object
  • foxml offers a wide range of compatibility with various versions of Fedora
  • foxml migration doesn't require the fedora 3 repository software to be running
  • large number of existing frameworks for efficiently processing XML

...

  • migration of data that's not in the repository (like configuration, global xacml policies, etc.) will require special handling
  • ability to write and use plugins (special configurations) for mapping complex metadata or fedora 3 constructs into fedora 4 must be made as easy as possible since most institutions will need to write their own or adapt existing ones

 

Pluggable

The main framework will take as its source FOXML from a fedora repository.  This may be just pointing to the fedora data store directory or pointing it to a running repository and fetching each record through the export API call.  What happens during the processing of each object must be highly configurable.  For the purpose of this proposal, consider the term "processing plugin" to refer to a bit of code or algorithm to handle a part of a fedora object.

Identifier plugins

A place that represents the implementation of your institutional pid migration strategy.  This could be as simple as "Store the PID as a DC identifier and mint a new fedora 4 id for this item" to something more complex like "escape the existing pid into a fedora 4 path".

Datastream plugins

I envision lots of datastream plugins whose applicability is based on characteristics of the datastream such as control group, mime type, dsid or even based on the content model that defines it.  Presedence for such rules should be simple and well-defined. 

You should be able to express strategies like some of the following examples:

  • the DC datastream should be translated into RDF assertions
  • datastreams called descMetadata should be translated to RDF assertions using a given template
  • the content datastream on objects with the cmodel:images content model should be handled X
Access Control plugins
  • POLICY datastreams should be mapped as follows
Reporting plugins

Migration Model

There are several ways to expose or abstract fedora 3 objects to a migration utility and it may be that some are better suited for the types of migration necessary.  For example, sequentially accessing content from within a FOXML file may prove to be the least memory-intensive and fastest way to process content, but if information in a datastream (like RELS-EXT) is needed to inform decisions about how to handle earlier content, such benefits may be negated in common use cases.  Furthermore, beyond simply processing objects sequentially, it may prove beneficial to process objects in a certain order, or to provide random-access to other resources within the fedora 3 repository being migrated.

modeldescriptionconsiderationsstatus
sequential accessStreaming the FOXML in the order present on disk to the routines or code that does the migration.This model minimizes memory usage and likely maximizes processing speed.  It can also more easily force higher-level routines to acknowledge all content, preventing inadvertent lossyness.  Complex higher-level routines would have to accept the data in the serialization order and couldn't easily alter processing based on content serialized deeper in the foxml datastream.A proof-of-concept implementation has been written.
whole object accessExpose access to the whole object in some form.  Pieces (like datastream content) would likely have to be accessed as needed from disk.This model allows for more flexibility in higher level migration routines as content could be accessed in whatever order at the expense of a higher memory footprint and less coercion to consider every element within the original fedora 3 object. 
versioned object abstractionFedora 3 allows versioning of datastreams but not object properties, though it does maintain a simple audit record of those object property updates.  In migration scenarios where metadata from multiple sources (different datastreams or object properties) are to be represented in the Fedora 4 properties, it may make sense to have an abstraction representing a version of the fedora 3 object (and all its datastreams) in time.This model would take care of a huge amount of the heavy lifting associated with version migration and the likely case where the DC datastream, RELS-EXT datastream and object properties would all find themselves represented as fedora 4 properties. 

 

 

 In addition to processing plugins, the framework allows for no-op plugins that instead of migrating data, performs some sort of check, gathers some statistics or otherwise generates a report that may be useful in developing and planning migration.

Implementation

GitHub