At the start of all workflows in Chronopolis we have our tools to package data into a file layout. Typically this will be handled by one of our Intake services, but occasionally is done in an ad-hoc fashion if data needs to be ingested manually. This is an action which as taken once so that the file layout can be distributed throughout the Chronopolis network. Currently we only supply our own packaging library for the BagIt format.

For an OCFL implementation, we need to make decisions around whether or not we want to create a wrapper around another project, or create our own library.

Requirements

In order to operate nominally, we would need to make sure any library we use can handle the following cases:

  • Creating an initial OCFL Object
  • Updating with a partial OCFL Object
  • Purging of data from an OCFL Object
  • Validating an OCFL Object, a specific version of an OCFL Object, and the metadata for an OCFL Object (e.g. making sure the inventory.json contains all files, the sidecar has the correct checksum, etc)
  • Handling additional metadata (ACE Tokens, application logs)
  • Understanding different addressing schemes

OCFL Storage Roots DECISION

For Chronopolis, we have a decision to make here in terms of where the Storage Roots will exist. It is important to keep in mind that a Storage Root can contain zero or more OCFL Objects. 

Preservation Root

If the Storage Root were to exist at the same level as our preservation root, we could use the Depositor root as a namespace for OCFL Objects. However, it would require that the entire preservation filesystem is updated if we move to a new version of the OCFL specification. If a Preservation Root is used as the OCFL Storage Root, then the Replication service would likely require additional updates in order to check that the OCFL Storage Root exists or create it.

Depositor Root

The first place we can have a Storage Root is at the root of each Depositor’s directory. This likely does not make sense because these files are not tracked by the ACE Audit and preservation files at this level are not transferred by the replication process. Therefor a separate process would need to be created in order to create the Storage Root for a depositor. There exists a similar issue to the preservation root as well that all OCFL Objects under a Depositor Root should conform to the OCFL version specified in the root. 

Package Root

Finally we can place the Storage Root at the package level. While this creates some extra data which needs to be transferred, it provides the opportunity to create the Storage Root at Intake and distribute that through the network. This also provides the opportunity to explore multi-part OCFL packages by creating multiple OCFL Objects under a single Storage Root for a given deposit. It also allows for a more staggered approach to having OCFL versions in the Chronopolis repository, as new versions of OCFL layouts could be propagated as work is done in order to support them.

OTM Considerations

In the OTM specification, a versioning scheme is used which allows for a Repository pass data to a DDP such that it can have multiple versions. In the Deposit DDP Workflow, a version-id is specified as an extra json field in order to store this information. In order to be compatible with the OTM specification, we will need to ensure that we can read and write arbitrary json fields to the inventory.json of an OCFL Object.

The OTM specification also includes a workflow for deleting data from a DDP. Depending on how the Chronopolis team plans on implementing this workflow for the DDP, we may want to include some knowledge about this in the packager.

Homegrown

If we opt to create our own OCFL packaging library, there are many things which we need to be aware of and make decisions on. Some of these are not necessarily technical decisions, but things which we can drive through our own idea of what constitutes best practices for OCFL in Chronopolis.

We should keep the implementation notes from the OCFL team in mind to keep things such as versioning numbering consistent with what is expected from the specification.

OCFL Storage Root

The first piece which our OCFL library would need is creating and listing OCFL storage roots. This is a fairly basic operation, and includes the namaste and some optional files describing the OCFL spec. If the Storage Root is going to be part of the preservation object, then the files would need to be generated, and directories created for each OCFL Object to be created.

OCFL Objects

The main focus of the library should revolve around the OCFL Object. The Object level is more well defined with respect to what it should be in Chronopolis, and as such there is less discussion around what decisions we need to make. For this we need the basics: creation, update, and validation.

It might be good consider an OCFL Object to be open or closed and work on it as needed. For example, if an OCFL Object is open:

  • files can be added and modified to a content directory of the current version
  • metadata such as tokens or log files can be added
  • the inventory can be updated

Once an OCFL Object is closed it should be locked and the associated process doing the finalization should write the new inventory sidecar. An OCFL Object which is closed should not have any operations succeed, and might be best defined by having the sidecar in place.

Inventory As A Database

As the inventory for an OCFL Object is a json file, we will need to be mindful of how that impacts processing. If consuming the json file has too much overhead, it could be persisted to a sqlite database temporarily while the OCFL Object is being worked on. This would allow the library to be able to query if files exist in a OCFL Object, the digest it expects to find, as well as easily inserting new versions and files.

When an OCFL Object is ready to be finalized, this database could be written out as the inventory.json.

Logs Directory

In order to store information about actions taken on an OCFL Object, a logs directory is specified by the OCFL specification. The information we may wish to store within this directory is:

  • ACE Tokens
  • Pertinent Event Log Information
    • Checksum validation
    • File copying/moving
    • Errors

As these are actions taken in specific versions, we will want to use the created timestamp for a version in order to keep data in the logs directory organized. In addition, each file written to the logs directory should have a sidecar which can be used for validation as the files themselves will not be subject to any validation processes within the OCFL specification.

Version Creation

When creating a new version of an OCFL Object, we will want to initialize the OCFL Object from the root inventory.json. As payload and version files will never be overwritten, we do not need to include them when a new version is ready to be constructed.

The workflow for a new version should look similar to:

  • Increment the version
    • Bump the head
    • Create a new version stanza
    • Create a new version directory
    • Create a new log-version directory
      • e.g. logs/version-created
  • For each file:
    • Query the manifest for the checksum of the file
      • If the checksum does not exist
        • copy the file to the version content directory
        • add the digest and filename to the manifest
          • note: the filename should be the path from the OCFL Object root, i.e. v1/content/my-file.txt
        • mark for ACE Token creation
    • Create an entry in the state block for the version file
  • Request all marked files for ACE Tokens
    • If embedding tokens: write all ACE Tokens to logs/version-created/ace-token-store with a related sidecar for validation
    • Register ACE Tokens with Ingest Server
  • Finalize any logging
  • Finalize OCFL Object by writing inventory.json and inventory.json.{alg}

Addressing Files

The packager itself should not be concerned with creating forward deltas, and as such should not have to understand any semantics of different content addressing schemes. Therefore the amount of addressing the packager should have an understanding of is limited to knowing the following:

  • the manifest contains the content path of the file
    • this will be used for deduplication, so only one entry will exist
  • the state contains the logical path of the file
    • this will be used for understanding what the OCFL Object looks like in the event the data needs to be returned to a Depositor

The name of an OCFL Object can continue to be similar with Chronopolis current naming standards.

Fixity

The OCFL Specification also includes a fixity block which can be used to store the checksums for each file organized by the checksum algorithm. This allows additional verification to be run on each file. For Chronopolis, we will be using the checksum in the manifest block, so this information would become redundant. That does not mean there is no value in it, but it does not have as much value to the needs of the Chronopolis network.

Depending on if the checksum algorithms are aligned between Chronopolis and the service bringing data in (in this case, the OTM Bridge), we may want to store additional checksums here. For example, if the OTM Bridge returns md5 checksums but we wish to use sha512, we can persist the md5 checksums in the fixity block then use sha512 for the manifest and other needs.

Version Deletion/Purging

Discussion needs to occur within the Chronopolis team in order to determine what exactly the workflow should be allowed to do. This outline should be taken as a preliminary look into how deletion can be handled.

As part of the OTM specification, we may also need to handle purging of files from an OCFL Object. For this workflow, we should follow the recommended steps outlined in the implementation notes.

This would be part of a larger workflow, and the steps handled by the packager would be centered around creating a new OCFL Object and inserting versions for each existing version of the object. If a stub object is to be used in place of what was purged, this would be included in order to overwrite any existing data at a Chronopolis node.

External Library

Currently there are a few libraries which are attempting to implement the OCFL specification:

Many of the decisions we make on applying and creating OCFL Storage Roots and Objects would apply to any existing libraries as well. If choosing to go this route we would need to make sure all operations are supported and that we can handle any errors gracefully in the event of failure. In the case of the Go library, executing external processes in java does not always make for the prettiest workflow but can be managed. The java library would integrate easily with our existing codebase and could be forked or modified in order to suit our workflow needs.

Evaluation

Each of the two libraries should go through an evaluation phase in order to determine

  • if our workflow needs are met and if not how they can be resolved
  • ease of integration with our codebase
    • including additional development we may wish to consider, e.g. creating an API which exposes the library for Chronopolis workflows
  • ease of contribution to the library

Integration

One a library is chosen for use, incorporating it into the codebase is the next step. How this gets exposed to other Chronopolis services will be up to the developers implementing the changes, as it can be done in multiple ways.

  • No labels