In general, when validating the integrity of files the responsibility is split between the Replication and ACE-AM as follows:

  • Replication: top level metadata (inventory.json, inventory.json.{alg})
  • ACE-AM: payload and metadata

Replication

As Replication interacts minimally with the data distributed through the network, it should not require as many updates as other services. Depending on the level of validation we wish to perform on data being distributed we have the opportunity to verify the format of the file layout as well as the data which is being transferred.

If a process will be performed to validate the correctness of the OCFL Object, the Replication service will need to be updated in order to handle parsing of the Object.

Post-Transfer Processes

After a transfer completes, we will need to know additional information about how a collection is packaged in order to validate and complete the replication.

Validation

In order to validate that an OCFL Object is correct, we will want to make sure that the inventory.json is well formed and that all the entries in the manifest and state blocks correspond to existing files on disk. In addition, we would want to perform this action for any new version being distributed located under the respective version directory. Orphaned files should have some discussion, but will be reported by the Audit Manager.

Hashing

During replication, files are hashed in order to validate successful transfer of content with the Ingest Server. Previously with BagIt the tagmanifest-sha256.txt was used in order to have a single file from which all hashing operations could be performed.

With the inclusion of OCFL, the inventory.json.{alg} sidecar will take this spot. From this file we can validate the checksum of the inventory.json which in turn will allow us to validate the checksums of all other files in the OCFL Object. There is an additional inventory.json and sidecar in each version directory which are expected to match the root inventory.json and sidecar. Running a checksum on these to ensure that they match the root may be an option, but discussion on extra validation should occur during implementation.

Implementation Notes

We will likely want to store information about the file used for validation in the FileLayout. As do not expect any variation within the location of these files except for differences between file layouts themselves, we should have a single source of truth describing our expectations.

ACE Token Upload

If the ACE Token Stores are transferred as part of the preservation object instead of as a separate file, the location used when retrieving them will need to be updated. When uploading files, the ByteStream of a file is retrieved from the Bucket class by passing the relative path of the TokenStore. This implementation is done so that if we need to use a non-posix storage layer, we can hopefully adapt without changing too much code. 

In order to support ACE Tokens stored within a package, we would only need to update the path passed to the Bucket. This might be information we store internally about where an ACE Token Store lives depending on what type of file layout (OCFL, BagIt) is using. Similar to the information used with respect to the inventory.json, this could live in the FileLayout so that we know where the ACE Token Store is located. Additionally, the head key would need to be retrieved from the inventory.json in order to know which version of OCFL Object we are retrieving the ACE Token Store for.

In the event ACE Token Stores are kept separate from the preservation object, no additional updates would need to be made to this part of the workflow.

ACE Checksum Updates

The checksums for the inventory.json and inventory.json.{alg} as they will both be modified during transfer. The mechanisms to handle this in the ACE-AM are outlined below.

ACE Audit Manager

The ACE Audit Manager provides integrity checking on the files within a preservation object by validating the ACE Tokens registered by the Replication service. Each ACE Token requires the checksum for a file to match what was used on the ingestion of the data. During transfer of versioned preservation objects, there needs to be functionality in order to accommodate these modifications, otherwise the ACE AM will log that the file is corrupt until its checksum matches what is has stored. More can be read about ACE Tokens and the process for validating ACE Tokens in the ACE documentation.

Within the ACE Audit Manager, we will be concerned with the following classes:

  • Collection - contains information about the location of a group of files to monitored
  • MonitoredItem - contains information about a file being monitored within a Collection (checksum, path, size, etc)
  • AceToken - the ACE Token for a MonitoredItem
  • LogEvent - audit log events which occur on a Collection, MonitoredItem, or AceToken and are held in the database

Updating Files

When transferring versioned OCFL Object, it is inevitable that the inventory.json and inventory.json.{alg} files will need to be updated. In order to accommodate this, there will need to be additional functionality added to the ACE AM application.

Checksum

The ACE AM will need to have an API added in order to allow the mutation of file checksums stored for a given file. This will need additional notifications in the ACE AM event log showing the FILE_CHECKSUM_UPDATE event along with the old checksum as well as the new. Updating the checksum for a file may also need to update the state of the MonitoredItem to MODIFIED.

ACE Token

An API already exists for adding ACE Tokens, and will allow ACE Tokens to be updated as well. We should modify this process to alter the state of a MonitoredItem to something other than INVALID. Using a MODIFIED state may be acceptable as with the checksum update.

Versioning for MonitoredItems

An alternative course of action which is optional would be to give the ACE Audit Manager some knowledge of versions in its database. This would require a migration in the form of an ace-{version}.sql file, and similar fields to the migrations outlined in the Ingest section. The versioned information which ACE AM would need to track are the file’s digest, size, and token. Additional events should be generated, such as FILE_VERSION_CREATE which are persisted as a LogEvent.

This would also need an API to be created in order to create new versions of MonitoredItems, and an ability to query on Collections for the latest versions of its MonitoredItems.

  • No labels