Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

In general, when validating the integrity of files the responsibility is split between the Replication and ACE-AM as follows:

  • Replication: top level metadata (inventory.json, inventory.json.{alg})
  • ACE-AM: payload and metadata

todo:

  • Focus on OCFL
  • Revise updated workflows
  • Review ACE-AM Updates + trim if unnecessary
  • s/package_layout/file_layout

Replication

As Replication interacts minimally with the packages data distributed through the network, it should not require as much many updates as other services. We should be aware of what files should be validated for a given packaging format, and run checks accordinglyDepending on the level of validation we wish to perform on data being distributed we have the opportunity to verify the format of the file layout as well as the data which is being transferred.

If a process will be performed to validate the correctness of the OCFL Object, the Replication service will need to be updated in order to handle parsing of the Object.

Post-Transfer Processes

After a transfer completes, we will need to know additional information about how a collection is packaged in order to validate and complete the replication.

Validation

For BagIt packagesIn order to validate that an OCFL Object is correct, we will likely want to do additional validation on want to make sure that the inventory.json is well formed and that all the entries in the manifest and tagmanifest to make sure that we have the amount of files which we expect. In an OCFL package, we might do the same with the inventory contents for either a version or the state block holding the manifest of the entire package state blocks correspond to existing files on disk. In addition, we would want to perform this action for any new version being distributed located under the respective version directory. Orphaned files should have some discussion, but will be reported by the Audit Manager.

Hashing

During replication, certain files are hashed in order to validate successful transfer of content with the Ingest Server. So far we consider the files which are important to be the top most level of hashes: Previously with BagIt the tagmanifest-sha256.txt for BagIt and inventory.json.sha512 for OCFL. In addition, we hash the ACE Token Store to ensure that it transferred successfully.  was used in order to have a single file from which all hashing operations could be performed.

With the inclusion of OCFL, the inventory.json.{alg} sidecar will take this spot. From this file we can validate the checksum of the inventory.json which in turn will allow us to validate the checksums of all other files in the OCFL Object. There is an additional inventory.json and sidecar in each version directory which are expected to match the root inventory.json and sidecar. Running a checksum on these to ensure that they match the root may be an option, but discussion on extra validation should occur during implementation.

Implementation Notes

We will likely want to store information about our “validation” files in an enum for each Package Layout. As we will the file used for validation in the FileLayout. As do not expect any variation within the location of these files until we deal with versioning, this should be a good way to avoid magic values within the classes which are doing workexcept for differences between file layouts themselves, we should have a single source of truth describing our expectations.

ACE Token Upload

If the ACE Tokens Token Stores are transferred as part of the preservation package object instead of as a separate file, the location used when retrieving them will need to be updated. When uploading files, the ByteStream of a file is retrieved from the Bucket class by passing the relative path of the TokenStore. This implementation is done so that if we need to use a non-posix storage layer, we can hopefully adapt without changing too much code (hopefully)

In order to support ACE Tokens stored within a package, we would only need to update the path passed to the Bucket. This might be information we store internally about where an ACE TokenStore Token Store lives depending on what type of package file layout (OCFL, BagIt) is using. In addition, if any Content Addressing is used, we would first need to read the index to discover the path of the TokenStoreSimilar to the information used with respect to the inventory.json, this could live in the FileLayout so that we know where the ACE Token Store is located. Additionally, the head key would need to be retrieved from the inventory.json in order to know which version of OCFL Object we are retrieving the ACE Token Store for.

In the event ACE Token Stores are kept separate from the preservation object, no additional updates would need to be made to this part of the workflow.

ACE Checksum Updates

The checksums for the inventory.json and inventory.json.{alg} as they will both be modified during transfer. The mechanisms to handle this in the ACE-AM are outlined below.

ACE Audit Manager

The ACE Audit Manager provides utilities for handling most of the actions we want to take on files when creating updates, however it does not have any idea of versioning built in. This means that if a file changes on disk, ACE integrity checking on the files within a preservation object by validating the ACE Tokens registered by the Replication service. Each ACE Token requires the checksum for a file to match what was used on the ingestion of the data. During transfer of versioned preservation objects, there needs to be functionality in order to accommodate these modifications, otherwise the ACE AM will log that the file is corrupt until its original checksum matches what is has stored. In addition, since some files will need to have tokens updated, ACE will mark the token as being corrupt until the checksum of the file is updated. While this primarily will affect metadata of packages, we will still want to take steps to resolve these issues.

...

More can be read about ACE Tokens and the process for validating ACE Tokens in the ACE documentation.

Within the ACE Audit Manager, we will be concerned with the following classes:

  • Collection - contains information about the location of a group of files to monitored
  • MonitoredItem - contains information about a file being monitored within a Collection (checksum, path, size, etc)
  • AceToken - the ACE Token for a MonitoredItem
  • LogEvent - audit log events which occur on a Collection, MonitoredItem, or AceToken and are held in the database

Updating Files

When transferring versioned OCFL Object, it is inevitable that the inventory.json and inventory.json.{alg} files will need to be updated. In order to accommodate this, there will need to be additional functionality added to the ACE AM application.

Checksum

The ACE AM will need to have an API added in order to allow the mutation of file checksums stored for a given file. This will need additional notifications in the ACE AMs AM event log showing the FILE_CHECKSUM_UPDATE event along with the old checksum as well as the new. Updating the checksum for a file may also need to update the state of the MonitoredItem to INVAILD, which would happen when the new ACE Token is ingested as well.

Content Address Storage Driver

MODIFIED.

ACE Token

An API already exists for adding ACE Tokens, and will allow ACE Tokens to be updated as well. We should modify this process to alter the state of a MonitoredItem to something other than INVALID. Using a MODIFIED state may be acceptable as with the checksum updateDepending on how much we wish to understand about a package and the mappings of its files, we could provide a StorageDriver for the ACE AM which allows the display of a files logic address. This would go along with the Content Addressing packaging which has layouts such as idx_v1.idx, which could be parsed by ACE AM and applied when browsing collections. This isn’t strictly necessary and would serve to add a human-readable element when looking at the MonitoredItems of a Collection. In all likeliness it would probably require a database migration in order to capture the mappings in an efficient way.

Versioning for MonitoredItems

One An alternative course of action which is optional would be to give the ACE Audit Manager some knowledge of versions in its database. This would require a migration in the form of a an ace-{version}.sql file, and similar fields to the migrations outlined in the Ingest section. The versioned information which ACE AM would need to track are the file’s digest, size, and token. Additional events should be generated, such as FILE_VERSION_UPDATECREATE which are persisted as a LogEvent.

This would also need an API to be created in order to create new versions of MonitoredItems, and an ability to the logevent tablequery on Collections for the latest versions of its MonitoredItems.