Repairing is a process which allows for node to node resolution of corrupt files in Chronopolis. Currently a node administrator must choose which files to repair and which node to repair from. This is to prevent unnecessary repairs (e.g. in the event of a filesystem being offline), and also to allow for discussion around the collection before it is repaired.
Download and install the rpm
- CORRUPT notices data in their Audit Manager (ACE-AM) is showing File Corrupt, indicating that checksums on disk have changed
- Discussion happens internally about who has this and can repair it
- SSH keys exchanged so that data transfer can occur for files which are to be repaired
- CORRUPT logs on to the Ingest server and selects 'Request Repair' in order to create a 'Repair Request'
- Inputs ACE AM credentials to query for the corrupt collection
- Select the Collection
- Select the Files to repair and the Node where they will be Repaired
- VALID logs onto the Ingest server and selects 'Fulfill Repair' in order to stage data for the repair
- At this point, both CORRUPT and VALID nodes should start the Repair service
- The Repair service running at VALID will stage data and update the Repair
- The Repair service running at CORRUPT will
- Pull data from VALID into a staging area
- Validate that the data transferred and matches the checksums in the ACE AM
- Overwrite the corrupt files
- Audit the files in the ACE AM
- Update the Repair with the result of the audit
- Once complete, the Repair Service at each node can be stopped
Repair File Transfer Strategies
During design of the Repair service, it was noted that there are different ways of transferring content between Chronopolis Nodes:
- Direct transfer
- through rsync
- through ACE AM
- Indirect transfer through the Ingest Server
During development support was added for each type of transfer, but only the direct rsync strategy was implemented. The direct ACE-AM transfer strategy requires additional development in the Audit Manager in order to support API Keys which can be used to access content. The indirect transfer through the Ingest Server was omitted as it was not deemed onerous for Chronopolis Nodes to exchange ssh keys.
Currently the Repair workflow handles repairing corrupt files, but does not cover other types of failure which can occur in the system. For example, in the past we have had issues with the Audit Manager (ACE-AM) having received invalid checksums from the underlying storage system, which then needed to be updated in order for an audit to pass successfully. We have also see ACE Token Stores be partially loaded which results in the need to re-upload the ACE Token Store so that we can ensure we are auditing against the ACE Tokens we created on ingestion of the collection.