Overview
The Bit Integrity Checker is intended to provide a simple and easy to use way of assuring that the content stored in your DuraCloud account has maintained bit integrity. The basic idea is that a DuraCloud account administrator provides a listing of expected content IDs and their associated MD5s. This listing is then used by the service as a basis for comparison against MD5s found in DuraCloud.
There is a trade-off between cost (both in time and money) and assurance in the trustworthiness of the MD5 provided by DuraCloud. The service is designed to offer three options to address this balance, see "Levels of trust" below. The fastest and cheapest option is to use the MD5 stored in the metadata of the content item. The underlying storage providers assert that this value which is created on ingest is also checked when the content item is read, and retrieves a mirrored copy of the content if there is a mismatch. As a note, when content is pushed into DuraCloud via the DuraStore REST-API an MD5 can be provided by the user to be automatically checked with the one generated by the underlying storage provider, and if no MD5 is provided the DuraCloud application calculates it for this comparison. If the administrator does not trust the assertion of the underlying storage provider, the Bit Integrity Checker also provides the option of reading the content and recalculating the MD5. Finally, if the administrator does not trust this recalculation, then the Bit Integrity Checker also provides the option of passing in a "salt" character string which will be appended to the content during the recalculation of the MD5.
The input listing is expected to be found as a user-specified content item within DuraCloud, and the resultant output file will be stored to a user-specified location within DuraCloud. The formats of the input and output files are the same, so a previous run's output may be used as a subsequent run's input.
see July 2010 NDIIPP presentation
Requirements
- User has the option to provide as input the listing of content items with expected MD5s
- User has the option to provide as input the listing of content items without expected MD5s
- Service will determine the system MD5 value for each content item according to selected "Level of trust" algorithm
- Service will store the determined listing in a user-provided DuraCloud location
- User has the option to specify any two listings of content items / MD5 pairs for comparison
- Service will report on comparison of provided MD5s against system MD5s values of content items
- Service will store report output in user-provided DuraCloud location
- Service will run on a compute instance local to the input storeId - future
Design
Levels of trust
Provides the choice of balance amongst cost, time, and assurance
- Trust in underlying storage providers
- Trust in DuraCloud and opensource software
- Trust in requester of service
The three levels of trust above are addressed by three implementations
- System MD5s are determined by using the stored metadata values
- System MD5s are determined by DuraCloud re-reading the content bytes and re-calculating the MD5s
- System MD5s are determined by DuraCloud re-reading the content bytes appended with a 'salt' and re-calculating the MD5s
Operational modes
In order to address both scenarios of allowing the user to have certainty that MD5s are being generated/checked when requested and allowing the user to trust the service and have it execute with a single command, the following modes are available.
- Single-step interaction
- User invokes service with a listing of contentId/MD5 pairs (and other options) to check
- Bit Integrity Checker generates a listing of contentId/MD5 pairs based on the input options and performs comparison with input, expected MD5s
- Service generates result report
- Two-step interaction
- User invokes service with a listing of contentIds (and other options) to check
- Bit Integrity Checker generates a listing of contentId/MD5 pairs based on the input options
- User invokes service with a listing of expected contentId/MD5 pairs to be compared to the generated listing
- Service generates result report
Functionality spec
Single-step interaction
- Service inputs
- spaceId & contentId where input listing is stored
- comma-delimited listing of spaceId, contentId and expected MD5
- each content item separated by newline character
- first line in file will be ignored
- spaceId & contentId where results file should be written
- options (see below)
- spaceId & contentId where input listing is stored
- Service outputs
- comma-delimited listing of spaceIds, contentIds, expected MD5s, system MD5s, status state
- service status state
Two-step interaction
- Step 1: Service inputs
- spaceId & contentId where input listing is stored
- comma-delimited listing of spaceId, contentId and without expected MD5
- each content item separated by newline character
- first line in file will be ignored
- spaceId & contentId where results file should be written
- options (see below)
- spaceId & contentId where input listing is stored
- Step 1: Service outputs
- comma-delimited listing of spaceIds, contentIds, system MD5s
- service status state
- Step 2: Service inputs
- spaceId & contentId where input listing is stored
- comma-delimited listing of spaceId, contentId and expected MD5
- each content item separated by newline character
- first line in file will be ignored
- spaceId & contentId where service-generated listing is stored
- comma-delimited listing of spaceId, contentId and system MD5
- as generated and stored by the the service in step-1
- spaceId & contentId where input listing is stored
- Service outputs
- comma-delimited listing of spaceIds, contentIds, expected MD5s, system MD5s, status state
- service status state
Service options
- trust level (stored value, recalculate, salt)
- salt
- arbitrary character string which will be appended to content in generating MD5
- fail-fast boolean
- service will exit when first error/mismatch found if 'true'
- complete space(s) boolean
- indicates if the input listing should be checked against the complete set of items in the space(s)
- storeId of underlying storage provider
- default to primary underlying storage provider
Service exceptions
- Checked
- Contains the following enum
- missing MD5 (expected or found)
- MD5 mis-match
- unequal content listings
- Runtime
- internal error
- salt option set but salt not provided
- service level not supported
- input content item not exists
- output result content item already exists
Options / Mode matrix
The table below shows the possible usage scenarios across the top, and their associated input options.
- One-step: hash from input list
- user initiates Bit Integrity Checker as a single operation
- an input listing of content-ids/hashes is provided
- service generates hashes one-to-one for each item in the input listing
- One-step: hash from complete space
- user initiates Bit Integrity Checker as a single operation
- an input listing of content-ids/hashes is provided
- service generates hashes for all content-items in spaces found in the input listing
- Two-step: hash from input list
- user initiates Bit Integrity Checker as a two part operation, this being the first
- an input listing of only content-ids is provided
- service generates hashes one-to-one for each item in the input listing
- Two-step: hash from complete space
- user initiates Bit Integrity Checker as a two part operation, this being the first
- user indicates space(s) of target content to hash
- Two-step: compare two lists
- user initiates Bit Integrity Checker as a two part operation, this being the second
- two input listings of content-ids/hashes are provided
User input option |
One-step: hash from input list |
One-step: hash from complete space |
Two-step: hash from input list |
Two-step: hash from complete space |
Two-step: compare two lists |
---|---|---|---|---|---|
hash approach |
|
|
|
|
|
salt |
|
|
|
|
|
fail-fast |
|
|
|
|
|
storage provider id |
|
|
|
|
|
space of provided listing |
|
|
|
|
|
object-id of provided listing |
|
|
|
|
|
space of provided listing-B |
|
|
|
|
|
object-id of provided listing-B |
|
|
|
|
|
space(s) of target content |
|
|
|
|
|
space for output |
|
|
|
|
|
object-id of result listing |
|
|
|
|
|
object-id of report |
|
|
|
|
|
10 Comments
Bill Branan
A few thoughts:
Andrew Woods
Thanks, Bill. Below are responses to each "thought":
Kyle Banerjee
A service that provides bit integrity checking is very useful. However, the way I understand how things will be set up during the pilot, the Sync Tool copies data from your local instance to DuraCloud. This means that bad files can overwrite good ones protected by Fixity Service in DC.
Given that data may be stored both locally and in DC, desirable behavior would be to identify and repair corrupted files wherever they may be. Fixity Service in current state will pull a new copy from another source if DC copy is kaput. Ideally the service could also push a good copy from DC if that other source proved to be corrupted.
Andrew Woods
Thanks for reflecting on the service, Kyle.
Indeed, the SyncTool will be pushing your local content to DC. And if the tool is run when the local file A.txt differs from the DC file A'.txt, then A'.txt will be overwritten by A.txt. If the local file was the corrupt one in this case, then that would be propagated to DC.
For clarification, what the Fixity Service does is limited to reporting. It does not perform any copying of content or repair. The Fixity Service takes a listing of expected contentId/MD5 pairs and compares them to what is found in DC. All discrepancies are flagged, and the user is responsible for taking action based on the report.
It sounds like you would be interested in additional repair functionality. Could you outline how you envision this flow? When running the service, would the user indicate either the input listing of contentId/MD5 pairs or the listing discovered from DC is the one to consider valid? Could it happen that there are content errors on both sides?
Bryan Beecher
Andrew, I want to be sure I understand the proposed service.
To take a trivial example, let's say that I have only a single file (woofwoof.txt) and I'm storing it in three different clouds (A, C, and C) via DuraCloud. If I wanted to check its fixity on a weekly basis then I would provide this service input:
Is that right?
And I can turn one of the knobs such that DuraCloud will fetch the file from each provider, calculate the hash, compare it to the value I passed in, and complain if it does not match?
If this is correct, then this sounds very useful, and would provide similar functionality to a service we're using locally that walks through our archival holdings weekly, calculating the actual hash v. one we have stored in a database.
One minor suggestion... If a typical use-case is that people will replicate their content across multiple cloud providers, storing the same file in each cloud, using DuraCloud as the intermediary, then it might be nice if I could specify the list of cloud providers once, and then a list of (file, hash) pairs. For instance, if I store one million small files, replicating each file in each of N different clouds, then my service input will have N million lines.
Bill Branan
Hi Bryan,
To clarify a bit, there are three pieces of information to be included in the input file: the space ID, the content ID, and the checksum. A "space" in DuraCloud is essentially a top-level folder used to organize content. So the example input file you included would actually be checking the hash for woofwoof.txt in three spaces (A, B, and C) within the same storage provider.
To handle multiple storage providers, the optional store ID parameter would be included on the service call which indicates the storage provider to check against. So to perform the task you have in mind, assuming that the woofwoof.txt file is stored in a space called "dog-sounds", the input file would look like this:
The service would need to be run three times, once each for providers A, B, and C.
The knob that you talk about is what Andrew calls the level-of-trust option, which determines where the checksum is calculated. At level 1 the metadata of the file is retrieved, which includes the checksum, and that is used for comparison. At level 2 the file itself is retrieved and the checksum computed and compared. At level 3 the file is retrieved and appended with a "salt" value (a string of characters), then the checksum is computed.
Andrew, a question for you on that last point: I'm assuming that when running the service using trust level 3 only the salt value, not the checksum, will be provided by the user. The checksum computed using the salt will then be included in the fixity service results file allowing the user to compute and compare the checksum on their local system. Since no actual comparison will be done, the output would be more akin to what the Metadata Export Service would generate. Is this what you had in mind?
Andrew Woods
Hello Bill,
This is a good question, and one which I would appreciate feedback on from the Pilot Partners.
I was thinking that for simplicity, in the trust-level-3 scenario where the MD5 is calculated for salted content items, the behavior of the Fixity Service would be the same as levels 1 & 2. That is to say, the input listing would contain spaceIds, contentIds, and expected-MD5s (that were created with the salt). Due to the transparency of the code and the ability of the user to intentionally include an incorrect MD5 for sanity checking, this may be adequate.
However, for an absolute guarantee that there is no smoke and mirrors under the covers, running the Fixity Service for all three levels of trust could be implemented as a two-part process.
If implementing either the simple approach mentioned first, or the two-step approach, I suggest user interaction with all three levels be consistent. That being the case, at the cost of an additional user step (or automated step if the user sets up local jobs that use the duraservice REST-API), I would advocate for the two-step process.
As an additional note, depending on the technical skills of the user institution, there may be a need for a client-side utility for generating the listing file from the local content items.
Michael Della Bitta
Being able to compare these files with some combination of sort, diff, and pipes is a Good Thing in my book.
Andrew Woods
Thanks, Michael.
For clarity, are you commenting on the nature of the result/report files generated by the Fixity Service? Namely, are you satisfied that they are in a specified, comma-delimited format? or are you suggesting any additional functionality/formatting that would be helpful in your proverbial book?
Daniel Davis
Are services such as the BBIC supposed to exit/stop when getting to a Job State of COMPLETED?
Does the BBIC have any logic about chunked files or does it simple iterate through the space?
I am curious about the Java logging and how its configured for services. I see the familiar:
WARN No appenders could be found for logger (org.apache.hadoop.conf.Configuration).
WARN Please initialize the log4j system properly.
BTW. A run on the current "Danny" download of 1206 icpsr files appears to have run successfully on a xlarge instance. Lots of logs to understand though.
For BBIC, if you add more spaces after starting the service they do not appear on the reconfiguration dialog for a second run.
The service detail screen does not update itself when job state changes, you have to refresh manually (at least over the period I gave it about 10 minutes).