This documentation space is deprecated. Please make all updates to DuraCloud documentation on the live DuraCloud documentation space.

Overview

The Bit Integrity Checker is intended to provide a simple and easy to use way of assuring that the content stored in your DuraCloud account has maintained bit integrity. The basic idea is that a DuraCloud account administrator provides a listing of expected content IDs and their associated MD5s. This listing is then used by the service as a basis for comparison against MD5s found in DuraCloud.

There is a trade-off between cost (both in time and money) and assurance in the trustworthiness of the MD5 provided by DuraCloud. The service is designed to offer three options to address this balance, see "Levels of trust" below. The fastest and cheapest option is to use the MD5 stored in the metadata of the content item. The underlying storage providers assert that this value which is created on ingest is also checked when the content item is read, and retrieves a mirrored copy of the content if there is a mismatch. As a note, when content is pushed into DuraCloud via the DuraStore REST-API an MD5 can be provided by the user to be automatically checked with the one generated by the underlying storage provider, and if no MD5 is provided the DuraCloud application calculates it for this comparison. If the administrator does not trust the assertion of the underlying storage provider, the Bit Integrity Checker also provides the option of reading the content and recalculating the MD5. Finally, if the administrator does not trust this recalculation, then the Bit Integrity Checker also provides the option of passing in a "salt" character string which will be appended to the content during the recalculation of the MD5.

The input listing is expected to be found as a user-specified content item within DuraCloud, and the resultant output file will be stored to a user-specified location within DuraCloud. The formats of the input and output files are the same, so a previous run's output may be used as a subsequent run's input.

see July 2010 NDIIPP presentation

Requirements

  1. User has the option to provide as input the listing of content items with expected MD5s
  2. User has the option to provide as input the listing of content items without expected MD5s
  3. Service will determine the system MD5 value for each content item according to selected "Level of trust" algorithm
  4. Service will store the determined listing in a user-provided DuraCloud location
  5. User has the option to specify any two listings of content items / MD5 pairs for comparison
  6. Service will report on comparison of provided MD5s against system MD5s values of content items
  7. Service will store report output in user-provided DuraCloud location
  8. Service will run on a compute instance local to the input storeId - future

Design

Levels of trust

Provides the choice of balance amongst cost, time, and assurance

  1. Trust in underlying storage providers
  2. Trust in DuraCloud and opensource software
  3. Trust in requester of service

The three levels of trust above are addressed by three implementations

  1. System MD5s are determined by using the stored metadata values
  2. System MD5s are determined by DuraCloud re-reading the content bytes and re-calculating the MD5s
  3. System MD5s are determined by DuraCloud re-reading the content bytes appended with a 'salt' and re-calculating the MD5s

Operational modes

In order to address both scenarios of allowing the user to have certainty that MD5s are being generated/checked when requested and allowing the user to trust the service and have it execute with a single command, the following modes are available.

  1. Single-step interaction
    1. User invokes service with a listing of contentId/MD5 pairs (and other options) to check
    2. Bit Integrity Checker generates a listing of contentId/MD5 pairs based on the input options and performs comparison with input, expected MD5s
    3. Service generates result report
  2. Two-step interaction
    1. User invokes service with a listing of contentIds (and other options) to check
    2. Bit Integrity Checker generates a listing of contentId/MD5 pairs based on the input options
    3. User invokes service with a listing of expected contentId/MD5 pairs to be compared to the generated listing
    4. Service generates result report

Functionality spec

Single-step interaction
  1. Service inputs
    1. spaceId & contentId where input listing is stored
      • comma-delimited listing of spaceId, contentId and expected MD5
      • each content item separated by newline character
      • first line in file will be ignored
    2. spaceId & contentId where results file should be written
    3. options (see below)
  2. Service outputs
    1. comma-delimited listing of spaceIds, contentIds, expected MD5s, system MD5s, status state
    2. service status state
Two-step interaction
  1. Step 1: Service inputs
    1. spaceId & contentId where input listing is stored
      • comma-delimited listing of spaceId, contentId and without expected MD5
      • each content item separated by newline character
      • first line in file will be ignored
    2. spaceId & contentId where results file should be written
    3. options (see below)
  2. Step 1: Service outputs
    1. comma-delimited listing of spaceIds, contentIds, system MD5s
    2. service status state
  3. Step 2: Service inputs
    1. spaceId & contentId where input listing is stored
      • comma-delimited listing of spaceId, contentId and expected MD5
      • each content item separated by newline character
      • first line in file will be ignored
    2. spaceId & contentId where service-generated listing is stored
      • comma-delimited listing of spaceId, contentId and system MD5
      • as generated and stored by the the service in step-1
  4. Service outputs
    1. comma-delimited listing of spaceIds, contentIds, expected MD5s, system MD5s, status state
    2. service status state
Service options
  1. trust level (stored value, recalculate, salt)
  2. salt
    • arbitrary character string which will be appended to content in generating MD5
  3. fail-fast boolean
    • service will exit when first error/mismatch found if 'true'
  4. complete space(s) boolean
    • indicates if the input listing should be checked against the complete set of items in the space(s)
  5. storeId of underlying storage provider
    • default to primary underlying storage provider
Service exceptions
  1. Checked
    • Contains the following enum
    1. missing MD5 (expected or found)
    2. MD5 mis-match
    3. unequal content listings
  2. Runtime
    1. internal error
    2. salt option set but salt not provided
    3. service level not supported
    4. input content item not exists
    5. output result content item already exists
Options / Mode matrix

The table below shows the possible usage scenarios across the top, and their associated input options.

  1. One-step: hash from input list
    • user initiates Bit Integrity Checker as a single operation
    • an input listing of content-ids/hashes is provided
    • service generates hashes one-to-one for each item in the input listing
  2. One-step: hash from complete space
    • user initiates Bit Integrity Checker as a single operation
    • an input listing of content-ids/hashes is provided
    • service generates hashes for all content-items in spaces found in the input listing
  3. Two-step: hash from input list
    • user initiates Bit Integrity Checker as a two part operation, this being the first
    • an input listing of only content-ids is provided
    • service generates hashes one-to-one for each item in the input listing
  4. Two-step: hash from complete space
    • user initiates Bit Integrity Checker as a two part operation, this being the first
    • user indicates space(s) of target content to hash
  5. Two-step: compare two lists
    • user initiates Bit Integrity Checker as a two part operation, this being the second
    • two input listings of content-ids/hashes are provided

User input option

One-step: hash from input list

One-step: hash from complete space

Two-step: hash from input list

Two-step: hash from complete space

Two-step: compare two lists

hash approach

(tick)

(tick)

(tick)

(tick)

 

salt

(tick)

(tick)

(tick)

(tick)

 

fail-fast

(tick)

(tick)

 

 

(tick)

storage provider id

(tick)

(tick)

(tick)

(tick)

(tick)

space of provided listing

(tick)

(tick)

(tick)

 

(tick)

object-id of provided listing

(tick)

(tick)

(tick)

 

(tick)

space of provided listing-B

 

 

 

 

(tick)

object-id of provided listing-B

 

 

 

 

(tick)

space(s) of target content

 

(tick)

 

(tick)

 

space for output

(tick)

(tick)

(tick)

(tick)

(tick)

object-id of result listing

(tick)

(tick)

(tick)

(tick)

 

object-id of report

(tick)

(tick)

 

 

(tick)

  • No labels

10 Comments

  1. A few thoughts:

    1. I don't see a way to generate the listing of contentIDs and checksums (service output can't be used as service input). It seems like this will be needed to provide a starting point to perform checks in the future. It may be useful to be able to generate this file by space or for all spaces.
    2. It's not clear to me what happens for content items which have been added to a space since the last time a checksum file (used as input here) was generated. Does it get ignored, because it's not in the input list? Does an exception get thrown (unequal content listings perhaps)? Hopefully the file which was added will be included in the output in some way at least to be able to be checked the next time the process is run.
    3. Should this service allow for checking files across spaces? Rather than an input of contentID|MD5, perhaps it should be spaceID|contentID|MD5. This would allow a user to check all of their DuraCloud content in a single call.
    4. It might be easier for humans wanting to review the results if the input/output files used newlines to seperate the content files, so rather than:
      content1|MD5|content2|MD5|content3|MD5|content4|MD5
      
      it would be:
      content1|MD5
      content2|MD5
      content3|MD5
      content4|MD5
      
      Another option is to just use the standard properties file setup:
      content1=MD5
      content2=MD5
      content3=MD5
      content4=MD5
      
    5. I'm not sure I see the need for checked exceptions here, considering that the interaction with the service is via http, not via java. If any of the exception cases occur (some would only show up if the fail-fast option were on) then processing is stopped and an error response is sent. This is regardless of whether the exception thrown in the service is checked or runtime.
    6. I still rather like the idea of performing the check in two steps (this provides a record of the contents of a user's datastore as well as providing an input file for the next time the fixity check is run):
      1. Generate a file with the listing of all contentIds and checksums, either for a single space or for all spaces
      2. Compare the input file with the freshly generated file
        • List results in the output file, noting any files missing from either file and any checksum mismatches
    1. Thanks, Bill. Below are responses to each "thought":

      1. A simple new service that has come up in different contexts is a Metadata Export Service. It could be run to create the initial listing (as well as could be extended to export other metadata in bulk fashion).
      2. The intention of the "complete space" boolean was to address this. If the option is set, then the check would be performed from the input listing against all content items in the corresponding space. Yes, "unequal content listings" would be the exception.
      3. Suggestion incorporated.
      4. Yes, newlines between content items was intended but not explicitly stated. Thanks for raising the ambiguity. Also, in order to make the input/output file more compatible with spreadsheet applications I am thinking comma-delimited is a better option to pipe.
      5. I agree, at the top-level an exception will be thrown by the service causing the error to be reported to the output file specified when the service is started. Including the flavors of exceptions here is a lower-level implementation detail.
      6. My thinking is that the Metadata Export Service will handle the first step you mention, and the second step is what this service (Fixity Service) provides.
  2. A service that provides bit integrity checking is very useful. However, the way I understand how things will be set up during the pilot, the Sync Tool copies data from your local instance to DuraCloud. This means that bad files can overwrite good ones protected by Fixity Service in DC.

    Given that data may be stored both locally and in DC, desirable behavior would be to identify and repair corrupted files wherever they may be. Fixity Service in current state will pull a new copy from another source if DC copy is kaput. Ideally the service could also push a good copy from DC if that other source proved to be corrupted.

    1. Thanks for reflecting on the service, Kyle.

      Indeed, the SyncTool will be pushing your local content to DC. And if the tool is run when the local file A.txt differs from the DC file A'.txt, then A'.txt will be overwritten by A.txt. If the local file was the corrupt one in this case, then that would be propagated to DC.

      For clarification, what the Fixity Service does is limited to reporting. It does not perform any copying of content or repair. The Fixity Service takes a listing of expected contentId/MD5 pairs and compares them to what is found in DC. All discrepancies are flagged, and the user is responsible for taking action based on the report.

      It sounds like you would be interested in additional repair functionality. Could you outline how you envision this flow? When running the service, would the user indicate either the input listing of contentId/MD5 pairs or the listing discovered from DC is the one to consider valid? Could it happen that there are content errors on both sides?

  3. Andrew, I want to be sure I understand the proposed service.

    To take a trivial example, let's say that I have only a single file (woofwoof.txt) and I'm storing it in three different clouds (A, C, and C) via DuraCloud.  If I wanted to check its fixity on a weekly basis then I would provide this service input:

    A,woofwoof.txt,<md5-hash>
    B,woofwoof.txt,<md5-hash>
    C,woofwoof.txt,<md5-hash>

    Is that right?

    And I can turn one of the knobs such that DuraCloud will fetch the file from each provider, calculate the hash, compare it to the value I passed in, and complain if it does not match?

    If this is correct, then this sounds very useful, and would provide similar functionality to a service we're using locally that walks through our archival holdings weekly, calculating the actual hash v. one we have stored in a database.

    One minor suggestion...  If a typical use-case is that people will replicate their content across multiple cloud providers, storing the same file in each cloud, using DuraCloud as the intermediary, then it might be nice if I could specify the list of cloud providers once, and then a list of (file, hash) pairs.  For instance, if I store one million small files, replicating each file in each of N different clouds, then my service input will have N million lines.

    1. Hi Bryan,

      To clarify a bit, there are three pieces of information to be included in the input file: the space ID, the content ID, and the checksum. A "space" in DuraCloud is essentially a top-level folder used to organize content. So the example input file you included would actually be checking the hash for woofwoof.txt in three spaces (A, B, and C) within the same storage provider.

      To handle multiple storage providers, the optional store ID parameter would be included on the service call which indicates the storage provider to check against. So to perform the task you have in mind, assuming that the woofwoof.txt file is stored in a space called "dog-sounds", the input file would look like this:

      dog-sounds,woofwoof.txt,<md5-hash>
      

      The service would need to be run three times, once each for providers A, B, and C.

      The knob that you talk about is what Andrew calls the level-of-trust option, which determines where the checksum is calculated. At level 1 the metadata of the file is retrieved, which includes the checksum, and that is used for comparison. At level 2 the file itself is retrieved and the checksum computed and compared. At level 3 the file is retrieved and appended with a "salt" value (a string of characters), then the checksum is computed.

      Andrew, a question for you on that last point: I'm assuming that when running the service using trust level 3 only the salt value, not the checksum, will be provided by the user. The checksum computed using the salt will then be included in the fixity service results file allowing the user to compute and compare the checksum on their local system. Since no actual comparison will be done, the output would be more akin to what the Metadata Export Service would generate. Is this what you had in mind?

      1. Hello Bill,
        This is a good question, and one which I would appreciate feedback on from the Pilot Partners.

        I was thinking that for simplicity, in the trust-level-3 scenario where the MD5 is calculated for salted content items, the behavior of the Fixity Service would be the same as levels 1 & 2. That is to say, the input listing would contain spaceIds, contentIds, and expected-MD5s (that were created with the salt). Due to the transparency of the code and the ability of the user to intentionally include an incorrect MD5 for sanity checking, this may be adequate.

        However, for an absolute guarantee that there is no smoke and mirrors under the covers, running the Fixity Service for all three levels of trust could be implemented as a two-part process.

        1. user uploads spaceId & contentId listing with no MD5s and then has Fixity Service generate an output listing (by any of the three methods: metadata, calculate, or salt)
        2. user uploads spaceId, contentId, expected-MD5 listing and has the Fixity Service perform the comparison

        If implementing either the simple approach mentioned first, or the two-step approach, I suggest user interaction with all three levels be consistent. That being the case, at the cost of an additional user step (or automated step if the user sets up local jobs that use the duraservice REST-API), I would advocate for the two-step process.

        As an additional note, depending on the technical skills of the user institution, there may be a need for a client-side utility for generating the listing file from the local content items.

  4. Being able to compare these files with some combination of sort, diff, and pipes is a Good Thing in my book.

    1. Thanks, Michael.
      For clarity, are you commenting on the nature of the result/report files generated by the Fixity Service? Namely, are you satisfied that they are in a specified, comma-delimited format? or are you suggesting any additional functionality/formatting that would be helpful in your proverbial book?

  5. Are services such as the BBIC supposed to exit/stop when getting to a Job State of COMPLETED?
    Does the BBIC have any logic about chunked files or does it simple iterate through the space?
    I am curious about the Java logging and how its configured for services. I see the familiar:
    WARN No appenders could be found for logger (org.apache.hadoop.conf.Configuration).
    WARN Please initialize the log4j system properly.
    BTW. A run on the current "Danny" download of 1206 icpsr files appears to have run successfully on a xlarge instance. Lots of logs to understand though.
    For BBIC, if you add more spaces after starting the service they do not appear on the reconfiguration dialog for a second run.
    The service detail screen does not update itself when job state changes, you have to refresh manually (at least over the period I gave it about 10 minutes).