Repairing is a process which that allows for node-to-node resolution of corrupt files in Chronopolis. Currently Rather than being an automated process, a node administrator must choose which files to repair and which node to repair from. This is to prevent unnecessary repairs (e.g. in the event of due to errors from a filesystem being offline) , and also to allow for discussion around and investigation about the collection before prior to it is being repaired.

Links

Installation

Download and install the rpm

Installation Notes

The rpm creates a chronopolis Chronopolis user if it does not exist, and creates the following files/directories:

/etc/chronopolis
/etc/chronopolis/repair.yml
/etc/init.d/chron-repair
/usr/lib/chronopolis
/usr/lib/chronopolis/chron-repair.jar
/var/log/chronopolichronopolis

Configuration

The configuration for the repair service is done in the repair.yml under /etc/chronopolis

Code Block

title	repair.yml
linenumbers	true
collapse	true

# Application Configuration for the Chronopolis Repair

# cron timers for
## cron.repair: how often to check the Ingest scheduledServer jobs; see http://www.quartz-scheduler.org/documentation/quartz-2.x/tutorials/crontrigger.html for documentationrepair endpoint
## cron.fulfilment: how often to check the Ingest Server fulfillments endpoint
cron:
  repair: 0 0/1 * * * *
  fulfillment: 0 0 * * * *

# storagegeneral locations for the repairproperties
## repair.stage: staging area to knowreplicate wherefiles to stagebefore datathey (for pulling without overwriting),
# as well as where to write data in when it has been validatedare moved to preservation storage
## repair.preservation: preservation storage area
repair:
  stage: /dataexport/chronopolisrepair/backupstaging
  preservation: /datapreservation/chronopolis/perservationbags

# Chronopolis Ingest API configuration
## ingest.endpoint: the url of the ingest server configuration for communication
## ingest.ucsername: the username to authenticate as
## ingest.password: the password to authenticate with
ingest:
  endpoint: http://localhost:8000
  username: my-usernode
  password: my-passnodepass

# rsync fulfillment configuration
# for fulfillments
## rsync.path: theused pathif tochrooting substituteusers inrsyncing when- creatingthe thepath rsyncunder link. e.x. ucsd@test-server.umiacs.umd.edu:bags/...
#   stage: the storage area to stage bags to
#   server: the server to rsync from. e.x. ucsd@test-server.umiacs.umd.edu:bags/...
# note that the username for the rsync is determined by whoever requested the repairthe chroot context
## rsync.stage: a staging area which fulfillments will be copied to
## rsync.server: the fqdn of the server nodes will replicate from 
rsync:
  path: bags/export/repair/outgoing
  stage: /export/repair/outgoing/bags
  server: test-serverloach.umiacs.umd.edu

# ACE aceAM configuration
### ace.am: the local ace-am endpoint to connect to for communication with aceACE AM webapp
## ace.username: the username to authenticate as
## ace.password: the password to authenticate with 
ace:
  am: http://localhost:8080/ace-am/
  userusername: aceadmin
  password: aceadmin

# spring the activeproperties
## spring.profiles.active: the profiles to use
# develop|default : if running in development mode or production
# rsync : if fulfilling with rsync or ace
spring.profiles. when running
##                         recommended: default, rsync
spring:
  profiles:
    active: default, rsync

# logging the location of the main repair logproperties
## logging.file: the file to write logging statements to
## logging.level: the log level to filter on
logging.file: /var/log/chronopolis/repair.log
logging.level.org.chronopolis: INFO

Running

The Repair Service ships with a SysV style init script and has the basic start/stop/restart options. Customization of the script may be necessary if your java location needs to be specified.

service chron-repair start|stop|restart

Workflow

Note that the workflow involves two nodes: one with CORRUPT data and one with VALID data

CORRUPT notices data in their Audit Manager (ACE-AM) is showing File Corrupt, indicating that checksums on disk have changed
1. Discussion happens internally about who has this and can repair it
2. SSH keys exchanged so that data transfer can occur for files which are to be repaired
CORRUPT logs on to the Ingest server and selects 'Request Repair' in order to create a 'Repair Request'
1. Inputs ACE AM credentials to query for the corrupt collection
2. Select the Collection
3. Select the Files to repair and the Node where they will be Repaired
VALID logs onto the Ingest server and selects 'Fulfill Repair' in order to stage data for the repair
At this point, both CORRUPT and VALID nodes should start the Repair service
1. The Repair service running at VALID will stage data and update the Repair
2. The Repair service running at CORRUPT will
  1. Pull data from VALID into a staging area
  2. Validate that the data transferred and matches the checksums in the ACE AM
  3. Overwrite the corrupt files
  4. Audit the files in the ACE AM
  5. Update the Repair with the result of the audit
Once complete, the Repair Service at each node can be stopped

Repair File Transfer Strategies

During design of the Repair service, it was noted that there are different ways of transferring content between Chronopolis Nodes:

Direct transfer
- through rsync
- through ACE AM
Indirect transfer through the Ingest Server

During development, support was added for each type of transfer, but only the direct rsync strategy was implemented. The direct ACE-AM transfer strategy requires additional development in the Audit Manager in order to support API Keys which can be used to access content. The indirect transfer through the Ingest Server was omitted as it was not deemed onerous for Chronopolis Nodes to exchange ssh keys.

Repair Types

Currently the Repair workflow handles repairing corrupt files, but does not cover other types of failure which can occur in the system. For example, in the past we have had issues with the Audit Manager (ACE-AM) having received invalid checksums from the underlying storage system, which then needed to be updated in order for an audit to pass successfully. We have also see ACE Token Stores be partially loaded which results in the need to re-upload the ACE Token Store so that we can ensure we are auditing against the ACE Tokens we created on ingestion of the collection.

Release Notes

Release 1.5.0

...

Space shortcuts

Page tree

Versions Compared

Old Version 10

New Version Current

Key

Links

Installation

Configuration

Running

Workflow

Repair File Transfer Strategies

Repair Types

Release Notes

Release 1.5.0

Space shortcuts

Page tree

Page History

Versions Compared

Old Version 10

New Version Current

Key

Links

Installation

Configuration

Running

Workflow

Repair File Transfer Strategies

Repair Types

Release Notes

Release 1.5.0