Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Installation

Builds for the master and develop branches of the Repair Service can be found at http://adaptci01.umiacs.umd.edu/resource/medic

The rpm creates a chronopolis user if it does not exist, and creates the following files/directories.

Code Block
languagebash
titleInstalled Files
collapsetrue
[~] $ rpm -ql ingest-server
/etc/chronopolis
/etc/chronopolis/repair.yml
/etc/init.d/chron-repair
/usr/lib/chronopolis
/usr/lib/chronopolis/chron-repair.jar
/var/log/chronopolis

Configuration

The configuration for the repair service is done in the repair.yml under /etc/chronopolis

Code Block
titlerepair.yml
linenumberstrue
collapsetrue
# cron timers for the scheduled jobs; see http://www.quartz-scheduler.org/documentation/quartz-2.x/tutorials/crontrigger.html for documentation
cron:
  repair: 0 0 * * * *
  fulfillment: 0 0 * * * *

# storage locations for the repair to know where to stage data (for pulling without overwriting),
# as well as where to write data in when it has been validated
repair:
  stage: /data/chronopolis/backup
  preservation: /data/chronopolis/perservation

# ingest server configuration for communication
ingest:
  endpoint: http://localhost:8000
  username: my-user
  password: my-pass

# rsync fulfillment configuration
#   path: the path to substitute in when creating the rsync link. e.x. ucsd@test-server.umiacs.umd.edu:bags/...
#   stage: the storage area to stage bags to
#   server: the server to rsync from. e.x. ucsd@test-server.umiacs.umd.edu:bags/...
# note that the username for the rsync is determined by whoever requested the repair
rsync:
  path: bags/
  stage: /export/outgoing/bags
  server: test-server.umiacs.umd.edu

# ace configuration
# am: the local ace-am endpoint to connect to for communication with ace
ace:
  am: http://localhost:8080/ace-am/
  user: ace
  password: ace

# the active profiles to use
# develop|default : if running in development mode or production
# rsync : if fulfilling with rsync or ace
spring.profiles.active: default, rsync

# the location of the main repair log
logging.file: /var/log/chronopolis/repair.log

Running

The Repair Service ships with a SysV style init script and has the basic start/stop/restart options. Customization of the script may be necessary if your java location needs to be specified.

  • service chron-repair start|stop|restart

This page will serve to map out a general process for restoring/repairing content between nodes in Chronopolis, and denote areas where discussion/development is needed.

Chronopolis Repair Design Document

Michael Ritter, UMIACS October 10, 2016

Background

Within the standard operating of Chronopolis, it is likely due to the volume of data we ingest that we will at some point need to repair data held at a node. In the event a node cannot repair their own data, a process will be in place so that the data can be repaired through the Chronopolis network. In this document a basic design proposal for a protocol through which we can repair collections in a combination of manual and automated work will be outlined.

Considerations

As this design is still living, there are still open questions as to how everything should be finalized and what impact they will have on the final result.
 

1. Transfer Strategies

  • Multiple types of transfers are allowed, however each will need to be implemented.
    • Node to Node: Transfer between replicating nodes using rsync + ssh with no intermediary step
    • Node to Ingest: Push content to the Ingest node from which a node can repair from
    • ACE: Use ACE with https as the transfer mechanism for serving files

2. Should we put a limit on the number of files being repaired in a single request?

  • At the moment this is unbounded, but we may want to look into it in the future

3. Should we include tokens in this process, but leave implementation out for now?

  • Initial version will only handle files, tokens can be added on later

Repair Flow

Basic flow: nodei = invalid; nodev = valid

  1. nodei sees invalid files in ACEi
  2. nodei gathers invalid files and issues a repair request to the ingest server
    1. POST /api/repair
    2. Handled manually
    3. Consider having multiple requests in the event many files are corrupt
  3. nodev sees the repair request
    1. Handled manually, likely from discussion in the chron group
  4. nodev checks ACEv to see if the files are valid
    1. POST /api/repair/<id>/fulfill if valid
  5. nodev stages content for nodei
    1. P2P: make a link (or links) to the files in a directory for nodei
    2. Ingest: rsync the files up to the ingest server
    3. ACE: create a token for nodei and make that available
  6. nodev notifies ingest server that content is ready for nodei
    1. POST /api/repair/fulfillment/<id>/ready
  7. nodei replicates staged content
    1. GET /api/repair/fulfillment?to=nodei&status=ready
  8. nodei validates staged content
    1. communicates with ACE compare API
    2. if not valid, end here
  9. nodei copies staged content to preservation storage
  10. nodei issues an audit of the corrupt files
  11. nodei responds with the result of the audit
    1. if the audit is not successful a new replication request will need to be made, but we might want to do that by hand
    2. POST /api/repair/fulfillment/<id>/complete

Turning this into a graph might be useful

Transfer Strategies

Node to Node

Node to Node transfers would require additional setup on our servers, and would likely require

a look in to how we deal with security around our data access (transferring ssh keys, ensuring

access by nodes is read only, etc). A feasibly staging process could look like:
 

1. nodev links data (ln -s) in nodei’s home directory

2. nodei rsyncs data from nodev:/homes/nodei/depositor/repair-for-collection

Node to Ingest

Node to Ingest, while lengthy, would have the least amount of development and setup effort

associated with it. Since we will most likely not be repairing terabytes of data at a time, one

could say this is "good enough". The staging process for data would look similar to:
 

1. nodev rsyncs data to the ingest server

2. nodev notifies that the data is ready at /path/to/data on the ingest server

3. nodei rsyncs data from the ingest server on /path/to/data
 

ACE

Repairing through ACE would require additional development on ACE, as it currently does not

have any concept of API keys, but otherwise provides the same benefits of Node-to-Node repair

with some constraints from http itself. Staging would become quite simple, and amount to:

1. nodev marks the collection as allowing outside access (for API keys only?)

2. nodev requests a new temporary API key from ACE

3. nodei downloads from ACEv using the generated API key

API Design - Move to Sub Page

The API can be viewed with additional formatting and examples at

http://adaptci01.umiacs.umd.edu:8080/

HTTP API

The REST API described follows standard conventions and is split in to two main parts, repair

and fulfillment.

Repair API

GET /api/repair/requests?<requested=?,collection=?,depositor=?,offers=?>

GET /api/repair/requests/<id>

POST /api/repair/requests

POST /api/repair/requests/<id>/fulfill

Fulfillment API

GET /api/repair/fulfillments?<to=?,from=?,status=?>

GET /api/repair/fulfillemnts/<id>

PUT /api/repair/fulfillments/<id>/ready

PUT /api/repair/fulfillemnts/<id>/complete

Models - Move to Sub Page

A repair request, sent out by a node who notices they have corrupt files in a collection

Repair Request Model

{

"depositor": "depositor-with-corrupt-collection",

"collection": "collection-with-corrupt-files",

"files": ["file_0", "file_1", ..., "file_n"]

}

A repair structure, returned by the Ingest server after a repair request is received

Repair Model

{

"id": 1,

"status": "requested|fulfilling|repaired|failed",

"requester": "node-with-corrupt-file",

"depositor": "depositor-with-corrupt-collection",

"fulfillment": 3,

"collection": "collection-with-corrupt-files",

"files": ["file_0", "file_1", ..., "file_n"]

}

A fulfillment for a repair, returned by the Ingest server after a node notifies it can fulfill a repair

request. Credentials are only visible to the requesting node and administrators.

Fulfillment Model

{

"id": 3,

"to": "node-with-corrupt-file",

"from": "node-with-valid-file",

"status": "staging|ready|complete|failed",

"credentials": { ... }

"repair": 1

}

Credentials ACE

{

"type": "ace",

"api-key": "ace-api-key",

"url": "https://node_v/ace-am" # ?? Not sure if really needed

}

Credentials Node-to-Node

{

"type": "node-to-node",

"url": "node_i@node_v.edu:/homes/node_i/path/to/repair"

}

Credentials Node-to-Node

{

"type": "ingest",

"url": "node_i@chron.ucsd.edu:/path/to/repair"

}

-------------------------

Previous iterations:

Repair Design Document, October 2016

...