Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

TODO: Update configuration documentation (+local file scanning)

...

The Ingest Server coordinates the distribution of packages throughout the Chronopolis network, and can perform additional services if needed (examples below). This is done through a HTTP API which both Intake and Replication services can interact with and receive updates about data they are processing.

Links

Installation

Prerequisites

  • Postgresql - The Ingest Server connects to a postgresql database to store information about transfers, bags, and tokens
  • Token Staging Area - The Ingest Server needs an area to store tokens during transfer
  • SSH key exchange - Replications are done through rsync, meaning public keys should be shared between nodes in order to authorize and authenticate access

RPM

Download the rpm for your operating system

Running

The ingest server runs as an executable jar. Using the init script allows for starting and stopping of the server as root

  • EL6: service ingest-server start|stop
  • EL7: systemctl start ingest-server

Installation Notes

Installed files for RHEL6

The Ingest service adds a layer between the Intake and Replication services. The main goal is to validate data which comes from bags and create ACE tokens for that data.

Installation

Prereqs

  • Postgresql - The ingest service connects to a postgresql database to store information about transfers, bags, and tokens
  • Staging areas - The ingest service needs an area to store tokens during transfer
  • SSH public keys - Needed from each node who will be replicating content

RPM

  • Download the rpm (soon a yum repository maybe?)
  • Use yum to install with `yum install ingest-server-$version.rpm`

...

languagebash
titleEL6 Ingest Files
collapsetrue

...

/etc/init.d/ingest-server
/usr/local/chronopolis/ingest
/usr/local/chronopolis/ingest/application.yml
/usr/local/chronopolis/ingest/ingest-server.jar

...

languagebash
titleEL7 Ingest Files
collapsetrue

Installed Files for RHEL7

/usr/lib/systemd/system/ingest-server.service
/

...

usr/local/chronopolis/ingest
/usr/local/chronopolis/ingest/application.yml
/usr/local/chronopolis/ingest/ingest-

...

prepare
/

...

usr/local/chronopolis/ingest/ingest-server.jar

When running, the startup scripts will check for the following directories and create/apply permissions if they do not match:

  • Logging: /var/log/chronopolis/


User Creation

A 'chronopolis' user is also needed which can write to /var/log/chronopolis and perform various read and write tasks as needed from the Token Staging Area. This is no longer installed as part of the rpm, but should be managed separately and configured in the ingest-server startup script.

Database Setup

Download the schema from the CI server

  1. Create a postgresql user for the database: `CREATE USER chron with PASSWORD 'secret-password';`
  2. Create the ingest database: `CREATE DATABASE ingest;`
  3. Grant the permissions to the new user: `GRANT ALL PRIVILEGES ON DATABASE ingest to chron;`
  4. Connect to the ingest database as the new user (either through the psql shell or reconnecting): `psql -U chron ingest`
  5. Source the sql script in the database: `\i /path/to/schema.sql`
Code Block
languagesql
titleExample PostgresSQL Setup
collapsetrue
psql (8.4.20)
Type "help" for help.
postgres=# CREATE USER chron WITH PASSWORD 'secret-password';
CREATE ROLE
postgres=# CREATE DATABASE ingest;
CREATE DATABASE
postgres=# GRANT ALL PRIVILEGES ON DATABASE ingest to chron;
GRANT
postgres=# \c ingest;
psql (8.4.20)
You are now connected to database "ingest".
ingest=# SET ROLE chron;
SET
ingest=> \i /tmp/schema_pg.sql

 

Preparing the DB for Schema Migrations

As of version 1.1.0, the database has a schema_version table for handling schema migrations. This is managed automatically through flyway, so that the server can be upgraded without needing to worry about manually applying patches. Flyway provides a jar file which we can use to prepare the database for migrations, something which can be applied to previous versions as well.

...

  1. The Ingest Server currently uses Flyway 4.2.0; if possible the binary for that version should be used

...

specify the version which you are creating the baseline (using the MAJOR.MINOR number of the ingest server version)

Code Block
languagebash
titleFlyway Configuration Example
#
# Copyright 2010-2015 Axel Fontaine
#
...

# Jdbc url to use to connect to the database
flyway.url=jdbc:postgresql://localhost/ingest

# Fully qualified classname of the jdbc driver (autodetected by default based on flyway.url)
# flyway.driver=

# User to use to connect to the database (default: <<null>>)
flyway.user=chron

# Password to use to connect to the database (default: <<null>>)
flyway.password=my-postgresql-password

...
flyway.baselineVersion=1.5

Configuration

The ingest server reads from the /usr/local/chronopolis/ingest/application.yml configuration file:

Code Block
languageyml
titleapplication.yml
collapsetrue
# Ingest Configuration Properties

# Ingest Cron job properties
# tokens: the rate at which to check for bags which have all tokens and can have a Token Store written
# request: the rate at which to check for bags which need their initial replications created
# tokenize: the rate at which to check for local bags which need tokens created - DEPRECATED in 3.0
ingest.cron:
  tokens: 0 0/10 * * * *
  request: 0 0/10 * * * *
  tokenize: 0 0 * * * * *

# Ingest AJP Settings
# enabled: flag to enable an AJP connector
# port: the port for the connector to listen on
ingest.ajp:
  enabled: false
  port: 8009

# The staging area for writing Token Stores. Nonposix support not yet implemented.
## id: The id of the StagingStorage in the Ingest server
## path: The path to the filesystem on disk
chron.stage.tokens.posix.id: -1
chron.stage.tokens.posix.path: /dev/null

# If Local Tokenization is desired include properties for the Ingest API user, staging information for Bags, and ACE IMS connection information
# username: The name of the user who created the bags ingest will be tokenizing
ingest.api.username: bag-creator

## id: The id of the StagingStorage which the Ingest Server will read from  - DEPRECATED in 3.0
## path: The path to the filesystem on disk  - DEPRECATED in 3.0
chron.stage.bags.posix.id: -1 
chron.stage.bags.posix.path: /dev/null

## port: the port to connect to the ims with
## waitTime: the time to wait between token requests
## endpoint: the fqdn of the ims
## queueLength: the maximum number of requests to send at once
ace.ims:
  port: 80
  waitTime: 5000
  endpoint: ims.umiacs.umd.edu
  queueLength: 1000

# Database connection
# Initialize should be kept false so that the server does not attempt to run a drop/create on the tables
spring.datasource:
  url: jdbc:postgresql://localhost/ingest
  username: postgres
  password: dbpass
  initialize: false

# Specify the active profile for configuring services as a comma separated list
# production - remove stdout/stderr from printing and run without accepting input
# disable-tokenizer - disable local tokenization services from running
spring.profiles.active: production
spring.pid.file: /var/run/ingest-server.pid

# debug: true
server.port: 8080

# Logging properties
logging.file: ingest.log
logging.level:
  org.springframework: ERROR
  org.hibernate: ERROR
  org.chronopolis: DEBUG

Notes on Configuratiom

  • An AJP connector can now be configured with the server, meaning SSL can be served through apache httpd instead of a java keystore
  • The pid file should probably not be updated unless you update the init files with any corresponding changes

Administration

Database Setup

Download the schema from the CI server

  1. Create a postgresql user for the database: CREATE USER chron with PASSWORD 'secret-password';
  2. Create the ingest database: CREATE DATABASE ingest;
  3. Grant the permissions to the new user: GRANT ALL PRIVILEGES ON DATABASE ingest to chron;
  4. Connect to the ingest database as the new user (either through the psql shell or reconnecting): psql -U chron ingest
  5. Source the sql script in the database: \i /path/to/schema.sql
Code Block
languagesql
titleExample PostgresSQL Setup
collapsetrue
psql (8.4.20)
Type "help" for help.
postgres=# CREATE USER chron WITH PASSWORD 'secret-password';
CREATE ROLE
postgres=# CREATE DATABASE ingest;
CREATE DATABASE
postgres=# GRANT ALL PRIVILEGES ON DATABASE ingest to chron;
GRANT
postgres=# \c ingest;
psql (8.4.20)
You are now connected to database "ingest".
ingest=# SET ROLE chron;
SET
ingest=> \i /tmp/schema_pg.sql

Preparing the DB for Schema Migrations

As of version 1.1.0, the database has a schema_version table for handling schema migrations. This is managed automatically through flyway, so that the server can be upgraded without needing to worry about manually applying patches. Flyway provides a jar file which we can use to prepare the database for migrations, something which can be applied to previous versions as well.

  1. Download and untar/unzip the Flyway Command Line Tool
    1. The Ingest Server currently uses Flyway 4.2.0; if possible the binary for that version should be used
  2. Edit the conf/flyway.conf
    1. some properties follow the same pattern as our application properties (connecting to the database)
    2. specify the version which you are creating the baseline (using the MAJOR.MINOR number of the ingest server version)

      Code Block
      languagebash
      titleFlyway Configuration Example
      #
      # Copyright 2010-2015 Axel Fontaine
      #
      ...
      
      # Jdbc url to use to connect to the database
      flyway.url=jdbc:postgresql://localhost/ingest
      
      # Fully qualified classname of the jdbc driver (autodetected by default based on flyway.url)
      # flyway.driver=
      
      # User to use to connect to the database (default: <<null>>)
      flyway.user=chron
      
      # Password to use to connect to the database (default: <<null>>)
      flyway.password=my-postgresql-password
      
      ...
      flyway.baselineVersion=3.0


  3. Use the flyway bash script to update the database

    Code Block
    languagebash
    titleFlyway Baseline Migration
    $ ./flyway baseline


Local Tokenization
Status
colourBlue
titleSince 2.3.0

When doing local ingestion of bags through the Ingest Server, it is possible to have the Ingest Server create ACE Tokens for the files in a Bag. This can be enabled through the application.yml configuration file. Note that because staging areas can be shared, the user creating the Bags in the Ingest Server should be unique to the local processes.

Once enabled, rudamentary information of tokenization can be viewed on the webui at /status/supervisor.

Code Block
languageyml
titleTokenization Configuration
# Ingest Tokenizer Settings
# cron: the cron timer for running local-tokenization
# enabled: flag to enable Local tokenization of bags
# username: the 'creator' to check for when depositing bags (defaults to to 'admin')
# staging.id: the ID of the StorageRegion to write tokens into
# staging.path: the path to the filesystem on disk
ingest.tokenizer:
  cron: 0 0 * * * *
  enabled: true
  username: admin
  staging.id: -1
  staging.path: /dev/null

Local File Scanning
Status
colourBlue
titleSINCE 3.1.0

Bags on a filesystem local to the Ingest Server can have their files and fixities registered. This is done through a task which periodically fires and queries the database for Bags awaiting scanning. Note that because staging areas can be shared, the user creating the Bags in the Ingest Server should be unique to the local processes.

Resetting Passwords
Status
colourBlue
titleSINCE 1.4.0

As of version 1.4.0, passwords for users are now encoded using bcrypt. In the event a user forgets their password, we will need to reset it for them. As we do not have email notifications or anything of the like setup, for the moment everything will need to be done manually. We will first need to run the password through an encoder, which can be found online. If you aren't sure how many rounds to use, check the database as the information is kept as part of the encoding, i.e. $2a$08 uses 8 rounds; $2a$10 uses 10 rounds.

Then we connect to the database and issue a simple update:

Code Block
languagesql
titleUser update
 UPDATE users SET password = '$2a$10$hEYYHV/Fri00RRHjWPIAWuH3NxYpPPjbMU5OsJfH1SAenajQqKjhK' WHERE username = 'user_resetting_password';

Storage Regions
Status
colourBlue
titleSINCE 2.0.0

With the release of version 2.0.0, StorageRegions have been introduced in order to facilitate distribution of content from many nodes in Chronopolis. A StorageRegion contains information about what type of data is held and the total capacity of the StorageRegion. Currently the capacity only serves as a reference, and can be exceeded if the Ingest Server does not know data has been removed. A note can also be provided to display additional information about a StorageRegion (e.g. 500TB XFS JBOD)

StorageRegions also provide configuration information for creating replications:

ReplicationConfiguration

  • Replication Server: The server which will be connected to for transferring data
  • Replication Path: The path which should be used to point to the files, e.g. /export/bags, or with a chroot env /bags
  • Replication Username: The username clients should connect as, or null if they should use their node username

Use the flyway bash script to update the database

Code Block
languagebash
titleFlyway Baseline Migration
$ ./flyway baseline

Optional Data Loading

This information is now outdated; a new method needs to be created for bootstrapping data with encrypted passwords.

There's also a sql script which contains entries for the nodes and users in chronopolis as well as their roles. The script is found on jenkins, and can be loaded with `\i /path/to/data.sql`. If this is not used, users and nodes will need to be created manually - an admin user will be created on startup with a default password of 'admin' if no users are found. Users can then be created through the web ui.

Configuration

The ingest server reads from the /usr/local/chronopolis/ingest/application.yml configuration file:

Code Block
languageyml
titleapplication.yml
collapsetrue
# Ingest Configuration Properties

# Ingest Cron job properties
# tokens: the rate at which to check for bags which have all tokens and can have a Token Store written
# request: the rate at which to check for bags which need their initial replications created
# tokenize: the rate at which to check for local bags which need tokens created - DEPRECATED in 3.0
ingest.cron:
  tokens: 0 0/10 * * * *
  request: 0 0/10 * * * *
  tokenize: 0 0 * * * * *

# Ingest AJP Settings
# enabled: flag to enable an AJP connector
# port: the port for the connector to listen on
ingest.ajp:
  enabled: false
  port: 8009

# The staging area for writing Token Stores. Nonposix support not yet implemented.
## id: The id of the StagingStorage in the Ingest server
## path: The path to the filesystem on disk
chron.stage.tokens.posix.id: -1
chron.stage.tokens.posix.path: /dev/null

# If Local Tokenization is desired include properties for the Ingest API user, staging information for Bags, and ACE IMS connection information
# username: The name of the user who created the bags ingest will be tokenizing
ingest.api.username: bag-creator

## id: The id of the StagingStorage which the Ingest Server will read from  - DEPRECATED in 3.0
## path: The path to the filesystem on disk  - DEPRECATED in 3.0
chron.stage.bags.posix.id: -1 
chron.stage.bags.posix.path: /dev/null

## port: the port to connect to the ims with
## waitTime: the time to wait between token requests
## endpoint: the fqdn of the ims
## queueLength: the maximum number of requests to send at once
ace.ims:
  port: 80
  waitTime: 5000
  endpoint: ims.umiacs.umd.edu
  queueLength: 1000

# Database connection
# Initialize should be kept false so that the server does not attempt to run a drop/create on the tables
spring.datasource:
  url: jdbc:postgresql://localhost/ingest
  username: postgres
  password: dbpass
  initialize: false

# Specify the active profile for configuring services as a comma separated list
# production - remove stdout/stderr from printing and run without accepting input
# disable-tokenizer - disable local tokenization services from running
spring.profiles.active: production
spring.pid.file: /var/run/ingest-server.pid

# debug: true
server.port: 8080

# Logging properties
logging.file: ingest.log
logging.level:
  org.springframework: ERROR
  org.hibernate: ERROR
  org.chronopolis: DEBUG

Notes on Configuratiom

  • An AJP connector can now be configured with the server, meaning SSL can be served through apache httpd instead of a java keystore
  • The pid file should probably not be updated unless you update the init files with any corresponding changes

Running

The ingest server runs as an executable jar. Using the init script allows for starting and stopping of the server as root: `service ingest-server start`

Administration

Local Bag Ingestion

When we need to ingest a Bag by hand because, the following steps need to be taken:

  1. Create a BagIt bag if one does not already exist
  2. Create a CSV containing the filenames and fixity values for each file in the Bag

  3. Create a Bag using the Create Bag API method from Ingest Restful-Server API
  4. Upload the Bag's file listing using the File Ingest API method
  5. Register a Staging entity for the Bag using the Create Staging API method, marking that the Bag is staged and ready to be replicated
    1. Note that it will take time for the Bag to be replicated as the Ingest Server will first create AceTokens for each file
Code Block
languagebash
titleExample Bag Ingest
collapsetrue
#!/bin/sh
#
# This is pseudocode which provides an example for how ingesting a bag might look
# when scripted. It will likely be revised before the production release of 3.0.0
# to actually work.
#################################################################################

BAGS="test-bag-0"
MANIFEST="manifest-sha256.txt"
TAGMANIFEST="tagmanifest-sha256.txt"

BAG_JSON='{"name":"scripted-bag-0", "depositor": "script-depositor", "size": 1024, "totalFiles": 10}'
STAGING_JSON='{"location": "script-depositor/scripted-bag-0", "validationFile": "/tagmanifest-sha256.txt", "storageRegion": 1, "totalFiles": 10, "size": 1024, "storageUnit": "B"}'
INGEST_USER="ingest"
INGEST_PASSWORD="secret"
INGEST_BAG_CREATE="http://localhost:8080/api/bags"
INGEST_FILE_CREATE="http://localhost:8080/api/bags/{id}/files"
INGEST_STAGING_CREATE="http://localhost:8080/api/bags{id}/staging/BAG"

generate_csv() {
    for bag in $BAGS; do
        current_manifest="$bag/$MANIFEST"
        current_tagmanifest="$bag/$TAGMANIFEST"

        awk -v bag="$bag" 'BEGIN { printf "FILENAME,SIZE,FIXITY_VALUE,FIXITY_ALGORITHM\n" } 
                  { printf "\"" $2 "\","
                    system("find " bag "/" $2 " -printf '%s'")
                    printf ","
                    printf $1 ",SHA-256\n" }' $current_tagmanifest $current_manifest > "$bag".csv
        tagsum=$(sha256sum $current_tagmanifest | cut -c -64)
        tagsize=$(find $current_tagmanifest -printf '%s') 
        echo "\"$TAGMANIFEST\",$tagsize,$tagsum,SHA-256" >> "$bag".csv
        gzip -c "$bag".csv > "$bag".csv.gz

        echo -n "$bag csv: "
        find "${bag}.csv" -printf '%s\n'
    done
}

# generate a csv file
generate_csv();

# register the bag, files, and staging
curl --user ${INGEST_USER}:${INGEST_PASSWORD} --header "Content-Type: application/json" --data '${BAG_JSON}' ${INGEST_BAG_CREATE}
# the bag id needs to be retrieved and injected into the next 2 calls
curl --user ${INGEST_USER}:${INGEST_PASSWORD} -F "file=@${bag}.csv;type=text/csv" ${INGEST_FILE_CREATE}
curl --user ${INGEST_USER}:${INGEST_PASSWORD} --header "Content-Type: application/json" --data '${STAGING_JSON}' ${INGEST_STAGING_CREATE}

Local Tokenization

When doing local ingestion of bags through the Ingest Server, it's possible to also have the Ingest Server create ACE Tokens for the files registered to a collection. This can be enabled through the application.yml configuration file:

Code Block
languageyml
titleTokenization Configuration
# Ingest Tokenizer Settings
# cron: the cron timer for running local-tokenization
# enabled: flag to enable Local tokenization of bags
# username: the 'creator' to check for when depositing bags (defaults to to 'admin')
# staging.id: the ID of the StorageRegion to write tokens into
# staging.path: the path to the filesystem on disk
ingest.tokenizer:
  cron: 0 0 * * * *
  enabled: true
  username: admin
  staging.id: -1
  staging.path: /dev/null

Note: If you want to disable local tokenization, you must set the ingest.tokenizer.enabled to false; otherwise the Ingest Server will attempt to create Beans for tokenization and fail to start depending on the configuration

Resetting Passwords

...

Code Block
languagesql
titleUser update
 UPDATE users SET password = '$2a$10$hEYYHV/Fri00RRHjWPIAWuH3NxYpPPjbMU5OsJfH1SAenajQqKjhK' WHERE username = 'user_resetting_password';

Node Specific Admin

StorageRegions

With the release of version 2.0.0, StorageRegions have been introduced in order to facilitate distribution of content from many nodes in Chronopolis. Configuration for them is as follows:

  • ReplicationConfiguration
  • Notes
  • ...

Open Questions

  • How do we handle error’d bags? (hold, reject, ??)
    • We bag the packages ourselves, so we should get no bags with errors
  • Do we have a record of all the collections and their states as they move through to replication? We need to be able to retrieve this data, including any failures.