Ingest

The Ingest service adds a layer between the Intake and Replication services. The main goal is to validate data which comes from bags and create ACE tokens for that data.

Installation

Prereqs

Postgresql - The ingest service connects to a postgresql database to store information about transfers, bags, and tokens.
Staging areas - The ingest service needs to see what collections have been staged, and also an area to store tokens during transfer.
SSH public keys - Needed from each node who will be replicating content

RPM

Download the rpm (soon a yum repository maybe?)
Use yum to install with `yum install ingest-server-$version.rpm`

Installed files are as follows

Ingest Files

[~] $ rpm -ql ingest-server
/etc/chronopolis
/etc/chronopolis/application.properties
/etc/init.d/ingest-server
/usr/lib/chronopolis
/usr/lib/chronopolis/ingest-server.jar
/var/log/chronopolis

A 'chronopolis' user is also created as part of the install process, which can write to /var/log/chronopolis. It should also be able to read and write from the staging areas.

Database Setup

Download the schema from the CI server

Create a postgresql user for the database: `CREATE USER chron with PASSWORD 'secret-password';`
Create the ingest database: `CREATE DATABASE ingest;`
Grant the permissions to the new user: `GRANT ALL PRIVILEGES ON DATABASE ingest to chron;`
Connect to the ingest database as the new user (either through the psql shell or reconnecting): `psql -U chron ingest`
Source the sql script in the database: `\i /path/to/schema.sql`

Example PostgresSQL Setup

psql (8.4.20)
Type "help" for help.
postgres=# CREATE USER chron WITH PASSWORD 'secret-password';
CREATE ROLE
postgres=# CREATE DATABASE ingest;
CREATE DATABASE
postgres=# GRANT ALL PRIVILEGES ON DATABASE ingest to chron;
GRANT
postgres=# \c ingest;
psql (8.4.20)
You are now connected to database "ingest".
ingest=# SET ROLE chron;
SET
ingest=> \i /tmp/schema_pg.sql

Preparing the DB for Schema Migrations

As of version 1.1.0, the database has a schema_version table for handling schema migrations. This is managed automatically through flyway, so that the server can be upgraded without needing to worry about manually applying patches. Flyway provides a jar file which we can use to prepare the database for migrations, something which can be applied to previous versions as well.

Download and untar/unzip the Flyway Command Line Tool

Edit the conf/flyway.conf

some properties follow the same pattern as our application properties (connecting to the database)

specify the version which you are creating the baseline (using the MAJOR.MINOR number of the ingest server version)

Flyway Configuration Example

#
# Copyright 2010-2015 Axel Fontaine
#
...

# Jdbc url to use to connect to the database
flyway.url=jdbc:postgresql://localhost/ingest

# Fully qualified classname of the jdbc driver (autodetected by default based on flyway.url)
# flyway.driver=

# User to use to connect to the database (default: <<null>>)
flyway.user=chron

# Password to use to connect to the database (default: <<null>>)
flyway.password=my-postgresql-password

...
flyway.baselineVersion=1.5

Use the flyway bash script to update the database
Flyway Baseline Migration
```
$ ./flyway baseline
```

Optional Data Loading

This information is now outdated; a new method needs to be created for bootstrapping data with encrypted passwords.

There's also a sql script which contains entries for the nodes and users in chronopolis as well as their roles. The script is found on jenkins, and can be loaded with `\i /path/to/data.sql`. If this is not used, users and nodes will need to be created manually - an admin user will be created on startup with a default password of 'admin' if no users are found. Users can then be created through the web ui.

Configuration

The ingest server reads from the /etc/chronopolis/application.properties configuration file:

application.properties

# Sample application.properties

## Staging areas
chron.stage.bags=/export/outgoing/bags
chron.stage.tokens=/export/outgoing/tokens
ingest.replication.server=stage.chronopolis.org
ingest.replication.user=chronopolis

## Database Connection
spring.datasource.url=jdbc:postgresql://localhost/ingest
spring.datasource.username=chron
spring.datasource.password=secret-password

### Needed so that we don't try to load the schema/data
spring.datasource.initialize=false

## Specify that we are running production services
spring.profiles.active=production

## SSL Configuration
# server.port = 8443
# server.ssl.key-store = file:/path/to/keystore.jks
# server.ssl.key-store-password = secret
# server.ssl.key-password = another-secret

# Logging
logging.path=/var/log/chronopolis/
logging.file=/var/log/chronopolis/ingest.log
logging.level.org.springframework=ERROR
logging.level.org.hibernate=ERROR
logging.level.org.chronopolis=DEBUG

# SMTP Configuration

# smtp.host=localhost.localdomain
# smtp.to=chron-support@sdsc.edu
# smtp.from=localhost
# smtp.send=false

As of version 2.0, we'll be moving to a yaml based configuration. This looks similar to the above, with a few changes being propagated through the various services to get all the properties to be the same.

application.yml

# Ingest Configuration Properties

# Ingest Cron job properties
# tokens: the rate at which to check for bags which have all tokens and can have a Token Store written
# request: the rate at which to check for bags which need their initial replications created
ingest.cron: 
  tokens: 0 0/10 * * * *
  request: 0 0/10 * * *

# Ingest AJP Settings
# enabled: flag to enable an AJP connector
# port: the port for the connector to listen on
ingest.ajp:
  enabled: false
  port: 8009

# The staging area for writing Token Stores. Nonposix support not yet implemented.
## id: The id of the StagingStorage in the Ingest server
## path: The path to the filesystem on disk
chron.stage.tokens.posix.id: -1
chron.stage.tokens.posix.path: /dev/null

# Database connection
# Initialize should be kept false so that the server does not attempt to run a drop/create on the tables
spring.datasource:
  url: jdbc:postgresql://localhost/ingest
  username: postgres
  password: dbpass
  initialize: false

# Specify the active profile for loading various services, normally production
spring.profiles.active: production

# debug: true
server.port: 8000

# Logging properties
logging.file: ingest.log
logging.level:
  org.springframework: ERROR
  org.hibernate: ERROR
  org.chronopolis: DEBUG

Notes on Configuration

The ingest.replication properties are used to build the uris for replication. An example, replicating a collection "Scientific_Data" from depositor "ucsd-researchers":
- rsync-bag: chronopolis@stage.chronopolis.org:/export/outgoing/bags/ucsd-researchers/Scientific_Data
- rsync-tokens: chronopolis@stage.chronopolis.org:/export/outgoing/tokens/ucsd-researchers/Scientific_Data-tokens-2015-02-11
An AJP connector can now be configured with the server, meaning SSL can be served through apache httpd instead of a java keystore

Running

The ingest server runs as an executable jar. Using the init script allows for starting and stopping of the server as root: `service ingest-server start`

Administration

Resetting Passwords

As of version 1.4.0, passwords for users are now encoded using bcrypt. In the event a user forgets their password, we will need to reset t for them. As we do not have email notifications or anything of the like setup, for the moment everything will need to be done manually. We will first need to run the password through an encoder, which can be found online. If you aren't sure how many rounds to use, check the database as the information is kept as part of the encoding, i.e. $2a$08 uses 8 rounds; $2a$10 uses 10 rounds.

Then we connect to the database and issue a simple update:

User update

 UPDATE users SET password = '$2a$10$hEYYHV/Fri00RRHjWPIAWuH3NxYpPPjbMU5OsJfH1SAenajQqKjhK' WHERE username = 'user_resetting_password';

Open Questions

How do we handle error’d bags? (hold, reject, ??)
- We bag the packages ourselves, so we should get no bags with errors
What about malformed digests?
- The ingest-server stores a token for each valid digest, and ignores all others. Either manifest is not digested until 100% of the files are valid. Requests are not made until tokens have been created for each file in the bag.
Do we have a record of all the collections and their states as they move through to replication? We need to be able to retrieve this data, including any failures.

Space shortcuts

Page tree