Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Completed FCRepoDateNormalizer workflow and added all of its actors.

...

  • Retangularity - this is very interactive which makes it a poor fit for Kepler
  • Column headings - again, listing problem headings is not an issue, but "allow edits" is too interactive
  • Data quality control - Kepler can certainly create the histograms or scatter plots of the data, but there wouldn't be the capability to select data values and correct them interactively.

Project Components

Kepler Actors

Actors A number of new actors were created that provide data access and accessioning functionality.

The PythonActor was used extensively in this project. It is based on the Jython interpreter which provides standard Python functionality within a Java application. Because Jython is implemented in Java, it also provides access to any Java class or class library available to the JVM. In this way, it provides a rapid prototyping tool that supports coding in both Java and Python.

Change Log Writer Actor

This actor writes a log file with a summary of changes made during the latest run of the workflow.

...

Source file :

...

ChangeLogWriterActor.py

...

Input ports :

...

  • changes : ObjectToken containing a Python tuple or Java array with 2 items :
    1. the current row number as an integer.
    2. a list of changes made.
  • filename : StringToken containing the fully qualified path for the change log file.

...

Output ports :

...

  • None

The "Change Log Writer" PythonActor is used in the following workflows:

  • FCRepoDateNormalizer

CSV Datastream Dissemination Actor

This actor performs several functions.

  1. It displays a form dialog that allows the user to enter parameters required to connect to a Fedora Repository.
  2. It connects to a Fedora Repository and extracts a datastream dissemination from the object specified in the form.
  3. It breaks the datastream into rows and send them, on-at-a-time, to the next actor in the worklflow.

...

Screenshot :

...

Image Added

...

Required Packages :

...

  • FCRepoKepler

...

Source file :

...

CSVDatastreamDisseminationActor.py

...

Input ports :

...

  • None

...

Output port :

...

  • dissemination : StringToken containing a single row from the CSV datastream.

The "CSV Datastream Dissemination" PythonActor is used in the following workflows:

  • FCRepoDateNormalizer

Error Log Writer Actor

This actor writes a log file with a summary of errors encountered during the latest run of the workflow.

...

Source file :

...

ErrorLogWriterActor.py

...

Input ports :

...

  • error: ObjectToken containing a Python tuple or Java array with 2 items :
    1. the current row number as an integer.
    2. a list of errors encountered.
  • filename : StringToken containing the fully qualified path for the error log file.

...

Output ports :

...

  • None

The "Error Log Writer" PythonActor is used in the following workflows:

  • FCRepoDateNormalizer

Normalize Date Actor

This actor looks at each column that contains a date value and removes extraneous time data when present. It is based on the RowAnalyzer class in fcrepo.kepler.RowAnalyzer.

...

Required Packages :

...

  • FCRepoKepler

...

Source file :

...

NormalizeDateActor.py

...

Input port :

...

  • input : ObjectToken containing a Python tuple or Java array with 2 items :
    1. the current row number as an integer.
    2. an ordered list of columns in the row.

...

Output port :

...

  • output : ObjectToken containing a Python tuple or Java array with 4 items :
    1. the current row number as an integer.
    2. a tuple/array of values for each column in the row.
    3. a tuple/array of changes made.
    4. a tuple/array of errors encountered.

The "Normalize Date" script also needs to get a list of indexes for the columns that contain dates. This can be done in one of two ways:

  • A string parameter on the PythonActor named 'indexes'
    OR
  • A port named 'indexes' containing a StringToken.
    In either case, the string must contain a comma-separated list of column numbers.

The "Normalize Date" PythonActor is used in the following workflows:

  • FCRepoDateNormalizer

Output Prep Actor

This actor sorts through the output created by a RowAnalysis script and routes the data to the proper output writer.

...

Source file :

...

OutputPrepActor.py

...

Input port :

...

  • input : ObjectToken containing a Python tuple or Java array with 4 items :
    1. the current row number as an integer.
    2. a tuple/array of values for each column in the row.
    3. a tuple/array of changes made.
    4. a tuple/array of errors encountered.

...

Output ports :

...

  • output : StringToken containg a string representing the ouput row in a CSV file. It is constructed by concatenating the values in the columns array using a separator character.
  • changes : ObjectToken containing a tuple with 2 items :
    1. the current row number as an integer.
    2. the tuple/array of changes made received on the input port.
  • errors : ObjectToken containing a tuple with 2 items :
    1. the current row number as an integer.
    2. the tuple/array of errors encountered received on the input port.

The "Output Prep" script also requires the PythonActor to have a parameter named 'separator' that contains the character to be used as a separator between columns in the text string constructed for the row.

The "Output Prep" PythonActor is used in the following workflows:

  • FCRepoDateNormalizer

Row To Columns Actor

This actor splits a text string representing a 'row' into 'columns' using a separator character such as ','.

...

Source file :

...

RowToColumnsActor.py

...

Input port :

...

  • row: StringToken containing a string representation of a single row in a spreadsheet or other data matrix.

...

Output port :

...

  • columns - ObjectToken containing a Python tuple or Java array with 2 items :
    1. the current row number as an integer.
    2. a tuple containing an ordered list of values for each column in the row.

The "Row To Columns" script also requires the PythonActor to have a parameter named 'separator' that contains the character that was used as a separator between columns in the input text string.

The "Row To Columns" PythonActor is used in the following workflows:

  • FCRepoDateNormalizer

Kepler Workflows

...

Kepler workflows were developed to illustrate how Kepler might be used as an accessioner's workbench.

FCRepoDateNormalizer

Retrieves a CSV datastream from an object in a Fedora Repository, processes the date column to satandardize the format and saves the results to a local file in CSV format.

...

Screenshot :

...

...

Source file :

...

FCRepoDateNormalizer.xml