DataStaR : Data Staging Repository
"_The purpose of DataStaR is to support collaboration and data sharing among researchers during the research process, and to promote publishing or archiving data and high-quality metadata to discipline-specific data centers, and/or to Cornell's own digital repository." (see DataStaR: An Institutional Approach to Research Data Curation)
...
Kepler workflows were developed to illustrate how Kepler might be used as an accessioner's workbench.
Retrieves a CSV datastream from an object in a Fedora Repository, processes date columns to standardize their format and saves the results to a local file in CSV format.
...
...
...
...
Retrieves a CSV datastream from an object in a Fedora Repository, splits columns containing both latitude and longitude coordinates into two separate columns and saves the results to a local file in CSV format.
...
...
...
...
Retrieves a CSV datastream from an object in a Fedora Repository, processes latitude columns to standardize their format and saves the results to a local file in CSV format.
...
...
...
...
Retrieves a CSV datastream from an object in a Fedora Repository, processes longitude columns to standardize their format and saves the results to a local file in CSV format.
...
...
...
...
A number of new actors were created that provide data access and accessioning functionality.
...
This actor writes a log file with a summary of changes made during the latest run of the workflow.
...
...
...
The "Change Log Writer" PythonActor is used in the following workflows:
...
...
...
CSVDatastreamDisseminationActor.py
...
The "CSV Datastream Dissemination" PythonActor is used in the following workflows:
This actor writes a log file with a summary of errors encountered during the latest run of the workflow.
...
...
...
The "Error Log Writer" PythonActor is used in the following workflows:
This actor looks at each column that contains a date value and removes extraneous time data when present. It .
...
...
...
...
...
...
...
...
The "Normalize Date" script also needs a list of indexes for the columns that contain dates. This can be acquired in one of two ways:
A string parameter on the PythonActor named 'indexes'
OR
A port named 'indexes' containing a StringToken.
In either case, the string must contain either a comma-separated list of column numbers or a formula describing a regular sequence that can be used to generate the list. The format of the formula is START + INCREMENT * COUNT. For example, the formula 7+4*10 means there are 10 columns in the list, dates occur every 4 columns starting with column 7. This would generate the list 7,11,15,19,23,27,31,35,39,43.
...
...
The "Normalize Date" PythonActor is used in the following workflows:
This actor looks at each column that contains a latitude value and assures that all values are valid and in the same format.
...
...
...
...
...
...
...
...
The "Normalize Latitude" script also needs a list of indexes for the columns that contain latitudes. This can be acquired in one of two ways:
A string parameter on the PythonActor named 'indexes'
OR
A port named 'indexes' containing a StringToken.
In either case, the string must contain either a comma-separated list of column numbers or a formula describing a regular sequence that can be used to generate the list. The format of the formula is START + INCREMENT * COUNT. For example, the formula 7+4*10 means there are 10 columns in the list, latitudes occur every 4 columns starting with column 7. This would generate the list 7,11,15,19,23,27,31,35,39,43.
...
...
The "Normalize Latitude" PythonActor is used in the following workflows:
This actor looks at each column that contains a longitude value and assures that all values are valid and in the same format.
...
...
...
...
NormalizeDateActorNormalizeLongitudeActor.py
...
...
...
...
The "Normalize DateLongitude" script also needs a list of indexes for the columns that contain dateslongitudes. This can be acquired in one of two ways:
A string parameter on the PythonActor named 'indexes'
OR
A port named 'indexes' containing a StringToken.
In either case, the string must contain either a comma-separated list of column numbers or a formula describing a regular sequence that can be used to generate the list. The format of the formula is START + INCREMENT * COUNT. For example, the formula 7+4*10 means there are 10 columns in the list, dates longitudes occur every 4 columns starting with column 7. This would generate the list 7,11,15,19,23,27,31,35,39,43.
...
...
The "Normalize DateLongitude" PythonActor is used in the following workflows:
This actor sorts through the output created by a RowAnalysis script and routes the data to the proper output writer.
...
...
...
...
...
...
The "Normalize DateOutput Prep" script also needs the character to be used as a separator between columns in the output text string. This can be acquired in one of two ways:
A string parameter on the PythonActor named 'separator'
OR
A port named 'separator' containing a StringToken.
In either case, the string must contain a comma-separated list of column numbers.
The "Output Prep" PythonActor is used in the following workflows:
This actor splits a text string representing a 'row' into 'columns' using a separator character such as ','.
...
...
...
...
...
...
...
The "Row To Columns" PythonActor is used in the following workflows:
Kepler workflows were developed to illustrate how Kepler might be used as an accessioner's workbench.
Retrieves a CSV datastream from an object in a Fedora Repository, processes the date column to satandardize the format and saves the results to a local file in CSV format.
...
...
...
...
This actor looks for columns that contain both latitude and longitude coordinates and splits them into two separate columns. It leaves the latitude in the original column and moves the longitude to a new column immediately next to the latitude.
...
...
...
...
...
...
...
...
The "Split Lat/Long" script also needs a list of indexes for the columns that contain lat/long coordinates. This can be acquired in one of two ways:
A string parameter on the PythonActor named 'indexes'
OR
A port named 'indexes' containing a StringToken.
In either case, the string must contain either a comma-separated list of column numbers or a formula describing a regular sequence that can be used to generate the list. The format of the formula is START + INCREMENT * COUNT. For example, the formula 7+4*10 means there are 10 columns in the list, coordinates occur every 4 columns starting with column 7. This would generate the list 7,11,15,19,23,27,31,35,39,43.
...
...
The "Split Lat/Long" PythonActor is used in the following workflows:
...
...