Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

History

The Harvester began life as a specialized ETL tool meant to ease the process of data ingest into VIVO. It has transformed into a general semantic ETL tool.

Introduction

The VIVO Harvester is a library of tools designed to take read and transform data from external data sources and ingest it into VIVO or potentially any other triplestore or semantic platform. The library was originally developed at by the University of Florida during Harvester Team during the 2009-2011 NIH Grant. Development of the Harvester follows a monthly release cycle. New features are built in the first 2-3 weeks of the cycle, with testing and releasing occurring during the 3rd and 4th week of the cycle. Use the links below to learn more about individual tools the harvester is comprised of, or read .

The VIVO Harvester is currently maintained on GitHub by John Fereira from Cornell as part of VIVO-related projects including AgriVIVO and USDA VIVO.  Other contributions to ongoing Harvester enhancements have been made by Alex Viggio through Symplectic, Ltd.

Source

Architecture and flow

The VIVO Harvester is a collection of small Java tools which are meant to be strung together in various ways to create a harvest process that is custom-tailored to your needs and (importantly) repeatable. This architecture makes the Harvester extremely versatile, but at the same time presents a bit of a learning curve.

We highly recommend that you become familiar with the basics of semantic technologies including RDF and ontologies and download and install the VIVO software before embarking on a data ingest process. Try entering sample data ranging from people and their affiliations to publications, grants, or awards and honors; then export the RDF from VIVO to see what it looks like – for many people having an example is much more intuitive than interpreting ontology diagrams or writing RDF directly.

This following vignettes attempt to follow the steps of a "typical" harvest with a focus primarily on functionality, not configuration or execution.

Fetch

Excerpt Include
Fetch
Fetch
nopaneltrue

Translate

Excerpt Include
Translate
Translate
nopaneltrue
 

Score 

Excerpt Include
Score
Score
nopaneltrue

Match

Excerpt Include
Match
Match
nopaneltrue

Change Namespace

Excerpt Include
ChangeNamespace
ChangeNamespace
nopaneltrue

Update

This step allows for multiple Harvester runs in succession to recognize data that has been modified since the previous run and update accordingly. A "previous harvest model" is created, which on the first run contains all the data imported on that run. On subsequent runs, this is compared with the new data to determine triples that have been removed or added since the last run. This comparison is made by the Diff tool, and the output is an "Additions file" and a "Subtractions file", containing RDF/XML data that should be added and removed, respectively, from VIVO.

Transfer

The data from the Additions file is added both to VIVO and the previous harvest model in two separate calls of the Transfer tool. Then the data from the Subtractions file is removed both from VIVO and the previous harvest model in two more Transfer calls.

At this point a harvest is complete.

Next Steps

Included in Harvester's scripts/ directory are several sample scripts which have been tested and will perform different types of harvests. One of the best ways to get started is to find one that is close to your needs, test it on a test server or virtual machine, and then tweak it until it meets your needs.

Read the Harvester User Guide to learn more about using the harvester.

...

Children Display

Harvester Instructions

Harvester User Guide

...

Pubmed Example Script

IP Example Script

Deployment

Video Walkthrus

Screencasts of example harvester runs: https://sourceforge.net/projects/vivo/files/VIVO%20Harvester/Demonstration/

...