Introduction

The VIVO Harvester is a library of tools designed to read and transform data from external data sources and ingest it into VIVO or potentially any other triplestore or semantic platform. The library was originally developed by the University of Florida Harvester Team during the 2009-2011 NIH Grant.

The VIVO Harvester is currently maintained on GitHub by John Fereira from Cornell as part of VIVO-related projects including AgriVIVO and USDA VIVO. Other contributions to ongoing Harvester enhancements have been made by Alex Viggio through Symplectic, Ltd.

Source

Recommended Harvester branch to check out or download from Git

Architecture and flow

The VIVO Harvester is a collection of small Java tools which are meant to be strung together in various ways to create a harvest process that is custom-tailored to your needs and (importantly) repeatable. This architecture makes the Harvester extremely versatile, but at the same time presents a bit of a learning curve.

We highly recommend that you become familiar with the basics of semantic technologies including RDF and ontologies and download and install the VIVO software before embarking on a data ingest process. Try entering sample data ranging from people and their affiliations to publications, grants, or awards and honors; then export the RDF from VIVO to see what it looks like – for many people having an example is much more intuitive than interpreting ontology diagrams or writing RDF directly.

This following vignettes attempt to follow the steps of a "typical" harvest with a focus primarily on functionality, not configuration or execution.

Fetch

Translate

Score

Match

Change Namespace

Update

This step allows for multiple Harvester runs in succession to recognize data that has been modified since the previous run and update accordingly. A "previous harvest model" is created, which on the first run contains all the data imported on that run. On subsequent runs, this is compared with the new data to determine triples that have been removed or added since the last run. This comparison is made by the Diff tool, and the output is an "Additions file" and a "Subtractions file", containing RDF/XML data that should be added and removed, respectively, from VIVO.

Transfer

The data from the Additions file is added both to VIVO and the previous harvest model in two separate calls of the Transfer tool. Then the data from the Subtractions file is removed both from VIVO and the previous harvest model in two more Transfer calls.

At this point a harvest is complete.

Next Steps

Included in Harvester's scripts/ directory are several sample scripts which have been tested and will perform different types of harvests. One of the best ways to get started is to find one that is close to your needs, test it on a test server or virtual machine, and then tweak it until it meets your needs.

Read the Harvester User Guide to learn more about using the harvester.