Introduction

The VIVO Harvester is a library of tools designed to read and transform data from external data sources and ingest it into VIVO or potentially any other triplestore or semantic platform. The library was originally developed by a team at the University of Florida during Harvester Team during the 2009-2011 NIH Grant.

The VIVO Harvester is currently maintained on GitHub by John Fereira from Cornell as part of VIVO-related projects including AgriVIVO and USDA VIVO. Other contributions to ongoing Harvester enhancements have been made by Alex Viggio through Symplectic, Ltd.

Source

Recommended Harvester branch to check out or download from Git

Architecture and flow

The VIVO Harvester is a collection of small Java tools which are meant to be strung together in various ways to create a harvest process that is custom-tailored to your needs and (importantly) repeatable. This architecture makes the Harvester extremely versatile, but at the same time presents a bit of a learning curve.

...

Excerpt Include

	Fetch
	Fetch
nopanel	true

Translate

The next step of a typical harvest is the translation. The fetched data will be in its own format, and this needs to be converted into VIVO-compatible triples. If the input is an XML format, this can be done using the XSLTranslator tool and a .xsl file containing XSLT code specific to the data format being converted to RDF/XML triples. Included with Harvester in the config/datamaps/ directory are several pre-written XSLT files for frequently-needed formats (including for example Pubmed). Another standard method in harvesting data is to prepare a SPARQL Construct using the VIVO UI that will take in RDF data and transform it into the VIVO ontology. You can use the SPARQL Translator to process SPARQL Construct files against target models.

Score and Matching

Depending on your data the next step may be to match incoming data with data already in VIVO. For example, if you have just pulled in some publication information from Pubmed, you might want to compare the author names with people in your VIVO, so that you can link the publications with the authors. This comparison is done via the Score tool, which compares any values you want between VIVO and the input data, and assigns a number to the comparison.

The immediate next step is to call the Match tool, which will look at the numbers generated by Score and compare them to a threshold value. Input entities compared by Score that meet or exceed the threshold will have their identities changed to the URI of the person in VIVO, so that when the data is finally pulled into VIVO the new data will be linked to existing data. In this way you can fetch publications for your existing researchers.

Namespace Change

Depending on how your data came in and how you generated triples for it the last step before importing the information into VIVO is to give your data proper URIs. This can be done via the ChangeNamespace tool. Prior to this step, URIs may be placeholders provided by the XSLT translation (typically using aspects of the raw data that are expected to be unique, such as an ISBN number) or blank nodes from a SPARQL Construct. If you've generated unique URI's for all of your data using a piece of unique information then you can skip this step. After this step all data has a proper VIVO URI and is ready for import into VIVO.

...

Excerpt Include

	Translate
	Translate
nopanel	true

Score

Excerpt Include

	Score
	Score
nopanel	true

Match

Excerpt Include

	Match
	Match
nopanel	true

Change Namespace

Excerpt Include

	ChangeNamespace
	ChangeNamespace
nopanel	true

Update

This step allows for multiple Harvester runs in succession to recognize data that has been modified since the previous run and update accordingly. A "previous harvest model" is created, which on the first run contains all the data imported on that run. On subsequent runs, this is compared with the new data to determine triples that have been removed or added since the last run. This comparison is made by the Diff tool, and the output is an "Additions file" and a "Subtractions file", containing RDF/XML data that should be added and removed, respectively, from VIVO.

Transfer

The data from the Additions file is added both to VIVO and the previous harvest model in two separate calls of the Transfer tool. Then the data from the Subtractions file is removed both from VIVO and the previous harvest model in two more Transfer calls.

...

Space shortcuts

Page tree

Versions Compared

Old Version 11

New Version Current

Key

Introduction

Source

Architecture and flow

Translate

Score and Matching

Namespace Change

Score

Match

Change Namespace

Update

Transfer

Space shortcuts

Page tree

Page History

Versions Compared

Old Version 11

New Version Current

Key

Introduction

Source

Architecture and flow

Translate

Score and Matching

Namespace Change

Score

Match

Change Namespace

Update

Transfer