Ingesting and maintaining data

Created by Jon Corson-Rikert, last modified on Nov 18, 2014

Scope note

Much of this documentation remains the same across VIVO releases, but some may not have been fully updated to the most recent release. While we will attempt to identify and alert you in such cases, please be aware that your VIVO may look and act slightly differently from what is represented here.

Introduction

This document provides an overview of the data ingest process for VIVO.

The early sections describe some typical data sources for VIVO and the challenges often associated with multiple values, representing information that is only true for certain periods of time, and the like.
Later sections describe different technical and workflow options for loading data into VIVO.
Some more specific examples are provided, but readers should expect to have to modify or extend examples to reflect their local data needs, the format of sources, and the depth of technical skills available to them, such as the ability to write or modify XSLT scripts

Other sources of information

Karma: A Data Integration Tool

How to plan data ingest for VIVO

How VIVO differs from a spreadsheet, where VIVO data typically comes from, cleaning data prior to loading, matching against data already in VIVO, and doing further cleanup once it's in VIVO

Ingest tools: home brew or off the shelf? — Major options including the Harvester, semantic ingest tools such as Karma, and XSLT
Typical ingest processes — Alternative approaches to ingest and making ingest repeatable
Challenges for data ingest — Challenges in the data, in workflow, in working incrementally, in modeling, and in migration
Monitoring for quality

VIVO Harvester

University of Florida Harvester Team
University of Florida Harvester Documentation Archive
- Development and Planning
  - Harvester Planned Features
  - Version 1
- Typical harvest
- Scheduling
- Problems and Solutions
- Web Of Science visual map
- XSLT Mapping
- Harvester Score Ontology
- Harvester Documentation Procedures
- New Harvest Workflow Proposal
- Demonstrations and Examples
- Env
- Configuration
- Harvester .tar file
- Harvester Debian package
- Harvester in Eclipse
- Deprecated Harvester Documentation
- Harvester vivo configuration file
- Harvester Source Documentation
  - Diff
  - Fetch — The first step of a typical harvest is the get you data from your target source. We call this the Fetch. For example, let us suppose we have a VIVO installation containing researchers at our university, and we want to harvest from Pubmed http://www.ncbi.nlm.nih.gov/pubmed/ information on publications written by researchers at our university. In this case we would use Harvester's PubmedFetch tool to send a query off to Pubmed, which will return the results of that query to us in its own XML for
  - Harvester Architecture Diagram
  - Merge
  - Qualify
  - RecordHandler
  - RenameResources
  - RunBibutils
  - Score — Depending on your data the next step may be to match incoming data with data already in VIVO. For example, if you have just pulled in some publication information from Pubmed, you might want to compare the author names with people in your VIVO, so that you can link the publications with the authors. This comparison is done via the tool, which compares any values you want between VIVO and the input data, and assigns a number to the comparison.
    - Algorithm
  - Smush
  - Translate — The next step of a typical harvest is the translation. The fetched data will be in its own format, and this needs to be converted into VIVO-compatible triples. If the input is an XML format, this can be done using the XSLTranslator tool and a .xsl file containing XSLT code specific to the data format being converted to RDF/XML triples. Included with Harvester in the config/datamaps/ directory are several pre-written XSLT files for frequently-needed formats (including for example Pubmed). Anothe
  - Transfer
  - XMLGrep
  - Utilities
    - ArgParser
    - ChangeNamespace — Depending on how your data came in and how you generated triples for it the last step before importing the information into VIVO is to give your data proper URIs via the tool. Prior to this step, URIs may be placeholders provided by the XSLT translation (typically using aspects of the raw data that are expected to be unique, such as an ISBN number) or blank nodes from a SPARQL Construct. If you've generated unique URI's for all of your data using a piece of unique information then you can skip
    - DatabaseClone
    - JenaConnect
      - Jena RDF Model
  - Harvester Tools
  - Match — The Match tool will look at the numbers generated by Score and compare them to a threshold value. Input entities compared by Score that meet or exceed the threshold will have their identities changed to the URI of the person in VIVO, so that when the data is finally pulled into VIVO the new data will be linked to existing data. In this way you can fetch publications for your existing researchers.

Beyond the basics of data ingest: more about tools and techniques

Data ingest guides and workshop materials

Populating VIVO from Activity Insight (Digital Measures)

No labels

All content on the LYRASIS Wiki is licensed under the CC BY (Attribution) license, unless otherwise noted.