Introduction

You've looked at VIVO, you've seen VIVO in action at other universities or organizations, you've downloaded and installed the code.  What next? How do you get information about your institution into your VIVO?

The answer may be different everywhere – it depends on a number of factors.

Next – what is different about data in VIVO?

As we've described, it's well worth learning the VIVO editing environment and creating sample data even if you know you will require an automated approach to data ingest and update.

VIVO makes certain assumptions about data based largely on the types of data, relationships, and attributes described in the VIVO ontology.  These assumptions do not always follow traditional row and column data models, primarily because the application almost always allows for arbitrarily repeating values rather than holding strictly to a fixed number of values per record.  Publications may most frequently have fewer than five authors, but in some fields such as experimental physics it's common to see hundreds of authors – not very workable in a one-row-per-publication, one-column-per-author spreadsheet model.

In VIVO, data about people, organizations, events, courses, places, dates, grants, and everything else are stored in one very simple, three-part structure – the RDF statement.  A statement, or triple, has a subject (any entity), a predicate or property, and an object that can be either another related entity or a simple data value such as a number, text string, or date.  While users will see VIVO data expressed in larger aggregations as web pages, internally VIVO is storing its data as RDF statements or triples.  

This is not the place to explain everything about RDF – there are many good tutorials available and other sections of this wiki explain the VIVO ontology and the more technical aspects of RDF. For now, just bear in mind that while the data you receive may come to you in one format, much of the work of data ingest involves decomposing that data into simple statements that will then be re-assembled by the VIVO application, guided by the ontology, into a coherent web page or a packet of Linked Open Data.

What data can VIVO accept?

With VIVO, your destination will be RDF but you may receive the data in a variety of formats. A first stage in planning ingest involves analyzing what data you have access to and mapping on paper how it need to be transformed for VIVO.

It's probably most common for data to be provided in spreadsheet format, which can be very simple to transform into RDF if each column of every row refers to attributes of the same entity, usually identified by a record identifier. The process becomes more complicated if different cells in the same row of the spreadsheet refer to different entities.

The following spreadsheet would be very easy to load into a VIVO describing cartoon characters:

idnameheightage
1Goofy89 cm11
2Elmer Fudd60 cm45
3Roadrunner140 cm2

You can readily imagine a storing the information about each cartoon character – id, name, height, and age – in one entity for each character.

A spreadsheet of books, however, would be more complicated:

idtitlepublication dateauthorpublisherpages
497531Cartoon Animation1967Wilcox, GeorgeHB Press237
501378Animation Techniques1989Smith, Charlotte and Wilcox, GeorgeCinema Press359
391783Digital Animation2005Ivar, SamuelDigital Logic, Inc.327
34682Dairy Barn Automation2011Wilcox, G.P.University of Minnesota Press403

VIVO stores the book, each author, and the publisher as independent entities related to the other.  This enables information about the book, authors, and publisher to be queried and displayed independently, a key feature of the semantic data model.

We have also introduced a common problems with spreadsheets – when a cell contains more than one value.  We need a way to connect the book, "Animation Techniques," with two authors, and to indicate that Charlotte Smith is the first author and George Wilcox the second.

This example also points out another challenge in working with data – it's not always clear when values that appear similar actually represent the same entity, whether a person, organization, title, journal, or event.  It would be easy to assume the George Wilcox in the first entry is the same as G.P. Wilcox in the 4th, but they are writing about very different topics. For a small organization, it may be easy to disambiguate authors, but this becomes a major challenge at the scale of a major research university.

Data cleanup and disambiguation are challenges for any system and will be a common theme in documenting VIVO data ingest along with semantic data modeling that is more specific to working with VIVO.

Further topics

See also

Under Ingesting and maintaining data

Under Maintaining VIVO