How to approach data ingest for VIVO

For starters

You've looked at VIVO, you've seen VIVO in action at other universities or organizations, you've downloaded and installed the code. What next? How do you get information about your institution into your VIVO?

The answer may be different everywhere – it depends on a number of factors.

How big is your organization? Some smaller ones have implemented VIVO only through interactive editing – they enter every person, publication, organizational unit, grant, and event they wish to show up, and the keep up with changes "manually" as well. This approach works well for organizations with under 100 people or so, especially if you have staff or student employees who are good at data entry and enjoy learning more about the people and the research. There's something of an inverse correlation with age – students can be blazingly fast with data entry, employing multiple windows and copying and pasting content. The site takes shape before your eyes and it's easy to measure progress and, after a bit of practice, predict how long the process will take.
- This approach may also be a good way to develop a working prototype with local data to use in making your case for a full-scale effort. The process of data entry is tedious but a very good way to learn the structure inherent in VIVO.
- We recommend that people new to RDF and ontologies enter representative sample data by hand and then export it in one of the more readable RDF formats such as n3, n-triples, or turtle. This is an excellent way to compare what you see on the screen with the data VIVO will actually produce – and when you know your target, it's easier to decide how best to develop a more automated ingest process.
The interactive approach will obviously not work with big institutions or where staff time or a ready pool of student editors is not available. There are also many advantages to developing more automated means of ingest and updating, including data consistency and the ability to replace data quickly and on a predictable timetable.
What are your available data sources? Some organizations have made good institutional data a priority, and others struggle with legacy systems lacking consistent identifiers or common definitions for important categorizations such as distinct types of units or employment positions. You may have to do make some inquiries to find the right people to contact to find out what data are available, and the stakeholders on your VIVO project may need to request access to that data.

Next – what is different about data in VIVO?

As we've described, it's well worth learning the VIVO editing environment and creating sample data even if you know you will require an automated approach to data ingest and update. VIVO makes certain assumptions about data based largely on the types of data, relationships, and attributes described in the VIVO ontology. These assumptions do not always follow traditional row and column data models, primarily because the application almost always allows for arbitrarily repeating values rather than holding strictly to a fixed number of values per record. Publications may most frequently have fewer than five authors, but in some fields such as experimental physics it's common to see hundreds of authors – not very workable in a one-row-per-publication, one-column-per-author data model.

VIVO also does not adhere to the familiar concept of tables, records, and fields. Data about people, organizations, events, courses, places, dates, grants, and everything else are stored in one very simple, three-part structure – the RDF statement having a subject about any entity, a predicate or property, and an object that can be either another related entity or a simple data value such as a number, text string, or date. While you may still think of data in larger chunks and users will see VIVO data expressed as web pages, internally VIVO is storing its data as RDF statements or "triples."

This is not the place to explain everything about RDF – there are many good tutorials out there and other sections of this wiki that explain ontologies and the more technical aspects of RDF. For now, just bear in mind that while the data you receive may come to you in one format, much of the work of data ingest involves decomposing that data into simple statements that will then be re-assembled by the VIVO application, guided by the ontology, into a coherent web page or a packet of Linked Open Data.

What data can VIVO cope with?

With VIVO, your destination will be RDF but you may receive the data in a variety of formats.

It's probably most common for data to be passed around in spreadsheet format, which can be very simple to transform into RDF if each column of every row refers to attributes of the same entity, usually identified by a record identifier. The process becomes more complicated if different cells in the same row of the spreadsheet refer to different entities, at least in the way VIVO's ontology models the world.

An example may be helpful. The following spreadsheet would be very easy to load into a VIVO describing cartoon characters:

id	name	height	age
1	Goofy	89 cm	11
2	Elmer Fudd	60 cm	45
3	Roadrunner	140 cm	2

You can readily imagine a storing the information about each character – id, name, height, and age – in one structure.

A spreadsheet of book publication data, however, would be more complicated:

id	title	publication date	author	publisher	pages
497531	Cartoon Animation	1967	Wilcox, George	HB Press	237
501378	Animation Techniques	1989	Smith, Charlotte	Cinema Press	359
391783	Digital Animation	2005	Ivar, Samuel	Digital Logic, Inc.	327

VIVO needs to store the book, each author, and the publisher as independent entities with bi-directional relationships among them. Even the publication date information is stored as a separate entity or object, in large part because a precision needs to be stored along with the year.

The details are not important now – but bear in mind that much of the work of data ingest centers around breaking data that typically arrives in one composite format into the separate entities and explicit relationships among them that comprise VIVO's data model and provide the flexibility for query and retrieval that mark one of VIVO's primary advantages over more fixed storage models.

Space shortcuts

Page tree

For starters

Next – what is different about data in VIVO?

What data can VIVO cope with?

Types of data sources for VIVO