*Deprecated* See https://wiki.duraspace.org/display/VIVODOC/All+Documentation for current documentation

Old

This article is for pre 1.6 versions of VIVO

 

Suppose you have a bunch of RDF data in your main model and now you want to update it with new data from an ingest process. We'll assume you're loading data from a spreadsheet or an XML file, and using SPARQL CONSTRUCTs as the basis of your workflow for converting the raw data.

When you load in the new data, you will have a bunch of new individuals with new URIs (perhaps something ugly that looks like *

(http://example.org/dataingest/hrdump-2010-01-01/individual123234])

*. Some or all of these individuals will actually be the same as those already described in your main model with different URIs (perhaps something relatively good looking like *

(http://myuniversity.edu/individual/JohnQPublic)

*. How do we handle this?

You might ask, "What about owl:sameAs?" We can assert owl:sameAs triples to show that two different URIs actually refer to the same individual. Then a reasoner can infer that any statements about URI1 also apply to URI2 and vice versa. This is great if we're working with multiple data sources with current information about the same individual, and we want to support querying for either URI.

This, unfortunately, isn't the case here. We need to update the statements asserted about our "real" URI according to the new data, retracting any statements that are no longer current. Additionally, we don't want to keep any statements with our "junk" data ingest URI.

This can be relatively straightforward if we're writing a program that manipulates the RDF graphs directly. If we want to do it as part of a SPARQL workflow, we need to resort to some ugly hacks exotic trickery.

Step one: assert our own custom "sameAs" property

It's convenient to have a temporary set of triples showing what individuals in our new ingest data are the same as individuals in our existing knowledge base. We don't want to use owl:sameAs because that has special semantics we don't want. So we can just make up our own property. The query below shows an example:

hrw:sameAsConstructQuery
	a s:SPARQLCONSTRUCTQuery ;
	s:queryStr "
		 PREFIX vivo: <http://vivo.library.cornell.edu/ns/0.1#>
		 PREFIX owl: <http://www.w3.org/2002/07/owl#>
		 PREFIX chr: <http://vivo.cornell.edu/ns/hr/0.9/hr.owl#>
		 PREFIX hrw: <http://vitro.mannlib.cornell.edu/ns/ingest/HRIngestWorkflow#>
		 CONSTRUCT {
		 ?p hrw:sameAs ?q
		 } WHERE {
{
?p chr:emplId ?id .
?q chr:emplId ?id
} UNION {
?p vivo:CornellemailnetId ?emailNetid .
?q vivo:CornellemailnetId ?emailNetid
		 }
OPTIONAL {
?p ?typep <http://vitro.mannlib.cornell.edu/ns/bjl23/hr/rules1#Person>
}
OPTIONAL {
?q ?typeq <http://vitro.mannlib.cornell.edu/ns/bjl23/hr/rules1#Person>
}
FILTER(bound(?typep))
FILTER(!bound(?typeq))
		 }
	" .

The hrw:sameAs property is neither reflexive, symmetric, nor transitive. A triple of the form

 ?a hrw:sameAs ?b

just means that junk ingest URI ?a refers to "real" URI ?b in our knowledge base. If we assert ?a hrw:sameAs ?a or ?b hrw:sameAs ?a we'll get all screwed up. The CONSTRUCT query makes sure the property only gets asserted in one direction by using the OPTIONAL and FILTER blocks. The individuals in the ingest data have this funny ingest-process *

<http://vitro.mannlib....rules1#Person>

* type; the individuals in the main database don't have this type. The OPTIONAL and FILTER makes sure that only individuals that have this funny type can be the subject of hrw:sameAs, and only those individuals without the type can be the object of hrw:sameAs. Two individuals are determined to the be the same if they share the same emplId or if they share the same Cornell email net Id.

Step two: construct the new data using the "real" URIs

Phew. OK, so now we have these hrw:sameAs triples showing how the new junk ingest URIs map to the old "real" URIs. Now we need to construct the data we want using the "real" URIs. Below is an example. The specifics aren't important, but notice how we're using hrw:sameAs. Note that we need to be careful not to construct *

<http://vitro.mannlib....rules1#Person>

* types into our new data, or we won't be able to make hrw:sameAs triples the next time we do an update. This particular CONSTRUCT does this by specifying regex patterns for the type of data we don't want to CONSTRUCT, which is where all the ugly FILTERs come in. In other situations it is likely easier to specify exactly what data you want to keep.

hrw:ExistingPeoplePreAssertionsConstructQuery
a s:SPARQLCONSTRUCTQuery ;
s:queryStr "
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX hr: <http://vitro.mannlib.cornell.edu/ns/bjl23/hr/1#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX hrb: <http://vitro.mannlib.cornell.edu/ns/bjl23/hr/rules1#>
	PREFIX hrw: <http://vitro.mannlib.cornell.edu/ns/ingest/HRIngestWorkflow#>

CONSTRUCT {
?existingPerson ?p ?o .
?oo ?pp ?existingPerson
} WHERE {
?s rdf:type hrb:Person .
?s hrw:sameAs ?existingPerson .
	 ?s ?p ?o .
OPTIONAL {
?oo ?pp ?s .
FILTER (!regex(str(?pp),\"http://vitro.mannlib.cornell.edu/ns/bjl23/hr/rules1#\"))
FILTER (!regex(str(?pp),\"http://vitro.mannlib.cornell.edu/ns/bjl23/hr/1#\"))
}
FILTER (!regex(str(?p),\"http://www.w3.org/2002/07/owl#\"))
FILTER (!regex(str(?p),\"rdf-schema\"))
FILTER (!regex(str(?p),\"sameAs\"))
FILTER (!regex(str(?p),\"http://vitro.mannlib.cornell.edu/ns/bjl23/hr/1#\"))
FILTER (!regex(str(?p),\"http://vitro.mannlib.cornell.edu/ns/bjl23/hr/rules1#\"))
FILTER (!regex(str(?o),\"http://vitro.mannlib.cornell.edu/ns/bjl23/hr/1#\"))
FILTER (!regex(str(?o),\"http://vitro.mannlib.cornell.edu/ns/bjl23/hr/rules1#\"))
}
" .

Step three: Get the RDF model difference in each direction

After we've made the assertions about our updated data, we can subtract the old (pre-update) data from the new data. This forms the set of statements that need to be added to the model to complete the update. Similarly, we can subtract the new data from the old data to get the set of retractions. Note that we need to CONSTRUCT a sub-graph of the old data so that includes only those statements that are in scope for updating. We don't want to use the entire existing knowledge base, or we'll end up retracting a whole lot of stuff we don't want to retract.

For more information please refer to the Data Ingest guide located Here