Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

We don't recommend using a person's name as part of their URI for the simple reason that their name may change.  In fact, many data architects remember recommend always using completely randomized, meaningless identifiers within URIs (for the part after the last / in the URI, known as the local name).

When performing extract, transform, and load (ETL) tasks to get data into VIVO for the first time it will be necessary to create URIs for each new person, organization, and other type of data ingested.  These URIs can be created by VIVO's ingest processes or generated by the ETL process itself and loaded into VIVO. The ETL process can create any arbitrary URI as long as the local name begins with a letter – some RDF processors are not happy with URIs having localnames beginning with a number or other symbol.  The ETL process can also create a URI based on an institutional or other identifier, which has the advantage of being predictable and repeatable.  However, you need to be sure that the identifier is unique and will not be re-used in the future should the person leave the institution or an organization identifier be recycled.

The goal with subsequent ingest – either new types of data or updates to existing sources -- is to match new incoming data against the existing URIs and their contents to avoid creating duplicates. This means having some way of checking new data against existing data.

Creating nightly accumulators

At Cornell, we have found it advantageous to run a nightly process that extracts a list of all people and all instances of several other types of entities along with their URIs and key identifying properties such as name parts, email addresses, and so on.  These lists serve as source against which to match incoming data to avoid having to query our production VIVO instance every time we encounter a co-author's name, a journal, or an organization name. We call the lists accumulators, and store them in an XML format because our largest source of updates about researcher activities comes from an XML web service.

These accumulator lists help assure that new data are matched against existing data, reducing but not eliminating all possible false positives or false negatives.  We will discuss disambiguation in more detail further along in the process. 

 

...

next topic: Challenges for data ingest