Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Notes on improving full lifecycle data management in VIVO

UF has been tasked with providing automated and continual ingests from Peoplesoft and grants databases; last fall they encountered a problem when a bad harvest that had created a lot of duplicates was not discovered until about 3 days after it happened. They had to try and figure out how to restore from that point in time.

...

The same problems can be evident in SPARQL queries – if multiple editors have touchec the same data, and if one did bad work, you will likely not know the state of your database – whether this bad data or some garbage left behind by a botched index rebuild or a data migration or an ingest process that went awry.

Data can change

If you keep files of all the triples added via an ingest process, in theory one could simply retract the same triples at a future date and undo all the additions. In practice, there are at least 4 problems with this assumption.

...

At Cornell, we have addressed some of the above issues (having experienced all the above problems) by changing update procedures to issue queries for all the statements about the URI's inserted (say a new set of grants). A SPARQL construct query can create a new file of RDF to retract that matches what now exists. This approach is itself not foolproof, however, and certainly cannot responsibly be run on a fully automated basis.

Alternative choices

Separate audit model

A version of VIVO predating our NIH grant (circa 2007 or 2008) had a listener hooked into an older editing system that deteced every time a triple was added or removed. The listener create a reified version of the triple in a separate Jena database store, together with a timestamp, an indication of whether the triple was added or retracted, and the user performing the action. We never made that much use of it except to see who had recently logged in as a self editor; it would certainly have been possible to build more elaborate reports on these additional, reified statements, but we never had the requirement to do so.

The reified triples were stored in a separate store so the application didn't have wade through all the extra logging data when doing a query.

Data separation by graph

Any modern triple store, including SDB, is really a quad store – each collection of triples is a graph, and the graph id forms the 4th value in the quad store, together with the subject, predicate, and object of the triple itself.

...

A graph identifier could in theory be encoded with information about the source and date of the data associated with the graph.

Graph per user or even triple

The graph approach could be extended to provide one or more graphs per user. In the pathological end case, the graph approach could devolve into a separate graph per triple. This would still likely be more efficient than in storage than the full reified statement approach, but we think it might cause some problems for the Jena SDB triple store we use now.

Discussion of reification vs. graphs

We have thought some about using the graph approach to data even in situations where users will be allowed to edit the data. For example, any data changed by a front-end user could be moved from the HR ingest graph to a global "end user modified" graph, and perhaps copied to an "HR ingest modified" graph to help identify for an auditing system what had been modified. A global "end user modified" graph would not be sufficiently granular when there is a use case requiring identification of the date any triple in that graph had been changed, for example, or by whom.

...

Can we get an architecture that does this?

Approaches via the connectivity to the triple store

For version 1.5 we want to make VIVO much more triple store agnostic. Brian Lowe has a simple working proof of concept that talks to a remote SPARQL endpoint anywhere on the web, we could push this problem down, so VIVO is triple store agnostic, then this problem is in the triple store layer – the virtuoso and enterprisey triple stores might have features to do the triple-by-triple auditing

...

next week Jim Blake will talk about extension architectures.

Original agenda for the discussion

Adding data to VIVO is not easy – partly because RDF and semantic tools are unfamiliar, but partly because of fundamental data management challenges facing any large system with multiple sources, especially when some data may be edited after ingest and those changes affect subsequent updates.

...

As VIVO matures at each of our institutions, we are also being asked more questions about reusing data from VIVO in other applications, about reporting using data in VIVO, and tools for archiving, removing, or exporting what can be very large amounts of data. How can we address these challenges appropriately?

Questions from the UF Harvester team

In our discussions of ingesting Person data from People Soft we have a wish list of things we'd love to know about a triple to allow us to preform intelligent actions on any triple as part of an ingest process. Some of these are:

  • CreatedBy = What user or person created this triple
  • CreateDate = When was this triple first created
  • LastModBy = What user or person last modified this triple
  • LastModDate = When was this triple last modified
  • Public = Am I allowed to show this triple to the public

Other questions to address

As time permits this week, and for future meetings

...