Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

previous topic: Ingest tools: home brew or off the shelf?

Children Display

Note: this is an approach that has these discussions reflect primarily the approaches and workflow that have been used at Cornell. Other approaches are used at other sites.

...

, and please update or annotate as appropriate to point out different requirements and/or solutions.

Introduction

Some VIVO sites do not allow manual editing by users, but reflect data from one or more other systems of record with VIVO being a point of integration and for syndicating integrated data to other websites or reporting tools. This can simplify data management after it's in VIVO but still very likely requires data alignment unless all the sources of data are internally consistent and share common unique identifiers.

When data in VIVO have been created or augmented by interactive editing, and when users can edit their own pages (typically called self-editing), there are more complexities to plan for.

  • If data in VIVO come from ingested sources but are also edited directly, whether by end users editing their own profiles or by a limited number of data curators or student hires, then there are more complexities to deal with.
    • First, if data are corrected in VIVO (a misspelled name, for example) but not in the source, the next input from that source will likely overwrite the correction and revert to the misspelled name.
      • If possible, build workflows to feed changes back to the source for correction, and propagate the next fix to VIVO through the next scheduled ingest of that source.
      • Sometimes a manual edit will still have to be made in VIVO to assuage an unhappy user, but if the identical change is made in the original source there should not be a concern about overwriting the correct version.

Ideally ingest processes are made repeatable and incremental so that changes do not require removing and then adding large amounts of data, but sometimes a source is only updated annually or the source system goes through changes that require large batch changes.

  • In this case a separate VIVO instance can be the task.A separate VIVO instance is used for ingest.
    • This instance is populated from the nightly backup of the production instance.
    • The use of a separate VIVO means that the production instance is not loaded down by the ingest process.
    • Ingest processes run at night.
      • Since ingested data is largely separate from editable data, it is not likely that there would be conflicts, except for the load on the system.
      • Ingest processes are run that compare the new data to the data in VIVO.
      • They generate RDF triples that must be added or removed from VIVO to represent the new data.
      • Because we are not apply these triples immediately, we can inspect them for correctness before committing them.
      • The RDF triples are applied to the production VIVO system.
        • These processes are ad hoc, and idiosyncratic to Cornell’s data sources and ontology extensions. They are constantly being changed, and are not packaged for release.

...

We don't recommend using a person's name as part of their URI for the simple reason that their name may change.  In fact, many data architects remember always using completely randomized, meaningless identifiers within URIs (for the part after the last / in the URI, known as the local name).

 

 

...

next topic: Challenges for data ingest