Background

The Current Workflow

See Typical harvest

The Issue

While this process works, it requires the processing of every record during every run of the harvester, including a potentially huge portion of the records that have not been modified at all. Additionally, this process compares the current harvest with 'last-harvest' model that must be synchronized with any modifications in the live VIVO model (vitro-kb-2). This causes a confusing amount of extra work for those manually maintaining the data or causes updates from the source to create anomalous 'extra' triples.

The Solution

The solution is to separate out changed data as soon as possible, discarding unchanged data since we know we need to do nothing to it. Comparing the raw data from the source we can isolate changes before putting it through the entire harvest process. This, potentially, will drastically decrease the time it takes successive runs of the harvester (for updates). Additionally, since we are comparing the raw data from the source, we can isolate the type of change as well. New records, Deleted records, and modified fields in a record can be isolated and handled in separate, appropriate ways. New records sent through a process similar to the current harvest workflow, deleted records merely being matched to vivo and purging the data from vivo, updated fields being correctly updated - even allowing for consideration for the sources authoritativeness.

Too implement this concept using the harvester, it would be best to add in a few tools to the toolset. Leveraging some of the advantages of our toolset (the concept of Records being unique, comparable objects) we can create these tools fairly easily.

  • Handle different change types separately
    • Field updates can be handled specially since we know what they are (not just a triple subtraction/addition)

      Separate Data Early

  • The fact that our tools already have this concept of a 'Record' from the source is an advantage we can leverage to isolate new data, removed data, and updated data quickly.
  • RecordHandler Toolwill be used to isolate New/Removed/Shared Records
    • Subtract LastHarvestRH from CurrentHarvestRH, what remains are New Records (is in current harvest, but not last one)
    • Subtract CurrentHarvestRH from LastHarvestRH, what remains are Removed Records (was in last harvest, but not current one)
    • Subtract NewRecordRH and RemoveRecordRH from CurrentHarvestRH and we have that which overlaps (records that are in both harvests)
  • RecordCompare Toolwill be used to compare records from SharedRH with the corresponding record in LastHarvestRH and will output records containing the changes in records that have been updated

    Handle Change Types Separately

    New Records

  • Send through an optimized pipeline for records we know are new

    Record Removals

  • Score/Match records to VIVO and remove corresponding data from VIVO (Or modify it to reflect its historical nature, such as past jobs)

    Record Updates

  • Update Tool Specification