Date

16 Dec 2020

Call-in Information

To join the online meeting:

Go to: https://lyrasis.zoom.us/my/vivo1
One tap mobile:
- US: +16699006833,,9358074182# or +19292056099,,9358074182#
Or Telephone:
- US: +1 669 900 6833 or +1 929 205 6099 or 877 853 5257
- Meeting ID: 935 807 4182
International numbers available: https://zoom.us/u/aeANHanzED

Indicating note-taker

Team UQAM and team TIB had a nice first investigative sprint with the objective to learn about Apache Kafka. We would like to start the work of the data ingest task force as soon as possible. We would like to present the outcome of our first investigation as long as it is fresh, and to see if we have alignment with others, and to get valuable feedback.

Introduction: Apache Kafka as a central component for data ingest in VIVO? (10 minutes by Michel Héon)
Entry point of the presentation 2020-12-16 VIVO-DataConnect ORCID Demo and https://github.com/vivo-community/vivo-data-connect/tree/POC-extract-orcid for code
Work at UQAM (5-10 min)
Work at TIB (5-10 min)
General discussion

Walking through context and use case
- "A professor wishes to add the reference to a scientific article, irrespective of whether he chooses ORCID or VIVO, the information he will enter in either of these platforms will be mutually updated"
Goal of using Kafka with VIVO:
- VIVO is a component in the enterprise, instead of the center
Main idea of Kafka
- An event-driven messaging system
- Allow for multi-to-multi producers and consumers
Recent sprint
- Ingest ORCID data into VIVO
Walkthrough of flow:
- Extract all ORCID_IDs associated with UQAM members
  - './orcid_get_all_records.sh'
  - Converting ORCID JSON into RDF
- Transform RDF into VIVO representation
- Send to Kafka
  - Then pass to VIVO
Demo
- 25,171 statements pushed through Kafka
- 763 users, with name, org, and competencies
Summary
- The ORCID ontology needs to be refined and clarified.
- The mapping between ORCID and VIVO also needs to be worked on
- The structure of the Kafka message has to be designed to respect the add/delete/modify record actions
- Several minor bugs need to be fixed in the scripts.
Future plans
- Building a POC VIVO → Kafka → ORCID
- Proving the architecture to operate in event-driven and real-time mode
- Getting POCs to Java
- Redesigning the mapping process, ORCID ontology structure and message structure

Using Kafka as a consumer of VIVO messages
Tasks
- Listener in VIVO to capture internal changes
- Producer to send to Kafka
VIVO Kafka-Module
- ModelChangedListener and ChangeListener
- Kafka start-up listener
- Http connection
VIVO producer
- Spring-boot service
Code will be in GitHub soon

Interest in the architecture presented
- Allows for integration with any number of source systems
This initiative allows for outputs from VIVO
Can past initiatives be used in this context?
- ..such as ORCID-to-VIVO
- ..such as Dimensions-to-VIVO
Could this support large-scale ingest?
- +100M triples?
- Are there Kafka buffer limits, throttling
- Kafka is designed for "big data"
Next steps
- TIB: VIVO to other systems by Feb/May
- TIB: Other systems to VIVO... timeline is further out
- UQAM: Ingest timeframe in Q1 of 2021
- Next meeting in January? - Ralph to organize