Team UQAM and team TIB had a nice first investigative sprint with the objective to learn about Apache Kafka. We would like to start the work of the data ingest task force as soon as possible. We would like to present the outcome of our first investigation as long as it is fresh, and to see if we have alignment with others, and to get valuable feedback.
"A professor wishes to add the reference to a scientific article, irrespective of whether he chooses ORCID or VIVO, the information he will enter in either of these platforms will be mutually updated"
Goal of using Kafka with VIVO:
VIVO is a component in the enterprise, instead of the center
Main idea of Kafka
An event-driven messaging system
Allow for multi-to-multi producers and consumers
Recent sprint
Ingest ORCID data into VIVO
Walkthrough of flow:
Extract all ORCID_IDs associated with UQAM members
'./orcid_get_all_records.sh'
Converting ORCID JSON into RDF
Transform RDF into VIVO representation
Send to Kafka
Then pass to VIVO
Demo
25,171 statements pushed through Kafka
763 users, with name, org, and competencies
Summary
The ORCID ontology needs to be refined and clarified.
The mapping between ORCID and VIVO also needs to be worked on
The structure of the Kafka message has to be designed to respect the add/delete/modify record actions
Several minor bugs need to be fixed in the scripts.
Future plans
Building a POC VIVO → Kafka → ORCID
Proving the architecture to operate in event-driven and real-time mode
Getting POCs to Java
Redesigning the mapping process, ORCID ontology structure and message structure
TIB
Using Kafka as a consumer of VIVO messages
Tasks
Listener in VIVO to capture internal changes
Producer to send to Kafka
VIVO Kafka-Module
ModelChangedListener and ChangeListener
Kafka start-up listener
Http connection
VIVO producer
Spring-boot service
Code will be in GitHub soon
Discussion
Interest in the architecture presented
Allows for integration with any number of source systems
This initiative allows for outputs from VIVO
Can past initiatives be used in this context?
..such as ORCID-to-VIVO
..such as Dimensions-to-VIVO
Could this support large-scale ingest?
+100M triples?
Are there Kafka buffer limits, throttling
Kafka is designed for "big data"
Next steps
TIB: VIVO to other systems by Feb/May
TIB: Other systems to VIVO... timeline is further out