Reference we can use as a starting point:  Change Management NOTES from LODLAM 2020

Overview

This document describes the types of changes that can occur in authoritative data.  For each type, the following information will be explored:

NOTE: At least initially, the example data will be shown in json, json-ld, or possibly some other format.  Final recommendations for format will be in the Deliverable documents.

Types of Changes


New

Description:

This type provides information on a completely new entity.

Discussion:

If the data for the entity is included in the data stream, how would the edges of the graph be defined.   For some authorities, the entity may be defined as just the data where the new entity URI is the subject (e.g. ontology is limited to skos).  For others, data for an entity can includes data several layers away from the subject URI of the new entity (e.g. works described in BibFrame).   Some may include blank nodes.

The wider the graph, the more chance there is that data expressed in the NEW data would overlap with existing data in the cache.

Option 2:

How to determine relevancy of the new entity?  For full cache, all new are relevant.  If not, then the downstream consumer will need to decide which new entities are relevant.  This can be supported by the notification process where an institution requests to be notified when the entity is added.  Intermediate 3rd party provides a notification.  Authorities may not have the band width to do this directly.

Service providers may find the change management documents described here as a way to provide these notifications.

Examples:


Approach:

Two possible approaches:

Example Data Stream:

For Option 1:

{ 
  type: NEW,
  URI: https://uri.of.new.entity
}


For Option 2:

{ 
  type: NEW,
  URI: https://uri.of.new.entity
  entity: { full entity as json-ld }
}



Deleted

Description:

This type provides information on an entity that was completely.  See also Deprecated.

Discussion:

For a deleted entity, the URI will no longer resolve.

Minimum: URI of the deleted thing.

If cache is a blob graph, then URI is sufficient.

Challenges:

Better not to delete.  Better to deprecate (see below) and add see instead.

Approach:


Example Data Stream:



Deprecated

Description:

This type provides information on an entity that still exists in the authority, but is marked as deprecated meaning it should no longer be used.   For deprecated entities, the URI will continue to resolve.  See also, Split and Merge.

Discussion:

Deprecation typically happens when:

In all cases,  the entity remains in the authority allowing its URI to still resolve for preservation, backward compatibility, and to provide downstream consumers with time to update their references to the entity.

It is common practice for the deprecated entity to include information on what should be used instead.

Are there challenges when there are lots of deprecated entities?

Deprecations are fulfilling a need to...

Are there times when a deprecated term should become deleted?

Implementation

Can you deprecate a part of the graph for an entity?

RDF is atemporal meaning that when you publish a graph, it exists and will always exist.  A graph is true and is later succeeded by another graph.  How to identify what is different?  Can deprecation be used to identify what has changed?  If you have two graphs, version 1 and version 2, the difference is the change.  Challenge with knowing the edges of the graph are different in 1 vs. 2.

LOC views graph as corresponding to a MARC record?  Each update of the system replaces the graph.  This is possible because the graph is a blob for the subject URI.

Example: Series - name-title authority record vs. multiple - author name changed and in perhaps 10 records, the records are updated upstream

Complication comes in when starting with a single URI and connects to another URI that has its own definition outside the current URI's graph.  If blank node, then this is not an issue.

Challenge with blobs is that if the blob references a URI that has a life outside the blob, then any data for the outside URI will be out of date within the blob.  Potential solution is for the blob to only have URIs and pulls in other data for the outside URI live.

Similar to normal forms in database theory.  Don't repeat data to avoid data redundancy.

Copy cataloging in BibFrame – work, instance, agent, item, etc. – Ideally, copy cataloging would be list of URIs. Significant practical difficulties.

Can we provide a sparql query that would indicate what needs to be updated in the graph?

Another option is to have delete, add, insert type diff commands that can be processed in order to make the graph changes.

Examples:


Approach:


Example Data Stream:



Canceled

Description:

MeSH uses the terminology that a heading is canceled.  They generally include use_instead to identify the alternative.  This may be the same as Deprecated.



Split

Description:

This type provides information on an entity that was split into two or more separate entities.

Discussion:

This commonly results in a new entities for each entity of the split.  The original entity becomes deprecated or deleted.  In some cases, the original entity for the split continues to exist with a different set of data.

This may be a sub-class of deleted or deprecated since the original entity is typically no longer valid under the original URI.

Examples:


Approach:


Example Data Stream:



Merge

Description:

This type provides information on two or more entities that were merged into a single entity.

Discussion:

This commonly results in a new entity with the data coming from each of the merged entities.  The original entities become deprecated or deleted.  In some cases, the merged entities are merged into one of the existing entities of the merge.

This may be a sub-class of deleted or deprecated since the original entities are typically no longer valid under the original URIs.

Examples:

Approach:


Example Data Stream:



Changed

Description:

This type provides information on an existing entity with changed data.

Discussion:


Examples:


Approach:


Example Data Stream:



Label Change Only

Description:

This type provides information on an existing entity with changed label data.

Discussion:

This specifically meets the need of applications that cache labels.  Question whether there should be caching of labels in downstream consumers?  Several indicate that this is common practice. 

Examples of use cases for caching labels in applications:

Discussion on Primary Label:

Discussion on Variant Labels:

Is there an ontology that describes types of changes?  Activity streams accommodates this functionality.


Examples:

Challenges:

Approach:

Minimally need to include:

NOTE: This can be represented as a triple.  <URI> <PREDICATE> "new label"@en


To be able to replace a label, also need:

NOTE: This would remove triple.  <URI> <PREDICATE> "old label"@en


OPTION 1:  Single type LABEL_CHANGE - all change information is in a single change entry

OPTION 2: Two change entries, one to DELETE_LABEL being replaced, followed by ADD_LABEL to add the new label.  Question: Will this provide an adequate indicator to downstream consumers allowing them to update cached values?

Example Data Stream:

For Option 1:

{ 
  "type": "LABEL_CHANGE",
  "URI": "https://uri.of.changing.entity",
  "PREDICATE": "skos:prefLabel",
  "NEW_LABEL": "new value"@en,
  "OLD_LABEL": "old value"@en 
}


{ 
  "type": "LABEL_CHANGE",
  "ADD": "https://uri.of.changing.entity",
  "PREDICATE": "skos:prefLabel",
  "NEW_LABEL": "new value"@en,
  "OLD_LABEL": "old value"@en 
}


For Option 2:

{ 
  "type": "DELETE_LABEL",
  "URI": "https://uri.of.changing.entity",
  "PREDICATE": "skos:prefLabel",
  "LABEL": "old value"@en 
}


{ 
  "type": "ADD_LABEL",
  "URI": "https://uri.of.changing.entity",
  "PREDICATE": "skos:prefLabel",
  "LABEL": "new value"@en,
}




Other Considerations and Questions

Are the following handled differently when managing change?


Notification when an initial search fails to match and later a match becomes available.  Same for partial match.


Would the types of changes be different for different types of entities (e.g. MeSH Subjects vs. Names)? 


Could information be expressed as a DIFF similar to how GIT does DIFFs?