Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The wider the graph, the more chance there is that data expressed in the NEW data would overlap with existing data in the cache.

Option 2:

  • Feasible, but should be optional because the graph for certain authorities may be computationally expensive
  • Expense may be in terms of how many new documents will be included
  • What is a full entity?  How much of the graph needs to be included?  What serialization should be included?  What ontology should be included if the authority supports multiple ontologies?

How to determine relevancy of the new entity?  For full cache, all new are relevant.  If not, then the downstream consumer will need to decide which new entities are relevant.  This can be supported by the notification process where an institution requests to be notified when the entity is added.  Intermediate 3rd party provides a notification.  Authorities may not have the band width to do this directly.

Service providers may find the change management documents described here as a way to provide these notifications.

Examples:


Approach:

Two possible approaches:

...

For a deleted entity, the URI will no longer resolve.

Minimum: URI of the deleted thing.

If cache is a blob graph, then URI is sufficient.

Challenges:

  • Are there other outside references to the triple?  Can produce orphaned sub-chains.  If access is always through the subject URI, there may be tolerance for kruft where the outside reference doesn't exist when you try to access it.
  • How far out into the graph would you delete? - following blank nodes; following through other URIs.

Better not to delete.  Better to deprecate (see below) and add see instead.

Approach:


Example Data Stream:

...

It is common practice for the deprecated entity to include information on what should be used instead.

Are there challenges when there are lots of deprecated entities?

  • triples do grow as deprecations grow
  • deprecations can be indexed separately or together, but this is driven by decisions made at the authority provider

Deprecations are fulfilling a need to...

  • maintain URIs over time
  • desire not to throw away any information
  • gives history of the authority
  • aids disambiguation

Are there times when a deprecated term should become deleted?

  • would this be a choice of the consumer?

Implementation

  • entity deleted, but URI is marked deprecated and remains in a registry of URIs

Can you deprecate a part of the graph for an entity?

  • Example - a specific attribute:  a label is no longer valid or an occupation is no longer valid
  • Example - a link to another entity: a link to a role or agent

RDF is atemporal meaning that when you publish a graph, it exists and will always exist.  A graph is true and is later succeeded by another graph.  How to identify what is different?  Can deprecation be used to identify what has changed?  If you have two graphs, version 1 and version 2, the difference is the change.  Challenge with knowing the edges of the graph are different in 1 vs. 2.

LOC views graph as corresponding to a MARC record?  Each update of the system replaces the graph.  This is possible because the graph is a blob for the subject URI.

Example: Series - name-title authority record vs. multiple - author name changed and in perhaps 10 records, the records are updated upstream

Complication comes in when starting with a single URI and connects to another URI that has its own definition outside the current URI's graph.  If blank node, then this is not an issue.

Challenge with blobs is that if the blob references a URI that has a life outside the blob, then any data for the outside URI will be out of date within the blob.  Potential solution is for the blob to only have URIs and pulls in other data for the outside URI live.

Similar to normal forms in database theory.  Don't repeat data to avoid data redundancy.

Copy cataloging in BibFrame – work, instance, agent, item, etc. – Ideally, copy cataloging would be list of URIs. Significant practical difficulties.

Can we provide a sparql query that would indicate what needs to be updated in the graph?

Another option is to have delete, add, insert type diff commands that can be processed in order to make the graph changes.

Examples:


Approach:


Example Data Stream:

...

  • It is common for there to be multiple variant labels.  In some cases, variants may represent different languages.

  • If DELETE_LABEL is supported, is it ok for a variant label to be deleted and not have a corresponding add?  This seems ok. 

Is there an ontology that describes types of changes?  Activity streams accommodates this functionality.

  • Is the process of defining this a process to create an ontology or similar to ontology creation?
  • Can activity stream documents describe the types of change?
  • Does this increase the level of agreement required?


Examples:

  • LOC is looking at using a feed that provides information on authoritative labels.  This is mostly used for name changes (e.g. person died and the death date is added to the label).

...

  • The label in the graph may not be directly connected to the URI (e.g. <URI><PREDICATE><NEW_LABEL>).  It may go through one or more other URIs or black nodes with a number of different predicates.
  • What constitutes a label may vary between ontologies.  Common label scenarios:
    • a single primary label
    • multiple primary labels
    • primary and variant labels
    • primary label is not human readable
    • no primary label
  • Roles of consumers
    • Full cache processed fully by machine
    • Partial cache processed fully by machine or may need some human intervention
    • Human processing with a need to quickly see changes that are important from their perspective
  • Ordering
    • Graphs are inherently not ordered; what implications are this?
    • Is there a way to express a diff with unordered data?
    • Desire is for an easy visual way to see, process, and focus in on a specific change
    • Perhaps use a notification where user subscribes to particular types of changes or certain data changes
    • There is a RDF graph source diff (https://www.w3.org/2001/sw/wiki/How_to_diff_RDF)
  • Label changes in one language but not another

Approach:

Minimally need to include:

  • URI
  • NEW_LABEL - the new label to use
  • PREDICATE (or some other identifier) - identifies which type of label is being replaced - perhaps should be LDPATH instead to address a label that is farther down the graph from the subjectURI.

NOTE: This can be represented as a triple.  <URI> <PREDICATE> "new label"@en

...

  • The goal is to have the types be flexible enough to use the same basic data for each type for all authorities and all types of entities.


Could information be expressed as a DIFF similar to how GIT does DIFFs?

  • visual diffs are easy for humans to quickly process and focus in on the area of concern
  • how would this work for computer processing?
  • See more notes in General Discussion on Approaches on Existing Change Management Approaches