Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 4.0

<?xml version="1.0" encoding="utf-8"?>
<html>

New History System - old page preserved below

Plan for New History System for DSpace

After examining the state of the current

Code Block
HistoryManager

and the data it produces, I'm inclined to discard
it and write a complete replacement.
Since the PLEDGE project needs
a functional history system, we are motivated to rebuild it
in time for the next major release (1.4).

The next few sections describe my strawman proposal for a new DSpace
History System. The original contents of this page are preserved below,
and contain some different perspectives on the design of a history system.

NOTE: This is an initial sketch of a proposal, it is not yet even
complete, but I want to put it in the wiki as early as possible to
gather responses – so please add your comments in the "Comments" section
below.

Goals

  • Preserve a fixed, unchangeable record of all significant changes to those data objects in a DSpace repository that are to be preserved.
    • Only consider changes to
      Code Block
      Item
      objects.
    • "Changes" include creation, add/remove bitstreams, changes to Item-level metadata, changes to bitstream-level metadata, withdrawl/reinstatement, deletion.
    • Associate the person responsible with each change event.
    • Record only such details of each event as are really necessary for provenance, e.g. new metadata values, but not bitstream contents.
  • Give access to the history data, by Item and through free-form queries.
  • Carry the relevant history data with an Item when it is moved from one DSpace repository to another.
  • Migrate whatever data can be reclaimed from the old DSpace history system.

Non-Goals

  • This is not a versioning system, so it does not usually attempt to record the substance of a change (e.g. contents of bitstreams).
  • Do not record any events that do not result in changes to the archive, so the history will not have records of disseminations.
  • Do not record changes that are only relevant within the archive, such as authorization policies, EPersons, Community and Collection hierarchies, etc.
  • Only record history data for objects that are in the archive, which means history is not recorded for Workspace and Workflow objects.

Justification

If the purpose of saving history data is to establish the provenance of
objects archived in DSpace as an aid to future preservationists, it makes
sense to only save data about the objects which are to be preserved.
This excludes transient object (e.g.

Code Block
WorkspaceItem

,

Code Block
WorkflowItem

,
and objects that are only meaningful in the context of their home
DSpace archive:

Code Block
Collection

,

Code Block
Community

, and especially

Code Block
EPerson

.
Note that the grouping of Items represented by a Collection can also
be expressed in the Items' metadata, a much more "preservable" manner.

Since preservation probably means transferring (or copying) Items to another
repository or archive at some point in the future, there should be a way
to include the relevant history data with each Item.

UPDATE: Actual History Schema

As of November, 2006, the actual RDF schema emitted by the new history
implementation is simpler and more streamlined than what was first
described here. The following outline of the schema describes what
you should expect to find in the RDF description of each event.

Namespaces=

Panel
borderColor#ccc
bgColor#fff
titleNamespaces in the History schema
borderStyledashed

prefix

Namespace URI

Code Block
rdf:
Code Block
http://www.w3.org/1999/02/22-rdf-syntax-ns#
Code Block
rdfs:
Code Block
http://www.w3.org/2000/01/rdf-schema#
Code Block
history:
Code Block
http://www.dspace.org/history#
Code Block
dso:
Code Block
http://www.dspace.org/objectModel#
Code Block
abc:
Code Block
http://metadata.net/harmony#
Code Block
dc:
Code Block
http://purl.org/dc/elements/1.1/

URIs to Identify Archival Objects

History records need a "normalized" way to refer to all DSpace Objects,
so there is a 1:1 mapping from identifier to object.
We have to be able to look up the History records relevant to the
object later, as well as generate the URI for an object when entering
new records about it.

Since not all archival DSpace objects have Handles, and in fact the
persistent identifier mechanism is likely to change in the near term, we
cannot use Handles for this identifier. That leaves database ID numbers
(along with the object type to discriminate between IDs). Although this
is not "archival" since the IDs might differ if the object is moved to
another archive, the History records from a single archive should be
self-consistent.

Panel

XXX needs clarifying..

The History record does need to include the mapping from Handle (or
other persistent identifier) to the normalized object URI – this is
provided by the

Code Block
abc:instanceOf

property described below.

See ObjectUri for
a description of the current "best practice" in generating URIs.

Panel

Action resources

Every event creates a unique resource representing that action, which has
a uniquely-generated (but otherwise meaningless) URI.
It can have the following properties:

  • Code Block
    rdf:type abc:Action
  • Code Block
    rdf:type history:''action-type''
    – value is a URI corresponding to event types in the EventSystemPrototype
    • Code Block
      history:Add
    • Code Block
      history:Remove
    • Code Block
      history:Create
    • Code Block
      history:Delete
    • Code Block
      history:Modify
    • Code Block
      history:ModifyMetadata
  • One of (
    Code Block
    abc:creates | abc:destroys | abc:hasPatient
    ) SubjectURI
  • Code Block
    abc:atTime
    ISO 8601 timestamp (from event)
  • Code Block
    history:inArchive
    URI of DSpace Archive
  • Code Block
    abc:involves
    ObjectURIif there is an Object
  • Code Block
    abc:hasParticipant
    EPerson-URI (if available)
  • Code Block
    history:usesTool
    ExtraLogInfo (if available)
  • Code Block
    history:detail  "event.getDetail()"
    (if available)
  • Code Block
    history:transactionID  "event.getTransactionID()"
    (if available)

DSpace Object resources

The DSpace Object which is the "subject" of the event always has properties:

  • Code Block
    rdf:type abc:Actuality
  • Code Block
    rdf:type dso:''ObjectType''
    – its DSpace Object type, one of these URIs, which correspond to the DSpace Object type names:
    • Code Block
      dso:Community
    • Code Block
      dso:Collection
    • Code Block
      dso:Item
    • Code Block
      dso:Bundle
    • Code Block
      dso:Bitstream
    • Code Block
      dso:EPerson

The DSpace Object which is the "object" of the event has properties:

  • Code Block
    rdf:type abc:Actuality
  • Code Block
    rdf:type dso:''ObjectType''
    – its DSpace Object type.

Optionally, both Object and Subject of the event may have these
properties if the data is available:

  • Code Block
    dc:title
    literal-title
  • Code Block
    abc:instanceOf
    Handle-URI

The EPerson Object which is the "participant" of the event has properties:
(Note that the email address of the EPerson is encoded in the URI)

  • Code Block
    rdf:type dso:EPerson
    – its DSpace Object type.

Binding between Handle and Object URI

Every DSpace Object with a Handle must record the mapping of that
Handle to an object URI (which might be identified only by the
database identifier in events before the Handle is assigned). This
is done with a statement on the object's DB-based URI relating it to
the Handle URI, e.g.

Code Block
1721.1/31429

, for example:

Panel

<info:dspace/dbid#item_814> abc:instanceOf <info:dspace/handle#1721.1/31429>

<hr>

(OUTDATED) History Data Model

After examining the goals, the existing system, and similar designs such
as the PREMIS Event model, I've outlined this data model for
representing DSpace object history.

It is based on the current version (3) of
ABC Harmony as an RDF ontology.
The ABC Harmony model is reasonably close to what
I had in mind, and using an existing
ontology lets us leverage its documentation and examples.

The first-class objects are:

*Item, corresponds to the DSpace Item identified by a certain persistent identifier (i.e. Handle).
*Event, describes a change to another object.
*Archive, identifies a single instance of a DSpace installation.
*EPerson, names an individual agent responsible for an event.

There are also two "second-class" objects: Since these describe aspects
of Items that can change over time, they only appear in the context of
an Event that creates or modifies an Item.

*Bitstream, describes the aspects of a DSpace Bitstream object relevant to provenance.
*Metadata, a metadata item attached to an Item in the DSpace object model.

Item

The Item is a holder for the DSpace Item's only immutable property,
its persistent identifier (i.e. Handle). An Item is identified by
its Handle so it has the same identifier on every DSpace archive
where it is present.
It has the RDF type value

Code Block
dspace:Item

,
which is a subclass of

Code Block
abc:Actuality

.

An Item's URI is its persistent identifier in URN syntax, e.g.

Code Block
hdl:123456789/123

.

Archive

An Archive identifies the DSpace instance in which an Event occurs.
It is identified by the World Wide Web URL of its top-level page, e.g.

Code Block
http://dspace.mit.edu/

.
It has the RDF type value

Code Block
dspace:Archive

, which is a subclass of

Code Block
abc:Actuality

.

It also may contain the following Dublin Core properties:

*dc:title - Descriptive name of the archive, e.g. "Miskatonic University Digital Archive"

NOTE: This is just my naive interpretation of DC.
Metadata mavens and Dublin Core critics are invited to correct my
usage of these DC elements.

EPerson

An EPerson identifies, as precisely as possible, an agent authenticated
to the DSpace archive who was responsible for initiating an Event.
Its information may be of limited value for provenance,
since, in the DSpace architecture, an EPerson is defined only within
the context of an archive and is not given a persistent identifer.

The EPerson is identified by a combination of its home archive and
a cryptographic message digest of its attributes that are unique within
that archive (i.e. the email address). Its RDF type is

Code Block
dspace:EPerson

,
which is a subclass of

Code Block
abc:Actuality

.

It also may contain the following Dublin Core properties:

*dc:title - Personal name in canonical format, e.g. "Jack Florey".
*dc:identifier.uri - "mailto" URI containing email address, e.g. "florey@dspace.org"

NOTE: This is just my naive interpretation of DC.
Metadata mavens and Dublin Core critics are invited to correct my
usage of these DC elements.

Bitstream

Represents a bitstream added to an Item. It includes information that
may be relevant to the item's provenance (to cross-check the contents
of the archive against history later, for example).

A Bitstream only appears as the subject of statements belonging to a
"Situation" that is the result of an Event which created or modified an Item.

*rdf:type is

Code Block
dspace:Bitstream

(a subclass of

Code Block
abc:Actuality

).
*dc:identifier.uri - The SequenceID of the bitstream, e.g. "#1"
*dc:title - the name attribute of the bitstream, e.g. "thesis.pdf"
*dc:format - short name of BitstreamFormat, e.g. "Adobe PDF"
*dc:format.extent - size in bytes of the contents, e.g. "314592"
*dc:type - type or purpose; i.e. name of the bundle containing bitstream, such as "ORIGINAL".
*dspace:checksum-algorithm - name of checksum algorithm, e.g. "MD5"
*dspace:checksum - value of checksum of bitstream contents, e.g. "6df9d97f2e8f9.."

Metadata

This represents one metadata value belonging to an Item.
When an item has multiple values for the same metadata element/qualifier,
they appear as separate nodes in the RDF model, not as multiple values within one node.

Metadata only appears as the subject of statements belonging to a
"Situation" that is the result of an Event which created or modified an Item.

It may have the following properties:

*rdf:type is

Code Block
dspace:Metadata

(a subclass of

Code Block
abc:Actuality

).
*dspace:mdSchema - the metadata schema (default is "dc").
*dspace:element - name of the Dubln Core-styled field identifier.
*dspace:qualifier - qualifier of the Dubln Core-styled field identifier.
*xml:lang - language code, e.g. "en".
*dspace:value - value of the metadata field.

Since DSpace metadata is traditionally in Qualified Dublin Core fields,
there is a shorthand for listing these. The Metadata value only needs
the appropriate DC or QDC property, whose value is the metadata value,
and the optional

Code Block
xml:lang

property, e.g. (in N3 format):

Panel

...
abc:contains [ rdf:type dspace:Metadata ;
dc:title "The Little Prince" ] ;
abc:contains [ rdf:type dspace:Metadata ;
xml:lang "fr" ;
dc:title "Le Petit Prince" ] ;

Event

The purpose of the history system is to record "events", so the Event is
its central data structure. The history record is simply a collection
of Events. Each Event represents a change to an Item – although each
transaction on the DSpace server may result in more than one Event being
recorded.

Each event has the following properties:

  • The subject URI consists of the archive's URI followed by a locally-unique identifier for the Event.
    *rdf:type is
    Code Block
    abc:Event
    *abc:hasParticipant - value is the EPerson object who was the authenticated user responsible for this change.
    *abc:creates | abc:hasResult | abc:destroys - identifies the type of action, the value is the URI of the affected Item.
    *abc:atTime - Timestamp at which the event was logged, e.g.
    Code Block
    "Tue Jan 24 17:46:49 EST 2006"
    *abc:precedes - Optional, this refers to the
    Code Block
    abc:Situation
    that results from this action. It is a blank node that describes the details of what was changed.
    *abc:involves - Value is the URI of the DSpace archive in which this event occurred.

Situation

In the
ABC Harmony
model, a Situation descibes the "existential" (i.e. time-varying) aspects of
an Actuality at a certain point in time. The alterations to the state
of an Item after an Event make up a Situation that the Event precedes.
(They are connected by the

Code Block
abc:precedes

property, which says the Event
precedes the Situation.)

*rdf:type is

Code Block
abc:Situation

*abc:contains - value is

Code Block
Bitstream

or

Code Block
Metadata

added to the Item at this time.
*abc:removes - value is

Code Block
Bitstream

or

Code Block
Metadata

deleted from the Item at this time.

NOTE: There is actually no

Code Block
removes

property in the ABC Harmony ontology,
but there is nothing else equivalent so we are taking the liberty of
adding it.

</html>