Archived / Obsolete Documentation

Documentation in this space is no longer accurate.
Looking for official DSpace documentation? See all documentation

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

Version 1 Next »

<?xml version="1.0" encoding="utf-8"?>
<html>

New History System - old page preserved below

Plan for New History System for DSpace

After examining the state of the current

HistoryManager

and the data it produces, I'm inclined to discard
it and write a complete replacement.
Since the PLEDGE project needs
a functional history system, we are motivated to rebuild it
in time for the next major release (1.4).

The next few sections describe my strawman proposal for a new DSpace
History System. The original contents of this page are preserved below,
and contain some different perspectives on the design of a history system.

NOTE: This is an initial sketch of a proposal, it is not yet even
complete, but I want to put it in the wiki as early as possible to
gather responses – so please add your comments in the "Comments" section
below.

Goals

  • Preserve a fixed, unchangeable record of all significant changes to those data objects in a DSpace repository that are to be preserved.
    • Only consider changes to
      Item
      objects.
    • "Changes" include creation, add/remove bitstreams, changes to Item-level metadata, changes to bitstream-level metadata, withdrawl/reinstatement, deletion.
    • Associate the person responsible with each change event.
    • Record only such details of each event as are really necessary for provenance, e.g. new metadata values, but not bitstream contents.
  • Give access to the history data, by Item and through free-form queries.
  • Carry the relevant history data with an Item when it is moved from one DSpace repository to another.
  • Migrate whatever data can be reclaimed from the old DSpace history system.

Non-Goals

  • This is not a versioning system, so it does not usually attempt to record the substance of a change (e.g. contents of bitstreams).
  • Do not record any events that do not result in changes to the archive, so the history will not have records of disseminations.
  • Do not record changes that are only relevant within the archive, such as authorization policies, EPersons, Community and Collection hierarchies, etc.
  • Only record history data for objects that are in the archive, which means history is not recorded for Workspace and Workflow objects.

Justification

If the purpose of saving history data is to establish the provenance of
objects archived in DSpace as an aid to future preservationists, it makes
sense to only save data about the objects which are to be preserved.
This excludes transient object (e.g.

WorkspaceItem

,

WorkflowItem

,
and objects that are only meaningful in the context of their home
DSpace archive:

Collection

,

Community

, and especially

EPerson

.
Note that the grouping of Items represented by a Collection can also
be expressed in the Items' metadata, a much more "preservable" manner.

Since preservation probably means transferring (or copying) Items to another
repository or archive at some point in the future, there should be a way
to include the relevant history data with each Item.

UPDATE: Actual History Schema

As of November, 2006, the actual RDF schema emitted by the new history
implementation is simpler and more streamlined than what was first
described here. The following outline of the schema describes what
you should expect to find in the RDF description of each event.

Namespaces=

Namespaces in the History schema

prefix

Namespace URI

rdf:
http://www.w3.org/1999/02/22-rdf-syntax-ns#
rdfs:
http://www.w3.org/2000/01/rdf-schema#
history:
http://www.dspace.org/history#
dso:
http://www.dspace.org/objectModel#
abc:
http://metadata.net/harmony#
dc:
http://purl.org/dc/elements/1.1/

URIs to Identify Archival Objects

History records need a "normalized" way to refer to all DSpace Objects,
so there is a 1:1 mapping from identifier to object.
We have to be able to look up the History records relevant to the
object later, as well as generate the URI for an object when entering
new records about it.

Since not all archival DSpace objects have Handles, and in fact the
persistent identifier mechanism is likely to change in the near term, we
cannot use Handles for this identifier. That leaves database ID numbers
(along with the object type to discriminate between IDs). Although this
is not "archival" since the IDs might differ if the object is moved to
another archive, the History records from a single archive should be
self-consistent.

XXX needs clarifying..

The History record does need to include the mapping from Handle (or
other persistent identifier) to the normalized object URI – this is
provided by the

abc:instanceOf

property described below.

See ObjectUri for
a description of the current "best practice" in generating URIs.

Action resources

Every event creates a unique resource representing that action, which has
a uniquely-generated (but otherwise meaningless) URI.
It can have the following properties:

  • rdf:type abc:Action
  • rdf:type history:''action-type''
    – value is a URI corresponding to event types in the EventSystemPrototype
    • history:Add
    • history:Remove
    • history:Create
    • history:Delete
    • history:Modify
    • history:ModifyMetadata
  • One of (
    abc:creates | abc:destroys | abc:hasPatient
    ) SubjectURI
  • abc:atTime
    ISO 8601 timestamp (from event)
  • history:inArchive
    URI of DSpace Archive
  • abc:involves
    ObjectURIif there is an Object
  • abc:hasParticipant
    EPerson-URI (if available)
  • history:usesTool
    ExtraLogInfo (if available)
  • history:detail  "event.getDetail()"
    (if available)
  • history:transactionID  "event.getTransactionID()"
    (if available)

DSpace Object resources

The DSpace Object which is the "subject" of the event always has properties:

  • rdf:type abc:Actuality
  • rdf:type dso:''ObjectType''
    – its DSpace Object type, one of these URIs, which correspond to the DSpace Object type names:
    • dso:Community
    • dso:Collection
    • dso:Item
    • dso:Bundle
    • dso:Bitstream
    • dso:EPerson

The DSpace Object which is the "object" of the event has properties:

  • rdf:type abc:Actuality
  • rdf:type dso:''ObjectType''
    – its DSpace Object type.

Optionally, both Object and Subject of the event may have these
properties if the data is available:

  • dc:title
    literal-title
  • abc:instanceOf
    Handle-URI

The EPerson Object which is the "participant" of the event has properties:
(Note that the email address of the EPerson is encoded in the URI)

  • rdf:type dso:EPerson
    – its DSpace Object type.

Binding between Handle and Object URI

Every DSpace Object with a Handle must record the mapping of that
Handle to an object URI (which might be identified only by the
database identifier in events before the Handle is assigned). This
is done with a statement on the object's DB-based URI relating it to
the Handle URI, e.g.

1721.1/31429

, for example:

<info:dspace/dbid#item_814> abc:instanceOf <info:dspace/handle#1721.1/31429>

<hr>

(OUTDATED) History Data Model

After examining the goals, the existing system, and similar designs such
as the PREMIS Event model, I've outlined this data model for
representing DSpace object history.

It is based on the current version (3) of
ABC Harmony as an RDF ontology.
The ABC Harmony model is reasonably close to what
I had in mind, and using an existing
ontology lets us leverage its documentation and examples.

The first-class objects are:

*Item, corresponds to the DSpace Item identified by a certain persistent identifier (i.e. Handle).
*Event, describes a change to another object.
*Archive, identifies a single instance of a DSpace installation.
*EPerson, names an individual agent responsible for an event.

There are also two "second-class" objects: Since these describe aspects
of Items that can change over time, they only appear in the context of
an Event that creates or modifies an Item.

*Bitstream, describes the aspects of a DSpace Bitstream object relevant to provenance.
*Metadata, a metadata item attached to an Item in the DSpace object model.

Item

The Item is a holder for the DSpace Item's only immutable property,
its persistent identifier (i.e. Handle). An Item is identified by
its Handle so it has the same identifier on every DSpace archive
where it is present.
It has the RDF type value

dspace:Item

,
which is a subclass of

abc:Actuality

.

An Item's URI is its persistent identifier in URN syntax, e.g.

hdl:123456789/123

.

Archive

An Archive identifies the DSpace instance in which an Event occurs.
It is identified by the World Wide Web URL of its top-level page, e.g.

http://dspace.mit.edu/

.
It has the RDF type value

dspace:Archive

, which is a subclass of

abc:Actuality

.

It also may contain the following Dublin Core properties:

*dc:title - Descriptive name of the archive, e.g. "Miskatonic University Digital Archive"

NOTE: This is just my naive interpretation of DC.
Metadata mavens and Dublin Core critics are invited to correct my
usage of these DC elements.

EPerson

An EPerson identifies, as precisely as possible, an agent authenticated
to the DSpace archive who was responsible for initiating an Event.
Its information may be of limited value for provenance,
since, in the DSpace architecture, an EPerson is defined only within
the context of an archive and is not given a persistent identifer.

The EPerson is identified by a combination of its home archive and
a cryptographic message digest of its attributes that are unique within
that archive (i.e. the email address). Its RDF type is

dspace:EPerson

,
which is a subclass of

abc:Actuality

.

It also may contain the following Dublin Core properties:

*dc:title - Personal name in canonical format, e.g. "Jack Florey".
*dc:identifier.uri - "mailto" URI containing email address, e.g. "florey@dspace.org"

NOTE: This is just my naive interpretation of DC.
Metadata mavens and Dublin Core critics are invited to correct my
usage of these DC elements.

Bitstream

Represents a bitstream added to an Item. It includes information that
may be relevant to the item's provenance (to cross-check the contents
of the archive against history later, for example).

A Bitstream only appears as the subject of statements belonging to a
"Situation" that is the result of an Event which created or modified an Item.

*rdf:type is

dspace:Bitstream

(a subclass of

abc:Actuality

).
*dc:identifier.uri - The SequenceID of the bitstream, e.g. "#1"
*dc:title - the name attribute of the bitstream, e.g. "thesis.pdf"
*dc:format - short name of BitstreamFormat, e.g. "Adobe PDF"
*dc:format.extent - size in bytes of the contents, e.g. "314592"
*dc:type - type or purpose; i.e. name of the bundle containing bitstream, such as "ORIGINAL".
*dspace:checksum-algorithm - name of checksum algorithm, e.g. "MD5"
*dspace:checksum - value of checksum of bitstream contents, e.g. "6df9d97f2e8f9.."

Metadata

This represents one metadata value belonging to an Item.
When an item has multiple values for the same metadata element/qualifier,
they appear as separate nodes in the RDF model, not as multiple values within one node.

Metadata only appears as the subject of statements belonging to a
"Situation" that is the result of an Event which created or modified an Item.

It may have the following properties:

*rdf:type is

dspace:Metadata

(a subclass of

abc:Actuality

).
*dspace:mdSchema - the metadata schema (default is "dc").
*dspace:element - name of the Dubln Core-styled field identifier.
*dspace:qualifier - qualifier of the Dubln Core-styled field identifier.
*xml:lang - language code, e.g. "en".
*dspace:value - value of the metadata field.

Since DSpace metadata is traditionally in Qualified Dublin Core fields,
there is a shorthand for listing these. The Metadata value only needs
the appropriate DC or QDC property, whose value is the metadata value,
and the optional

xml:lang

property, e.g. (in N3 format):

...
abc:contains [ rdf:type dspace:Metadata ;
dc:title "The Little Prince" ] ;
abc:contains [ rdf:type dspace:Metadata ;
xml:lang "fr" ;
dc:title "Le Petit Prince" ] ;

Event

The purpose of the history system is to record "events", so the Event is
its central data structure. The history record is simply a collection
of Events. Each Event represents a change to an Item – although each
transaction on the DSpace server may result in more than one Event being
recorded.

Each event has the following properties:

  • The subject URI consists of the archive's URI followed by a locally-unique identifier for the Event.
    *rdf:type is
    abc:Event
    *abc:hasParticipant - value is the EPerson object who was the authenticated user responsible for this change.
    *abc:creates | abc:hasResult | abc:destroys - identifies the type of action, the value is the URI of the affected Item.
    *abc:atTime - Timestamp at which the event was logged, e.g.
    "Tue Jan 24 17:46:49 EST 2006"
    *abc:precedes - Optional, this refers to the
    abc:Situation
    that results from this action. It is a blank node that describes the details of what was changed.
    *abc:involves - Value is the URI of the DSpace archive in which this event occurred.

Situation

In the
ABC Harmony
model, a Situation descibes the "existential" (i.e. time-varying) aspects of
an Actuality at a certain point in time. The alterations to the state
of an Item after an Event make up a Situation that the Event precedes.
(They are connected by the

abc:precedes

property, which says the Event
precedes the Situation.)

*rdf:type is

abc:Situation

*abc:contains - value is

Bitstream

or

Metadata

added to the Item at this time.
*abc:removes - value is

Bitstream

or

Metadata

deleted from the Item at this time.

NOTE: There is actually no

removes

property in the ABC Harmony ontology,
but there is nothing else equivalent so we are taking the liberty of
adding it.

</html>

  • No labels