Archived / Obsolete Documentation

Documentation in this space is no longer accurate.
Looking for official DSpace documentation? See all documentation

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

Version 1 Next »

Curation System for DSpace 1.7

This document is a high-level - but developer-focused - introduction to the curation system being proposed for DSpace 1.7 It presumes knowledge of java and DSpace internals.

Tasks

The goal of the curation system ('CS') is to provide a simple, extensible, way to manage
routine content operations on a repository. These operations are known to CS as 'tasks', and they
can operate on any DSpaceObject (i.e. subclasses of DSpaceObject) - although
the first incarnation will only understand Communities, Collections, and Items - viz. core
data model objects. Tasks may essentially work on only one type of DSpace object - typically
an item - and in this case they may simply ignore other data types (tasks have the ability to
'skip' objects for any reason). The DSpace core distribution ought to provide a number of useful
tasks, but the system is designed to encourage local extension - tasks can be written
for any purpose, and placed in any java package. What sorts of things are appropriate tasks?
Some examples:

  • apply a virus scan to item bitstreams (this will be our example below)
  • profile a collection based on format types - good for identifying format migrations
  • ensure a given set of metadata fields are present in every item, or even that they have particular values
  • call a network service to enhance/replace/normalize an items's metadata or content
  • ensure all item bitstreams are readable and their checksums agree with the ingest values

A task can be arbitrary code, but the class implementing it must have 2 properties:

  1. it must provide a no-arg constructor, so it can be loaded by the PluginManager

Thus, all tasks are 'named' plugins, meaning that each must be configured in dspace.cfg as:

plugin.named.org.dspace.curate.CurationTask = \
org.dspace.curate.ProfileFormats = format-profile \
org.dspace.curate.RequiredMetadata = req-metadata \
org.dspace.ctask.replicate.Audit = audit \
org.dspace.ctask.replicate.Estimate = estimate \
org.dspace.ctask.replicate.Generate = generate \
org.dspace.ctask.integrity.Checksum = checksum \
org.dspace.ctask.integrity.ClamScan = vscan

The 'plugin name' (audit, estimate, etc) is called the task name, and is used instead of the qualified class name
wherever it is needed (on the cmd line, etc) - the CS always dereferences it.

  1. implements 'org.dspace.curate.CurationTask'

The CurationTask interface is almost a 'tagging' interface, and only requires a few very high-level methods be implemented. The most significant is:

int perform(DSpaceObject dso);

The return value should be a code describing one of 4 conditions:

  • 0 : SUCCESS the task completed successfully
  • 1 : FAIL the task failed (it is up to the task to decide what 'counts' as failure - an example might be that the virus scan finds an infected file)
  • 2 : SKIPPED the task could not be performed on the object, perhaps because it was not applicable
  • -1 : ERROR the task could not be completed due to an error

If a task extends the AbstractCurationTask class, that is the only method it needs to define.

  • No labels