Curation System for DSpace 1.7
This document is a high-level - but developer-focused - introduction to the curation system being proposed for DSpace 1.7 It presumes knowledge of java and DSpace internals.
Tasks
The goal of the curation system ('CS') is to provide a simple, extensible, way to manage
routine content operations on a repository. These operations are known to CS as 'tasks', and they
can operate on any DSpaceObject (i.e. subclasses of DSpaceObject) - although
the first incarnation will only understand Communities, Collections, and Items - viz. core
data model objects. Tasks may essentially work on only one type of DSpace object - typically
an item - and in this case they may simply ignore other data types (tasks have the ability to
'skip' objects for any reason). The DSpace core distribution ought to provide a number of useful
tasks, but the system is designed to encourage local extension - tasks can be written
for any purpose, and placed in any java package. What sorts of things are appropriate tasks?
Some examples:
- apply a virus scan to item bitstreams (this will be our example below)
- profile a collection based on format types - good for identifying format migrations
- ensure a given set of metadata fields are present in every item, or even that they have particular values
- call a network service to enhance/replace/normalize an items's metadata or content
- ensure all item bitstreams are readable and their checksums agree with the ingest values
A task can be arbitrary code, but the class implementing it must have 2 properties:
- it must provide a no-arg constructor, so it can be loaded by the PluginManager
Thus, all tasks are 'named' plugins, meaning that each must be configured in dspace.cfg as:
plugin.named.org.dspace.curate.CurationTask = \
org.dspace.curate.ProfileFormats = format-profile \
org.dspace.curate.RequiredMetadata = req-metadata \
org.dspace.ctask.replicate.Audit = audit \
org.dspace.ctask.replicate.Estimate = estimate \
org.dspace.ctask.replicate.Generate = generate \
org.dspace.ctask.integrity.Checksum = checksum \
org.dspace.ctask.integrity.ClamScan = vscan
The 'plugin name' (audit, estimate, etc) is called the task name, and is used instead of the qualified class name
wherever it is needed (on the cmd line, etc) - the CS always dereferences it.
- implements 'org.dspace.curate.CurationTask'
The CurationTask interface is almost a 'tagging' interface, and only requires a few very high-level methods be implemented. The most significant is:
int perform(DSpaceObject dso);
The return value should be a code describing one of 4 conditions:
- 0 : SUCCESS the task completed successfully
- 1 : FAIL the task failed (it is up to the task to decide what 'counts' as failure - an example might be that the virus scan finds an infected file)
- 2 : SKIPPED the task could not be performed on the object, perhaps because it was not applicable
- -1 : ERROR the task could not be completed due to an error
If a task extends the AbstractCurationTask class, that is the only method it needs to define.