Archived / Obsolete Documentation

Documentation in this space is no longer accurate.
Looking for official DSpace documentation? See all documentation

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 8 Next »

New Features for the Curation System

Introduced in DSpace 1.7, and expanded in 1.8, the Curation System (CS) is still a comparatively new denizen in the DSpace ecosystem. As more tasks and 'suites' are produced, we are learning a lot about what additional functionality the framework could offer to support more powerful, flexible, and easily implemented tasks. This page is intended to be a place to collect these insights, as well as designs that address these needs. Many new features are already being developed, and we welcome participation in their evolution.

Object Selectors

In CS, the unit of curation is a DSpaceObject (which may be an Item, Collection, or Community). Thus the API offers these basic methods (on the Curator class):

public void curate(DSpaceObject object) throws IOException;

public void curate(Context c, String id) throws IOException;

A task may elect to restrict its scope of operation to a particular type or subset of objects (typically, only items, not containers), and can thus apply filters in business logic code to the objects it is given, but often we may wish to perform a given task on a set of objects that do not correspond to any natural container, so filtering will be of no help. For example, we may wish to perform a task on all recently installed items (whatever the collection). We may do this, of course, by writing custom code that pulls the necessary items, then feeds them one-by-one to a curator, but our code is not very portable/repurposable. We could not, e.g., easily use the same code in a command-line context and a UI context, as we have come to expect with CS.

This is the primary motivation for a new feature of the curation API known as object selectors. 'ObjectSelector' is a new interface (which essentially just exposes a DSpaceObject Iterator), that is directly supported by the curation API:

public void curate(ObjectSelector selector) throws IOException;

public void queue(ObjectSelector selector, String queueId) throws IOException;

The curator will perform the configured tasks on all the DSpaceObjects delivered by the selector, and the selector can deliver any set of objects it wishes. As an interface, CS users may write and deploy their own custom selector implementations, but we propose to offer a few general-purpose selector implementations that will be bundled with the curation system. Currently these are:

SearchSelector

This selector invokes the DSpace native (Lucene) search APIs to obtain sets of objects. In this way, one can easily perform curation tasks on any set of search results. For ease of reuse, the search query string can be stored in a configuration file, and each such configuration can be given a different name. This technique, known as 'named selectors', allows for easy integration in other CS tools. For example (in the command-line tool via the DSpace launcher):

[dspace]/bin/dspace curate -o nanotechnology -t textextract

The argument to the '-o' (*o*bjectselector) is the name of a selector, which we can imagine is a search for all the items whose title contains 'nanotechnology'.

It should be noted that SearchSelector can also be used for 'non-canned' searches: we could expose a search box in a web page, have the user type in a search string and configure a search selector to use this 'live' query.

QuerySelector

This selector queries the database to obtain its objects. In essence, the selector transforms a very simplified user-supplied query string into the SQL necessary to perform the database query. An example can illustrate:

in_archive = '1' AND last_modified > ${today - 7} AND dc.contributor.author = 'Jones'

This query would retrieve all items authored by Jones installed within the last week. The actual SQL is more complex, since joins with the metadata tables are required. For the curious, the syntax of the query language is given below (in Extended Backus-Naur Form)

(* Query syntax EBNF *)
  query = expr , { "AND" , expr } ;
  expr = field name | metadata name , oper , value ;
  field name = characters , { "_" , characters } ;
  metadata name = characters , "." , characters , [ "." , characters ] ;
  oper = "=" | "<>" | ">" | "<" | ">=" | "<=" | "BETWEEN" | "LIKE" | "IN" ;
  value = literal | variable ;
  literal = "'" , characters , { whitespace , characters } , "'" ;
  variable = "${" , varname , [ "+" | "-" , number ] , "}" ;
  varname = "today" | handle ;
(* end syntax EBNF *)

Task Recording

Most routine task executions have no lasting or special significance, but some may merit keeping track of. For example, a scan of the Library of Congress page http://id.loc.gov/vocabulary/preservationEvents.html reveals that many preservation events of significance map to currently offered curation tasks. A facility for tracking important tasks may therefore be desirable. CS does emit ordinary DSpace logging messages, but these are interleaved with all other application logging data, so are not suitable for this sort of historical record.

Instead we propose a very simple, but flexible way to monitor task execution, based on a new annotation type for tasks:

@Record
public class ImportantTask extends AbstractCurationTask
...

The presence of this annotation signifies to the CS that when a task of this type is performed, the outcome should be recorded somewhere, if recording has been otherwise activated in the DSpace curation setup. There will be no error (or run-time penalty), if recording has not been activated. The 'outcome' here means the following data elements:

  • time stamp of task performance
  • id (handle) of object
  • EPerson name invoking task (if specified)
  • logical task name
  • task status code
  • task result string (if set)
  • task 'type' (explained below)
  • task 'value' (also explained below)

'Recorded' here means only that a class implementing the 'Recorder' interface has been configured. What constitutes 'recording' is up to the implementation, but could include:

  • logging to a local file
  • writing to a database
  • posting to a message queue

The point is to have a 'hook' into the curation runtime where these records can be captured. The release will likely include a simple local file log/journal recorder as a basic starter implementation. A number of default values may be overridden if needed:

@Record(statusCodes={1, 2})

@Record(type="PREMIS",
        value="Replication")

By default, the recording logic will be invoked regardless of the statusCode returned by the task, but in the above case, we limit it to errors and skips. Type and value are very useful when a task's work can be expressed in a controlled vocabulary: it would be easy to generate records, e.g., as RDF statements with the id as subject, and type and value referring to ontology defined terms. A given task may have multiple such descriptions (in different domains, e.g). Since simple annotations cannot be repeated, we must use a 'container' annotation:

@Records({
   @Record(type="PREMIS", value="Replication"),
   @Record(type="LOC" value="duplication")
})
  • No labels