Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 5.3

...

Introduced in DSpace 1.7, and expanded in 1.8, the Curation System (CS) is still a comparatively new denizen in the DSpace ecosystem. As more tasks and 'suites' are produced, we are learning a lot about what additional functionality the framework could offer to support more powerful, flexible, and easily implemented tasks. This page is intended to be a place to collect these insights, as well as designs that address these needs. Many new features are already being developed, and we welcome participation in their evolution.

Queue Filtering

CS supports asynchronous operation by allowing curation requests to be written to a persistent queue for later processing. CS simply empties the queue on demand, and processes each request. While formally correct - in the sense that every queued request is processed - various optimizations and efficiency gains may be possible by more active management of the queue. To take one simple case: suppose an expensive operation that needs to be performed only once appears twice or more on a queue. Could we not 'weed' the queue of such duplicates, and still achieve the desired result? To support such 'intelligent' queue management, one must realize that not all strategies will work for all types of queues in all circumstances, thus any solution must be both flexible/extensible (in terms of the logic to manage the queue), and flexible in how it can be invoked in any circumstance. That is it must be optional, and extensible.

Therefore we propose a new interface, with a single method:

Code Block
public interface TaskQueueFilter {
   Iterator<TaskQueueEntry> filter(Set<TaskQueueEntry>);
}

The filter() method is designed to accept a set of TaskQueueEntries - which is what the TaskQueue 'dequeue()' method returns - and return a (possibly) modified set retrievable through an iterator. The iterator is important (as opposed to just a new set), since it allows (but does not require) the filter to impose an order on the entries. Filters will be applied when 'CurationCli' is invoked (we can add a new, optional, '-f filter' command-line switch) on a particular queue, so flexibility is secured by the ability to set different (or no) filters on different queues. It may be possible to 'chain' filters, but these use-cases would need further definition.

Programs

A common need is to coordinate the activities of multiple tasks against particular object sets: we may wish to ensure one task is performed before another, or only conditionally performed, possibly based on the 'outcome' of another task. CS currently has no ability to specify or enforce these these constraints: in fact it explicitly disavows this. In this situation:

Code Block
Curator curator = new Curator();
curator.addTask("task1");
curator.addTask("task2");
curator.curate(myDso);

the curator makes no promises that 'task1' will run before 'task2' - it could in fact be reversed. Nor can a task have any way of 'discovering' whether another task has run, so coordination can't be managed in the task logic itself. There are sound reasons why simple ordering is not supported: there are too many 'contingencies' that simple ordering cannot cope with. For example, suppose that in the above case 'task1' has an error and never properly ran - then task2's assumptions would be mistaken.

A more full-featured and robust mechanism than simple ordering is needed: thus the proposal to add task 'programs'. A program is a set of instructions about how and whether to run sets of tasks. The CS will be responsible for 'compiling' and running these programs, and a 'program' will have the exact same semantics as an atomic task. Namely:

  • It will return a status code with the same value set as tasks
  • It will optionally return a 'result' string
  • It will have a locally-bound logical name
  • It will be possible to invoke a program wherever a task can be - in admin UI, workflow, batch, etc

What would a task program look like - i.e. what is the program syntax, etc.?  Here is a straw-man example:

Code Block
# Task Program Example
# MIT Libraries - January 2013
first-task
if not @SUCCESS
  report "problem out of the gate"
  return @ERROR:"first-task did not succeed"
end
second-task
if @FAIL
   cleanup-task
elif @ERROR
   report "error on second task"
elif @SKIP
   another-task
   if @SUCCESS
      return cleanup-task
   end
else
   cleanup-task
end

 

Object Selectors

In CS, the unit of curation is a DSpaceObject (which may be an Item, Collection, or Community). Thus the API offers these basic methods (on the Curator class):

Code Block

public void curate(DSpaceObject object) throws IOException;

public void curate(Context c, String id) throws IOException;

...

This is the primary motivation for a new feature of the curation API known as object selectors. 'ObjectSelector' is a new interface (which essentially just exposes a DSpaceObject Iterator), that is directly supported by the curation API:

Code Block
public void curate(ObjectSelector selector) throws IOException;

public void curatequeue(ObjectSelector selector, String queueId) throws IOException;

The curator will perform the configured tasks on all the DSpaceObjects delivered by the selector, and the selector can deliver any set of objects it wishes. As an interface, CS users may write and deploy their own custom selector implementations, but we propose to offer a few general-purpose selector implementations that will be bundled with the curation system. Currently these are:

...

This selector invokes the DSpace native (Lucene) search APIs to obtain sets of objects. In this way, one can easily perform curation tasks on any set of search results. For ease of reuse, the search query string can be stored in a configuration file, and each such configuration can be given a different name. This technique, known as 'named selectors', allows for easy integration in other CS tools. For example (in the command-line tool via the DSpace launcher):

Code Block

[dspace]/bin/dspace curate -o nanotechnology -t textextract

...

This selector queries the database to obtain its objects. In essence, the selector transforms a very simplified user-supplied query string into the SQL necessary to perform the database query. An example can illustrate:

Code Block

in_archive = '1' AND last_modified > ${today - 7} AND dc.contributor.author = 'Jones'

This query would retrieve all items authored by Jones installed within the last week. The actual SQL is more complex, since joins with the metadata tables are required. For the curious, the syntax of the query language is given below (in Extended Backus-Naur Form)

Code Block

(* Query syntax EBNF *)
  query = expr , { "AND" , expr } ;
  expr = field name | metadata name , oper , value ;
  field name = characters , { "_" , characters } ;
  metadata name = characters , "." , characters , [ "." , characters ] ;
  oper = "=" | "<>" | ">" | "<" | ">=" | "<=" | "BETWEEN" | "LIKE" | "IN" ;
  value = literal | variable ;
  literal = "'" , characters , { whitespace , characters } , "'" ;
  variable = "${" , varname , [ "+" | "-" , number ] , "}" ;
  varname = "today" | handle ;
(* end syntax EBNF *)

...

Instead we propose a very simple, but flexible way to track monitor task execution, based on a new annotation type for tasks:

Code Block

@Record
public class ImportantTask extends AbstractCurationTask
...

The presence of this annotation signifies to the CS that when a task of this type is performed, the outcome should be recorded somewhere, if recording has been otherwise configured activated in the DSpace curation setup. There will be no error (or run-time penalty), if recording has not been activated. The 'outcome' here means the following data elements:

  • time stamp of task performance
  • id (handle) of object
  • EPerson name invoking task (if specified)
  • logical task name
  • task status code
  • task result string (if set)
  • task 'type' (explained below)
  • task 'value' (also explained below)

'Recorded' here means only that a class implementing the 'Recorder' interface has been configured. What constitutes 'recording' is up to the implementation, but could include:

  • logging to a local file
  • writing to a database
  • posting to a message queue

The point is to have a 'hook' into the curation runtime where these records can be captured. The release will likely include a simple local file log/journal recorder as a basic starter implementation. A number of default values may be overridden if needed:

Code Block
@Record(statusCodes={1, 2})

@Record(type="PREMIS",
        value="Replication")

By default, the recording logic will be invoked regardless of the statusCode returned by the task, but in the above case, we limit it to errors and skips. Type and value are very useful when a task's work can be expressed in a controlled vocabulary: it would be easy to generate records, e.g., as RDF statements with the id as subject, and type and value referring to ontology defined terms. A given task may have multiple such descriptions (in different domains, e.g). Since simple annotations cannot be repeated, we must use a 'container' annotation:

Code Block
@Records({
   @Record(type="PREMIS", value="Replication"),
   @Record(type="LOC" value="duplication")
})

Resource Management

One of the core design objectives of CS was to make tasks as simple to implement as possible: in practice this meant keeping the API 'footprint' (number of methods that a task has to code) very small. In fact, it really only consists of 2 methods:

Code Block
void init(Curator curator, String taskId) throws IOException;

int perform(DSpaceObject dso) throws IOException;

int perform(Context ctx, String id) throws IOException;

where the third method can usually be converted into the second. One consequence of this is a lack of what one would consider full lifecycle semantics. That is, there is no method by which a task could 'clean itself up' after use. This can entail a few gyrations - or at any rate a certain task design discipline - in some circumstances. Let us take a concrete example: a task that needs to write some data to a stream for each object it receives. The simplest apparent way to code this is:

Code Block
public class StreamTaskTake1 implements CurationTask
{
   private OutputStream out;

   public void init(Curator curator, String taskId) throws IOException
   {
       out = new FileOutputStream("somewhere", true);
   }

   public int perform(DSpaceObject dso) throws IOException
   {
       .....
       out.write(dso.getHandle().getBytes());
       ....
    }
}

but of course this isn't very satisfactory, since the task never closes the stream it opened. The task has no apparent way of determining when it is called for the last time, so there isn't an obvious way around this. (There are in fact several ways - e.g. the task can annotate itself as @Distributive and have complete control over how it is called, but this can add substantial complexity). So we are usually led to a solution like this:

Code Block
public class StreamTaskTake2 implements CurationTask
{
   private OutputStream out;

   public void init(Curator curator, String taskId) throws IOException
   {
   }

   public int perform(DSpaceObject dso) throws IOException
   {
       .....
       out = new FileOutputStream("somewhere", true);
       out.write(dso.getHandle().getBytes());
       out.close();
       ....
    }
}

This version is formally correct, and indeed exhibits the quite desirable trait of not holding a file descriptor when not in use, but we might chafe at the thought that we are doing fairly inefficient IO if this task is invoked on a collection of 1000 items by re-opening every time. Thus the idea of curator resource management: suppose we could simply ask the curation system to manage the issue? Like so:

Code Block
public class StreamTaskTake3 implements CurationTask
{
   private OutputStream out;

   public void init(Curator curator, String taskId) throws IOException
   {
       out = new FileOutputStream("somewhere", true);
       // let the curator worry about this..
       curator.enrollResource(out, "close");
   }

   public int perform(DSpaceObject dso) throws IOException
   {
       .....
       out.write(dso.getHandle().getBytes());
       ....
    }
}

That is, the enrollResource method asks the CS to ensure that when the curator has finished its work, it should call 'out.close()' on the stream. The "close" argument is called the policy, and it is the job of CS to enforce the policy. Currently, we have only looked at 'close' and 'flush' as policies, but it would not be difficult to imagine others.