Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents
minLevel2
outlinetrue
stylenone

Info

This documentation provides a guide for how to programmatically create Curation Tasks.  For more information configuring Curation Tasks, see the Curation System section of the documentation 

Writing your own tasks

A task is just a java class that can contain arbitrary code, but it must have 2 properties:

...

Second, it must implement the interface "org.dspace.curate.CurationTask"

The CurationTask interface is almost a "tagging" interface, and only requires a few very high-level methods be implemented. The most significant is:

...

  • 0 : SUCCESS the task completed successfully
  • 1 : FAIL the task failed (it is up to the task to decide what 'counts' as failure - an example might be that the virus scan finds an infected file)
  • 2 : SKIPPED the task could not be performed on the object, perhaps because it was not applicable
  • -1 : ERROR the task could not be completed due to an error

If a task extends the AbstractCurationTask class, that is the only method it needs to define.

Invoking tasks in arbitrary user code

If these pre-defined ways are not sufficient, you can of course manage curation directly in your code. You would use the CS helper classes. For example:

Code Block
languagejava
Collection coll = (Collection)HandleManager.resolveToObject(context, "123456789/4");
Curator curator = new Curator();
curator.addTask("vscan").curate(coll);
System.out.println("Result: " + curator.getResult("vscan"));

would do approximately what the command line invocation did. the method "curate" just performs all the tasks configured (you can add multiple tasks to a curator).

Asynchronous (Deferred) Operation

Because some tasks may consume a fair amount of time, it may not be desirable to run them in an interactive context. CS provides a simple API and means to defer task execution, by a queuing system. Thus, using the previous example:

Code Block
languagejava
Curator curator = new Curator();
curator.addTask("vscan").queue(context, "monthly", "123456789/4");

would place a request on a named queue "monthly" to virus scan the collection. To read (and process) the queue, we could for example:

Code Block
[dspace]/bin/dspace curate -q monthly

use the command-line tool, but we could also read the queue programmatically. Any number of queues can be defined and used as needed.
In the administrative UI curation "widget", there is the ability to both perform a task, but also place it on a queue for later processing.


Task Output and Reporting

Few assumptions are made by CS about what the 'outcome' of a task may be (if any) - it. could e.g. produce a report to a temporary file, it could modify DSpace content silently, etc. But the CS runtime does provide a few pieces of information whenever a task is performed:

Status Code

This was mentioned above. This is returned to CS whenever by any of a task is called's perform methods. The complete list of values, defined in Curator, is:

...

valuesymbolmeaning
-3

...

CURATE_NOTASKCS could not find the requested task
-2CURATE_UNSETtask did not return a status code because it has not yet run
-1CURATE_ERRORtask could not be performed
0CURATE_SUCCESStask performed successfully
1CURATE_FAIL

task performed, but failed

2CURATE_SKIPtask not performed due to object not being eligible


In the administrative UI, this code is translated into the word or phrase configured by the ui.statusmessages property (discussed abovein Curation System) for display.

Result String

The task may define set a string indicating details of the outcome. This result is displayed, in the "curation widget" described above:

Code Block
languagejava
curator.setResult("Item " + item.getID() + " was painted " + color);"Virus 12312 detected on Bitstream 4 of 1234567789/3"

CS does not interpret or assign result strings, ; the task does it. A task may choose not to assign a result, but the "best practice" for tasks is to assign one whenever possible.  Code which invokes Curator.getResult() may use the result string for display or any other purpose.

Reporting Stream

For very fine-grained information, a task may write to a reporting stream. This stream is sent to standard out, so is only available when running a task from the command line. Unlike the result string, there is no limit to the amount of data that may be pushed to this stream.

Code Block
languagejava
curator.report("Lorem ipsum dolor sit amet,\n");
curator.report("consectetur adipiscing elit,\n");
curator.report("sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.\n");

Accessing task output in calling code

The status code, reporting stream, and the result string are accessed (or set) by methods on the Curation Curator object:

Code Block
languagejava
Curator curator = new Curator();
curator.setReporter(new OutputStreamWriter(System.out));
curator.addTask("vscan").curate(coll);
int status = curator.getStatus("vscan");
String result = curator.getResult("vscan");

Task Properties

DSpace 1.8 introduces a new "idiom" for tasks that require configuration data. It is available to any task whose implementation extends AbstractCurationTask, but is completely optional. There are a number of problems that task properties are designed to solve, but to make the discussion concrete we will start with a particular one: the problem of hard-coded configuration file names. A task that relies on configuration data will typically encode a fixed reference to a configuration file name. For example, the virus scan task reads a file called "clamav.cfg", which lives in [dspace]/config/modules. And thus in the implementation one would find:

Code Block
languagejava
host = configurationService.getProperty("clamav.service.host");

and similar. But tasks are supposed to be written by anyone in the community and shared around (without prior coordination), so if another task uses the same configuration file name, there is a name collision here that can't be easily fixed, since the reference is hard-coded in each task. In this case, if we wanted to use both at a given site, we would have to alter the source of one of them - which introduces needless code localization and maintenance.

Task properties gives us a simple solution. Here is how it works: suppose that both colliding tasks instead use this method provided by AbstractCurationTask in their task implementation code (e.g. in virus scanner):

Code Block
languagejava
host = taskProperty("clamav.service.host");

Note that there is no name of the configuration file even mentioned, just the property name whose value we want. At runtime, the curation system resolves this call to a set of configuration properties, and it uses the name the task has been configured as as the prefix of the properties. So, for example, if both were installed (in, say, curate.cfg) as:

Code Block
org.dspace.ctask.general.ClamAv = vscan,
org.community.ctask.ConflictTask = virusscan,
....

then "taskProperty("foo")" will resolve to the property named vscan.foo when called from ClamAv task, but virusscan.foo when called from ConflictTask's code. Note that the "vscan" etc are locally assigned names, so we can always prevent the "collisions" mentioned, and we make the tasks much more portable, since we remove the "hard-coding" of config namesTask code may configure itself using ConfigurationService in the normal manner, or by the use of "task properties".  See Curation System - Task Properties for discussion of the issues for which task properties were invented.  Any code which extends AbstractCurationTask has access to its configured task properties.

The entire "API" for task properties is:

Code Block
languagejava
public String taskProperty(String name);
public int taskIntProperty(String name, int defaultValue);
public long taskLongProperty(String name, long defaultValue);
public boolean taskBooleanProperty(String name, boolean default);

Another use of task properties is to support multiple task profiles. Suppose we have a task that we want to operate in one of two modes. A good example would be a mediafilter task that produces a thumbnail. We can either create one if it doesn't exist, or run with "-force" which will create one regardless. Suppose this behavior was controlled by a property in a config file. If we configured the task as "thumbnail", then we would have in (perhaps) [dspace]/config/modules/thumbnail.cfg:

Code Block
...other properties...
thumbnail.thumbnail.maxheight = 80
thumbnail.thumbnail.maxwidth = 80
thumbnail.forceupdate=false

Then, following the pattern above, the thumbnail generating task code would look like:

Code Block
languagejava
if (taskBooleanProperty("forceupdate")) {
    // do something
}

But an obvious use-case would be to want to run force mode and non-force mode from the admin UI on different occasions. To do this, one would have to stop Tomcat, change the property value in the config file, and restart, etc However, we can use task properties to elegantly rescue us here. All we need to do is go into the config/modules directory, and create a new file perhaps called: thumbnail.force.cfg. In this file, we put the properties:

Code Block
thumbnail.force.thumbnail.maxheight = 80
thumbnail.force.thumbnail.maxwidth = 80
thumbnail.force.forceupdate=true

Then we add a new task (really just a new name, no new code) in curate.cfg:

Code Block
org.dspace.ctask.general.ThumbnailTask = thumbnail
org.dspace.ctask.general.ThumbnailTask = thumbnail.force

Consider what happens: when we perform the task "thumbnail" (using taskProperties), it uses the thumbnail.* properties and operates in "non-force" profile (since the value is false), but when we run the task "thumbnail.force" the curation system uses the thumbnail.force.* properties. Notice that we did all this via local configuration - we have not had to touch the source code at all to obtain as many "profiles" as we would like.

Task Annotations

CS looks for, and will use, certain java annotations in the task Class definition that can help it invoke tasks more intelligently. An example may explain best. Since tasks operate on DSOs that can either be simple (Items) or containers (Collections, and Communities), there is a fundamental problem or ambiguity in how a task is invoked: if the DSO is a collection, should the CS invoke the task on each member of the collection, or does the task "know" how to do that itself? The decision is made by looking for the @Distributive annotation: if present, CS assumes that the task will manage the details, otherwise CS will walk the collection, and invoke the task on each member. The java class would be defined:

...

Only a few annotation types have been defined so far, but as the number of tasks grow, we can look for common behavior that can be signaled by annotation. For example, there is a @Mutative type: that tells CS that the task may alter (mutate) the object it is working on.

Scripted Tasks

...

DSpace 1.8 includes introduced limited (and somewhat experimental) support for deploying and running tasks written in languages other than Java. Since version 6, Java has provided a standard way (API) to invoke so-called scripting or dynamic language code that runs on the java virtual machine (JVM). Scripted tasks are those written in a language accessible from this API. The exact number of supported languages will vary over time, and the degree of maturity of each language, or suitability of the language for curation tasks will also vary significantly. However, preliminary work indicates that Ruby (using the JRuby runtime) and Groovy may prove viable task languages.

Support for scripted tasks does not include any DSpace pre-installation of the scripting language itself - this must be done according to the instructions provided by the language maintainers, and typically only requires a few additional jars on the DSpace classpath. Once one or more languages have been installed into the DSpace deployment, task support is fairly straightforward. One new property must be defined in [dspace]/config/modules/curate.cfg:

Code Block
curate.script.dir = ${dspace.dir}/scripts

This merely defines the directory location (usually relative to the deployment base) where task script files should be kept. This directory will contain a "catalog" of scripted tasks named task.catalog that contains information needed to run scripted tasks. Each task has a 'descriptor' property with value syntax:

<engine>|<relFilePath>|<implClassCtor>

An example property for a link checking task written in Ruby might be:

Code Block
linkchecker = ruby|rubytask.rb|LinkChecker.new

This descriptor means that a "ruby" script engine will be created, a script file named "rubytask.rb" in the directory <script.dir> will be loaded and the resolver will expect an evaluation of "LinkChecker.new" will provide a correct implementation object. Note that the task must be configured in all other ways just like java tasks (in ui.tasknames, ui.taskgroups, etc).

Script files may embed their descriptors to facilitate deployment. To accomplish this, a script must include the descriptor string with syntax:
$td=<descriptor> somewhere on a comment line. For example:

Code Block
# My descriptor $td=ruby|rubytask.rb|LinkChecker.new

For reasons of portability, the <relFilePath> component may be omitted in this context. Thus, "$td=ruby||LinkChecker.new" will be expanded to a descriptor with the name of the embedding file  See Curation System - Scripted Tasks for information on configuring and running scripted tasks.

Interface

Scripted tasks must implement a slightly different interface than the CurationTask interface used for Java tasks. The appropriate interface for scripting tasks is ScriptedTask and has the following methods:

Code Block
languagejava
public void init(Curator curator, String taskId) throws IOException;
public int performDso(DSpaceObject dso) throws IOException;
public int performId(Context ctx, String id) throws IOException;

The difference is that ScriptedTask has separate perform methods for DSO and identifier. The reason for that is that some scripting languages (e.g. Ruby) don't support method overloading.

...