Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

As of release 1.7, DSpace supports running curation tasks, which are described in this section. DSpace 1.7 and subsequent distributions will bundle (include) several useful tasks, but the system also is designed to allow new tasks to be added between releases, both general purpose tasks that come from the community, and locally written and deployed tasks.

Table of Contents
minLevel2
outlinetrue
stylenone

Changes in 1.8

  • New package: The default curation task package is now org.dspace.ctask. The tasks supplied with DSpace releases are now under org.dspace.ctask.general
  • New tasks in DSpace release: Some additional curation tasks have been supplied with DSpace 1.8, including a link checker and a translator
  • UI task groups: Ability to assign tasks to groups whose members display together in the Administrative UI
  • Task properties: Support for a site-portable system for configuration and profiling of tasks using configuration files
  • New framework services: Support for context management during curation operations
  • Scripted tasks: New (experimental) support for authoring and executing tasks in languages other than Java

...

For CS to run a task, the code for the task must of course be included with other deployed code (to [dspace]/lib, WAR, etc) but it must also be declared and given a name. This is done via a configuration property in [dspace]/config/modules/curate.cfg as follows:

Code Block

plugin.named.org.dspace.curate.CurationTask = \
org.dspace.ctask.general.NoOpCurationTask = noop, \
org.dspace.ctask.general.ProfileFormats = profileformats, \
org.dspace.ctask.general.RequiredMetadata = requiredmetadata, \
org.dspace.ctask.general.ClamScan = vscan, \
org.dspace.ctask.general.MicrosoftTranslator = translate, \
org.dspace.ctask.general.MetadataValueLinkChecker = checklinks

...

The complete list of arguments:

Code Block

-t taskname: name of task to perform
-T filename: name of file containing list of tasknames
-e epersonID: (email address) will be superuser if unspecified
-i identifier: Id of object to curate. May be (1) a handle (2) a workflow Id or (3) 'all' to operate on the whole repository
-q queue: name of queue to process - -i and -q are mutually exclusive
-l limit: maximum number of objects in Context cache. If absent, unlimited objects may be added.
-s scope: declare a scope for database transactions. Scope must be: (1) 'open' (default value) (2) 'curation' or (3) 'object'
-v emit verbose output
-r - emit reporting to standard out

...

Each of the above pages exposes a drop-down list of configured tasks, with a button to 'perform' the task, or queue it for later operation (see section below). Not all activated tasks need appear in the Curate tab - you filter them by means of a configuration property. This property also permits you to assign to the task a more user-friendly name than the PluginManager taskname. The property resides in [dspace]/config/modules/curate.cfg:

Code Block

ui.tasknames = \
     profileformats = Profile Bitstream Formats, \
     requiredmetadata = Check for Required Metadata

When a task is selected from the drop-down list and performed, the tab displays both a phrase interpreting the 'status code' of the task execution, and the 'result' message if any has been defined. When the task has been queued, an acknowledgement appears instead. You may configure the words used for status codes in curate.cfg (for clarity, language localization, etc):

Code Block

ui.statusmessages = \
    -3 = Unknown Task, \
    -2 = No Status Set, \
    -1 = Error, \
     0 = Success, \
     1 = Fail, \
     2 = Skip, \
     other = Invalid Status

...

The configuration of groups follows the same simple pattern as tasks, using properties in [dspace]/config/modules/curate.cfg. The group is assigned a simple logical name, but also a localizable name that appears in the UI. For example

Code Block

# ui.taskgroups contains the list of defined groups, together with a pretty name for UI display
ui.taskgroups = \
  replication = Backup and Restoration Tasks, \
  integrity = Metadata Integrity Tasks, \
  .....
# each group membership list is a separate property, whose value is comma-separated list of logical task names
ui.taskgroup.integrity = profileformats, requiredmetadata
....

...

CS provides the ability to attach any number of tasks to standard DSpace workflows. Using a configuration file [dspace]/config/workflow-curation.xml, you can declaratively (without coding) wire tasks to any step in a workflow. An example:

Code Block
languagehtml/xml

<taskset-map>
    <mapping collection-handle="default" taskset="cautious" />
</taskset-map>
<tasksets>
    <taskset name="cautious">
        <flowstep name="step1">
            <task name="vscan">
                <workflow>reject</workflow>
                <notify on="fail">$flowgroup</notify>
                <notify on="fail">$colladmin</notify>
                <notify on="error">$siteadmin</notify>
            </task>
        </flowstep>
    </taskset>
</tasksets>

This markup would cause a virus scan to occur during step one of workflow for any collection, and automatically reject any submissions with infected files. It would further notify (via email) both the reviewers (step 1 group), and the collection administrators, if either of these are defined. If it could not perform the scan, the site administrator would be notified.

...

Tasks wired in this way are normally performed as soon as the workflow step is entered, and the outcome action (defined by the 'workflow' element) immediately follows. It is also possible to delay the performance of the task - which will ensure a responsive system - by queuing the task instead of directly performing it:

Code Block
languagehtml/xml

...
    <taskset name="cautious">
        <flowstep name="step1" queue="workflow">
...

...

If these pre-defined ways are not sufficient, you can of course manage curation directly in your code. You would use the CS helper classes. For example:

Code Block
languagejava

Collection coll = (Collection)HandleManager.resolveToObject(context, "123456789/4");
Curator curator = new Curator();
curator.addTask("vscan").curate(coll);
System.out.println("Result: " + curator.getResult("vscan"));

...

Because some tasks may consume a fair amount of time, it may not be desirable to run them in an interactive context. CS provides a simple API and means to defer task execution, by a queuing system. Thus, using the previous example:

Code Block
languagejava

Curator curator = new Curator();
     curator.addTask("vscan").queue(context, "monthly", "123456789/4");

...

This was mentioned above. This is returned to CS whenever a task is called. The complete list of values:

Code Block

-3 NOTASK - CS could not find the requested task
  -2 UNSET  - task did not return a status code because it has not yet run
  -1 ERROR - task could not be performed
   0 SUCCESS - task performed successfully
   1 FAIL - task performed, but failed
   2 SKIP - task not performed due to object not being eligible

...

The task may define a string indicating details of the outcome. This result is displayed, in the 'curation widget' described above:

Code Block

"Virus 12312 detected on Bitstream 4 of 1234567789/3"

...

The status code, and the result string are accessed (or set) by methods on the Curation object:

Code Block
languagejava

Curator curator = new Curator();
     curator.addTask("vscan").curate(coll);
     int status = curator.getStatus("vscan");
     String result - curator.getResult("vscan");

...

DSpace 1.8 introduces a new 'idiom' for tasks that require configuration data. It is available to any task whose implementation extends AbstractCurationTask, but is completely optional. There are a number of problems that task properties are designed to solve, but to make the discussion concrete we will start with a particular one: the problem of hard-coded configuration file names. A task that relies on configuration data will typically encode a fixed reference to a configuration file name. For example, the virus scan task reads a file called 'clamav.cfg', which lives in [dspace]/config/modules. And thus in the implementation one would find:

Code Block
languagejava

host = ConfigurationManager.getProperty("clamav", "service.host");

...

Task properties gives us a simple solution. Here is how it works: suppose that both colliding tasks instead use this method provided by AbstractCurationTask in their task implementation code (e.g. in virus scanner):

Code Block
languagejava

host = taskProperty("service.host");

Note that there is no name of the configuration file even mentioned, just the property name whose value we want. At runtime, the curation system resolves this call to a configuration file, and it uses the name the task has been configured as as the name of the config file. So, for example, if both were installed (in curate.cfg) as:

Code Block

org.dspace.ctask.general.ClamAv = vscan,
org.community.ctask.ConflictTask = virusscan,
....

...

The entire 'API' for task properties is:

Code Block
languagejava

public String taskProperty(String name);
public int taskIntProperty(String name, int defaultValue);
public long taskLongProperty(String name, long defaultValue);
public boolean taskBooleanProperty(String name, boolean default);

Another use of task properties is to support multiple task profiles. Suppose we have a task that we want to operate in one of two modes. A good example would be a mediafilter task that produces a thumbnail. We can either create one if it doesn't exist, or run with '-force' which will create one regardless. Suppose this behavior was controlled by a property in a config file. If we configured the task as 'thumbnail', then we would have in [dspace]/config/modules/thumbnail.cfg:

Code Block

...other properties...
thumbnail.maxheight = 80
thumbnail.maxwidth = 80
forceupdate=false

Then, following the pattern above, the thumbnail generating task code would look like:

Code Block
languagejava

if (taskBooleanProperty("forceupdate")) {
    // do something
}

But an obvious use-case would be to want to run force mode and non-force mode from the admin UI on different occasions. To do this, one would have to stop Tomcat, change the property value in the config file, and restart, etc However, we can use task properties to elegantly rescue us here. All we need to do is go into the config/modules directory, and create a new file called: thumbnail.force.cfg. In this file, we put only one property:

Code Block

forceupdate=true

Then we add a new task (really just a new name, no new code) in curate.cfg:

Code Block

org.dspace.ctask.general.ThumbnailTask = thumbnail,
org.dspace.ctask.general.ThumbnailTask = thumbnail.force

...

CS looks for, and will use, certain java annotations in the task Class definition that can help it invoke tasks more intelligently. An example may explain best. Since tasks operate on DSOs that can either be simple (Items) or containers (Collections, and Communities), there is a fundamental problem or ambiguity in how a task is invoked: if the DSO is a collection, should the CS invoke the task on each member of the collection, or does the task 'know' how to do that itself? The decision is made by looking for the @Distributive annotation: if present, CS assumes that the task will manage the details, otherwise CS will walk the collection, and invoke the task on each member. The java class would be defined:

Code Block
languagejava

@Distributive
public class MyTask implements CurationTask

A related issue concerns how non-distributive tasks report their status and results: the status will normally reflect only the last invocation of the task in the container, so important outcomes could be lost. If a task declares itself @Suspendable, however, the CS will cease processing when it encounters a FAIL status. When used in the UI, for example, this would mean that if our virus scan is running over a collection, it would stop and return status (and result) to the scene on the first infected item it encounters. You can even tune @Supendable tasks more precisely by annotating what invocations you want to suspend on. For example:

Code Block
languagejava

@Suspendable(invoked=Curator.Invoked.INTERACTIVE)
public class MyTask implements CurationTask

...

Support for scripted tasks does not include any DSpace pre-installation of the scripting language itself - this must be done according to the instructions provided by the language maintainers, and typically only requires a few additional jars on the DSpace classpath. Once one or more languages have been installed into the DSpace deployment, task support is fairly straightforward. One new property must be defined in [dspace]/config/modules/curate.cfg:

Code Block

script.dir = ${dspace.dir}/scripts

...

Script files may embed their descriptors to facilitate deployment. To accomplish this, a script must include the descriptor string with syntax:
$td=<descriptor> somewhere on a comment line. For example:

Code Block

# My descriptor $td=ruby|rubytask.rb|LinkChecker.new

...

Scripted tasks must implement a slightly different interface than the CurationTask interface used for Java tasks. The appropriate interface for scripting tasks is ScriptedTask and has the following methods:

Code Block
languagejava

public void init(Curator curator, String taskId) throws IOException;
public int performDso(DSpaceObject dso) throws IOException;
public int performId(Context ctx, String id) throws IOException;

...

The task with the taskname 'formatprofiler' (in the admin UI it is labeled "Profile Bitstream Formats") examines all the bitstreams in an item and produces a table ("profile") which is assigned to the result string. It is activated by default, and is configured to display in the administrative UI. The result string has the layout:

Code Block

10 (K) Portable Network Graphics
5  (S) Plain Text

where the left column is the count of bitstreams of the named format and the letter in parentheses is an abbreviation of the repository-assigned support level for that format:

Code Block

U  Unsupported
K  Known
S  Supported

...

  • Add the plugin to the comma separated list of curation tasks.
Code Block

### Task Class implementations
plugin.named.org.dspace.curate.CurationTask = \
org.dspace.ctask.general.ProfileFormats = profileformats, \
org.dspace.ctask.general.RequiredMetadata = requiredmetadata, \
org.dspace.ctask.general.ClamScan = vscan
  • Optionally, add the vscan friendly name to the configuration to enable it in the administrative it in the administrative user interface.
Code Block

ui.tasknames = \
profileformats = Profile Bitstream Formats, \
requiredmetadata = Check for Required Metadata, \
vscan = Scan for Viruses
  • In [dspace]/config/modules, edit configuration file clamav.cfg:
Code Block

service.host = 127.0.0.1
Change if not running on the same host as your DSpace installation.
service.port = 3310
Change if not using standard ClamAV port
socket.timeout = 120
Change if longer timeout needed
scan.failfast = false
Change only if items have large numbers of bitstreams
  • Finally, if desired virus scanning can be enabled as part of the submission process upload file step. In [dspace]/config/modules, edit configuration file submission-curation.cfg:
Code Block

virus-scan = true

Task Operation from the Administrative user interface

...

If desired virus scanning can be enabled as part of the submission process upload file step. In [dspace]/config/modules, edit configuration file submission-curation.cfg:

Code Block

virus-scan = true

Task Operation from the curation command line client

To output the results to the console:

Code Block

[dspace]/bin/dspace curate -t vscan -i <handle of container or item dso> -r -

Or capture the results in a file:

Code Block

[dspace]/bin/dspace curate -t vscan -i <handle of container or item dso> -r - > /<path...>/<name>

...

An example configuration file can be found in [dspace]/config/modules/translator.cfg.

Code Block

#---------------------------------------------------------------#
#----------TRANSLATOR CURATION TASK CONFIGURATIONS--------------#
#---------------------------------------------------------------#
# Configuration properties used solely by MicrosoftTranslator   #
# Curation Task (uses Microsoft Translation API v2)             #
#---------------------------------------------------------------#
## Translation field settings
##
## Authoritative language field
## This will be read to determine the original language an item was submitted in
## Default: dc.language

translate.field.language = dc.language

## Metadata fields you wish to have translated
#
translate.field.targets = dc.description.abstract, dc.title, dc.type

## Translation language settings
##
## If the language field configured in translate.field.language is not present
## in the record, set translate.language.default to a default source language
## or leave blank to use autodetection
#
translate.language.default = en

## Target languages for translation
#
translate.language.targets = de, fr

## Translation API settings
##
## Your Bing API v2 key and/or Google "Simple API Access" Key
## (note to Google users: your v1 API key will not work with Translate v2,
## you will need to visit https://code.google.com/apis/console and activate
## a Simple API Access key)
##
## You do not need to enter a key for both services.
#
translate.api.key.microsoft = YOUR_MICROSOFT_API_KEY_GOES_HERE
translate.api.key.google = YOUR_GOOGLE_API_KEY_GOES_HERE