Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The goal of the curation system ('"CS'") is to provide a simple, extensible way to manage routine content operations on a repository. These operations are known to CS as '"tasks'", and they can operate on any DSpaceObject (i.e. subclasses of DSpaceObject) - which means the entire Site, Communities, Collections, and Items - viz. core data model objects. Tasks may elect to work on only one type of DSpace object - typically an Item - and in this case they may simply ignore other data types (tasks have the ability to '"skip' " objects for any reason). The DSpace core distribution will provide a number of useful tasks, but the system is designed to encourage local extension - tasks can be written for any purpose, and placed in any java package. This gives DSpace sites the ability to customize the behavior of their repository without having to alter - and therefore manage synchronization with - the DSpace source code. What sorts of activities are appropriate for tasks?

...

For each activated task, a key-value pair is added. The key is the fully qualified class name and the value is the taskname used elsewhere to configure the use of the task, as will be seen below. Note that the curate.cfg configuration file, while in the config directory, is located under '"modules'". The intent is that tasks, as well as any configuration they require, will be optional '"add-ons' " to the basic system configuration. Adding or removing tasks has no impact on dspace.cfg.

...

Second, it must implement the interface '"org.dspace.curate.CurationTask'"

The CurationTask interface is almost a '"tagging' " interface, and only requires a few very high-level methods be implemented. The most significant is:

Code Block
languagejava
int perform(DSpaceObject dso);

...

Tasks are invoked using CS framework classes that manage a few details (to be described below), and this invocation can occur wherever needed, but CS offers great versatility '"out of the box'":

On the command line

A simple tool '"CurationCli' " provides access to CS via the command line. This tool bears the name '"curate' " in the DSpace launcher. For example, to perform a virus check on collection '"4'":

Code Block
[dspace]/bin/dspace curate -t vscan -i 123456789/4

...

In the XMLUI, there are several ways to execute configured Curation Tasks:

  1. From the '"Curate' " tab that appears on each '"Edit Community/Collection/Item' " page: this tab allows an Administrator, Community Administrator or Collection Administrator to run a Curation Task on that particular Community, Collection or Item. When running a task on a Community or Collection, that task will also execute on all its child objects, unless the Task itself states otherwise (e.g. running a task on a Collection will also run it across all Items within that Collection).
    • NOTE: Community Administrators and Collection Administrators can only run Curation Tasks on the Community or Collection which they administer, along with any child objects of that Community or Collection. For example, a Collection Administrator can run a task on that specific Collection, or on any of the Items within that Collection.
  2. From the Administrator's '"Curation Tasks' " page: This option is only available to DSpace Administrators, and appears in the Administrative side-menu. This page allows an Administrator to run a Curation Task across a single object, or all objects within the entire DSpace site.
    • In order to run a task from this interface, you must enter in the handle for the DSpace object. To run a task site-wide, you can use the handle: [your-handle-prefix]/0

...

When a task is selected from the drop-down list and performed, the tab displays both a phrase interpreting the '"status code' " of the task execution, and the '"result' " message if any has been defined. When the task has been queued, an acknowledgement appears instead. You may configure the words used for status codes in curate.cfg (for clarity, language localization, etc):

...

The configuration of groups follows the same simple pattern as tasks, using properties in [dspace]/config/modules/curate.cfg. The group is assigned a simple logical name, but also a localizable name that appears in the UI. For example:

Code Block
# ui.taskgroups contains the list of defined groups, together with a pretty name for UI display
ui.taskgroups = \
  replication = Backup and Restoration Tasks, \
  integrity = Metadata Integrity Tasks, \
  .....
# each group membership list is a separate property, whose value is comma-separated list of logical task names
ui.taskgroup.integrity = profileformats, requiredmetadata
....

...

This attribute (which must always follow the '"name' " attribute in the flowstep element), will cause all tasks associated with the step to be placed on the queue named '"workflow' " (or any queue you wish to use, of course), and further has the effect of suspending the workflow. When the queue is emptied (meaning all tasks in it performed), then the workflow is restarted. Each workflow step may be separately configured,

...

would do approximately what the command line invocation did. the method '"curate' " just performs all the tasks configured
configured (you can add multiple tasks to a curator).

...

use the command-line tool, but we could also read the queue programmatically. Any number of queues can be defined and used as needed.
In the administrative UI curation '"widget'", there is the ability to both perform a task, but also place it on a queue for later processing.

...

Few assumptions are made by CS about what the 'outcome' of a task may be (if any) - it. could e.g. produce a report to a temporary file, it could modify DSpace content silently, etc. But the CS runtime does provide a few pieces of information whenever a task is performed:

...

This was mentioned above. This is returned to CS whenever a task is called. The complete list of values:

Code Block
  -3 NOTASK  - CS could not find the requested task
  -2 UNSET   - task did not return a status code because it has not yet run
  -1 ERROR   - task could not be performed
   0 SUCCESS - task performed successfully
   1 FAIL    - task performed, but failed
   2 SKIP    - task not performed due to object not being eligible

...

The task may define a string indicating details of the outcome. This result is displayed, in the '"curation widget' " described above:

Code Block
"Virus 12312 detected on Bitstream 4 of 1234567789/3"

...

Code Block
languagejava
Curator curator = new Curator();
     curator.addTask("vscan").curate(coll);
     int status = curator.getStatus("vscan");
     String result -= curator.getResult("vscan");

...

DSpace 1.8 introduces a new '"idiom' " for tasks that require configuration data. It is available to any task whose implementation extends AbstractCurationTask, but is completely optional. There are a number of problems that task properties are designed to solve, but to make the discussion concrete we will start with a particular one: the problem of hard-coded configuration file names. A task that relies on configuration data will typically encode a fixed reference to a configuration file name. For example, the virus scan task reads a file called '"clamav.cfg'", which lives in [dspace]/config/modules. And thus in the implementation one would find:

...

Code Block
org.dspace.ctask.general.ClamAv = vscan,
org.community.ctask.ConflictTask = virusscan,
....

then '"taskProperty()' " will resolve to [dspace]/config/modules/vscan.cfg when called from ClamAv task, but [dspace]/config/modules/virusscan.cfg when called from ConflictTask's code. Note that the 'vscan' etc are locally assigned names, so we can always prevent the 'collisions'mentioned, and we make the tasks much more portable, since we remove the 'hard-coding' of config names.

...

Another use of task properties is to support multiple task profiles. Suppose we have a task that we want to operate in one of two modes. A good example would be a mediafilter task that produces a thumbnail. We can either create one if it doesn't exist, or run with '-force' which will create one regardless. Suppose this behavior was controlled by a property in a config file. If we configured the task as '"thumbnail'", then we would have in [dspace]/config/modules/thumbnail.cfg:

...

Consider what happens: when we perform the task '"thumbnail' " (using taskProperties), it reads the config file thumbnail.cfg and operates in '"non-force' " profile (since the value is false), but when we run the task '"thumbnail.force' " the curation system first reads thumbnail.cfg, then reads thumbnail.force.cfg which overrides the value of the '"forceupdate' " property. Notice that we did all this via local configuration - we have not had to touch the source code at all to obtain as many '"profiles' " as we would like.

Task Annotations

...

This descriptor means that a '"ruby' " script engine will be created, a script file named '"rubytask.rb' " in the directory <script.dir> will be loaded and the resolver will expect an evaluation of '"LinkChecker.new' " will provide a correct implementation object. Note that the task must be configured in all other ways just like java tasks (in ui.tasknames, ui.taskgroups, etc).

...

For reasons of portability, the <relFilePath> component may be omitted in this context. Thus, '"$td=ruby||LinkChecker.new' " will be expanded to a descriptor with the name of the embedding file.

...

The 'requiredmetadata' task examines item metadata and determines whether fields that the web submission (input-forms.xml) marks as required are present. It sets the result string to indicate either that all required fields are present, or constructs a list of metadata elements that are required but missing. When the task is performed on an item, it will display the result for that item. When performed on a collection or community, the task be performed on each item, and will display the last item result. If all items in the community or collection have all required fields, that will be the last in the collection. If the task fails for any item (i.e. the item lacks all required fields), the process is halted. This way the results for the 'failed' items are not lost.

Virus Scan

The '"vscan' " task performs a virus scan on the bitstreams of items using the ClamAV software product.
Clam AntiVirus is an open source (GPL) anti-virus toolkit for UNIX. A port for Windows is also available. The virus scanning curation task interacts with the ClamAV virus scanning service to scan the bitstreams contained in items, reporting on infection(s). Like other curation tasks, it can be run against a container or item, in the GUI or from the command line. It should be installed according to the documentation at http://www.clamav.net. It should not be installed in the dspace installation directory. You may install it on the same machine as your dspace installation, or on another machine which has been configured properly.

...