<?xml version="1.0" encoding="utf-8"?>
<html>
<p>author: Grace Carpenter
date: August 2006</p>
<p>TechMDExtractor is a command-line tool for running Jhove on
the DSpace asset store. It will determine if each
bitstream is a valid and/or well-formed instance of the format it
purports to be. If an identifier is specified, processing will be
limited to the given Community, Collection, or Item. If verbose
processing is specified, all the extracted technical metadata will be
sent to standard output.</p>
About the Design
<p>In order to make Jhove work with DSpace, I had to create two classes
that wrap two of the main Jhove classes. These classes,
org.dspace.app.techmdextractor.jhove.DSJhoveBase and
org.dspace.app.techmdextractor.jhove.DSConfigHandler, essentially
re-write the code in the corresponding Jhove classes
(edu.harvard.hul.ois.jhove.JhoveBase and
edu.harvard.hul.ois.jhove.ConfigHandler, respectively). DSJhoveBase
initializes the Jhove modules, and also provides the main
entry points for DSpace to parse bitstreams. DSConfigHandler has
code to parse the DSpace-specific elements of the Jhove
configuration file (jhove.conf).</p>
Configuring TechMDExtractor
<ol>
<li><p>Apply the dspace-preingest patch to your DSpace installation,
and follow the instructions for configuring it. (TechMDExtractor has some build-time
dependencies on the Pre-ingest project--it implements two of its
interfaces:
org.dspace.workflow.PreIngestFilter and
org.dspace.workflow.FilterResult.)</p>
<li><p>Check out the TechMDExtractor project from CVS.</p></li>
<li><p>Modify the file
TechMDExtractor/config/jhove.conf
to
reflect the specifics of your DSpace installation. In particular,
the things that <strong>must</strong>
be modified are:</p>
<ul>
<li>the <tempDirectory> element must contain a directory with appropriate
permissions for the Jhove executable to write to </li>
<li>the <dspace:format-name> element that follows each
<module>/<class> element
must contain the short description of the format as it appears in your
bitstreamformatregistry table.
Jhove.conf
contains the default short
descriptions for DSpace formats, so you don't have to worry about this
if you haven't edited the bitstreamformatregistry table.</li>
</ul>
</li>
<li><p>If you wish, configure logging for the non-DSpace-specific code
in Jhove by editing
TechMDExtractor/config/jhoveLogging.properties
.</p>
<p>Note that the Jhove code actually uses two different logging APIs:
java logging for most of Jhove, and log4j for the DSpace-specific
initialization and top-level execution code. For debugging set-up problems,
you should be able to get most of the information you need from the
regular DSpace logs. If you want to debug format-specific parsing issues,
you should modify the file
TechMDExtractor/conf/jhoveLogging.properties
, which
will be placed in your
<i>[dspace]</i>/config/
directory
at build-time.</p>
</li>
<li><p>From the TechMDExtractor directory, type
ant install
. After
the build process has completed, verify that the following jars
are in your
<i>[dspace]</i>/lib
directory:</p>
<ul>
<li>
tmdExtractor.jar
</li>
<li>
jhove.jar
</li>
<li>
jhove-handler.jar
</li>
<li>
jhove-module.jar
</li>
</ul>
<p>Running
ant install
should also place the above jars
in your
<i>[dspace-source]</i>/lib
directory, for
use in the Workflow Pre-ingest step.</p>
<p>The files
TechMDExtractor/config/jhove.conf
and
TechMDExtractor/config/jhoveLogging.properties
should have been
copied into your
<i>[dspace]</i>/config
directory.</p>
</li>
<li><p>Don't forget that the dspace.cfg file in your
<i>[dspace]</i>/config
directory must be modified,
as specified in the Workflow Pre-ingest instructions.</p>
<p>Note that the Jhove initialization code
(in
org.dspace.app.techmdextractor.jhove.JhoveExtractor
) also
checks for the configuration variable
jhove.sax.class
. This
is because I always get errors when parsing the jhove configuration
file, although they don't cause the code to fail. See the "Known Issues"
section of the documentation for more information.</p>
</li>
</ol>
Running TechMDExtractor
<p>From the
<i>[dspace]</i>/bin
directory, type</p>
dsrun org.dspace.app.techmdextractor.ExtractorManager -h
<p>You'll get a list of command-line options for running the program. Note
that the code for the TechMDExtractor is based heavily (OK, stolen )
from the MediaFilter code, so many of the options are similar.</p>
Files Changed
<ul>
<li>config/dspace.cfg</li>
</ul>
Files Added
The source code may be found online under CVS here:
http://libaxis1.mit.edu/viewcvs/sandbox/TechMDExtractor/
<ul>
<li>config/jhove.conf</li>
<li>config/jhoveLogging.properties</li>
<li>src/org/dspace/app/techmdextractor/jhove/DSConfigHandler.java</li>
<li>src/org/dspace/app/techmdextractor/jhove/DSJhoveBase.java</li>
<li>src/org/dspace/app/techmdextractor/jhove/JhoveExtractor.java</li>
<li>src/org/dspace/app/techmdextractor/jhove/JhoveFilterResult.java</li>
<li>src/org/dspace/app/techmdextractor/jhove/JhovePreIngestFilter.java</li>
<li>src/org/dspace/app/techmdextractor/jhove/JhoveTechMD.java</li>
<li>src/org/dspace/app/techmdextractor/jhove/ExtractorManager.java</li>
<li>src/org/dspace/app/techmdextractor/jhove/TechMDExtractorException.java</li>
<li>build.xml</li>
</ul>
Known Issues
<ul><li><p>SAXParser problem: the SAX parser complains when it
parses the jhove.conf file. The messages I get are:</p>
[Warning] jhove.conf:6:39:SchemaLocation: schemLocation value = 'http://hul.harvard.edu/oi/xml/xsd/jhove/1.3/jhoveConfig.xsd' must have even number of URI's. [Error] jhove.conf:6:39: cvc-elt.1: Cannot find the declaration of element 'jhoveConfig'
<p>If you use Jhove 'out of the box', you won't receive these errors.
I believe that Jhove as a stand-alone uses the default Java SAX
parser (Crimson?), whereas DSpace is using Xerces. It seems that
the different parsers probably need to be configured differently. I
don't think the error messages are a problem for the
config file, but I'm not sure how
this affects the parsing of XML docs submitted to Jhove.
I started to play around with this, and the TechMDExtractor
code actually checks the dspace.cfg file to see if a
parser is specified (
jhove.sax.class=<i>sax parser name</i>
).
Needs investigation.</p>
</li>
</ul>
</html>