Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Remove XPDF section

...

  • Unknown
  • License
    Deleting a format will cause any existing bitstreams of this format to be reverted to the unknown bitstream format.

XPDF Filter

This is an alternative suite of MediaFilter plugins that offers faster and more reliable text extraction from PDF Bitstreams, as well as thumbnail image generation. It replaces the built-in default PDF MediaFilter.

If this filter is so much better, why isn't it the default? The answer is that it relies on external executable programs which must be obtained and installed for your server platform. This would add too much complexity to the installation process, so it left out as an optional "extra" step.

Installation Overview

Here are the steps required to install and configure the filters:

  1. Install the xpdf tools for your platform, from the downloads at http://www.foolabs.com/xpdf
  2. Acquire the Sun Java Advanced Imaging Tools and create a local Maven package.
  3. Edit DSpace configuration properties to add location of xpdf executables, reconfigure MediaFilter plugins.
  4. Build and install DSpace, adding -Pxpdf-mediafilter-support to Maven invocation.

Install XPDF Tools

First, download the XPDF suite found at: http://www.foolabs.com/xpdf and install it on your server. The executables can be located anywhere, but make a note of the full path to each command.

You may be able to download a binary distribution for your platform, which simplifies installation. Xpdf is readily available for Linux, Solaris, MacOSX, Windows, NetBSD, HP-UX, AIX, and OpenVMS, and is reported to work on AIX, OS/2, and many other systems.

The only tools you really need are:

  • pdfinfo - displays properties and Info dict
  • pdftotext - extracts text from PDF
  • pdftoppm - images PDF for thumbnails

Fetch and install jai_imageio JAR

Fetch and install the Java Advanced Imaging Image I/O Tools.

For AIX, Sun support has the following: "JAI has native acceleration for the above but it also works in pure Java mode. So as long as you have an appropriate JDK for AIX (1.3 or later, I believe), you should be able to use it. You can download any of them, extract just the jars, and put those in your $CLASSPATH."

Download the jai_imageio library version 1.0_01 or 1.1 found at: https://jai-imageio.dev.java.net/binary-builds.html#Stable_builds .

For these filters you do NOT have to worry about the native code, just the JAR, so choose a download for any platform.

Code Block
curl -O http://download.java.net/media/jai-imageio/builds/release/1.1/jai_imageio-1_1-lib-linux-i586.tar.gz
tar xzf jai_imageio-1_1-lib-linux-i586.tar.gz

The preceding example leaves the JAR in jai_imageio-1_1/lib/jai_imageio.jar . Now install it in your local Maven repository, e.g.: (changing the path after file= if necessary)

Code Block
mvn install:install-file                       \
          -Dfile=jai_imageio-1_1/lib/jai_imageio.jar  \
          -DgroupId=com.sun.media                     \
          -DartifactId=jai_imageio                    \
          -Dversion=1.0_01                            \
          -Dpackaging=jar                             \
          -DgeneratePom=true

You may have to repeat this procedure for the jai_core.jar library, as well, if it is not available in any of the public Maven repositories. Once acquired, this command installs it locally:

Code Block
mvn install:install-file -Dfile=jai_core-1.1.2_01.jar  \
    -DgroupId=javax.media -DartifactId=jai_core -Dversion=1.1.2_01 -Dpackaging=jar -DgeneratePom=true

Edit DSpace Configuration

First, be sure there is a value for thumbnail.maxwidth and that it corresponds to the size you want for preview images for the UI, e.g.: (NOTE: this code doesn't pay any attention to thumbnail.maxheight but it's best to set it too so the other thumbnail filters make square images.)

Code Block
# maximum width and height of generated thumbnails
        thumbnail.maxwidth= 80
        thumbnail.maxheight = 80

Now, add the absolute paths to the XPDF tools you installed. In this example they are installed under /usr/local/bin (a logical place on Linux and MacOSX), but they may be anywhere.

Code Block
xpdf.path.pdftotext = /usr/local/bin/pdftotext
        xpdf.path.pdftoppm = /usr/local/bin/pdftoppm
        xpdf.path.pdfinfo = /usr/local/bin/pdfinfo

Change the MediaFilter plugin configuration to remove the old org.dspace.app.mediafilter.PDFFilter and add the new filters, e.g: (New sections are in bold)

Code Block
filter.plugins = \
        PDF Text Extractor, \
        PDF Thumbnail, \
        HTML Text Extractor, \
        Word Text Extractor, \
        JPEG Thumbnail
         plugin.named.org.dspace.app.mediafilter.FormatFilter = \
        org.dspace.app.mediafilter.XPDF2Text = PDF Text Extractor, \
        org.dspace.app.mediafilter.XPDF2Thumbnail = PDF Thumbnail, \
        org.dspace.app.mediafilter.HTMLFilter = HTML Text Extractor, \
        org.dspace.app.mediafilter.WordFilter = Word Text Extractor, \
        org.dspace.app.mediafilter.JPEGFilter = JPEG Thumbnail, \
        org.dspace.app.mediafilter.BrandedPreviewJPEGFilter = Branded Preview JPEG

Then add the input format configuration properties for each of the new filters, e.g.:

Code Block
filter.org.dspace.app.mediafilter.XPDF2Thumbnail.inputFormats = Adobe PDF
filter.org.dspace.app.mediafilter.XPDF2Text.inputFormats = Adobe PDF

Finally, if you want PDF thumbnail images, don't forget to add that filter name to the filter.plugins property, e.g.:

Code Block
filter.plugins = PDF Thumbnail,  PDF Text Extractor, ...

Build and Install

Follow your usual DSpace installation/update procedure, only add -Pxpdf-mediafilter-support to the Maven invocation:

Code Block
mvn -Pxpdf-mediafilter-support package
     ant -Dconfig=\[dspace\]/config/dspace.cfg update

Configuring Usage Instrumentation Plugins

...