Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Moved misplaced introduction to creating custom filters

...

Below is a listing of all currently available Media Filters, and what they actually do:

Name

Java Class

Function

Default input formats

Enabled by Default?

Text Extractor
(7.3 or above)
org.dspace.app.mediafilter.TikaTextExtractionFilterAs of 7.3, all text extraction for Full text indexing takes place in a single filter. This filter uses Apache Tika which supports a wide variety of formats (e.g. Microsoft products, PDF, HTML, Text, etc). Additional formats may be configured from the Tika supported formats list at https://tika.apache.org/2.3.0/formats.html 

Adobe PDF,
Microsoft formats (Word, PPT, Excel), CSV, HTML, RTF, Text, OpenDocument formats (Text, Spreadsheet, Presentation)

yes

PDF Text Extractor
(7.2 or below)

org.dspace.app.mediafilter.PDFFilter

extracts the full text of Adobe PDF documents (only if text-based or OCRed) for full text indexing. (Uses the Apache PDFBox tool)

Adobe PDF

yes

HTML Text Extractor
(7.2 or below)

org.dspace.app.mediafilter.HTMLFilter

extracts the full text of HTML documents for full text indexing. (Uses Swing's HTML Parser)

HTML, Text

yes

Word Text Extractor
(7.2 or below)

org.dspace.app.mediafilter.PoiWordFilter

extracts the full text of Microsoft Word and Microsoft Word XML documents for full text indexing. (Uses the "Apache POI" tools.)

Microsoft Word, Microsoft Word XML

yes

Excel Text Extractor
(7.2 or below)

org.dspace.app.mediafilter.ExcelFilterextracts the full text of Microsoft Excel documents for full text indexing. (Uses the "Apache POI" tools.)Microsoft Excel, Microsoft Excel XMLyes

PowerPoint Text Extractor
(7.2 or below)

org.dspace.app.mediafilter.PowerPointFilter

extracts the full text of slides and notes in Microsoft PowerPoint and PowerPoint XML documents for full text indexing. (Uses the Apache POI tools.)

Microsoft Powerpoint, Microsoft Powerpoint XML

yes

PDFBox JPEG Thumbnailorg.dspace.app.mediafilter.PDFBoxThumbnailcreates thumbnail images of the first page of PDF filesAdobe PDFyes

JPEG Thumbnail

org.dspace.app.mediafilter.JPEGFilter

creates thumbnail images of GIF, JPEG and PNG files

BMP, GIF, JPEG, image/png

yes

Branded Preview JPEG

org.dspace.app.mediafilter.BrandedPreviewJPEGFilter

creates a branded preview image for GIF, JPEG and PNG files

BMP, GIF, JPEG, image/png

no

ImageMagick Image Thumbnail Generator

org.dspace.app.mediafilter.ImageMagickImageThumbnailFilter

Uses ImageMagick to generate thumbnails for image bitstreams. Requires installation of ImageMagick on your server. See ImageMagick Media Filters.BMP, GIF, image/png, JPG, TIFF, JPEG, JPEG 2000no
ImageMagick PDF Thumbnail Generatororg.dspace.app.mediafilter.ImageMagickPdfThumbnailFilterUses ImageMagick and Ghostscript to generate thumbnails for PDF bitstreams. Requires installation of ImageMagick and Ghostscript on your server. See  ImageMagick Media Filters.Adobe PDFno

Please note that the filter-media script will automatically update the DSpace search index by default.

...

The media filter plugin configuration filter.plugins in dspace.cfg contains a list of all enabled media/format filter plugins (see Configuring Media Filters for more information). By modifying the value of filter.plugins you can disable or enable MediaFilter plugins.  The filter.plugins setting can be set multiple times to enable multiple filters.  Each filter must be enabled via its name (see "Name" column in the table above).

Code Block
# Enable the default Text Extractor (for 7.3 or above)
filter.plugins = Text Extractor

# Enable the JPEG thumbnail creator
filter.plugins = JPEG Thumbnail

# Enable the PDF thumbnail creator
filter.plugins = PDFBox JPEG Thumbnail


Executing (via Command Line)

...

  • Help : [dspace]/bin/dspace filter-media -h
    • Display help message describing all command-line options.
  • Force mode : [dspace]/bin/dspace filter-media -f
    • Apply filters to ALL bitstreams, even if they've already been filtered. If they've already been filtered, the previously filtered content is overwritten.
  • Identifier mode : [dspace]/bin/dspace filter-media -i 123456789/2
    • Restrict processing to the community, collection, or item named by the identifier - by default, all bitstreams of all items in the repository are processed. The identifier must be a Handle, not a DB key. This option may be combined with any other option.
  • Maximum mode : [dspace]/bin/dspace filter-media -m 1000
    • Suspend operation after the specified maximum number of items have been processed - by default, no limit exists. This option may be combined with any other option.
  • Plugin mode : [dspace]/bin/dspace filter-media -p "PDF Text Extractor","Word Text Extractor"
    • Apply ONLY the filter plugin(s) listed (separated by commas). By default all named filters listed in the filter.plugins field of dspace.cfg are applied. This option may be combined with any other option. WARNING: multiple plugin names must be separated by a comma (i.e. ',') and NOT a comma followed by a space (i.e. ', ').
  • Skip mode : [dspace]/bin/dspace filter-media -s 123456789/9,123456789/100
    • SKIP the listed identifiers (separated by commas) during processing. The identifiers must be Handles (not DB Keys). They may refer to items, collections or communities which should be skipped. This option may be combined with any other option. WARNING: multiple identifiers must be separated by a comma (i.e. ',') and NOT a comma followed by a space (i.e. ', ').
    • NOTE: If you have a large number of identifiers to skip, you may maintain this comma-separated list list, one identifier per line, within a separate file (e.g. filter-skiplist.txt). Use the following format to call the program. Please note the use of the "grave" or "tick" (`) symbol and do not use the single quotation.
      • [dspace]/bin/dspace filter-media -s `less $(paste -sd, - < filter-skiplist.txt`txt)
  • Verbose mode : [dspace]/bin/dspace filter-media -v
    • Verbose mode - print Print all extracted text and other filter details to STDOUT.

Creating Custom MediaFilters

Adding your own filters is done by creating a class which implements the org.dspace.app.mediafilter.FormatFilter interface. See the Creating a new Media/Format Filter topic and comments in the source file FormatFilter.java for more information. In theory filters could be implemented in any programming language (C, Perl, etc.) However, they need to be invoked by the Java code in the Media Filter class that you create.

...

Creating a simple Media Filter

...

Code Block
#Get "outputFormat" configuration from dspace.cfg
String outputFormat =  ConfigurationManager.getProperty(MediaFilterManager.FILTER_PREFIX + "." + MyComplexMediaFilter.class.getName() + "." + this.getPluginInstanceName() + ".outputFormat");


Configuration parameters

Property

textextractor.max-chars  (only in 7.3 or above)

Example Valuetextextractor.max-chars = 100000
Informational NoteBy default, the "Text Extractor" only extracts the first 100,000 characters of text for full-text indexing.  This setting allows you to increase or decrease that default.  Set to -1 for no maximum.  Keep in mind that larger values (or -1) are more likely to encounter OutOfMemoryException errors when extracting text from very large files.  In those scenarios, you may wish to consider instead enabling "textextractor.use-temp-file" below to better control memory usage.
Propertytextextractor.use-temp-file  (only in 7.3 or above)
Example Valuetextextractor.use-temp-file = false
Informational NoteBy default, the "Text Extractor" will perform all text extraction in memory (i.e. textextractor.use-temp-file=false).  This ensures text extraction runs quickly, but it has the risk of hitting OutOfMemoryException errors if you either increase "textextractor.max-chars" or simply don't have much available memory on the server.  In those scenarios, you can set "textextractor.use-temp-file=true" in order to tell the text extraction process to extract all text using a temporary file.  This decreases the memory usage of the text extraction process, but will run slightly slower.
Property

filter.org.dspace.app.mediafilter.publicPermission

Example Valuefilter.org.dspace.app.mediafilter.publicPermission = JPEGFilter
Informational NoteBy default mediafilter derivatives / thumbnails inherit the permissions of the parent bitstream, but you can override this, in case you want to make publicly accessible derivative / thumbnail content, typically the thumbnails of objects for the browse list. List the MediaFilter names that would get public accessible permissions. Any media filters not listed will instead inherit the permissions of the parent bitstream.

...