Page History
mageMagick Image Thumbnail Generator
Table of Contents | ||||||
---|---|---|---|---|---|---|
|
...
Below is a listing of all currently available Media Filters, and what they actually do:
Name | Java Class | Function | Default input formats | Enabled by Default? | |||
---|---|---|---|---|---|---|---|
HTML PDF Text Extractor |
| extracts the full text of HTML documents Adobe PDF documents (only if text-based or OCRed) for full text indexing. (Uses Swing's HTML Parser) | true | the Apache PDFBox tool) | Adobe PDF | yes | |
HTML Text Extractor | JPEG Thumbnail
| JPEGFiltercreates thumbnail images of GIF, JPEG and PNG files | true | ||||
Branded Preview JPEG |
| creates a branded preview image for GIF, JPEG and PNG files | false | ||||
| extracts the full text of HTML documents for full text indexing. (Uses Swing's HTML Parser) | HTML, Text | yes | ||||
Word PDF Text Extractor |
| extracts the full text of Adobe PDF documents (only if text-based or OCRed) Microsoft Word or Plain Text documents for full text indexing. (Uses the Apache PDFBox tool) | true | "Microsoft Word Text Mining" tools.) See also PoiWordFilter, below. | Microsoft Word | yes | |
Word XPDF Text Extractor |
| extracts the full text of Adobe PDF documents (only if text-based or OCRed) Microsoft Word and Microsoft Word XML documents for full text indexing. (Uses the XPDF command line tools available for Unix.) See XPDF Filter Configuration for details on installing/enabling. | false | Word "Apache POI" tools.) Disabled by default. Uncomment PoiWordFilter and comment WordFilter in dspace.cfg if you wish to use this one. | Microsoft Word, Microsoft Word XML | no | |
Excel Text Extractor | org.dspace.app.mediafilter. | WordFilterExcelFilter | extracts the full text of Microsoft | Word or Plain TextExcel documents for full text indexing. (Uses the " | Microsoft Word Text MiningApache POI" tools.) | trueMicrosoft Excel, Microsoft Excel XML | yes |
PowerPoint Text Extractor |
| extracts the full text of slides and notes in Microsoft PowerPoint and PowerPoint XML documents for full text indexing (Uses the Apache POI tools.)true | Microsoft Powerpoint, Microsoft Powerpoint XML | yes | |||
PDFBox JPEG Thumbnail | org.dspace.app.mediafilter.PDFBoxThumbnail | creates thumbnail images of the first page of PDF files | Adobe PDF | yes | |||
JPEG Thumbnail |
| creates thumbnail images of GIF, JPEG and PNG files | BMP, GIF, JPEG, image/png | yes | |||
Branded Preview JPEG |
| creates a branded preview image for GIF, JPEG and PNG files | BMP, GIF, JPEG, image/png | no | |||
ImageMagick Image Thumbnail Generator |
| uses Uses ImageMagick to generate thumbnails for image bitstreams. Requires installation of ImageMagick on your server. See ImageMagick Media Filters. | BMP, GIF, image/png, JPG, TIFF, JPEG, JPEG 2000 | nofalse | |||
ImageMagick PDF Thumbnail Generator | org.dspace.app.mediafilter.ImageMagickPdfThumbnailFilter | uses Uses ImageMagick and Ghostscript to generate thumbnails for PDF bitstreams. Requires installation of ImageMagick and Ghostscript on your server. See ImageMagick Media Filters. | Adobe PDF | falseno |
Please note that the filter-media
script will automatically update the DSpace search index by default (see Legacy methods for re-indexing content) This is the recommended way to run these scripts. But, should you wish to disable it, you can pass the -n flag to either script to do so (see Executing (via Command Line) below).
Enabling/Disabling MediaFilters
...
- Help :
[dspace]/bin/dspace filter-media -h
- Display help message describing all command-line options.
- Force mode :
[dspace]/bin/dspace filter-media -f
- Apply filters to ALL bitstreams, even if they've already been filtered. If they've already been filtered, the previously filtered content is overwritten.
- Identifier mode :
[dspace]/bin/dspace filter-media -i 123456789/2
- Restrict processing to the community, collection, or item named by the identifier - by default, all bitstreams of all items in the repository are processed. The identifier must be a Handle, not a DB key. This option may be combined with any other option.
- Maximum mode :
[dspace]/bin/dspace filter-media -m 1000
- Suspend operation after the specified maximum number of items have been processed - by default, no limit exists. This option may be combined with any other option.
[dspace]/bin/dspace filter-media -n
- Suppress index creation - by default, a new search index is created for full-text searching. This option suppresses index creation if you intend to run
index-update
elsewhere.
- Plugin mode :
[dspace]/bin/dspace filter-media -p "PDF Text Extractor","Word Text Extractor"
- Apply ONLY the filter plugin(s) listed (separated by commas). By default all named filters listed in the filter.plugins field of dspace.cfg are applied. This option may be combined with any other option. WARNING: multiple plugin names must be separated by a comma (i.e. ',') and NOT a comma followed by a space (i.e. ', ').
- Skip mode :
[dspace]/bin/dspace filter-media -s 123456789/9,123456789/100
- SKIP the listed identifiers (separated by commas) during processing. The identifiers must be Handles (not DB Keys). They may refer to items, collections or communities which should be skipped. This option may be combined with any other option. WARNING: multiple identifiers must be separated by a comma (i.e. ',') and NOT a comma followed by a space (i.e. ', ').
- NOTE: If you have a large number of identifiers to skip, you may maintain this comma-separated list within a separate file (e.g. filter-skiplist.txt). Use the following format to call the program. Please note the use of the "grave" or "tick" (`) symbol and do not use the single quotation.
[dspace]/bin/dspace filter-media -s `less filter-skiplist.txt`
- Verbose mode :
[dspace]/bin/dspace filter-media -v
- Verbose mode - print all extracted text and other filter details to STDOUT.
Adding your own filters is done by creating a class which implements theorg.dspace.app.mediafilter.FormatFilter
interface. See the Creating a new Media/Format Filter topic and comments in the source fileFormatFilter.java
for more information. In theory filters could be implemented in any programming language (C, Perl, etc.) However, they need to be invoked by the Java code in the Media Filter class that you create.
- Verbose mode - print all extracted text and other filter details to STDOUT.
...
Property | filter.org.dspace.app.mediafilter.publicPermission |
---|---|
Example Value | filter.org.dspace.app.mediafilter.publicPermission = JPEGFilter, XPDF2Thumbnail |
Informational Note | By default mediafilter derivatives / thumbnails inherit the same permissions of the parent bitstream, but you can override this, in case you want to make publicly accessible derivative / thumbnail content, typically the thumbnails of objects for the browse list. List the MediaFilter name's names that would get public accessible permissions. Any media filters not listed will instead inherit the permissions of the parent bitstream. |
...