Tesseract is an Optical Character Recognition program that Islandora uses to extract text from images to files that can then be appended to an object as datastreams. It supports HOCR standards, and when invoked, Islandora will use it to create both HOCR and raw OCR output. Tesseract supports multiple languages, the installation of which are recognized by the Islandora OCR (Copy) module. Tesseract is recognized as one of the most accurate open source OCR engines available, Tesseract will read binary, grey, or colour images and output text. A TIFF reader that will read uncompressed TIFF images is also included.
- Autotools (Make, etc.)
- Leptonica image processing library
For Linux installations: While it is likely that your distribution's package manager may contain Tesseract in one of its repositories, it is EXTREMELY unlikely that it will be the correct version. For the Islandora OCR module to create OCR derivatives, Tesseract 3.02.02 or higher is required. At the time of writing, this is the latest stable version. THIS MEANS THAT IT IS LIKELY THAT YOU WILL HAVE TO COMPILE IT FROM SOURCE.