Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Many core MediaFilter operations are not unique to institutional repositories. Text extraction, for example, is practiced widely by applications that retrieve content on the web and need to index it. DSpace may thus leverage existing art where appropriate. We are evaluating the Apache Tika framework in this light. Tika is part of the larger ecosystem that grew around Lucene, Nutch, Hadoop, SOLR, etc and is concerned with content analysis and data extraction from documents. It has been integrated, e.g. into JackRabbit (the reference implementation of Java Content Repository JSR), and other digital asset management systems. This could help address the 'High Code Maintenance' issue: the Tika community can shoulder the burden of ensuring the latest and best components.

Even in the current Tika 1.0 release, we could greatly expand the functionality of text extraction in MediaFilter: in addition to PDF, Word, HTML, and Powerpoint, there are Tika parsers for XML, OpenDocument, audio, video, EPub, and many others