Skip to main content

By clicking Submit, you agree to the developerWorks terms of use.

The first time you sign into developerWorks, a profile is created for you. Select information in your developerWorks profile is displayed to the public, but you may edit the information at any time. Your first name, last name (unless you choose to hide them), and display name will accompany the content that you post.

All information submitted is secure.

  • Close [x]

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerworks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

By clicking Submit, you agree to the developerWorks terms of use.

All information submitted is secure.

  • Close [x]

Understanding information content with Apache Tika

Oleg Tikhonov, Software Engineer, IBM  
Photo of Oleg Tikhonov
Originally from Kazakhstan, Oleg graduated from Ariel University Center of Samaria in Israel. He has worked in the software industry for more than 5 years. At IBM, Oleg develops automated software test solutions. In his spare time, when not exploring the details of new J2EE technologies and how to stretch their capabilities, he likes diving with friends and spending time with his beautiful wife, Vera.
Chris Mattmann (chris.a.mattmann@jpl.nasa.gov), Senior Computer Scientist, Professor, NASA JPL/USC
Chris Mattmann
Chris Mattmann is a senior computer scientist at NASA's Jet Propulsion Laboratory, working on instrument and science data systems on Earth science missions and informatics tasks. Mattmann has a joint appointment as an adjunct assistant professor in the University of Southern California (USC) Computer Science Department, where he teaches a course on software architecture to local and remote graduate students over USC's distance education network. His research interests are primarily software architecture and large-scale data-intensive systems. He received his Ph.D. in computer science from USC. He is a member of the IEEE.

Summary:  With the increasingly widespread use of computers and the pervasiveness the modern Internet has attained, huge amounts of information in many languages are becoming available. Automatic information processing and retrieval is urgently needed to understand content across cultures, languages, and continents. A recent Apache software project, Tika, is becoming an important tool toward realizing content understanding.

Date:  15 Jun 2010
Level:  Intermediate PDF:  A4 and Letter (60 KB | 19 pages)Get Adobe® Reader®

Activity:  52755 views
Comments:  

Introduction

In this tutorial, we introduce the Apache Tika framework and explain its concepts (e.g., N-gram, parsing, mime detection, and content analysis) via illustrative examples that should be applicable to not only seasoned software developers but to beginners to content analysis and programming as well. We assume you have a working knowledge of the Java™ programming language and plenty of content to analyze.

Throughout this tutorial, you will learn:

  • Apache Tika's API, most relevant modules, and related functions
  • Apache Nutch (one of the progenitors of Tika) and its NgramProfiler and LanguageIdentifier classes, which have recently been ported to Tika
  • cpdetector, the code page detector project, and its functionality

What is Apache Tika?

As Apache Tika's site suggests, Apache Tika is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries.

The parser interface

The org.apache.tika.parser.Parser interface is the key component of Apache Tika. It hides the complexity of different file formats and parsing libraries while providing a simple and powerful mechanism for client applications to extract structured text content and metadata from all sorts of documents. All this is achieved with a single method:

void parse(InputStream stream, ContentHandler handler, Metadata metadata)
    throws IOException, SAXException, TikaException;
                

The parse method takes the document to be parsed and related metadata as input, and outputs the results as XHTML SAX events and extra metadata. The main criteria that led to this design are shown in Table 1.


Table 1. Criteria for Tika parsing design
Criterion Explanation
Streamed parsing The interface should require neither the client application nor the parser implementation to keep the full document content in memory or spooled to disk. This allows even huge documents to be parsed without excessive resource requirements.
Structured content A parser implementation should be able to include structural information (headings, links, etc.) in the extracted content. A client application can use this information, for example, to better judge the relevance of different parts of the parsed document.
Input metadata A client application should be able to include metadata like the file name or declared content type with the document to be parsed. The parser implementation can use this information to better guide the parsing process.
Output metadata A parser implementation should be able to return document metadata in addition to document content. Many document formats contain metadata, such as the name of the author, that may be useful to client applications.

These criteria are reflected in the arguments of the parse method.

Document InputStream

The first argument is an InputStream for reading the document to be parsed.

If this document stream cannot be read, parsing stops and the thrown IOException is passed up to the client application. If the stream can be read but not parsed (if the document is corrupted, for example), the parser throws a TikaException.

The parser implementation will consume this stream, but will not close it. Closing the stream is the responsibility of the client application that opened it initially. Listing 1 shows the recommended pattern for using streams with the parse method.


Listing 1. Recommended pattern for using streams with the parse method
InputStream stream = ...;      // open the stream
try {
    parser.parse(stream, ...); // parse the stream
} finally {
    stream.close();            // close the stream
}
                

XHTML SAX events

The parsed content of the document stream is returned to the client application as a sequence of XHTML SAX events. XHTML is used to express structured content of the document, and SAX events enable streamed processing. Note that the XHTML format is used here only to convey structural information, not to render the documents for browsing.

The XHTML SAX events produced by the parser implementation are sent to a ContentHandler instance given to the parse method. If the content handler fails to process an event, parsing stops and the thrown SAXException is passed up to the client application.

Listing 2 shows the overall structure of the generated event stream (with indenting added for clarity).


Listing 2. Structure of the generated event stream

<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <title>...</title>
  </head>
  <body>
    ...
  </body>
</html>
                

Parser implementations typically use the XHTMLContentHandler utility class to generate the XHTML output. Dealing with the raw SAX events can be complex, so Apache Tika (since V0.2) comes with several utility classes that can be used to process and convert the event stream to other representations.

For example, the BodyContentHandler class can be used to extract just the body part of the XHTML output and feed it as SAX events to another content handler or as characters to an output stream, a writer, or simply a string. The following code snippet parses a document from the standard input stream and outputs the extracted text content to standard output:

ContentHandler handler = new BodyContentHandler(System.out);
parser.parse(System.in, handler, ...);
                

Another useful class is ParsingReader that uses a background thread to parse the document and returns the extracted text content as a character stream.


Listing 3. Example of the ParsingReader

InputStream stream = ...; // the document to be parsed
Reader reader = new ParsingReader(parser, stream, ...);
try {
 ...; // read the document text using the reader
} finally {
 reader.close(); // the document stream is closed automatically
}
                

Document metadata

The final argument to the parse method is used to pass document metadata in and out of the parser. Document metadata is expressed as a metadata object.

Table 2 lists some of the more interesting metadata properties.


Table 2. Metadata properties
Property Description
Metadata.RESOURCE_NAME_KEY The name of the file or resource that contains the document — A client application can set this property to allow the parser to use file name heuristics to determine the format of the document. The parser implementation may set this property if the file format contains the canonical name of the file (the GZIP format has a slot for the file name, for example).
Metadata.CONTENT_TYPE The declared content type of the document — A client application can set this property based on, such as an HTTP Content-Type header. The declared content type may help the parser to correctly interpret the document. The parser implementation sets this property to the content type according to which the document was parsed.
Metadata.TITLE The title of the document — The parser implementation sets this property if the document format contains an explicit title field.
Metadata.AUTHOR The name of the author of the document — The parser implementation sets this property if the document format contains an explicit author field.

Note that metadata handling is still being discussed by the Apache Tika development team, and it is likely that there will be some (backwards-incompatible) changes in metadata handling before Tika V1.0.

Parser implementations

Apache Tika comes with a number of parser classes for parsing various document formats, as shown in Table 3.


Table 3. Tika parser classes
Format Description
Microsoft® Excel® (application/vnd.ms-excel) Excel spreadsheet support is available in all versions of Tika and is based on the HSSF library from POI.
Microsoft Word® (application/msword) Word document support is available in all versions of Tika and is based on the HWPF library from POI.
Microsoft PowerPoint® (application/vnd.ms-powerpoint) PowerPoint presentation support is available in all versions of Tika and is based on the HSLF library from POI.
Microsoft Visio® (application/vnd.visio) Visio diagram support was added in Tika V0.2 and is based on the HDGF library from POI.
Microsoft Outlook® (application/vnd.ms-outlook) Outlook message support was added in Tika V0.2 and is based on the HSMF library from POI.
GZIP compression (application/x-gzip) GZIP support was added in Tika V0.2 and is based on the GZIPInputStream class in the Java 5 class library.
bzip2 compression (application/x-bzip) bzip2 support was added in Tika V0.2 and is based on bzip2 parsing code from Apache Ant, which was originally based on work by Keiron Liddle from Aftex Software.
MP3 audio (audio/mpeg) The parsing of ID3v1 tags from MP3 files was added in Tika V0.2. If found, the following metadata is extracted and set:
  • TITLE Title
  • SUBJECT Subject
MIDI audio (audio/midi) Tika uses the MIDI support in javax.audio.midi to parse MIDI sequence files. Many karaoke file formats are based on MIDI and contain lyrics as embedded text tracks that Tika knows how to extract.
Wave audio (audio/basic) Tika supports sampled wave audio (.wav files, etc.) using the javax.audio.sampled package. Only sampling metadata is extracted.
Extensible Markup Language (XML) (application/xml) Tika uses the javax.xml classes to parse XML files.
HyperText Markup Language (HTML) (text/html) Tika uses the CyberNeko library to parse HTML files.
Images (image/*) Tika uses the javax.imageio classes to extract metadata from image files.
Java class files The parsing of Java class files is based on the ASM library and work by Dave Brosius in JCR-1522.
Java Archive Files The parsing of JAR files is performed using a combination of the ZIP and Java class file parsers.
OpenDocument (application/vnd.oasis.opendocument.*) Tika uses the built-in ZIP and XML features in the Java language to parse the OpenDocument document types used most notably by OpenOffice V2.0 and higher. The older OpenOffice V1.0 formats are also supported, although they are currently not auto-detected as well as the newer formats.
Plain text (text/plain) Tika uses the International Components for Unicode Java library (ICU4J) to parse plain text.
Portable Document Format (PDF) (application/pdf) Tika uses the PDFBox library to parse PDF documents.
Rich Text Format (RTF) (application/rtf) Tika uses Java's built-in Swing library to parse RTF documents.
TAR (application/x-tar) Tika uses an adapted version of the TAR parsing code from Apache Ant to parse TAR files. The TAR code is based on work by Timothy Gerard Endres.
ZIP (application/zip) Tika uses Java's built-in ZIP classes to parse ZIP files.

You can also extend Apache Tika with your own parsers, and any contributions to Tika are welcome. The goal of Tika is to reuse existing parser libraries like Apache PDFBox or Apache POI as much as possible, so most of the parser classes in Tika are adapters to such external libraries.

Apache Tika also contains some general-purpose parser implementations that are not targeted at any specific document formats. The most notable of these is the AutoDetectParser class that encapsulates all Tika functionality into a single parser that can handle any type of document. This parser will automatically determine the type of the incoming document based on various heuristics and will then parse the document accordingly.

Now it's time for hands-on activities. Here are the classes we will develop throughout our tutorial:

  1. BudgetScramble — Shows how to use Apache Tika metadata to determine which document has been changed recently and when.
  2. TikaMetadata — Shows how to get all Apache Tika metadata of a specific document, even if there is no data (just to display all metadata types).
  3. TikaMimeType — Shows how to use Apache Tika's mimetypes to detect the mimetype of a particular document.
  4. TikaExtractText — Shows Apache Tika's text-extraction capabilities and saves extracted text as an appropriate file.
  5. LanguageDetector — Introduces the Nutch language's identification ability to identify the language of particular content.
  6. Summary — Sums up Tika features, such as MimeType, content charset detection, and metadata. In addition, it introduces cpdetector functionality to determine a file's charset encoding. Finally, it shows Nutch's language identification in process.

Requirements

  • Ant V1.7 or higher
  • Java V1.6 SE or higher

1 of 8 | Next

Comments



Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Open source
ArticleID=495090
TutorialTitle=Understanding information content with Apache Tika
publish-date=06152010
author1-email=olegt@il.ibm.com
author1-email-cc=
author2-email=chris.a.mattmann@jpl.nasa.gov
author2-email-cc=cmwalden@us.ibm.com

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere).

My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Try IBM PureSystems. No charge.