Text Analysis in InfoSphere Warehouse, Part 3: Develop and integrate custom UIMA text analysis engines

Gain business insights from unstructured data

In the first two articles of this series, you learned about IBM® InfoSphere® Warehouse text analysis capabilities, how to use regular expressions and dictionaries to extract information from text, and how to publish the results with a Cognos report. This article describes how to use the Unstructured Information Management Architecture (UIMA) framework to create a custom text annotator and use it in InfoSphere Warehouse. The ability of InfoSphere Warehouse to use UIMA based annotators in analytic flows is a powerful feature. You can write custom annotators that can extract almost any information from text. Plus you can use UIMA based annotators that are provided by IBM, other companies, and many universities. For example, you can find UIMA annotators that tokenize words and extract concepts such as persons or sentiments.

Stefan Abraham (stefana@de.ibm.com), Software Engineer, IBM

Stefan Abraham photographStefan Abraham is a Software Engineer at the IBM Research & Development Lab Boeblingen, Germany. He works on text analysis components and on data mining related UI components in InfoSphere Warehouse.



Simone Daum (sdaum@de.ibm.com), Software Engineer, IBM

Simone Daum photoSimone Daum is a Software Engineer at the IBM Research & Development Lab Boeblingen, Germany. She works on tooling for data preparation for data mining and on text analysis in InfoSphere Warehouse.



Benjamin G. Leonhardi (bleon@de.ibm.com), Software Engineer, IBM

Author Photo: Benjamin LeonhardiBenjamin Leonhardi is a software engineer for InfoSphere Warehouse data mining at the IBM Research & Development Lab in Boeblingen, Germany. He works on mining visualization, text mining, and mining reporting solutions.



27 August 2009

Also available in Spanish

Introduction

As described in the previous articles of this series, InfoSphere Warehouse provides tooling for dictionary and regular expression-based text analysis. These are arguably the two most common approaches to extracting information from text, but there are many cases that these two approaches do not cover.

InfoSphere Warehouse text analytics is built on the UIMA framework, which is an open framework that enables development of extendible analytic applications. UIMA supports development of analytic applications from multiple modules. For example, one module could extract or annotate a person from the text, while another tries to find relationships between these annotations. The connection, inputs, and outputs are defined by UIMA through a hierarchical type system. The UIMA framework is available in different programming languages, with Java® being the most commonly used. With InfoSphere Warehouse, you can import UIMA-based Java Analysis Engines to significantly extend your text analytic capabilities. These custom annotators can be used in analytic flows with the TextAnalyzer operator. Similar to the Dictionary and Regular Expression operators, the found entities are mapped to database table columns and can be further processed and used like other structured information.

This article describes the basic concepts of the UIMA framework and gives you the theoretical and practical information you need to create your own UIMA analysis engine and use it in InfoSphere Warehouse. This includes an explanation of how to set up the Eclipse-based UIMA framework development environment and how to write your own simple type system and annotator description as interfaces of the analysis engine. The article then shows you how to write the Java code that does the actual analysis of the documents. After finishing development, you learn how to export your code as a Processing Engine ARchive (PEAR) file, which is the UIMA standard for packaging and deploying annotators. Finally, the article explains how to import the PEAR file into InfoSphere Warehouse and use it in an analytic flow to extract information from text. This complete cycle is explained using a simple example that analyzes text and returns all words that contain five or more vowels.

Writing custom annotators can be a useful way to solve a specific business problem. Additionally, the Apache UIMA home page and some universities have freely available analysis engine repositories that offer existing solutions you can use in InfoSphere Warehouse. There are also many companies that sell existing solutions. So by using the open standard approach of UIMA, you gain the advantage of possibly being able to leverage the work that others may have already done to solve similar problems.

Some of the most interesting and active fields related to text analysis are scientific research and sentiment detection. In sentiment detection, the analysis engine tries to determine how positive or negative someone feels about a given product or service. A practical example of sentiment detection would be to monitor forums, blogs, and review Web sites after a product launch to determine public reaction. Or you could monitor similar resources to measure how a new marketing campaign has changed the perception on a product. While very interesting, these type of applications represent only a subset of the available analysis engines that can be used in InfoSphere Warehouse to extract information from the unstructured information of your warehouse.


UIMA concepts

As described in the previous section, InfoSphere Warehousing provides text analysis features that are based on the Unstructured Information Management Architecture (UIMA). You can use built-in text analysis functions, namely dictionary based and regular expression based named entity extraction, as explained in the previous articles of this series. You can also use custom Apache UIMA compliant text analysis components to implement more advanced text analysis solutions. Components are available through IBM service engagements, IBM Research, other companies, and universities. Of course you can also create them yourself using the UIMA SDK.

What is UIMA

UIMA is a component architecture and a software framework implementation for the analysis of unstructured content such as text, video, and audio. The purpose for this framework is to have a common platform for unstructured analysis that can provide reusable analysis components that reduce duplication of development effort.

The architecture of UIMA allows you to easily plug in custom analysis components and combine them with others. Your UIMA applications do not need to know the details of how the analysis components work together to create the results. The tasks of integrating and orchestrating multiple analysis components is handled by the UIMA framework.

A UIMA application might analyze plain text and identify entities such as persons, places, and organizations; or it might identify relationships, such as works-for or located-at. Applications are typically decomposed into components. For example, "language identification" => "language specific segmentation" => "sentence boundary detection" => "entity detection (person/place names etc.)".

There might be dependencies between components. For example, a language specific segmentation must be done before the sentence boundary detection for this language can be started. Each component is self-contained and can be combined with other components. Each component, which is written in Java or C++, implements interfaces defined by the framework and provides self-describing metadata via XML descriptor files. The UIMA framework manages the components and the data flow between them.

Analysis engines, annotators, and the Common Analysis Structure

Analysis engines are the central building blocks within UIMA. An analysis engine contains one or more annotators or other analysis engines. Each annotator implements one specific text analysis function. This recursive packaging allows you to build complex analysis engines out of simpler ones. Each annotator stores its results in typed feature structures, which are simply data structures that have a type and a set of (attribute, value) pairs.

An annotation is a particular type of feature structure that is attached to a region of the artifact being analyzed. For example, annotations might be attached to a span of text in a document. In this case, the annotations contain a specific start and end position within the document. This means they easily lend themselves to specifying information extraction results. For example, in the following text a Company annotation would cover positions 19 to 21:

UIMA started as an IBM initiative, but has gone open source in 2008

All feature structures, including annotations, are represented in the UIMA Common Analysis Structure (CAS), which is the central data structure through which all UIMA components communicate.

Figure 1 shows an analysis engine containing annotators for named entity recognition, grammatical parsing, and relationship detection. Note that the Relationship Annotator can detect relationships without looking at the actual document text, but by analyzing preexisting concepts and grammar annotations in the CAS.

Figure 1. A text analysis engine containing annotators for grammatical parsing, named entity detection, and relationship detection
Mapping shows how annotators combine in a text analysis engine to extract CeoOf relationship from a string of text

The UIMA type system

The UIMA type system defines the various types of objects that might be found in documents and that can be extracted by analysis engines. For example, Person might be a type.

Types include features. For example, Age and Occupation might be defined as features of the Person type. Examples of other types might be Organization, Company, Money, Product, or NounPhrase.

Type systems are domain-specific and application-specific. You can organize types into taxonomies. For example, Company might be defined as a subtype of Organization, or NounPhrase might be a subtype of ParseNode.

A general and common type that is often used in text analysis to derive additional types is the Annotation type, which is provided by the UIMA framework. You can use the Annotation type to label regions in documents. The Annotation type includes the features Begin and End. The values of these features delimit a span. For example, in the following text string (the same string being analyzed in Figure 1) the annotation Person, starts at position 0 and ends at position 10:

Fred Center is the CEO of Center Micros

The first step in developing an annotator is to define the CAS Feature Structure types that it works on. This is done in an XML file called a Type System Descriptor. UIMA defines the built-in types TOP, which is the root of the type system (analogous to Object in Java), Annotation, which is described above, and others. UIMA also defines basic range types for features such as Boolean, Integer, and Double, as well as arrays of these primitive types.

Processing Engine ARchives (PEAR) files

After you have developed and successfully tested an analysis engine, you can package it to be deployed to another application as a pre-configured (text) analysis component. The annotator packaging format in UIMA is called PEAR, which stands for Processing Engine ARchive. The PEAR format contains all necessary information to run the wrapped annotator component. For details about the PEAR packaging format, refer to the PEAR Reference chapter of the UIMA reference documentation.

Using the Text Analyzer operator in InfoSphere Warehouse Design Studio, you can import any Apache UIMA compliant text analysis engine to annotate concepts in unstructured text. Use the Analysis Engine Import wizard to import these pre-configured PEAR files. If an analysis engine was created using a prior UIMA version such as IBM UIMA, it needs to be migrated first. The migration process is described in the next section.


UIMA and InfoSphere Warehouse

As mentioned in the previous sections, you can import pre-configured UIMA analysis engines (PEAR files) into InfoSphere Warehouse. This lets you extend the text analysis capabilities of InfoSphere Warehouse to meet your specific requirements.

This section provides an overview of how the text analysis results created by the analysis engines are represented in the database. It also explains how you can use InfoSphere Warehouse to import and execute text analysis engines that were created with prior versions of UIMA.

Mapping of analysis results to columns in the database

In InfoSphere Warehouse the analysis of unstructured data requires that the data exists in the database as a column with a character data type (CHAR, VARCHAR, or CLOB). UIMA handles the given text column content of each table row as one text document.

The produced analysis results, each derived from the Annotation type, are stored in the CAS. A CAS Consumer is used to write the contents of selected results into the database using JDBC. Each selected feature results in a new column.

To specify which analysis results should be written to the database, you configure the properties of the Text Analyzer operator, or more precisely, the Analysis Results tab. Use this tab to specify which annotation type contains the analysis result, (for example, the type Company).

As described earlier, each annotation includes built-in features such as Begin and End, and may also have custom features such as Full_Legal_Name or CEO for the type Company. The selected annotation type specifies which result you are interested in, and the selected features of this annotation type specify the detailed part of interest. Each feature results in one database column.

The resulting database table does not need to contain columns created just from the analysis. It can contain columns from the input tables as well. This might, for example, enable you to relate the results to the original text later.

This process is summarized in Figure 2. As depicted in the figure, the CAS Consumer is configured in the Analysis Result tab of the Text Analyzer operator to write the content of the Begin, End, and Full_Legal_Name features to the resulting table.

Figure 2. UIMA flow to analyze a text column and write analysis result into a table
UIMA flow showing that analysis of sample text string results in International Business Machines being written as Full_Legal_Name in the resulting table.

Possible problems that are not easy to identify can occur if the results of the UIMA annotator are larger than the result column they are mapped to. For example, if the column has a type of VARCHAR(256), results that are longer than 256 characters result in a failed flow and an SQL error. An easy way to identify this problem is to temporarily use CLOB types, which have no size restrictions, as target columns. If this solves the problem, your annotator may return unexpectedly long return values. It may be useful to ascertain in the annotator code that a specific length may not be exceeded in the created annotations. Another problem may occur if no DB2 tablespace of a given length is available.

Migrating IBM UIMA-based analysis engines to Apache

InfoSphere Warehouse V9.5.1 and later does not support IBM UIMA-based packaged text analysis components. However, you can still use these components within these versions by following one of the procedures described below:

  • Use the IBM UIMA to Apache UIMA conversion tooling to migrate the source code of PEAR files. For more information, see the Migrating from IBM UIMA to Apache UIMA section on the UIMA Web site.
  • Use the IBM UIMA Adapter Wrapper for Apache UIMA to run your IBM UIMA-based PEAR file in an Apache UIMA-based runtime environment. You can use the IBM UIMA Adapter package for IBM UIMA-based annotators when no source code is available that you can transform with the IBM UIMA to Apache UIMA conversion tooling. To process these IBM UIMA annotators, use the IBM UIMA Adapter package in the new Apache UIMA runtime. For more information, visit the UIMA page on alphaWorks.

Identify problems in an annotator by switching on UIMA logging in InfoSphere Warehouse

InfoSphere Warehouse forwards error messages that occur in your annotator code to the trace of the executed analytic flow. If you enable content tracing for a process, a data flow, or a mining flow, UIMA logs of level CONFIG and above are routed to the InfoSphere Warehousing log.

On some occasions it may be necessary to debug the problems of your annotator by retrieving UIMA log messages of a finer level. This covers all problems that occur in the UIMA code of your annotator and problems that occur in the annotator code of a custom annotator. See the InfoSphere Warehouse documentation for instructions on how to see the UIMA logs of a mining or data flow and how to change the UIMA trace level to get more information.


An example of creating a UIMA analysis engine and integrating it into InfoSphere Warehouse Design Studio

This example shows you how to write a simple UIMA analysis engine using the UIMA SDK and then integrate it into a text analysis flow in InfoSphere Warehouse Design Studio. The analysis engine splits document text into words (expecting white-space separated text) and creates an annotation for all words that contain more than four vowels. Although there is no business use case for such an analysis engine, it provides a simple way to demonstrate how easy it is to create a UIMA analysis engine.

The example uses the Eclipse Tooling for UIMA, which is provided on Apache.org, to create the analysis engine. This choice was made because it provides easy-to-use editors and tooling for analysis engine development.

After the analysis engine is ready, you package it as a PEAR file, which is the format that InfoSphere Warehouse expects for the analysis engine import. You then import the analysis engine in Design Studio and create a flow using it on sample text documents that are located in a database table. For these sample text documents, the table CIA.FACTBOOKS in the DWESAMP database that is shipped with InfoSphere Warehouse is used.

Installing UIMA

To install UIMA, download the UIMA Java Framework & SDK (version 2.2.2) from the UIMA download page and extract it to your file system. You also need to install the UIMA Eclipse plug-ins as described in these step-by-step instructions in the Apache UIMA and SDK Setup documentation.

Creating and testing the UIMA analysis engine

A UIMA analysis engine consists of a Java class for the contained annotator and additional XML files. The Java class extends a basic annotator implementation provided by the UIMA SDK. The additional XML files describe the types of annotations that are created (or consumed) by the annotator along with other additional information. The UIMA Eclipse plug-ins simplify the creation of the XML files. Use the following steps to create the analysis engine in your Eclipse environment.

Creating a Java project with a UIMA nature

  1. In your Eclipse environment, select File -> New -> Java Project.
  2. Enter a project name. For example, com.ibm.developerworks.textanalysis.partIII.
  3. Next, you add the UIMA nature to the project. This creates a specific folder structure for the project and enables additional actions on it. To add the nature, right-click on the project and select Add UIMA Nature. In the following steps, you only make use of the desc folder that is created.
  4. Finally, you have to add the UIMA Java libraries to the build path of the project. To do this, right-click on the project and select Build Path -> Add External Archives. Include all the jar files contained in the lib folder of your extracted UIMA package.

Creating the type system

The type system contains information about annotation types and their features. The sample annotator you are building creates annotations for words with a minimum number of vowels, so the type system should contain a VowelCountAnnotation type. A feature of this type should be the number of vowels that the word contains.

To create the type system, follow these steps:

  1. Right-click the desc folder of your project and select New -> Other....
  2. Select UIMA -> Type System Descriptor File and enter VowelCountTypeSystem.xml as the file name.
  3. Go to the Type System tab at the bottom of the editor.
  4. Click Add Type and enter com.ibm.developerworks.VowelCountAnnotation as the type name. This type inherits from the UIMA Annotation type.
  5. Select the VowelCountAnnotation type and click Add ... to add the feature for the number of vowels. Enter numberVowels as the feature name.
  6. Click the Browse button and select uima.cas.Integer as the range type for the feature. Figure 3 shows the complete type system.
    Figure 3. The type system
    Screen shot of the Types (or Classes) section with tree showing numberVowels feature under VowelCountAnnotation type.
  7. Save the type system. Notice that the UIMA framework already generated the Java classes VowelCountAnnotation_Type.java and VowelCountAnnotation.java in the com.ibm.developerworks package of your source folder. You use these classes in the annotator code in the next step. If the Java classes are not generated, click the JCasGen button in the type system editor.

Implementing the Java annotator class

The implementation of the annotator resides in a Java class that extends the basic annotator implementation named JCasAnnotator_ImplBase, which is provided by the UIMA framework.

  1. Right-click on the com.ibm.developerworks package and select New -> Class.
  2. Select VowelCounter as the class name and org.apache.uima.analysis_component.JCasAnnotator_ImplBase as the superclass.
  3. For this simple annotator, you just need to overwrite the process method, which receives a JCAS — the Java class for the Common Analysis Structure. This object contains the document text along with meta information about the text, such as the document language. It also contains the annotations that are created during the analysis run. For example, if you are running multiple annotators in a row, one annotator can access the annotations that were created by previously running annotators. The annotator for this exercise does not read any annotations; it only creates the VowelCountAnnotation annotations.

Listing 1 contains the complete set of code for the VowelCounter class. Look at the process method. First it reads the text of the document with getDocumentText(). Then it replaces line breaks, underscores, and hyphens with blanks. This is because the following loop splits the text into words by blanks. A real linguistic analysis certainly has to be more sophisticated, but for demonstration purposes this simple approach is sufficient.

The getNumberVowels() method counts the number of vowels in each word. If the word contains more than four vowels, an annotation is created. The code simply uses the VowelCountAnnotation class that has been automatically created with the type system along with getters and setters for the numberVowels feature and the default features for the Begin and End of the annotation. Lastly, annotation.addToIndexes() adds the annotation to the CAS annotation index.

Listing 1. VowelCounter.java
package com.ibm.developerworks;

import java.util.ArrayList;
import java.util.Collections;

import org.apache.uima.analysis_component.JCasAnnotator_ImplBase;
import org.apache.uima.analysis_engine.AnalysisEngineProcessException;
import org.apache.uima.jcas.JCas;

public class VowelCounter extends JCasAnnotator_ImplBase {

   // the vowels threshold. Words that have more vowels than this will be annotated
   int vowelThreshold = 4;

   @Override
   public void process(JCas cas) throws AnalysisEngineProcessException {
      // get document text
      String doc = cas.getDocumentText();

      // replace line breaks, underscores and hyphens by blanks
      doc = doc.replaceAll("\n", " ");
      doc = doc.replaceAll("_", " ");
      doc = doc.replaceAll("-", " ");
		
      // iterate over the words and count vowels
      int currentPos = 0;
      while (currentPos < doc.length()) {
         int posNextDelimiter = doc.indexOf(' ', currentPos);
         int end = posNextDelimiter;
         if (posNextDelimiter == -1) {
            end = doc.length();
         }

         int numberVowels = getNumberVowels(doc.substring(currentPos, end));
         if (numberVowels > vowelThreshold) {
            this.addAnnotation(currentPos, end, numberVowels, cas);
         }
         currentPos = end + 1;
      }
   }

   protected void addAnnotation(int begin, int end, int numberVowels, JCas cas) {
      VowelCountAnnotation annotation = new VowelCountAnnotation(cas);
      annotation.setBegin(begin);
      annotation.setEnd(end);
      annotation.setNumberVowels(numberVowels);
      annotation.addToIndexes();
   }

   /**
    * Returns the number of vowels a, e, i, o, u, A, E, I, O, U in the word
    * 
    * @param word
    *           the input word
    * @return the number of vowels in the word
    */
   protected int getNumberVowels(String word) {
      if ((word == null) || (word.length() == 0))
         return 0;

      // initialize the list of vowels
      // (this can also be moved to the initialize()-method of the annotator
      // for performance optimization)
      ArrayList<Character> vowels = new ArrayList<Character>();
      Collections.addAll(vowels, 'a', 'e', 'i', 'o', 'u', 'A', 'E', 'I', 'O', 'U');

      int numberOfVowels = 0;
      for (int i = 0; i < word.length(); i++) {
         if (vowels.contains(word.charAt(i))) {
            numberOfVowels++;
         }
      }
      return numberOfVowels;
   }
}

Creating the descriptor file

To complete the analysis engine, you need to create a descriptor file that links the Java class and the type system and contains other configuration options. To create the descriptor, follow these steps:

  1. Right-click the desc folder of your project and select New -> Other....
  2. Select UIMA -> Analysis Engine Descriptor File and enter VowelCounter.xml as the file name.
  3. Go to the Overview tab of the editor and select the VowelCounter class as Java class file.
  4. On the Type System tab in the Imported Type Systems section, click Add to import your type system VowelCountTypeSystem.xml by location.
  5. To define which kind of annotations are output of the annotator, go to the Capabilities tab and click Add Type.
  6. In the row for the VowelCountAnnotation, click on the Output column to declare this Annotation type as the output type of the annotator, as shown in Figure 4.
    Figure 4. The capabilities tab
    Screen shot of the Component Capabilities section with VowelCountAnnotation type selected as Output.
  7. Save the descriptor file.

Testing the annotator

During development of the annotator, you might want to test it out on some sample text. You can do this easily with the Document Analyzer that is provided with the UIMA JDK. To run the analyzer, follow these steps:

  1. Right-click the Java project and select Run As -> Java Application.
  2. Select org.apache.uima.tools.docanalyzer.DocumentAnalyzer as the application.
  3. In the dialog, point Location of the Analysis Engine XML Descriptor to your VowelCounter.xml descriptor file and click the Interactive button.
  4. Type some sample text into the text field and click Analyze. Figure 5 shows an example of how annotations appear in the document analyzer.
    Figure 5. Annotations in the document analyzer
    Sample text in the document analyzer. The word warehouse in the sample text is highlighted as a VowelCountAnnotation.

Exporting the analysis engine

To use the analysis engine in the InfoSphere Warehouse Design Studio, follow these steps to package it as a PEAR file:

  1. Right-click on the Java project and select Generate PEAR file.
  2. In the wizard, point Component Descriptor to your VowelCounter.xml descriptor file and click Next two times.
  3. On the PEAR file page, enter the path where the PEAR file should be saved to. For example, c:\VowelCounter.pear.
  4. Click Finish. Now the analysis engine is ready to be imported and used in the Design Studio.

Importing and using the analysis engine in the InfoSphere Warehouse Design Studio

You can import analysis engines that are in the PEAR file format into a Data Warehousing project and use it in an analysis flow with the Text Analyzer operator.

Creating a Data Warehousing project in the Design Studio

  1. From the Design Studio menu, select File -> New -> Data Warehouse Project.
  2. Enter a project name in the wizard.
  3. Click Finish.

Importing the analysis engine

  1. Right-click the Analysis Engines folder (under Text Analysis) in your project and select New -> Analysis Engine.
  2. Point the Analysis Engine to the PEAR file that you previously exported to the file system.
  3. Click Finish.

Creating the data flow

  1. Right-click the Data Flows folder in your Data Warehousing project and select New -> Data Flow.
  2. In the wizard, enter a name for the data flow. For example, Count Vowels.
  3. Specify that you want to work against a database and click Next.
  4. Select the DWESAMP database and click Finish.

Defining the data flow

The following steps describe how to define the data flow so that it reads text from the CIA.FACTBOOK table in the DWESAMP sample database, runs the analysis engine on the text, and writes the annotations to a new database table.

The source table is read with a Table Source operator. The analysis engine is run with a Text Analyzer operator. The last operator in the flow is a Table Target operator pointing to a table where you want to write the annotations.

  1. From the Sources and Targets section of the palette, select a Table Source operator and drag it onto the editor canvas.
  2. In the Source database table browse dialog, expand the CIA schema and select the FACTBOOK table.
  3. Click OK.
  4. Click Finish.
  5. From the Text Operators section, drag a Text Analyzer operator onto the canvas. The Properties view of the operator is opened below the canvas.
  6. On the canvas, use a drag operation to connect the output port of the Source Table operator with the input port of the Text Analyzer operator.
  7. Select the Text Analyzer operator on the canvas and define the properties in the Properties view as follows:
    1. On the Analysis Engine page, select the TEXT input-text column and set the analysis engine to the imported engine named VowelCounter.
    2. On the Analysis Results page, select VowelCountAnnotation as the annotation type. You can see the numberVowels, Begin, and End features that are already selected as result columns. The Design Studio automatically adds a column for the Covered Text. The column contains the text that the annotations covers from Begin to End. This is simply the text that was marked for the annotations in the text analyzer.
    3. On the Output columns page, select the COUNTRY column from the list of available columns and move it to the list of output columns on the right. Now you can relate the annotation with the key COUNTRY of the text document that it was found in.
  8. To create the table to contain the annotations, right-click the output port of the Text Analyzer operator and select Create suitable table....
  9. Enter VOWELCOUNTS for the table name and CIA for the schema.
  10. Click Finish.
  11. Finally, save the data flow. For example, you can save by clicking into the editor area and pressing the Ctrl+S keys.

The complete data flow is shown below in Figure 6. You can execute, the data flow by selecting DB2 UDB Data Flow -> Execute.

Figure 6. The data flow
The data flow in the InfoSphere Warehouse Design Studio. Table Source connected to Text Analyzer connected to Table Target.

You can view the sample annotations by right-clicking the table target operator and selecting Sample Contents of the Database Table. Figure 7 shows the first annotations that were found for Afghanistan.

Figure 7. Sample annotations
Sample annotations. All samples are from COUNTRY Afghanistan and show all words found with five or more vowels.

Conclusion

In this article you have learned how to develop your own UIMA analysis engine and use it in InfoSphere Warehouse. Unstructured Information Management Architecture (UIMA) is an open software framework that supports development and deployment of analytical applications. This article described the necessary UIMA concepts to develop a custom analysis engine. You have learned how to setup the UIMA development environment and how to create your own annotator and use it in InfoSphere Warehouse to extract structured information from text input. The ability to use UIMA compatible analysis engines greatly expands the analytic capabilities of InfoSphere Warehouse. It makes it possible to develop custom analysis engines for projects requiring specific types of analysis. It also makes it possible to use third-party assets, free or commercial, to do sophisticated analysis such as sentiment detection on your Warehouse data.

This article is the third and last of this three part series about the use of InfoSphere Warehouse for text analysis. The first article explained the architecture of InfoSphere Warehouse text analysis and how to use the integrated tooling for regular expressions to extract information from text data. The second article gave an overview of the support for dictionary based analysis, which enables you to build your own dictionaries and use them for analysis. The second article also showed how to easily distribute the extracted information through your company with IBM Cognos reporting tools. This last article concludes the series by describing how to write your own custom UIMA annotators and how to use them in InfoSphere Warehouse.

As described in the first article, unstructured information represents a large and fast growing part of the available business information in most companies. InfoSphere Warehouse text analysis is built on open standards like UIMA and provides powerful and rich tooling that you can use to unlock the information contained in this unstructured data. You can then use BI reporting tools such as Cognos to easily distribute extracted information throughout your company.

Resources

Learn

Get products and technologies

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Information management on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Information Management, Big data and analytics
ArticleID=423276
ArticleTitle=Text Analysis in InfoSphere Warehouse, Part 3: Develop and integrate custom UIMA text analysis engines
publish-date=08272009