Unstructured Information Management Architecture SDK

A Java SDK that supports the implementation, composition, and deployment of applications working with unstructured information

See these frequently asked questions about UIMA.

What is UIMA?

UIMA stands for Unstructured Information Management Architecture. It is component software architecture for the development, discovery, composition, and deployment of multi-modal analytics for the analysis of unstructured information and its integration with search technologies. UIMA processing occurs through a series of modules called analysis engines. The result of analysis is an assignment of semantics to the elements of unstructured data, for example, the indication that the phrase "Washington" refers to a person's name or that it refers to a place. UIMA supports the rendering of these results in conventional structures (for example, relational databases or search engine indices), where the content of the original unstructured information may efficiently be accessed according to its inferred semantics. UIMA is specifically designed to support the developer in the creation, integration, deployment, and sharing of components across platforms and among disperse teams with different skills working to develop advanced analytics.

What's the difference between UIMA and the UIMA SDK?

UIMA stands for Unstructured Information Management Architecture. It is component software architecture for UIMA is an architecture that specifies component interfaces, design patterns, data representations, and development roles. The UIMA Software Development Kit (SDK) is a software system that includes a run-time framework, APIs, and tools for implementing, composing, packaging, and deploying UIMA components. It comes with a semantic search engine for indexing and querying over the results of analysis. The UIMA run-time framework allows developers to plug in their components and applications and run them on different platforms and according to different deployment options that range from tightly-coupled (running in the same process space) to loosely-coupled (distributed across different processes or machines for greater scale, flexibility, and recoverability).

How does UIMA relate to IBM products?

UIMA text analysis engines and annotators are already used within several IBM products, including IBM's new enterprise search product, WebSphere Information Integrator OmniFind Edition, and IBM's WebSphere Portal product. All new text analysis technology that is being put into IBM products is based on UIMA components.

Can I build my UIM application on top of UIMA?

Yes. The UIMA license does not restrict its usage to specific scenarios, and we are of course very interested in your feedback, which will help us making UIMA the right platform for building UIM applications. Please note, however, that we currently offer support on a "best we can do" basis. If you are interested in a more formal support agreement, or if you would like to include UIMA in a commercial solution, please contact IBM for additional options.

What is an annotation?

An annotation is a label, typically represented as string of characters, associated with a region of a document. An example is the label "Person" associated with the span of text "George Washington". We say that "Person" annotates "George Washington" in the sentence "George Washington was the first president of the United States". The association of the label "Person" with a particular span of text is an annotation.

Annotations are not limited to text. A label may annotate a region of an image or a segment of audio. The same concepts apply.

What is the CAS?

The CAS stands for Common Analysis Structure. It provides cooperating UIMA components with a common representation and mechanism for shared access to the artifact being analyzed (for example, a document, audio file, video stream, etc.) and the current analysis results.

What does the CAS contain?

The CAS is a data structure for which UIMA provides multiple interfaces. It contains and provides the analysis writer with access to the following:

Does the CAS contain only annotations?

No. The CAS contains the artifact being analyzed and the analysis results. Analysis results are those statements recorded by analysis engines in the CAS. The most common form of analysis result is the addition of an annotation. But an analysis engine may write any structure that conforms to the CAS's type system into the CAS. These may not be annotations but may be other things, such as links between annotations and properties of objects associated with annotations.

Is the CAS merely XML?

No; in fact there are many possible representations of the CAS. If all of the analysis engines are running in the same process, an efficient, in-memory data object is used. If a CAS must be sent to an analysis engine on a remote machine, it can be done via an XML or a binary serialization of the CAS. UIMA specifies an XML representation of the CAS.

What is a type system?

Think of a type system as a schema for the CAS. It defines the types of objects and their properties (or features) that may be instantiated in a CAS. A CAS conforms to a particular type system. UIMA components declare their input and output with respect to a type system. Type systems include the definitions of types, their properties, and single-inheritance hierarchy of types.

What's the difference between an annotator and an analysis engine?

In the terminology of UIMA, an annotator is simply some code that analyzes documents and puts out annotations on the content of the documents. The UIMA framework takes the annotator, together with metadata describing such things as the input requirements and output of the annotator, and produces an analysis engine. Analysis engines contain the framework-provided infrastructure that allows them to be easily combined with other analysis engines in different flows and according to different deployment options (collocated or as Web services, for example).

Are UIMA analysis engines Web services?

Not necessarily. However, deploying an analysis engine as a Web service is one of the deployment options supported by the UIMA framework.

How do you scale a UIMA application?

The UIMA framework allows components such as analysis engines and CAS consumers to be easily deployed as services or in other containers and managed by systems middleware designed to be scaled. UIMA applications tend to naturally scale-out across documents, allowing many documents to be analyzed in parallel.

What does it mean to embed UIMA in systems middleware?

An example of an embedding would be the deployment of a UIMA analysis engine as an Enterprise Java Bean inside an application server such as IBM WebSphere. Such an embedding allows the deployer to take advantage of the features and tools provided by WebSphere for achieving scalability, service management, recoverability, etc. UIMA is independent of any particular systems middleware, so analysis engines could be deployed on other types of middleware as well.

Must analysis engines be "stateless"?

Technically, no. But analysis engines developers are encouraged not to maintain state between documents that would prevent their engine from working as advertised if switched into a different flow or onto a different document collection.

UIMA defines another type of component, the CAS Consumer, which is intended to maintain state across documents and is typically associated with some resource such as a database or search engine that aggregates analysis results across an entire collection.

Is engine meta-data compatible with Web services and UDDI?

All UIMA component implementations are associated with an XML descriptor that represents captured metadata describing various properties about the component in order to support discovery, reuse, validation, automatic composition, and development tooling. In principle, UIMA component metadata is compatible with Web services and UDDI. However, the UIMA framework currently uses its own XML representation for this metadata. It would not be difficult to convert between UIMA's XML representation and the WSDL and UDDI standards.

How is the CPM different from a CPE?

The UIMA framework includes a Collection Processing Manager or CPM for managing the execution of a workflow of UIMA components orchestrated to analyze a large collection of documents. The UIMA developer does not implement or describe a CPM. It is a built-in part of the framework. It is a piece of infrastructure code that handles CAS transport, instance management, batching, check-pointing, statistics collection, and failure recovery in the execution of this collection processing workflow.

A Collection Processing Engine (CPE) is a component that the UIMA developer creates by specifying a CPE descriptor. A CPE descriptor points to a series of UIMA components, including a Collection Reader, CAS Initializer, Analysis Engine(s), and CAS Consumers. These components organized in a particular flow define a collection analysis job that acquires documents from a source collection, initializes CASs with document content, performs document analysis, and then produces collection level results (for example, search engine index, database, and so on). The CPM is the execution engine for a CPE.

Is an XML Fragment Query supposed to be valid XML?

Not exactly. The XML Fragment query syntax used by the semantic search engine that is shipped with UIMA uses basic XML syntax as an intuitive way to describe hierarchical patterns of annotations that may occur in a CAS. It deviates from valid XML in a few ways in order to support queries over "overlapping" or "cross-over" annotations

Does UIMA support modalities other than text?

The UIMA architecture supports the development, discovery, composition, and deployment of multi-modal analytics including text, audio, and video. However, this release of the SDK includes only documentation and programming examples for text analysis.

How does UIMA compare to other similar work?

A number of different frameworks for NLP have preceded UIMA. Two of them where developed at IBM Research and represent UIMA's early roots. For details, please see the UIMA article that appears in the IBM Systems Journal Vol. 43, No. 3.

UIMA has advanced that state of the art along a number of dimensions including support for distributed deployments in different middleware environments; easy framework embedding in different software product platforms (key for commercial applications); broader architectural converge with its collection processing architecture; support for multiple-modalities; support for efficient integration across programming languages; support for a modern software engineering discipline calling out different roles in the use of UIMA to develop applications; the extensive use of descriptive component metadata to support development tools; and component discovery and composition. (Please note that not all these features are available in this release of the SDK

The output in the viewer window appears to be missing the carriage-return, line-feed characters.

We've observed this problem with earlier releases of Java. Try running the SDK with the supplied IBM Java 1.4.2.

The printed version of the UIMA SDK user's guide has funny characters. What can I do?

We've observed that some printers print this PDF better if you select (on Windows), the Advanced button that appears on the Print window, and then change the Font and Resource Policy: from Send by Range to Send at Startrefer.

On Linux, the Java system seems to stop at random places and is unresponsive to any commands.

We've seen this behavior on some machines with hyperthreading enabled, on earlier versions of Linux. This problem disappeared when we upgraded to the current levels of the threading libraries.

The printed version of the UIMA SDK user's guide has funny characters. What can I do?

We've observed that some printers print this PDF better if you select (on Windows), the Advanced button that appears on the Print window, and then change the Font and Resource Policy: from Send by Range to Send at Startrefer.

The documentation says the UIMA.LOG file will be created in the "default directory." Where is this directory?

It is usually the directory you were in when you invoked UIMA. If you are running from Eclipse, it may be in the project you had selected when you did a "Run," or it may be the directory where the eclipse.exe file is.

JCasGen says it's generating in the default package, but then I see an exception being generated. What happened?

The CAS types in the UIMA SDK must have a CAS name space. You can't have a type named "MyType" -- it must have a name such as "com.myorg.MyType". The part of the name before the last period is the name space and is used in JCasGen to specify the package name of the generated files.

The logging goes to the Console if I use a GUI application in Java 1.5. This didn't happen in Version 1.5 and doesn't happen if I don't use a GUI.

Logging is controlled by a configuration file, which can be specified by passing a command line argument when Java starts up to set a Java system property. Java's implementation is to read this specification once when the logger is first initialized. As of Java 1.5, we've noticed that some GUI classes in Java are using logging. So if this system parameter is not set, and the GUI classes' use of the logger is the first to occur, there will be a Java default, which is to log messages to the Syserr output stream. If, later, the UIMA Frameworks sees that no system property was set, it sets the property -- but it has no effect because the logging configuration is already initialized.

To work around this problem, specify the logging configuration file explicitly on the Java invocation command line, or arrange to set the system property before any logging happens. For more information about logging, see the Logging section in Chapter 4 of the UIMA SDK documentation.

I can't see any Run menu item. What can I do?

Try switching to the Java perspective by selecting the following menu choices: Window -> Open Perspective -> Java.

When I invoke Run, instead of running, it shows a menu with "Do you want to Save"?

Eclipse checks to see if you've edited any files but not saved them, and if so, it will bring up this menu to give you the opportunity to save the files before running. The run action will happen after you decide whether you want to save the file(s) and take the appropriate action.

The Component Description Editor looks funny -- not as in the documentation.

This may be due to the editor not having enough room to be displayed. Try making the window larger. Try also double-clicking on the title tab for this editor at the top. This action should expand the window to the full Eclipse window. (You can return to the previous window configuration by double-clicking on the title tab again).

The UIMA.LOG file is in my project directory; why don't I see it in the Package Explorer view of Eclipse?

To see it, select the project and press F5 or right-click and select Refresh. Eclipse caches a view of the file system; it must be occasionally told when things have changed in the file system and that it should refresh its views.

When using UIMA in WebSphere Information Integrator OmniFind, how can I modify the pear timeout value?

In some cases, it is necessary to modify the standard timeout configuration setting for a custom annotator. For example, if an annotator performs very complex text analysis, then maybe the default timeout value of 30 seconds is too low. To change the timeout value, the snippet below shows the custom annotator settings in the EsCpeDescriptor.xml.

<casProcessor deployment="remote" name="MyCustomAnnotator">
<descriptor>
<include href="/home/esadmin/config/col1.parserdriver/specifiers/ EsSocketService.xml"/>
</descriptor>
<filter/>
<errorHandling>
<errorRateThreshold action="continue" value="0/100"/>
<maxConsecutiveRestarts action="terminate" value="3"/>
<timeout max="30000"/>
</errorHandling>
<checkpoint batch="1"/>
<deploymentParameters>
<parameter name="transport" type="string" value="com.ibm.es.control.casprocessor.server. CasProcessorSocketTransport"/>
</deploymentParameters>
</casProcessor>


The timeout value is specified in milliseconds in the error handling section of the casProcessor. If the annotator does not return earlier, increase this timeout value in order to trigger a timeout event. After increasing the timeout value for the custom annotator, it is also necessary to increase the timeout value for the CPM output queue. The necessary setting is also in the EsCpeDescriptor.xml at the end of the file. The tag is called

<outputQueue dequeueTimeout="100000" .../>


Increase this timeout value by the same factor used for the custom annotator.

When using UIMA in WebSphere Information Integrator OmniFind, how can I change the Java heap size for my custom annotator?

The pear file, including the custom annotators that are associated with a collection, is running in a collection-specific, fenced box. The fenced box is a separate process called cas processor. In order to change the JVM heap size for that process, you must modify the following configuration file: NodeRoot/master_config/colID_config.ini Within the file, search for an expression such as: sessionN.type=casprocessor to get the session number for the current collection's cas processor. After heaving the session number, change the heap size in the following setting: sessionN.max_heap=size in MB The default heap size is set to 200 MB. Be careful with increasing that heap size. For additional help, see the memory recommendations in the OmniFind installation guide.

When using UIMA in WebSphere Information Integrator OmniFind, how can I see my custom annotator log messages in the OmniFind logs?

All custom annotator log messages are written to the OmniFind parser service's audit log file, located at NodeRootlogs/audit/parserservice_audit_currentDate.log. Within OmniFind there are three different log levels: Error, Warning, and Informational. The OmniFind log level for audit log files is set to Informational and cannot be changed to another value. Within the UIMA logging architecture, there are seven possible log levels (Error, Warning, Info, Config, Fine, Finer, and Finest); some can be additionally mapped to the OmniFind log levels. The default level mapping is as shown below:

OmniFind log level: UIMA log level

Error: Error

Warning: Warning

Informational: Info

not mapped: Config , Fine, Finer, Finest

Note that the mapping for Error and Warning messages can not be changed. By default, only the custom annotator log message with the levels Info, Warning, and Error are written to the log file. This default behavior can be replaced with a special log level mapping for log levels below Info, as follows:

For UIMA in WebSphere Information Integrator OmniFind: My annotator works on XML tags. It works in the SDK, but not in OmniFind. What's wrong?

The OmniFind XML parser models all XML tags as CAS annotations. They are removed from the actual document content. If you need to access XML information in your annotator, there are two ways of doing this, which can be combined: