Unstructured information management (UIM) applications are software systems
that analyze unstructured information (text, audio, video, images, and so
on) to discover, organize, and deliver relevant knowledge to the user. In
analyzing unstructured information, UIM applications make use of a variety
of analysis technologies, including statistical and rule-based Natural
Language Processing (NLP), Information Retrieval (IR), machine learning, and
ontologies. IBM's Unstructured Information Management Architecture (UIMA) is
an architectural and software framework that supports creation, discovery,
composition, and deployment of a broad range of analysis capabilities and
the linking of them to structured information services, such as databases or
search engines. The UIMA framework provides a run-time environment in which
developers can plug in and run their UIMA component implementations, along
with other independently-developed components, and with which they can build
and deploy UIM applications. The framework is not specific to any IDE or
platform.
This technology, the UIMA SDK (Software Development Kit), is an all-JavaTM
implementation of the UIMA framework, and it supports the implementation,
description, composition, and deployment of UIMA components and
applications. It also supports the developer with an
Eclipse-based development environment
that includes a set of tools and utilities for using UIMA.
One large, but not the only, application area of text analysis is improving
text search. By detecting important terms and topics within documents,
semantic search engines provide the capability to search for concepts and
relationships instead of keywords. IBM's enterprise search solutions, IBM
OmniFind Enterprise Edition or IBM DB2 Warehouse Edition, have such semantic
search capabilities. They allows UIMA annotators to be plugged into the
OmniFind or DB2 Warehouse processing flow, enabling semantic search to be
performed on the extracted concepts. Another large application area is
information extraction. The text-analysis functions of IBM DB2 Warehouse
Edition focus on information extraction that creates structured data out of
unstructured data. DB2 Warehouse Edition allows UIMA annotators to be
plugged into a Mining flow, enabling the extraction of information that can
then be analyzed together with structured information by using business
intelligence tools. Since UIMA is used and developed both by IBM research
and development teams, there are two locations of the UIMA SDK:
The UIMA SDK on alphaWorks is the "early adopter" version of the SDK. It
is intended for users who don't use OmniFind or DB2 Warehouse, or who want
to use features of UIMA that may not be supported by OmniFind or DB2
Warehouse. The alphaWorks SDK is also a test bed to gather feedback on new
features of the UIMA SDK. Its versions may evolve more rapidly, and are
not tied to specific OmniFind or DB2 Warehouse releases. The SDK is
supported on a "best can do" basis, by way of the alphaWorks forum. The
Java source code for core components of the alphaWorks SDK is available at
SourceForge.
The UIMA SDK on developerWorks is the "OmniFind-compatible" and "DB2
Warehouse-compatible" version of the SDK. It is intended for users who
want to develop and deploy semantic search solutions with IBM OmniFind
Enterprise Edition or solutions that take advantage of OmniFind's
capabilities for enterprise-scale document crawling and extraction. It is
also intended for users who want to develop and deploy text-analysis
projects with IBM DB2 Warehouse Edition. The developerWorks SDK is tested
for compatibility with a specific OmniFind and and DB2 Warehouse version
and will be updated to keep in sync with new OmniFind and and DB2
Warehouse releases. As the SDK evolves, prior versions will still be
available on developerWorks, to ensure that each supported OmniFind and
and DB2 Warehouse version has a corresponding SDK. For customers who have
an OmniFind or and DB2 Warehouse license, this SDK is supported by way of
the IBM support channels and also through the developerWorks forum.
UIMA is an architecture in which basic building blocks called Analysis
Engines (AEs) are composed in order to analyze a document. At the heart of
AEs are the analysis algorithms that do all the work to analyze documents
and record analysis results (for example, detecting person names). These
algorithms are packaged within components that are called Annotators. AEs
are the stackable containers for annotators and other analysis engines.
How Annotators represent and share their results is an important part of
the UIMA architecture. To enable composition and reuse, UIMA defines a
Common Analysis Structure (CAS) precisely for these purposes. The CAS is an
object-based container that manages and stores typed objects having
properties and values. Object types may be related to each other in a
single-inheritance hierarchy. Annotators are given a CAS having the subject
of analysis (the document), in addition to any previously created objects
(from annotators earlier in the pipeline), and they add their own objects to
the CAS. The CAS serves as a common data object, shared among the annotators
that are assembled for an application.
Many UIM applications analyze entire collections of documents. UIMA
supports this analysis through its Collection Processing Architecture. This
part of the architecture allows specification of a "source-to-sink" flow
from a collection reader though a set of analysis engines and then to a set
of CAS Consumers. The collection reader's job is to connect to and iterate
through a source collection, acquiring documents and initializing CASes for
analysis. After the analysis engines have added their information to the
CAS, CAS consumers do the final CAS processing, for example, sending the CAS
contents to a search engine or extracting elements of interest and
populating a relational database. A Semantic Search engine is included in
the UIMA SDK; it will allow the developer to experiment with indexing
analysis results, which will enable semantic searches using the the
annotations in the CAS.
IBM has started an OASIS working group to create an open standard for UIMA
applications. The purpose of this working group is the creation of standards
to ensure interoperability between different UIM applications and thus
create an open ecosystem of unstructured analysis platforms and
applications. Participation in the working group is open to all OASIS
members. Concurrently, IBM has donated the source code to the Apache
Software Foundation, and Apache has accepted UIMA as an Incubator project.
All new UIMA development will happen on Apache, and new
Apache UIMA releases
are available there. It will be some time before the first release will be
available from Apache. We will continue to support previous versions of UIMA
through developerWorks.
XMI support has been added. There are two new chapters in the user's
guide describing this support. As a part of this change, additional type
system feature description information for types which are arrays or lists
can now be specified, including the type of the elements of these
collections.
A new utility to merge two or more PEAR files has been added, and is
described in the user's guide.
Please see the release notes for details on other enhancements and bug
fixes.
The UIMA SDK is being developed by teams from IBM Research and IBM Software
Group. It is a world-wide effort, with significant participation from the
following IBM sites: