UIMA stands for Unstructured Information Management Architecture. It is
component software architecture for the development, discovery, composition,
and deployment of multi-modal analytics for the analysis of unstructured
information and its integration with search technologies. UIMA processing
occurs through a series of modules called analysis engines. The result of
analysis is an assignment of semantics to the elements of unstructured data,
for example, the indication that the phrase "Washington" refers to a
person's name or that it refers to a place. UIMA supports the rendering of
these results in conventional structures (for example, relational databases
or search engine indices), where the content of the original unstructured
information may efficiently be accessed according to its inferred semantics.
UIMA is specifically designed to support the developer in the creation,
integration, deployment, and sharing of components across platforms and
among disperse teams with different skills working to develop advanced
analytics.
UIMA stands for Unstructured Information Management Architecture. It is
component software architecture for UIMA is an architecture that specifies
component interfaces, design patterns, data representations, and development
roles. The UIMA Software Development Kit (SDK) is a software system that
includes a run-time framework, APIs, and tools for implementing, composing,
packaging, and deploying UIMA components. It comes with a semantic search
engine for indexing and querying over the results of analysis. The UIMA
run-time framework allows developers to plug in their components and
applications and run them on different platforms and according to different
deployment options that range from tightly-coupled (running in the same
process space) to loosely-coupled (distributed across different processes or
machines for greater scale, flexibility, and recoverability).
UIMA text analysis engines and annotators are already used within several
IBM products, including IBM's new enterprise search product,
WebSphere Information Integrator OmniFind Edition,
and IBM's WebSphere Portal product. All new text analysis technology that is
being put into IBM products is based on UIMA components.
Yes. The UIMA license does not restrict its usage to specific scenarios,
and we are of course very interested in your feedback, which will help us
making UIMA the right platform for building UIM applications. Please note,
however, that we currently offer support on a "best we can do" basis. If you
are interested in a more formal support agreement, or if you would like to
include UIMA in a commercial solution, please contact IBM for additional
options.
An annotation is a label, typically represented as string of characters,
associated with a region of a document. An example is the label "Person"
associated with the span of text "George Washington". We say that "Person"
annotates "George Washington" in the sentence "George Washington was the
first president of the United States". The association of the label "Person"
with a particular span of text is an annotation.
Annotations are not limited to text. A label may annotate a region of an
image or a segment of audio. The same concepts apply.
The CAS stands for Common Analysis Structure. It provides cooperating UIMA
components with a common representation and mechanism for shared access to
the artifact being analyzed (for example, a document, audio file, video
stream, etc.) and the current analysis results.
No. The CAS contains the artifact being analyzed and the analysis results.
Analysis results are those statements recorded by analysis engines in the
CAS. The most common form of analysis result is the addition of an
annotation. But an analysis engine may write any structure that conforms to
the CAS's type system into the CAS. These may not be annotations but may be
other things, such as links between annotations and properties of objects
associated with annotations.
No; in fact there are many possible representations of the CAS. If all of
the analysis engines are running in the same process, an efficient,
in-memory data object is used. If a CAS must be sent to an analysis engine
on a remote machine, it can be done via an XML or a binary serialization of
the CAS. UIMA specifies an XML representation of the CAS.
Think of a type system as a schema for the CAS. It defines the types of
objects and their properties (or features) that may be instantiated in a
CAS. A CAS conforms to a particular type system. UIMA components declare
their input and output with respect to a type system. Type systems include
the definitions of types, their properties, and single-inheritance hierarchy
of types.
In the terminology of UIMA, an annotator is simply some code that analyzes
documents and puts out annotations on the content of the documents. The UIMA
framework takes the annotator, together with metadata describing such things
as the input requirements and output of the annotator, and produces an
analysis engine. Analysis engines contain the framework-provided
infrastructure that allows them to be easily combined with other analysis
engines in different flows and according to different deployment options
(collocated or as Web services, for example).
The UIMA framework allows components such as analysis engines and CAS
consumers to be easily deployed as services or in other containers and
managed by systems middleware designed to be scaled. UIMA applications tend
to naturally scale-out across documents, allowing many documents to be
analyzed in parallel.
An example of an embedding would be the deployment of a UIMA analysis
engine as an Enterprise Java Bean inside an application server such as IBM
WebSphere. Such an embedding allows the deployer to take advantage of the
features and tools provided by WebSphere for achieving scalability, service
management, recoverability, etc. UIMA is independent of any particular
systems middleware, so analysis engines could be deployed on other types of
middleware as well.
Technically, no. But analysis engines developers are encouraged not to
maintain state between documents that would prevent their engine from
working as advertised if switched into a different flow or onto a different
document collection.
UIMA defines another type of component, the CAS Consumer, which is intended
to maintain state across documents and is typically associated with some
resource such as a database or search engine that aggregates analysis
results across an entire collection.
All UIMA component implementations are associated with an XML descriptor
that represents captured metadata describing various properties about the
component in order to support discovery, reuse, validation, automatic
composition, and development tooling. In principle, UIMA component metadata
is compatible with Web services and UDDI. However, the UIMA framework
currently uses its own XML representation for this metadata. It would not be
difficult to convert between UIMA's XML representation and the WSDL and UDDI
standards.
The UIMA framework includes a Collection Processing Manager or CPM for
managing the execution of a workflow of UIMA components orchestrated to
analyze a large collection of documents. The UIMA developer does not
implement or describe a CPM. It is a built-in part of the framework. It is a
piece of infrastructure code that handles CAS transport, instance
management, batching, check-pointing, statistics collection, and failure
recovery in the execution of this collection processing workflow.
A Collection Processing Engine (CPE) is a component that the UIMA developer
creates by specifying a CPE descriptor. A CPE descriptor points to a series
of UIMA components, including a Collection Reader, CAS Initializer, Analysis
Engine(s), and CAS Consumers. These components organized in a particular
flow define a collection analysis job that acquires documents from a source
collection, initializes CASs with document content, performs document
analysis, and then produces collection level results (for example, search
engine index, database, and so on). The CPM is the execution engine for a
CPE.
Not exactly. The XML Fragment query syntax used by the semantic search
engine that is shipped with UIMA uses basic XML syntax as an intuitive way
to describe hierarchical patterns of annotations that may occur in a CAS. It
deviates from valid XML in a few ways in order to support queries over
"overlapping" or "cross-over" annotations
The UIMA architecture supports the development, discovery, composition, and
deployment of multi-modal analytics including text, audio, and video.
However, this release of the SDK includes only documentation and programming
examples for text analysis.
A number of different frameworks for NLP have preceded UIMA. Two of them
where developed at IBM Research and represent UIMA's early roots. For
details, please see the UIMA article that appears in the
IBM Systems Journal Vol. 43, No. 3.
UIMA has advanced that state of the art along a number of dimensions
including support for distributed deployments in different middleware
environments; easy framework embedding in different software product
platforms (key for commercial applications); broader architectural converge
with its collection processing architecture; support for
multiple-modalities; support for efficient integration across programming
languages; support for a modern software engineering discipline calling out
different roles in the use of UIMA to develop applications; the extensive
use of descriptive component metadata to support development tools; and
component discovery and composition. (Please note that not all these
features are available in this release of the SDK
We've observed that some printers print this PDF better if you select (on
Windows), the Advanced button that appears on the Print window, and then
change the Font and Resource Policy: from Send by Range to
Send at Startrefer.
We've seen this behavior on some machines with hyperthreading enabled, on
earlier versions of Linux. This problem disappeared when we upgraded to the
current levels of the threading libraries.
It is usually the directory you were in when you invoked UIMA. If you are
running from Eclipse, it may be in the project you had selected when you did
a "Run," or it may be the directory where the eclipse.exe file is.
The CAS types in the UIMA SDK must have a CAS name space. You can't have a
type named "MyType" -- it must have a name such as "com.myorg.MyType". The
part of the name before the last period is the name space and is used in
JCasGen to specify the package name of the generated files.
Logging is controlled by a configuration file, which can be specified by
passing a command line argument when Java starts up to set a Java system
property. Java's implementation is to read this specification once when the
logger is first initialized. As of Java 1.5, we've noticed that some GUI
classes in Java are using logging. So if this system parameter is not set,
and the GUI classes' use of the logger is the first to occur, there will be
a Java default, which is to log messages to the Syserr output stream. If,
later, the UIMA Frameworks sees that no system property was set, it sets the
property -- but it has no effect because the logging configuration is
already initialized.
To work around this problem, specify the logging configuration file
explicitly on the Java invocation command line, or arrange to set the system
property before any logging happens. For more information about logging, see
the Logging section in Chapter 4 of the UIMA SDK documentation.
Eclipse checks to see if you've edited any files but not saved them, and if
so, it will bring up this menu to give you the opportunity to save the files
before running. The run action will happen after you decide whether you want
to save the file(s) and take the appropriate action.
This may be due to the editor not having enough room to be displayed. Try
making the window larger. Try also double-clicking on the title tab for this
editor at the top. This action should expand the window to the full Eclipse
window. (You can return to the previous window configuration by
double-clicking on the title tab again).
To see it, select the project and press F5 or right-click and select
Refresh. Eclipse caches a view of the file system; it must be
occasionally told when things have changed in the file system and that it
should refresh its views.
In some cases, it is necessary to modify the standard timeout
configuration setting for a custom annotator. For example, if an annotator
performs very complex text analysis, then maybe the default timeout value of
30 seconds is too low. To change the timeout value, the snippet below shows
the custom annotator settings in the EsCpeDescriptor.xml.
The timeout value is specified in milliseconds in the error handling
section of the casProcessor. If the annotator does not return earlier,
increase this timeout value in order to trigger a timeout event. After
increasing the timeout value for the custom annotator, it is also necessary
to increase the timeout value for the CPM output queue. The necessary
setting is also in the EsCpeDescriptor.xml at the end of the file. The tag
is called
<outputQueue dequeueTimeout="100000" .../>
Increase this timeout value by the same factor used for the custom
annotator.
The pear file, including the custom annotators that are associated with a
collection, is running in a collection-specific, fenced box. The fenced box
is a separate process called cas processor. In order to change the JVM heap
size for that process, you must modify the following configuration file:
NodeRoot/master_config/colID_config.ini Within
the file, search for an expression such as:
sessionN.type=casprocessor to get the session
number for the current collection's cas processor. After heaving the session
number, change the heap size in the following setting:
sessionN.max_heap=size in MB The default heap
size is set to 200 MB. Be careful with increasing that heap size. For
additional help, see the memory recommendations in the OmniFind installation
guide.
All custom annotator log messages are written to the OmniFind parser
service's audit log file, located at
NodeRootlogs/audit/parserservice_audit_currentDate.log. Within OmniFind
there are three different log levels: Error, Warning, and Informational. The
OmniFind log level for audit log files is set to Informational and cannot be
changed to another value. Within the UIMA logging architecture, there are
seven possible log levels (Error, Warning, Info, Config, Fine, Finer, and
Finest); some can be additionally mapped to the OmniFind log levels. The
default level mapping is as shown below:
OmniFind log level: UIMA log level
Error: Error
Warning: Warning
Informational: Info
not mapped: Config , Fine, Finer, Finest
Note that the mapping for Error and Warning messages can not be changed. By
default, only the custom annotator log message with the levels Info,
Warning, and Error are written to the log file. This default behavior can be
replaced with a special log level mapping for log levels below Info, as
follows:
Modify the tokenizer.properties config file in the directory
EsNodeRoot/master_config/parserservice/.
Inside this file, look for a level configuration setting such as
trevi.tokenizer.jedii.InformationalLevelMapping=Info.
In order to see more than UIMA annotator Info messages in the log
file, replace this log level value with the desired UIMA log level. For
example, use
trevi.tokenizer.jedii.InformationalLevelMapping=Finest in order
to see all UIMA annotator log messages in the OmniFind audit log.
The OmniFind XML parser models all XML tags as CAS annotations. They are
removed from the actual document content. If you need to access XML
information in your annotator, there are two ways of doing this, which can
be combined:
If you enable native XML search on the parse panel of your collection,
OmniFind will create an Annotation of type com.ibm.es.tt.MarkupTag for
each XML tag found in a document. This annotation contains all the
information of the original XML tag, namely, its attributes and their
content. Moreover, OmniFind will automatically index these XML tags
under the name in which they appear in the XML file, so you can use them
for semantic searching right away. Your annotators could access these
annotations in their processing instead of relying on the XML tags. They
would need to iterate over com.ibm.es.tt.MarkupTag and look at the name
feature of the annotation in order to find out which XML tag it
represented originally.
You can specify a so-called "XML to CAS" mapping file. In this file,
you specify which XML tags should be mapped to which CAS types. OmniFind
will automatically create annotations for these XML tags. This would
make it even easier for your annotators to access certain XML tags than
in Option 1. For example, if one annotator is interested only in content
within <technicianComments>, you could specify a mapping
from this tag to a type com.yourco.TechnicianComment. Then your
annotator need iterate only over annotations of this type. In the case
of "XML to CAS" mapping, OmniFind doesn't index the XML tags
automatically. If you still want to search for, say,
<technicianComments> you have two options:
Additionally enable native XML search
In your CAS2Index mapping, add a rule that maps
com.yourco.TechnicianComment to the span technicianComment.