IBM Content Analytics with Enterprise Search, Version 3.0.0

Basic concepts used in text analysis processing

Basic concepts that are used in text analysis processing include annotators, analysis results, feature structure, type, type system, annotation, and common analysis structure.

Annotators contain the logic that analyzes a document and discovers and records descriptive data about the document as a whole (referred to as document metadata) and parts in the document. This descriptive data is referred to as analysis results. The analysis results annotate any contiguous substring (also referred to as span) of the text document. Ideally, the analysis results correspond to the information that you want to search for.

A feature structure is the underlying data structure that represents an analysis result. A feature structure is an attribute-value structure. Each feature structure is of a type and every type has a specified set of valid features or attributes (properties), much like a Java class. Features have a range type that indicates the type of value that the feature must have, such as String. All annotators in UIMA store data in feature structures.

For example, the text span "James Matthew Bloggs" might be spanned by an annotation of type Person with the features personName, age, nationality and profession.

The type system defines the types of objects (feature structures) that may be discovered in a document. The type system defines all possible feature structures in terms of types and features (attributes), much like a class hierarchy in Java. You can define any number of different types in a type system. A type system is domain and application specific.

Most of the text analysis annotators produce their analysis results in the form of annotations. Annotations are a special kind of feature structure that is designated for linguistic analysis processing. An annotation spans or covers a piece of input text and is defined in terms of its beginning and end positions in the input text.

For example, an annotator that recognizes monetary expressions creates for the text "100.55 US Dollars" an annotation of type monetaryExpression that covers the text with the feature currencySymbol set to "$".

All feature structures are represented in a central data structure called the common analysis structure. All data exchange is handled by using the common analysis structure.

The common analysis structure contains the following objects:

The text document
The type system description that indicates the types, subtypes, and their features
Analysis results that describe the document or regions of the document
An index repository that supports access to and iteration over the analysis results

Feedback