There is an increasing focus on using information technology to get more value from unstructured information within organizations. The new Unstructured Information Management Architecture (UIMA) framework that was recently introduced by IBM (see Resources) makes it easier to develop and deploy systems that analyze unstructured media objects, like documents, in order to provide functions such as semantic search and text mining. Text mining is data mining applied to information extracted from text. In what follows, a very simple text mining application is described in some detail.
The text mining application described in this article, Preston, analyzes documents to find mentions of peoples' names, and uses data mining to find groups of people who are often mentioned together. While this technique is just one of many that are useful in text mining, it serves to illustrate the main features of such an application, and provides a concrete example with which to introduce the use of UIMA. And it also illustrates how it can be combined with structured databases and data mining. This article is for people who are interested in learning how the worlds of unstructured and structured information can be brought together using the new UIMA technology.
Figure 1 shows an overview of Preston. It analyzes documents stored as text fields in tables in a DB2 database. Components in the UIMA framework read the documents from the database, analyze them for mentions of names in a certain format, and write the results to another database, the Extracted Information Database (EIDB). The components are developed and deployed using tools in the UIMA SDK that is available from developerWorks (see Resources). Post-analysis processing of the information in the EIDB prepares it for data mining, which is done using DB2 Intelligent Miner. The whole application can easily be run on a laptop computer.
Figure 1. An overview of the Preston text mining application described in this article.
The documents used as examples in this article are the trivia that form part of the biographical information about actors and other personalities in the Internet Movie Database (IMDB) (see Resources). For illustration, I built a DB2 structured database from a subset of the IMDB content, and included the trivia as text fields in this database.
The UIMA components bridge from the unstructured data fields in the source data to structured extracted data. Different components read documents from the source database, analyze the text to find mentions of names, and store the results in a new database, the Extracted Information Database or EIDB.
The documents are read from the source database by SQLReader, which implements UIMA's CollectionReader interface and is developed using the SDK. When the UIMA framework calls SQLReader's initialize method, it connects to the database using JDBC™ and issues a SQL SELECT that returns the data it needs, such as the text strings, in a ResultSet object that SQLReader stores. The framework then uses the iterator-like methods of the CollectionReader interface, such as getNext(), to actually get the text and metadata of each document. These are returned to the framework in a UIMA-defined data object called the Common Analysis Structure, or CAS. Actually, since we are processing text documents, the data object is a text CAS, or TCAS, but for simplicity this article blurs this distinction and only talks about a CAS. The framework supplies an empty CAS when it calls getNext. SQLReader populates the CAS with data from the current row in the ResultSet. A sketch of the code required is shown in Listing 1. It shows how both the document text, from the TRIVIA column in the input table, and some metadata such as the document's URI, are put into the CAS. SQLReader also has to implement a hasNext() method (not shown) to complete the iterator interface.
Listing 1. Initializing the CAS in SQLReader's getNext method. Error checking has been omitted for clarity.
Connection conn;
ResultSet rs;
// Not shown: code to set up the Connection and to
// populate the ResultSet from the input database
public void getNext( CAS cas) {
// Not shown: code to check that the ResultSet contains more data
// Get the document text and put it into the CAS
String content = resultSet.getString( "TRIVIA"); //get document text
JCas jcas = cas.getJCas();
jcas.setDocumentText( content); // set document text
// Construct a URI for this document
String id = rs.getString( idColName); // get primary key
String url = conn.getMetaData().getURL(); // database URL
String uri = url + "/" + tableName + "/" + idColName + "#" + id;
// set URI into a SourceDocumentInfo
SourceDocumentInfo docInfo = new SourceDocumentInfo( jcas);
docInfo.setURI( uri); // set uri feature value
docInfo.addToIndexes();
// Advance to next row in the ResultSet
nextRow();
}
|
SQLReader is made known to the UIMA framework using an XML descriptor file. Each UIMA component has such a file, which can be created using tools in the SDK or by hand. The descriptor points to the implementation of the component, in this case a class file, and also includes any configuration information that the component needs. In the case of SQLReader, the descriptor contains information such as the URL of the source database, and the userid/password needed for logon. These are read at initialization time using methods that UIMA provides.
Another very important piece of information in a descriptor is a reference to the typesystem that the component uses. The CAS stores data as typed feature structures, and the typesystem defines the types and their inter-relations. Figure 2 shows the typesystem used in Preston. A typesystem is defined using SDK tools, which also automatically create Java ™ classes that correspond to the types in the typesystem. SourceDocumentInfo in Listing 1 is an example of such a class. Its URI attribute is used to hold the document URI that is created by SQLReader. (This URI will be copied out from the CAS into the EIDB at the end of the UIMA processing, as you will see later.)
Figure 2. The UIMA typesystem used by Preston. The names of built-in UIMA types are shown in italics. Arrows indicate inheritance.
Once the framework has obtained a CAS from SQLReader, it passes it to a text analysis engine (TAE) to do the actual analysis. A TAE can be quite complex, consisting of several components, including other TAEs. In Preston, however, the TAE contains only one component, NameReferenceAnnotator, that implements the Annotator interface defined by UIMA. An Annotator is the basic text analysis component. Its job is to use the information in the CAS that it is given, in particular the document text, to find some new data that it then adds to the CAS. NameReferenceAnnotator uses a regular expression to find names in the specific format that is used in the IMDB documents when people in the IMDB are mentioned in text. (see Figure 3). A name in single quotes, followed by "(qv)", is easily spotted by a regular expression. The only complication comes from names that themselves contain one or more apostrophes. The figure also illustrates how IMDB disambiguates names, in this case John Barrymore, which are shared by more than one person in its database. This will be useful at a later step.
Figure 3. An example of a trivia document from the IMDB, illustrating the special format used for names in the source data.
Son of actor 'John Barrymore (I)' (qv) and actress 'Dolores Costello' (qv). |
The most important methods of the Annotator interface are
initialize and
process. When
initialize is called by the framework, the NameReferenceAnnotator reads the regular expression as a string from its descriptor and compiles it. Then, when process is called, it finds matches of the regular
expression in the document text that it gets from the CAS. As each match is found, it is
stored in the CAS as instances of types from the typesystem of Figure 2. Each name is
stored as a
NameReference object, containing the name found by
the regular expression as a string, and its beginning and ending character positions
are set as the
begin and
end integer features that a NameReference inherits
from the Annotation built-in type. NameReference also includes a reference to a
DocumentEntity. The function of this feature
structure is to store information about each entity (person) who is mentioned in a
document. If an entity is mentioned several times, each of the mentions references the
same document entity. One factor that makes Preston simple is that in the IMDB data, all mentions of the same person have exactly the same form. This makes it easy to identify the appropriate
DocumentEntity for each mention. If Preston had to be extended to handle other kinds of input data, it
might have to be able to different forms of the same name. For example, if a mention like "Mr
Barrymore" occurred in a (hypothetical) longer version of the document in
Figure 2, it would have to be recognized as a reference to the same entity as the mention of
"John Barrymore (I)". The processing needed to make this connection is called
in-document co-reference. In Preston, it is minimal because the IMDB data is so clean.
Creating the extracted information database
In order to do data mining on the information discovered in the document collection by
the NameReferenceAnnotator, the mention and document entity information in all the CASs has to be written to a
structured database. This is done at the end of the document processing stream (see Figure 1). Components that receive each CAS at the end of
processing are called
CAS consumers, for which UIMA provides the
CasConsumer interface. A UIMA processing pipeline
can have several CAS consumers, each of which receives each CAS in turn as it exits from
the Text Analysis Engine. Preston uses two CAS consumers. One, called
cas2jdbc, writes the data from each CAS to tables in a
relational database (DB2), while the other,
EidbManager, ignores the CASs it is given but sets up the database at the beginning of
each run, and does the post-processing of all the information once all the documents
have been analyzed.
Figure 4. The structure of the Extracted Information Database (EIDB).
The data model used for the EIDB is shown in
Figure 4. The
MENTIONS table holds the individual mentions of names
that the NameReferenceAnnotator detects, and the
DOCENT table holds the document entities. Sample data
from these tables is shown in
Figure 5. The other tables in the EIDB are discussed below.
Though this simple schema is good enough for the present purpose, it could be made more
efficient. For example, the document URIs are long strings that consist of an
invariant part and a part that is document-specific. The invariant parts could be
moved into a separate table to save space. The database setup that EidbManager does
when its initialize method is called consists of creating the four tables in shown in
the schema of
Figure 4, using SQL statements that are read from its
descriptor file. The CAS consumer
cas2jdbc, that Preston uses to populate the
MENTIONS and
DOCENT tables, is part of WebSphere® Information
Integrator OmniFind Edition V8.3. It is a general-purpose component that writes data
from a text CAS into tables in a relational database under the control of an XML
configuration file. The mapping from the UIMA typesystem to the relational schema is
controlled by the configuration file. Part of the configuration for cas2jdbc in
Preston is shown in
Listing 2, which shows how two of the columns of the
MENTIONS table are populated from features of
NameReference instances in the CAS. Complete details of the construction of a
mapping file are given in the documentation for cas2jdbc.
As shown in Figure 5, below, the rows in the
MENTIONS and
DOCENT tables of the EIDB that are derived from the
document "He was married to 'Cicely Tyson' (qv) by 'Andrew Young (IV)' (qv) in the
home of 'Bill Cosby' (qv). 'Bill Cosby' (qv) was the best man, and gave away the
bride." Note that there are two mentions of Bill Cosby, but only one document entity. Keys
have been shortened for clarity.
Figure 5. Rows in the MENTIONS AND DOCENT tables
The code fragment in Listing 2, below, shows how the span
column of the
MENTIONS table is populated from the name feature
of a NameReference annotation, and how the docent_id column is populated from
the entity feature, using a unique ID that is created for each feature structure
in the CAS by cas2jdbc.
Listing 2. Part of the configuration file for the CasConsumer cas2jdbc in Preston.
<explicitMappingRule applyToSubtypes="false"> <type>com.ibm.fisc.preston.NameReference</type> <table>MENTIONS</table> <featureMappings> <featureMapping> <feature>name</feature> <length>1024</length> <column>SPAN</column> </featureMapping> <featureMapping> <feature> entity/com.ibm.fisc.preston.DocumentEntity:uniqueId() </feature> <column>DOCENT_ID</column> </featureMapping> </featureMappings> </explicitMappingRule> |
After the last document has been processed, the
MENTIONS and
DOCENT tables in the EIDB hold information about all
the name mentions that were found. But a given individual can be mentioned in several
documents. The term
instance is used to refer to a single entity that is mentioned in
one or more documents. The
INSTANCES table in the EIDB records information about
instances, and the
DE_INST table maintains the links from each document
entity to the corresponding instance. The task of determining which entities from
different documents are in fact the same instance is called
cross-document co-reference. In Preston, the processing for
cross-document co-reference is done when the
collectionProcessComplete method of the
EidbManager CAS consumer is called by the framework.
The task is relatively easy in Preston because names in the IMDB are always mentioned in
exactly the same way, and it is very easy to figure out which entities in different
documents should be linked to the same instance. In other production applications, cross-document
co-reference can be quite complex, and is in fact an area of current research. In
Preston the processing just consists of issuing a couple of SQL commands that create
an entry in the
INSTANCES table for each set of unique names in the
DOCENT table, and which create the corresponding rows
in the
DE_INST table. The completed Extracted Information
Database is ready to be used for data mining.
We use data mining of the data in the EIDB to discover groups of people who are strongly
connected with each other. The evidence for a connection between two people is that
they are mentioned in the same document, i.e., that they are
co-mentioned. We could include other evidence for a connection by either
including additional structured data, such as database tables that show which
people worked together on movies, or by deeper text analysis. Additional text
analysis would enable us to find other relations between people, based on statements
in the text. By adding more annotators to find these relations, and more types in the
typesystem to store them in the CAS, we could create database tables containing
entity-relation-entity triples, also called subject-predicate-object triples.
To accommodate this future possibility, the co-mention data in the EIDB is cast into
a triples-oriented schema by defining a view on the database that has this structure. The
schema for this view, called
UIMA_RELATIONS, is shown in Table 1.
| Column name | Description |
|---|---|
| subject_type | The type of the subject entity, e.g.
NameReference. |
| subject_uri | A unique identifier for the subject entity, in the form of a URI. |
| predicate_type | The type of the predicate, e.g.
Has_name. |
| object_type | The type of the object, such as
Document or
String.
|
| object_name | Either the URI of the object entity, if for example its type is
Document, or the string value of the object if its type is String.
|
| evidence_uri | A URI that can be used by an application to retrieve the evidence for the relation, e.g. the URI of a document. |
This kind of schema, known as a vertical schema, has two main advantages. It is very flexible, because new relations can easily be inserted by using different values in the predicate_type column. Second, it makes the relations and their semantics explicit, whereas in a standard database schema many relations are implicit in design of the schema. A vertical schema is also closer to semantic Web standards like RDF. By making a view, rather than an explicit table, we avoid the main disadvantage of a vertical schema, namely that many queries require it to be joined with itself, which can be expensive.
The
UIMA_RELATIONS view is created as the union of two SQL
selects. One select creates rows for the "Mentioned_in" predicate, and the other for
the "Has_name" predicate. The first of these connects people and documents. It pulls
person instances from the
INSTANCES table in the EIDB, and through joins with
other tables finds the documents in which the person instances are mentioned. The
evidence URI is the document URI. The second SQL select creates rows for the
"Has_name" predicate, which connects person instances to their name strings. Since
all the information needed for this predicate is in the
INSTANCES table, an evidence URI is constructed to
point to the relevant row in this table.
The data mining in Preston, mining for associations, requires the definition
of another view, the
MINING_VIEW, whose format is defined by the
requirements of the DB2 Intelligent Miner tool described below. It is constructed from
the
UIMA_RELATIONS view by joining it to itself. The mining
view contains only two columns, as shown in Table 2. The first column is a
human-readable identifier for an entity, in this case the name of a person. The second
column is a unique ID for the "transaction" in which the person occurs. In our case this
is the URI of a document in which the person is mentioned.
| Column name | Description |
|---|---|
| name | A string that describes an entity, for example the name of an instance of a person |
| transaction_id | A unique identifier for the transaction in which the entity appears, for example the URI of a document. |
The significance of the transaction ID can be easily seen if we consider the original motivation of association mining: market basket analysis. If we think of a market basket (for example, a shopping cart in a supermarket) as the "transaction," and its identifier as the transaction ID, then the original use of associations mining was to find correlations between the presence of two of more items in the cart. In Preston, the document is the cart, and the people mentioned in it are the items in the cart. If we had other relations, in particular binary ones of the form person-relation-person, then the relation instance would be the cart, the subject and object of the relation would be the items in the cart, and the transaction identifier would be the identifier of the relation instance.
The output of association mining is a set of rules of the form
entity1, entity2 => entity 3
meaning that the presence of both entity1 and entity2 in a
transaction implies the presence of entity3 with a certain probability. This example is a
rule of length 3. In Preston the rules we seek link pairs of people, like
personA => personB,
which has length 2. The strength of this rule measures how
often personA and personB appear together in a document. Several measures of the
strength are calculated by the association's mining algorithm.
We use DB2 Intelligent Miner for the associations mining. Once installed in DB2, this
product can be invoked by calls to stored procedures within SQL statements.
The invocation shown
in Listing 3 uses an "easy mining procedure" that is supplied with Intelligent
Miner. In this invocation PRESTON is the name of the model that is created,
MINING_VIEW is the view being mined, and the next two numerical
parameters set two thresholds for the strength of the rules generated, namely
that the minimum support is 0.01% and the minimum
confidence is 1%. The last parameter specifies that the maximum rule length is 2 as just discussed. Support and
confidence are measures of the strength of an association rule. Support is the
proportion of the transactions for which the rule is true, while confidence measures
the proportion of the documents containing personA that also mention personB.
One way to think about co-mention relations is that they define a network, or graph, of people who are linked if they have been mentioned together in at least one document. This network is implicit in the mining view. One of several useful features of DB2 Intelligent Miner is that it finds strongly connected sub-graphs within this network. The people in the sub-graph have been mentioned together frequently. An example is shown in Figure 6, which was drawn by DB2 Intelligent Miner Visualization. We see that some well-known associations in real life can be detected by data mining the co-mention data in the IMDB trivia documents. The color coding here is that orange signifies stronger associations than white, which are in turn stronger than blue. This sub-graph represents, of course, the Beatles and people who have been strongly associated with them.
Listing 3. The invocation in SQL of the Easy Mining Procedure for associations mining. The
BuildRuleModel is a User Defined Function
supplied with DB2 Intelligent Miner.
CALL IDMMX.BuildRuleModel( 'PRESTON', 'MINING_VIEW', 'TRANSACTION_ID', 0.01, 1, 2) |
Figure 6. A strongly connected sub-graph discovered by DB2 Intelligent Miner within the network of co-mention relations that were found by text analysis.
This article has described a simple application, Preston, that finds mentions of people in documents using text analysis in UIMA, builds a database from the data it finds, and invokes data mining for associations to find strongly connected sub-graphs in the network of co-mention relations. While this application is very simple, it illustrates the main features of using UIMA to bridge between the worlds of unstructured and structured data. One possible extension of this application would be to recognize more kinds of entities, and relations between the entities, by using more sophisticated text analysis. Annotators or text analysis engines from different sources can be easily plugged into the UIMA framework. IBM has already announced that several business partners are making UIMA-compatible text analysis components available. UIMA-compatible open-source components are also available from the GATE project at the University of Sheffield (see Resources).
Another extension would be to deploy the application not on the UIMA framework implementation available on the SDK but on the supported IBM product: WebSphere Information Integrator OmniFind Edition. OmniFind supports UIMA, and adds additional support for gathering documents from many different kinds of databases, and also for integrating text analysis and text search to provide semantic text search. In this case, be sure to use the version of the SDK from developerWorks that is compatible with OmniFind.
The UIMA framework is continuing to evolve, driven by work in IBM Research. While this article has focused on text analysis, UIMA can also be used to analyze other kinds of unstructured information such as audio and images.
The author wishes to thank Graham Bent of the IBM Hursley Laboratory for introducing him to the combination of DB2 Intelligent Miner with text analysis, and the Internet Movie Database for permission to use their content.
Learn
- "Semantic search in WebSphere Information Integrator OmniFind Edition: The case for semantic search" (developerWorks, August 2005) describes how UIMA can be used to enhance search applications.
- The Internet Movie Database is at http://www.imdb.com. Much of the content of the IMDB is available for download.
- Learn more about theGATE project.
- Stay current with
developerWorks technical events and Webcasts.
Get products and technologies
- The UIMA
SDK is available for download.
- Information
about the DB2 Intelligent Miner product is available at
http://www.ibm.com/software/data/iminer.
- Build your next development project with IBM trial software, available for download directly from developerWorks.
Discuss
- Participate in
developerWorks blogs and get involved in the developerWorks community.

Alan Marwick is a technical lead for knowledge management in the IBM Federal Innovation Solutions Center. He joined IBM in 1985, and worked in the IBM T.J. Watson Research Laboratory on materials science, digital libraries, and applications of text analysis before taking his present position. He has a Ph.D. in physics from the University of Sussex in the UK.
Comments (Undergoing maintenance)





