Skip to main content

Text Mining for associations using UIMA and DB2 Intelligent Miner

Bridging from unstructured to structured information

Alan Marwick (marwick@us.ibm.com), Knowledge Management technical lead, IBM, Software Group
author photo
Alan Marwick is a technical lead for knowledge management in the IBM Federal Innovation Solutions Center. He joined IBM in 1985, and worked in the IBM T.J. Watson Research Laboratory on materials science, digital libraries, and applications of text analysis before taking his present position. He has a Ph.D. in physics from the University of Sussex in the UK.

Summary:  Get more value from your unstructured information. Explore how a simple text-mining application using a text analysis engine built with the UIMA SDK finds mentions of people in documents. Then another UIMA component writes the results to tables in a DB2® database. Using this data, strong associations between people who are frequently mentioned together in the documents are detected using DB2 Intelligent Miner.

Date:  02 Feb 2006
Level:  Introductory
Activity:  1015 views

Introduction

There is an increasing focus on using information technology to get more value from unstructured information within organizations. The new Unstructured Information Management Architecture (UIMA) framework that was recently introduced by IBM (see Resources) makes it easier to develop and deploy systems that analyze unstructured media objects, like documents, in order to provide functions such as semantic search and text mining. Text mining is data mining applied to information extracted from text. In what follows, a very simple text mining application is described in some detail.

Overview

The text mining application described in this article, Preston, analyzes documents to find mentions of peoples' names, and uses data mining to find groups of people who are often mentioned together. While this technique is just one of many that are useful in text mining, it serves to illustrate the main features of such an application, and provides a concrete example with which to introduce the use of UIMA. And it also illustrates how it can be combined with structured databases and data mining. This article is for people who are interested in learning how the worlds of unstructured and structured information can be brought together using the new UIMA technology.

Figure 1 shows an overview of Preston. It analyzes documents stored as text fields in tables in a DB2 database. Components in the UIMA framework read the documents from the database, analyze them for mentions of names in a certain format, and write the results to another database, the Extracted Information Database (EIDB). The components are developed and deployed using tools in the UIMA SDK that is available from developerWorks (see Resources). Post-analysis processing of the information in the EIDB prepares it for data mining, which is done using DB2 Intelligent Miner. The whole application can easily be run on a laptop computer.


Figure 1. An overview of the Preston text mining application described in this article.
overview

The documents used as examples in this article are the trivia that form part of the biographical information about actors and other personalities in the Internet Movie Database (IMDB) (see Resources). For illustration, I built a DB2 structured database from a subset of the IMDB content, and included the trivia as text fields in this database.

Text analysis with UIMA

The UIMA components bridge from the unstructured data fields in the source data to structured extracted data. Different components read documents from the source database, analyze the text to find mentions of names, and store the results in a new database, the Extracted Information Database or EIDB.

The documents are read from the source database by SQLReader, which implements UIMA's CollectionReader interface and is developed using the SDK. When the UIMA framework calls SQLReader's initialize method, it connects to the database using JDBC™ and issues a SQL SELECT that returns the data it needs, such as the text strings, in a ResultSet object that SQLReader stores. The framework then uses the iterator-like methods of the CollectionReader interface, such as getNext(), to actually get the text and metadata of each document. These are returned to the framework in a UIMA-defined data object called the Common Analysis Structure, or CAS. Actually, since we are processing text documents, the data object is a text CAS, or TCAS, but for simplicity this article blurs this distinction and only talks about a CAS. The framework supplies an empty CAS when it calls getNext. SQLReader populates the CAS with data from the current row in the ResultSet. A sketch of the code required is shown in Listing 1. It shows how both the document text, from the TRIVIA column in the input table, and some metadata such as the document's URI, are put into the CAS. SQLReader also has to implement a hasNext() method (not shown) to complete the iterator interface.


Listing 1. Initializing the CAS in SQLReader's getNext method. Error checking has been omitted for clarity.
		
		Connection conn;
		ResultSet rs;

		// Not shown: code to set up the Connection and to 
		// populate the ResultSet from the input database
		
		public void getNext( CAS cas) {

		   // Not shown: code to check that the ResultSet contains more data
		   
		   // Get the document text and put it into the CAS
		   String content = resultSet.getString( "TRIVIA"); //get document text
		   JCas jcas = cas.getJCas();
		   jcas.setDocumentText( content);            // set document text
					
		   // Construct a URI for this document
		   String id = rs.getString( idColName);      // get primary key
		   String url = conn.getMetaData().getURL();  // database URL
		   String uri = url + "/" + tableName + "/" + idColName + "#" + id;

		   // set URI into a SourceDocumentInfo
		   SourceDocumentInfo docInfo = new SourceDocumentInfo( jcas);
		   docInfo.setURI( uri);                      // set uri feature value
		   docInfo.addToIndexes();

		   // Advance to next row in the ResultSet
		   nextRow();
		} 
	

SQLReader is made known to the UIMA framework using an XML descriptor file. Each UIMA component has such a file, which can be created using tools in the SDK or by hand. The descriptor points to the implementation of the component, in this case a class file, and also includes any configuration information that the component needs. In the case of SQLReader, the descriptor contains information such as the URL of the source database, and the userid/password needed for logon. These are read at initialization time using methods that UIMA provides.

Another very important piece of information in a descriptor is a reference to the typesystem that the component uses. The CAS stores data as typed feature structures, and the typesystem defines the types and their inter-relations. Figure 2 shows the typesystem used in Preston. A typesystem is defined using SDK tools, which also automatically create Java ™ classes that correspond to the types in the typesystem. SourceDocumentInfo in Listing 1 is an example of such a class. Its URI attribute is used to hold the document URI that is created by SQLReader. (This URI will be copied out from the CAS into the EIDB at the end of the UIMA processing, as you will see later.)


Figure 2. The UIMA typesystem used by Preston. The names of built-in UIMA types are shown in italics. Arrows indicate inheritance.
typesystem

Once the framework has obtained a CAS from SQLReader, it passes it to a text analysis engine (TAE) to do the actual analysis. A TAE can be quite complex, consisting of several components, including other TAEs. In Preston, however, the TAE contains only one component, NameReferenceAnnotator, that implements the Annotator interface defined by UIMA. An Annotator is the basic text analysis component. Its job is to use the information in the CAS that it is given, in particular the document text, to find some new data that it then adds to the CAS. NameReferenceAnnotator uses a regular expression to find names in the specific format that is used in the IMDB documents when people in the IMDB are mentioned in text. (see Figure 3). A name in single quotes, followed by "(qv)", is easily spotted by a regular expression. The only complication comes from names that themselves contain one or more apostrophes. The figure also illustrates how IMDB disambiguates names, in this case John Barrymore, which are shared by more than one person in its database. This will be useful at a later step.


Figure 3. An example of a trivia document from the IMDB, illustrating the special format used for names in the source data.
		 
Son of actor 'John Barrymore (I)' (qv) and actress 'Dolores Costello' (qv).
	

The most important methods of the Annotator interface are initialize and process. When initialize is called by the framework, the NameReferenceAnnotator reads the regular expression as a string from its descriptor and compiles it. Then, when process is called, it finds matches of the regular expression in the document text that it gets from the CAS. As each match is found, it is stored in the CAS as instances of types from the typesystem of Figure 2. Each name is stored as a NameReference object, containing the name found by the regular expression as a string, and its beginning and ending character positions are set as the begin and end integer features that a NameReference inherits from the Annotation built-in type. NameReference also includes a reference to a DocumentEntity. The function of this feature structure is to store information about each entity (person) who is mentioned in a document. If an entity is mentioned several times, each of the mentions references the same document entity. One factor that makes Preston simple is that in the IMDB data, all mentions of the same person have exactly the same form. This makes it easy to identify the appropriate DocumentEntity for each mention. If Preston had to be extended to handle other kinds of input data, it might have to be able to different forms of the same name. For example, if a mention like "Mr Barrymore" occurred in a (hypothetical) longer version of the document in Figure 2, it would have to be recognized as a reference to the same entity as the mention of "John Barrymore (I)". The processing needed to make this connection is called in-document co-reference. In Preston, it is minimal because the IMDB data is so clean.

Creating the extracted information database

In order to do data mining on the information discovered in the document collection by the NameReferenceAnnotator, the mention and document entity information in all the CASs has to be written to a structured database. This is done at the end of the document processing stream (see Figure 1). Components that receive each CAS at the end of processing are called CAS consumers, for which UIMA provides the CasConsumer interface. A UIMA processing pipeline can have several CAS consumers, each of which receives each CAS in turn as it exits from the Text Analysis Engine. Preston uses two CAS consumers. One, called cas2jdbc, writes the data from each CAS to tables in a relational database (DB2), while the other, EidbManager, ignores the CASs it is given but sets up the database at the beginning of each run, and does the post-processing of all the information once all the documents have been analyzed.


Figure 4. The structure of the Extracted Information Database (EIDB).

The data model used for the EIDB is shown in Figure 4. The MENTIONS table holds the individual mentions of names that the NameReferenceAnnotator detects, and the DOCENT table holds the document entities. Sample data from these tables is shown in Figure 5. The other tables in the EIDB are discussed below. Though this simple schema is good enough for the present purpose, it could be made more efficient. For example, the document URIs are long strings that consist of an invariant part and a part that is document-specific. The invariant parts could be moved into a separate table to save space. The database setup that EidbManager does when its initialize method is called consists of creating the four tables in shown in the schema of Figure 4, using SQL statements that are read from its descriptor file. The CAS consumer cas2jdbc, that Preston uses to populate the MENTIONS and DOCENT tables, is part of WebSphere® Information Integrator OmniFind Edition V8.3. It is a general-purpose component that writes data from a text CAS into tables in a relational database under the control of an XML configuration file. The mapping from the UIMA typesystem to the relational schema is controlled by the configuration file. Part of the configuration for cas2jdbc in Preston is shown in Listing 2, which shows how two of the columns of the MENTIONS table are populated from features of NameReference instances in the CAS. Complete details of the construction of a mapping file are given in the documentation for cas2jdbc.

As shown in Figure 5, below, the rows in the MENTIONS and DOCENT tables of the EIDB that are derived from the document "He was married to 'Cicely Tyson' (qv) by 'Andrew Young (IV)' (qv) in the home of 'Bill Cosby' (qv). 'Bill Cosby' (qv) was the best man, and gave away the bride." Note that there are two mentions of Bill Cosby, but only one document entity. Keys have been shortened for clarity.


Figure 5. Rows in the MENTIONS AND DOCENT tables

The code fragment in Listing 2, below, shows how the span column of the MENTIONS table is populated from the name feature of a NameReference annotation, and how the docent_id column is populated from the entity feature, using a unique ID that is created for each feature structure in the CAS by cas2jdbc.


Listing 2. Part of the configuration file for the CasConsumer cas2jdbc in Preston.
		
<explicitMappingRule applyToSubtypes="false">
	<type>com.ibm.fisc.preston.NameReference</type>
	<table>MENTIONS</table>
	<featureMappings>
		<featureMapping>
			<feature>name</feature>
			<length>1024</length>
			<column>SPAN</column>
		</featureMapping>
		<featureMapping>
			<feature>
				entity/com.ibm.fisc.preston.DocumentEntity:uniqueId()
			</feature>
			<column>DOCENT_ID</column>
		</featureMapping>
	</featureMappings>
	</explicitMappingRule>
	

After the last document has been processed, the MENTIONS and DOCENT tables in the EIDB hold information about all the name mentions that were found. But a given individual can be mentioned in several documents. The term instance is used to refer to a single entity that is mentioned in one or more documents. The INSTANCES table in the EIDB records information about instances, and the DE_INST table maintains the links from each document entity to the corresponding instance. The task of determining which entities from different documents are in fact the same instance is called cross-document co-reference. In Preston, the processing for cross-document co-reference is done when the collectionProcessComplete method of the EidbManager CAS consumer is called by the framework. The task is relatively easy in Preston because names in the IMDB are always mentioned in exactly the same way, and it is very easy to figure out which entities in different documents should be linked to the same instance. In other production applications, cross-document co-reference can be quite complex, and is in fact an area of current research. In Preston the processing just consists of issuing a couple of SQL commands that create an entry in the INSTANCES table for each set of unique names in the DOCENT table, and which create the corresponding rows in the DE_INST table. The completed Extracted Information Database is ready to be used for data mining.

Data mining for associations

We use data mining of the data in the EIDB to discover groups of people who are strongly connected with each other. The evidence for a connection between two people is that they are mentioned in the same document, i.e., that they are co-mentioned. We could include other evidence for a connection by either including additional structured data, such as database tables that show which people worked together on movies, or by deeper text analysis. Additional text analysis would enable us to find other relations between people, based on statements in the text. By adding more annotators to find these relations, and more types in the typesystem to store them in the CAS, we could create database tables containing entity-relation-entity triples, also called subject-predicate-object triples. To accommodate this future possibility, the co-mention data in the EIDB is cast into a triples-oriented schema by defining a view on the database that has this structure. The schema for this view, called UIMA_RELATIONS, is shown in Table 1.

Table 1. The schema of the UIMA_RELATIONS view. All columns are of type VARCHAR.
Column nameDescription
subject_typeThe type of the subject entity, e.g. NameReference.
subject_uriA unique identifier for the subject entity, in the form of a URI.
predicate_typeThe type of the predicate, e.g. Has_name.
object_typeThe type of the object, such as Document or String.
object_nameEither the URI of the object entity, if for example its type is Document,
or the string value of the object if its type is String.
evidence_uriA URI that can be used by an application to retrieve the evidence for
the relation, e.g. the URI of a document.

This kind of schema, known as a vertical schema, has two main advantages. It is very flexible, because new relations can easily be inserted by using different values in the predicate_type column. Second, it makes the relations and their semantics explicit, whereas in a standard database schema many relations are implicit in design of the schema. A vertical schema is also closer to semantic Web standards like RDF. By making a view, rather than an explicit table, we avoid the main disadvantage of a vertical schema, namely that many queries require it to be joined with itself, which can be expensive.

The UIMA_RELATIONS view is created as the union of two SQL selects. One select creates rows for the "Mentioned_in" predicate, and the other for the "Has_name" predicate. The first of these connects people and documents. It pulls person instances from the INSTANCES table in the EIDB, and through joins with other tables finds the documents in which the person instances are mentioned. The evidence URI is the document URI. The second SQL select creates rows for the "Has_name" predicate, which connects person instances to their name strings. Since all the information needed for this predicate is in the INSTANCES table, an evidence URI is constructed to point to the relevant row in this table.

The data mining in Preston, mining for associations, requires the definition of another view, the MINING_VIEW, whose format is defined by the requirements of the DB2 Intelligent Miner tool described below. It is constructed from the UIMA_RELATIONS view by joining it to itself. The mining view contains only two columns, as shown in Table 2. The first column is a human-readable identifier for an entity, in this case the name of a person. The second column is a unique ID for the "transaction" in which the person occurs. In our case this is the URI of a document in which the person is mentioned.

Table 2. The schema of the view MINING_VIEW. Both columns are of type VARCHAR.
Column nameDescription
nameA string that describes an entity, for example the name of an
instance of a person
transaction_idA unique identifier for the transaction in which the entity appears,
for example the URI of a document.

The significance of the transaction ID can be easily seen if we consider the original motivation of association mining: market basket analysis. If we think of a market basket (for example, a shopping cart in a supermarket) as the "transaction," and its identifier as the transaction ID, then the original use of associations mining was to find correlations between the presence of two of more items in the cart. In Preston, the document is the cart, and the people mentioned in it are the items in the cart. If we had other relations, in particular binary ones of the form person-relation-person, then the relation instance would be the cart, the subject and object of the relation would be the items in the cart, and the transaction identifier would be the identifier of the relation instance.

The output of association mining is a set of rules of the form

entity1, entity2 => entity 3

meaning that the presence of both entity1 and entity2 in a transaction implies the presence of entity3 with a certain probability. This example is a rule of length 3. In Preston the rules we seek link pairs of people, like

personA => personB,

which has length 2. The strength of this rule measures how often personA and personB appear together in a document. Several measures of the strength are calculated by the association's mining algorithm.

We use DB2 Intelligent Miner for the associations mining. Once installed in DB2, this product can be invoked by calls to stored procedures within SQL statements. The invocation shown in Listing 3 uses an "easy mining procedure" that is supplied with Intelligent Miner. In this invocation PRESTON is the name of the model that is created, MINING_VIEW is the view being mined, and the next two numerical parameters set two thresholds for the strength of the rules generated, namely that the minimum support is 0.01% and the minimum confidence is 1%. The last parameter specifies that the maximum rule length is 2 as just discussed. Support and confidence are measures of the strength of an association rule. Support is the proportion of the transactions for which the rule is true, while confidence measures the proportion of the documents containing personA that also mention personB.

One way to think about co-mention relations is that they define a network, or graph, of people who are linked if they have been mentioned together in at least one document. This network is implicit in the mining view. One of several useful features of DB2 Intelligent Miner is that it finds strongly connected sub-graphs within this network. The people in the sub-graph have been mentioned together frequently. An example is shown in Figure 6, which was drawn by DB2 Intelligent Miner Visualization. We see that some well-known associations in real life can be detected by data mining the co-mention data in the IMDB trivia documents. The color coding here is that orange signifies stronger associations than white, which are in turn stronger than blue. This sub-graph represents, of course, the Beatles and people who have been strongly associated with them.


Listing 3. The invocation in SQL of the Easy Mining Procedure for associations mining. The BuildRuleModel is a User Defined Function supplied with DB2 Intelligent Miner.
		 CALL IDMMX.BuildRuleModel( 'PRESTON', 'MINING_VIEW',
		'TRANSACTION_ID', 0.01, 1, 2)
	


Figure 6. A strongly connected sub-graph discovered by DB2 Intelligent Miner within the network of co-mention relations that were found by text analysis.
A strongly connected sub-graph discovered by DB2 Intelligent Miner

Future directions

This article has described a simple application, Preston, that finds mentions of people in documents using text analysis in UIMA, builds a database from the data it finds, and invokes data mining for associations to find strongly connected sub-graphs in the network of co-mention relations. While this application is very simple, it illustrates the main features of using UIMA to bridge between the worlds of unstructured and structured data. One possible extension of this application would be to recognize more kinds of entities, and relations between the entities, by using more sophisticated text analysis. Annotators or text analysis engines from different sources can be easily plugged into the UIMA framework. IBM has already announced that several business partners are making UIMA-compatible text analysis components available. UIMA-compatible open-source components are also available from the GATE project at the University of Sheffield (see Resources).

Another extension would be to deploy the application not on the UIMA framework implementation available on the SDK but on the supported IBM product: WebSphere Information Integrator OmniFind Edition. OmniFind supports UIMA, and adds additional support for gathering documents from many different kinds of databases, and also for integrating text analysis and text search to provide semantic text search. In this case, be sure to use the version of the SDK from developerWorks that is compatible with OmniFind.

The UIMA framework is continuing to evolve, driven by work in IBM Research. While this article has focused on text analysis, UIMA can also be used to analyze other kinds of unstructured information such as audio and images.

Acknowledgements

The author wishes to thank Graham Bent of the IBM Hursley Laboratory for introducing him to the combination of DB2 Intelligent Miner with text analysis, and the Internet Movie Database for permission to use their content.


Resources

Learn

Get products and technologies

Discuss

About the author

author photo

Alan Marwick is a technical lead for knowledge management in the IBM Federal Innovation Solutions Center. He joined IBM in 1985, and worked in the IBM T.J. Watson Research Laboratory on materials science, digital libraries, and applications of text analysis before taking his present position. He has a Ph.D. in physics from the University of Sussex in the UK.

Comments (Undergoing maintenance)



Trademarks  |  My developerWorks terms and conditions

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Information Management
ArticleID=103076
ArticleTitle=Text Mining for associations using UIMA and DB2 Intelligent Miner
publish-date=02022006
author1-email=marwick@us.ibm.com
author1-email-cc=

My developerWorks community

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere).

My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Special offers