It may be helpful for those unfamiliar with text analysis to provide a brief overview of the technology and typical use cases that benefit from application of text analytics. For those who want a more in-depth discussion of text analytics, particularly as manifested in the Unstructured Information Management Architecture (UIMA), there are several articles on developerWorks that go into more detail as well as on Apache.org (see the Resources section)
Text analysis is the process of enabling computers to extract meaning from text. Text analysis is often implemented as a series of iterative processes that range from simple language detection, parsing, and tokenizing, to more complex processes that can recognize the sentiment of what is being expressed. UIMA offers a standardized input and output format for these different processes to enable the plug-and-play of modules in a variety of combinations and from a variety of vendors.
The output of text analysis consists of the original text along with additional metadata about the text. At that point, there are a variety of applications that can make use of that enhanced metadata, including business intelligence applications, search applications, enterprise content management systems, and text mining applications (see Figure 1) .
Figure 1. Text analysis enhances a wide variety of applications
OmniFind Analytics Edition provides the ability to interactively explore and mine the results of text analysis, as well as structured data that is typically associated with unstructured text. For those of you familiar with business intelligence applications, you can think of it as content-centric business intelligence, in that it aggregates the results of text analysis to detect frequencies, correlations, and trends. Typical use cases include:
- Analysis of customer contact information (e-mails, chats, problem tickets, contact center notes) for insight into quality or satisfaction issues
- Analysis of blogs and wikis for reputation monitoring
- Analysis of internal e-mail for compliance violations or for expertise location
Figure 2 is an overview of the flow of content and data throughout the system. First, the original text data must be in a format that is understandable by OmniFind Analytics Edition, which is an XML format called Analysis Text Markup Language (ATML). OmniFind Analytics Edition can automatically convert comma separated files (.csv) to ATML.
As you can see in Figure 2, there are both structured and unstructured components in the documents. You must specify which are the text fields that you want to run through text analysis (natural language processing).
Figure 2. OmniFind Analytics Edition Architecture
As the text goes through natural language processing, rules and dictionaries are applied. Default categories, rules, and dictionaries are included out of the box, but these can be customized to recognize information more specific to the domain being analyzed. The output of natural language processing is another XML file format called Mining Markup Language (MIML), as shown in Figure 3.
Figure 3. MIML files are the output of text analysis
Note: You can bypass the ATML step completely by crawling content using OmniFind Enterprise Edition and by running the OmniFind Analytics Edition text analysis on that crawled content inside of OmniFind Enterprise Edition's UIMA pipeline. The output of that analysis is a MIML file.
In the normal flow, the MIML is input into the indexing step and a text index is created. This index is specifically designed to allow for very high speed retrieval of requests for aggregation, correlation, and so on, as users interact with the mining interface. The MIML files can be found in the file system of your OmniFind Analytics Edition installation, in the directory <OAE_INSTALL_DIR>/databases/<database name>/db/miml/. You will learn more about MIML files in the next section.
The mining module is the interface that allows users to explore the index and discover patterns and relationships. Users can use familiar search paradigms, including keyword/semantic search and faceted navigation to explore the information, which can be viewed in a variety of ways such as top-n frequency, time series, and 2D correlation heat maps. As users drill into the information, at any point they can go directly to a view that contains the original text of the document.
The mining interface is also available as a Java API for custom environments.
As mentioned above, the typical flow is for the analysis results (MIML) to be indexed so that users can interactively explore the content using the interactive user interface. However, for some use cases there is a need to extend existing business performance reporting with key text features. For example, a health care payer may use the mining interface to discover occurrences of particular diseases or symptoms in notes taken by nurses who interact with patients on the phone. They may want to export that information to a data warehouse where they can link that information with existing structured data on costs or death rates and do regular reporting on that relationship as a key performance indicator.
This article describes some options for integrating the results of text analysis, the MIML files, into a relational schema for use by normal business reporting tools, such as Cognos.
A MIML file is a well-formed XML document whose root element is
<miml>, as shown in Figure 4. At the
second level of the document you find a number of
<doc> elements that are children of the root element. Each
<doc> element contains the output of the natural language processing for one particular input document such as an email, problem ticket, CRM record, or other plain text document. Since you often analyze many such documents, the MIML file typically has many
Figure 4. Anatomy of a MIML file
As you see in Figure 4, each document is represented by an ID number, a title, the original content of the document ("input"), and three sets of features:
- standard features
- keyword features
- textlist features.
The standard features represent general structured information from each document, such as date, origin, subject, or country. There is one
<kw> element for each such data item. The
val attribute of the
<kw> element contains the actual value, the
cat attribute contains its category, such as "date" or "country".
The keyword features describe the occurrences of words and phrases in the
unstructured text content. There is one
for each distinct word or phrase, each with a value and a category, such as
"general noun", "proper noun", "verb", or
"phrase". For each occurrence of such word there is a separate
<occ> element that contains the occurrence ID and the begin and end position in the original text.
The text list features contain the positions of the informational data within the document content.
Because MIML files are well-formed XML documents, you can use the pureXML capabilities in DB2 to process the MIML data for reporting and integration with existing relational data. The two main approaches for doing this are shredding (to relational tables) and storing (as XML).
By shredding the MIML data into a relational target schema, the full power of SQL queries and business intelligence tools can be used to analyze the data. DB2 offers two methods for the shredding of XML data into relational table:
- The first is Called an "annotated schema shred", also known as "XML decomposition". With this approach, you add annotations to an XML schema to define how XML element and attribute values are mapped to relational columns in one or multiple tables. You will see below that IBM Data Studio provides convenient GUI support to define this mapping visually.
- The second shredding method in DB2 uses the SQL/XML function
XMLTABLE, which typically takes XML data as input and produces element and attribute values in a relational output format. In addition to the examples in this article, refer to the Resources section for other developerWorks articles that describe these DB2 pureXML features in more detail.
The alternative to shredding is to store and index the MIML data in a column of type XML in a DB2 table. This avoids conversion into relational format but requires SQL/XML or XQuery to query the data. The pros and cons of both approaches are discussed in this article.
XML shredding requires a relational target schema. There are various ways in which you can define a set of relational tables to represent the MIML data in relational format. Each has different advantages and disadvantages, related to shredding performance, query performance, and the ease of reporting and analytics. The three different relational schemas are discussed in this section.
Option 1: Simple four-table schema
The first relational schema is a very straightforward relational representation of the MIML data (Figure 5). It is somewhat de-normalized and allows for very easy and efficient shredding with either the
XMLTABLE function or annotated schemas.
Figure 5. Simple four-table schema for MIML data in DB2
DOCUMENT table contains the basic information of each
document: the document ID, title, and the document content ("input"). Remember that a
MIML document has the following:
- Standard features with standard keywords
- Keyword features that describe keyword occurrences
- Text list features
You see in Figure 5 that one table is used for each of these types of features. The standard features are stored in the STANDARD_KW table, keywords with their occurrences in the KEYWORD_KW_OCC table, and the text list features in the TEXTLIST_TEXT table. Every feature table contains the DOC_ID as a reference to the DOCUMENT table. All DOC_IDs are unique to meet the foreign key constraints. Figure 6 illustrates the logical mapping of a MIML file to the four relational tables.
Figure 6. Visualizing the mapping from MIML to the four-table schema
Since this schema is not fully normalized, there is some data redundancy. For
example, each document contains several standard keywords, such as date and subject
description. These keywords are stored repeatedly in the
STANDARD_KW keyword table for every occurrence of a given
keyword. Additionally, keyword categories (such as noun, verb, and proper verb) are repeated in the KEYWORD_KW_OCC table for every keyword.
Option 2: Star schema
Figure 7 shows another possible schema for MIML data. This is a star schema consisting of a fact table and four dimension tables. The fact table DOCUMENT_ANALYSIS contains one row for each occurrence of a keyword in a document. The dimension tables describe each analysis entry in the fact table: the actual date, the keyword values and categories as well as corresponding document title and content. This is a more normalized schema than the four-table schema option, so that repeated storage of identical keywords and categories is avoided.
Figure 7. A star schema for MIML (initial version)
You also notice that this schema does not represent all the data items from the original MIML file. The actual position of each keyword occurrence has been omitted, as well as some of the standard features (except date) and text list features. Although they could be included in the schema, it is assumed here that they are not relevant for subsequent reporting purposes.
DATE dimension allows you to analyze the data
according to time periods, such as week, month, and quarter. Also note that the keyword categories are stored in the dimension table
CATEGORY, where each category points to its parent category within this table. Therefore, it is not only possible to query for individual categories (noun, verb), but also category groups (word, phrase).
The advantages of this schema are:
- It is more normalized thereby avoiding data redundancy and saving storage space
- It may be more suitable for reporting purposing and it provides higher "queryability" than the simple four-table schema.
The disadvantage of this schema is that shredding into it is a lot harder. You
cannot shred into this schema directly using DB2's decomp routine or
XMLTABLE function. This is because this relational schema requires
the generation and proper use of foreign keys, which are key values that are not
included in the original MIML. However, if MIML data has been shredded into the
simple four-table schema, then it is easy to use SQL
statements to move the data into the star schema (see the script found in the Download section).
Option 3: Refined star schema
The star schema can be further refined to reduce the rows in the DOCUMENT_ANALYSIS fact table. In the initial star schema above, the fact table contains one row for each keyword occurrence. The refined schema introduces a bridge table between the actual document analysis and the keywords with their categories (Figure 8). This means that the refined fact table now has one row per document. This further reduces some data redundancy. For example, the foreign key to the date table is not repeated for every keyword of a document. Logically, each fact table row now represents a document, not a keyword occurrence.
Figure 8. A refined star schema for MIML
The refined star schema consists of the fact table DOCUMENT_ANALYSIS, the default dimension tables DOCUMENT and DATE, the bridge table UCAT_BRIDGE that points to the keywords, and several optional dimension tables, such as CATEGORY_SUBJECTDESC. The optional dimensions stem from the standard keyword categories in the MIML data. You can choose to have one such dimension table for each standard category, such as subject-description, country, and organization. The decision to include or exclude certain standard keyword categories in the star schema depends on the specific reporting requirements.
The UCAT table contains all keywords and their categories. In this refined schema, these categories are represented as levels. Each category, such as .tkm_en_base_word.noun.proper, is split into its levels tkm_en_base_word, noun, and proper. Each level is stored in a separate column. Hence, it is again possible also to query for category groups such as word or phrase.
The advantages of this refined star schema are as follows:
- It is further normalized and eliminates even more data redundancy
- Fact table rows now represent documents instead of keyword occurrences within the document. This can be more suitable and intuitive, depending on the actual reporting requirements.
The disadvantage of this schema is a slightly increased complexity with a many-to-many relationship between the fact table and the UCAT table. Shredding into this schema is therefore more challenging, but can be achieved with a SQL stored procedure (see the script found in the Download section). With the usual caveats about how plans can change, IBM intends to include a shredding tool in a future release of OmniFind Analytics Edition that would shred the MIML to this refined star schema.
This section uses DB2's annotated schema shredding feature to map MIML data to the simple four-table schema. IBM Data Studio includes a GUI tool to define mappings from XML schemas to relational schemas (Figure 9). In a nutshell, you open the XML schema file, establish a database connection, and then use the mouse to draw connections from XML elements and attributes to relational columns.
More specifically, you start by right-clicking on the XML schema file in IBM Data Studio. From the context menu you choose Open withâ¦, and then Annotated XSD Mapping Editor. This opens a three-part view, as shown in Figure 9. On the left side, you see the structure of all the elements and attributes defined by the XML schema. On the right side, you choose the tables you want to map to. The frame in the middle contains the mappings. To create a mapping, simply click on an XML node in the left panel and on any column of one of the tables. Data Studio shows the mapping as a connecting line between the two end points.
Figure 9. Mapping XML to relational schemas with IBM Data Studio
In the background, IBM Data Studio inserts actual annotations into the XML schema file, which all start with the prefix db2-xdb, as shown in Figure 10. You see that each attribute is annotated with its path and a target column in a target table ("rowset").
Figure 10. XML schema with annotations
To register the newly annotated XML schema, use IBM Data Studio or execute the command shown in Listing 1 through the DB2 command line processor.
Listing 1. Register XML Schema
After the annotated schema is registered, you can shred the MIML document into the relational target tables. If the MIML document exists as a file in the file system, such as miml_4711.xml, you can perform the
Listing 2. Decompose XML document regarding XML schema
Alternatively, you can invoke the shredding from an application through a stored procedure call. The parameter marker represents the MIML document that should be shredded. For further information, see "Shred XML documents using DB2 pureXML" (developerWorks, Jan 2008).
Listing 3. Decompose XML document regarding XML schema using stored procedure
As an alternative to annotated schema, you can use the SQL function
XMLTABLE to shred the MIML documents. In that case, an XML schema is
not required. The
XMLTABLE function reads an XML document
and returns a relational result set. The mapping of XML nodes to relational columns
is implicit in the
XMLTABLE function. In Listing 4, you see that one SQL insert statement is used for each of
the four tables in the simple relational schema. Each insert statement reads the MIML
document from a staging table called MIML that contains a single column, XMLDOC, of
type XML. Alternatively, you can pass the MIML file from an application through a
parameter marker into the insert statement. These techniques are described in more
detail in the article "XMLTABLE by example, Part 2: Common scenarios for using XMLTABLE with
DB2" (developerWorks Sept 2007).
Listing 4. XMLTABLE statements to shred MIML into the simple four-table schema
The advantage of the insert statements with the
XMLTABLE function over the annotated schema shred is generally better performance if the shredding is simple and the number of target tables is small. This is case for MIML data. Also, if you are familiar with SQL and/or XPath, you can easily change or customize these insert statements if there are specific application requirements. You may find that easier than customizing the schema annotations shown in the previous section.
Shred into the star schemas
The star schemas require the generation and proper use of primary or foreign keys for the fact and dimension tables. Neither the annotated schema shred nor the
XMLTABLE function can generate these keys. However, they can be
resolved if you embed the
INSERT statements above in an
SQL stored procedure with a little extra logic. A sample stored procedure that shreds
MIML data into the refined star schema (Figure 8) is attached to this article. It uses so-called sequences in DB2 to generate unique identifiers. A sequence is a database object that generates a sequence of unique numbers. As categories or dates are shredded, the stored procedure checks if they already exist in the dimension table. If so, the existing ID is read from the dimension table and used as the foreign key for the new entries in the fact table. Otherwise, a new key value is taken from the sequence and used for both, a new entry in the dimension table and as a foreign key in the fact table.
As an alternative to shredding, you can simply use a table with a column of type XML to store, index, and query the MIML data:
Listing 5. Create table with column of type XML
This has various advantages:
- No need to design or optimize a relational schema
- No need to define a mapping from XML to relational tables
- XML insert performance is much higher than shredding performance
The potential disadvantage is that you cannot use plain SQL to query the MIML data
in DB2. You would use SQL/XML or XQuery, as shown in the next section. This is
perfectly fine for queries up to a certain degree of complexity. If you need to
perform highly complex queries and very heavy analytical processing, plain SQL over
shredded data can provide better performance. Also, MIML data shredded into a
relational schema is more easily accessible for most BI tools. However, using the
XMLTABLE function, it is possible to create relational views of XML data. This is discussed in detail in "XMLTABLE by example, Part 2: Common scenarios for using XMLTABLE with
If a single MIML file contains information about thousands of text documents, then
pureXML delivers much better query performance if you split such a large MIML file
into smaller XML fragments, meaning along the path /miml/doc to produce one XML document per original text document. The
INSERT statement shown in Listing 6 does exactly that.
Listing 6. Split XML document into smaller subdocuments
To further improve the performance of XQuery or SQL/XML queries, XML indexes should be defined. Listing 7 shows how to index all the keyword values in the MIML XML format.
Listing 7. Create XML index on specific XPath
The way you write queries against the MIML data depends on how you choose to represent the MIML data in DB2: simple relational schema, star schema, or a single table with an XML column. Now take a really simple query as an example. Say you want to find the titles of all documents from February 2008 that contain the keywords "dental" and "Tennessee." Listing 8 shows one possible way to write this query in SQL against the simple four-table schema. Listing 9 shows the same query for the refined star schema.
Listing 8. Sample query on simple four-table schema
Listing 9. Sample query on refined star schema
Listing 10. Sample query on pureXML storage using XQuery
Listing 11. Sample query on pureXML storage using SQL/XML
As described earlier, you have several options to choose from when integrating OmniFind Analytics Edition text analysis results (MIML) with DB2. First, you must decide whether you want to shred the data to a relational schema or store it as an XML data type. If you decide to shred MIML to relational tables, there are different options for the relational target schema as well as for the actual shredding method. Table 1 and Table 2 provide a summary and comparison of the most relevant options.
Table 1 compares the simple four-table schema with the refined star schema and the pureXML storage. The first two require shredding, while the pureXML approach does not. This has a significant impact on the time it takes to ingest a large amount of MIML data. Tests were performed with a big MIML file that contained information for about 50,000 input documents. pureXML storage without shredding is the fastest way to insert this data into DB2. The insert time for this option is what is used as the baseline for the other options. Shredding the MIML data with a reasonably well-optimized stored procedure into the refined star schema took approximately 2.25 times longer than inserting the MIML data into an XML column. A prototype of this stored procedure can be found in the file located in the Download section.
Using the simple
XMLTABLE statements (Listing 4) to shred MIML into the simple four-table schema also took about 2.25
times longer than using pureXML storage. Using DB2's decomp routines (annotated
schema shred) ran about two times longer than the
XMLTABLE shredding into the same target schema, which is 4.5 times longer than inserts into a pureXML column.
In this comparison, it is important to note that the approach with the refined star
schema converts only a subset of the MIML data into relational format. For example,
all the detailed information about keyword occurrences (MIML element
<occ>) has been skipped. In the sample data, this accounts for
45 percent of the data volume. Also note that the performance behavior may vary depending on the specific hardware and software configuration used.
The choice between relational and XML storage does not only affect insert performance, but also the querying and indexing of the MIML data. If you take the hit and shred the MIML data to relational table, then you can use plain SQL with relational indexes to analyze the data. XML storage implies that you use XQuery or SQL/XML to query the MIML data. For queries of low or medium complexity, including the ones in the Query MIML Data in DB2 section, the query performance is good. For highly complex queries, plain SQL over shredded data may be easier to write and can provide better performance. Also, MIML data shredded into a relational is more easily accessible for most BI tools.
Table 1. Comparison of the different target schemas
|.||Simple four-table schema||Refined star schema||pureXML column|
|Shred options||direct||indirect (use stored procedure)||not required|
|Relative time to ingest MIML data||
||Stored Procedure using ||
|Data mapping||all data||just document title, text,|
keywords, date, and
|Available query languages||SQL||SQL||SQL/XML, XQuery;|
SQL if you define relational
views over the XML data
|Index types||indexes on relational columns||indexes on relational columns||XML indexes|
Table 2 summarizes and compares the usage of annotated XML
schemas with the
XMLTABLE function for shredding. With
XMLTABLE, the shredding is implicitly defined in the parameters of the
XMLTABLE function. The benefit of the annotated schema shredding is that IBM Data Studio offers a graphical wizard to define the mapping from XML to relational tables very conveniently (Figure 9).
Table 2. Shredding with annotated XML schemas versus the XMLTABLE function
|.||Annotated XML schemas||XMLTABLE function|
|Define mapping by||Adding annotations in the XML Schema|
(for example, through the Data Studio user interface)
|Setting parameters of the function|
|XML schema required||Yes||No|
|Routines or functions for shredding||
xdbDecompXML() stored procedures
|Maintenance||Change the XML Schema annotations||Change the |
This article reviewed the text analysis capabilities of IBM OmniFind Analytics Edition, including an analysis of the XML format of the text analysis results, MIML. It then examined different approaches that can help you extend the value of OmniFind Analytics Edition text analysis by storing analysis results from the MIML file into DB2 to enable standard business intelligence operations and reporting using the full power of SQL or SQL/XML. Here is a summary of the decision process you should follow when deciding which method to use:
- Shred to relational or store as XML?
- If you want to use standard SQL and BI reporting tools, pick shred. Go to step 2.
- If you are OK with using SQL/XML and XQuery, and if you want highest possible insert performance with minimal schema and insert complexity, store as XML. You're done.
- If you have decided to shred, pick a relational target schema that best fits the query and reporting requirements.
- Consider the simple four-table schema if schema simplicity, ease of shredding using out of the box tools and functions are important to you, and if you don't need to optimize for minimal storage space consumption.
- Consider the star schema or refined star schema for better "queryability" with BI and reporting tools. You may also prefer the star schemas if you want to minimize data redundancy and storage consumption and are willing to spend more system resources on shredding
- Choose the simpler star schema if you prefer a model where each fact table rows represents a keyword in a document.
- Choose the refined star schema if you prefer a model where each fact table row represents one document.
By a clear understanding of the advantages and disadvantages of each approach, you can choose which method works best for your analysis and reporting needs.
For their help with this article and the relational star schemas, we'd like to thank Balunaini Prasad from the IBM India Lab, Sreeram Balakrishnan from the IBM Silicon Valley Lab, and Iwao Inagaki from the IBM Yamato Lab in Japan.
|DDLs, SQL/XML scripts, and stored procedure1||TextAnalytics_to_DB2.zip||6KB||HTTP|
- This download contains a set of sample DDLs and SQL/XML scripts as well as a prototype SQL stored procedure that shreds MIML data into a star schema.
OmniFind Analytics Edition: An advanced text analysis and mining platform": Get an
in-depth look at the product.
- "XML Support in DB2
Universal Database": Learn more about this important topic.
- "XMLTABLE by example, Part 2: Common scenarios for
using XMLTABLE with DB2" (developerWorks, Sept 2007): Explore the power of manipulating XML data with the
- "Shred XML documents using DB2 pureXML":
(developerWorks, Jan 2008): Find further discussion and examples of mapping XML data to relational tables in DB2.
- "pureXML in DB2 9: Which way to query your XML
data?" (developerWorks, June 2006): Gain information about how to query your XML data.
wiki: Stay up to speed on DB2's XML technology.
- "Apache UIMA": Get a more in-depth discussion of text analytics, particularly as manifested in the Unstructured Information Management Architecture (UIMA).
developerWorks Information Management zone: Learn more about DB2. Find technical documentation, how-to articles, education, downloads, product information, and more.
Stay current with developerWorks
technical events and webcasts.
Get products and technologies
OmniFind Analytics Edition product Web site: Find a wide range of information on IBM
OmniFInd Analytics Edition.
- Check out the
IBM Data Studio product web site.
Build your next development project with
trial software, available for download directly from developerWorks.
- Participate in the discussion forum.
Participate in developerWorks blogs and get involved in the developerWorks community.
Matthias Nicola is a Senior Software Engineer for DB2 pureXML™ at IBM's Silicon Valley Lab. His work focuses on all aspects of XML in DB2, including XQuery, SQL/XML, storage, indexing and performance. Matthias also works closely with customers and business partners, assisting them in the design, implementation, and optimization of XML solutions. Prior to joining IBM, Matthias worked on data warehousing performance for Informix Software. He received his doctorate in computer science from the Technical University of Aachen, Germany.
Martin Sommerlandt is on an 18-month internship at the IBM Silicon Valley Lab. He studies Computer Science at University of Technology Dresden, Germany, and joined IBM in 2007 as an intern to work on DB2 pureXML™ solutions. His interests include XQuery, SQL/XML, XML performance, and customized XML solutions. Martin is an IBM certified DB2 Application Developer and IBM certified DB2 Database Administrator. He has also worked as a tutor for DB2 administration and programming.