From text analytics to data warehousing

Process output from OmniFind Analytics Edition with DB2 and pureXML

Text analytics enables you to extract more business value from unstructured data such as emails, customer relationship management (CRM) records, office documents, or any text-based data. IBM® OmniFind™ Analytics Edition provides rich text analysis capabilities and interactive visualization to enable you to find patterns and trends hidden in large quantities of unstructured information. The text analysis results from OmniFind Analytics Edition are in XML-format and can also be stored, indexed, and queried in a DB2 database. This allows you to incorporate your text analysis results into existing business applications and reporting tools by using regular SQL or SQL/XML queries. This article provides an overview of text analytics with OmniFind Analytics Edition and describes several ways of bringing its analysis results into DB2, in relational or pureXML™ format.

Matthias Nicola (mnicola@us.ibm.com), DB2 pureXML, IBM Silicon Valley Laboratory

Author photo: Matthias NicolaMatthias Nicola is a Senior Software Engineer for DB2 pureXML™ at IBM's Silicon Valley Lab. His work focuses on all aspects of XML in DB2, including XQuery, SQL/XML, storage, indexing and performance. Matthias also works closely with customers and business partners, assisting them in the design, implementation, and optimization of XML solutions. Prior to joining IBM, Matthias worked on data warehousing performance for Informix Software. He received his doctorate in computer science from the Technical University of Aachen, Germany.


developerWorks Contributing author
        level

Martin Sommerlandt, DB2 pureXML, IBM

Author Photo: Martin SommerlandtMartin Sommerlandt is on an 18-month internship at the IBM Silicon Valley Lab. He studies Computer Science at University of Technology Dresden, Germany, and joined IBM in 2007 as an intern to work on DB2 pureXML™ solutions. His interests include XQuery, SQL/XML, XML performance, and customized XML solutions. Martin is an IBM certified DB2 Application Developer and IBM certified DB2 Database Administrator. He has also worked as a tutor for DB2 administration and programming.



Kathy Zeidenstein (krzeide@us.ibm.com), IBM Content Discovery, IBM

Author Photo: Kathy ZeidensteinKathy Zeidenstein has worked at IBM for a bazillion years. She currently works on the IBM Data Studio enablement team. Before taking on this role, she was the product marketing manager for IBM OmniFind Analytics Edition.



24 April 2008

Also available in Chinese Japanese

Overview of text analysis

It may be helpful for those unfamiliar with text analysis to provide a brief overview of the technology and typical use cases that benefit from application of text analytics. For those who want a more in-depth discussion of text analytics, particularly as manifested in the Unstructured Information Management Architecture (UIMA), there are several articles on developerWorks that go into more detail as well as on Apache.org (see the Resources section)

Text analysis is the process of enabling computers to extract meaning from text. Text analysis is often implemented as a series of iterative processes that range from simple language detection, parsing, and tokenizing, to more complex processes that can recognize the sentiment of what is being expressed. UIMA offers a standardized input and output format for these different processes to enable the plug-and-play of modules in a variety of combinations and from a variety of vendors.

The output of text analysis consists of the original text along with additional metadata about the text. At that point, there are a variety of applications that can make use of that enhanced metadata, including business intelligence applications, search applications, enterprise content management systems, and text mining applications (see Figure 1) .

Figure 1. Text analysis enhances a wide variety of applications
Text analysis enhances a wide variety of applications

Overview of OmniFind Analytics Edition

OmniFind Analytics Edition provides the ability to interactively explore and mine the results of text analysis, as well as structured data that is typically associated with unstructured text. For those of you familiar with business intelligence applications, you can think of it as content-centric business intelligence, in that it aggregates the results of text analysis to detect frequencies, correlations, and trends. Typical use cases include:

  • Analysis of customer contact information (e-mails, chats, problem tickets, contact center notes) for insight into quality or satisfaction issues
  • Analysis of blogs and wikis for reputation monitoring
  • Analysis of internal e-mail for compliance violations or for expertise location

Architecture

Figure 2 is an overview of the flow of content and data throughout the system. First, the original text data must be in a format that is understandable by OmniFind Analytics Edition, which is an XML format called Analysis Text Markup Language (ATML). OmniFind Analytics Edition can automatically convert comma separated files (.csv) to ATML.

As you can see in Figure 2, there are both structured and unstructured components in the documents. You must specify which are the text fields that you want to run through text analysis (natural language processing).

Figure 2. OmniFind Analytics Edition Architecture
OmniFind Analytics Edition Architecture

As the text goes through natural language processing, rules and dictionaries are applied. Default categories, rules, and dictionaries are included out of the box, but these can be customized to recognize information more specific to the domain being analyzed. The output of natural language processing is another XML file format called Mining Markup Language (MIML), as shown in Figure 3.

Figure 3. MIML files are the output of text analysis
MIML files are the output of text analysis

Note: You can bypass the ATML step completely by crawling content using OmniFind Enterprise Edition and by running the OmniFind Analytics Edition text analysis on that crawled content inside of OmniFind Enterprise Edition's UIMA pipeline. The output of that analysis is a MIML file.

In the normal flow, the MIML is input into the indexing step and a text index is created. This index is specifically designed to allow for very high speed retrieval of requests for aggregation, correlation, and so on, as users interact with the mining interface. The MIML files can be found in the file system of your OmniFind Analytics Edition installation, in the directory <OAE_INSTALL_DIR>/databases/<database name>/db/miml/. You will learn more about MIML files in the next section.

The mining module is the interface that allows users to explore the index and discover patterns and relationships. Users can use familiar search paradigms, including keyword/semantic search and faceted navigation to explore the information, which can be viewed in a variety of ways such as top-n frequency, time series, and 2D correlation heat maps. As users drill into the information, at any point they can go directly to a view that contains the original text of the document.

The mining interface is also available as a Java API for custom environments.

Extend business reporting with text analysis

As mentioned above, the typical flow is for the analysis results (MIML) to be indexed so that users can interactively explore the content using the interactive user interface. However, for some use cases there is a need to extend existing business performance reporting with key text features. For example, a health care payer may use the mining interface to discover occurrences of particular diseases or symptoms in notes taken by nurses who interact with patients on the phone. They may want to export that information to a data warehouse where they can link that information with existing structured data on costs or death rates and do regular reporting on that relationship as a key performance indicator.

This article describes some options for integrating the results of text analysis, the MIML files, into a relational schema for use by normal business reporting tools, such as Cognos.


Anatomy of a MIML file

A MIML file is a well-formed XML document whose root element is <miml>, as shown in Figure 4. At the second level of the document you find a number of <doc> elements that are children of the root element. Each <doc> element contains the output of the natural language processing for one particular input document such as an email, problem ticket, CRM record, or other plain text document. Since you often analyze many such documents, the MIML file typically has many <doc> elements.

Figure 4. Anatomy of a MIML file
Anatomy of a MIML file

As you see in Figure 4, each document is represented by an ID number, a title, the original content of the document ("input"), and three sets of features:

  • standard features
  • keyword features
  • textlist features.

The standard features represent general structured information from each document, such as date, origin, subject, or country. There is one <kw> element for each such data item. The val attribute of the <kw> element contains the actual value, the cat attribute contains its category, such as "date" or "country".

The keyword features describe the occurrences of words and phrases in the unstructured text content. There is one <kw> element for each distinct word or phrase, each with a value and a category, such as "general noun", "proper noun", "verb", or "phrase". For each occurrence of such word there is a separate <occ> element that contains the occurrence ID and the begin and end position in the original text.

The text list features contain the positions of the informational data within the document content.


Process MIML data in DB2

Because MIML files are well-formed XML documents, you can use the pureXML capabilities in DB2 to process the MIML data for reporting and integration with existing relational data. The two main approaches for doing this are shredding (to relational tables) and storing (as XML).

By shredding the MIML data into a relational target schema, the full power of SQL queries and business intelligence tools can be used to analyze the data. DB2 offers two methods for the shredding of XML data into relational table:

  • The first is Called an "annotated schema shred", also known as "XML decomposition". With this approach, you add annotations to an XML schema to define how XML element and attribute values are mapped to relational columns in one or multiple tables. You will see below that IBM Data Studio provides convenient GUI support to define this mapping visually.
  • The second shredding method in DB2 uses the SQL/XML function XMLTABLE, which typically takes XML data as input and produces element and attribute values in a relational output format. In addition to the examples in this article, refer to the Resources section for other developerWorks articles that describe these DB2 pureXML features in more detail.

The alternative to shredding is to store and index the MIML data in a column of type XML in a DB2 table. This avoids conversion into relational format but requires SQL/XML or XQuery to query the data. The pros and cons of both approaches are discussed in this article.

Relational schema options for MIML

XML shredding requires a relational target schema. There are various ways in which you can define a set of relational tables to represent the MIML data in relational format. Each has different advantages and disadvantages, related to shredding performance, query performance, and the ease of reporting and analytics. The three different relational schemas are discussed in this section.

Option 1: Simple four-table schema
The first relational schema is a very straightforward relational representation of the MIML data (Figure 5). It is somewhat de-normalized and allows for very easy and efficient shredding with either the XMLTABLE function or annotated schemas.

Figure 5. Simple four-table schema for MIML data in DB2
Simple four-table schema for MIML data in DB2

The DOCUMENT table contains the basic information of each document: the document ID, title, and the document content ("input"). Remember that a MIML document has the following:

  • Standard features with standard keywords
  • Keyword features that describe keyword occurrences
  • Text list features

You see in Figure 5 that one table is used for each of these types of features. The standard features are stored in the STANDARD_KW table, keywords with their occurrences in the KEYWORD_KW_OCC table, and the text list features in the TEXTLIST_TEXT table. Every feature table contains the DOC_ID as a reference to the DOCUMENT table. All DOC_IDs are unique to meet the foreign key constraints. Figure 6 illustrates the logical mapping of a MIML file to the four relational tables.

Figure 6. Visualizing the mapping from MIML to the four-table schema
Visualizing the mapping from MIML to the four-table schema

Since this schema is not fully normalized, there is some data redundancy. For example, each document contains several standard keywords, such as date and subject description. These keywords are stored repeatedly in the STANDARD_KW keyword table for every occurrence of a given keyword. Additionally, keyword categories (such as noun, verb, and proper verb) are repeated in the KEYWORD_KW_OCC table for every keyword.

Option 2: Star schema
Figure 7 shows another possible schema for MIML data. This is a star schema consisting of a fact table and four dimension tables. The fact table DOCUMENT_ANALYSIS contains one row for each occurrence of a keyword in a document. The dimension tables describe each analysis entry in the fact table: the actual date, the keyword values and categories as well as corresponding document title and content. This is a more normalized schema than the four-table schema option, so that repeated storage of identical keywords and categories is avoided.

Figure 7. A star schema for MIML (initial version)
A star schema for MIML (initial version)

You also notice that this schema does not represent all the data items from the original MIML file. The actual position of each keyword occurrence has been omitted, as well as some of the standard features (except date) and text list features. Although they could be included in the schema, it is assumed here that they are not relevant for subsequent reporting purposes.

The DATE dimension allows you to analyze the data according to time periods, such as week, month, and quarter. Also note that the keyword categories are stored in the dimension table CATEGORY, where each category points to its parent category within this table. Therefore, it is not only possible to query for individual categories (noun, verb), but also category groups (word, phrase).

The advantages of this schema are:

  • It is more normalized thereby avoiding data redundancy and saving storage space
  • It may be more suitable for reporting purposing and it provides higher "queryability" than the simple four-table schema.

The disadvantage of this schema is that shredding into it is a lot harder. You cannot shred into this schema directly using DB2's decomp routine or XMLTABLE function. This is because this relational schema requires the generation and proper use of foreign keys, which are key values that are not included in the original MIML. However, if MIML data has been shredded into the simple four-table schema, then it is easy to use SQL MERGE statements to move the data into the star schema (see the script found in the Download section).

Option 3: Refined star schema
The star schema can be further refined to reduce the rows in the DOCUMENT_ANALYSIS fact table. In the initial star schema above, the fact table contains one row for each keyword occurrence. The refined schema introduces a bridge table between the actual document analysis and the keywords with their categories (Figure 8). This means that the refined fact table now has one row per document. This further reduces some data redundancy. For example, the foreign key to the date table is not repeated for every keyword of a document. Logically, each fact table row now represents a document, not a keyword occurrence.

Figure 8. A refined star schema for MIML
A refined star schema for MIML

The refined star schema consists of the fact table DOCUMENT_ANALYSIS, the default dimension tables DOCUMENT and DATE, the bridge table UCAT_BRIDGE that points to the keywords, and several optional dimension tables, such as CATEGORY_SUBJECTDESC. The optional dimensions stem from the standard keyword categories in the MIML data. You can choose to have one such dimension table for each standard category, such as subject-description, country, and organization. The decision to include or exclude certain standard keyword categories in the star schema depends on the specific reporting requirements.

The UCAT table contains all keywords and their categories. In this refined schema, these categories are represented as levels. Each category, such as .tkm_en_base_word.noun.proper, is split into its levels tkm_en_base_word, noun, and proper. Each level is stored in a separate column. Hence, it is again possible also to query for category groups such as word or phrase.

The advantages of this refined star schema are as follows:

  • It is further normalized and eliminates even more data redundancy
  • Fact table rows now represent documents instead of keyword occurrences within the document. This can be more suitable and intuitive, depending on the actual reporting requirements.

The disadvantage of this schema is a slightly increased complexity with a many-to-many relationship between the fact table and the UCAT table. Shredding into this schema is therefore more challenging, but can be achieved with a SQL stored procedure (see the script found in the Download section). With the usual caveats about how plans can change, IBM intends to include a shredding tool in a future release of OmniFind Analytics Edition that would shred the MIML to this refined star schema.

Shred MIML with annotated XML schemas

This section uses DB2's annotated schema shredding feature to map MIML data to the simple four-table schema. IBM Data Studio includes a GUI tool to define mappings from XML schemas to relational schemas (Figure 9). In a nutshell, you open the XML schema file, establish a database connection, and then use the mouse to draw connections from XML elements and attributes to relational columns.

More specifically, you start by right-clicking on the XML schema file in IBM Data Studio. From the context menu you choose Open with…, and then Annotated XSD Mapping Editor. This opens a three-part view, as shown in Figure 9. On the left side, you see the structure of all the elements and attributes defined by the XML schema. On the right side, you choose the tables you want to map to. The frame in the middle contains the mappings. To create a mapping, simply click on an XML node in the left panel and on any column of one of the tables. Data Studio shows the mapping as a connecting line between the two end points.

Figure 9. Mapping XML to relational schemas with IBM Data Studio
Mapping XML to relational schemas with IBM Data Studio

In the background, IBM Data Studio inserts actual annotations into the XML schema file, which all start with the prefix db2-xdb, as shown in Figure 10. You see that each attribute is annotated with its path and a target column in a target table ("rowset").

Figure 10. XML schema with annotations
XML schema with annotations

To register the newly annotated XML schema, use IBM Data Studio or execute the command shown in Listing 1 through the DB2 command line processor.

Listing 1. Register XML Schema

REGISTER XMLSCHEMA
'http://www.example.org' FROM 'annotatedSchema.xsd' AS mimlschema COMPLETE ENABLE DECOMPOSITION;

After the annotated schema is registered, you can shred the MIML document into the relational target tables. If the MIML document exists as a file in the file system, such as miml_4711.xml, you can perform the DECOMPOSE statement:

Listing 2. Decompose XML document regarding XML schema

DECOMPOSE XML DOCUMENT
'miml_4711.xml' XMLSCHEMA mimlschema;

Alternatively, you can invoke the shredding from an application through a stored procedure call. The parameter marker represents the MIML document that should be shredded. For further information, see "Shred XML documents using DB2 pureXML" (developerWorks, Jan 2008).

Listing 3. Decompose XML document regarding XML schema using stored procedure

CALL
xdbDecompXML('sqlschema', 'mimlschema', ?, 'documentID', 0, NULL, NULL, NULL);

Shred MIML with the XMLTABLE function

As an alternative to annotated schema, you can use the SQL function XMLTABLE to shred the MIML documents. In that case, an XML schema is not required. The XMLTABLE function reads an XML document and returns a relational result set. The mapping of XML nodes to relational columns is implicit in the XMLTABLE function. In Listing 4, you see that one SQL insert statement is used for each of the four tables in the simple relational schema. Each insert statement reads the MIML document from a staging table called MIML that contains a single column, XMLDOC, of type XML. Alternatively, you can pass the MIML file from an application through a parameter marker into the insert statement. These techniques are described in more detail in the article "XMLTABLE by example, Part 2: Common scenarios for using XMLTABLE with DB2" (developerWorks Sept 2007).

Listing 4. XMLTABLE statements to shred MIML into the simple four-table schema

INSERT INTO
doc SELECT X.doc_id, X.title, X.input FROM MIML, XMLTABLE('$XMLDOC/doc' COLUMNS doc_id VARCHAR(15) PATH '@id', title CLOB(10K) PATH 'title', input CLOB(1M) PATH 'input') X; INSERT INTO standard_kw SELECT X.doc_id, X.cat, X.val FROM MIML, XMLTABLE('$XMLDOC/doc/feature[@type="standard"]/kw' COLUMNS doc_id VARCHAR(15) PATH '../../@id', cat VARCHAR(50) PATH '@cat', val VARCHAR(4096) PATH '@val') X; INSERT INTO keyword_kw_occ SELECT X.doc_id, X.cat, X.val, X.id, X.begin, X.end FROM MIML, XMLTABLE('$XMLDOC/doc/feature[@type="keyword"]/kw/occ' COLUMNS doc_id VARCHAR(15) PATH '../../../@id', cat VARCHAR(50) PATH '../@cat', val VARCHAR(4096) PATH '../@val', begin INTEGER PATH '@begin', end INTEGER PATH '@end', id INTEGER PATH '@id') X; INSERT INTO textlist_text SELECT X.doc_id, X.name, X.begin, X.end FROM MIML, XMLTABLE('$XMLDOC/doc/feature[@type="textlist"]/text' COLUMNS doc_id VARCHAR(15) PATH '../../@id', name VARCHAR(15) PATH '@name', begin INTEGER PATH '@begin', end INTEGER PATH '@end') X;

The advantage of the insert statements with the XMLTABLE function over the annotated schema shred is generally better performance if the shredding is simple and the number of target tables is small. This is case for MIML data. Also, if you are familiar with SQL and/or XPath, you can easily change or customize these insert statements if there are specific application requirements. You may find that easier than customizing the schema annotations shown in the previous section.

Shred into the star schemas
The star schemas require the generation and proper use of primary or foreign keys for the fact and dimension tables. Neither the annotated schema shred nor the XMLTABLE function can generate these keys. However, they can be resolved if you embed the INSERT statements above in an SQL stored procedure with a little extra logic. A sample stored procedure that shreds MIML data into the refined star schema (Figure 8) is attached to this article. It uses so-called sequences in DB2 to generate unique identifiers. A sequence is a database object that generates a sequence of unique numbers. As categories or dates are shredded, the stored procedure checks if they already exist in the dimension table. If so, the existing ID is read from the dimension table and used as the foreign key for the new entries in the fact table. Otherwise, a new key value is taken from the sequence and used for both, a new entry in the dimension table and as a foreign key in the fact table.

Store MIML in XML columns

As an alternative to shredding, you can simply use a table with a column of type XML to store, index, and query the MIML data:

Listing 5. Create table with column of type XML

CREATE TABLE
miml (document XML);

This has various advantages:

  • No need to design or optimize a relational schema
  • No need to define a mapping from XML to relational tables
  • XML insert performance is much higher than shredding performance

The potential disadvantage is that you cannot use plain SQL to query the MIML data in DB2. You would use SQL/XML or XQuery, as shown in the next section. This is perfectly fine for queries up to a certain degree of complexity. If you need to perform highly complex queries and very heavy analytical processing, plain SQL over shredded data can provide better performance. Also, MIML data shredded into a relational schema is more easily accessible for most BI tools. However, using the XMLTABLE function, it is possible to create relational views of XML data. This is discussed in detail in "XMLTABLE by example, Part 2: Common scenarios for using XMLTABLE with DB2."

If a single MIML file contains information about thousands of text documents, then pureXML delivers much better query performance if you split such a large MIML file into smaller XML fragments, meaning along the path /miml/doc to produce one XML document per original text document. The INSERT statement shown in Listing 6 does exactly that.

Listing 6. Split XML document into smaller subdocuments

INSERT INTO
subdocuments (document) SELECT T.xmldoc FROM miml, XMLTABLE('$DOCUMENT/miml/doc' COLUMNS xmldoc XML PATH 'document{.}') T;

To further improve the performance of XQuery or SQL/XML queries, XML indexes should be defined. Listing 7 shows how to index all the keyword values in the MIML XML format.

Listing 7. Create XML index on specific XPath

CREATE INDEX
idx_keyword ON subdocuments (document) GENERATE KEY USING XMLPATTERN '/doc/feature/kw/@val' AS SQL VARCHAR(4096);

Query MIML data in DB2

The way you write queries against the MIML data depends on how you choose to represent the MIML data in DB2: simple relational schema, star schema, or a single table with an XML column. Now take a really simple query as an example. Say you want to find the titles of all documents from February 2008 that contain the keywords "dental" and "Tennessee." Listing 8 shows one possible way to write this query in SQL against the simple four-table schema. Listing 9 shows the same query for the refined star schema.

Listing 8. Sample query on simple four-table schema

SELECT
title FROM document, standard_kw s, keyword_kw_occ kw1, keyword_kw_occ kw2 WHERE id = s.doc_id AND id = kw1.doc_id AND id = kw2.doc_id AND (s.cat = '.date' AND s.val BETWEEN '20080201' AND '20080229') AND kw1.val = 'dental' AND kw2.val = 'Tennessee';
Listing 9. Sample query on refined star schema

SELECT
doc.title FROM document_analysis da, document doc, date ucat_bridge b1, ucat cat1, ucat_bridge b2, ucat cat2 WHERE da.document = doc.id AND da.date = date.id AND date.month = 2 AND date.year = 2008 AND da.ucat = b1.id AND b1.annotid = cat1.annotid AND cat1.keyword = 'dental' AND da.ucat = b2.id AND b2.annotid = cat2.annotid AND cat2.keyword = 'Tennessee';

If you choose not to shred the MIML data but to store it in an XML column, the same query would be expressed in XQuery (Listing 10) or SQL/XML (Listing 11).

Listing 10. Sample query on pureXML storage using XQuery

XQUERY
for $i in db2-fn:xmlcolumn('MIML.DOCUMENT')/doc where $i/feature[@type="standard"]/kw[@cat=".date" and @val ge "20080201" and @val le "20080229"] and $i/feature[@type="keyword" and kw/@val = "dental" and kw/@val = "Tennessee"] return $i/title;
Listing 11. Sample query on pureXML storage using SQL/XML

SELECT XMLQUERY
('$DOCUMENT/doc/title') FROM miml WHERE XMLEXISTS ('$DOCUMENT/doc/feature[@type="standard"]/kw[@cat=".date" and @val ge "20080201" and @val le "20080229"]') AND XMLEXISTS ('$DOCUMENT/doc/feature[@type="keyword" and kw/@val = "dental" and kw/@val = "Tennessee"]');

Compare the options

As described earlier, you have several options to choose from when integrating OmniFind Analytics Edition text analysis results (MIML) with DB2. First, you must decide whether you want to shred the data to a relational schema or store it as an XML data type. If you decide to shred MIML to relational tables, there are different options for the relational target schema as well as for the actual shredding method. Table 1 and Table 2 provide a summary and comparison of the most relevant options.

Table 1 compares the simple four-table schema with the refined star schema and the pureXML storage. The first two require shredding, while the pureXML approach does not. This has a significant impact on the time it takes to ingest a large amount of MIML data. Tests were performed with a big MIML file that contained information for about 50,000 input documents. pureXML storage without shredding is the fastest way to insert this data into DB2. The insert time for this option is what is used as the baseline for the other options. Shredding the MIML data with a reasonably well-optimized stored procedure into the refined star schema took approximately 2.25 times longer than inserting the MIML data into an XML column. A prototype of this stored procedure can be found in the file located in the Download section.

Using the simple XMLTABLE statements (Listing 4) to shred MIML into the simple four-table schema also took about 2.25 times longer than using pureXML storage. Using DB2's decomp routines (annotated schema shred) ran about two times longer than the XMLTABLE shredding into the same target schema, which is 4.5 times longer than inserts into a pureXML column.

In this comparison, it is important to note that the approach with the refined star schema converts only a subset of the MIML data into relational format. For example, all the detailed information about keyword occurrences (MIML element <occ>) has been skipped. In the sample data, this accounts for 45 percent of the data volume. Also note that the performance behavior may vary depending on the specific hardware and software configuration used.

The choice between relational and XML storage does not only affect insert performance, but also the querying and indexing of the MIML data. If you take the hit and shred the MIML data to relational table, then you can use plain SQL with relational indexes to analyze the data. XML storage implies that you use XQuery or SQL/XML to query the MIML data. For queries of low or medium complexity, including the ones in the Query MIML Data in DB2 section, the query performance is good. For highly complex queries, plain SQL over shredded data may be easier to write and can provide better performance. Also, MIML data shredded into a relational is more easily accessible for most BI tools.

Table 1. Comparison of the different target schemas
. Simple four-table schemaRefined star schemapureXML column
StoragerelationalrelationalXML
Shred optionsdirectindirect (use stored procedure)not required
Relative time to ingest MIML dataDECOMP routinesXMLTABLE statementsStored Procedure using XMLTABLEINSERT statement
450%225%225%100%
Data mappingall datajust document title, text,
keywords, date, and
optional additional
standard keywords
all data
Available query languagesSQLSQLSQL/XML, XQuery;
SQL if you define relational
views over the XML data
Index typesindexes on relational columnsindexes on relational columnsXML indexes

Table 2 summarizes and compares the usage of annotated XML schemas with the XMLTABLE function for shredding. With XMLTABLE, the shredding is implicitly defined in the parameters of the XMLTABLE function. The benefit of the annotated schema shredding is that IBM Data Studio offers a graphical wizard to define the mapping from XML to relational tables very conveniently (Figure 9).

Table 2. Shredding with annotated XML schemas versus the XMLTABLE function
. Annotated XML schemasXMLTABLE function
Define mapping byAdding annotations in the XML Schema
(for example, through the Data Studio user interface)
Setting parameters of the function
XML schema requiredYesNo
Routines or functions for shreddingDECOMPOSE command or
xdbDecompXML() stored procedures
XMLTABLE function in SQL/XML
MaintenanceChange the XML Schema annotationsChange the XMLTABLE statements

Summary

This article reviewed the text analysis capabilities of IBM OmniFind Analytics Edition, including an analysis of the XML format of the text analysis results, MIML. It then examined different approaches that can help you extend the value of OmniFind Analytics Edition text analysis by storing analysis results from the MIML file into DB2 to enable standard business intelligence operations and reporting using the full power of SQL or SQL/XML. Here is a summary of the decision process you should follow when deciding which method to use:

  1. Shred to relational or store as XML?
    1. If you want to use standard SQL and BI reporting tools, pick shred. Go to step 2.
    2. If you are OK with using SQL/XML and XQuery, and if you want highest possible insert performance with minimal schema and insert complexity, store as XML. You're done.
  2. If you have decided to shred, pick a relational target schema that best fits the query and reporting requirements.
    1. Consider the simple four-table schema if schema simplicity, ease of shredding using out of the box tools and functions are important to you, and if you don't need to optimize for minimal storage space consumption.
    2. Consider the star schema or refined star schema for better "queryability" with BI and reporting tools. You may also prefer the star schemas if you want to minimize data redundancy and storage consumption and are willing to spend more system resources on shredding
      1. Choose the simpler star schema if you prefer a model where each fact table rows represents a keyword in a document.
      2. Choose the refined star schema if you prefer a model where each fact table row represents one document.

By a clear understanding of the advantages and disadvantages of each approach, you can choose which method works best for your analysis and reporting needs.

Acknowledgements

For their help with this article and the relational star schemas, we'd like to thank Balunaini Prasad from the IBM India Lab, Sreeram Balakrishnan from the IBM Silicon Valley Lab, and Iwao Inagaki from the IBM Yamato Lab in Japan.


Download

DescriptionNameSize
DDLs, SQL/XML scripts, and stored procedure1TextAnalytics_to_DB2.zip6KB

Note

  1. This download contains a set of sample DDLs and SQL/XML scripts as well as a prototype SQL stored procedure that shreds MIML data into a star schema.

Resources

Learn

Get products and technologies

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Information management on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Information Management
ArticleID=302629
ArticleTitle=From text analytics to data warehousing
publish-date=04242008