 | Level: Intermediate Matthias Nicola (mnicola@us.ibm.com), DB2 pureXML, IBM Silicon Valley Laboratory Martin Sommerlandt (msommer@us.ibm.com), DB2 pureXML, IBM Kathy Zeidenstein (krzeide@us.ibm.com), IBM Content Discovery, IBM
24 Apr 2008 Text analytics enables you to extract more business value from
unstructured data such as emails, customer relationship management (CRM) records,
office documents, or any text-based data. IBM® OmniFind™ Analytics Edition provides rich text analysis capabilities and interactive visualization to enable you to find patterns and trends hidden in large quantities of unstructured information. The text analysis results from OmniFind Analytics Edition are in XML-format and can also be stored, indexed, and queried in a DB2 database. This allows you to incorporate your text analysis results into existing business applications and reporting tools by using regular SQL or SQL/XML queries. This article provides an overview of text analytics with OmniFind Analytics Edition and describes several ways of bringing its analysis results into DB2, in relational or pureXML™ format.
Overview of text analysis
It may be helpful for those unfamiliar with text analysis to provide a brief
overview of the technology and typical use cases that benefit from application of
text analytics. For those who want a more in-depth discussion of text analytics,
particularly as manifested in the Unstructured Information Management Architecture
(UIMA), there are several articles on developerWorks that go into more detail as well
as on Apache.org (see the Resources section)
Text analysis is the process of enabling computers to extract meaning from text.
Text analysis is often implemented as a series of iterative processes that range from
simple language detection, parsing, and tokenizing, to more complex processes that
can recognize the sentiment of what is being expressed. UIMA offers a standardized
input and output format for these different processes to enable the plug-and-play of modules in a variety of combinations and from a variety of vendors.
The output of text analysis consists of the original text along with additional metadata about the text. At that point, there are a variety of applications that can make use of that enhanced metadata, including business intelligence applications, search applications, enterprise content management systems, and text mining applications (see Figure 1) .
Figure 1. Text analysis enhances a wide variety of applications
Overview of OmniFind Analytics Edition
OmniFind Analytics Edition provides the ability to interactively explore and mine
the results of text analysis, as well as structured data that is typically associated
with unstructured text. For those of you familiar with business intelligence applications, you can think of it as content-centric business intelligence, in that it aggregates the results of text analysis to detect frequencies, correlations, and trends. Typical use cases include:
- Analysis of customer contact information (e-mails, chats, problem tickets, contact center notes) for insight into quality or satisfaction issues
- Analysis of blogs and wikis for reputation monitoring
- Analysis of internal e-mail for compliance violations or for expertise location
Architecture
Figure 2 is an overview of the flow of content and data throughout the system. First, the original text data must be in a format that is understandable by OmniFind Analytics Edition, which is an XML format called Analysis Text Markup Language (ATML). OmniFind Analytics Edition can automatically convert comma separated files (.csv) to ATML.
As you can see in Figure 2, there are both structured and unstructured components in the documents. You must specify which are the text fields that you want to run through text analysis (natural language processing).
Figure 2. OmniFind Analytics Edition Architecture
As the text goes through natural language processing, rules and dictionaries are applied. Default categories, rules, and dictionaries are included out of the box, but these can be customized to recognize information more specific to the domain being analyzed. The output of natural language processing is another XML file format called Mining Markup Language (MIML), as shown in Figure 3.
Figure 3. MIML files are the output of text analysis
Note: You can bypass the ATML step completely by crawling content using OmniFind Enterprise Edition and by running the OmniFind Analytics Edition text analysis on that crawled content inside of OmniFind Enterprise Edition's UIMA pipeline. The output of that analysis is a MIML file.
In the normal flow, the MIML is input into the indexing step and a text index is created. This index is specifically designed to allow for very high speed retrieval of requests for aggregation, correlation, and so on, as users interact with the mining interface. The MIML files can be found in the file system of your OmniFind Analytics Edition installation, in the directory <OAE_INSTALL_DIR>/databases/<database name>/db/miml/. You will learn more about MIML files in the next section.
The mining module is the interface that allows users to explore the index and discover patterns and relationships. Users can use familiar search paradigms, including keyword/semantic search and faceted navigation to explore the information, which can be viewed in a variety of ways such as top-n frequency, time series, and 2D correlation heat maps. As users drill into the information, at any point they can go directly to a view that contains the original text of the document.
The mining interface is also available as a Java API for custom environments.
Extend business reporting with text analysis
As mentioned above, the typical flow is for the analysis results (MIML) to be
indexed so that users can interactively explore the content using the interactive
user interface. However, for some use cases there is a need to extend existing business performance reporting with key text features. For example, a health care payer may use the mining interface to discover occurrences of particular diseases or symptoms in notes taken by nurses who interact with patients on the phone. They may want to export that information to a data warehouse where they can link that information with existing structured data on costs or death rates and do regular reporting on that relationship as a key performance indicator.
This article describes some options for integrating the results of text analysis,
the MIML files, into a relational schema for use by normal business reporting tools, such as Cognos.
Anatomy of a MIML file
A MIML file is a well-formed XML document whose root element is <miml>, as shown in Figure 4. At the
second level of the document you find a number of <doc> elements that are children of the root element. Each <doc> element contains the output of the natural language processing for one particular input document such as an email, problem ticket, CRM record, or other plain text document. Since you often analyze many such documents, the MIML file typically has many <doc> elements.
Figure 4. Anatomy of a MIML file
As you see in Figure 4, each document is represented by an ID number, a title, the original content of the document ("input"), and three sets of features:
- standard features
- keyword features
- textlist features.
The standard features represent general structured information from each document, such as date, origin, subject, or country. There is one <kw> element for each such data item. The val attribute of the <kw> element contains the actual value, the cat attribute contains its category, such as "date" or "country".
The keyword features describe the occurrences of words and phrases in the
unstructured text content. There is one <kw> element
for each distinct word or phrase, each with a value and a category, such as
"general noun", "proper noun", "verb", or
"phrase". For each occurrence of such word there is a separate <occ> element that contains the occurrence ID and the begin and end position in the original text.
The text list features contain the positions of the informational data within the document content.
Process MIML data in DB2
Because MIML files are well-formed XML documents, you can use the pureXML capabilities in DB2 to process the MIML data for reporting and integration with existing relational data. The two main approaches for doing this are shredding (to relational tables) and storing (as XML).
By shredding the MIML data into a relational target schema, the full power of SQL
queries and business intelligence tools can be used to analyze the data. DB2 offers
two methods for the shredding of XML data into relational table:
- The first is Called an "annotated schema shred", also known as "XML
decomposition". With this approach, you add annotations to an XML schema to
define how XML element and attribute values are mapped to relational columns in one
or multiple tables. You will see below that IBM Data Studio provides convenient GUI support to define this mapping visually.
- The second shredding method in DB2 uses the SQL/XML function
XMLTABLE, which typically takes XML data as input and produces
element and attribute values in a relational output format. In addition to the
examples in this article, refer to the Resources section for other developerWorks articles that describe these DB2 pureXML features in more detail.
The alternative to shredding is to store and index the MIML data in a column of type
XML in a DB2 table. This avoids conversion into relational format but requires
SQL/XML or XQuery to query the data. The pros and cons of both approaches are
discussed in this article.
Relational schema options for MIML
XML shredding requires a relational target schema. There are various ways in which
you can define a set of relational tables to represent the MIML data in relational
format. Each has different advantages and disadvantages, related to shredding
performance, query performance, and the ease of reporting and analytics. The three
different relational schemas are discussed in this section.
Option 1: Simple four-table schema
The first relational schema is a very straightforward relational representation of the MIML data (Figure 5). It is somewhat de-normalized and allows for very easy and efficient shredding with either the XMLTABLE function or annotated schemas.
Figure 5. Simple four-table schema for MIML data in DB2
The DOCUMENT table contains the basic information of each
document: the document ID, title, and the document content ("input"). Remember that a
MIML document has the following:
- Standard features with standard keywords
- Keyword features that describe keyword occurrences
- Text list features
You see in Figure 5 that one table is used for each of these types of features. The standard features are stored in the STANDARD_KW table, keywords with their occurrences in the KEYWORD_KW_OCC table, and the text list features in the TEXTLIST_TEXT table. Every feature table contains the DOC_ID as a reference to the DOCUMENT table. All DOC_IDs are unique to meet the foreign key constraints. Figure 6 illustrates the logical mapping of a MIML file to the four relational tables.
Figure 6. Visualizing the mapping from MIML to the four-table schema
Since this schema is not fully normalized, there is some data redundancy. For
example, each document contains several standard keywords, such as date and subject
description. These keywords are stored repeatedly in the STANDARD_KW keyword table for every occurrence of a given
keyword. Additionally, keyword categories (such as noun, verb, and proper verb) are repeated in the KEYWORD_KW_OCC table for every keyword.
Option 2: Star schema
Figure 7 shows another possible schema for MIML data. This is a
star schema consisting of a fact table and four dimension tables. The fact table
DOCUMENT_ANALYSIS contains one row for each occurrence of a keyword in a document.
The dimension tables describe each analysis entry in the fact table: the actual date,
the keyword values and categories as well as corresponding document title and
content. This is a more normalized schema than the four-table schema option, so that
repeated storage of identical keywords and categories is avoided.
Figure 7. A star schema for MIML (initial version)
You also notice that this schema does not represent all the data items from the
original MIML file. The actual position of each keyword occurrence has been omitted,
as well as some of the standard features (except date) and text list features.
Although they could be included in the schema, it is assumed here that they are not relevant for subsequent reporting purposes.
The DATE dimension allows you to analyze the data
according to time periods, such as week, month, and quarter. Also note that the keyword categories are stored in the dimension table CATEGORY, where each category points to its parent category within this table. Therefore, it is not only possible to query for individual categories (noun, verb), but also category groups (word, phrase).
The advantages of this schema are:
- It is more normalized thereby avoiding data redundancy and saving storage space
- It may be more suitable for reporting purposing and it provides higher
"queryability" than the simple four-table schema.
The disadvantage of this schema is that shredding into it is a lot harder. You
cannot shred into this schema directly using DB2's decomp routine or XMLTABLE function. This is because this relational schema requires
the generation and proper use of foreign keys, which are key values that are not
included in the original MIML. However, if MIML data has been shredded into the
simple four-table schema, then it is easy to use SQL MERGE
statements to move the data into the star schema (see the script found in the Download section).
Option 3: Refined star schema
The star schema can be further refined to reduce the rows in the DOCUMENT_ANALYSIS fact table. In the initial star schema above, the fact table contains one row for each keyword occurrence. The refined schema introduces a bridge table between the actual document analysis and the keywords with their categories (Figure 8). This means that the refined fact table now has one row per document. This further reduces some data redundancy. For example, the foreign key to the date table is not repeated for every keyword of a document. Logically, each fact table row now represents a document, not a keyword occurrence.
Figure 8. A refined star schema for MIML
The refined star schema consists of the fact table DOCUMENT_ANALYSIS, the default
dimension tables DOCUMENT and DATE, the bridge table UCAT_BRIDGE that points to the
keywords, and several optional dimension tables, such as CATEGORY_SUBJECTDESC.
The optional dimensions stem from the standard keyword categories in the MIML data.
You can choose to have one such dimension table for each standard category, such as
subject-description, country, and organization. The decision to include or exclude certain standard keyword categories in the star schema depends on the specific reporting requirements.
The UCAT table contains all keywords and their categories. In this refined schema,
these categories are represented as levels. Each category, such as
.tkm_en_base_word.noun.proper, is split into its levels tkm_en_base_word, noun, and proper. Each level is stored in a separate column. Hence, it is again possible also to query for category groups such as word or phrase.
The advantages of this refined star schema are as follows:
- It is further normalized and eliminates even more data redundancy
- Fact table rows now represent documents instead of keyword occurrences within the document. This can be more suitable and intuitive, depending on the actual reporting requirements.
The disadvantage of this schema is a slightly increased complexity with a
many-to-many relationship between the fact table and the UCAT table. Shredding into this schema is therefore more challenging, but can be achieved with a SQL stored procedure (see the script found in the Download section). With the usual caveats about how plans can change, IBM intends to include a shredding tool in a future release of OmniFind Analytics Edition that would shred the MIML to this refined star schema.
Shred MIML with annotated XML schemas
This section uses DB2's annotated schema shredding feature to map MIML data to the
simple four-table schema. IBM Data Studio includes a GUI tool to define mappings from
XML schemas to relational schemas (Figure 9). In a nutshell,
you open the XML schema file, establish a database connection, and then use the mouse to draw connections from XML elements and attributes to relational columns.
More specifically, you start by right-clicking on the XML schema file in IBM Data
Studio. From the context menu you choose Open with…, and then Annotated XSD
Mapping Editor. This opens a three-part view, as shown in Figure 9. On the left side, you see the structure of all the elements
and attributes defined by the XML schema. On the right side, you choose the tables you want to map to. The frame in the middle contains the mappings. To create a mapping, simply click on an XML node in the left panel and on any column of one of the tables. Data Studio shows the mapping as a connecting line between the two end points.
Figure 9. Mapping XML to relational schemas with IBM Data Studio
In the background, IBM Data Studio inserts actual annotations into the XML schema
file, which all start with the prefix db2-xdb, as shown in Figure 10. You see that each attribute is annotated with its path and a target column in a target table ("rowset").
Figure 10. XML schema with annotations
To register the newly annotated XML schema, use IBM Data Studio or execute the
command shown in Listing 1 through the DB2 command line processor.
Listing 1. Register XML
Schema
REGISTER XMLSCHEMA 'http://www.example.org'
FROM 'annotatedSchema.xsd' AS mimlschema
COMPLETE ENABLE DECOMPOSITION; |
After the annotated schema is registered, you can shred the MIML document into the relational target tables. If the MIML document exists as a file in the file system, such as miml_4711.xml, you can perform the DECOMPOSE statement:
Listing 2. Decompose XML document regarding XML schema
DECOMPOSE XML DOCUMENT 'miml_4711.xml' XMLSCHEMA mimlschema; |
Alternatively, you can invoke the shredding from an application through a stored
procedure call. The parameter marker represents the MIML document that should be
shredded. For further information, see "Shred XML documents using DB2 pureXML" (developerWorks, Jan 2008).
Listing 3. Decompose XML document regarding XML schema using stored procedure
CALL xdbDecompXML('sqlschema', 'mimlschema', ?, 'documentID',
0, NULL, NULL, NULL); |
Shred MIML with the XMLTABLE function
As an alternative to annotated schema, you can use the SQL function XMLTABLE to shred the MIML documents. In that case, an XML schema is
not required. The XMLTABLE function reads an XML document
and returns a relational result set. The mapping of XML nodes to relational columns
is implicit in the XMLTABLE function. In Listing 4, you see that one SQL insert statement is used for each of
the four tables in the simple relational schema. Each insert statement reads the MIML
document from a staging table called MIML that contains a single column, XMLDOC, of
type XML. Alternatively, you can pass the MIML file from an application through a
parameter marker into the insert statement. These techniques are described in more
detail in the article "XMLTABLE by example, Part 2: Common scenarios for using XMLTABLE with
DB2" (developerWorks Sept 2007).
Listing 4. XMLTABLE statements to shred MIML
into the simple four-table schema
INSERT INTO doc
SELECT X.doc_id, X.title, X.input
FROM MIML,
XMLTABLE('$XMLDOC/doc'
COLUMNS doc_id VARCHAR(15) PATH '@id',
title CLOB(10K) PATH 'title',
input CLOB(1M) PATH 'input') X;
INSERT INTO standard_kw
SELECT X.doc_id, X.cat, X.val
FROM MIML,
XMLTABLE('$XMLDOC/doc/feature[@type="standard"]/kw'
COLUMNS doc_id VARCHAR(15) PATH '../../@id',
cat VARCHAR(50) PATH '@cat',
val VARCHAR(4096) PATH '@val') X;
INSERT INTO keyword_kw_occ
SELECT X.doc_id, X.cat, X.val, X.id, X.begin, X.end
FROM MIML,
XMLTABLE('$XMLDOC/doc/feature[@type="keyword"]/kw/occ'
COLUMNS doc_id VARCHAR(15) PATH '../../../@id',
cat VARCHAR(50) PATH '../@cat',
val VARCHAR(4096) PATH '../@val',
begin INTEGER PATH '@begin',
end INTEGER PATH '@end',
id INTEGER PATH '@id') X;
INSERT INTO textlist_text
SELECT X.doc_id, X.name, X.begin, X.end
FROM MIML,
XMLTABLE('$XMLDOC/doc/feature[@type="textlist"]/text'
COLUMNS doc_id VARCHAR(15) PATH '../../@id',
name VARCHAR(15) PATH '@name',
begin INTEGER PATH '@begin',
end INTEGER PATH '@end') X; |
The advantage of the insert statements with the XMLTABLE function over the annotated schema shred is generally better performance if the shredding is simple and the number of target tables is small. This is case for MIML data. Also, if you are familiar with SQL and/or XPath, you can easily change or customize these insert statements if there are specific application requirements. You may find that easier than customizing the schema annotations shown in the previous section.
Shred into the star schemas
The star schemas require the generation and proper use of primary or foreign keys for the
fact and dimension tables. Neither the annotated schema shred nor the XMLTABLE function can generate these keys. However, they can be
resolved if you embed the INSERT statements above in an
SQL stored procedure with a little extra logic. A sample stored procedure that shreds
MIML data into the refined star schema (Figure 8) is attached to this article. It uses so-called sequences in DB2 to generate unique identifiers. A sequence is a database object that generates a sequence of unique numbers. As categories or dates are shredded, the stored procedure checks if they already exist in the dimension table. If so, the existing ID is read from the dimension table and used as the foreign key for the new entries in the fact table. Otherwise, a new key value is taken from the sequence and used for both, a new entry in the dimension table and as a foreign key in the fact table.
Store MIML in XML columns
As an alternative to shredding, you can simply use a table with a column of type XML
to store, index, and query the MIML data:
Listing 5. Create table with column of type XML
CREATE TABLE miml (document XML); |
This has various advantages:
- No need to design or optimize a relational schema
- No need to define a mapping from XML to relational tables
- XML insert performance is much higher than shredding performance
The potential disadvantage is that you cannot use plain SQL to query the MIML data
in DB2. You would use SQL/XML or XQuery, as shown in the next section. This is
perfectly fine for queries up to a certain degree of complexity. If you need to
perform highly complex queries and very heavy analytical processing, plain SQL over
shredded data can provide better performance. Also, MIML data shredded into a
relational schema is more easily accessible for most BI tools. However, using the
XMLTABLE function, it is possible to create relational views of XML data. This is discussed in detail in "XMLTABLE by example, Part 2: Common scenarios for using XMLTABLE with
DB2."
If a single MIML file contains information about thousands of text documents, then
pureXML delivers much better query performance if you split such a large MIML file
into smaller XML fragments, meaning along the path /miml/doc to produce one XML document per original text document. The INSERT statement shown in Listing 6 does exactly that.
Listing 6. Split XML document into smaller subdocuments
INSERT INTO subdocuments (document)
SELECT T.xmldoc FROM miml,
XMLTABLE('$DOCUMENT/miml/doc'
COLUMNS xmldoc XML PATH 'document{.}') T; |
To further improve the performance of XQuery or SQL/XML queries, XML indexes should be defined. Listing 7 shows how to index all the keyword values in the MIML XML format.
Listing 7. Create XML index on specific XPath
CREATE INDEX idx_keyword ON subdocuments (document)
GENERATE KEY
USING XMLPATTERN '/doc/feature/kw/@val' AS SQL VARCHAR(4096); |
Query MIML data in DB2
The way you write queries against the MIML data depends on how you choose to
represent the MIML data in DB2: simple relational schema, star schema, or a single
table with an XML column. Now take a really simple query as an example. Say you want
to find the titles of all documents from February 2008 that contain the keywords
"dental" and "Tennessee." Listing 8 shows one possible way to
write this query in SQL against the simple four-table schema. Listing 9 shows the same query for the refined star schema.
Listing 8. Sample query on simple four-table schema
SELECT title
FROM document, standard_kw s, keyword_kw_occ kw1, keyword_kw_occ kw2
WHERE id = s.doc_id
AND id = kw1.doc_id
AND id = kw2.doc_id
AND (s.cat = '.date' AND s.val BETWEEN '20080201' AND '20080229')
AND kw1.val = 'dental'
AND kw2.val = 'Tennessee'; |
Listing 9. Sample query on refined star schema
SELECT doc.title
FROM document_analysis da, document doc, date
ucat_bridge b1, ucat cat1, ucat_bridge b2, ucat cat2
WHERE da.document = doc.id
AND da.date = date.id AND date.month = 2 AND date.year = 2008
AND da.ucat = b1.id
AND b1.annotid = cat1.annotid AND cat1.keyword = 'dental'
AND da.ucat = b2.id
AND b2.annotid = cat2.annotid AND cat2.keyword = 'Tennessee'; |
If you choose not to shred the MIML data but to store it in an XML column, the same query would be expressed in XQuery (Listing 10) or SQL/XML (Listing 11).
Listing 10. Sample query on pureXML storage using XQuery
XQUERY
for $i in db2-fn:xmlcolumn('MIML.DOCUMENT')/doc
where $i/feature[@type="standard"]/kw[@cat=".date" and @val ge "20080201"
and @val le "20080229"]
and
$i/feature[@type="keyword" and kw/@val = "dental" and kw/@val = "Tennessee"]
return $i/title; |
Listing 11. Sample query on pureXML storage using SQL/XML
SELECT XMLQUERY ('$DOCUMENT/doc/title')
FROM miml
WHERE XMLEXISTS ('$DOCUMENT/doc/feature[@type="standard"]/kw[@cat=".date" and
@val ge "20080201" and @val le "20080229"]')
AND XMLEXISTS ('$DOCUMENT/doc/feature[@type="keyword" and
kw/@val = "dental" and kw/@val = "Tennessee"]'); |
Compare the options
As described earlier, you have several options to choose from when integrating OmniFind Analytics Edition text analysis results (MIML) with DB2. First, you must decide whether you want to shred the data to a relational schema or store it as an XML data type. If you decide to shred MIML to relational tables, there are different options for the relational target schema as well as for the actual shredding method. Table 1 and Table 2 provide a summary and comparison of the most relevant options.
Table 1 compares the simple four-table schema with the refined
star schema and the pureXML storage. The first two require shredding, while the
pureXML approach does not. This has a significant impact on the time it takes to
ingest a large amount of MIML data. Tests were performed with a big MIML file that
contained information for about 50,000 input documents. pureXML storage without
shredding is the fastest way to insert this data into DB2. The insert time for this
option is what is used as the baseline for the other options. Shredding the MIML data
with a reasonably well-optimized stored procedure into the refined star schema took
approximately 2.25 times longer than inserting the MIML data into an XML column. A
prototype of this stored procedure can be found in the file located in the Download section.
Using the simple XMLTABLE statements (Listing 4) to shred MIML into the simple four-table schema also took about 2.25
times longer than using pureXML storage. Using DB2's decomp routines (annotated
schema shred) ran about two times longer than the XMLTABLE shredding into the same target schema, which is 4.5 times longer than inserts into a pureXML column.
In this comparison, it is important to note that the approach with the refined star
schema converts only a subset of the MIML data into relational format. For example,
all the detailed information about keyword occurrences (MIML element <occ>) has been skipped. In the sample data, this accounts for
45 percent of the data volume. Also note that the performance behavior may vary depending on the specific hardware and software configuration used.
The choice between relational and XML storage does not only affect insert
performance, but also the querying and indexing of the MIML data. If you take the hit
and shred the MIML data to relational table, then you can use plain SQL with
relational indexes to analyze the data. XML storage implies that you use XQuery or
SQL/XML to query the MIML data. For queries of low or medium complexity, including
the ones in the Query MIML Data in DB2 section, the query performance is good. For highly complex queries,
plain SQL over shredded data may be easier to write and can provide better
performance. Also, MIML data shredded into a relational is more easily accessible for most BI tools.
Table 1. Comparison of the different target schemas
| .
| Simple four-table schema | Refined star schema | pureXML column |
|---|
|
Storage
| relational | relational | XML |
|---|
|
Shred options
| direct | indirect (use stored procedure) | not required |
|---|
|
Relative time to ingest MIML data
|
DECOMP routines |
XMLTABLE statements | Stored Procedure using XMLTABLE
|
INSERT statement |
|---|
| 450% | 225% | 225% | 100% | |
Data mapping
| all data | just document title, text, keywords, date, and optional additional standard keywords | all data |
|---|
|
Available query languages
| SQL | SQL | SQL/XML, XQuery; SQL if you define relational views over the XML data |
|---|
|
Index types
| indexes on relational columns | indexes on relational columns | XML indexes |
|---|
Table 2 summarizes and compares the usage of annotated XML
schemas with the XMLTABLE function for shredding. With
XMLTABLE, the shredding is implicitly defined in the parameters of the XMLTABLE function. The benefit of the annotated schema shredding is that IBM Data Studio offers a graphical wizard to define the mapping from XML to relational tables very conveniently (Figure 9).
Table 2. Shredding with annotated XML schemas
versus the XMLTABLE function
| .
| Annotated XML schemas | XMLTABLE function |
|---|
|
Define mapping by
| Adding annotations in the XML Schema (for example, through the Data Studio user interface) | Setting parameters of the function |
|---|
|
XML schema required
| Yes | No |
|---|
|
Routines or functions for shredding
|
DECOMPOSE command or xdbDecompXML() stored procedures |
XMLTABLE function in SQL/XML |
|---|
|
Maintenance
| Change the XML Schema annotations | Change the XMLTABLE statements |
|---|
Summary
This article reviewed the text analysis capabilities of IBM OmniFind Analytics
Edition, including an analysis of the XML format of the text analysis results, MIML.
It then examined different approaches that can help you extend the value of OmniFind Analytics Edition text analysis by storing analysis results from the MIML file into DB2 to enable standard business intelligence operations and reporting using the full power of SQL or SQL/XML. Here is a summary of the decision process you should follow when deciding which method to use:
- Shred to relational or store as XML?
- If you want to use standard SQL and BI reporting tools, pick shred. Go to step 2.
- If you are OK with using SQL/XML and XQuery, and if you want highest possible insert performance with minimal schema and insert complexity, store as XML. You're done.
- If you have decided to shred, pick a relational target schema that best fits the query and reporting requirements.
- Consider the simple four-table schema if schema simplicity, ease of shredding using out of the box tools and functions are important to you, and if you don't need to optimize for minimal storage space consumption.
- Consider the star schema or refined star schema for better "queryability" with BI and reporting tools. You may also prefer the star schemas if you want to minimize data redundancy and storage consumption and are willing to spend more system resources on shredding
- Choose the simpler star schema if you prefer a model where each fact table
rows represents a keyword in a document.
- Choose the refined star schema if you prefer a model where each fact table row represents one document.
By a clear understanding of the advantages and disadvantages of each approach, you
can choose which method works best for your analysis and reporting needs.
Acknowledgements
For their help with this article and the relational star schemas, we'd like to thank Balunaini Prasad from the IBM India Lab, Sreeram Balakrishnan from the IBM Silicon Valley Lab, and Iwao Inagaki from the IBM Yamato Lab in Japan.
Download | Description | Name | Size | Download method |
|---|
| DDLs, SQL/XML scripts, and stored procedure1 | TextAnalytics_to_DB2.zip | 6KB | HTTP |
|---|
Note - This download contains a set of sample DDLs and SQL/XML scripts as well as a prototype SQL stored procedure that shreds MIML data into a star schema.
Resources Learn
-
"IBM
OmniFind Analytics Edition: An advanced text analysis and mining platform": Get an
in-depth look at the product.
- "XML Support in DB2
Universal Database": Learn more about this important topic.
- "XMLTABLE by example, Part 2: Common scenarios for
using XMLTABLE with DB2" (developerWorks, Sept 2007): Explore the power of manipulating XML data with the
XMLTABLE function.
- "Shred XML documents using DB2 pureXML":
(developerWorks, Jan 2008): Find further discussion and examples of mapping XML data to relational tables in DB2.
- "pureXML in DB2 9: Which way to query your XML
data?" (developerWorks, June 2006): Gain information about how to query your XML data.
-
DB2 pureXML
wiki: Stay up to speed on DB2's XML technology.
- "Apache UIMA": Get a more in-depth discussion of text analytics, particularly as manifested in the Unstructured Information Management Architecture (UIMA).
-
developerWorks Information Management zone: Learn more about DB2. Find technical documentation, how-to articles, education, downloads, product information, and more.
-
Stay current with developerWorks
technical events and webcasts.
Get products and technologies
Discuss
About the authors  | 
|  | Matthias Nicola is a Senior Software Engineer for DB2 pureXML™ at IBM's Silicon Valley Lab. His work focuses on all aspects of XML in DB2, including XQuery, SQL/XML, storage, indexing and performance. Matthias also works closely with customers and business partners, assisting them in the design, implementation, and optimization of XML solutions. Prior to joining IBM, Matthias worked on data warehousing performance for Informix Software. He received his doctorate in computer science from the Technical University of Aachen, Germany. |
 | 
|  | Martin Sommerlandt is on an 18-month internship at the IBM Silicon Valley Lab. He studies Computer Science at University of Technology Dresden, Germany, and joined IBM in 2007 as an intern to work on DB2 pureXML™ solutions. His interests include XQuery, SQL/XML, XML performance, and customized XML solutions. Martin is an IBM certified DB2 Application Developer and IBM certified DB2 Database Administrator. He has also worked as a tutor for DB2 administration and programming. |
 | 
|  | Kathy Zeidenstein has worked at IBM for a bazillion years. She currently works on the IBM Data Studio enablement team. Before taking on this role, she was the product marketing manager for IBM OmniFind Analytics Edition. |
Rate this page
|  |