This article shows how to combine the XQuery functionality of DB2 with the data mining results that are stored in XML using InfoSphere Warehouse Data Mining.
InfoSphere Warehouse Data Mining provides algorithms for all the common data mining tasks, such as the tasks in the following list and the example applications of the tasks:
- Clustering, including finding important customer segments or
- Association rules, including cross-selling or store-layout improvement
- Sequential patterns, including selective marketing or prediction of part
- Classification, including churn analysis or medical diagnostics
- Regression, including insurance claim prediction or risk management
The results of all those data mining tasks are called data mining models, and they are stored in a table using the PMML standard. This standard is based on XML to allow easy exchange between different data mining providers. Such a model can then be applied to new data, which is called scoring.
For example, you might have the requirement to classify new customers regarding their risk of default. A PMML model needs to store all important information about the mining results. Intelligent Miner provides a set of SQL extractor functions to get the most important information from the model.
If you need to extract more or different information, this article describes how this could be done. This article also shows you the power of DB2 XQuery and how to use it in DB2 user-defined functions (UDFs) or DB2 stored procedures (STPs).
In DB2 9, pureXML introduced the XML datatype and made it possible to query and work with XML documents in the DB2 database very efficiently.
There are two query languages that can be used to extract information from XML documents. The XPath language is used to navigate through the XML tree and to select nodes according to given restraints. XQuery is a superset of XPath, and it adds the For, Let, Order by, Where, and Return expressions (also known as FLWOR expressions [pronounced flower]) to enable more complex SQL-style queries on XML documents. You can also transform XML documents with XQuery in a similar way as with XSLT.
To use pureXML, the database needs to be encoded in UTF-8. This is the default for databases since DB2 9.5 was released. An XML column can be created like every other column. The code in Listing 1 creates a small test table with an ID column and an XML column.
Listing 1. Creating a small test table
db2 "CREATE TABLE XMLTEST (ID int, XMLCOLUMN XML)"
To insert XML into the table, use the INSERT statement, as shown in Listing 2. The XML string in the command is parsed into the XML column. Be sure to use correct escaping for quotes.
Listing 2. Inserting XML
db2 "INSERT INTO XMLTEST VALUES (1, '<Person ... />')"
You will use the sample XML data shown in Listing 3 later in this
Person XML fragment is saved in a separate row of the XML column.
Listing 3. Inserting XML
<Person> <Name>John Smith</Name> <Tel publish="false">555-8659</Tel> </Person> ... <Person> <Name>Mark Muster</Name> <Tel publish="true">555-7765</Tel> </Person>
XQuery expressions can be used in SQL expressions. To use XQuery
expressions outside of
SQL, they must be preceded by
If XQuery is used outside of a SQL query, the input column must be
specified with the command
db2-fn:xmlcolumn with a parameter for the
column that contains the XML source, as shown in Listing 4.
Listing 4. Using XQuery expressions outside SQL
db2 "XQUERY db2-fn:xmlcolumn('XMLTEST.XML')"
The most important concept when working with XML documents is the XML path, which provides a way for addressing nodes and elements in the XML tree. The XPath expression in Listing 5 returns the NAME nodes of the people in the document by providing a path from the root node of the XML document.
Listing 5. Returning name nodes
XQUERY db2-fn:xmlcolumn('XMLTEST.XML')/Person/Name ... <Name>John Smith</Name> <Name>Mark Muster</Name> 2 record(s) selected.
XQuery provides sophisticated SQL-style expressions to query multiple XML documents. The example in Listing 6 uses the FOR and WHERE expressions to query for people whose telephone numbers are not published. Attributes of XML nodes can be accessed with the @ qualifier. Note that the expression used to check for the publish attribute is in brackets.
Listing 6. Querying for people's phone numbers
XQUERY for $x IN db2-fn:xmlcolumn('XMLTEST.XML')/Person/Tel where $x[@publish='false'] return $x ... <Tel publish="false">555-8659</Tel> 1 record(s) selected.
While XQuery can be used to extract XML nodes and values from an XML document stored in a DB2 table, you need a way to extract information as a structured table that can be read by tools that are unable to read XML. The method used for this in DB2 is the XMLTABLE expression. This expression creates a table from an XQuery expression. XMLTABLE is used in the FROM statement of a SELECT query. The first argument specifies the XQuery expression that returns the XML fragments. The COLUMNS argument specifies the table output columns, including column name, type, and the relative XQuery expression for each XML fragment.
Listing 7. Using the XMLTABLE expression
select * from XMLTABLE ('db2-fn:xmlcolumn("XMLTEST.XML")/Person' COLUMNS name VARCHAR(20) PATH 'Name', tel VARCHAR(20) PATH 'Tel', published VARCHAR(5) PATH 'Tel/@publish' ) AS X; ... NAME TEL PUBLISHED -------------------- -------------------- --------- John Smith 555-8659 false Mark Muster 555-7765 true 2 record(s) selected.
Using this type of DB2 select statement, you can extract all information in XML documents and pass them on as SQL table data. For more information, see Resources.
Predictive Model Markup Language (PMML) is an XML markup language to describe data mining models and to enable the exchange of mining models between different data mining providers. Exchanging the model enables other programs to score against these data mining models without creating the models themselves. See Resources for the PMML specification published on the Data Mining Group Website.
A PMML model typically follows this structure:
- A PMML root note, which shows the version of the PMML used
- A header, which provides general information about the model,
including name and date
- A MiningBuildTask, which gives information about the configuration used to
build the model
- A DataDictionary, which offers information about the fields used to create the model, including the name, type, and values
The model information depends on the used model. Listing 8 shows the PMML structure for a clustering model.
Listing 8. PMML structure for a clustering model
<?xml version="1.0" encoding="UTF-8" ?> <PMML version="3.0" xmlns="http://www.dmg.org/PMML-3_0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.dmg.org/PMML-3_0 http://www.dmg.org/v3-0/pmml-3-0.xsd"> <Header copyright="Copyright IBM Corp. 2002, 2007 All Rights Reserved"> <Application name="IBM DB2 Intelligent Miner" version="9.5"/> <Timestamp>2008-11-23 23:03:09</Timestamp> </Header> <MiningBuildTask> ... <DataDictionary numberOfFields="8"> ... <ClusteringModel modelName="BANK.CUSTOMERS_SEGMENTS" modelClass="centerBased" functionName="clustering" algorithmName="Kohonen" numberOfClusters="5" x-execTime="0.070000171661377" x-similarityThreshold="0.5" x-quality="0.780511795527719"> ...
The next example uses a clustering model based on the Kohonen algorithm. PMML supports center-based and distribution-based clustering models. Both have the element ClusteringModel as a top-level element. A list of clusters and a measure to calculate the distance or similarity for new records to the different clusters is included. Listing 9 shows a center-based Kohonen cluster model.
Listing 9. Using a center-based Kohonen cluster model
<ClusteringModel modelName="BANK.CUSTOMERS_SEGMENTS" modelClass="centerBased" functionName="clustering" algorithmName="Kohonen" numberOfClusters="5" x-execTime="0.070000171661377" x-similarityThreshold="0.5" x-quality="0.780511795527719"> + <Extension name="Matrix"> + <Extension name="X-Correlations"> + <MiningSchema> + <ModelStats> <ComparisonMeasure kind="distance"> <squaredEuclidean/> </ComparisonMeasure> <ClusteringField field="AGE" compareFunction="absDiff" x-quality="0.291742616514616"/> <ClusteringField field="AVERAGE_BALANCE" compareFunction="absDiff" x-quality="0.223507903713586"/> <ClusteringField field="GENDER" compareFunction="absDiff" x-quality="0.67909000081883"/> <ClusteringField field="ID" compareFunction="absDiff" x-quality="0.32252228279477"/> <ClusteringField field="MARITAL_STATUS" compareFunction="absDiff" x-quality="0.593752740606671"/> <ClusteringField field="NBR_YEARS" compareFunction="absDiff" x-quality="0.34454192375915"/> <ClusteringField field="NOT_USED" compareFunction="absDiff" x-quality="0.198193321261319"/> <ClusteringField field="SAVING_ACCOUNT" compareFunction="absDiff" x-quality="0.663013938830898"/> + <CenterFields> <Cluster name="1" x-quality="0.786176824083955"> ...
All attributes that start with
vendor-specific and can be
ignored. See Resources for more details and examples of PMML on the Data
Mining Group Web site.
Intelligent Miner is integrated in DB2 using UDFs and STPs. You can use either a set of SQL statements or the InfoSphere Warehouse DesignStudio graphical interface to create a data mining model. The graphical interface is the more convenient approach. The PMML models are stored in DB2 tables, which can be used either to score new data or to extract information.
For example, imagine a scenario where you have data from bank customers from which you want to use clustering to determine the different customer segments. Ensure that all the customers' relevant information is combined in one table or view, as shown in Listing 10.
Listing 10. Seeing bank customer data
db2 select * from bank.customers fetch first 3 rows only ID AGE GENDER MARITAL_STATUS NBR_YEARS SAVING_ACCOUNT AVERAGE_BALANCE NOT_USED ------ --- ------- -------------- --------- ------------- -------------- -------- 120921 34 m s 3 n -23 n 223423 55 m m 12 y 123 n 937212 25 m s 2 y 388 n 3 record(s) selected.
To create the clustering model on the sample data, use the IDMMX.BuildClusModel stored procedure. The parameters for the IDMMX.BuildClusModel stored procedure are:
- Name of the model
- Name of the table or view
- Specific data mining parameters
Listing 11 shows the command to create the model.
Listing 11. Creating the model
db2 "call IDMMX.BuildClusModel('BANK.CUSTOMERS_SEGMENTS', 'BANK.CUSTOMERS', 'DM_setAlgorithm(''Kohonen'')')"
When the model is created, either use the Intelligent Miner Visualizer to look at the model in a graphical way or use the existing Intelligent Miner extract functions to get some of the information through SQL. For example, you could extract an overview of the cluster model that shows the number of clusters, the used fields, and the quality of the model. Listing 12 shows the SQL to use.
Listing 12. Using an SQL to extract an overview of the cluster model
SELECT "ID" AS "CLUSTERID", CAST ("NAME" AS CHAR(20)) AS "CLUSTERNAME", "SIZE", CAST ("HOMOGENEITY" AS DEC(5,2)) AS "HOMOGENEITY" FROM TABLE (IDMMX.DM_getClusters(( SELECT "MODEL" FROM IDMMX."CLUSTERMODELS" WHERE "MODELNAME"='BANK.CUSTOMERS_SEGMENTS'))) AS "CLUSTERS" ORDER BY SIZE DESC;
Listing 13 shows the results of the SQL in Listing 12.
Listing 13. The resulting overview of the cluster model
MODELNAME NUMBER_CLUSTERS NUMBER_FIELDS MODEL_QUALITY -------------------- --------------- ------------- ------------- BANK.CUSTOMERS_SEGME 7 7 0.81 1 record(s) selected.
Most information can be extracted with the available data mining functions, but some information cannot be extracted. You can use XQuery functions with Intelligent Miner functions to read those hidden parts directly from the PMML model.
Intelligent Miner does not currently allow extracting details of the cluster supplementary fields. A field can either be active or supplementary, depending on how valuable it is for calculating the clusters. For example, fields that are primary keys in DB2 that always have different values are automatically moved to supplementary fields because they cannot provide insights to the clusters.
You can find the hidden information within the PMML model in the XML node MiningField. If the attribute usageType has the value Supplementary, it indicates a supplementary field. An active field is the default usage type, therefore it is not indicated in the PMML file.
Listing 14 shows a PMML MiningSchema example.
Listing 14. Coding a PMML MiningSchema
<MiningSchema> <MiningField name="AGE"/> <MiningField name="AVERAGE_BALANCE"/> ... <MiningField name="SAVING_ACCOUNT"/> <MiningField name="ID" usageType="supplementary"/> </MiningSchema>
Listing 15 shows how a UDF in DB2 could be implemented to extract all the cluster fields and to show the usage type of those fields.
Listing 15. Implementing a UDF in DB2
CREATE FUNCTION getClusterMiningFields (mname varchar(128)) RETURNS TABLE (name varchar(20), usagetype varchar(20)) LANGUAGE SQL BEGIN ATOMIC RETURN select X.name, CASE WHEN X.usageType IS NULL THEN 'active' ELSE X.usageType END from XMLTable('$DATA//*:MiningField' PASSING XMLPARSE(document ( select IDMMX.DM_expClusModel(cm.MODEL) from IDMMX.CLUSTERMODELS cm where MODELNAME=mname)) as "DATA" columns name varchar(20) path '@name', usageType varchar(20) path '@usageType') as X; END@
Look in more detail at the different parts of the function. The UDF is declared as a table function, which means it returns a temporary table and not just a single value. Because it is written in SQL, the LANGUAGE is SQL. For all other create function options, the defaults are used, which is often a good starting point.
The body of the UDF is implemented between the BEGIN and END tokens. All result columns are specified in the select part of the SQL/XQuery statement. In the example, the columns are X.name and X.usageType, which are output columns of the XMLTABLE expression. The case statement for the X.usageType indicates that the active XML attribute is not written into the PMML document.
XMLPARSE is needed, because the PMML document itself is not stored directly in an XML column. The document needs to be exported from a BLOB column using the IDMMX.DM_expClusModel function, and then the document should be copied into an XML value using the XMLPARSE function.
Once the document is available as an XML value, you can process it with XQuery, as shown in the native XQuery example in Listing 15.
To call the create function statement, put the statements into a file and call it with the syntax showing in Listing 16. Note that the conventional end-of-statement character is not used because the character is required within the UDF.
Listing 16. Calling the file containing create function statements
db2 -td@ -vf createClusterFieldUDF.db2
When you call the UDF within a
select statement, you get the
information about the fields, including which fields are active or
supplementary, as shown in Listing 17.
Listing 17. Reading the information about the fields
db2 select t1.name, t1.usagetype from table (getClusterMiningFields('BankCustomerSegments')) t1 NAME USAGETYPE -------------------- -------------------- AGE active NBR_YEARS_CLI active GENDER active MARITAL_STATUS active PROFESSION active SAVINGS_ACCOUNT active HAS_LIFE_INSURANCE active INT_CREDITCARD active ONLINE_ACCESS active JOINED_ACCOUNTS active BANKCARD active CLIENT_ID supplementary AVERAGE_BALANCE active NO_CRED_TRANS active NO_DEBIT_TRANS active 15 record(s) selected.
The same functionality could also be put into an STP. The syntax to create the STP looks different, but the code itself is exactly the same, as showing in Listing 18.
Listing 18. Using an STP
CREATE PROCEDURE getClusterMiningFields ( in mname varchar(128) ) LANGUAGE SQL RESULT SETS 1 BEGIN DECLARE cursor1 CURSOR WITH RETURN FOR select X.name, CASE WHEN X.usageType IS NULL THEN 'active' ELSE X.usageType END from XMLTable('$DATA//*:MiningField' PASSING XMLPARSE(document ( select IDMMX.DM_expClusModel(cm.MODEL) from IDMMX.CLUSTERMODELS cm where MODELNAME=mname)) as "DATA" columns name varchar(20) path '@name', usageType varchar(20) path '@usageType') as X; OPEN cursor1; END@
You can call the STP from the DB2 command line window using the call statement. The output, as shown in Listing 19, is the same as you saw in the UDF implementation.
Listing 19. Results of calling the STP from the DB2 command line window
db2 call getClusterMiningFields('BankCustomerSegments') Result set 1 -------------- NAME 2 -------------------- -------------------- AGE active NBR_YEARS_CLI active GENDER active MARITAL_STATUS active PROFESSION active SAVINGS_ACCOUNT active HAS_LIFE_INSURANCE active INT_CREDITCARD active ONLINE_ACCESS active JOINED_ACCOUNTS active BANKCARD active CLIENT_ID supplementary AVERAGE_BALANCE active NO_CRED_TRANS active NO_DEBIT_TRANS active 15 record(s) selected. Return Status = 0
InfoSphere Warehouse DesignStudio is available with integrated tooling for the development of Java® and SQL STPs and UDFs. This article's second example creates a UDF that reads model information from an Association Model. This information is available in the Association Visualizer, but it has no corresponding SQL extractor function. The tooling helps with the creation, deployment, and testing of the UDF.
To use the tooling:
- Open InfoSphere Warehouse DesignStudio.
- Switch to the Data perspective by selecting Window > Open
Perspective > Other.
- If the Data perspective is not in the list, select Show all, and
select Data from the list. You should now see a Data Project Explorer in
the left corner.
- Create a new Data Development Project by selecting File > New > Data Development
- Name the new project PMMLExtraction.
- Select a database that is enabled for mining and that contains an Association model. For this example, select the sample database DWESAMP, and click Finish. You should now have a new Data Development project that has subfolders for UDFs, STPs, and other database objects, as shown in Figure 1.
Figure 1. Data perspective of DesignStudio
Click here for a larger view of Figure 1.
To create an SQL user-defined function:
- Create a new UDF by selecting New > User-Defined Function, and
right-clicking the User-Defined Functions folder.
- Change the name of the UDF to AssocInformationExtractor, and make
sure that the Language is SQL.
- Click Finish. This creates a very basic UDF. Now you can add the logic you want.
Association models contain a number of global statistics, including the number of transactions and the number of item sets. You can view these statistics in the Association Visualizer of InfoSphere Warehouse in the Statistics tab, as shown in Figure 2.
Figure 2. Global statistics in Association Visualizer
This information is saved in the PMML model as attributes of the Association model node, as shown in Listing 20.
Listing 20. Association model attributes
<PMML ... <AssociationModel modelName="MBA.RULES" functionName="associationRules" algorithmName="SIDE" numberOfTransactions="34734" maxNumberOfItemsPerTA="242" avgNumberOfItemsPerTA="2.87349571025508" minimumSupport="0.001" minimumConfidence="0.1" lengthLimit="4" numberOfItems="25" numberOfItemsets="62" numberOfRules="74" x-quality="1">
The UDF itself is very similar to what was done in this article's first example. It gets the name of the model as a parameter, and it returns a table with one column for each statistic. Association models are grouped together with Sequence models as Rule models, which are saved in the IDMMX.RULEMODELS table. Because there is only one AssociationModel node in the PMML model, its xPath can be written as $DATA//*:AssociationModel, which searches for an AssociationModel node with any parent nodes. Listing 21 shows the complete UDF code to enter into the Source view of the editor in the DesignStudio.
Listing 21. UDF code for the Source view
CREATE FUNCTION ASSOCINFORMATIONEXTRACTOR (mname varchar(128)) RETURNS TABLE (numberTransactions int, averageNumberItemsPerTransaction real, maximumNumberItemsPerTransaction int, numberItemSets int, numberSingletonItemSets int, minimumRuleSupport real, minimumRuleConfidence real, maximumRuleLength int) LANGUAGE SQL MODIFIES SQL DATA NO EXTERNAL ACTION DETERMINISTIC BEGIN ATOMIC RETURN select X.numberTransactions, X.averageNumberItemsPerTransaction, X.maximumNumberItemsPerTransaction, X.numberItemSets, X.numberSingletonItemSets, X.minimumRuleSupport, X.minimumRuleConfidence, X.maximumRuleLength from XMLTable('$DATA//*:AssociationModel' PASSING XMLPARSE(document (select IDMMX.DM_expRuleModel(cm.MODEL) FROM IDMMX.RULEMODELS cm where MODELNAME=mname)) as "DATA" columns numberTransactions int path '@numberOfTransactions', averageNumberItemsPerTransaction real path '@avgNumberOfItemsPerTA', maximumNumberItemsPerTransaction int path '@maxNumberOfItemsPerTA', numberItemSets int path '@numberOfItemsets', numberSingletonItemSets int path '@numberOfItems', minimumRuleSupport real path '@minimumSupport', minimumRuleConfidence real path '@minimumConfidence', maximumRuleLength int path '@lengthLimit' ) as X ; END
To deploy this UDF to the database:
- Right-click the UDF in Data Project Explorer, and select
- Enter the schema for the UDF, such as the example DB2XML. The other default
settings deploy this to the current database, drop UDFs with a
similar name, and build the code for Java UDFs.
- Click Finish to deploy the UDF. You see a
Deploy successfulmessage in the Data Output view at the lower right.
- Use the procedure from the command line, or run it
from the DesignStudio by right-clicking Run.
- On the Specify Parameter Values window, enter the model name of an existing Association model. For the InfoSphere Warehouse example, enter MBA.RULES. In the data output view, you can see the resulting set with global statistics of the model, as shown in Figure 3.
Figure 3. Global statistics result set
You can also use the InfoSphere Warehouse Data tooling to develop and debug UDFs and STPs.
The article showed how easy it is to write your own PMML extract functions using XQuery combined with Intelligent Miner and to pass the results as relational data into your SQL. No programming language is needed, and the process of creating the UDFs is easier because it is supported by graphical tools.
"Overview of InfoSphere Warehouse and Cognos integration architecture"
(developerWorks, Oct 2008) for a good refresher.
- Get the online
resources you need to advance your skills on InfoSphere Warehouse.
- Refer to the Data Mining Group Website
for the published PMML specification.
- Learn more about Information Management at the developerWorks Information Management
zone. Find technical documentation,
how-to articles, education, downloads, product information, and
- Stay current with
developerWorks technical events and webcasts.
Get products and technologies
- Download a free
trial version of
IBM InfoSphere Warehouse 9.7 Enterprise
- Build your next
development project with
IBM trial software,
available for download directly from developerWorks.
Michael Haide is team lead for InfoSphere Warehouse Data Mining at the IBM Research and Development Lab at Boeblingen, Germany. He works on data mining kernel and database integration. He also worked several years in DB2 Net Search Extender as team lead and architect.