Integrate IBM InfoSphere Warehouse data mining with IBM Cognos reporting, Part 2: Deviation detection with InfoSphere Warehouse and Cognos

Disseminate your data mining results effectively

In the previous installment of this series you learned how to visualize simple data mining results in IBM® Cognos®. In this article, learn some advanced techniques, such as drill-down and the extraction of structured information from data mining models with Cognos. Using the included business scenario and running example, understand the data mining task of deviation detection, that is, the task of identifying unnatural data records. See how to find such records with IBM InfoSphere™ Warehouse data mining, and create interactive reports that allow interactive exploration.

Share:

Benjamin G. Leonhardi (bleon@de.ibm.com), Software Engineer, IBM

Author Photo: Benjamin LeonhardiBenjamin Leonhardi is a software engineer for InfoSphere Warehouse data mining at the IBM Research & Development Lab in Boeblingen, Germany. He works on mining visualization, text mining, and mining reporting solutions.



Christoph Sieb (christoph.sieb@de.ibm.com), Senior Software Engineer, IBM

Author Photo: Christoph SiebChristoph Sieb is a senior software engineer at the IBM Research & Development Lab in Boeblingen, Germany. Christoph is responsible for data mining algorithms and framework-related issues of InfoSphere Warehouse data mining. He has more than seven years of experience as a developer and consultant in the area of business intelligence and data mining.



Dr. Michael J. Wurst (mwurst@de.ibm.com), Senior Software Engineer, IBM

Author Photo: Michael WurstMichael Wurst, Ph.D. is a senior software engineer at the IBM Research & Development Lab in Boeblingen, Germany. He holds a Ph.D. in computer science and is responsible for data mining algorithms and tooling in InfoSphere Warehouse. Prior to joining IBM, Michael worked as a co-developer, architect, and consultant for the RapidMiner data mining software.



26 November 2008

Also available in Chinese Russian Vietnamese

Introduction

Finding out about suspicious behavior in time is an important task in many current IT applications. Imagine, for instance, credit card transactions. If a user shows highly unusual buying behavior (for example, shopping at a discount store regularly and then buying expensive jewelry), then it could be a good idea to check the corresponding transactions for traces of fraud. Still, deviations can be used in other scenarios than just to detect fraud or malicious manipulation. Human resource departments use deviation detection to find employees or applicants that do not fit into the usual categories and might be overlooked when applying fixed rules to identify high potentials.

Data records that deviate from the global distribution of data records are called outliers. Outlier handling is usually not a fully automated task. Rather, data mining is used to point out data records that deserve a closer look by a human analyst or expert who must then decide whether to take action or not. A sophisticated user interface and interaction model is therefore a prerequisite of successful outlier handling. Cognos is well suited for this task. Actually, a report similar to the one created in the first article of this series can be used to visualize outliers. To leverage the full potential of Cognos for displaying outliers, however, you need to employ some more advanced features. First, see show how to use "drill-through" to create interactive Cognos reports and how to link reports. This will help to summarize information and still allow for a fast access to relevant outlier data records. Second, learn how additional information can be extracted from the data mining models that helps the human expert to understand the nature of an outlier.

The running example in this article is an application that helps employees of a bank identify customers that show unusual behavior. This can be used to avoid fraud or to detect customers that deserve special attention. In the following section, get an overview of deviation detection and see how InfoSphere Warehouse can be used to find outliers in large data sets. The subsequent sections cover the basics of drill-through and extracting information from mining models, and show how both of these techniques can be used to make deviation detection results easier to understand and to utilize.


Deviation detection with InfoSphere Warehouse

What is deviation detection?

Deviation detection is defined as the task of finding unusual data records in large datasets. These records are called outliers. How to define "unusual" exactly is subject to academic debate and may also depend on the domain where deviation detection is applied. On a general level, deviation detection aims to find data records with properties that do not follow the general statistical distribution of the majority of the data records. Depending on the application area, deviations may be:

  1. Incorrect data (for example, if a person has an age of 300, this is probably an incorrect entry in the database)
  2. Unusual behavior of underlying processes (for example, credit card transactions that do not follow the usual patterns)

Correspondingly, deviation detection can be used for different tasks. If you assume that your dataset contains incorrect data, then you can apply deviation detection for data cleansing, thereby finding incorrect entries in your database. In the second case, the data is correct, but indicates that some of the processes that are reflected in the data show unusual behavior. This can be used for fraud detection, which is, together with data cleansing, the second major application of deviation detection. As pointed out above, unusual behavior is not necessarily fraud. It can, for instance, also indicate new emerging patterns, say "elderly customers that make excessive use of online auctions". Detecting such emerging patterns early on, allows companies to offer new products or services early on, giving them an important advantage over competitors. A similar application can be found in the financial sector. Deviation detection is used to find promising investments that do not fit the usual patterns and therefore went unrecognized so far. In all of these cases, a human analyst must check the outliers to see whether data values are incorrect or whether some action must be taken to avoid fraud or to leverage some so far unrecognized opportunities. In the following, learn how InfoSphere Warehouse detects outliers and how you can apply deviation detection to your data. The remainder of this article discusses how to visualize outliers interactively in Cognos.

Deviation detection in InfoSphere Warehouse

In recent years, many different methods for detecting deviations have been proposed. InfoSphere Warehouse uses a particularly powerful method for deviation detection that is based on data clustering. Clustering denotes a data mining technique that groups data records into clusters of pair-wise similar records by their properties. Let's take a look at Figure 1. Each point in the diagram represents a single customer. In this simple case, customers are only described by age and average balance. InfoSphere Warehouse uses a statistical clustering algorithm to group customers that are similar in those two dimensions into clusters. As you can see, some of the clusters are larger and more central than others (Cluster 1, as opposed to Cluster 3). InfoSphere Warehouse combines several properties to assign to each cluster a "deviation" degree. The higher this degree, the more likely the records in the cluster can be considered outliers.

Figure 1. Deviation detection by clustering
Deviation detection by clustering

Thus, there is not a sharp distinction between outlier and non-outlier. Rather, the user must specify a threshold for outliers. All clusters with an outlier degree above this threshold are marked as outlier clusters and all their members as outliers. This threshold can be set in two ways. First, if you only have a limited number of experts that are able to check outliers, you simply use the data records that belong to clusters with the highest deviation degree. If you are looking for promising companies in which to invest, you would probably start with clusters that have the highest deviation degree and then work your way down, as long as you have the resources. Second, the threshold can be fixed. This would be the case in alerting scenarios that require that action is taken if any new data record is assigned to a cluster with a deviation degree above the given threshold. InfoSphere Warehouse allows you to do both by simply assigning a cluster id and a corresponding deviation degree to each data record. You can either filter the records, or sort them to obtain the outliers you would like to review or that you must check. The following section provides a step-by-step example of how to find outliers with InfoSphere Warehouse and how to assign deviation degrees to individual data records.

A practical example

In the following example, deviation detection is applied to entries about bank customers. Sample data of the corresponding table is shown in Figure 2. The table BANK.BANKCUSTOMERS is shipped with the samples of InfoSphere Warehouse.

Figure 2. Sample data from the BANK.BANKCUSTOMERS table
Sample data from the BANK.BANKCUSTOMERS table

To detect outliers in this table:

  1. Create a new mining flow, as shown in the last article.
  2. Also, as shown in the last article, you must drag a table source operator to the editor.
  3. Double-click on the operator to open it, specify BANK.BANKCUSTOMERS as the source database table, and click OK to confirm.
  4. Now, drag a "Find Deviations" operator onto the canvas next to the table source, and connect the output pin of the table source with the input pin of "Find deviations". Change the model name of the cluster model that is generated by the "Find Deviations" operator to IDMMX.OUTLIERMODEL. Double-click the operator to open the properties, and change the model name on the second wizard page.
  5. Finally, create a suitable target table by selecting "Create Suitable Table..." from the right-click context menu of the output port of the "Find Deviations" operator. On the first wizard page, select the schema BANK and enter the table name CUSTOMERS_OL. Click on Finish. A "Table Target" operator will be connected to the flow. If you want to run the flow multiple times, it is sensible to select the check box "Delete Previous Content" in the properties of the "Table Target" operator.

This flow simply loads the customer table, passes it to the deviation detection algorithm, and writes the results into a new table (see Figure 3). An extract of the results can be seen in Figure 4.

Figure 3. The mining flow used to detect deviation.
The mining flow used to detect deviations

As you can see, there are two additional columns, namely DEV_DEGREE and CLUSTER_ID. The first denotes how strongly the record is considered an outlier. The cluster id is the id of the "outlier" cluster to which the record belongs. Additional information about these clusters can be extracted by analyzing the cluster model that was internally created by the find deviation operator. (This article discusses this more later.)

Figure 4. Sample data from the result table, BANK.CUSTOMER_OL
Sample data from the result table, BANK.CUSTOMER_OL

Creating an interactive Cognos report for deviation detection

In this section, learn how to create a Cognos report that allows you to interactively inspect outliers. As a point of departure, you can use a similar report as in the first article of this series. Instead of a table of patients with an indicator on who of them should be proposed for a medical check-up, you now have a list of customers with an indicator on which customers should be checked for fraud or potential. While this approach works well for a smaller number of customers, a list with several thousand entries is not very convenient. Also, an analyst may like to see what makes a particular customer an "outlier". You need some additional information to provide such an explanation. Therefore, let's extend the simple approach in two ways:

  • Group customers by profession and provide an overview on how many outliers can be found in each profession category. Using this kind of grouping is a good way to deal with large amounts of information. One could, for instance, imagine that each category is reviewed and analyzed by a specific employee responsible for this category. Instead of profession, other aspects would have be possible as well (for example, location).
  • Enrich each outlier record by information on why the given record is considered to be an outlier. As mentioned above, each record is assigned to a cluster, and all members of a cluster share the same deviation degree. You can use the properties of a cluster to describe outliers. If a cluster contains, for instance, mostly very young people with high average balance, then this is probably a good explanation of why this cluster was identified as an outlier.

The following section first shows how to enrich outliers with additional information. Then, you will create an interactive report that groups customers by their profession and allows you to interactively select outliers in a specific category by using the "drill-through" feature of Cognos.

Extracting additional information from mining models

The table CUSTOMER_OL contains relevant information about outliers. As mentioned above, each record is assigned to a cluster. The "Find Deviations" operator creates a cluster model in the background that stores detailed information about these clusters. This information is stored in the PMML (Predictive Model Markup Language) format in the database. It contains information on:

  • The distribution of values in a cluster
  • Number of records in a cluster
  • Importance of variables for each cluster
  • Homogeneity of clusters
  • More

You can use stored procedures that come with InfoSphere Warehouse to extract this information into result sets that can be consumed into Cognos. Such result sets can be regarded as "views" that are not explicitly created in the database but are created on the fly by the stored procedure.

If you want to extract textual information about clusters, you can use the following command:

SELECT ID, DESCRIPTION FROM TABLE(IDMMX.DM_GETCLUSTERS((SELECT MODEL FROM IDMMX.CLUSTERMODELS WHERE MODELNAME='IDMMX.OUTLIERMODEL'))) AS CT

This will give you a table that contains the following columns:

  • ID: The id of the cluster (corresponding to the ID in the CUSTOMER_OL table)
  • DESCRIPTION: A textual description of the cluster

Such result sets can be consumed by Cognos as regular views or tables. The only thing you have to be aware of is that these stored procedures are not included in DB2 but added by InfoSphere Warehouse. Let's come back to this in a moment. Figure 5 summarizes the two ways of how to get information from InfoSphere to Cognos: as simple database views/tables, or as extracts from mining models using stored procedures. Such stored procedures do not only exist for cluster models but also for many other mining models. For a full list of available model extraction functions, take a look at the InfoSphere Warehouse documentation (see Resources). In the following we will show how you can use the Cognos framework manager to combine both kinds of information.

Figure 5. Two ways of accessing mining-related information in Cognos
Two ways of accessing mining-related information in Cognos

Import and combine mining results in the Cognos Framework Manager

For your report, you need two query subjects in the Cognos project that you then join to obtain a textual description for each outlier:

  • A simple query subject that accesses the outlier table BANK.CUSTOMER_OL created in the first section. This query subject contains the customer records together with the deviation degree and the cluster id.
  • A query subject that uses a stored procedure to access the cluster information of the clustering model created by the mining algorithm. As described above, the cluster information contains, among others, a short text description of the cluster (which is, in this case, a description of all outlier records in this cluster).

First, you have to create a Cognos Framework Manager project that is connected to the sample database DWESAMP of InfoSphere Warehouse and has the BANK.CUSTOMERS_OL table, which was created previously. Detailed instructions on how to achieve this is provided in the first article of this series. It is good practice to create a query subject in the PresentationView namespace that contains the information you need from the database to have an abstraction layer over the query subjects, which are created from SQL statements. This also enables you to change the column names to more descriptive texts and add additional columns. You want an outlier-flag query item that indicates whether a record is considered an outlier. Compute this from the deviation degree query item DEV_DEGREE.

To create the outlier table query subject used by the report:

  1. Create a new query subject "OutlierTable" in the PresentationView namespace from the model (existing query subjects and query items).
  2. Add all query items from the CUSTOMER_OL query subject and change the names to more descriptive labels.
  3. Add a new query item with the following expression definition:

    IF ([PresentationView].[OutlierTable].[Deviation factor] > 1000) then (1) else (0)

    This query item will be "1" if the deviation is higher than 1000; else it will be "0". The deviation factor 1000 has been chosen as a sensible cut-off factor for our dataset. It may be useful to make this deviation sensitivity parameterized to return more or less records as outliers.

    Your query subject definition should look similar to Figure 6:
    Figure 6. The query subject definition of the outlier table
    The query subject definition of the outlier table

The second query item is a table with information about the clusters of the clustering model that has been created during the find deviation run. This tabular description of the clustering model can be extracted in InfoSphere Warehouse Data Mining with the user defined function IDMMX.DM_GETCLUSTERS that returns a table of the clusters in the model together with a short text description about the field distribution in this cluster. Clustering models are saved as CLOBS in the IDMMX.CLUSTERMODELS table together with a "MODELNAME" column that can be used to select the correct model. The user-defined function needs to be wrapped into a SELECT statement to be used by Cognos. Since the InfoSphere Warehouse Mining table functions are not standard DB2 functions, some Cognos options need to be changed before this query subject can be created.

To create the cluster description query subject from the DB2 table function:

  1. Select the database DWESAMP in the Data Sources folder of the Project Viewer and change the "Query Processing" property in the properties view to "Limited Locale". This enables query subjects from SQL that are not known by Cognos.
  2. Create a new query subject "OutlierClusters" in the PresentationView namespace and choose to model the query subject from a data source. Modeling it from a stored procedure will only support the stored procedures known to Cognos.
  3. On the "Select a data source" page, select DWESAMP and clear the Run database query subject wizard check box. The query subject wizard only works for standard SQL. Click on Finish.
  4. After the query subject has been created, the Query Subject Definition wizard will open. Enter the SQL code to return the clusters from the model, where IDMMX.OUTLIERMODEL is the name of the clustering model generated during the "find deviations" run.
    SELECT * FROM TABLE(IDMMX.DM_GETCLUSTERS((SELECT MODEL 
    FROM IDMMX.CLUSTERMODELS WHERE MODELNAME='IDMMX.OUTLIERMODEL'))) AS CT
  5. Since this is no Cognos conform SQL, the SQL type of the query needs to be set to "Native", which tells Cognos to pass the SQL to the database, instead of interpreting it. To change this setting, open the "Query Information" tab of the query subjects properties. Select "Options", and change the "SQL type" under the "SQL settings" tab to "Native".
  6. A "Test Sample" run should return a table with the clusters of the model, as shown in Figure 7:
    Figure 7. The test results of the outlier cluster query subject
    The test results of the outlier cluster query subject

Using a stored procedure as query subject input has the advantage that no unnecessary tables or views are created in the database. More importantly, the stored procedure will be executed during report generation. This makes it possible to dynamically execute mining computations at report generation time. This subject is covered more in the upcoming articles of this series.

To create reports that have joined tables of the OutlierTable and the OutlierClusters query subjects, you have to create a relationship between the scored cluster id of the outlier records and the cluster id of the cluster table.

To create a relationship between the OutlierTable and the OutlierClusters query subjects:

  1. Select Create Relationship from the context menu of the OutlierTable.
  2. For the left query subject, select the cluster id of the OutlierTable and set the cardinality to 1..n since there are many records belonging to the same cluster.
  3. For the right query subject, add the OutlierClusters query subject, select the ID column, and set the cardinality to 1..1 since there is one row for each cluster.
  4. Select OK.
    Figure 8. The relationship between the OutlierTable and the OutlierClusters query subjects
    The relationship between the OutlierTable and the OutlierClusters query subjects

You have now created the query subjects you need for the Cognos report and can deploy an "OutliersPackage" containing the PresentationView of the project to the Cognos Content Store. Creation and deployment of the package are achieved just as described in the previous article of this series.

Figure 9. The created resources in the Framework Manager
The created resources in the Framework Manager

Linking Cognos reports by drill-through

The original concept of "drill-through" refers to the task of navigating from aggregate values to individual records. This is a common task in OLAP applications. In Cognos, this concept is used in a broader sense as any kind of "linking" reports together. Thus drill-through definitions serve a similar aim as hyper links in HTML. Linking reports would not be a very powerful feature by itself. What makes drill-through definitions a powerful tool is the use of parameters. Each report in Cognos may contain parameters that can, for instance, be used to create parameterizable queries. If these parameters are not set, the user is prompted for them. In a drill-through definition, these parameters can be defined as part of the hyperlink (just as in HTTP GET requests). The values for these parameters can be drawn from the context of the link. The following section exemplifies this concept using two reports that are linked.

Create an outlier report with Cognos report studio

In this section, create a project with two report pages based on the deployed OutliersPackage:

  • The main page that shows an overview of the customer records grouped by profession with the respective number of outliers for each profession
  • A report page that shows the actual deviating records for a given profession

Both reports will be linked so that its possible to drill through to the deviating records from the main overview page.

Since the overview page contains links to the details page, let's create it first. The outlier details page contains the customer records that are flagged as outliers, which belong to the profession selected in the main report page. To create this interaction, you need to add a parameter for the profession to the report and use it in a filter on the records. You also need to add a condition to the filter that only returns records that have a Outlier Flag query item of "1" (all records with deviation degree over the threshold).

To create the "Outlier Details" report page, do the following:

  1. Create a new Report in Cognos Report Studio using the OutliersPackage.
  2. Add a list object to the report.
  3. Add the OutlierTable query subject to the list by dragging it from the "Insertable Objects" view.
  4. Add the DESCRIPTION query item of the OutlierClusters query subject to the list.
  5. Add a filter to the list by selecting the list and pressing the "Filters" icon on the toolbar or Data > Filters from the menu.
  6. In the wizard, add a "Details filter" with the "Add" icon.
  7. In the "Detail Filter Expression" page, add the following code to the "Expression Definition".
    ([PresentationView].[OutlierTable].[Outlier flag] = 1) 
    AND ([PresentationView].[OutlierTable].[Profession] = ?Profession?)

    Cognos will automatically detect the "Profession" keyword that is surrounded by "?" as a parameter and add it to the parameter list of the report.
  8. Confirm the filter by pressing OK.
    Figure 10. The filter for the Outlier Details page
    The filter for the Outlier Details page
  9. Change the report header text and column headers to your liking, and save your report as OutlierDetails.
    Figure 11. The Cognos Report Studio with the Outlier Details report
    The Cognos Report Studio with the Outlier Details report

The overview page contains a list of all professions with the number of customer records and the number of outliers per profession. The number of records will be computed by using Cognos aggregation functions.

To create the OutlierOverview page:

  1. Create a new report in Cognos Report Studio using the OutliersPackage.
  2. Add a list object to the report.
  3. From the OutlierTable query subject in the "Insertable Objects" view, drag the query items: Profession, Customer ID, and Outlier flag to the list.
  4. The Customer ID column is used to show the number of customer records per profession. To compute this, you need to change the aggregate function of the column to "Count" in the properties view of the column.
  5. The OutlierFlag column is used to show the number of deviating records per profession. Since the column contains a "1" for every deviating record and "0" for non-deviating records, you need to add the values of the column. Change the aggregate function of the column to Total.
  6. Change the "Data Item" > "Name" and "Data Item" > "Label" properties of the "Customer ID" and "Outlier flag" columns to Number of customers and Number of outliers.
  7. To enable a drill through to the OutlierDetails Report, select the Number of Outliers column (not the header, but the column below it), and choose Drill-Through Definitions from the right-click context menu.
  8. Add a new drill-through definition".
  9. In the Target Report tab of the properties, select the OutlierDetails report as target report.
  10. Select Run the report as action.
  11. Select the Open in new window check box.
  12. Add a new linked parameter with the Edit button below the parameters list.
  13. In the Parameters dialog, select Pass data item value as the method to link the parameter to a value of the row in the list.
  14. Choose the Profession query item as data item source, and select OK.
    Figure 12. The drill-through definition parameters.
    The drill-through definition parameters
  15. Press OK to save the drill-through definition.
    Figure 13. The drill-through definitions page
    The drill-through definitions page
  16. Change the header text, and save the report as OutlierOverview.
  17. You can now run your report from the Run > Run HTML menu or from Cognos Connect.
  18. Click on a link from the number of outliers column to see the detailed records.
    Figure 14. The outlier overview page
    The outlier overview page

This simple project consists of just two reports. By using a "drill-through" definition that contains a parameter defined by the context of the link, you do not need to create a link for each profession individually. The upcoming articles of this series show how to do more powerful things with drill-through definitions, such as invoking data mining dynamically.


Conclusion

In this article, you learned about deviation detection and how to perform deviation detection with InfoSphere Warehouse. Deviation detection is a highly interactive task, and outliers must usually be checked manually to see whether they indicate fraud, errors in the data, or some interesting opportunity. Cognos is very well-suited to support the task of interactive outlier analysis. In addition to the simple techniques used in the first article of this series, you learned two additional techniques. First, additional information can be extracted from the mining models that is not yet contained in the scored table by using stored procedures. Second, Cognos reports can be linked by using the "drill-through" feature, enabling interactive reports. In the upcoming articles of this series, learn how to go even further by invoking data mining dynamically from Cognos.

Resources

Learn

Get products and technologies

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Big data and analytics on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Big data and analytics, Information Management
ArticleID=354837
ArticleTitle=Integrate IBM InfoSphere Warehouse data mining with IBM Cognos reporting, Part 2: Deviation detection with InfoSphere Warehouse and Cognos
publish-date=11262008