Finding out about suspicious behavior in time is an important task in many current IT applications. Imagine, for instance, credit card transactions. If a user shows highly unusual buying behavior (for example, shopping at a discount store regularly and then buying expensive jewelry), then it could be a good idea to check the corresponding transactions for traces of fraud. Still, deviations can be used in other scenarios than just to detect fraud or malicious manipulation. Human resource departments use deviation detection to find employees or applicants that do not fit into the usual categories and might be overlooked when applying fixed rules to identify high potentials.
Data records that deviate from the global distribution of data records are called outliers. Outlier handling is usually not a fully automated task. Rather, data mining is used to point out data records that deserve a closer look by a human analyst or expert who must then decide whether to take action or not. A sophisticated user interface and interaction model is therefore a prerequisite of successful outlier handling. Cognos is well suited for this task. Actually, a report similar to the one created in the first article of this series can be used to visualize outliers. To leverage the full potential of Cognos for displaying outliers, however, you need to employ some more advanced features. First, see show how to use "drill-through" to create interactive Cognos reports and how to link reports. This will help to summarize information and still allow for a fast access to relevant outlier data records. Second, learn how additional information can be extracted from the data mining models that helps the human expert to understand the nature of an outlier.
The running example in this article is an application that helps employees of a bank identify customers that show unusual behavior. This can be used to avoid fraud or to detect customers that deserve special attention. In the following section, get an overview of deviation detection and see how InfoSphere Warehouse can be used to find outliers in large data sets. The subsequent sections cover the basics of drill-through and extracting information from mining models, and show how both of these techniques can be used to make deviation detection results easier to understand and to utilize.
Deviation detection with InfoSphere Warehouse
What is deviation detection?
Deviation detection is defined as the task of finding unusual data records in large datasets. These records are called outliers. How to define "unusual" exactly is subject to academic debate and may also depend on the domain where deviation detection is applied. On a general level, deviation detection aims to find data records with properties that do not follow the general statistical distribution of the majority of the data records. Depending on the application area, deviations may be:
- Incorrect data (for example, if a person has an age of 300, this is probably an incorrect entry in the database)
- Unusual behavior of underlying processes (for example, credit card transactions that do not follow the usual patterns)
Correspondingly, deviation detection can be used for different tasks. If you assume that your dataset contains incorrect data, then you can apply deviation detection for data cleansing, thereby finding incorrect entries in your database. In the second case, the data is correct, but indicates that some of the processes that are reflected in the data show unusual behavior. This can be used for fraud detection, which is, together with data cleansing, the second major application of deviation detection. As pointed out above, unusual behavior is not necessarily fraud. It can, for instance, also indicate new emerging patterns, say "elderly customers that make excessive use of online auctions". Detecting such emerging patterns early on, allows companies to offer new products or services early on, giving them an important advantage over competitors. A similar application can be found in the financial sector. Deviation detection is used to find promising investments that do not fit the usual patterns and therefore went unrecognized so far. In all of these cases, a human analyst must check the outliers to see whether data values are incorrect or whether some action must be taken to avoid fraud or to leverage some so far unrecognized opportunities. In the following, learn how InfoSphere Warehouse detects outliers and how you can apply deviation detection to your data. The remainder of this article discusses how to visualize outliers interactively in Cognos.
Deviation detection in InfoSphere Warehouse
In recent years, many different methods for detecting deviations have been proposed. InfoSphere Warehouse uses a particularly powerful method for deviation detection that is based on data clustering. Clustering denotes a data mining technique that groups data records into clusters of pair-wise similar records by their properties. Let's take a look at Figure 1. Each point in the diagram represents a single customer. In this simple case, customers are only described by age and average balance. InfoSphere Warehouse uses a statistical clustering algorithm to group customers that are similar in those two dimensions into clusters. As you can see, some of the clusters are larger and more central than others (Cluster 1, as opposed to Cluster 3). InfoSphere Warehouse combines several properties to assign to each cluster a "deviation" degree. The higher this degree, the more likely the records in the cluster can be considered outliers.
Figure 1. Deviation detection by clustering
Thus, there is not a sharp distinction between outlier and non-outlier. Rather, the user must specify a threshold for outliers. All clusters with an outlier degree above this threshold are marked as outlier clusters and all their members as outliers. This threshold can be set in two ways. First, if you only have a limited number of experts that are able to check outliers, you simply use the data records that belong to clusters with the highest deviation degree. If you are looking for promising companies in which to invest, you would probably start with clusters that have the highest deviation degree and then work your way down, as long as you have the resources. Second, the threshold can be fixed. This would be the case in alerting scenarios that require that action is taken if any new data record is assigned to a cluster with a deviation degree above the given threshold. InfoSphere Warehouse allows you to do both by simply assigning a cluster id and a corresponding deviation degree to each data record. You can either filter the records, or sort them to obtain the outliers you would like to review or that you must check. The following section provides a step-by-step example of how to find outliers with InfoSphere Warehouse and how to assign deviation degrees to individual data records.
A practical example
In the following example, deviation detection is applied to entries about bank customers. Sample data of the corresponding table is shown in Figure 2. The table BANK.BANKCUSTOMERS is shipped with the samples of InfoSphere Warehouse.
Figure 2. Sample data from the BANK.BANKCUSTOMERS table
To detect outliers in this table:
- Create a new mining flow, as shown in the last article.
- Also, as shown in the last article, you must drag a table source operator to the editor.
- Double-click on the operator to open it, specify
BANK.BANKCUSTOMERSas the source database table, and click OK to confirm.
- Now, drag a "Find Deviations" operator onto the canvas next to the
table source, and connect the output pin of the table source with the
input pin of "Find deviations". Change the model name of the cluster
model that is generated by the "Find Deviations" operator to
IDMMX.OUTLIERMODEL. Double-click the operator to open the properties, and change the model name on the second wizard page.
- Finally, create a suitable target table by selecting "Create Suitable
Table..." from the right-click context menu of the output port of the
"Find Deviations" operator. On the first wizard page, select the
schema BANK and enter the table name
CUSTOMERS_OL. Click on Finish. A "Table Target" operator will be connected to the flow. If you want to run the flow multiple times, it is sensible to select the check box "Delete Previous Content" in the properties of the "Table Target" operator.
Figure 3. The mining flow used to detect deviation.
As you can see, there are two additional columns, namely DEV_DEGREE and CLUSTER_ID. The first denotes how strongly the record is considered an outlier. The cluster id is the id of the "outlier" cluster to which the record belongs. Additional information about these clusters can be extracted by analyzing the cluster model that was internally created by the find deviation operator. (This article discusses this more later.)
Figure 4. Sample data from the result table, BANK.CUSTOMER_OL
Creating an interactive Cognos report for deviation detection
In this section, learn how to create a Cognos report that allows you to interactively inspect outliers. As a point of departure, you can use a similar report as in the first article of this series. Instead of a table of patients with an indicator on who of them should be proposed for a medical check-up, you now have a list of customers with an indicator on which customers should be checked for fraud or potential. While this approach works well for a smaller number of customers, a list with several thousand entries is not very convenient. Also, an analyst may like to see what makes a particular customer an "outlier". You need some additional information to provide such an explanation. Therefore, let's extend the simple approach in two ways:
- Group customers by profession and provide an overview on how many outliers can be found in each profession category. Using this kind of grouping is a good way to deal with large amounts of information. One could, for instance, imagine that each category is reviewed and analyzed by a specific employee responsible for this category. Instead of profession, other aspects would have be possible as well (for example, location).
- Enrich each outlier record by information on why the given record is considered to be an outlier. As mentioned above, each record is assigned to a cluster, and all members of a cluster share the same deviation degree. You can use the properties of a cluster to describe outliers. If a cluster contains, for instance, mostly very young people with high average balance, then this is probably a good explanation of why this cluster was identified as an outlier.
The following section first shows how to enrich outliers with additional information. Then, you will create an interactive report that groups customers by their profession and allows you to interactively select outliers in a specific category by using the "drill-through" feature of Cognos.
Extracting additional information from mining models
The table CUSTOMER_OL contains relevant information about outliers. As mentioned above, each record is assigned to a cluster. The "Find Deviations" operator creates a cluster model in the background that stores detailed information about these clusters. This information is stored in the PMML (Predictive Model Markup Language) format in the database. It contains information on:
- The distribution of values in a cluster
- Number of records in a cluster
- Importance of variables for each cluster
- Homogeneity of clusters
You can use stored procedures that come with InfoSphere Warehouse to extract this information into result sets that can be consumed into Cognos. Such result sets can be regarded as "views" that are not explicitly created in the database but are created on the fly by the stored procedure.
If you want to extract textual information about clusters, you can use the following command:
SELECT ID, DESCRIPTION FROM TABLE(IDMMX.DM_GETCLUSTERS((SELECT MODEL FROM IDMMX.CLUSTERMODELS WHERE MODELNAME='IDMMX.OUTLIERMODEL'))) AS CT
This will give you a table that contains the following columns:
- ID: The id of the cluster (corresponding to the ID in the CUSTOMER_OL table)
- DESCRIPTION: A textual description of the cluster
Such result sets can be consumed by Cognos as regular views or tables. The only thing you have to be aware of is that these stored procedures are not included in DB2 but added by InfoSphere Warehouse. Let's come back to this in a moment. Figure 5 summarizes the two ways of how to get information from InfoSphere to Cognos: as simple database views/tables, or as extracts from mining models using stored procedures. Such stored procedures do not only exist for cluster models but also for many other mining models. For a full list of available model extraction functions, take a look at the InfoSphere Warehouse documentation (see Resources). In the following we will show how you can use the Cognos framework manager to combine both kinds of information.
Figure 5. Two ways of accessing mining-related information in Cognos
Import and combine mining results in the Cognos Framework Manager
For your report, you need two query subjects in the Cognos project that you then join to obtain a textual description for each outlier:
- A simple query subject that accesses the outlier table BANK.CUSTOMER_OL created in the first section. This query subject contains the customer records together with the deviation degree and the cluster id.
- A query subject that uses a stored procedure to access the cluster information of the clustering model created by the mining algorithm. As described above, the cluster information contains, among others, a short text description of the cluster (which is, in this case, a description of all outlier records in this cluster).
First, you have to create a Cognos Framework Manager project that is connected to the sample database DWESAMP of InfoSphere Warehouse and has the BANK.CUSTOMERS_OL table, which was created previously. Detailed instructions on how to achieve this is provided in the first article of this series. It is good practice to create a query subject in the PresentationView namespace that contains the information you need from the database to have an abstraction layer over the query subjects, which are created from SQL statements. This also enables you to change the column names to more descriptive texts and add additional columns. You want an outlier-flag query item that indicates whether a record is considered an outlier. Compute this from the deviation degree query item DEV_DEGREE.
To create the outlier table query subject used by the report:
- Create a new query subject "OutlierTable" in the PresentationView namespace from the model (existing query subjects and query items).
- Add all query items from the CUSTOMER_OL query subject and change the names to more descriptive labels.
- Add a new query item with the following expression definition:
IF ([PresentationView].[OutlierTable].[Deviation factor] > 1000) then (1) else (0)
This query item will be "1" if the deviation is higher than 1000; else it will be "0". The deviation factor 1000 has been chosen as a sensible cut-off factor for our dataset. It may be useful to make this deviation sensitivity parameterized to return more or less records as outliers.
Your query subject definition should look similar to Figure 6:
Figure 6. The query subject definition of the outlier table
The second query item is a table with information about the clusters of
the clustering model that has been created during the find deviation run.
This tabular description of the clustering model can be extracted in
InfoSphere Warehouse Data Mining with the user defined function
IDMMX.DM_GETCLUSTERS that returns a table of
the clusters in the model together with a short text description about the
field distribution in this cluster. Clustering models are saved as CLOBS
in the IDMMX.CLUSTERMODELS table together with a "MODELNAME" column that
can be used to select the correct model. The user-defined function needs
to be wrapped into a
SELECT statement to be
used by Cognos. Since the InfoSphere Warehouse Mining table functions are
not standard DB2 functions, some Cognos options need to be changed before
this query subject can be created.
To create the cluster description query subject from the DB2 table function:
- Select the database DWESAMP in the Data Sources folder of the Project Viewer and change the "Query Processing" property in the properties view to "Limited Locale". This enables query subjects from SQL that are not known by Cognos.
- Create a new query subject "OutlierClusters" in the PresentationView namespace and choose to model the query subject from a data source. Modeling it from a stored procedure will only support the stored procedures known to Cognos.
- On the "Select a data source" page, select DWESAMP and clear the Run database query subject wizard check box. The query subject wizard only works for standard SQL. Click on Finish.
- After the query subject has been created, the Query Subject Definition
wizard will open. Enter the SQL code to return the clusters from the
model, where IDMMX.OUTLIERMODEL is the name of the clustering model
generated during the "find deviations" run.
SELECT * FROM TABLE(IDMMX.DM_GETCLUSTERS((SELECT MODEL FROM IDMMX.CLUSTERMODELS WHERE MODELNAME='IDMMX.OUTLIERMODEL'))) AS CT
- Since this is no Cognos conform SQL, the SQL type of the query needs to be set to "Native", which tells Cognos to pass the SQL to the database, instead of interpreting it. To change this setting, open the "Query Information" tab of the query subjects properties. Select "Options", and change the "SQL type" under the "SQL settings" tab to "Native".
- A "Test Sample" run should return a table with the clusters of the
model, as shown in Figure 7:
Figure 7. The test results of the outlier cluster query subject
Using a stored procedure as query subject input has the advantage that no unnecessary tables or views are created in the database. More importantly, the stored procedure will be executed during report generation. This makes it possible to dynamically execute mining computations at report generation time. This subject is covered more in the upcoming articles of this series.
To create reports that have joined tables of the OutlierTable and the OutlierClusters query subjects, you have to create a relationship between the scored cluster id of the outlier records and the cluster id of the cluster table.
To create a relationship between the OutlierTable and the OutlierClusters query subjects:
- Select Create Relationship from the context menu of the OutlierTable.
- For the left query subject, select the cluster id of the OutlierTable and set the cardinality to 1..n since there are many records belonging to the same cluster.
- For the right query subject, add the OutlierClusters query subject, select the ID column, and set the cardinality to 1..1 since there is one row for each cluster.
- Select OK.
Figure 8. The relationship between the OutlierTable and the OutlierClusters query subjects
You have now created the query subjects you need for the Cognos report and can deploy an "OutliersPackage" containing the PresentationView of the project to the Cognos Content Store. Creation and deployment of the package are achieved just as described in the previous article of this series.
Figure 9. The created resources in the Framework Manager
Linking Cognos reports by drill-through
The original concept of "drill-through" refers to the task of navigating
from aggregate values to individual records. This is a common task in OLAP
applications. In Cognos, this concept is used in a broader sense as any
kind of "linking" reports together. Thus drill-through definitions serve a
similar aim as hyper links in HTML. Linking reports would not be a very
powerful feature by itself. What makes drill-through definitions a
powerful tool is the use of parameters. Each report in Cognos may contain
parameters that can, for instance, be used to create parameterizable
queries. If these parameters are not set, the user is prompted for them.
In a drill-through definition, these parameters can be defined as part of
the hyperlink (just as in
HTTP GET requests).
The values for these parameters can be drawn from the context of the link.
The following section exemplifies this concept using two reports that are
Create an outlier report with Cognos report studio
In this section, create a project with two report pages based on the deployed OutliersPackage:
- The main page that shows an overview of the customer records grouped by profession with the respective number of outliers for each profession
- A report page that shows the actual deviating records for a given profession
Both reports will be linked so that its possible to drill through to the deviating records from the main overview page.
Since the overview page contains links to the details page, let's create it first. The outlier details page contains the customer records that are flagged as outliers, which belong to the profession selected in the main report page. To create this interaction, you need to add a parameter for the profession to the report and use it in a filter on the records. You also need to add a condition to the filter that only returns records that have a Outlier Flag query item of "1" (all records with deviation degree over the threshold).
To create the "Outlier Details" report page, do the following:
- Create a new Report in Cognos Report Studio using the OutliersPackage.
- Add a list object to the report.
- Add the OutlierTable query subject to the list by dragging it from the "Insertable Objects" view.
- Add the DESCRIPTION query item of the OutlierClusters query subject to the list.
- Add a filter to the list by selecting the list and pressing the "Filters" icon on the toolbar or Data > Filters from the menu.
- In the wizard, add a "Details filter" with the "Add" icon.
- In the "Detail Filter Expression" page, add the following code to the
([PresentationView].[OutlierTable].[Outlier flag] = 1) AND ([PresentationView].[OutlierTable].[Profession] = ?Profession?)
Cognos will automatically detect the "Profession" keyword that is surrounded by "?" as a parameter and add it to the parameter list of the report.
- Confirm the filter by pressing OK.
Figure 10. The filter for the Outlier Details page
- Change the report header text and column headers to your liking, and
save your report as
Figure 11. The Cognos Report Studio with the Outlier Details report
The overview page contains a list of all professions with the number of customer records and the number of outliers per profession. The number of records will be computed by using Cognos aggregation functions.
To create the OutlierOverview page:
- Create a new report in Cognos Report Studio using the OutliersPackage.
- Add a list object to the report.
- From the OutlierTable query subject in the "Insertable Objects" view, drag the query items: Profession, Customer ID, and Outlier flag to the list.
- The Customer ID column is used to show the number of customer records per profession. To compute this, you need to change the aggregate function of the column to "Count" in the properties view of the column.
- The OutlierFlag column is used to show the number of deviating records per profession. Since the column contains a "1" for every deviating record and "0" for non-deviating records, you need to add the values of the column. Change the aggregate function of the column to Total.
- Change the "Data Item" > "Name" and "Data Item" > "Label" properties of the "Customer ID" and "Outlier flag" columns to Number of customers and Number of outliers.
- To enable a drill through to the OutlierDetails Report, select the Number of Outliers column (not the header, but the column below it), and choose Drill-Through Definitions from the right-click context menu.
- Add a new drill-through definition".
- In the Target Report tab of the properties, select the OutlierDetails report as target report.
- Select Run the report as action.
- Select the Open in new window check box.
- Add a new linked parameter with the Edit button below the parameters list.
- In the Parameters dialog, select Pass data item value as the method to link the parameter to a value of the row in the list.
- Choose the Profession query item as data item source, and
Figure 12. The drill-through definition parameters.
- Press OK to save the drill-through definition.
Figure 13. The drill-through definitions page
- Change the header text, and save the report as
- You can now run your report from the Run > Run HTML menu or from Cognos Connect.
- Click on a link from the number of outliers column to see the detailed
Figure 14. The outlier overview page
This simple project consists of just two reports. By using a "drill-through" definition that contains a parameter defined by the context of the link, you do not need to create a link for each profession individually. The upcoming articles of this series show how to do more powerful things with drill-through definitions, such as invoking data mining dynamically.
In this article, you learned about deviation detection and how to perform deviation detection with InfoSphere Warehouse. Deviation detection is a highly interactive task, and outliers must usually be checked manually to see whether they indicate fraud, errors in the data, or some interesting opportunity. Cognos is very well-suited to support the task of interactive outlier analysis. In addition to the simple techniques used in the first article of this series, you learned two additional techniques. First, additional information can be extracted from the mining models that is not yet contained in the scored table by using stored procedures. Second, Cognos reports can be linked by using the "drill-through" feature, enabling interactive reports. In the upcoming articles of this series, learn how to go even further by invoking data mining dynamically from Cognos.
- Integrate InfoSphere Warehouse data mining with IBM Cognos reporting, Part 1: Overview of InfoSphere Warehouse and Cognos integration architecture: Get an introduction to the basic integration architecture involved in integrating InfoSphere Warehouse data mining with IBM Cognos reporting. Examine a technical case study to gain a basic understanding of how to achieve the integration.
- Integrate InfoSphere Warehouse data mining with IBM Cognos reporting, Part 3: Invoke mining dynamically from Cognos using a market basket analysis example: Learn about the task of association rule mining and how to achieve it in InfoSphere Warehouse. Then, see how to invoke data mining dynamically from Cognos. Finally, read an example of how this mechanism could be used in real time.
- Integrate InfoSphere Warehouse data mining with IBM Cognos reporting, Part 4: Customer segmentation with InfoSphere Warehouse and Cognos: Learn the value of customer segmentation as a marketing tool. A detailed example uses Cognos to build an interactive report with drill-through capabilities and data distribution visualization.
- "Dynamic Warehousing: Data Mining Made Easy": Gain a better understanding of data mining and the data mining algorithms in InfoSphere Warehouse.
- IBM InfoSphere Warehouse: Find out more about InfoSphere Warehouse.
- IBM InfoSphere Warehouse Information Center: Find out about technical details and parameters of InfoSphere Warehouse.
- Cognos: Find out more about IBM Cognos.
- Data Mining Group: Find out more about the Predictive Model Markup Language (PMML).
- developerWorks Information Management zone: Learn more about Information Management. Find technical documentation, how-to articles, education, downloads, product information, and more.
- Stay current with developerWorks technical events and webcasts.
- Technology bookstore: Browse for books on these and other technical topics.
Get products and technologies
- Download a free trial version of IBM InfoSphere Warehouse 9.7 Enterprise Edition.
- Build your next development project with IBM trial software, available for download directly from developerWorks.