Data mining enables experts, analysts and users to gain insight into patterns in large data collections and to incorporate them in every day business processes. Historically, data mining has been a task that could only be accomplished by experts in statistics and data analysis. The results of data mining, on the other hand, are often relevant to different users across the company.
Consider the following scenario. You gather data about your customers that covers demographic aspects (such as age, profession, and place of residence) as well as past transactions (such as items sold and contracts). Your marketing department wants to develop new offerings that are tailored to specific groups of customers with similar properties. How to find such typical groups of users? Data clustering offers a solution to this problem. It automatically groups datasets according to their properties or features. An analyst can then review these groups and interactively refine them until they meet his or her needs. A marketing expert would, for instance, find that there is a small but economically important group of customers that is of above average age and does not use Internet banking. Based on this information, special offers could be made to these customers. A crucial step in the analysis process is to enable users to understand the results of the data clustering step. Expert analysts are usually not experts on low-level databases programming.
How can the results of data mining be delivered to analysts and employees who need it? How can the results be presented in a way that reflects the business process in which the user is involved? How can security requirements be met, such that each user sees only what she is supposed to see? To provide satisfying answers to those questions, a shift in perspective is necessary, away from statistical analysis, to the actual end user and the business processes he or she is involved in. InfoSphere Warehouse is the strong basement for company-wide data warehousing. InfoSphere Warehouse provides data mining functionality directly in the underlying DB2 database where the data resides. The complete functionality can be accessed by an easy to use, Eclipse-based tooling platform and deployment tasks can be triggered directly within the tooling.
IBM Cognos is one of the market leaders in reporting solutions. Reporting plays an important role in many companies, as it helps to consolidate and visualize relevant information in different ways for different target audiences. The presented results are usually derived by applying basic arithmetic operations to the information stored in a data warehouse (for example, average sales per month). Advanced analytics, like in the example above, are beyond the scope and capabilities of a reporting framework. This concludes to combine both by using the scalable, advanced analytics capabilities of InfoSphere Warehouse together with the established, sophisticated reporting capabilities of IBM Cognos. The following sections show how IBM InfoSphere Warehouse Data Mining and IBM Cognos can be combined in a flexible way to achieve this goal. This integration does not require any fancy programming or setup; it can be all accomplished using just plain DB2 SQL.
The following sections briefly present the basic architecture of InfoSphere Warehouse and IBM Cognos. These are followed by high-level ideas of how both can be integrated. Finally, this integration is demonstrated on a simple, step-by-step example from the health care sector. The upcoming articles of this series describe more advanced technical and conceptual aspects of this integration.
InfoSphere Warehouse is based on DB2 for data storage. It is available with database a partitioning feature (DPF) for scalable, secure, and high-performance storage of data warehousing data, combining the advantages of online transaction processing (OLTP) databases with the storage needs of big data warehouses. InfoSphere Warehouse is available with a variety of tools for warehouse management and analysis. The analytical components are
- Cubing services
- Data mining
- Text analytics
InfoSphere Warehouse Design Studio is the Eclipse-based tooling platform used to design workload rules, data transformation flows, and analytical flows for data mining and text analytics. These flows can then be deployed to the InfoSphere Warehouse administration console to be scheduled and run according to needs. InfoSphere Warehouse data mining is built with DB2 stored procedures and user-defined functions for high-performance in-database execution, taking advantage of DB2 as an execution environment. It can be accessed either through its SQL interface or through the mining flows of InfoSphere Warehouse.
Data mining is the task of finding useful information in large datasets. A typical scenario that will be used as running example within this article is health care. Today it is possible to gather large amounts of health care data, describing the state of many different patients over the years. An important use of such data is to find indicators for latent diseases early on. To this end, you can collect data of patients that, for instance, have a heart disease and analyze what factors, such as blood pressure and weight, are correlated with it. This data must be collected and transformed to enable it for data mining. More specifically, you need to obtain a table that contains exactly one row per patient and several columns that represent information about the patient. Also, there should be a special column that indicates whether this patient is actually diseased. A possible scheme would look like Figure 1:
Figure 1. The schema of the heart disease database table (part of InfoSphere Warehouse examples)
Based on this kind of data, stored procedures within InfoSphere Warehouse can build a model that predicts, for any new patient, whether he or she is likely to have a latent heart disease. This process is called "modeling." The resulting XML data mining model is stored in the database and can be accessed through SQL/XQuery. It can be used for two purposes. First, you can extract information from this model to gain insight into which are the important indicators of heart diseases and other statistical information. Second, the model can be automatically applied to new data records. Thus, for any records that do not contain any value for the heart disease column yet, you can add this value automatically, together with a value that represents the confidence that this prediction is correct. This process is called "scoring." Figure 2 shows this schematically:
Figure 2. Based on historical data, a prediction model is built that can then be applied to new data
InfoSphere Warehouse contains highly efficient implementations of almost all current data mining algorithms. Data mining is invoked by first writing the data into a table. Then a model is created, that can later be applied to records for which the target value is not yet known, yielding a prediction and a confidence value for this prediction. All data mining functions are invoked just as usual SQL commands, as you will see below. This enables an easy integration in almost all settings, such as Web services.
IBM Cognos 8 Business Intelligence provides a full set of business intelligence (BI) capabilities and is based on a flexible service-oriented architecture (SOA). The main capabilities are reporting, analysis, dashboards, and scorecarding.
Reporting is used to aggregate relevant data about business processes and distribute it to the people who benefit the most from the particular information. In the Data mining environment, this means to transfer results to the people who benefit the most from the generated business insights.
The components of Cognos 8 that are needed to create and publish business reports are:
- Cognos Connection: The Web portal and entry point to the functionality of Cognos 8. From here the user can manage, organize, and view existing content like reports or dashboards. The Web-based authoring studios, like Report Studio, can be started from Cognos Connection to create new resources or to edit existing ones. It is also used to administer the Cognos server, for example, to change access restrictions, import content, or change the list of people a report is sent to.
- Cognos Framework Manager: The modeling tool for the data sources that are used in Cognos 8. Data from most databases and other sources, like Web services, can be accessed, for example, in Cognos reports by describing a metadata layer in Framework Manager. Database tables, views, and stored procedures can be added to a Cognos package as query subjects. The Framework Manager is used to import and consolidate the information that is available in different data sources in the company so the information can be used in a coherent way in Cognos 8 BI tools like Report Studio. It is important to note that the data itself is kept in the data source and accessed as the report is created.
- Cognos Report Studio: One of the Web-based authoring studios of Cognos 8 BI. It is used to create and edit sophisticated reports about the data that was described in Framework Manager. It provides powerful functionality like drill-down, prompts, and a comprehensive charting library.
To make this data available in a report, a couple of steps are necessary:
- In Framework Manager, the data modeler creates the Cognos metadata that describes the data in the database from a business perspective (including relations among tables, business names of values, and so on).
- After the metadata has been modeled, it is deployed as a package into the Cognos 8 content store. From there, it can be accessed through Cognos Connection and the authoring studios.
- The report designer enters Cognos Connection and creates a new report on the deployed package. After the report is created, the target group and distribution form (such as email or Web portal) can be administered.
The ability to create reports from relational datasources is the key to the integration of InfoSphere Warehouse mining and IBM Cognos.
The content of Cognos reports consists of a results set delivered by a relational data source, as described above. The content of a particular report is determined by a (dynamic) SQL query to one or more data sources. You can exploit this basic communication pattern to integrate mining and Cognos reporting in the following ways:
- Cognos can be used to display scored tables, possibly together with confidence information.
- Cognos can be used to display model information. This information is extracted from the actual XML models by table extractor functions or by XQuery queries.
- Cognos can invoke data mining and scoring dynamically by calling SQL
stored procedures. This allows for the following:
- Call mining with different settings provided by the user in the reporting interface
- Call mining on different subsets of the data (for example, to create recursive drill-down reports)
- Score records dynamically based on user inputs
Figure 3 summarizes the invocation scheme used in all of these cases.
Figure 3. Basic integration architecture of IBM InfoSphere data mining and IBM Cognos reporting
Mining is invoked by a stored procedure call and creates an XML mining model in the database. This model can be scored on new data or model information can be extracted into a table. These tables can be consumed by Cognos. The user can interactively invoke mining by calling a corresponding stored procedure on the database from a Cognos report.
This integration offers many benefits:
- It is very simple and requires only SQL knowledge and no additional programming
- Mining models are stored in the database and can be accessed in a secure, efficient way from Cognos
- Using stored procedures, the whole mining process can be triggered and controlled from Cognos
The next section is a step-by-step example for this kind of integration, which is also the most simple one: scoring records in the database and displaying the results in a Cognos report. The extraction of model information and the dynamic invocation of mining from Cognos reports will be covered in the follow-up articles of this series.
This example analyzes patient data from a hospital. The hospital's heart department has master records of their patients together with several measures like heart rate, blood pressure, cholesterol, and so on. Patients are checked for four different heart diseases. The patient records contain a column that indicates whether they have one of the four heart diseases (y = yes) or not (n = no). The corresponding heart disease database table is depicted in figure 1. This table can be found in the examples shipped with InfoSphere Warehouse.
The analytical goal is to predict for new patients the risk that they suffer any of the four heart diseases. If the risk is high, immediate check-ups should be done.
The idea is to enable risk management even if no dedicated check for one of those heart diseases was made, but the measures are available anyway from earlier check-ups in other areas.
First, create a prediction model based on the HEART table that allows you to predict heart disease risks for patients.
Create a Data Warehouse Project:
Right click the Project Explorer, and select New > Data Warehouse Project (as shown in Figure 4 below).
Figure 4. Create a Data Warehouse project
In the following wizard, type the project name, for example: AdvancedAnalytics. Then click Finish.
Create an empty mining flow:
- Expand the newly-created project.
- Right click on the folder Mining Flows, and select New > Mining Flow.
- In the upcoming wizard, type in the mining flow name Heart Disease Risk.
- In this example you will work against the database. Therefore, leave the default, and click Next.
- Select the DWESAMP database, and click Finish.
Create the mining flow:
The Mining Flow editor opens. On the right side of the mining editor you can see a palette with operators (see Figure 5). With these operators, you can build up a mining flow by dragging and dropping them on the editor canvas.
Figure 5. Mining flow inside Design Studio
To create the mining model that predicts disease risks for patients, follow these steps:
- In the palette, under the Sources and Targets section, select a Table Source operator and drag it onto the editor.
- In the table selection dialog, expand the HEALTHCARE schema and select the HEART table, then click Finish.
- In the palette, under the Preprocessing Operators section, select a Random Split operator and drag it onto the editor.
- Connect the Output port of the first operator to the Input port of the second one by a simple dragging action.
- Select the Random Split operator.
- On the Properties tab below the mining editor, set the Percentage of test data property to 20. This means we will use 20 percent of the data to validate our model later. Thus, you have to split the data before building the prediction model.
- In the palette, under the Mining Operators section, select a Predictor operator and drag it onto the editor.
- Connect the split operator’s Training Output port with the Predictor Input Port.
- Select the Predictor operator
- On the Properties tab below the mining editor, select the Mining Settings tab on the left side.
- In the Target column selection list, select DISEASED as the column you want to predict.
- Design Studio recognizes automatically that you want to predict a nominal value column and automatically offers you those settings (in the same tab) available for this purpose. In the Algorithm selection list, select Naïve Bayes.
- Above the Mining Settings properties tab, select the Model Name tab. Leave the prefix as is but change the model name to HeartDiseasePrediction.
- Also, under the Mining section of the palette, select the Tester operator and drag it onto the editor.
- Connect the Predictor’s Model output port with the Tester’s Model input port and the Random Split’s Test output port with the Tester’s input port.
- Save the mining flow, for example, by clicking on the editor area and pressing Ctrl+S.
Now, the mining flow is ready to execute.
Execute the mining flow:
You can execute a whole mining flow editor or just dedicated paths of the flow by right-click an operator and selecting Run to this step…. In this scenario, you right-click on the Tester operator, select Run to this step…, and click Finish. This flow generates a model that predicts the risk of heart disease and stores it in the database. It is trained on 80 percent of the data and is then tested on the remaining data points. This allows you to estimate how well the model would perform on new data. Just right-click on the Test Result port of the Tester operator. You can also take a look at the model itself. To this end, right-click on the Model port of the Predictor operator.
Scoring means to apply a previously learned model to new data. The new data has no classification (in this case, no checks on heart disease have been made) and the scoring process assigns a prediction to each new record according to the mining model.
Create a new mining flow for scoring:
Perform the same steps as in Create a new mining flow, except that you should give it another name, such as Classify New Patients.
Create the scoring flow:
To create the scoring flow that classifies new patients do the following:
- In the palette, under the Sources and Targets section, select a Model Source operator and drag it onto the editor.
- In the mining model selection dialog, expand the classification models and select your AdvancedAnalytics.HeartDiseasePrediction model.
- Under the Sources and Targets section of the palette, drag a Table Source operator onto the editor.
- In the table selection dialog, expand the HEALTHCARE schema and select the HEART table. This table could be loaded into the warehouse each night, or in real-time, whenever a new patient has arrived at the hospital, and once the necessary measures are available.
- From under the Mining Operators section of the palette, drag a Scorer operator onto the editor.
- Connect the model and table source operator to the scorer.
- The scorer appends two columns with the classification (y and
n) and another column specifying the certainty of this
classification. To store this scored table, you must create a suitable
table with these columns as a Table Target operator.
To do so, right-click on the output port of the scorer and select Create Suitable Table….
- In the upcoming dialog, just specify the name of the new table: NEW_PATIENT_CLASSIFICATION, and the schema where the table will be created: HEALTHCARE. Click Finish.
- Save the scoring flow.
- Finally, execute the flow by right-clicking on the table target operator and run to this step.
After executing the flow, in the lower view appears a sample from the scored table (see Figure 6). If you scroll to the right, you see two columns that have been appended by the scorer.
Figure 6. Scoring flow with sample results
The first appended column, PREDICTED_CLASS, contains the prediction. It states whether the mining model classifies the current patient as someone who will suffer the disease (y) or not (n). The second appended column, CLASS_CONFIDENCE, contains numbers between 0 and 1 and states how certain the scorer (according to the model) is about the prediction. A value of 1 corresponds to "100 percent sure that the classification is correct."
Before you can create a report using mining results, you need to define which resources (databases, tables, or views) should be used. Framework Manager also allows you to enrich data sources by defining connections and new columns (with expressions).
This simple example creates a Cognos Framework Manager project. Define the scored result table and publish the meta data.
Open Framework Manager and create a new project:
- On the desktop, you can launch the Cognos Framework Manager.
- After startup has finished, click Create a new project….
- In the upcoming dialog, specify the name as HeartMetaData, and click OK.
- Select English as the language, and press OK.
- As you want to create metadata from a DB2 table, you select a common Data Sources and click Next.
- Select the DWESAMP database, and click Next.
- Now you are asked to select the data object you want to import as metadata. Expand the HEALTHCARE schema and the table folder, and check the NEW_PATIENT_CLASSIFICATION table you created during the execution of the scoring flow. Click Next.
- The settings on this page are OK, so click Import, and then click Finish.
Figure 7. Cognos 8 Framework Manager
On the left side in the Project Viewer, you can explore the newly created project. Now, define an additional column that combines the insight of the two prediction columns created by the scoring flow. The idea is to provide the mining insight in an understandable way to the doctors.
To do this, perform the following steps:
- Expand the DWESAMP. You see the table created in the scoring flow called NEW_PATIENT_CLASSIFICATION.
- Double-click the table.
- The Query Subject Definition dialog appears. Go to the Calculations
tab (see figure 8).
Figure 8. Query Subject Definition dialog
- Click Add to create a new column defined from a calculation. The Calculation Definition dialog appears.
- In the name field, enter Check and into the Expression
definition text field, type the following expression (the expression
can also be found on the desktop in the file Calculation
IF ( [DWESAMP].[NEW_PATIENT_CLASSIFICATION].[PREDICTED_CLASS] = 'y') THEN (IF ([DWESAMP].[NEW_PATIENT_CLASSIFICATION].[CLASS_CONFIDENCE]>0.83) THEN('necessary') ELSE('maybe')) ELSE (IF([DWESAMP].[NEW_PATIENT_CLASSIFICATION].[CLASS_CONFIDENCE] > 0.9) THEN('not necessary') ELSE('maybe'))
The above expression creates three new classes from the PREDICTED_CLASS label (y and n) and the CLASS_CONFIDENCE. The expression puts those patients having a predicted class of y with a confidence higher than 83 percent to the high risk patients (meaning a checkup is necessary). Patients that are classified with n and a confidence of this class higher than 90 percent are put in the group of low-risk patients (meaning it is not necessary to do a checkup). All other patients are classified as medium risk (a checkup maybe necessary). Figure 9 is an example of the Calculation Definition dialog.
Figure 9. Calculation Definition dialog
- Click OK to close the Calculation Definition dialog.
- You can test the expression by going to the Test tab and clicking Test Sample.
- Click OK to close the Query Subject Definition dialog.
Create a package and publish it:
- To make the recently created metadata available, go to the Project Viewer and right-click the Packages folder, then select Create > Package.
- In the name field, type Heart, and click Next.
- Click Next, as nothing has to be done.
- In the following dialog add DB2 to the Available function sets list, and click Finish.
- In the following dialog, click Yes to continue with the publishing process.
- In the Select publish location part of the Publish Wizard, click Next, as nothing must be changed.
- Do not specify security settings in this example. Thus, click Next.
- Then press “Publish"
- In the dialog mentioning that you have successfully published your package, click Finish.
Cognos Report Studio is a fully Web-based application. You can start Report Studio by double-clicking the Internet Explorer icon on the desktop (upper left corner).
You have to perform the following steps to create a simple mining report:
- On the Welcome screen, click Create professional reports.
- In the Select a package explorer, click the Heart package you published in the Framework Manager.
- On the Welcome dialog, click Create a new report or template.
- Choose the blank template, and click OK.
- In the Insertable Objects pane, you can choose between different tabs.
The source (Data sources) tab is selected. Before you add data to the
report, you need a layout structure. You can get layout structures
from the Toolbox tab, so go there now.
Figure 10. Report Studio
- From the list of objects displayed, drag a Text Item onto the report editor area (see Figure 10).
- In the text box, enter the name Patient checkup recommendation report.
- Locate the List item in the toolbox and drag it onto the editor area.
- Switch back to the Source tab (Data sources) and drag the NEW_PATIENT_CLASSIFICATION table into the list in the editor area.
- You see the classified patient table, and if you scroll to the right you also see the appended columns. As you only require the appended column Check, created in Framework Manager, remove the other two <PREDICTED_CLASS> and <CLASS_CONFIDENCE>. Select both column headers (hold down Ctrl), right-click on the selected headers and choose Delete from the context menu. For the current example, you also remove the columns <DISEASED>, <KEYWORDS>, and <MEDICAL_HISTORY>.
- Now, you want to highlight patients who do not need (not necessary) a checkup with green, those who require (necessary) a checkup with red, and the others (maybe) will not be highlighted. To do so, right-click the Check column (Note: Click the column body, not the header) and select Style > Conditional styles….
- On the Conditional Styles dialog, click the icon in the lower left corner to create a new conditional style.
- On the next dialog, the Check column is already selected. From the Type of conditional style drop-down list, select String, and click OK.
- On the next dialog, click the icon in the lower left corner and select Select Multiple Values…. Report Studio retrieves the possible values directly from the database (Note: it could happen, that Report Studio asks for the DB2 user ID and password).
- First add the value necessary to the selected list, and click OK.
- Repeat steps 14 and 15 to add the value not necessary.
- For the value not necessary, select Excellent from the Style drop-down list.
- For the value necessary, select Poor from the Style drop-down list.
- Click OK twice.
- Change the style of the heading by selecting it and then clicking the Font pop-up menu on the Properties pane.
- Click the Run report button to view the resulting report.
Figure 11 shows the final report. The right-most column shows the risk classification.
Figure 11. Healthcare heart disease risk report
This article described the basic architecture that enables a simple integration of InfoSphere data mining into Cognos reporting. It pointed out that such an integration can have a significant impact on the acceptance of data mining, as the consumer of the results must not be aware of any of the details of the mining process. A running example from the healthcare sector showed how a simple integration can be achieved with only minimal development effort.
Still, there are many more possibilities. The follow-up articles will discuss more advanced topics, such as drill-through, the use of metrics in the framework manager, and the dynamic invocation of data mining from Cognos reports.
Integrate InfoSphere Warehouse
data mining with IBM Cognos reporting, Part 2: Deviation detection
with InfoSphere Warehouse and Cognos: Learn how to identify
unnatural data records with InfoSphere Warehouse data mining, and create
interactive reports that allow interactive exploration.
Integrate InfoSphere Warehouse
data mining with IBM Cognos reporting, Part 3: Invoke mining
dynamically from Cognos using a market basket analysis example:
Learn about the task of association rule mining and how to achieve it in
InfoSphere Warehouse. Then, see how to invoke data mining dynamically from
Cognos. Finally, read an example of how this mechanism could be used in
Integrate InfoSphere Warehouse
data mining with IBM Cognos reporting, Part 4: Customer segmentation
with InfoSphere Warehouse and Cognos: Learn the value of customer
segmentation as a marketing tool. A detailed example uses Cognos to build
an interactive report with drill-through capabilities and data
- The HEART dataset used in the
step-by-step example is taken from the UCI Repository.
Warehousing: Data Mining Made Easy": Gain a better understanding
of data mining and the data mining algorithms in InfoSphere Warehouse.
InfoSphere Warehouse: Find out more about InfoSphere Warehouse.
IBM InfoSphere Warehouse Information Center: Find out about
technical details and parameters of InfoSphere Warehouse.
Cognos: Find out more about IBM
- Data Mining
Group: Find out more about the Predictive Model Markup Language
- developerWorks Information
Management zone: Learn more about Information Management. Find
technical documentation, how-to articles, education, downloads, product
information, and more.
- Stay current with developerWorks technical events
- Technology bookstore: Browse for books on these and other
Get products and technologies
- Download a free
trial version of IBM InfoSphere Warehouse 9.7
- Build your next
development project with IBM trial software, available
for download directly from developerWorks.
Benjamin Leonhardi is a software engineer for InfoSphere Warehouse data mining at the IBM Research & Development Lab in Boeblingen, Germany. He works on mining visualization, text mining, and mining reporting solutions.
Christoph Sieb is a senior software engineer at the IBM Research & Development Lab in Boeblingen, Germany. Christoph is responsible for data mining algorithms and framework-related issues of InfoSphere Warehouse data mining. He has more than seven years of experience as a developer and consultant in the area of business intelligence and data mining.
Michael Wurst, Ph.D. is a senior software engineer at the IBM Research & Development Lab in Boeblingen, Germany. He holds a Ph.D. in computer science and is responsible for data mining algorithms and tooling in InfoSphere Warehouse. Prior to joining IBM, Michael worked as a co-developer, architect, and consultant for the RapidMiner data mining software.
John Rollins, Ph.D., P.E., is the technical leader for data mining in the IBM Americas Software Group, where he has been involved in technical sales support, field enablement, intellectual capital development, and consulting for the past 10 years. Prior to joining IBM in 1996, he was an engineering consultant, professor, and researcher.
Jo Ramos is a business intelligence (BI) specialist for the BI Best Practices team in Dallas, Texas. He has eight years of experience with customer BI projects, providing support in consulting, architecture design, data modeling, and implementation of data warehouses and analytical solutions. Jo has worked with the majority of the IBM and Information Management products portfolio, specializing in analytical solutions. He holds a Bachelor of Science in Economics from the FURB University in Santa Catarina, Brazil.