Data mining — Business goals and business examples

In enterprises exists a huge amount of unstructured information. This information includes call-center notes, text survey-fields, problem reports, repair reports, insurance claims, patient records, email correspondence with customers, or product reviews.

Text is usually grouped in a category, for example, Unstructured data, together with video tapes, images, audio tapes, and other information that is stored with little or no subdivisions at all into different fields. Though you can also apply the concepts of the Unstructured Information Management Architecture (UIMA) to audio tapes or video tapes, unstructured information in the InfoSphere™ Warehouse typically refers to text strings such as call-center notes, customer-satisfaction surveys, or problem reports.

With the existing data warehousing tools, you cannot efficiently use this information to create insight. While there is huge insight in structured data, insight in unstructured data is lost because it is difficult to identify in the vast amount of unstructured data that is used by organizations to run their business.

With information retrieval, you can locate individual documents that deal with a specific problem. With business intelligence, you can aggregate information to detect patterns and trends.

The goal of text analysis is to transform unstructured information into a structure that can be analyzed in the InfoSphere Warehouse together with existing structured information by using data warehousing tools, for example, reporting tools, tools for multidimensional analysis, or data mining tools. You can create this structure by determining concepts that are included in the text, and by extracting these concepts into relational tables.

The main scenarios for text analysis look like this:

Creating simple reports: For example, you might want to identify the top 10 customer-satisfaction problems that are recorded in a text field of a customer survey.
Creating multi-dimensional reports: For example, in retail, you might want to derive the new OLAP dimension Return reasons from text analysis and combine it with existing dimensions like Time, Geography, or Product.
Generating additional input fields for data mining to improve the predictive power of data mining models: For example, you might want to use symptoms that are included in patient records to improve the predictive power of a data mining model that so far predicts the patients who need treatment only based on structured information like patient age or blood pressure.

Business examples

You can apply text analysis or a combined analysis of structured and unstructured data in all industries and across many business functions. The following examples illustrate the concepts:

Cross-business: analyzing customer-satisfaction surveys

Typically, customer-satisfaction surveys include structured fields and text fields. Structured fields might be Yes or No, multiple-choice questions, or a rating on a scale from 1 to 5. Above the structured fields, surveys usually include text fields that allow customers to express their views in their own words. These comments are of high value because they include a wealth of information, however, it can be tedious, time-consuming, and costly to categorize or identify them. You can streamline this process by using text analysis, for example:

Find the most frequently occurring terms or concepts in free-text fields
Categorize these terms into broader categories
Create simple reports, for example, find the top 10 problems, or multidimensional reports, for example, identify trends of specific topics over time

Voice of the customer: customer retention

In the telecommunication industry, a cell-phone provider might want to combine text analysis with the data-mining technique Predictive modeling to reinforce customer retention.

It is expensive to win new customers. Therefore, customers must not be attracted by competitors. With the data-mining technique Predictive modeling, you can predict for individual customers the propensity to cancel their contracts.

Predictive modeling is based on available data about each customer and on historic cases of customers who have left your company.

In a traditional data-mining model, only structured data about customers is used. For example:

Demographic data: Demographic data might include age, gender, income, number of children.
Transactional data: Transactional data might include payment type, number of overseas calls, number of long-distance calls, number of local calls last month or last 3 months.

With text analysis, you can extract the most important concepts that are mentioned by customers during a call to the call center. For example, the following concepts might be recorded by call-center staff in their notes:

New phone
Bad rate

By using the information that can be extracted from call-center notes as additional input to the data-mining prediction-model, you can considerably improve the predictive power of the data-mining model.

Manufacturing: causal analysis of repair reports

A car manufacturer might want to analyze repair reports from repair shops to understand the root or the repair sequences that cause frequent failures. The business goals are to improve failing parts and to receive early warning indicators to avoid costly product recalls.

Repair shops that are accredited dealers for a specific brand might provide most of the repair reports in a structured form, for example, part types or standard codes for standard services. This information is already used today to find sequential patterns of part-failure sequences. However, nonstandard cases are reported in free-text fields. Extracting part types and problem types from these free-text fields and including them in the causal and sequence analysis improves the manufacturers ability to react to detect quality problems earlier and avoid costly product recalls.

Life science: causal analysis for coronary heart diseases

Medical scientists might want to evaluate a study on the risk of patients who suffer from heart diseases. For each patient, structured information such as blood pressure, cholesterol, or age, were collected. Additionally, unstructured textual information for each patient was collected. The unstructured information includes the medical history with data about the lifestyle and medical symptoms of the patient. This information might include relevant risk factors for heart diseases.

With text analysis, the following keywords might be detected in the records of the patients:

smoker, physical inactivity, alcoholism, obesity

You can considerably improve association models and classification models by including the keywords that are retrieved by using text analysis on these models. For example, based on the analysis of structured and unstructured data, 40% of the patients might be eligible to be exempted from further intensive and expensive medical supervision and control. This result cannot be achieved if you use structured information only.