What is data mining?
Learn about data mining, which combines statistics and artificial intelligence to analyze large data sets to discover useful information
Subscribe to the IBM newsletter
black and blue background
What is data mining?

Data mining, also known as knowledge discovery in data (KDD), is the process of uncovering patterns and other valuable information from large data sets. Given the evolution of data warehousing technology and the growth of big data, adoption of data mining techniques has rapidly accelerated over the last couple of decades, assisting companies by transforming their raw data into useful knowledge. However, despite the fact that that technology continuously evolves to handle data at a large-scale, leaders still face challenges with scalability and automation.

Data mining has improved organizational decision-making through insightful data analyses. The data mining techniques that underpin these analyses can be divided into two main purposes; they can either describe the target dataset or they can predict outcomes through the use of machine learning algorithms. These methods are used to organize and filter data, surfacing the most interesting information, from fraud detection to user behaviors, bottlenecks, and even security breaches.

When combined with data analytics and visualization tools, like Apache Spark, delving into the world of data mining has never been easier and extracting relevant insights has never been faster. Advances within artificial intelligence only continue to expedite adoption across industries. 

Now available: watsonx.data

Scale AI workloads, for all your data, anywhere

Data mining process

The data mining process involves a number of steps from data collection to visualization to extract valuable information from large data sets. As mentioned above, data mining techniques are used to generate descriptions and predictions about a target data set. Data scientists describe data through their observations of patterns, associations, and correlations. They also classify and cluster data through classification and regression methods, and identify outliers for use cases, like spam detection.

Data mining usually consists of four main steps: setting objectives, data gathering and preparation, applying data mining algorithms, and evaluating results.

1. Set the business objectives: This can be the hardest part of the data mining process, and many organizations spend too little time on this important step. Data scientists and business stakeholders need to work together to define the business problem, which helps inform the data questions and parameters for a given project. Analysts may also need to do additional research to understand the business context appropriately.

2. Data preparation: Once the scope of the problem is defined, it is easier for data scientists to identify which set of data will help answer the pertinent questions to the business. Once they collect the relevant data, the data will be cleaned, removing any noise, such as duplicates, missing values, and outliers. Depending on the dataset, an additional step may be taken to reduce the number of dimensions as too many features can slow down any subsequent computation. Data scientists will look to retain the most important predictors to ensure optimal accuracy within any models.

3. Model building and pattern mining: Depending on the type of analysis, data scientists may investigate any interesting data relationships, such as sequential patterns, association rules, or correlations. While high frequency patterns have broader applications, sometimes the deviations in the data can be more interesting, highlighting areas of potential fraud.

Deep learning algorithms may also be applied to classify or cluster a data set depending on the available data. If the input data is labelled (i.e. supervised learning), a classification model may be used to categorize data, or alternatively, a regression may be applied to predict the likelihood of a particular assignment. If the dataset isn’t labelled (i.e. unsupervised learning), the individual data points in the training set are compared with one another to discover underlying similarities, clustering them based on those characteristics.

4. Evaluation of results and implementation of knowledge: Once the data is aggregated, the results need to be evaluated and interpreted. When finalizing results, they should be valid, novel, useful, and understandable. When this criteria is met, organizations can use this knowledge to implement new strategies, achieving their intended objectives.

Data mining techniques

Data mining works by using various algorithms and techniques to turn large volumes of data into useful information. Here are some of the most common ones:

Association rules: An association rule is a rule-based method for finding relationships between variables in a given dataset. These methods are frequently used for market basket analysis, allowing companies to better understand relationships between different products. Understanding consumption habits of customers enables businesses to develop better cross-selling strategies and recommendation engines.

Neural networks: Primarily leveraged for deep learning algorithms, neural networks process training data by mimicking the interconnectivity of the human brain through layers of nodes. Each node is made up of inputs, weights, a bias (or threshold), and an output. If that output value exceeds a given threshold, it “fires” or activates the node, passing data to the next layer in the network. Neural networks learn this mapping function through supervised learning, adjusting based on the loss function through the process of gradient descent. When the cost function is at or near zero, we can be confident in the model’s accuracy to yield the correct answer.

Decision tree: This data mining technique uses classification or regression methods to classify or predict potential outcomes based on a set of decisions. As the name suggests, it uses a tree-like visualization to represent the potential outcomes of these decisions.

K- nearest neighbor (KNN): K-nearest neighbor, also known as the KNN algorithm, is a non-parametric algorithm that classifies data points based on their proximity and association to other available data. This algorithm assumes that similar data points can be found near each other. As a result, it seeks to calculate the distance between data points, usually through Euclidean distance, and then it assigns a category based on the most frequent category or average.

Data mining applications

Data mining techniques are widely adopted among business intelligence and data analytics teams, helping them extract knowledge for their organization and industry. Some data mining use cases include:

Sales and marketing

Companies collect a massive amount of data about their customers and prospects. By observing consumer demographics and online user behavior, companies can use data to optimize their marketing campaigns, improving segmentation, cross-sell offers, and customer loyalty programs, yielding higher ROI on marketing efforts. Predictive analyses can also help teams to set expectations with their stakeholders, providing yield estimates from any increases or decreases in marketing investment.


Educational institutions have started to collect data to understand their student populations as well as which environments are conducive to success. As courses continue to transfer to online platforms, they can use a variety of dimensions and metrics to observe and evaluate performance, such as keystroke, student profiles, classes, universities, time spent, etc.

Operational optimization

Process mining leverages data mining techniques to reduce costs across operational functions, enabling organizations to run more efficiently. This practice has helped to identify costly bottlenecks and improve decision-making among business leaders.

Fraud detection

While frequently occurring patterns in data can provide teams with valuable insight, observing data anomalies is also beneficial, assisting companies in detecting fraud. While this is a well-known use case within banking and other financial institutions, SaaS-based companies have also started to adopt these practices to eliminate fake user accounts from their datasets.

Related solutions
Enterprise search platform

Find critical answers and insights from your business data using AI-powered enterprise search technology

Explore IBM Watson Discovery
Data Warehouse

A fully managed, elastic cloud data warehouse built for high-performance analytics and AI

Explore IBM Db2 Warehouse on Cloud
IBM Watson® Studio

Build and scale trusted AI on any cloud. Automate the AI lifecycle for ModelOps.

Learn about IBM Watson® Studio
Take the next step

Partner with IBM to get started on your latest data mining project. IBM Watson Discovery digs through your data in real-time to reveal hidden patterns, trends and relationships between different pieces of content. Use data mining techniques to gain insights into customer and user behavior, analyze trends in social media and e-commerce, find the root causes of problems and more. There is untapped business value in your hidden insights.

Get started with IBM Watson Discovery today