Published: 9 June 2024
Contributors: Joshua Noble
The Apriori algorithm is an unsupervised machine learning algorithm used for association rule learning. Association rule learning is a data mining technique that identifies frequent patterns, connections and dependencies among different groups of items called itemsets in data. Some common use cases are disease prediction and recommendation systems like market basket analysis for ecommerce platforms.
Introduced in 1994 by Rakesh Agrawal and Ramakrishnan Srikant (link resides outside ibm.com) the name, 'Apriori' acknowledges the prior knowledge of frequent itemsets that the algorithm uses in computation. The algorithm runs iterations over the data to identify k-itemsets, meaning k items that frequently occur together. It then uses the k-itemsets to identify the k+1 itemsets. The Apriori algorithm relies on the insight that adding items to a frequently purchased group can only make it less frequent, not more. The process relies on the Apriori property that states that if an itemset appears frequently in a dataset, all its subsets must also be frequent. Conversely, if an itemset is identified as infrequent, then all its supersets are considered infrequent.
The Apriori algorithm is applicable to all kinds of datasets, especially those generated by transactional databases and it's often used for market basket analysis to support recommendation systems. For example, when using an e-commerce platform that sells clothes and shoes a shopper searches for shoes and decides to add a pair of formal black shoes to their shopping cart. The shopper then notices that the interface recommends other items to purchase, like socks. One of the ways this recommendation system works is to learn customer purchasing patterns and associate items that have a history of probably being purchased together.
One of the biggest advantages of using the Apriori algorithm is its simplicity and adaptability. However, Apriori algorithms are not as efficient when handling large datasets. The multi-iteration process of itemset candidate generation can become computationally expensive and memory intensive. Apriori is often combined with other techniques to mitigate these issues.
The Apriori function is integrated into many popular programming languages including Python, Java and R, making generating quality association rules along with frequent itemsets simple to integrate into existing applications or systems.
Each of the key steps in the Apriori algorithm looks to identify itemsets and all their possible supersets looking for the most frequent to create the association rules.
Step 1: Frequent itemsets generation
The algorithm first identifies the unique items, sometimes referred to as 1-itemsets, in the dataset along with their frequencies. Then, it combines the items that appear together with a probability above a specified threshold into candidate itemsets and filters out the infrequent itemsets to reduce the compute cost in further steps. This process, known as frequent itemset mining, looks just for itemsets with meaningful frequencies.
Step 2: Expand and then prune itemsets
Using the Apriori property, the algorithm combines frequent itemsets further to form larger itemsets. The larger itemset combinations with a lower probability are pruned. This further reduces the search space and makes the computation more efficient.
Step 3: Repeat steps 1 and 2
The algorithm repeats steps 1 and 2 until all frequent itemsets meeting the defined threshold probability are generated exhaustively. Each iteration generates more complex and comprehensive associations in the itemsets.
Once Apriori has created the itemsets the strength of the generated associations and relationships can be investigated.
Learn about barriers to AI adoptions, particularly lack of AI governance and risk management solutions.
Register for the guide on foundation models
The Apriori algorithm uses the support, confidence, and lift metrics to define its operating criteria and improve performance efficiency.
Support is defined as the ratio of the number of times an item occurs in the transactions to the total number of transactions. This metric thus defines the probability of the occurrence of each individual item in the transactions. The same logic can be extended to itemsets.
where IA is item A, Occ(IA) is the number of occurrences of item A, and S(IA) = support of item A
For example, in a retail store, 250 out of 2000 transactions over a day might include a purchase of apples. Using the formula:
This result implies there is a 12.5% chance that apples were bought that day.
You can indicate a required minimum support threshold when applying the Apriori algorithm. This means that any item or itemset with support less than the specified minimum support will be considered infrequent.
The confidence metric identifies the probability of items or itemsets occurring in the itemsets together. For example, if there are two items in a transaction, the existence of one item is assumed to lead to the other. The first item or itemset is the antecedent, and the second is the consequent. The confidence is thus defined as the ratio of the number of transactions having both the antecedent and the consequent, to the number of transactions only having the antecedent. This scenario is represented as:
where A is the antecedent, B is the consequent, and C(A,B) is the confidence that A leads to B.
Extending the preceding example, assume that there are 150 transactions where apples and bananas were purchased together. The confidence is calculated as:
This result indicates a 60% chance that an apple purchase then leads to a banana purchase. Similarly, assuming a total of 500 transactions for bananas, then the confidence that a banana purchase leads to an apple purchase is calculated as:
Here, there is just a 30% chance that a banana purchase leads to an apple purchase.
While confidence is a good measure of likelihood, it is not a guarantee of a clear association between items. The value of confidence might be high for other reasons. For this reason, a minimum confidence threshold is applied to filter out weakly probable associations while mining with association rules.
Lift is the factor with which the likelihood of item A leading to item B is higher than the likelihood of item A. This metric quantifies the strength of association between A and B. It can help indicate whether there is a real relationship between the items in the itemset or are they being grouped together by coincidence.
Where LA,B is the lift for item A leading to item B, CA,B is the confidence that item A leads to item B, SA is the support for item A.
For the example above, we can see that:
The high lift value indicates that the likelihood of apples and bananas being purchased together is 4.8 times higher than that of apples being purchased alone. Also, it can be observed that:
The low lift value here indicates that a banana purchase leading to an apple purchase might be just a coincidence.
In many cases applying a brute-force approach (link resides outside ibm.com) to calculate the support and confidence thresholds for every rule and then prune rules that don’t meet a threshold can be computationally prohibitive. To make applying the Apriori algorithm more efficient it is often combined with other association rule mining techniques. Two of the most common are the FP-growth algorithm (link resides outside ibm.com) and its variant FP-Max to reduce memory and computation constraints. The Apriori algorithm can also be combined with decision trees, where the Apriori algorithm identifies the frequent itemset, and the decision tree technique helps identify the association rules.
Another popular variant of the Apriori algorithm is Dynamic Itemset Counting (DIC) (link resides outside ibm.com) which starts counting potential itemsets early, without waiting for all the transactions to be recorded. DIC divides the dataset into smaller segments and processes each segment separately. This segmentation enables early stopping when the algorithm is not able to identify any frequent itemsets, but the partitioning of the data also helps reduce the compute cost significantly.
Apriori algorithms can also be useful in unsupervised learning-based artificial intelligence applications like clustering algorithms when the data supports it. It helps identify relationships and associations between seemingly independent entities, grouping them into possible clusters.
Discovering and grouping itemsets has multiple applications and the Apriori algorithm is sometimes referred to as the first thing data miners try because of its versatility. We'll look at some of common uses cases in different industries.
One of the most common applications of the Apriori algorithm is performing market basket analysis. Retailers analyze customer purchase history and optimize the way stores are laid out by placing frequently purchased items near each other or on the same shelf. E-commerce platforms use Apriori algorithms to study product-based relationships based on user preferences and purchase pattern mining analysis to create efficient customer recommendation systems. The same kind of analysis can be used to optimize the purchase of services, for example, choosing training courses from a catalog, or recommending other types of coverage when selecting insurance.
The Apriori algorithm can be used to find strong association rules between symptoms and diseases to improve the efficiency of diagnosis and devise targeted treatment plans. For example, which patients are likely to develop diabetes (link resides outside ibm.com) or the role that diet or lifestyle play in disease (link resides outside ibm.com). It can also help identify factors associated with adverse drug reactions.
Apriori algorithms are also applicable in nontransactional databases. Data analysts will often use Apriori for web usage mining, to analyze clickstream data, and to interpret user behavior.
Another common application of the Apriori algorithm is to identify fraudulent patterns in financial transactions. Identifying specific purchase patterns as being possibly fraudulent allows a financial institution to act quickly to suspend transactions or contact an account holder.
Learn how to implement the Apriori algorithm in Python by using watsonx
Learn how to implement the Apriori algorithm with the R programming language by using watsonx.
Learn about clustering, an unsupervised machine learning algorithm that organizes and classifies different objects, data points or observations into groups or clusters based on similarities or patterns.