Market Basket Analysis

Association Rule Mining

Ever wondered why a Walmart store locates baby diapers and beer bottles together? Ever wondered why milk, eggs and butter are always kept together? There is no coincidence that gum and candy are placed together in any convenient store. These types of strategies were based on experience of marketing and business professionals who had varied experience in the field.

Through the introduction of machine learning methods, stores today are able to decide which items are to be placed together on shelves. Machine learning methods have also been used to learn the purchase patterns of customers on online shopping websites. For example the ” books that are frequently bought together” feature on Amazon.com.

This practice is called market basket analysis due to its widespread use in retail stores.The result of a market basket analysis are a set of association rules that specify relationship amongst bought items.

For example , P{butter, milk}->bread.

This basically states that if you buy butter and milk together, you are likely to buy bread. The brackets ({}) are used to indicate a set/ itemset. Rules, like the above one, are learned using subsets of itemsets . The above mentioned rule was probably derived from the itemset {butter, milk,bread}.

Association rule mining is used for unsupervised knowledge aquisition in large databases. As this is unsupervised learning, the training data need not be labeled. There is no way of testing the performance of this technique, apart from comparing the results to data already present.

Apart from the aforementioned problems, this technique can be used for the following:

Analyzing cancer data by finding recurring DNA patterns
Analyzing customer behaviour patterns which eventually lead to dropping of mobile/broadband services.

Transactional datasets are highly complex; due to the high number of transactions and features( items that are bought). Number of potential itemsets grow exponentially with the number of items. For k items , there are 2^k itemsets. For a transaction data containing 300 items, there would be 2^300 itemsets ! Instead of evaluating each and every itemset, a better algorithm would ignore cases that rarely occur in practice. If there is a shop which sells dairy items and musical instruments, the itemset

{guitar,cheese} would be highly uncommon!

The most widely used approach to this problem is apriori. The name is derived from the fact that the algorithm uses a simple prior belief about the frequent itemsets. This belief then reduces the association rule search space : all subsets of a frequent itemset must also be frequent.

The number of reported rules can be reduced by applying thresholds to measurement metrics.

These metrics are called support and confidence.

Support of an itemset is the measure of how frequently they occur. Confidence on the other hand is the support of the itemset containing X and Y divided by the support of X.

Support(X)=count(X)/N

Confidence(X->Y)=Support(X,Y)/Support(X)

Here I try to extract association rules for a grocery purchase data set available on the UCI Machine Learning Repository. The R statistical programming language was used here. The code can be found here.

Support Levels For frequently occurring items