27  Association Rules

And now for something completely different.

27.1 Introduction

Like other unsupervised learning methods, Association Rule Mining (ARM) tries to learn aspects of the joint density \(p(\textbf{X})\) of the data. It specifically asks if in regions of \(p(\textbf{X})\) where the density is high, are certain values of \(X_j\) associated with values of \(X_j\) more frequently than one would expect under a completely random allocation? In other words, are there associations between the values that tell us something interesting about the joint distribution of the data?

The most frequent form of ARM is market basket analysis, the name appeals to items a customer places in a shopping basket. \(X_1, \cdots, X_p\) are the possible items offered for purchase, \(X_j \in \{0,1\}\). \(X_{ij}\) takes on the value \(1\) if the \(i\)th customer selects the \(j\)th item into their basket, \(0\) otherwise. Market basket analysis is concerned with the properties of the \((n \times p)\) matrix \(\textbf{X}\) of zeros and ones, trying to understand with items are purchased together—the famous example is beer and diapers.

Market basket analysis. Source

Some applications of ARM are

  • Customer segmentation: grouping customers based on their buying behavior
  • Retail allocation: which item should be on shelves and how should they be grouped
  • Recommender systems: cross-marketing of products
  • Survey analysis: which responses go together?
  • Precursor analytics: Borne (2021) mentions this interesting application of ARM, where data are time stamped. The analysis seeks time lags at which the association strength peaks. Precursor analysis has found that solar storm events show the strongest geomagnetic effects 2–3 hours after the storm and that video cameras are bought 3 months after the purchase of a video (VHS) player.

The approach to association rule mining is essentially mode hunting, looking for areas of the joint distribution with high densities and finding associations of items that are more frequent than expected if shopping baskets were filled at random. This is conceptually pretty simple, it is just a matter of counting.

Example: Beers and Diapers

Borne (2021) cites beer–diaper association as the classic example of market basket analysis, although the particular example might be more legend than reality.

Suppose that a retail store has 600,000 transactions (shopping baskets). 7,500 (1.25%) of the shopping baskets contain diapers, 60,000 (10%) of the baskets contain beer. So we know that a basket is 8 times more likely to contain beer than diapers. In order to analyze the association of beer and diapers we need to know how many of the baskets contain both. Suppose that number is 6000 (1%).

If beer and diaper purchases were independent, then 10% of the diaper purchases would be accompanied by beer purchases, or, equivalently, 1.25% of beer buyers would pick up diapers. Either way, under independence of the two items we would expect \(600,000 \times 0.1 \times 0.0125 = 750\) baskets with both beer and diapers. The actual number of baskets containing both is much higher.

The ratio between what was observed and what was expected under independence is called the lift factor: \[ L = \frac{6000}{750} = 8 \]

Association rule mining is about finding associations that have a large lift factor. But that is not the only criterion by which associations are judged. A lift factor of 10 is very interesting when the items appear frequently in baskets, but not that interesting if they are purchased very rarely. In the former case one can focus promotional pricing on those items together whereas in the latter case this might not yield a lot of bang for the buck (literally).

Finding meaningful associations can be challenging, since we are dealing with a \(p\)-dimensional distribution and \(p\) can be very large. Market basket analysis of Amazon.com involves more than 12 million products. The data are very sparse, many items are never purchased together. There is insufficient data to estimate proportions reliably. Also, we would not be interested only in associations between two items; what about sets of three or four or more items? We need a structured approach and simplify the procedure—this leads us down the path of item sets.

27.2 Association Rules and Item Sets

Items are collected in item sets, for example {butter, bread} is a two-item set and {milk} is a single-item set. Association rules display relationships among item sets, for example

{bread, butter} \(\Rightarrow\) {jelly}

The left hand side (lhs) of the rule is called the head or the antecedent, the right hand side (rhs) is called the body or the consequent of the rule. The rule is read as follows: if the head (lhs) occurs then the body (rhs) also occurs.

The most important metrics by which to judge an association rule are its support, confidence, and lift.

Support

Suppose we have \(n\) transactions and a rule $A \(\Rightarrow\) B$ where A and B are item sets. Let \(\#A\) denote the number of times A occurs in the database and \(\#(A,B)\) the number of times the the sets A and B occur together. For example, if A is a three-item set, then \(\#A\) is the number of records where all three items are present. if B is a single-item set, then \(\#(A,B)\) is the number of records where all four items are present.

The support of the rule is the relative frequency with which head and body appear: \[ S(A \Rightarrow B) = \frac{\#(A,B)}{n} \]

Confidence

The confidence of the rule is the estimate of the conditional probability to find the body (the consequent) of the rule in transactions given that they contain the head (the antecedent):

\[ C(A \Rightarrow B) = \frac{\#(A,B)}{\#A} \] It is a measure of predictability of the association between A and B.

Support can be beneficial for finding the connection between products in comparison to the whole database, whereas confidence looks at the connection between item sets. A support threshold is often used to focus on those item sets that are interesting because they are frequent. For example, a support threshold of 0.2 implies that only item set that appear in at least 20% of the transactions are considered. The confidence, on the other hand, tells us how likely it is that B occurs every time A occurs.

It is possible that confidence is high but support is low. For example, {caviar} \(\Rightarrow\) {vodka} is a high-confidence rule, because given that caviar is purchased it is likely to be accompanied by a vodka purchase. It is a low-support rule because caviar is not purchased frequently. An algorithm that sets a support threshold might miss association rules with low-support and high confidence.

A rule can have 100% confidence if the head requires the body—that is, every time the head occurs the body also occurs. For example, if opening a savings account requires having a checking account, then \[ C(\text{savings} \Rightarrow \text{checking}) = \frac{\#(\text{checking and savings})}{\#\text{savings}}=1 \] Knowing that a customer has a savings account, you can say with certainty that they also have a checking account. The body (checking account) is completely predictable from the head (savings account).

Lift

The lift of an association rule measured how much more likely the association is compared to independence of the items: \[ L(A \Rightarrow B) = n \frac{\#(A,B)}{\#A \#B} = \frac{C(A\Rightarrow B)}{S(B)} \] The lift factor is an estimate of the ratio of the joint probabilities \[ \frac{\Pr(A,B)}{\Pr(A)\Pr(B)} \]

  • When \(L(A \Rightarrow B) = 1\), A and B are not associated with each other (independent).

  • When \(L(A\Rightarrow B) > 1\), then A and B occur more frequently together than under a random selection of items—A and B have a positive association.

  • When \(L(A \Rightarrow B) < 1\), then A and B occur less frequently together than expected under a random selection of items—A and B have a negative association.

Example

A supermarket has 80 shopping baskets, 60 of them contain bread, 50 contain peanut butter. 40 baskets contain both bread and peanut butter.

  • Support for {bread} = 60/80 = 0.75
  • Support for {peanut butter} = 50/80 = 0.625
  • Support for {bread, peanut butter} = 40/80 = 0.5
  • Confidence for rule {bread} \(\Rightarrow\) {peanut butter} = 40/60 = 0.666
  • Lift for rule {bread} \(\Rightarrow\) {peanut butter} = 80 x 40 / (60 x 50) = 1.06

Another example calculation follows before we apply an efficient algorithm to find rules in a large database.

Example

Suppose we have indicator variables \(X_1, \cdots, X_5\) for five items and the following baskets

Basket \(X_1\) \(X_2\) \(X_3\) \(X_4\) \(X_5\)
1 1 1 1 0 0
2 1 0 1 1 0
3 0 1 1 1 0
4 1 0 0 1 1
5 0 1 1 0 1

The next table shows support, confidence, and lift for a series of rules

Rule Support Confidence Lift
\(X_1 \Rightarrow X_4\) 2/5 2/3 10/9
\(X_3 \Rightarrow X_1\) 2/5 2/4 5/6
\(X_1 \Rightarrow X_3\) 2/5 2/3 5/6
\((X_2,X_3) \Rightarrow X_4\) 1/5 1/3 5/9

27.3 Apriori Algorithm

The math to compute support, confidence, and lift of rules is simple, it just means counting occurrences in a database of transactions. What makes ARM tricky is the large number of rules. For \(p\) items, there are \(2^p-1\) non-empty item sets. An algorithmic approach is needed to find interesting rules, especially when \(p\) and \(n\) are large.

Also, we want to query the database with queries such as

Find all rules that have “bagels” in the antecedent. Find all rules that have “sausage” in the antecedent and “mustard” in the consequent.

Caution

When analyzing large databases of baskets, you will find a large number of spurious associations. Those occur when items appear together more frequently compared to independence, but only by chance. Suppose we apply a 5% significance rule, meaning that we declare associations as “significant” if they appear beyond the 95th percentile of the frequency distribution. A market basket analysis with 1,000,000 two-item sets in which there are no associations whatsoever will declare 50,000 associations as “significant” although none of them are relevant.

The Apriori algorithm of Agrawal, Imieliński, and Swami (1993) is based on setting a support threshold \(s\) and a confidence threshold \(c\). Multiple passes over the data are made. In the first pass the support of single-item sets is computed and any items with \(S(A) < s\) are discarded. In the second pass the support of two-item sets is computed and any sets with \(S(A,B) < s\) are again discarded. This continues until all candidate rules in a pass have support less than \(s\).

Based on the item sets determined so far, only those association rules are displayed for which the confidence exceeds the threshold \(c\).

The Apriori algorithm makes it feasible to find relevant association rules in very large databases. It works because \(S(A,B) \le \min(S(A),S(B))\), in other words the support of any sets found on a subsequent pass cannot exceed the support of the subsets of the previous pass. However, the algorithm also has some disadvantages:

  • It can create large candidate sets of rules and thus lead to many spurious rules.
  • It requires a query engine to filter and summarize candidate item sets.
  • It passes through the database multiple times, once for each cardinality of item sets.
  • It is unlikely to find rules with high confidence and low support because items are filtered first based on the support threshold—only item sets that pass this threshold are considered in computing confidence, which is a measure comparing item sets.

Example Analysis with arules

Example: Grocery Shopping

The apriori function in the arules package in R performs association rule mining based on the Apriori algorithm. The input data to the function is an object of class transactions, which is based on a sparse matrix representation of item occurrences.

We analyze here the Groceries data set that is part of the arules package. This data set contains 1 month (30 days) of real-world point-of-sale transaction data from a typical local grocery outlet. The data set contains 9835 transactions and the items are aggregated into 169 categories.

library(arules)
data(Groceries)

summary(Groceries)
transactions as itemMatrix in sparse format with
 9835 rows (elements/itemsets/transactions) and
 169 columns (items) and a density of 0.02609146 

most frequent items:
      whole milk other vegetables       rolls/buns             soda 
            2513             1903             1809             1715 
          yogurt          (Other) 
            1372            34055 

element (itemset/transaction) length distribution:
sizes
   1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
2159 1643 1299 1005  855  645  545  438  350  246  182  117   78   77   55   46 
  17   18   19   20   21   22   23   24   26   27   28   29   32 
  29   14   14    9   11    4    6    1    1    1    1    3    1 

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.000   2.000   3.000   4.409   6.000  32.000 

includes extended item information - examples:
       labels  level2           level1
1 frankfurter sausage meat and sausage
2     sausage sausage meat and sausage
3  liver loaf sausage meat and sausage

The data are very sparse, which is typical for these applications. Only 2.6% of the 9835 x 169 possible item–transaction combinations are non-zero. The most frequently item is whole milk, its support is 2513/9835 = 0.255516. There are 2159 single-item sets, 1643 two-item sets, and one 32-item set. The median item set size is 3.

You can query the raw data for the transactions with the inspect function:

inspect(Groceries[1:3])
    items                 
[1] {citrus fruit,        
     semi-finished bread, 
     margarine,           
     ready soups}         
[2] {tropical fruit,      
     yogurt,              
     coffee}              
[3] {whole milk}          

The first transaction involves four items, the second transaction two items, the third transaction was a purchase of whole milk alone.

The itemFrequencyPlot function helps to inspect the frequency distribution of the items in the database. The following call requests the 20 items with the highest absolute frequency.

itemFrequencyPlot(Groceries,topN=20,type="absolute",cex.names=0.75)

By default, the apriori function mines for rules with a support of \(s > 0.1\) and a confidence of \(c > 0.8\). The following call changes these values. Based on the summary above, it should only find three rules in the database, based on single-item sets for whole milk, other vegetables, and rolls/buns.

rules <- apriori(Groceries, 
                 parameter=list(support=0.18,conf=0))
Apriori

Parameter specification:
 confidence minval smax arem  aval originalSupport maxtime support minlen
          0    0.1    1 none FALSE            TRUE       5    0.18      1
 maxlen target  ext
     10  rules TRUE

Algorithmic control:
 filter tree heap memopt load sort verbose
    0.1 TRUE TRUE  FALSE TRUE    2    TRUE

Absolute minimum support count: 1770 

set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
sorting and recoding items ... [3 item(s)] done [0.00s].
creating transaction tree ... done [0.00s].
checking subsets of size 1 2 done [0.00s].
writing ... [3 rule(s)] done [0.00s].
creating S4 object  ... done [0.00s].
inspect(rules)
    lhs    rhs                support   confidence coverage lift count
[1] {}  => {rolls/buns}       0.1839349 0.1839349  1        1    1809 
[2] {}  => {other vegetables} 0.1934926 0.1934926  1        1    1903 
[3] {}  => {whole milk}       0.2555160 0.2555160  1        1    2513 

As expected, only three rules are found by the algorithm since only three single items exceed the support threshold of 0.18. It would have been possible to find rules based on two-item sets constructed from whole milk, other vegetables or rolls\buns on the second pass, but none of those combinations had support > 0.18.

The head of the rules is displayed as lhs and the body as rhs for the left hand side and right hand side of the rule, respectively. The coverage is the support of the lhs which shows that the set {} is to be interpreted as all transactions. That also explains why the confidence of the single-item rules is equal to their support.

Changing the value for support to a smaller number generates more rules.

rules <- apriori(Groceries, 
                 parameter=list(support=0.001,conf=0.8),
                 control = list(verbose=F))
summary(rules)
set of 410 rules

rule length distribution (lhs + rhs):sizes
  3   4   5   6 
 29 229 140  12 

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  3.000   4.000   4.000   4.329   5.000   6.000 

summary of quality measures:
    support           confidence        coverage             lift       
 Min.   :0.001017   Min.   :0.8000   Min.   :0.001017   Min.   : 3.131  
 1st Qu.:0.001017   1st Qu.:0.8333   1st Qu.:0.001220   1st Qu.: 3.312  
 Median :0.001220   Median :0.8462   Median :0.001322   Median : 3.588  
 Mean   :0.001247   Mean   :0.8663   Mean   :0.001449   Mean   : 3.951  
 3rd Qu.:0.001322   3rd Qu.:0.9091   3rd Qu.:0.001627   3rd Qu.: 4.341  
 Max.   :0.003152   Max.   :1.0000   Max.   :0.003559   Max.   :11.235  
     count      
 Min.   :10.00  
 1st Qu.:10.00  
 Median :12.00  
 Mean   :12.27  
 3rd Qu.:13.00  
 Max.   :31.00  

mining info:
      data ntransactions support confidence
 Groceries          9835   0.001        0.8
                                                                                                  call
 apriori(data = Groceries, parameter = list(support = 0.001, conf = 0.8), control = list(verbose = F))

This query generated 410 rules, four-item rules most frequent among them. Their confidence ranges from 0.8 to 1.0, their lift ranges from 3.131 to 11.235.

The top 10 rules by decreasing lift are

options(digits=5)
knitr::kable(DATAFRAME(rules[1:10]))
LHS RHS support confidence coverage lift count
{liquor,red/blush wine} {bottled beer} 0.00193 0.90476 0.00214 11.2353 19
{curd,cereals} {whole milk} 0.00102 0.90909 0.00112 3.5579 10
{yogurt,cereals} {whole milk} 0.00173 0.80952 0.00214 3.1682 17
{butter,jam} {whole milk} 0.00102 0.83333 0.00122 3.2614 10
{soups,bottled beer} {whole milk} 0.00112 0.91667 0.00122 3.5875 11
{napkins,house keeping products} {whole milk} 0.00132 0.81250 0.00163 3.1798 13
{whipped/sour cream,house keeping products} {whole milk} 0.00122 0.92308 0.00132 3.6126 12
{pastry,sweet spreads} {whole milk} 0.00102 0.90909 0.00112 3.5579 10
{turkey,curd} {other vegetables} 0.00122 0.80000 0.00153 4.1345 12
{rice,sugar} {whole milk} 0.00122 1.00000 0.00122 3.9137 12

The association with the highest lift factor is the purchase of beer when the basket contains liquor and wine. The confidence of this transaction is very high, making beer purchase highly predictable from the presence of liquor and wine.

The top 10 rules by decreasing confidence are

rules <- sort(rules, by="confidence", decreasing=TRUE)
knitr::kable(DATAFRAME(rules[1:10]))
LHS RHS support confidence coverage lift count
10 {rice,sugar} {whole milk} 0.00122 1 0.00122 3.9137 12
16 {canned fish,hygiene articles} {whole milk} 0.00112 1 0.00112 3.9137 11
33 {root vegetables,butter,rice} {whole milk} 0.00102 1 0.00102 3.9137 10
60 {root vegetables,whipped/sour cream,flour} {whole milk} 0.00173 1 0.00173 3.9137 17
62 {butter,soft cheese,domestic eggs} {whole milk} 0.00102 1 0.00102 3.9137 10
68 {citrus fruit,root vegetables,soft cheese} {other vegetables} 0.00102 1 0.00102 5.1682 10
126 {pip fruit,butter,hygiene articles} {whole milk} 0.00102 1 0.00102 3.9137 10
133 {root vegetables,whipped/sour cream,hygiene articles} {whole milk} 0.00102 1 0.00102 3.9137 10
135 {pip fruit,root vegetables,hygiene articles} {whole milk} 0.00102 1 0.00102 3.9137 10
143 {cream cheese ,domestic eggs,sugar} {whole milk} 0.00112 1 0.00112 3.9137 11

You can limit the maximum length of rules generated with the maxlen= parameter. Setting the maximum length to 3 reduces the number of generated ruls to 29.

# setting max length of rules to 3
rules <- apriori(Groceries, 
                 parameter = list(supp=0.001, 
                                  conf=0.8,
                                  maxlen=3),
                 control = list(verbose=F))
summary(rules)
set of 29 rules

rule length distribution (lhs + rhs):sizes
 3 
29 

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      3       3       3       3       3       3 

summary of quality measures:
    support          confidence       coverage            lift      
 Min.   :0.00102   Min.   :0.800   Min.   :0.00112   Min.   : 3.13  
 1st Qu.:0.00112   1st Qu.:0.812   1st Qu.:0.00122   1st Qu.: 3.26  
 Median :0.00122   Median :0.846   Median :0.00153   Median : 3.61  
 Mean   :0.00147   Mean   :0.861   Mean   :0.00173   Mean   : 4.00  
 3rd Qu.:0.00173   3rd Qu.:0.909   3rd Qu.:0.00214   3rd Qu.: 4.20  
 Max.   :0.00254   Max.   :1.000   Max.   :0.00315   Max.   :11.24  
     count     
 Min.   :10.0  
 1st Qu.:11.0  
 Median :12.0  
 Mean   :14.5  
 3rd Qu.:17.0  
 Max.   :25.0  

mining info:
      data ntransactions support confidence
 Groceries          9835   0.001        0.8
                                                                                                           call
 apriori(data = Groceries, parameter = list(supp = 0.001, conf = 0.8, maxlen = 3), control = list(verbose = F))

An application of the querying engine is to target items. For example, if folks are buying whole milk, what are they also buying? Report the 5 associations with the highest lift:

rules <- apriori(data=Groceries,
                 parameter=list(supp=0.001,conf=0.15,maxlen=3), 
                 appearance=list(lhs="whole milk",default="rhs"),
                 control = list(verbose=F))
rules <- sort(rules, decreasing=TRUE,by="lift")
knitr::kable(DATAFRAME(rules[1:5]))
LHS RHS support confidence coverage lift count
5 {whole milk} {root vegetables} 0.04891 0.19140 0.25552 1.7560 481
4 {whole milk} {tropical fruit} 0.04230 0.16554 0.25552 1.5776 416
7 {whole milk} {yogurt} 0.05602 0.21926 0.25552 1.5717 551
9 {whole milk} {other vegetables} 0.07483 0.29288 0.25552 1.5136 736
8 {whole milk} {rolls/buns} 0.05663 0.22165 0.25552 1.2050 557

arulesViz::plot has specialized plotting methods for association rules. method="scatterplot" requires two quality measures and shades the rules by a third variable (lift is the default). method="matrix" arranges the rules by antecedent (head, lhs) and consequent (body, rhs).

library(arulesViz)
rules <- apriori(Groceries, 
                 parameter=list(support=0.001,conf=0.5),
                 control = list(verbose=F))
plot(rules,
     method="scatterplot",
     measure=c("support","confidence"),
     shading="lift")