The document discusses association rule mining. It defines frequent itemsets as itemsets whose support is greater than or equal to a minimum support threshold. Association rules are implications of the form X → Y, where X and Y are disjoint itemsets. Support and confidence are used to evaluate rules. The Apriori algorithm is introduced as a two-step approach to generate frequent itemsets and rules by pruning the search space using an anti-monotonic property of support.
Association Rule Mining
•Given a set of transactions, find rules that will predict the occurrence
of an item based on the occurrences of other items in the transaction
Market-Basket transactions
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Example of Association Rules
{Diaper} {Beer},
{Milk, Bread} {Eggs,Coke},
{Beer, Bread} {Milk},
3.
Definition: Frequent Itemset
•Itemset
– A collection of one or more items
• Example: {Milk, Bread, Diaper}
– k-itemset
• An itemset that contains k items
• Support count ()
– Frequency of occurrence of an itemset
– E.g. ({Milk, Bread,Diaper}) = 2
• Support
– Fraction of transactions that contain an itemset
– E.g. s({Milk, Bread, Diaper}) = 2/5
• Frequent Itemset
– An itemset whose support is greater than or
equal to a minsup threshold
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
4.
Definition: Association Rule
Example:
Beer}Diaper,Milk{
4.0
5
2
|T|
)BeerDiaper,,Milk(
s
67.0
3
2
)Diaper,Milk(
)BeerDiaper,Milk,(
c
Association Rule
– An implication expression of the form X
Y (X tends to Y), where X and Y are disjoint
itemsets, i.e., X Y =
– Example:
{Milk, Diaper} {Beer}
Rule Evaluation Metrics
– Support (s)
Fraction of transactions that contain both
X and Y
– Confidence (c)
Measures how often items in Y
appear in transactions that
contain X
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
5.
Why Use Supportand Confidence?
• Support is an important measure because a rule that has
very low support may occur simply by chance.
• A low support rule is also likely to be uninteresting from a
business perspective because it may not be profitable to
promote items that customers seldom buy together.
• Confidence, on the other hand, measures the reliability of
the inference made by a rule.
• For a given rule X Y, the higher the confidence, the more
likely it is for Y to be present in transactions that contain X.
• Confidence also provides an estimate of the conditional
probability of Y given X.
5
6.
Association Rule MiningTask
• Given a set of transactions T, the goal of association
rule mining is to find all rules having
– support ≥ minsup threshold
– confidence ≥ minconf threshold
• Brute-force approach:
– List all possible association rules
– Compute the support and confidence for each rule
– Prune rules that fail the minsup and minconf thresholds
Computationally prohibitive!
7.
Mining Association Rules
Exampleof Rules:
{Milk,Diaper} {Beer} (s=0.4, c=0.67)
{Milk,Beer} {Diaper} (s=0.4, c=1.0)
{Diaper,Beer} {Milk} (s=0.4, c=0.67)
{Beer} {Milk,Diaper} (s=0.4, c=0.67)
{Diaper} {Milk,Beer} (s=0.4, c=0.5)
{Milk} {Diaper,Beer} (s=0.4, c=0.5)
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Observations:
• All the above rules are binary partitions of the same itemset:
{Milk, Diaper, Beer}
• Rules originating from the same itemset have identical support but
can have different confidence
• Thus, we may decouple the support and confidence requirements
8.
Mining Association Rules
•Two-step approach:
1. Frequent Itemset Generation
– Generate all itemsets whose support minsup
2. Rule Generation
– Generate high confidence rules from each frequent itemset,
where each rule is a binary partitioning of a frequent itemset
• Frequent itemset generation is still computationally
expensive
9.
Frequent Itemset Generation
null
ABAC AD AE BC BD BE CD CE DE
A B C D E
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDE
Given d items, there are
2d possible candidate
itemsets
10.
Frequent Itemset Generation
•Brute-force approach:
– Each itemset in the lattice is a candidate frequent itemset
– Count the support of each candidate by scanning the database
– Match each transaction against every candidate
– Complexity ~ O(NMw) => Expensive since M = 2d !!!
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
N
Transactions List of
Candidates
M
w
11.
Computational Complexity
• Givend unique items:
– Total number of itemsets = 2d
– Total number of possible association rules:
123 1
1
1 1
dd
d
k
kd
j
j
kd
k
d
R
If d=6, R = 602 rules
12.
Frequent Itemset GenerationStrategies
• Reduce the number of candidates (M)
– Complete search: M=2d
– Use pruning techniques to reduce M
• Reduce the number of transactions (N)
– Reduce size of N as the size of itemset increases
– Used by DHP and vertical-based mining algorithms
• Reduce the number of comparisons (NM)
– Use efficient data structures to store the candidates or
transactions
– No need to match every candidate against every
transaction
13.
Monotonicity Property
• Ameasure s (support) is monotone (or upward closed) if
• which means that if X is a subset of Y, then s(X) must not exceed s(Y).
On the other hand, s is anti-monotone (or downward closed) if
• which means that if X is a subset of Y, then s(Y) must not exceed s(X).
• Any measure that posses an anti-monotone property can be
incorporated directly into the mining algorithm to effectively prune the
exponential search space of candidate itemsets.
13
)()()(:, YsXsYXYX
)()()(:, YsXsYXYX
14.
Reducing Number ofCandidates
• Apriori principle:
– If an itemset is frequent, then all of its subsets must also be
frequent
– Support of an itemset never exceeds the support of its
subsets
– This is known as the anti-monotone property of support
Illustrating Apriori Principle
ItemCount
Bread 4
Coke 2
Milk 4
Beer 3
Diaper 4
Eggs 1
Itemset Count
{Bread,Milk} 3
{Bread,Beer} 2
{Bread,Diaper} 3
{Milk,Beer} 2
{Milk,Diaper} 3
{Beer,Diaper} 3
Itemset Count
{Bread,Milk,Diaper} 3
Items (1-itemsets)
Pairs (2-itemsets)
(No need to generate
candidates involving Coke
or Eggs)
Triplets (3-itemsets)
Minimum Support = 3
If every subset is considered,
6C1 + 6C2 + 6C3 = 41
With support-based pruning,
6 + 6 + 1 = 13
18.
Apriori Algorithm
• Method:
–Let k=1
– Generate frequent itemsets of length 1
– Repeat until no new frequent itemsets are identified
• Generate length (k+1) candidate itemsets from length k
frequent itemsets
• Prune candidate itemsets containing subsets of length k that
are infrequent
• Count the support of each candidate by scanning the DB
• Eliminate candidates that are infrequent, leaving only those
that are frequent
19.
Apriori Algorithm –Example 2
• Consider the following transaction database:
• min_sup=2
20.
Apriori Algorithm –Example 2 – Iteration 1
• In the first iteration of the algorithm, each item is a
member of the set of candidate 1-itemsets, C1. The
algorithm simply scans all of the transactions to count
the number of occurrences of each item.
21.
Apriori Algorithm –Example 2 – Iteration 2
• To discover the set of frequent 2-itemsets, L2, the
algorithm uses the join L1L1 to generate a candidate set
of 2-itemsets, C2.
• C2consists of
|𝐿1|
2
2-itemsets.
22.
Apriori Algorithm –Example 2 – Iteration 3
• The generation of the set of the candidate 3-itemsets,
C3.
23.
Apriori Algorithm –Example 2 – Iteration 4
• The algorithm uses L3 L3 to generate a candidate set
of 4-itemsets, C4.
• Although the join results in {{I1, I2, I3, I5}}, itemset {I1,
I2, I3, I5} is pruned because its subset {I2, I3, I5} is not
frequent.
• Thus, C4 = φ, and the algorithm terminates, having
found all of the frequent itemsets.
Reducing Number ofComparisons
• Candidate counting:
– Scan the database of transactions to determine the support
of each candidate itemset
– To reduce the number of comparisons, store the candidates
in a hash structure
• Instead of matching each transaction against every candidate,
match it against candidates contained in the hashed buckets
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
N
Transactions Hash Structure
k
Buckets
26.
Generate Hash Tree
23 4
5 6 7
1 4 5
1 3 6
1 2 4
4 5 7 1 2 5
4 5 8
1 5 9
3 4 5 3 5 6
3 5 7
6 8 9
3 6 7
3 6 8
1,4,7
2,5,8
3,6,9
Hash function
Suppose you have 15 candidate itemsets of length 3:
{1 4 5}, {1 2 4}, {4 5 7}, {1 2 5}, {4 5 8}, {1 5 9}, {1 3 6}, {2 3 4}, {5 6 7}, {3 4 5}, {3 5 6}, {3 5
7}, {6 8 9}, {3 6 7}, {3 6 8}
You need:
• Hash function
• Max leaf size: max number of itemsets stored in a leaf node (if number of candidate
itemsets exceeds max leaf size, split the node)