Rules of data mining

Association Rule Mining
• Given a set of transactions, find rules that will predict the occurrence
of an item based on the occurrences of other items in the transaction
Market-Basket transactions
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Example of Association Rules
{Diaper}  {Beer},
{Milk, Bread}  {Eggs,Coke},
{Beer, Bread}  {Milk},

Definition: Frequent Itemset
• Itemset
– A collection of one or more items
• Example: {Milk, Bread, Diaper}
– k-itemset
• An itemset that contains k items
• Support count ()
– Frequency of occurrence of an itemset
– E.g. ({Milk, Bread,Diaper}) = 2
• Support
– Fraction of transactions that contain an itemset
– E.g. s({Milk, Bread, Diaper}) = 2/5
• Frequent Itemset
– An itemset whose support is greater than or
equal to a minsup threshold
TID Items
1 Bread, Milk

Definition: Association Rule
Example:
Beer}Diaper,Milk{ 
4.0
5
2
|T|
)BeerDiaper,,Milk(


s
67.0
3
2
)Diaper,Milk(
)BeerDiaper,Milk,(



c
 Association Rule
– An implication expression of the form X 
Y (X tends to Y), where X and Y are disjoint
itemsets, i.e., X  Y = 
– Example:
{Milk, Diaper}  {Beer}
 Rule Evaluation Metrics
– Support (s)
 Fraction of transactions that contain both
X and Y
– Confidence (c)
 Measures how often items in Y
appear in transactions that
contain X
TID Items
1 Bread, Milk

Why Use Support and Confidence?
• Support is an important measure because a rule that has
very low support may occur simply by chance.
• A low support rule is also likely to be uninteresting from a
business perspective because it may not be profitable to
promote items that customers seldom buy together.
• Confidence, on the other hand, measures the reliability of
the inference made by a rule.
• For a given rule X  Y, the higher the confidence, the more
likely it is for Y to be present in transactions that contain X.
• Confidence also provides an estimate of the conditional
probability of Y given X.
5

Association Rule Mining Task
• Given a set of transactions T, the goal of association
rule mining is to find all rules having
– support ≥ minsup threshold
– confidence ≥ minconf threshold
• Brute-force approach:
– List all possible association rules
– Compute the support and confidence for each rule
– Prune rules that fail the minsup and minconf thresholds
 Computationally prohibitive!

Mining Association Rules
Example of Rules:
{Milk,Diaper}  {Beer} (s=0.4, c=0.67)
{Milk,Beer}  {Diaper} (s=0.4, c=1.0)
{Diaper,Beer}  {Milk} (s=0.4, c=0.67)
{Beer}  {Milk,Diaper} (s=0.4, c=0.67)
{Diaper}  {Milk,Beer} (s=0.4, c=0.5)
{Milk}  {Diaper,Beer} (s=0.4, c=0.5)
TID Items
1 Bread, Milk
Observations:
• All the above rules are binary partitions of the same itemset:
{Milk, Diaper, Beer}
• Rules originating from the same itemset have identical support but
can have different confidence
• Thus, we may decouple the support and confidence requirements

Mining Association Rules
• Two-step approach:
1. Frequent Itemset Generation
– Generate all itemsets whose support  minsup
2. Rule Generation
– Generate high confidence rules from each frequent itemset,
where each rule is a binary partitioning of a frequent itemset
• Frequent itemset generation is still computationally
expensive

Frequent Itemset Generation
null
AB AC AD AE BC BD BE CD CE DE
A B C D E
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDE
Given d items, there are
2d possible candidate
itemsets

Frequent Itemset Generation
• Brute-force approach:
– Each itemset in the lattice is a candidate frequent itemset
– Count the support of each candidate by scanning the database
– Match each transaction against every candidate
– Complexity ~ O(NMw) => Expensive since M = 2d !!!
TID Items
1 Bread, Milk
N
Transactions List of
Candidates
M
w

Computational Complexity
• Given d unique items:
– Total number of itemsets = 2d
– Total number of possible association rules:
123 1
1
1 1












 












 
dd
d
k
kd
j
j
kd
k
d
R
If d=6, R = 602 rules

Frequent Itemset Generation Strategies
• Reduce the number of candidates (M)
– Complete search: M=2d
– Use pruning techniques to reduce M
• Reduce the number of transactions (N)
– Reduce size of N as the size of itemset increases
– Used by DHP and vertical-based mining algorithms
• Reduce the number of comparisons (NM)
– Use efficient data structures to store the candidates or
transactions
– No need to match every candidate against every
transaction

Monotonicity Property
• A measure s (support) is monotone (or upward closed) if
• which means that if X is a subset of Y, then s(X) must not exceed s(Y).
On the other hand, s is anti-monotone (or downward closed) if
• which means that if X is a subset of Y, then s(Y) must not exceed s(X).
• Any measure that posses an anti-monotone property can be
incorporated directly into the mining algorithm to effectively prune the
exponential search space of candidate itemsets.
13
)()()(:, YsXsYXYX 
)()()(:, YsXsYXYX 

Reducing Number of Candidates
• Apriori principle:
– If an itemset is frequent, then all of its subsets must also be
frequent
– Support of an itemset never exceeds the support of its
subsets
– This is known as the anti-monotone property of support

Illustrating Apriori Principle
15

16

Item Count
Bread 4
Coke 2
Milk 4
Beer 3
Diaper 4
Eggs 1
Itemset Count
{Bread,Milk} 3
{Bread,Beer} 2
{Bread,Diaper} 3
{Milk,Beer} 2
{Milk,Diaper} 3
{Beer,Diaper} 3
Itemset Count
{Bread,Milk,Diaper} 3
Items (1-itemsets)
Pairs (2-itemsets)
(No need to generate
candidates involving Coke
or Eggs)
Triplets (3-itemsets)
Minimum Support = 3
If every subset is considered,
6C1 + 6C2 + 6C3 = 41
With support-based pruning,
6 + 6 + 1 = 13

Apriori Algorithm
• Method:
– Let k=1
– Generate frequent itemsets of length 1
– Repeat until no new frequent itemsets are identified
• Generate length (k+1) candidate itemsets from length k
frequent itemsets
• Prune candidate itemsets containing subsets of length k that
are infrequent
• Count the support of each candidate by scanning the DB
• Eliminate candidates that are infrequent, leaving only those
that are frequent

Apriori Algorithm – Example 2
• Consider the following transaction database:
• min_sup=2

Apriori Algorithm – Example 2 – Iteration 1
• In the first iteration of the algorithm, each item is a
member of the set of candidate 1-itemsets, C1. The
algorithm simply scans all of the transactions to count
the number of occurrences of each item.

• To discover the set of frequent 2-itemsets, L2, the
algorithm uses the join L1L1 to generate a candidate set
of 2-itemsets, C2.
• C2consists of
|𝐿1|
2
2-itemsets.

• The generation of the set of the candidate 3-itemsets,
C3.

• The algorithm uses L3 L3 to generate a candidate set
of 4-itemsets, C4.
• Although the join results in {{I1, I2, I3, I5}}, itemset {I1,
I2, I3, I5} is pruned because its subset {I2, I3, I5} is not
frequent.
• Thus, C4 = φ, and the algorithm terminates, having
found all of the frequent itemsets.

Apriori Algorithm – Example 3
• http://nikhilvithlani.blogspot.com/2012/03/apriori-
algorithm-for-data-mining-made.html

Reducing Number of Comparisons
• Candidate counting:
– Scan the database of transactions to determine the support
of each candidate itemset
– To reduce the number of comparisons, store the candidates
in a hash structure
• Instead of matching each transaction against every candidate,
match it against candidates contained in the hashed buckets
TID Items
1 Bread, Milk
N
Transactions Hash Structure
k
Buckets

Generate Hash Tree
2 3 4
5 6 7
1 4 5
1 3 6
1 2 4
4 5 7 1 2 5
4 5 8
1 5 9
3 4 5 3 5 6
3 5 7
6 8 9
3 6 7
3 6 8
1,4,7
2,5,8
3,6,9
Hash function
Suppose you have 15 candidate itemsets of length 3:
{1 4 5}, {1 2 4}, {4 5 7}, {1 2 5}, {4 5 8}, {1 5 9}, {1 3 6}, {2 3 4}, {5 6 7}, {3 4 5}, {3 5 6}, {3 5
7}, {6 8 9}, {3 6 7}, {3 6 8}
You need:
• Hash function
• Max leaf size: max number of itemsets stored in a leaf node (if number of candidate
itemsets exceeds max leaf size, split the node)

Association Rule Discovery: Hash tree
1 5 9
1 4 5 1 3 6
3 4 5 3 6 7
3 6 8
3 5 6
3 5 7
6 8 9
2 3 4
5 6 7
1 2 4
4 5 7
1 2 5
4 5 8
1,4,7
2,5,8
3,6,9
Hash Function Candidate Hash Tree
Hash on
1, 4 or 7
{1 4 5}, {1 2 4}, {4 5 7}, {1 2 5}, {4
{1 3 6}, {2 3 4}, {5 6 7}, {3 4 5}, {3
{6 8 9}, {3 6 7}, {3 6 8}

1 5 9
1 4 5 1 3 6
3 4 5 3 6 7
3 6 8
3 5 6
3 5 7
6 8 9
2 3 4
5 6 7
1 2 4
4 5 7
1 2 5
4 5 8
1,4,7
2,5,8
3,6,9
Hash on
2, 5 or 8

1 5 9
1 4 5 1 3 6
3 4 5 3 6 7
3 6 8
3 5 6
3 5 7
6 8 9
2 3 4
5 6 7
1 2 4
4 5 7
1 2 5
4 5 8
1,4,7
2,5,8
3,6,9
Hash on
3, 6 or 9

Subset Operation
1 2 3 5 6
Transaction, t
2 3 5 61 3 5 62
5 61 33 5 61 2 61 5 5 62 3 62 5
5 63
1 2 3
1 2 5
1 2 6
1 3 5
1 3 6
1 5 6
2 3 5
2 3 6
2 5 6 3 5 6
Subsets of 3 items
Level 1
Level 2
Level 3
63 5
Given a transaction t, what are
the possible subsets of size 3?

Subset Operation Using Hash Tree
1 5 9
1 4 5 1 3 6
3 4 5 3 6 7
3 6 8
3 5 6
3 5 7
6 8 9
2 3 4
5 6 7
1 2 4
4 5 7
1 2 5
4 5 8
1 2 3 5 6
1 + 2 3 5 6
3 5 62 +
5 63 +
1,4,7
2,5,8
3,6,9
Hash Functiontransaction

1 5 9
1 4 5 1 3 6
3 4 5 3 6 7
3 6 8
3 5 6
3 5 7
6 8 9
2 3 4
5 6 7
1 2 4
4 5 7
1 2 5
4 5 8
1,4,7
2,5,8
3,6,9
Hash Function
1 2 3 5 6
3 5 61 2 +
5 61 3 +
61 5 +
3 5 62 +
5 63 +
1 + 2 3 5 6
transaction

1 5 9
1 4 5 1 3 6
3 4 5 3 6 7
3 6 8
3 5 6
3 5 7
6 8 9
2 3 4
5 6 7
1 2 4
4 5 7
1 2 5
4 5 8
1,4,7
2,5,8
3,6,9
Hash Function
1 2 3 5 6
3 5 61 2 +
5 61 3 +
61 5 +
3 5 62 +
5 63 +
1 + 2 3 5 6
transaction
Match transaction against 11 out of 15 candidates

Rules of data mining

In this document

More Related Content

What's hot

Similar to Rules of data mining

More from Sulman Ahmed

Recently uploaded

Rules of data mining