DATA MINING -
LECTURE 2
Dr. Mahmoud Mounir
mahmoud.mounir@cis.asu.edu.eg
Data Mining:
Concepts and Techniques
(3rd ed.)
— Chapter 6 —
Jiawei Han, Micheline Kamber, and Jian Pei
University of Illinois at Urbana-Champaign &
Simon Fraser University
©2011 Han, Kamber & Pei. All rights reserved.
2
Chapter 5: Mining Frequent Patterns, Association
and Correlations: Basic Concepts and Methods
◼ Basic Concepts
◼ Frequent Itemset Mining Methods
◼ Which Patterns Are Interesting?—Pattern
Evaluation Methods
◼ Summary
3
What Is Frequent Pattern Analysis?
◼ Frequent pattern: a pattern (a set of items, subsequences, substructures,
etc.) that occurs frequently in a data set
◼ First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context
of frequent itemsets and association rule mining
◼ Motivation: Finding inherent regularities in data
◼ What products were often purchased together?— Juicw and diapers?!
◼ What are the subsequent purchases after buying a PC?
◼ What kinds of DNA are sensitive to this new drug?
◼ Can we automatically classify web documents?
◼ Applications
◼ Basket data analysis, cross-marketing, catalog design, sale campaign
analysis, Web log (click stream) analysis, and DNA sequence analysis.
4
Why Is Freq. Pattern Mining Important?
◼ Freq. pattern: An intrinsic and important property of
datasets
◼ Foundation for many essential data mining tasks
◼ Association, correlation, and causality analysis
◼ Sequential, structural (e.g., sub-graph) patterns
◼ Pattern analysis in spatiotemporal, multimedia, time-
series, and stream data
◼ Classification: discriminative, frequent pattern analysis
◼ Cluster analysis: frequent pattern-based clustering
◼ Data warehousing: iceberg cube and cube-gradient
◼ Semantic data compression: fascicles
◼ Broad applications
5
Basic Concepts: Frequent Patterns
Tid Items bought ◼ itemset: A set of one or more
10 Juice, Nuts, Diaper items
20 Juice, Coffee, Diaper ◼ k-itemset X = {x1, …, xk}
30 Juice, Diaper, Eggs
◼ (absolute) support, or, support
40 Nuts, Eggs, Milk count of X: Frequency or
50 Nuts, Coffee, Diaper, Eggs, Milk occurrence of an itemset X
Customer Customer
◼ (relative) support, s, is the
buys both buys diaper fraction of transactions that
contains X (i.e., the probability
that a transaction contains X)
◼ An itemset X is frequent if X’s
support is no less than a minsup
Customer
buys juice
threshold
6
Basic Concepts: Association Rules
Tid Items bought ◼ Find all the rules X → Y with
10 Juice, Nuts, Diaper
minimum support and confidence
20 Juice, Coffee, Diaper
30 Juice, Diaper, Eggs ◼ support, s, probability that a
40 Nuts, Eggs, Milk transaction contains X U Y
50 Nuts, Coffee, Diaper, Eggs, Milk
◼ confidence, c, conditional
probability that a transaction
Customer Customer
buys both
having X also contains Y
buys
diaper
Let minsup = 50%, minconf = 50%
Freq. Pat.: Juice :3, Nuts:3, Diaper:4, Eggs:3,
Customer {Juice, Diaper}:3
buys juice ◼ Association rules: (many more!)
◼ Juice → Diaper (60%, 100%)
◼ Diaper → Juice (60%, 75%)
7
Closed Patterns and Max-Patterns
◼ Exercise. DB = {<a1, …, a100>, < a1, …, a50>}
◼ Min_sup = 1.
◼ What is the set of closed itemset?
◼ <a1, …, a100>: 1
◼ < a1, …, a50>: 2
◼ What is the set of max-pattern?
◼ <a1, …, a100>: 1
◼ What is the set of all patterns?
◼ !!
8
Computational Complexity of Frequent Itemset
Mining
◼ How many itemsets are potentially to be generated in the worst case?
◼ The number of frequent itemsets to be generated is senstive to the
minsup threshold
◼ When minsup is low, there exist potentially an exponential number of
frequent itemsets
◼ The worst case: MN where M: # distinct items, and N: max length of
transactions
◼ The worst case complexty vs. the expected probability
◼ Ex. Suppose Walmart has 104 kinds of products
◼ The chance to pick up one product 10-4
◼ The chance to pick up a particular set of 10 products: ~10-40
◼ What is the chance this particular set of 10 products to be frequent
103 times in 109 transactions?
9
Chapter 5: Mining Frequent Patterns, Association
and Correlations: Basic Concepts and Methods
◼ Basic Concepts
◼ Frequent Itemset Mining Methods
◼ Which Patterns Are Interesting?—Pattern
Evaluation Methods
◼ Summary
10
Scalable Frequent Itemset Mining Methods
◼ Apriori: A Candidate Generation-and-Test
Approach
◼ Improving the Efficiency of Apriori
◼ FPGrowth: A Frequent Pattern-Growth Approach
◼ ECLAT: Frequent Pattern Mining with Vertical
Data Format
11
The Downward Closure Property and Scalable
Mining Methods
◼ The downward closure property of frequent patterns
◼ Any subset of a frequent itemset must be frequent
◼ If {juice, diaper, nuts} is frequent, so is {juice,
diaper}
◼ i.e., every transaction having {juice, diaper, nuts} also
contains {juice, diaper}
◼ Scalable mining methods: Three major approaches
◼ Apriori (Agrawal & Srikant@VLDB’94)
◼ Freq. pattern growth (FPgrowth—Han, Pei & Yin
@SIGMOD’00)
◼ Vertical data format approach (Charm—Zaki & Hsiao
@SDM’02)
12
Apriori: A Candidate Generation & Test Approach
◼ Apriori pruning principle: If there is any itemset which is
infrequent, its superset should not be generated/tested!
(Agrawal & Srikant @VLDB’94, Mannila, et al. @ KDD’ 94)
◼ Method:
◼ Initially, scan DB once to get frequent 1-itemset
◼ Generate length (k+1) candidate itemsets from length k
frequent itemsets
◼ Test the candidates against DB
◼ Terminate when no frequent or candidate set can be
generated
13
The Apriori Algorithm—An Example
Supmin = 2 Itemset sup
Itemset sup
Database TDB {A} 2
Tid Items
L1 {A} 2
C1 {B} 3
{B} 3
10 A, C, D {C} 3
1st scan {C} 3
20 B, C, E {D} 1
{E} 3
30 A, B, C, E {E} 3
40 B, E
C2 Itemset sup C2 Itemset
{A, B} 1
L2 Itemset sup 2nd scan {A, B}
{A, C} 2
{A, C} 2 {A, C}
{A, E} 1
{B, C} 2
{B, C} 2 {A, E}
{B, E} 3
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2 {B, E}
{C, E}
C3 Itemset L3 Itemset sup
3rd scan
{B, C, E} {B, C, E} 2
14
The Apriori Algorithm (Pseudo-Code)
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1 that
are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;
15
Implementation of Apriori
◼ How to generate candidates?
◼ Step 1: self-joining Lk
◼ Step 2: pruning
◼ Example of Candidate-generation
◼ L3={abc, abd, acd, ace, bcd}
◼ Self-joining: L3*L3
◼ abcd from abc and abd
◼ acde from acd and ace
◼ Pruning:
◼ acde is removed because ade is not in L3
◼ C4 = {abcd}
16
Scalable Frequent Itemset Mining Methods
◼ Apriori: A Candidate Generation-and-Test Approach
◼ Improving the Efficiency of Apriori
◼ FPGrowth: A Frequent Pattern-Growth Approach
◼ ECLAT: Frequent Pattern Mining with Vertical Data Format
◼ Mining Close Frequent Patterns and Maxpatterns
17
The Apriori Algorithm—An Example
Min Support = 2
Confidence = 70%
18
The Apriori Algorithm—An Example
19
The Apriori Algorithm—An Example
❑ Generating Association Rules
20
Summary
◼ Basic concepts: association rules, support-
confident framework, closed and max-patterns
◼ Scalable frequent pattern mining methods
◼ Apriori (Candidate generation & test)
21
Ref: Basic Concepts of Frequent Pattern Mining
◼ (Association Rules) R. Agrawal, T. Imielinski, and A. Swami. Mining
association rules between sets of items in large databases. SIGMOD'93
◼ (Max-pattern) R. J. Bayardo. Efficiently mining long patterns from
databases. SIGMOD'98
◼ (Closed-pattern) N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal.
Discovering frequent closed itemsets for association rules. ICDT'99
◼ (Sequential pattern) R. Agrawal and R. Srikant. Mining sequential patterns.
ICDE'95
22
Ref: Apriori and Its Improvements
◼ R. Agrawal and R. Srikant. Fast algorithms for mining association rules.
VLDB'94
◼ H. Mannila, H. Toivonen, and A. I. Verkamo. Efficient algorithms for discovering
association rules. KDD'94
◼ A. Savasere, E. Omiecinski, and S. Navathe. An efficient algorithm for mining
association rules in large databases. VLDB'95
◼ J. S. Park, M. S. Chen, and P. S. Yu. An effective hash-based algorithm for
mining association rules. SIGMOD'95
◼ H. Toivonen. Sampling large databases for association rules. VLDB'96
◼ S. Brin, R. Motwani, J. D. Ullman, and S. Tsur. Dynamic itemset counting and
implication rules for market basket analysis. SIGMOD'97
◼ S. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule mining
with relational database systems: Alternatives and implications. SIGMOD'98
23
Ref: Depth-First, Projection-Based FP Mining
◼ R. Agarwal, C. Aggarwal, and V. V. V. Prasad. A tree projection algorithm for generation
of frequent itemsets. J. Parallel and Distributed Computing, 2002.
◼ G. Grahne and J. Zhu, Efficiently Using Prefix-Trees in Mining Frequent Itemsets, Proc.
FIMI'03
◼ B. Goethals and M. Zaki. An introduction to workshop on frequent itemset mining
implementations. Proc. ICDM’03 Int. Workshop on Frequent Itemset Mining
Implementations (FIMI’03), Melbourne, FL, Nov. 2003
◼ J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation.
SIGMOD’ 00
◼ J. Liu, Y. Pan, K. Wang, and J. Han. Mining Frequent Item Sets by Opportunistic
Projection. KDD'02
◼ J. Han, J. Wang, Y. Lu, and P. Tzvetkov. Mining Top-K Frequent Closed Patterns without
Minimum Support. ICDM'02
◼ J. Wang, J. Han, and J. Pei. CLOSET+: Searching for the Best Strategies for Mining
Frequent Closed Itemsets. KDD'03
24
Ref: Vertical Format and Row Enumeration Methods
◼ M. J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li. Parallel algorithm for
discovery of association rules. DAMI:97.
◼ M. J. Zaki and C. J. Hsiao. CHARM: An Efficient Algorithm for Closed Itemset
Mining, SDM'02.
◼ C. Bucila, J. Gehrke, D. Kifer, and W. White. DualMiner: A Dual-Pruning
Algorithm for Itemsets with Constraints. KDD’02.
◼ F. Pan, G. Cong, A. K. H. Tung, J. Yang, and M. Zaki , CARPENTER: Finding
Closed Patterns in Long Biological Datasets. KDD'03.
◼ H. Liu, J. Han, D. Xin, and Z. Shao, Mining Interesting Patterns from Very High
Dimensional Data: A Top-Down Row Enumeration Approach, SDM'06.
25
Ref: Mining Correlations and Interesting Rules
◼ S. Brin, R. Motwani, and C. Silverstein. Beyond market basket: Generalizing
association rules to correlations. SIGMOD'97.
◼ M. Klemettinen, H. Mannila, P. Ronkainen, H. Toivonen, and A. I. Verkamo. Finding
interesting rules from large sets of discovered association rules. CIKM'94.
◼ R. J. Hilderman and H. J. Hamilton. Knowledge Discovery and Measures of Interest.
Kluwer Academic, 2001.
◼ C. Silverstein, S. Brin, R. Motwani, and J. Ullman. Scalable techniques for mining
causal structures. VLDB'98.
◼ P.-N. Tan, V. Kumar, and J. Srivastava. Selecting the Right Interestingness Measure
for Association Patterns. KDD'02.
◼ E. Omiecinski. Alternative Interest Measures for Mining Associations. TKDE’03.
◼ T. Wu, Y. Chen, and J. Han, “Re-Examination of Interestingness Measures in Pattern
Mining: A Unified Framework", Data Mining and Knowledge Discovery, 21(3):371-
397, 2010
26