Introduction to
Data Mining Methods
Data Mining:
Data Mining Methods
with Dr. Qin Lv
Learning objective: Identify the core
functionalities of data modeling in the
data mining pipeline. Apply the Apriori
algorithm for frequent itemset mining.
Data Mining: Four Views
Application
Knowledge Technique
Data
Application
Data Mining Pipeline
Knowledge
Pattern
evaluation
Data modeling
Data
warehousing
Technique
Data
preprocessing
Data
understanding
Data
Technique View
Ø Frequent pattern analysis
Ø Classification, prediction
Ø Clustering
Ø Anomaly detection
Ø Trend and evolution analysis
Frequent Pattern
Analysis
Ø Frequent itemset
Ø Frequent sequence
Ø Frequent structure
Ø Association rules
Ø Correlation analysis
Classification
Ø Pre-defined
classes
Ø Need training data
Ø Build model to
distinguish classes
Prediction
Ø Numerical prediction
(continuous value)
• E.g., weather
• E.g., stock price
• E.g., traffic
Clustering
Ø No predefined
classes
Ø Intra-cluster
similarity
Ø Inter-cluster
dissimilarity
Anomaly Detection
280
Ø Anomaly/outlier 260 Unusual Time
Series Snippets Level Shifting
• Differ from the “norm” 240
Kelvin
• E.g., error, noise 220
• E.g., fraud 200
• E.g., extreme events 180
160
199 200 200 200 200 200
8-0 0-0 1-0 2-1 4-0 5-0
9-1 1-3 6-1 0-2 3-1 7-2
8 1 5 8 1 4
Date (yyyy-mm-dd)
Trend and Evolution Analysis
Ø Changes over time
• Overall trend
• Periodical patterns
• Anomalies
• E.g.,
Data Mining Methods
Ø Frequent pattern analysis
Ø Classification
Ø Clustering
Ø Outlier analysis
Market Basket Analysis
Tid Items
Ø List of transactions
1 A, B, C, E
• Each Ti contains multiple items
2 A, D, E
Ø (Frequent) itemset
• X = {x1, x2, …, xk} 3 B, C, E
Ø (Minimum) support 4 B, C, D, E
• Probability of Ti containing X 5 B, D, E
Frequent Pattern Mining
Ø Brute force approach (e.g., 100 items)
Ø Closed pattern X: no super-pattern Y ⊃ X
w/ the same support
Ø Max-pattern X: no super-pattern Y ⊃ X
Closed & Max Pattern Example
Ø {<a1, ..., a100>, <a1, ..., a50>} min_sup = 0.5
Ø Frequent pattern? all item combinations
Ø Closed pattern?
• <a1, ..., a100>: 1; <a1, ..., a50>: 2
Ø Max-pattern?
• <a1, ..., a100>: 1
Apriori Algorithm
Ø Apriori pruning: if X is infrequent, then any
of its superset cannot be frequent
Ø Procedure
• Scan dataset to get freq. 1-itemsets
• Generate candidate (k+1)-itemsets from freq. k-itemsets
• Scan dataset to remove infreq. candidate (k+1)-itemsets
• Stop when no more freq. or candidate itemsets
Itemset #
Apriori Algorithm Example {B, C} 3
What about {B, D, E} {B, D} 2
Ø min_sup = 0.6 or {C, D, E}?
{B, E} 4
Tid Items Itemset #
{C, D} 1
1 A, B, C, E {A} 2
{C, E} 3
2 A, D, E {B} 4
{D, E} 3
3 B, C, E {C} 3
4 B, C, D, E {D} 3 Itemset #
5 B, D, E {E} 5 {B, C, E} 3
Important Details
Ø Self-joining of k-itemsets => (k+1)-itemsets
• Only join if their first (k-1) items are the same
Ø Pruning: remove if subset is not frequent
Ø Example: L3 = {abc, abd, acd, ace, bcd}
• abc and abd => abcd and bcd is in L3 => valid candidate
• acd and ace => acde but ade is not in L3 => pruned