Dbm630 lecture05

DBM630: Data Mining and
Data Warehousing

MS.IT. Rangsit University
Semester 2/2011

Lecture 5
Association Rule Mining

by Kritsada Sriphaew (sriphaew.k AT gmail.com)

1

Topics
 Association rule mining
 Mining single-dimensional association rules
 Mining multilevel association rules
 Other measurements: interest and conviction
 Association rule mining to correlation analysis

2 Data Warehousing and Data Mining by Kritsada Sriphaew

What is Association Mining?
 Association rule mining:
 Finding frequent patterns, associations, correlations, or causal
structures among sets of items or objects in transaction
databases, relational databases, and other information
repositories.
 Applications:
 Basket data analysis, cross-marketing, catalog design,
clustering, classification, etc.
 Ex.: Rule form: “Body  Head [support, confidence]”
buys(x, “diapers*”)  Consequent [support, confidence]”
“Antecedent  buys(x, “beers”) [0.5%,60%]
major(x, “CS”)^takes(x, “DB”)  grade(x, “A”) [1%, 75%]

A typical example of association rule mining is
market basket analysis.


Rule Measures: Support/Confidence
 Find all the rules “Antecedent(s)  Consequent(s)” with minimum
support and confidence
 support, s, probability that a transaction contains {A  C}
 confidence, c, conditional probability that a transaction having A also contains C
 Let min. sup. 50%, and min. conf. 50%, • Support= 50% means that 50% of all
transactions under analysis show that
 A  C (s=50%, c=66.7%) A and C are purchased together
• Confidence=66.7% means that 66.7% of the
 C  A (s=50%, c=100%) customers who purchased A also bought C
 Typically association rules are considered
interesting if they satisfy both
a minimum support threshold and Transactional databases
a mininum confidence threshold Transaction ID Items Bought
 Such thresholds can be set by users 2000 A,B,C
or domain experts 1000 A,C
4000 A,D
5000 B,E,F


Rule Measures: Support/Confidence
probability
TransID Items Bought Rule: A C
T001 A,B,C
T002 A,C support (AC) = P({AC}) = P(AC)
T003 A,D confidence(AC) = P(C|A)
T004 B,E,F = P({AC})/P({A})

Frequency • A  B (1/4 = 25%, 1/3 = 33.3%) Customer buys both (A&C)
A =3 • B  A (1/4 = 25%, 1/2 = 50%)
Customer buys diaper(C)
B =2 • A  C (2/4 = 50%, 2/3 = 66.7%)
• C  A (2/4 = 50%, 2/2 =100%)
C =2
• A, B  C (1/4 = 25%, 1/1 = 100%)
AB = 1 • A, C  B (1/4 = 25%, 1/2 = 50%)
AC = 2 • B, C  A (1/4 = 25%, 1/1 = 100%)
BC = 1
ABC = 1
Customer buys beer (A)

Association Rule: Support/Confidence for
Relational Tables
 In case that each transaction is a row in a relational table
 Find: all rules that correlate the presence of one set of
attributes with that of another set of attributes
outlook temp. humidity windy Sponsor play-time play
• If temperature = hot
sunny hot high True Sony 85 Y
then humidity = high
(s=3/10,c=3/5) sunny hot high False HP 90 Y

overcast hot normal True Ford 63 Y
• If windy=true and play=Y rainy mild high True Ford 5 N
then humidity=high and
rainy cool low False HP 56 Y
outlook=overcast
(s=2/10, c=2/4) sunny hot low True Sony 25 N

rainy cool normal True Nokia 5 N
• If windy=true and play=Y overcast mild high True Honda 86 Y
and humidity=high
rainy mild low False Ford 78 Y
then outlook=overcast
(s=2/10, c=2/3) overcast hot high True Sony 74 Y


Association Rule Mining: Types
 Boolean vs. quantitative associations (Based on the types of
values handled) (Single vs. multiple Dim.)
SQLServer ^ DMBooks  DBMiner [0.2%, 60%]
buys(x, “SQLServer”) ^ buys(x, “DMBook”) 
buys(x, “DBMiner”) [0.2%, 60%]
age(x, “30..39”) ^ income(x, “42..48K”) 
buys(x, “PC”) [1%, 75%]

 Single level vs. multilevel analysis
 What brands of beers are associated with what brands of diapers?
 Various extensions
 Maxpatterns and closed itemsets


An Example (single dimensional Boolean
association Rule Mining)
 For rule A  C: Min. support 50%
 support = support({A, C}) = 50% Min. confidence 50%
 confidence = support({A, C})/support({A}) = 66.7%
 The Apriori principle:
 Any subset of a frequent itemset must be
frequent
Transaction ID Items Bought Frequent Itemset Support
2000 A,B,C {A} 75%
1000 A,C {B} 50%
4000 A,D {C} 50%
5000 B,E,F {A,C} 50%


Two Steps in Mining Association Rules
 A subset of a frequent itemset must also be a frequent
itemset
 i.e., if {AB} is a frequent itemset, both {A} and {B} must
be a frequent itemset
 Iteratively find frequent itemsets with cardinality from 1 to
k (k-itemset)
Step1: Find the frequent itemsets: the sets of items that
have minimum support

Step2: Use the frequent itemsets to generate association
rules


Find the frequent itemsets

The Apriori Algorithm
 Join Step: Ck is generated by joining Lk-1with itself
 Prune Step: Any (k-1)-itemset that is not frequent cannot be a subset
of a frequent k-itemset
 Pseudo-code:
Ck: Candidate itemset of size k
Lk : frequent itemset of size k

L1 = {frequent 1-itemsets}; 1
for (k = 1; Lk !=f; k++) do begin
Ck+1 = candidates generated from Lk; 2
for each transaction t in database do
increment the count of all candidates in Ck+1
that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return Uk Lk;

The Apriori Algorithm — Example
Database D itemset sup.
L1 itemset sup.
TID Items C1 {1} 2 {1} 2
100 134 {2} 3 {2} 3
200 235 Scan D {3} 3 {3} 3
300 1235 {4} 1 {5} 3
400 25 {5} 3
C2 itemset sup C2 itemset
L2 itemset sup {1 2} 1 Scan D {1 2}
{1 3} 2 {1 3} 2 {1 3}
{2 3} 2 {1 5} 1 {1 5}
{2 3} 2 {2 3}
{2 5} 3
{2 5} 3 {2 5}
{3 5} 2
{3 5} 2 {3 5}
C3 itemset Scan D L3 itemset sup
{2 3 5} {2 3 5} 2
12

How to Generate Candidates?
 Suppose the items in Lk-1 are listed in an order
Step 1: self-joining Lk-1
INSERT INTO Ck
SELECT p.item1, p.item2, …,
p.itemk-1, q.itemk-1
FROM Lk-1 p, Lk-1 q
WHERE p.item1 = q.item1, …,
p.itemk-2 = q.itemk-2,
p.itemk-1 < q.itemk-1
Step 2: pruning
ForAll itemsets c IN Ck DO
ForAll (k-1)-subsets s OF c DO
13
IF (s is not in Lk-1) THEN DELETE c FROM Ck

Example of Generating Candidates
L3={abc, abd, acd, ace, bcd}

Self-joining: L3×L3
abc + abd  abcd
acd + ace  acde

Pruning:
C4={abcd} acde is removed because
ade is not in L3


How to Count Supports of Candidates?
 Why counting supports of candidates a problem?
 The total number of candidates can be very huge
 One transaction may contain many candidates
 Method:
 Candidate itemsets are stored in a hash-tree
 Leaf node of hash-tree contains a list of itemsets and
counts
 Interior node contains a hash table
 Subset function: finds all the candidates contained in a
transaction

Subset Function
 Subset function: finds all the candidates contained in a
transaction. (1) Generate Hash Tree (2) Hashing each item in
the transactions 2 1

C2 1 3 1+1 Database
itemset TID Items
{1 2} 5 1 100 134
{1 3} f 200 235
{1 5} 3 300 1235
1+1
{2 3} 2 400 25
{2 5} 5 1+1+1
{3 5}
3 5 1+1


Is Apriori Fast Enough? — Performance
Bottlenecks
 The core of the Apriori algorithm:
 Use frequent (k – 1)-itemsets to generate candidate frequent
k-itemsets
 Use database scan and pattern matching to collect counts for
the candidate itemsets
 The bottleneck of Apriori: candidate generation
 Huge candidate sets:
 104 frequent 1-itemset will generate 107 candidate 2-
itemsets
 To discover a frequent pattern of size 100, e.g., {a1, a2, …,
a100}, one needs to generate 2100  1030 candidates.
 Multiple scans of database:
 Needs (n +1 ) scans, n is the length of the longest pattern


Mining Frequent Patterns Without Candidate
Generation
 Compress a large database into a compact,
Frequent-Pattern tree (FP-tree) structure
 highly condensed, but complete for frequent pattern
mining
 avoid costly database scans
 Develop an efficient, FP-tree-based frequent pattern
mining method
 A divide-and-conquer methodology: decompose mining
tasks into smaller ones
 Avoid candidate generation: sub-database test only!

Construct FP-tree from Transaction DB
TID Items bought (ordered) frequent items
100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}
200 {a, b, c, f, l, m, o} {f, c, a, b, m} min_support = 0.5
300 {b, f, h, j, o} {f, b}
400 {b, c, k, s, p} {c, b, p}
500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}
{}
Steps: Header Table

1. Scan DB once, find Item frequency head f:4 c:1
frequent 1-itemset f 4
(single item pattern) c 4 c:3 b:1 b:1
a 3
2. Order frequent items b 3 a:3 p:1
in frequency m 3
descending order p 3 m:2 b:1
3. Scan DB again,
construct FP-tree p:2 m:1

Mining Frequent Patterns using FP-tree
 General idea (divide-and-conquer)
 Recursively grow frequent pattern path using the FP-tree
 Method
 For each item, construct its conditional pattern-base, and then its
conditional FP-tree
 Repeat the process on each newly created conditional FP-tree
 Until the resulting FP-tree is empty, or it contains only one path (single
path will generate all the combinations of its sub-paths, each of which is a
frequent pattern)
 Benefit: Completeness & Compactness
 Completeness: never breaks a long pattern of any transaction and
preserves complete information for frequent pattern mining
 Compactness: reduces irrelevant information (infrequent items are gone),
orders in frequency descending ordering (more frequent items are likely to
be shared), and smaller than the original database.


Step 1: From FP-tree to Conditional Pattern Base
 Starting at the frequent header table in the FP-tree
 Traverse the FP-tree by following the link of each frequent item
 Accumulate all of transformed prefix paths of that item to form a
conditional pattern base

{}
Header Table
Conditional pattern bases
Item frequency head f:4 c:1 item cond. pattn base
f 4
c:3 b:1 b:1 c f:3
c 4
a 3 a fc:3
b 3 a:3 p:1 b fca:1, f:1, c:1
m 3
p 3 m:2 b:1
m fca:2, fcab:1
p fcam:2, cb:1
p:2 m:1
21 Knowledge Management and Discovery © Kritsada Sriphaew

Step 2: Construct Conditional FP-tree
 For each pattern-base
 Accumulate the count for each item in the base
 Construct the FP-tree for the frequent items of the pattern base
m-conditional pattern base:
{} fca:2,
Header Table
fcab:1
Item frequency head f:4 c:1
All frequent
f 4 patterns
c 4 c:3 b:1 b:1 {}
concerning m
a 3
b 3 a:3 p:1 f:3 m,
m 3
c:3 fm, cm, am,
p 3 m:2 b:1
fcm, fam, cam,
a:3
p:2 m:1 fcam
22
m-conditional FP-tree

Mining Frequent Patterns by
(Creating Conditional Pattern-Bases)

Item Conditional pattern-base Conditional FP-tree
p {(fcam:2), (cb:1)} {(c:3)}|p
m {(fca:2), (fcab:1)} {(f:3, c:3, a:3)}|m
b {(fca:1), (f:1), (c:1)} Empty
a {(fc:3)} {(f:3, c:3)}|a
c {(f:3)} {(f:3)}|c
f Empty Empty


Step 3: Recursively mine the conditional FP-
tree {}

f:3
{}
c:3
f:3
am-conditional FP-tree
c:3
{}
a:3
f:3
cm-conditional FP-tree

{}

f:3
cam-conditional FP-tree


Single FP-tree Path Generation
 Suppose an FP-tree T has a single path P
 The complete set of frequent pattern of T can be generated by
enumeration of all the combinations of the sub-paths of P
m-conditional pattern base:
{} fca:2,
Header Table
fcab:1
Item frequency head f:4 c:1
All frequent
f 4 patterns
c 4 c:3 b:1 b:1 {}
concerning m
a 3
b 3 a:3 p:1 f:3 m,
m 3
c:3 fm, cm, am,
p 3 m:2 b:1
fcm, fam, cam,
a:3
p:2 m:1 fcam
25

FP-growth vs. Apriori: Scalability With the
Support Threshold
Data set T25I20D10K
100
90
80 D1 FP-growth runtime
Run time(sec.)

70 D1 Apriori runtime
60
50
40
30
20
10
0
0 1 2 3
Support threshold(%)

CHARM - Mining Closed Association Rules
 Instead of horizontal DB format, vertical format is used.
 Instead of traditional frequent itemsets, closed frequent
itemsets are mined.
Horizontal DB Vertical DB
Transaction Items Items Transaction
1 ABDE A 1345
2 BCE B 123456
3 ABDE C 2456
4 ABCE D 1356
5 ABCDE E 12345
6 BCD

CHARM – Frequent Itemsets and Their Supports
 An example database and its frequent itemsets

Items Trans. Support Itemsets

A 1345 1.00 B
B 123456 0.83 BE, E
C 2456 0.67 A, C, D, AB,AE,
D 1356 BC, BD, ABE
E 12345 0.50 AD, CE, DE,
ABD, ADE, BCE,
BDE, ABDE
Vertical DB
Min. support = 0.5


CHARM - Closed Itemsets
 Closed frequent itemsets and their corresponding
frequent itemsets
Closed
Itemsets Tidsets Sup. Freq. Itemsets
B 123456 1.00 B
BE 12345 0.83 BE, E
ABE 1345 0.67 ABE, AB, AE, A
BD 1356 0.67 BD, D
BC 2456 0.67 BC, C
ABDE 135 0.50 ABDE, ABD, ADE,
BDE, AD, DE
BCE 245 0.50 CE, BCE

The CHARM Algorithm
CHARM (? I  T, minsup); CHARM-PROPERY(Nodes, NewN)
1. Nodes = { Ij  t(Ij) : Ij  I  |t(Ij )|  minsup } 1. if (|Y|  minsup) then
2. CHARM-EXTEND (Nodes, C) 2. if t(Xi) = t(Xj) then // Propery 1
3. Remove Xj from Nodes
CHARM-EXTEND (Nodes, C) 4. Replace all Xi with X’
3. for each Xi  t(Xi) in Nodes 5. else if t(Xj)  t(Xj) then // Propery 2
4. NewN = f and X = Xi 6. Replace all Xi with X’
5. for each Xj  t(Xj) in Nodes, with f(j) > f(I) 7. else if t(Xj)  t(Xj) then // Propery 3
6. X’ = X  Xj and Y = t(Xi)  t(Xj) 8. Remove Xj from Nodes
7. CHARM-PROPERTY(Nodes, NewN) 9. Add X  Y to NewN
8. if NewN  f then CHARM-EXTEND(NewN) 10. else if t(Xj)  t(Xj) then // Propery 4
9. C = C  {X} // if X is not subsumed 11. Add X  Y to NewN
f

Ax1345 Bx123456 Cx2456 Dx1356 Ex12345
ABx1345
ABEx1345

ABCx45 ABDx135 BCx2456 BDx1356 BEx12345
ABDEx135

BCDx56 BCEx245 BDEx135

Presentation of Association Rules (Table Form)


Visualization of Association Rule Using
Plane Graph

32

Visualization of Association Rule Using Rule
Graph

33

Mining multilevel association rules from transactional databases
Multiple-Level Association Rules
TID ITEMS
 Items often form hierarchy. T1 {1121, 1122, 1212}

 Items at the lower level are T2 {1222, 1121, 1122, 1213}

expected to have lower T3 {1124, 1213}
T4 {1111, 1211, 1232, 1221, 1223}
support.
Food
 Rules regarding itemsets at (1)
the appropriate levels could Milk Bread
be quite useful. (11) (12)
 Transaction database can be Skim 2% Wheat White
encoded based on (111) (112) (121) (122)
dimensions and levels
Wonder
 We can explore shared Fraser Sunset (1222)
multi-level mining (1121) (1124)
Wonder
(1213)

Mining Multi-Level Associations
 A top_down, progressive deepening approach:
 First find high-level strong rules:
 milk  bread [20%, 60%]
 Then find their lower-level “weaker” rules:
 2% milk  wheat bread [6%, 50%]

 Variations at mining multiple-level association rules.
 Level-crossed association rules:
2% milk  Wonder wheat bread [3%, 60%]
 Association rules with multiple, alternative hierarchies:
2% milk  Wonder bread [8%, 72%]

Multi-level Association: Redundancy Filtering
 Some rules may be redundant due to “ancestor”
relationships between items.
 Example
milk  wheat bread [s=8%, c=70%]
2% milk  wheat bread [s=2%, c=72%]
 We say the first rule is an ancestor of the second
rule.
 A rule is redundant if its support is close to the
“expected” value, based on the rule’s ancestor.

Multi-Level Mining: Progressive Deepening
 A top-down, progressive deepening approach:
 First mine high-level frequent items:
 milk (15%), bread (10%)
 Then mine their lower-level “weaker” frequent itemsets:
 2% milk (5%), wheat bread (4%)
 Different min_support threshold across multi-levels lead
to different algorithms:
 If adopting the same min_support across multi-levels
 then toss t if any of t’s ancestors is infrequent.
 If adopting reduced min_support at lower levels
 then examine only those descendants whose ancestor’s
support is frequent/non-negligible.


Problem of Confidence
 Example: (Aggarwal & Yu, PODS98)
 Among 5000 students
 3000 play basketball
 3750 eat cereal
 2000 both play basket ball and eat cereal
 play basketball  eat cereal [40%, 66.7%] is misleading because the overall
percentage of students eating cereal is 75% which is higher than 66.7%.
 play basketball  not eat cereal [20%, 33.3%] is far more accurate, although
with lower support and confidence

basketball not basketball sum(row)
cereal 2000 1750 3750
not cereal 1000 250 1250
sum(col.) 3000 2000 5000

Interest/Lift/Correlation
 Interest (or lift, correlation)
 taking both P(A) and P(B) in consideration P( A  B)
 P(AB)=P(B)P(A), if A and B are independent P( A) P( B)
events
 A and B negatively correlated, if the value is less
than 1; otherwise A and B positively correlated
2000
basketball not basketball sum(row) 5000
cereal 2000 1750 3750  0.889
3000 3750
not cereal 1000 250 1250 
5000 5000
sum(col.) 3000 2000 5000

1000
 Lift(play basketball  eat cereal) = 0.89 5000  1.33
 Lift(play basketball  not eat cereal) = 1.33 3000 1250

5000 5000

Conviction
 Conviction (Brin, 1997)
(1  Support ( B))
 Conviction ( A  B) 
 0 <= conv(AB) <=  (1  Confidence ( A  B)
 A and B are statistically independent if and only if
conv(AB) = 1
 0 < conv(AB) < 1 if and only if p(B|A) < p(B)
B is negatively correlated with A.
 1 < conv(AB) <  if and only if p(B|A) > p(B)
B is positively correlated with A. 1
3750
basketball not basketball sum(row) 5000  0.375
cereal 2000 1750 3750 1  0.667
not cereal 1000 250 1250
1250
sum(col.) 3000 2000 5000 1
5000  2.25
conviction(play basketball  eat cereal) = 0.375
1  0.333
conviction(play basketball  not eat cereal) = 2.25

From Association Mining to Correlation Analysis

 Ex. Strong rules are not necessarily interesting
Of 10000 transactions
• 6000 customer transactions include computer games
• 7500 customer transactions include videos
• 4000 customer transactions include both computer game and video

• Suppose that data mining program for
videos games
discovering association rules is run on
the data, using min_sup of 30% and
min_conf. of 60%
• The following association rule is
discovered:
4,000

buys(X, “computer games”)  buys(X, “videos”)
[s=40%, c=66%]
41
=4000/10000 =4000/6000

A misleading “strong” association rule

[support=40%, confidence=66%]

 This rule is misleading because the probability of purchasing video is
75% (>66%)
 In fact, computer games and videos are negatively associated because
the purchase of one of these items actually decreases the likelihood of
purchasing the other. Therefore, we could easily make unwise business
decisions based on this rule

42
Data Warehousing and Data Mining by Kritsada Sriphaew

From Association Analysis to Correlation
Analysis
 To help filter out misleading “strong” association
 Correlation rules
 A  B [support, confidence, correlation]
 Lift is a simple correlation measure that is given as follows
 The occurrence of itemset A is independent of the occurrence of itemset B if
P(AB) = P(A)P(B);
 Otherwise, itemset A and B are dependent and correlated
 lift(A,B) = P(AB) / P(A)P(B) = P(B|A) / P(B) = conf(AB) / sup(B)
 If lift(A,B) < 1, then the occurrence of A is negatively correlated with the
occurrence of B
 If lift(A,B) > 1, then A and B is positively correlated, meaning that the occurrence
of one implies the occurrence of the other.

43

From Association Analysis to Correlation
Analysis (Cont.)
 Ex. Correlation analysis using lift

[support=40%, confidence=66%]

 The lift of this rule is
P{game,video} / (P{game} × P{video}) = 0.40/(0.6 ×0.75) = 0.89
 There is a negative correlation between the occurrence of {game} and {video}

 Ex. Is this following rule misleading?
 Buy walnuts  Buy milk [1%, 80%]”
 if 85% of customers buy milk

44

Homework
 ให้ transactional database ซึ่งเป็น LOG ไฟล์บันทึกการเข้าเยี่ยมชมเว็บเพจของปู้ใช้แต่ละคน
ในช่วงระยะเวลาหนึ่ง จงหากฎสัมพันธ์ที่น่าเชื่อถือ โดยสมมติว่าเราเป็นปู้วเิ คราะห์ข้อมูล มี
สิทธิตั้ง minimum support และ minimum confidence ด้วยตัวเอง พร้อมอธิบายเหตุปล
ประกอบว่าทาไมถึงตั้งค่านั้น และตรวจสอบด้วยว่ากฎเหล่านั้นเป็น misleading หรือไม่ ถ้ามีให้
แก้ไขอย่างไร
TID List of items
T001 P1, P2, P3, P4
T002 P3, P6
T003 P2, P5, P1
T004 P5, P4, P3,P6
T005 P1, P3, P4, P2

P1 P2 P4 P6
P3 P5
45

Feb 26, 2011 (14:00)
 Quiz I
 Star-net Query (Multidimensional Table)
 Data Cube Computation (Memory Calculation)
 Data Preprocessing (Normalization, Smoothing by binning)
 Association Rule Mining


Dbm630 lecture05

More Related Content

Viewers also liked

Similar to Dbm630 lecture05

More from Tokyo Institute of Technology

Recently uploaded

Dbm630 lecture05