KEMBAR78
Cs501 mining frequentpatterns | PPT
CS501: DATABASE AND
DATA MINING
Mining Frequent Patterns1
WHAT IS FREQUENT PATTERN
ANALYSIS?
 Frequent pattern: a pattern (a set of items, subsequences, substructures,
etc.) that occurs frequently in a data set
 First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context
of frequent itemsets and association rule mining
 Motivation: Finding inherent regularities in data
 What products were often purchased together?— Beer and diapers?!
 What are the subsequent purchases after buying a PC?
 What kinds of DNA are sensitive to this new drug?
 Can we automatically classify web documents?
 Applications
 Basket data analysis, cross-marketing, catalog design, sale campaign
analysis, Web log (click stream) analysis, and DNA sequence analysis.
2
WHY IS FREQ. PATTERN MINING
IMPORTANT?
 Discloses an intrinsic and important property of data sets
 Forms the foundation for many essential data mining tasks
 Association, correlation, and causality analysis
 Sequential, structural (e.g., sub-graph) patterns
 Pattern analysis in spatiotemporal, multimedia, time-
series, and stream data
 Classification: associative classification
 Cluster analysis: frequent pattern-based clustering
 Data warehousing: iceberg cube and cube-gradient
 Semantic data compression: fascicles
 Broad applications 3
BASIC CONCEPTS: FREQUENT
PATTERNS AND ASSOCIATION RULES
 Itemset X = {x1, …, xk}
 Find all the rules X  Y with
minimum support and confidence
support, s, probability that a
transaction contains X ∪ Y
confidence, c, conditional
probability that a
transaction having X also
contains Y
4
Let supmin = 50%, confmin = 50%
Freq. Pat.: {A:3, B:3, D:4, E:3, AD:3}
Association rules:
A  D (60%, 100%)
D  A (60%, 75%)
Customer
buys diaper
Customer
buys both
Customer
buys beer
Transaction-id Items bought
10 A, B, D
20 A, C, D
30 A, D, E
40 B, E, F
50 B, C, D, E, F
CLOSED PATTERNS AND MAX-
PATTERNS
 A long pattern contains a combinatorial number of sub-
patterns, e.g., {a1, …, a100} contains (1
100
) + (2
100
) + … + (1
1
0
0
0
0
) =
2100
– 1 = 1.27*1030
sub-patterns!
 Solution: Mine closed patterns and max-patterns instead
 An itemset X is closed if X is frequent and there exists
no super-pattern Y X, with the same support as X
 An itemset X is a max-pattern if X is frequent and there
exists no frequent super-pattern Y X
 Closed pattern is a lossless compression of freq.
patterns
Reducing the # of patterns and rules 5
SCALABLE METHODS FOR MINING
FREQUENT PATTERNS
 The downward closure property of frequent patterns
Any subset of a frequent itemset must be frequent
If {beer, diaper, nuts} is frequent, so is {beer,
diaper}
i.e., every transaction having {beer, diaper, nuts} also
contains {beer, diaper}
 Scalable mining methods: Three major approaches
Apriori
Freq. pattern growth
Vertical data format approach
6
APRIORI: A CANDIDATE GENERATION-AND-TEST
APPROACH
 Apriori pruning principle: If there is any itemset which is
infrequent, its superset should not be generated/tested!
 Method:
Initially, scan DB once to get frequent 1-itemset
Generate length (k+1) candidate itemsets from length
k frequent itemsets
Test the candidates against DB
Terminate when no frequent or candidate set can be
generated
7
THE APRIORI ALGORITHM—AN
EXAMPLE
8
Database TDB
1st
scan
C1
L1
L2
C2 C2
2nd
scan
C3 L33rd
scan
Tid Items
10 A, C, D
20 B, C, E
30 A, B, C, E
40 B, E
Itemset sup
{A} 2
{B} 3
{C} 3
{D} 1
{E} 3
Itemset sup
{A} 2
{B} 3
{C} 3
{E} 3
Itemset
{A, B}
{A, C}
{A, E}
{B, C}
{B, E}
{C, E}
Itemset sup
{A, B} 1
{A, C} 2
{A, E} 1
{B, C} 2
{B, E} 3
{C, E} 2
Itemset sup
{A, C} 2
{B, C} 2
{B, E} 3
{C, E} 2
Itemset
{B, C, E}
Itemset sup
{B, C, E} 2
Supmin = 2
IMPORTANT DETAILS OF APRIORI
 How to generate candidates?
 Step 1: self-joining Lk
 Step 2: pruning
 How to count supports of candidates?
 Example of Candidate-generation
 L3={abc, abd, acd, ace, bcd}
 Self-joining: L3*L3
 abcd from abc and abd
 acde from acd and ace
 Pruning:
 acde is removed because ade is not in L3
 C4={abcd}
9
HOW TO COUNT SUPPORTS OF
CANDIDATES?
 Why counting supports of candidates a problem?
The total number of candidates can be very huge
 One transaction may contain many candidates
 Method:
Candidate itemsets are stored in a hash-tree
Leaf node of hash-tree contains a list of itemsets
and counts
Interior node contains a hash table
Subset function: finds all the candidates contained
in a transaction
10
BOTTLENECK OF FREQUENT-
PATTERN MINING
 Multiple database scans are costly
 Mining long patterns needs many passes of scanning
and generates lots of candidates
 To find frequent itemset i1i2…i100
 # of scans: 100
 # of Candidates: (1
100
) + (2
100
) + … + (1
1
0
0
0
0
) = 2100
-1 = 1.27*1030
!
 Bottleneck: candidate-generation-and-test
 Can we avoid candidate generation?
11
FREQUENT ITEMSET GENERATION
 Apriori: uses a generate-and-test approach –
generates candidate itemsets and tests if they are
frequent
 Generation of candidate itemsets is expensive(in both
space and time)
 Support counting is expensive
 Subset checking (computationally expensive)
 Multiple Database scans (I/O)
 FP-Growth: allows frequent itemset discovery
without candidate itemset generation. Two step
approach:
 Step 1: Build a compact data structure called the FP-tree
 Built using 2 passes over the data-set.
 Step 2: Extracts frequent itemsets directly from the FP-
tree 12
STEP 1: FP-TREE CONSTRUCTION
 FP-Tree is constructed using 2 passes over the
data-set:
Pass 1:
 Scan data and find support for each item.
 Discard infrequent items.
 Sort frequent items in decreasing order based on
their support.
Use this order when building the FP-Tree, so
common prefixes can be shared.
13
STEP 1: FP-TREE CONSTRUCTION
Pass 2:
Nodes correspond to items and have a counter
1. FP-Growth reads 1 transaction at a time and maps it
to a path
2. Fixed order is used, so paths can overlap when
transactions share items (when they have the same
prfix ).
 In this case, counters are incremented
1. Pointers are maintained between nodes containing
the same item, creating singly linked lists (dotted
lines)
 The more paths that overlap, the higher the compression.
FP-tree may fit in memory.
1. Frequent itemsets extracted from the FP-Tree. 14
STEP 1: FP-TREE CONSTRUCTION (EXAMPLE)
15
FP-TREE SIZE
 The FP-Tree usually has a smaller size than the
uncompressed data - typically many transactions
share items (and hence prefixes).
 Best case scenario: all transactions contain the same set of
items.
 1 path in the FP-tree
 Worst case scenario: every transaction has a unique set of
items (no items in common)
 Size of the FP-tree is at least as large as the original data.
 Storage requirements for the FP-tree are higher - need to store
the pointers between the nodes and the counters.
 The size of the FP-tree depends on how the items are
ordered
 Ordering by decreasing support is typically used but
it does not always lead to the smallest tree (it's a
heuristic). 16
STEP 2: FREQUENT ITEMSET
GENERATION
 FP-Growth extracts frequent itemsets from the
FP-tree.
 Bottom-up algorithm - from the leaves towards
the root
 Divide and conquer: first look for frequent
itemsets ending in e, then de, etc. . . then d, then
cd, etc. . .
 First, extract prefix path sub-trees ending in an
item(set). (hint: use the linked lists)
17
PREFIX PATH SUB-TREES
(EXAMPLE)
18
STEP 2: FREQUENT ITEMSET
GENERATION
 Each prefix path sub-tree is processed
recursively to extract the frequent
itemsets. Solutions are then merged.
 E.g. the prefix path sub-tree for e will be used
to extract frequent itemsets ending in e, then
in de, ce, be and ae, then in cde, bde, cde, etc.
 Divide and conquer approach
19
CONDITIONAL FP-TREE
 The FP-Tree that would be built if we only consider
transactions containing a particular itemset (and then
removing that itemset from all transactions).
 Example: FP-Tree conditional on e.
20
EXAMPLE
Let minSup = 2 and extract all frequent itemsets
containing e.
 1. Obtain the prefix path sub-tree for e:
21
EXAMPLE
 2. Check if e is a frequent item by adding the
counts along the linked list (dotted line). If so,
extract it.
 Yes, count =3 so {e} is extracted as a frequent
itemset.
 3. As e is frequent, find frequent itemsets ending
in e. i.e. de, ce, be and ae.
22
EXAMPLE
 4. Use the the conditional FP-tree for e to find
frequent itemsets ending in de, ce and ae
 Note that be is not considered as b is not in the
conditional FP-tree for e.
 For each of them (e.g. de), find the prefix paths
from the conditional tree for e, extract frequent
itemsets, generate conditional FP-tree, etc...
(recursive)
23
EXAMPLE
 Example: e -> de -> ade ({d,e}, {a,d,e} are found to be
frequent)
•Example: e -> ce ({c,e} is found to be frequent)
24
RESULT
Frequent itemsets found (ordered by sufix and order in
which they are found):
25
DISCUSION
 Advantages of FP-Growth
 only 2 passes over data-set
 “compresses” data-set
 no candidate generation
 much faster than Apriori
 Disadvantages of FP-Growth
 FP-Tree may not fit in memory!!
 FP-Tree is expensive to build
26
ECLAT: ANOTHER METHOD FOR FREQUENT
ITEMSET GENERATION
 ECLAT: for each item, store a list of transaction ids
(tids); vertical data layout
TID-list
27
ECLAT: ANOTHER METHOD FOR FREQUENT
ITEMSET GENERATION
 Determine support of any k-itemset by intersecting tid-
lists of two of its (k-1) subsets.
 Advantage: very fast support counting
 Disadvantage: intermediate tid-lists may become too
large for memory
A
1
4
5
6
7
8
9
B
1
2
5
7
8
10
∧ →
AB
1
5
7
8
28
INTERESTINGNESS MEASURE: CORRELATIONS
(LIFT)
 play basketball ⇒ eat cereal [40%, 66.7%] is misleading
 The overall % of students eating cereal is 75% > 66.7%.
 play basketball ⇒ not eat cereal [20%, 33.3%] is more accurate,
although with lower support and confidence
29
Basketball Not basketball Sum (row)
Cereal 2000 1750 3750
Not cereal 1000 250 1250
Sum(col.) 3000 2000 5000
LIFT
 Measure of dependent/correlated events: lift
 Lift = 1 , A & B are independent
 Lift > 1, A & B are positively correlated
 Lift<1, A & B are negatively correlated
30
)()(
)(
BPAP
BAP
lift
∪
=
89.0
5000/3750*5000/3000
5000/2000
),( ==CBlift
33.1
5000/1250*5000/3000
5000/1000
),( ==¬CBlift

Cs501 mining frequentpatterns

  • 1.
    CS501: DATABASE AND DATAMINING Mining Frequent Patterns1
  • 2.
    WHAT IS FREQUENTPATTERN ANALYSIS?  Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set  First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context of frequent itemsets and association rule mining  Motivation: Finding inherent regularities in data  What products were often purchased together?— Beer and diapers?!  What are the subsequent purchases after buying a PC?  What kinds of DNA are sensitive to this new drug?  Can we automatically classify web documents?  Applications  Basket data analysis, cross-marketing, catalog design, sale campaign analysis, Web log (click stream) analysis, and DNA sequence analysis. 2
  • 3.
    WHY IS FREQ.PATTERN MINING IMPORTANT?  Discloses an intrinsic and important property of data sets  Forms the foundation for many essential data mining tasks  Association, correlation, and causality analysis  Sequential, structural (e.g., sub-graph) patterns  Pattern analysis in spatiotemporal, multimedia, time- series, and stream data  Classification: associative classification  Cluster analysis: frequent pattern-based clustering  Data warehousing: iceberg cube and cube-gradient  Semantic data compression: fascicles  Broad applications 3
  • 4.
    BASIC CONCEPTS: FREQUENT PATTERNSAND ASSOCIATION RULES  Itemset X = {x1, …, xk}  Find all the rules X  Y with minimum support and confidence support, s, probability that a transaction contains X ∪ Y confidence, c, conditional probability that a transaction having X also contains Y 4 Let supmin = 50%, confmin = 50% Freq. Pat.: {A:3, B:3, D:4, E:3, AD:3} Association rules: A  D (60%, 100%) D  A (60%, 75%) Customer buys diaper Customer buys both Customer buys beer Transaction-id Items bought 10 A, B, D 20 A, C, D 30 A, D, E 40 B, E, F 50 B, C, D, E, F
  • 5.
    CLOSED PATTERNS ANDMAX- PATTERNS  A long pattern contains a combinatorial number of sub- patterns, e.g., {a1, …, a100} contains (1 100 ) + (2 100 ) + … + (1 1 0 0 0 0 ) = 2100 – 1 = 1.27*1030 sub-patterns!  Solution: Mine closed patterns and max-patterns instead  An itemset X is closed if X is frequent and there exists no super-pattern Y X, with the same support as X  An itemset X is a max-pattern if X is frequent and there exists no frequent super-pattern Y X  Closed pattern is a lossless compression of freq. patterns Reducing the # of patterns and rules 5
  • 6.
    SCALABLE METHODS FORMINING FREQUENT PATTERNS  The downward closure property of frequent patterns Any subset of a frequent itemset must be frequent If {beer, diaper, nuts} is frequent, so is {beer, diaper} i.e., every transaction having {beer, diaper, nuts} also contains {beer, diaper}  Scalable mining methods: Three major approaches Apriori Freq. pattern growth Vertical data format approach 6
  • 7.
    APRIORI: A CANDIDATEGENERATION-AND-TEST APPROACH  Apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested!  Method: Initially, scan DB once to get frequent 1-itemset Generate length (k+1) candidate itemsets from length k frequent itemsets Test the candidates against DB Terminate when no frequent or candidate set can be generated 7
  • 8.
    THE APRIORI ALGORITHM—AN EXAMPLE 8 DatabaseTDB 1st scan C1 L1 L2 C2 C2 2nd scan C3 L33rd scan Tid Items 10 A, C, D 20 B, C, E 30 A, B, C, E 40 B, E Itemset sup {A} 2 {B} 3 {C} 3 {D} 1 {E} 3 Itemset sup {A} 2 {B} 3 {C} 3 {E} 3 Itemset {A, B} {A, C} {A, E} {B, C} {B, E} {C, E} Itemset sup {A, B} 1 {A, C} 2 {A, E} 1 {B, C} 2 {B, E} 3 {C, E} 2 Itemset sup {A, C} 2 {B, C} 2 {B, E} 3 {C, E} 2 Itemset {B, C, E} Itemset sup {B, C, E} 2 Supmin = 2
  • 9.
    IMPORTANT DETAILS OFAPRIORI  How to generate candidates?  Step 1: self-joining Lk  Step 2: pruning  How to count supports of candidates?  Example of Candidate-generation  L3={abc, abd, acd, ace, bcd}  Self-joining: L3*L3  abcd from abc and abd  acde from acd and ace  Pruning:  acde is removed because ade is not in L3  C4={abcd} 9
  • 10.
    HOW TO COUNTSUPPORTS OF CANDIDATES?  Why counting supports of candidates a problem? The total number of candidates can be very huge  One transaction may contain many candidates  Method: Candidate itemsets are stored in a hash-tree Leaf node of hash-tree contains a list of itemsets and counts Interior node contains a hash table Subset function: finds all the candidates contained in a transaction 10
  • 11.
    BOTTLENECK OF FREQUENT- PATTERNMINING  Multiple database scans are costly  Mining long patterns needs many passes of scanning and generates lots of candidates  To find frequent itemset i1i2…i100  # of scans: 100  # of Candidates: (1 100 ) + (2 100 ) + … + (1 1 0 0 0 0 ) = 2100 -1 = 1.27*1030 !  Bottleneck: candidate-generation-and-test  Can we avoid candidate generation? 11
  • 12.
    FREQUENT ITEMSET GENERATION Apriori: uses a generate-and-test approach – generates candidate itemsets and tests if they are frequent  Generation of candidate itemsets is expensive(in both space and time)  Support counting is expensive  Subset checking (computationally expensive)  Multiple Database scans (I/O)  FP-Growth: allows frequent itemset discovery without candidate itemset generation. Two step approach:  Step 1: Build a compact data structure called the FP-tree  Built using 2 passes over the data-set.  Step 2: Extracts frequent itemsets directly from the FP- tree 12
  • 13.
    STEP 1: FP-TREECONSTRUCTION  FP-Tree is constructed using 2 passes over the data-set: Pass 1:  Scan data and find support for each item.  Discard infrequent items.  Sort frequent items in decreasing order based on their support. Use this order when building the FP-Tree, so common prefixes can be shared. 13
  • 14.
    STEP 1: FP-TREECONSTRUCTION Pass 2: Nodes correspond to items and have a counter 1. FP-Growth reads 1 transaction at a time and maps it to a path 2. Fixed order is used, so paths can overlap when transactions share items (when they have the same prfix ).  In this case, counters are incremented 1. Pointers are maintained between nodes containing the same item, creating singly linked lists (dotted lines)  The more paths that overlap, the higher the compression. FP-tree may fit in memory. 1. Frequent itemsets extracted from the FP-Tree. 14
  • 15.
    STEP 1: FP-TREECONSTRUCTION (EXAMPLE) 15
  • 16.
    FP-TREE SIZE  TheFP-Tree usually has a smaller size than the uncompressed data - typically many transactions share items (and hence prefixes).  Best case scenario: all transactions contain the same set of items.  1 path in the FP-tree  Worst case scenario: every transaction has a unique set of items (no items in common)  Size of the FP-tree is at least as large as the original data.  Storage requirements for the FP-tree are higher - need to store the pointers between the nodes and the counters.  The size of the FP-tree depends on how the items are ordered  Ordering by decreasing support is typically used but it does not always lead to the smallest tree (it's a heuristic). 16
  • 17.
    STEP 2: FREQUENTITEMSET GENERATION  FP-Growth extracts frequent itemsets from the FP-tree.  Bottom-up algorithm - from the leaves towards the root  Divide and conquer: first look for frequent itemsets ending in e, then de, etc. . . then d, then cd, etc. . .  First, extract prefix path sub-trees ending in an item(set). (hint: use the linked lists) 17
  • 18.
  • 19.
    STEP 2: FREQUENTITEMSET GENERATION  Each prefix path sub-tree is processed recursively to extract the frequent itemsets. Solutions are then merged.  E.g. the prefix path sub-tree for e will be used to extract frequent itemsets ending in e, then in de, ce, be and ae, then in cde, bde, cde, etc.  Divide and conquer approach 19
  • 20.
    CONDITIONAL FP-TREE  TheFP-Tree that would be built if we only consider transactions containing a particular itemset (and then removing that itemset from all transactions).  Example: FP-Tree conditional on e. 20
  • 21.
    EXAMPLE Let minSup =2 and extract all frequent itemsets containing e.  1. Obtain the prefix path sub-tree for e: 21
  • 22.
    EXAMPLE  2. Checkif e is a frequent item by adding the counts along the linked list (dotted line). If so, extract it.  Yes, count =3 so {e} is extracted as a frequent itemset.  3. As e is frequent, find frequent itemsets ending in e. i.e. de, ce, be and ae. 22
  • 23.
    EXAMPLE  4. Usethe the conditional FP-tree for e to find frequent itemsets ending in de, ce and ae  Note that be is not considered as b is not in the conditional FP-tree for e.  For each of them (e.g. de), find the prefix paths from the conditional tree for e, extract frequent itemsets, generate conditional FP-tree, etc... (recursive) 23
  • 24.
    EXAMPLE  Example: e-> de -> ade ({d,e}, {a,d,e} are found to be frequent) •Example: e -> ce ({c,e} is found to be frequent) 24
  • 25.
    RESULT Frequent itemsets found(ordered by sufix and order in which they are found): 25
  • 26.
    DISCUSION  Advantages ofFP-Growth  only 2 passes over data-set  “compresses” data-set  no candidate generation  much faster than Apriori  Disadvantages of FP-Growth  FP-Tree may not fit in memory!!  FP-Tree is expensive to build 26
  • 27.
    ECLAT: ANOTHER METHODFOR FREQUENT ITEMSET GENERATION  ECLAT: for each item, store a list of transaction ids (tids); vertical data layout TID-list 27
  • 28.
    ECLAT: ANOTHER METHODFOR FREQUENT ITEMSET GENERATION  Determine support of any k-itemset by intersecting tid- lists of two of its (k-1) subsets.  Advantage: very fast support counting  Disadvantage: intermediate tid-lists may become too large for memory A 1 4 5 6 7 8 9 B 1 2 5 7 8 10 ∧ → AB 1 5 7 8 28
  • 29.
    INTERESTINGNESS MEASURE: CORRELATIONS (LIFT) play basketball ⇒ eat cereal [40%, 66.7%] is misleading  The overall % of students eating cereal is 75% > 66.7%.  play basketball ⇒ not eat cereal [20%, 33.3%] is more accurate, although with lower support and confidence 29 Basketball Not basketball Sum (row) Cereal 2000 1750 3750 Not cereal 1000 250 1250 Sum(col.) 3000 2000 5000
  • 30.
    LIFT  Measure ofdependent/correlated events: lift  Lift = 1 , A & B are independent  Lift > 1, A & B are positively correlated  Lift<1, A & B are negatively correlated 30 )()( )( BPAP BAP lift ∪ = 89.0 5000/3750*5000/3000 5000/2000 ),( ==CBlift 33.1 5000/1250*5000/3000 5000/1000 ),( ==¬CBlift