Mining Frequent Patterns
Mining Frequent Patterns
□ In this topic,
JI Basic concepts
■ Market Basket Analysis: A Motivating Example
■ Frequent Itemsets, Closed Itemsets, and Association Rules
JI Frequent Itemset Mining Methods:
■ Apriori Algorithm
■ Generating Association Rules from Frequent Itemsets
■ Improving the Efficiency of Apriori
■ A Pattern-Growth Approach for Mining Frequent Itemsets
■ Mining Frequent Itemsets Using Vertical Data Format
■ Mining Closed and Max Patterns
JI Which Patterns Are Interesting?—Pattern Evaluation
Methods
Mining Frequent Patterns, Associations,
and Correlations
3
□ Imagine that you are a sales manager at AllElectronics, and you are talking to a
customer who recently bought a PC and a digital camera from the store.
□ What should you recommend to her next?
□ Information about which products are frequently purchased by your customers
following their purchases of a PC and a digital camera in sequence would be
very helpful in making your recommendation.
□ Frequent patterns and association rules are the knowledge that you want to mine
in such a scenario.
Mining Frequent Patterns, Associations,
and Correlations
4
□ Frequent pattern mining searches for recurring relationships in a given data set.
□ Basic concepts of frequent pattern mining for the discovery of interesting
associations and correlations between itemsets in transactional and
relational databases.
JI Example of market basket analysis, the earliest form of frequent pattern mining
for association rules.
JI Basic concepts of mining frequent patterns and associations
6.1.1 Market Basket Analysis:
A Motivating Example
6
□ Frequent itemset mining leads to the discovery of associations and correlations among
items in large transactional or relational data sets.
□ With massive amounts of data continuously being collected and stored, many industries
are becoming interested in mining such patterns from their databases.
□ Help in many business decision-making processes such as catalog design, cross-
marketing, and customer shopping behavior analysis.
□ A typical example of frequent itemset mining is market basket analysis - customer
buying habits by finding associations between the different items that customers place
“shopping
in their baskets” Which items are frequently
purchased together by customers?
Shopping Baskets
sugar
eggs
Market Analyst
Customer n
Market Basket Analysis:
A Motivating Example
7
□ Suppose, as manager of an AllElectronics branch, you would like to learn more about
the buying habits of your customers.
□ Specifically, you wonder, “Which groups or sets of items are customers likely to purchase
on a given trip to the store?”
□ Each item has a Boolean variable representing the presence or absence of that item.
□ Each basket can then be represented by a Boolean vector of values assigned to these
variables.
□ The Boolean vectors can be analyzed for buying patterns that reflect items that are
frequently associated or purchased together. These patterns can be represented in
the form of association rules.
□ Ex: customers who purchase computers also tend to buy antivirus software at the same
time is represented in the following association rule:
computer ⇒ antivirus software [support = 2%, confidence = 60%]
Market Basket Analysis:
A Motivating Example
8
□ Let D, the task-relevant data, be a set of database transactions where each transaction
T is a nonempty itemset such that T ⊆ I.
□ Let A be a set of items. A transaction T is said to contain A if A ⊆ T.
support(A⇒B)=P(A∪B)
=
support(A∪B) support_count(A∪B)
confidence(A⇒B)= P(B|A) =
Support(A) Support_Count (A)
Frequent Itemsets, Closed Itemsets,
and Association Rules
11
□ A major challenge in mining frequent itemsets from a large data set is the fact that
such mining often generates a huge number of itemsets satisfying the
minimum support (min sup) threshold, especially when min sup is set low.
□ This is because if an itemset is frequent, each of its subsets is frequent as well.
□ A long itemset will contain a combinatorial number of shorter, frequent sub-
itemsets.
Closed Patterns and Max-Patterns
14
Closed Patterns and Max-Patterns
15
□ Solution: Mine closed patterns and max-patterns instead
□ An itemset X is closed in a data set D if there exists no proper super-itemset Y such
that Y has the same support count as X in D.
□ T D B 1 = {<a1, …, a100>, < a1, …, a50>} and Min_sup = 1.
□ How many closed patterns does T D B 1 contain?
D Closed itemset C = {{a1, a2,..., a100} : 1; {a1, a2,..., a50} : 2}
■ {a2, a45 : 2} since {a2, a45} is a sub-itemset of the itemset {a1, a2,..., a50 :
2};
■ {a8, a55 : 1} since {a8, a55} is not a sub-itemset of the previous itemset but
of the itemset {a1, a2,..., a100 : 1}.
□ Closed pattern is a lossless compression of frequent patterns.
D Reduces the # of patterns but does lose the support information!
□ An itemset X is a closed frequent itemset in set D if X is both closed and frequent
in D.
Closed Patterns and Max-Patterns
16
□ An itemset X is a maximal frequent itemset (or max-itemset) in a data set D if X
is frequent, and there exists no super-itemset Y such that X ⊂Y and Y is
frequent in D.
D <a1, …, a100>: 1
■ From the maximal frequent itemset, we can only assert that both itemsets
({a2, a45} and {a8, a55}) are frequent, but we cannot assert
their actual support counts.
■ Max-pattern is a lossy compression!
Closed Patterns and Max-Patterns
17
□ When we have many items, the total number of itemsets grow exponentially to the
number of items in the database: |itemsets| = 2n. This immediately causes
both space and time complexity problems in finding frequent patterns.
□ For example, the table above contains five transaction records and six items. This
small database can have 26 = 64 possible itemsets.
□ To deal with this problem, we now define a series of sets.
Sub-pattern (proper subset) &
Super-pattern (proper superset)
18
□ To find efficient algorithms for mining frequent patterns, we will start with two
basic concepts of patterns.
□ A pattern is an itemset, therefore the order of the items are not important.
□ What will be useful is the properties of subsets or supersets of frequent patterns.
□ Let us start with subsets: any proper subset of a pattern, a sub-pattern.
□ Ex: If a pattern X contains, 3 items: X={a, b, c}, any proper subsets, such as sets
containing a, or b, or c, or ab, or ac, or bc, are all sub patterns.
□ Another concept is of course super-pattern. A super pattern of a pattern is a
proper super set of the pattern.
□ Here are some examples of the pattern containing the 3 items: a, b, c
□ abcd, abce, abcde, …
Close Pattern & Max Pattern
19
□ In this topic, methods for mining the simplest form of frequent patterns
JI Apriori, the basic algorithm for finding frequent itemsets
JI Generate strong association rules from frequent itemsets.
JI several variations to the Apriori algorithm for
improved efficiency and scalability.
JI Frequent Pattern-growth methods for mining frequent itemsets
JI Methods for mining frequent itemsets from vertical data format.
Apriori Algorithm: Finding Frequent Itemsets
by Confined Candidate Generation
22
□ Apriori property: All nonempty subsets of a frequent itemset must also be frequent.
□ if an itemset I does not satisfy the minimum support threshold, min sup, then I is
not frequent, that is, P(I) < min sup.
□ If an item A is added to the itemset I, then the resulting itemset (i.e., I ∪ A) cannot
occur more frequently than I . Therefore, I ∪ A is not frequent either, that
is, P(I∪A)<min sup.
□ This property belongs to a special category of properties called antimonotonicity
in the sense that if a set cannot pass a test, all of its supersets will fail the same
test as well. It is called antimonotonicity because the property is
monotonic in the context of failing a test.
Apriori Algorithm:
24
JI Prune: Ck is a superset of Lk, that is, its members may or may not be
frequent, but all of the frequent k-itemsets are included in Ck
Apriori Algorithm:
25
Supmin =
2
Database TDB
Tid Items
10 A, C, D
20 B, C, E
30 A, B, C, E
40 B, E
Another Example
34
C2 Itemset sup
C2 Itemset
{A, B} 1
L2 Itemset sup
{A, C} 2 2nd scan {A, B}
{A, C} 2 {A, C}
{A, E} 1
{B, C} 2 {A, E}
{B, C} 2
{B, E} 3
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2 {B, E}
{C, E}
L3 Itemset sup
C3 Itemset
3 scan
rd
{B, C, E} {B, C, E} 2
Improving the Efficiency of Apriori
35
□ Hash-Based Technique:
□ A hash-based technique can be used to reduce the size of the candidate k-itemsets,
Ck , for k > 1. H2
bucket address 0 1 2 3 4 5 6
bucket count 2 2 4 2 2 4 4
Create hash table H2
bucket contents {I1, I4} {I1, I5} {I2, I3} {I2, I4} {I2, I5} {I1, I2} {I1, I3}
using hash function
h(x, y) = ((order of x) X10 {I3, I5} {I1, I5} {I2, I3} {I2, I4} {I2, I5} {I1, I2} {I1, I3}
+ (order of y)) mod 7 {I2, I3} {I1, I2} {I1, I3}
{I2, I3} {I1, I2} {I1, I3}
□ Ex: Generate L1 – 1-
frequent itemset
and construct hash
table
□ A 2-itemset with a corresponding bucket count in the hash table that is below the
support threshold cannot be frequent and thus should be removed from
the candidate set.
□ Such a hash-based technique may substantially reduce the number of candidate k-
Improving the Efficiency of Apriori
36
JI Two Phases:
■ In phase I, the algorithm divides the transactions of D into n nonoverlapping
partitions.
■ If the minimum relative support threshold for transactions in D is min sup,
then the minimum support count for a partition is min_sup × the
number of transactions in that partition.
■ For each partition, all the local frequent itemsets (i.e., the itemsets frequent
within the partition) are found.
■ The collection of frequent itemsets from all partitions forms the global
candidate itemsets with respect to D
■ In phase II, a second scan of D is conducted in which the actual support of
each candidate is assessed to determine the global frequent itemsets.
Improving the Efficiency of Apriori
38
Algorithm: FP growth. Mine frequent itemsets using an FP-tree by pattern fragment growth.
Input:
D, a transaction database;
min sup, the minimum support count threshold.
L = {K : 5, E : 4, M : 3, O : 3, Y : 3}
Built Conditional Frequent Pattern Tree
52
Generate Frequent Pattern rules
53
Mining Frequent Itemsets Using the Vertical
Data Format
54
Table 6.3 The Vertical Data Format of the Transaction
Data Set D of Table 6.1
itemset TID set
I1 {T100, T400, T500, T700, T800, T900}
I2 {T100, T200, T300, T400, T600, T800,
I3 T900}
I4 {T300,
{T200, T500, T600, T700, T800, T900}
I5 T400} {T100, T800}
□ Completeness
JI Preserve complete information for frequent pattern mining
JI Never break a long pattern of any transaction
□ Compactness
JI Reduce irrelevant info—infrequent items are gone
JI Items in frequency descending order: the more frequently occurring, the more
likely to be shared
JI Never be larger than the original database (not count node-links and the
count
field)
Mining Closed and Max Patterns
56
□ Item merging:
JI If every transaction containing a frequent itemset X also contains an
itemset Y but not any proper superset of Y , then X ∪ Y forms a frequent
closed itemset and there is no need to search for any itemset containing
X but no Y .
□ Sub-itemset pruning:
JI If a frequent itemset X is a proper subset of an already found frequent
closed itemset Y and support_count(X)=support_count(Y), then X and
all of X’s descendants in the set enumeration tree cannot be frequent
closed itemsets and thus can be pruned.
Mining Closed and Max Patterns
57
□ Item skipping:
JI In the depth-first mining of closed itemsets, at each level, there will be a
prefix itemset X associated with a header table and a projected
database.
JI If a local frequent item p has the same support in several header tables
at different levels, we can safely prune p from the header tables at
higher levels.
Another Example
83
TI List of item
D
T100 IDsI2,
I1,
T200 I5 I2, I4
T300 I2, I3
T400 I1, I2,
T500 I4 I1, I3
T600 I2, I3
T700 I1, I3
T800 I1, I2, I3,
T900 I5 I1, I2, I3
Another Example
84
Mining Frequent Itemsets Using the Vertical
Data Format
85
Table 6.3 The Vertical Data Format of the Transaction
Data Set D of Table 6.1
itemset TID set
I1 {T100, T400, T500, T700, T800, T900}
I2 {T100, T200, T300, T400, T600, T800,
I3 T900}
I4 {T300,
{T200, T500, T600, T700, T800, T900}
I5 T400} {T100, T800}
□ Completeness
JI Preserve complete information for frequent pattern mining
JI Never break a long pattern of any transaction
□ Compactness
JI Reduce irrelevant info—infrequent items are gone
JI Items in frequency descending order: the more frequently occurring, the more
likely to be shared
JI Never be larger than the original database (not count node-links and the
count
field)
Mining Closed and Max Patterns
87
□ Item merging:
JI If every transaction containing a frequent itemset X also contains an
itemset Y but not any proper superset of Y , then X ∪ Y forms a frequent
closed itemset and there is no need to search for any itemset containing
X but no Y .
□ Sub-itemset pruning:
JI If a frequent itemset X is a proper subset of an already found frequent
closed itemset Y and support_count(X)=support_count(Y), then X and
all of X’s descendants in the set enumeration tree cannot be frequent
closed itemsets and thus can be pruned.
Mining Closed and Max Patterns
88
□ Item skipping:
JI In the depth-first mining of closed itemsets, at each level, there will be a
prefix itemset X associated with a header table and a projected
database.
JI If a local frequent item p has the same support in several header tables
at different levels, we can safely prune p from the header tables at
higher levels.
Summary
89
91
all confidence
JI max
confidence JI
Kulczynski
JI cosine
Other measures
110
□ All-confidence:
□ Given two itemsets, A and B, the all confidence measure of A and B is defined as
sup(A [ B)
all conf(A, B) = = min {P(A|
max {sup(A), sup(B)}
B), P(B|A)}
□ where max{sup(A), sup(B)} is the maximum support of the Itemsets A and B.
□ Thus, all conf(A,B) is also the minimum confidence of the two association rules
related to A and B, namely, “A ⇒ B” and “B ⇒ A.”
□ Max_Confidence:
□ Given two itemsets, A and B, the max confidence measure of A and B is defined as
max conf(A, B) = max{P(A | B), P(B | A)}
□ The max conf measure is the maximum confidence of the two association rules,
“A
⇒ B” and “B ⇒ A.”
Other Measures
111
□ Kulczynski measure:
□ Given two itemsets, A and B, the Kulczynski measure of A and B (abbreviated as
Kulc) is defined as
1
Kulc (A, B) = (P(A|B) + P(B|
A)) 2
□ The average of two conditional probabilities: the probability of itemset B given
itemset A, and the probability of itemset A given itemset B.
□ Cosine measure:
□ Finally, given two itemsets, A and B, the cosine measure of A and B is defined as
|sup(A) -
sup(B)|
IR(A, B) =sup(A) + sup(B) - sup(A
[ B)
Comparison of Null-invariant Measures
118
Example
119
Lift & Chi-Square
120
Null Invariant Measures
121
Unit - II
122
□ Single-dimensional rules:
buys(X, “milk”) buys(X, “bread”)
□ Multi-dimensional rules: ≥ 2 dimensions or predicates
JI Inter-dimension assoc. rules (no repeated predicates)
age(X,”19-25”) ^ occupation(X,“student”) buys(X,
“coke”)
(age,income,buys)
Quantitative Association Rules Based on
Statistical Inference Theory
JI Since it is unlikely that one buys Ford Expedition (an SUV car) and Toyota
Prius (a hybrid car) together, Ford Expedition and Toyota Prius are likely
negatively correlated patterns
□ Negatively correlated patterns that are infrequent tend to be more interesting
than those that are frequent
Defining Negative Correlated Patterns (I)
131
□ Definition 1 (support-based)
JI If itemsets X and Y are both frequent but rarely occur together, i.e.,
sup(X U Y) < sup (X) * sup(Y)
JI Then X and Y are negatively correlated
□ Problem: A store sold two needle 100 packages A and B, only one
transaction containing both A and B.
JI When there are in total 200 transactions, we have
s(A U B) = 0.005, s(A) * s(B) = 0.25, s(A U B) < s(A) * s(B)
JI When there are 105 transactions, we have
s(A U B) = 1/105, s(A) * s(B) = 1/103 * 1/103, s(A U B) > s(A) *
s(B)
JI Where is the problem? —Null transactions, i.e., the support-based
definition is not null-invariant!
Defining Negative Correlated Patterns (II)
132