KEMBAR78
Mining Frequent Patterns | PDF | Applied Mathematics | Algorithms And Data Structures
0% found this document useful (0 votes)
6 views108 pages

Mining Frequent Patterns

The document discusses mining frequent patterns, associations, and correlations, focusing on concepts such as market basket analysis, frequent itemsets, and association rules. It outlines various methods for frequent itemset mining, including the Apriori algorithm and pattern-growth approaches, while emphasizing the importance of evaluating interesting patterns. Additionally, it explains the definitions and significance of closed patterns and max-patterns in the context of frequent itemset mining.

Uploaded by

divya28032006
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views108 pages

Mining Frequent Patterns

The document discusses mining frequent patterns, associations, and correlations, focusing on concepts such as market basket analysis, frequent itemsets, and association rules. It outlines various methods for frequent itemset mining, including the Apriori algorithm and pattern-growth approaches, while emphasizing the importance of evaluating interesting patterns. Additionally, it explains the definitions and significance of closed patterns and max-patterns in the context of frequent itemset mining.

Uploaded by

divya28032006
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 108

Unit - II

Mining Frequent Patterns, Associations, Correlations

(Contents: Text book 2 - Chapter 4)


Mining Frequent Patterns, Associations,
and Correlations
2

□ In this topic,
JI Basic concepts
■ Market Basket Analysis: A Motivating Example
■ Frequent Itemsets, Closed Itemsets, and Association Rules
JI Frequent Itemset Mining Methods:
■ Apriori Algorithm
■ Generating Association Rules from Frequent Itemsets
■ Improving the Efficiency of Apriori
■ A Pattern-Growth Approach for Mining Frequent Itemsets
■ Mining Frequent Itemsets Using Vertical Data Format
■ Mining Closed and Max Patterns
JI Which Patterns Are Interesting?—Pattern Evaluation
Methods
Mining Frequent Patterns, Associations,
and Correlations
3

□ Imagine that you are a sales manager at AllElectronics, and you are talking to a
customer who recently bought a PC and a digital camera from the store.
□ What should you recommend to her next?
□ Information about which products are frequently purchased by your customers
following their purchases of a PC and a digital camera in sequence would be
very helpful in making your recommendation.
□ Frequent patterns and association rules are the knowledge that you want to mine
in such a scenario.
Mining Frequent Patterns, Associations,
and Correlations
4

□ Frequent patterns are patterns (e.g., itemsets, subsequences, or substructures)


that appear frequently in a data set.
□ For example, a set of items, such as milk and bread, that appear frequently
together in a transaction data set is a frequent itemset.
□ A subsequence, such as buying first a PC, then a digital camera, and then a memory
card, if it occurs frequently in a shopping history database, is a
(frequent) sequential pattern.
□ A substructure can refer to different structural forms, such as subgraphs, subtrees,
or sublattices, which may be combined with itemsets or subsequences. If
a substructure occurs frequently, it is called a (frequent) structured pattern.
□ Finding frequent patterns plays an essential role in mining associations,
correlations, and many other interesting relationships among data.
□ Moreover, it helps in data classification, clustering, and other data mining tasks.
Basic Concepts
5

□ Frequent pattern mining searches for recurring relationships in a given data set.
□ Basic concepts of frequent pattern mining for the discovery of interesting
associations and correlations between itemsets in transactional and
relational databases.
JI Example of market basket analysis, the earliest form of frequent pattern mining
for association rules.
JI Basic concepts of mining frequent patterns and associations
6.1.1 Market Basket Analysis:
A Motivating Example
6
□ Frequent itemset mining leads to the discovery of associations and correlations among
items in large transactional or relational data sets.
□ With massive amounts of data continuously being collected and stored, many industries
are becoming interested in mining such patterns from their databases.
□ Help in many business decision-making processes such as catalog design, cross-
marketing, and customer shopping behavior analysis.
□ A typical example of frequent itemset mining is market basket analysis - customer
buying habits by finding associations between the different items that customers place
“shopping
in their baskets” Which items are frequently
purchased together by customers?

Shopping Baskets

bread milk bread


milk cereal sugar eggs milk bread
butter
Customer 1 Customer 2 Customer 3

sugar
eggs
Market Analyst
Customer n
Market Basket Analysis:
A Motivating Example
7

□ Suppose, as manager of an AllElectronics branch, you would like to learn more about
the buying habits of your customers.
□ Specifically, you wonder, “Which groups or sets of items are customers likely to purchase
on a given trip to the store?”
□ Each item has a Boolean variable representing the presence or absence of that item.

□ Each basket can then be represented by a Boolean vector of values assigned to these
variables.
□ The Boolean vectors can be analyzed for buying patterns that reflect items that are
frequently associated or purchased together. These patterns can be represented in
the form of association rules.
□ Ex: customers who purchase computers also tend to buy antivirus software at the same
time is represented in the following association rule:
computer ⇒ antivirus software [support = 2%, confidence = 60%]
Market Basket Analysis:
A Motivating Example
8

□ Rule support and confidence are two measures of rule interestingness.


□ support (2%) - 2% of all the transactions under analysis show that computer and
antivirus software are purchased together.
□ confidence (60%) - 60% of the customers who purchased a computer also bought
the software.
□ Typically, association rules are considered interesting if they satisfy
both a
minimum support threshold and a minimum confidence threshold.
□ These thresholds can be a set by users or domain experts.
□ Rules that satisfy both minsup and minconf are called strong rules
□ Additional analysis can be performed to discover interesting
statistical correlations between associated items.
Frequent Itemsets, Closed Itemsets, and
Association Rules
9

□ Let I = {I1, I2,..., Im} be an itemset.

□ Let D, the task-relevant data, be a set of database transactions where each transaction
T is a nonempty itemset such that T ⊆ I.
□ Let A be a set of items. A transaction T is said to contain A if A ⊆ T.

□ An association rule is an implication of the form A ⇒ B, where A ⊂ I, B ⊂ I, A ≠ ∅, B ≠ ∅,


and A ∩ B = φ.

□ The rule A ⇒ B - support s, where s is the percentage of transactions in D that contain


A ∪ B . This is taken to be the probability, P(A ∪ B)

□ The rule A ⇒ B - confidence c in the transaction set D, where c is the percentage of


transactions in D containing A that also contain B. This is taken to be the
conditional probability, P(B|A).

support(A⇒B)=P(A∪B)

confidence (A⇒B) =P(B|A)


Frequent Itemsets, Closed Itemsets,
and Association Rules
10

□ A set of items is referred to as an itemset.


□ An itemset that contains k items is a k-itemset. Ex: The set {computer, antivirus
software} is a 2-itemset.
□ The occurrence frequency of an itemset is the number of transactions that contain
the itemset. (frequency, support count, or count of the itemset).
support(A⇒B)=P(A∪B)
□ The above itemset support is sometimes referred to as relative support, whereas the
occurrence frequency is called the absolute support.
□ If the relative support of an itemset I satisfies a prespecified minimum support
threshold (i.e., the absolute support of I satisfies the corresponding
minimum support count threshold), then I is a frequent itemset.
□ The set of frequent k-itemsets is commonly denoted by Lk.

=
support(A∪B) support_count(A∪B)
confidence(A⇒B)= P(B|A) =
Support(A) Support_Count (A)
Frequent Itemsets, Closed Itemsets,
and Association Rules
11

□ In general, association rule mining can be viewed as a two-step process:


JI Find all frequent itemsets: By definition, each of these itemsets will occur at
least as frequently as a predetermined minimum support count, min sup.
JI Generate strong association rules from the frequent
definition, these rules must satisfy minimum itemsets:
confidence. By support
and
minimum
Example
12
□ Find all the rules X ➔ Y with minimum support
and confidence
Tid Items bought
JI support, s, probability that a
10 cerelac, Nuts, Diaper transaction
20 cerelac, Coffee, Diaper contains X  Y
JI confidence, c, conditional probability that a
30 cerelac, Diaper, Eggs
transaction having X also contains Y
40 Nuts, Eggs, Milk Let minsup = 50%, minconf = 50%
50 Nuts, Coffee, Diaper, Eggs, Milk
Ex: cerelac_count = 3, support = 3/5 =0.6 (60%)
Customer Customer Freq. Pat.: cerelac:3, Nuts:3, Diaper:4, Eggs:3,
buys both buys diaper {cerelac, Diaper}:3

Confidence (Cer =>Dia) = sup(Cer U Dia)/Sup (Cer) =


(3/5)/(3/5) = 1(100%)
Confidence (Dia => Cer) = sup(Dia U Cer )/Sup (Dia)
=
Customer (3/5)/(4/5) = 0.75 (75%)
buys cerelac □ Association rules: (many more!)

JI cerelac ➔ Diaper (60%,


100%)
JI Diaper ➔ cerelac (60%,
75%)
Closed Patterns and Max-Patterns
13

□ A major challenge in mining frequent itemsets from a large data set is the fact that
such mining often generates a huge number of itemsets satisfying the
minimum support (min sup) threshold, especially when min sup is set low.
□ This is because if an itemset is frequent, each of its subsets is frequent as well.
□ A long itemset will contain a combinatorial number of shorter, frequent sub-
itemsets.
Closed Patterns and Max-Patterns
14
Closed Patterns and Max-Patterns
15
□ Solution: Mine closed patterns and max-patterns instead
□ An itemset X is closed in a data set D if there exists no proper super-itemset Y such
that Y has the same support count as X in D.
□ T D B 1 = {<a1, …, a100>, < a1, …, a50>} and Min_sup = 1.
□ How many closed patterns does T D B 1 contain?
D Closed itemset C = {{a1, a2,..., a100} : 1; {a1, a2,..., a50} : 2}
■ {a2, a45 : 2} since {a2, a45} is a sub-itemset of the itemset {a1, a2,..., a50 :
2};
■ {a8, a55 : 1} since {a8, a55} is not a sub-itemset of the previous itemset but
of the itemset {a1, a2,..., a100 : 1}.
□ Closed pattern is a lossless compression of frequent patterns.
D Reduces the # of patterns but does lose the support information!
□ An itemset X is a closed frequent itemset in set D if X is both closed and frequent
in D.
Closed Patterns and Max-Patterns
16
□ An itemset X is a maximal frequent itemset (or max-itemset) in a data set D if X
is frequent, and there exists no super-itemset Y such that X ⊂Y and Y is
frequent in D.

D <a1, …, a100>: 1
■ From the maximal frequent itemset, we can only assert that both itemsets
({a2, a45} and {a8, a55}) are frequent, but we cannot assert
their actual support counts.
■ Max-pattern is a lossy compression!
Closed Patterns and Max-Patterns
17

□ When we have many items, the total number of itemsets grow exponentially to the
number of items in the database: |itemsets| = 2n. This immediately causes
both space and time complexity problems in finding frequent patterns.
□ For example, the table above contains five transaction records and six items. This
small database can have 26 = 64 possible itemsets.
□ To deal with this problem, we now define a series of sets.
Sub-pattern (proper subset) &
Super-pattern (proper superset)
18

□ To find efficient algorithms for mining frequent patterns, we will start with two
basic concepts of patterns.
□ A pattern is an itemset, therefore the order of the items are not important.
□ What will be useful is the properties of subsets or supersets of frequent patterns.
□ Let us start with subsets: any proper subset of a pattern, a sub-pattern.
□ Ex: If a pattern X contains, 3 items: X={a, b, c}, any proper subsets, such as sets
containing a, or b, or c, or ab, or ac, or bc, are all sub patterns.
□ Another concept is of course super-pattern. A super pattern of a pattern is a
proper super set of the pattern.
□ Here are some examples of the pattern containing the 3 items: a, b, c
□ abcd, abce, abcde, …
Close Pattern & Max Pattern
19

□ A closed pattern is a frequent pattern. So it meets the minimum support criteria.


□ In addition to that, all super-patterns of a closed pattern are less frequent than the
closed pattern.
□ Ex: 3 items: a, b, c. Suppose, the minimum support count is 2.
□ Suppose a pattern ab has support count of 2 and a pattern abc has support count
of 2. Is the pattern ab is a closed pattern?
□ No. Pattern ab is a frequent pattern, but it has a super-pattern that is NOT less
frequent than ab.
□ Ex: 3 items: x, y, z.
□ suppose a pattern xy has support count of 3 and a pattern xyz has support count of
2. Is the pattern xy is a closed pattern?
□ Pattern xy is a frequent pattern and also the only super-pattern xyz is
less frequent than xy.
□ Therefore, xy is a closed pattern.
Close Pattern & Max Pattern
20

□ A max pattern is a frequent pattern. So it also meets the minimum support


criteria like closed pattern
□ In addition, but unlike closed pattern, all super-patterns of a max pattern are NOT
frequent patterns.
□ Ex: 3 items: a, b, c. Suppose, the minimum support count is 2.
□ Suppose a pattern ab has support count of 3 and a pattern abc has support count
of 2. Is the pattern ab is a max pattern?
□ Pattern ab is a frequent pattern, but it has a super-pattern that is a frequent
pattern as well. So, pattern ab is NOT a max pattern.
□ Ex: 3 items: x, y, z. Suppose, the minimum support count is 2.
□ Suppose a pattern xy has support count of 3 and a pattern xyz has support count
of 1. Is the pattern xy is a max pattern?
□ Pattern xy is a frequent pattern and also the only super-pattern xyz is NOT a
frequent pattern. Therefore, xy is a max pattern.
Frequent Itemset Mining Methods
21

□ In this topic, methods for mining the simplest form of frequent patterns
JI Apriori, the basic algorithm for finding frequent itemsets
JI Generate strong association rules from frequent itemsets.
JI several variations to the Apriori algorithm for
improved efficiency and scalability.
JI Frequent Pattern-growth methods for mining frequent itemsets
JI Methods for mining frequent itemsets from vertical data format.
Apriori Algorithm: Finding Frequent Itemsets
by Confined Candidate Generation
22

□ Apriori is a seminal algorithm for mining frequent itemsets for Boolean


association rules
□ Apriori employs an iterative approach known as a level-wise search, where k-
itemsets are used to explore (k + 1)-itemsets.
□ First, the set of frequent 1-itemsets is found by scanning the database to
accumulate the count for each item, and collecting those items that
satisfy minimum support.
□ The resulting set is denoted by L1. Next, L1 is used to find L2, the set of frequent 2-
itemsets, which is used to find L3, and so on, until no more frequent k-itemsets
can be found.
□ The finding of each Lk requires one full scan of the database.
□ To improve the efficiency of the level-wise generation of frequent itemsets, an
important property called the Apriori property is used to reduce the
search space.
Apriori Algorithm:
23

□ Apriori property: All nonempty subsets of a frequent itemset must also be frequent.

□ if an itemset I does not satisfy the minimum support threshold, min sup, then I is
not frequent, that is, P(I) < min sup.
□ If an item A is added to the itemset I, then the resulting itemset (i.e., I ∪ A) cannot
occur more frequently than I . Therefore, I ∪ A is not frequent either, that
is, P(I∪A)<min sup.
□ This property belongs to a special category of properties called antimonotonicity
in the sense that if a set cannot pass a test, all of its supersets will fail the same
test as well. It is called antimonotonicity because the property is
monotonic in the context of failing a test.
Apriori Algorithm:
24

□ “How is the Apriori property used in the algorithm?”


□ To understand this, let us look at how Lk−1 is used to find Lk for k ≥ 2.
□ A two-step process

JI Join: To find Lk , a set of candidate k-itemsets is generated by joining


Lk−1 with itself. This set of candidates is denoted Ck.

JI Prune: Ck is a superset of Lk, that is, its members may or may not be
frequent, but all of the frequent k-itemsets are included in Ck
Apriori Algorithm:
25

□ Apriori pruning principle: If there is any itemset which is infrequent,


its superset should not be generated/tested!
□ Method:
JI Initially, scan DB once to get frequent 1-itemset

JI Generate length (k+1) candidate itemsets from length k frequent


itemsets
JI Test the candidates against DB
JI Terminate when no frequent or candidate set can be generated
The Apriori Algorithm (Pseudo-Code)
26
The Apriori Algorithm (Pseudo-Code)
Apriori: Example
28

Transactional Data for an AllElectronics


Branch
TID List of item
IDs
T100 I1, I2,
T200 I5 I2, I4
T300 I2, I3
T400 I1, I2,
T500 I4 I1, I3
T600 I2, I3
T700 I1, I3
T800 I1, I2, I3,
T900 I5 I1, I2, I3
Example
29
Example
30
Apriori: Example
31
Generating Association Rules from Frequent
Itemsets
32

D The data contain frequent itemset X = {I1, I2, I5}.


D What are the association rules that can be generated from X?
D The nonempty subsets of X are {I1, I2}, {I1, I5}, {I2, I5}, {I1}, {I2}, and {I5}.
D The resulting association rules are as shown below, each listed with its confidence:
Transactional Data for an AllElectronics
D {I1, I2} ⇒ I5, confidence = 2/4 = 50%
Branch
D {I1, I5} ⇒ I2, confidence = 2/2 = 100% TI List of item
D {I2, I5} ⇒ I1, confidence = 2/2 = 100% D
T100 IDs
I1, I2, I5
T200 I2, I4
D I1 ⇒ {I2, I5}, confidence = 2/6 = 33% T300 I2, I3
D I2 ⇒ {I1, I5}, confidence = 2/7 = 29% T400 I1, I2, I4
T500 I1, I3
D I5 ⇒ {I1, I2}, confidence = 2/2 = 100% T600 I2, I3
T700 I1, I3
T800 I1, I2, I3,
T900 I5 I1, I2, I3
Another Example
33

Supmin =
2
Database TDB
Tid Items

10 A, C, D

20 B, C, E

30 A, B, C, E

40 B, E
Another Example
34

Supmin = Itemset sup


Itemset sup
2 {A} 2
L1 {A} 2
Tid Items
Database TDB C1 {B} 3
{B} 3
10 A, C, D {C} 3
{C} 3
20 B, C, E 1st scan {D} 1
{E} 3
30 A, B, C, E {E} 3
40 B, E

C2 Itemset sup
C2 Itemset
{A, B} 1
L2 Itemset sup
{A, C} 2 2nd scan {A, B}
{A, C} 2 {A, C}
{A, E} 1
{B, C} 2 {A, E}
{B, C} 2
{B, E} 3
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2 {B, E}
{C, E}

L3 Itemset sup
C3 Itemset
3 scan
rd
{B, C, E} {B, C, E} 2
Improving the Efficiency of Apriori
35

□ Hash-Based Technique:
□ A hash-based technique can be used to reduce the size of the candidate k-itemsets,
Ck , for k > 1. H2
bucket address 0 1 2 3 4 5 6
bucket count 2 2 4 2 2 4 4
Create hash table H2
bucket contents {I1, I4} {I1, I5} {I2, I3} {I2, I4} {I2, I5} {I1, I2} {I1, I3}
using hash function
h(x, y) = ((order of x) X10 {I3, I5} {I1, I5} {I2, I3} {I2, I4} {I2, I5} {I1, I2} {I1, I3}
+ (order of y)) mod 7 {I2, I3} {I1, I2} {I1, I3}
{I2, I3} {I1, I2} {I1, I3}

□ Ex: Generate L1 – 1-
frequent itemset
and construct hash
table
□ A 2-itemset with a corresponding bucket count in the hash table that is below the
support threshold cannot be frequent and thus should be removed from
the candidate set.
□ Such a hash-based technique may substantially reduce the number of candidate k-
Improving the Efficiency of Apriori
36

□ Transaction reduction (reducing the number of transactions scanned in


future iteraions):
JI A transaction that does not contain any frequent
k-itemsets cannot contain any frequent (k + 1)-itemsets.
□ Partitioning (partitioning the data to find candidate itemsets):
JI A partitioning technique can be used that requires just two database
scans to mine the frequent itemsets.
Phase I
Phase II

Divide D Find the Combine Find global Frequent


Transactions frequent all local frequent
into n itemsets
in D itemsets frequent itemsets
partitions in D
local to each itemsets among
partition to form candidates
(1 scan) candidate (1 scan)
itemset
Improving the Efficiency of Apriori
37

JI Two Phases:
■ In phase I, the algorithm divides the transactions of D into n nonoverlapping
partitions.
■ If the minimum relative support threshold for transactions in D is min sup,
then the minimum support count for a partition is min_sup × the
number of transactions in that partition.
■ For each partition, all the local frequent itemsets (i.e., the itemsets frequent
within the partition) are found.
■ The collection of frequent itemsets from all partitions forms the global
candidate itemsets with respect to D
■ In phase II, a second scan of D is conducted in which the actual support of
each candidate is assessed to determine the global frequent itemsets.
Improving the Efficiency of Apriori
38

□ Sampling (mining on a subset of the given data):


JI The basic idea of the sampling approach is to pick a random sample S of the
given data D, and then search for frequent itemsets in S instead of D.
JI In this way, we trade off some degree of accuracy against efficiency.
□ Dynamic itemset counting (adding candidate itemsets at different points during
a scan):
JI A dynamic itemset counting technique was proposed in which the database is
partitioned into blocks marked by start points.
JI In this variation, new candidate itemsets can be added at any start point, unlike
in Apriori, which determines new candidate itemsets only immediately before
each complete database scan.
A Pattern-Growth Approach for Mining Frequent
Itemsets
39

□ In many cases the Apriori candidate generate-and-test method


significantly reduces the size of candidate sets, leading to good performance gain.
□ However, it can suffer from two nontrivial costs:
JI It may still need to generate a huge number of candidate sets.
■ For example, if there are 104 frequent 1-itemsets, the Apriori algorithm will
need to generate more than 107 candidate 2-itemsets.
JI It may need to repeatedly scan the whole database and check a large set of
candidates by pattern matching.
JI It is costly to go over each transaction in the database to determine the support
of the candidate itemsets.
A Pattern-Growth Approach for Mining
Frequent Itemsets
40

□ Bottlenecks of the Apriori approach


JI Breadth-first (i.e., level-wise) search
JI Candidate generation and test
■ Often generates a huge number of candidates
□ The FPGrowth Approach (J. Han, J. Pei, and Y. Yin, SIGMOD’ 00)
JI Depth-first search
JI Avoid explicit candidate generation
□ Major philosophy: Grow long patterns from short ones using local frequent items
only
JI “abc” is a frequent pattern
JI Get all transactions having “abc”, i.e., project DB on abc: DB|abc
JI “d” is a local frequent item in DB|abc ➔ abcd is a frequent pattern
A Pattern-Growth Approach for Mining
Frequent Itemsets
41

Algorithm: FP growth. Mine frequent itemsets using an FP-tree by pattern fragment growth.
Input:

D, a transaction database;
min sup, the minimum support count threshold.

Output: The complete set of frequent patterns.


Method:

1. The FP-tree is constructed in the following


steps:
(a) Scan the transaction database D once. Collect F, the set of frequent items, and their support
counts. Sort F in support count descending order as L, the list of frequent items.
(b) Create the root of an FP-tree, and label it as “null.” For each transaction Trans in D do the
following.
Select and sort the frequent items in Trans according to the order of L. Let the sorted
frequent item list in Trans be [p|P], where p is the first element and P is the remaining
list. Call insert tree([p|P], T), which is performed as follows. If T has a child N such that
N.item-name = p.item-name, then increment N ’s count by 1; else create a new node N ,
and let its count be 1, its parent link be linked to T , and its node-link to the nodes with the
same item-name via the node-link structure. If P is nonempty, call insert tree(P, N)
recursively.
A Pattern-Growth Approach for Mining
Frequent Itemsets
42
Example
43
Support_count of itemsets
44

Frequent Patterns are


L = {K : 5, E : 4, M : 3, O : 3, Y : 3}
Ordered-Item set
45

Frequent Patterns are


L = {K : 5, E : 4, M : 3, O : 3, Y : 3}
a) Inserting the set {K, E, M, O, Y}:
46
b) Inserting the set {K, E, O, Y}:
47
c) Inserting the set {K, E, M}:
48
d) Inserting the set {K, M, Y}:
49
e) Inserting the set {K, E, O}:
50
Compute Conditional Pattern Base
51

L = {K : 5, E : 4, M : 3, O : 3, Y : 3}
Built Conditional Frequent Pattern Tree
52
Generate Frequent Pattern rules
53
Mining Frequent Itemsets Using the Vertical
Data Format
54
Table 6.3 The Vertical Data Format of the Transaction
Data Set D of Table 6.1
itemset TID set
I1 {T100, T400, T500, T700, T800, T900}
I2 {T100, T200, T300, T400, T600, T800,
I3 T900}
I4 {T300,
{T200, T500, T600, T700, T800, T900}
I5 T400} {T100, T800}

Table 6.4 2-Itemsets in Vertical Data Format


itemset TID set
{I1, {T100, T400, T800,
I2} T900}
{I1, {T500, T700, T800,
I3} T900}
{I1, {T400}
I4} {T100, T800}
{I1, {T300, T600,{T100,
T800,
{I2, I5}
I5} T900}
T800}
{I2, {T200, T400}
{I3, I5} {T800}
I3}
{I2,
I4}
Table 6.5 3-Itemsets in Vertical Data Format
itemset
TID set
{I1, I2, I3}
{T800, T900}
{I1, I2, I5}
Benefits of the FP-tree Structure
55

□ Completeness
JI Preserve complete information for frequent pattern mining
JI Never break a long pattern of any transaction
□ Compactness
JI Reduce irrelevant info—infrequent items are gone
JI Items in frequency descending order: the more frequently occurring, the more
likely to be shared
JI Never be larger than the original database (not count node-links and the
count
field)
Mining Closed and Max Patterns
56

□ Item merging:
JI If every transaction containing a frequent itemset X also contains an
itemset Y but not any proper superset of Y , then X ∪ Y forms a frequent
closed itemset and there is no need to search for any itemset containing
X but no Y .
□ Sub-itemset pruning:
JI If a frequent itemset X is a proper subset of an already found frequent
closed itemset Y and support_count(X)=support_count(Y), then X and
all of X’s descendants in the set enumeration tree cannot be frequent
closed itemsets and thus can be pruned.
Mining Closed and Max Patterns
57

□ Item skipping:
JI In the depth-first mining of closed itemsets, at each level, there will be a
prefix itemset X associated with a header table and a projected
database.
JI If a local frequent item p has the same support in several header tables
at different levels, we can safely prune p from the header tables at
higher levels.
Another Example
83

TI List of item
D
T100 IDsI2,
I1,
T200 I5 I2, I4
T300 I2, I3
T400 I1, I2,
T500 I4 I1, I3
T600 I2, I3
T700 I1, I3
T800 I1, I2, I3,
T900 I5 I1, I2, I3
Another Example
84
Mining Frequent Itemsets Using the Vertical
Data Format
85
Table 6.3 The Vertical Data Format of the Transaction
Data Set D of Table 6.1
itemset TID set
I1 {T100, T400, T500, T700, T800, T900}
I2 {T100, T200, T300, T400, T600, T800,
I3 T900}
I4 {T300,
{T200, T500, T600, T700, T800, T900}
I5 T400} {T100, T800}

Table 6.4 2-Itemsets in Vertical Data Format


itemset TID set
{I1, {T100, T400, T800,
I2} T900}
{I1, {T500, T700, T800,
I3} T900}
{I1, {T400}
I4} {T100, T800}
{I1, {T300, T600,{T100,
T800,
{I2, I5}
I5} T900}
T800}
{I2, {T200, T400}
{I3, I5} {T800}
I3}
{I2,
I4}
Table 6.5 3-Itemsets in Vertical Data Format
itemset
TID set
{I1, I2, I3}
{T800, T900}
{I1, I2, I5}
Benefits of the FP-tree Structure
86

□ Completeness
JI Preserve complete information for frequent pattern mining
JI Never break a long pattern of any transaction
□ Compactness
JI Reduce irrelevant info—infrequent items are gone
JI Items in frequency descending order: the more frequently occurring, the more
likely to be shared
JI Never be larger than the original database (not count node-links and the
count
field)
Mining Closed and Max Patterns
87

□ Item merging:
JI If every transaction containing a frequent itemset X also contains an
itemset Y but not any proper superset of Y , then X ∪ Y forms a frequent
closed itemset and there is no need to search for any itemset containing
X but no Y .
□ Sub-itemset pruning:
JI If a frequent itemset X is a proper subset of an already found frequent
closed itemset Y and support_count(X)=support_count(Y), then X and
all of X’s descendants in the set enumeration tree cannot be frequent
closed itemsets and thus can be pruned.
Mining Closed and Max Patterns
88

□ Item skipping:
JI In the depth-first mining of closed itemsets, at each level, there will be a
prefix itemset X associated with a header table and a projected
database.
JI If a local frequent item p has the same support in several header tables
at different levels, we can safely prune p from the header tables at
higher levels.
Summary
89

□ Basic concepts: association rules, support-confident framework, closed and max-


patterns
□ Scalable frequent pattern mining methods

JI Apriori (Candidate generation &

test) JI Projection-based (Fpgrowth)

JI Vertical format approach


Which Patterns Are Interesting?—
Pattern Evaluation Methods
90

□ Most association rule mining algorithms employ a support–confidence framework.


□ Although minimum support and confidence thresholds help weed out or exclude
the exploration of a good number of uninteresting rules, many of the
rules generated are still not interesting to the users.
□ Unfortunately, this is especially true when mining at low support thresholds or
mining for long patterns.
□ This has been a major bottleneck for successful application of association rule
mining.
Strong Rules Are Not Necessarily Interesting

91

□ Whether or not a rule is interesting can be assessed either subjectively


or objectively.
□ Ultimately, if user judges subjectively – may differ from one person to another
□ However, objectiveness can be used as one step toward the goal of weeding out
uninteresting rules that would otherwise be presented to the user.
□ “How can we tell which strong association rules are really interesting?”
Strong Rules Are Not Necessarily
Interesting
92

□ Example: A misleading “strong” association rule.


JI Suppose we are interested in analyzing transactions at AllElectronics with
respect to the purchase of computer games and videos.
JI Let game refer to the transactions containing computer games, and video refer
to those containing videos.
JI Of the 10,000 transactions analyzed, the data show that 6000 of the customer
transactions included computer games, while 7500 included videos, and 4000
included both computer games and videos.
JI Association rule: a minimum support of 30% and a minimum confidence of
60%.
Rule is a strong association rule and would therefore be reported, since
JI
its
support value of 4000/ 10,000 = 40% and confidence value of 4000/6000
= 66% satisfy the minimum support and minimum confidence thresholds,
respectively.
Strong Rules Are Not Necessarily
Interesting
93

□ The following association rule is discovered:


buys (X, “computer games”) ) buys (X,
“videos”) [support = 40%, confidence =
66%].
□ However, Rule is misleading because the probability of purchasing videos is 75%,
which is even larger than 66%.
□ In fact, computer games and videos are negatively associated because the purchase of
one of these items actually decreases the likelihood of purchasing the other.
□ Without fully understanding this phenomenon, we could easily make unwise business
decisions based on above Rule.
□ Example also illustrates that the confidence of a rule A ⇒ B can be deceiving.
□ It does not measure the real strength (or lack of strength) of the correlation and
implication between A and B.
□ Hence, alternatives to the support–confidence framework can be useful in mining
interesting data relationships.
From Association Analysis to Correlation
Analysis
94

□ Support and confidence measures are insufficient at filtering out uninteresting


association rules.
□ A correlation measure can be used to augment the support–confidence framework
for association rules.
□ This leads to correlation rules of the form
A ⇒ B [support, confidence, correlation]
□ Many correlation measures are available:
D Lift Measure
D χ2 Measure
Lift Measure
95

□ Lift is a simple correlation measure:


JI The occurrence of itemset A is independent of the occurrence of itemset
B if
P(A ∪ B) = P(A)P(B)
JI Otherwise, itemsets A and B are dependent and correlated as events.
□ The lift between the occurrence of A and B can be measured by computing

□ Lift(A,B) <1 , A is negatively correlated with B


□ Lift(A,B) >1, A and B are positively correlated
□ Lift(A,B) = 1 A and B are independent and no correlation
□ In other words, lift of the association (or correlation) rule A ⇒ B can be written as
Lift: Example
96
χ2 measure
97
Another Example
98
Association
99
Lift
100
Chi-Square
101
Which is best? lift and χ2
102
Which is best? lift and χ2
103
Which is best? lift and χ2
104
Which is best? lift and χ2
105
Which is best? lift and χ2
106
Which is best? lift and χ2
107
Which is best? lift and χ2
108
A Comparison of Pattern
Evaluation Measures
109

□ Instead of using the simple support–confidence framework other measures, such


as lift and χ2 – discloses more intrinsic patterns
JI How effective are these measures?
JI Should we also consider other alternatives?
□ Other measures: JI

all confidence
JI max
confidence JI

Kulczynski
JI cosine
Other measures
110

□ All-confidence:
□ Given two itemsets, A and B, the all confidence measure of A and B is defined as
sup(A [ B)
all conf(A, B) = = min {P(A|
max {sup(A), sup(B)}
B), P(B|A)}
□ where max{sup(A), sup(B)} is the maximum support of the Itemsets A and B.
□ Thus, all conf(A,B) is also the minimum confidence of the two association rules
related to A and B, namely, “A ⇒ B” and “B ⇒ A.”
□ Max_Confidence:
□ Given two itemsets, A and B, the max confidence measure of A and B is defined as
max conf(A, B) = max{P(A | B), P(B | A)}
□ The max conf measure is the maximum confidence of the two association rules,
“A
⇒ B” and “B ⇒ A.”
Other Measures
111

□ Kulczynski measure:
□ Given two itemsets, A and B, the Kulczynski measure of A and B (abbreviated as
Kulc) is defined as
1
Kulc (A, B) = (P(A|B) + P(B|
A)) 2
□ The average of two conditional probabilities: the probability of itemset B given
itemset A, and the probability of itemset A given itemset B.
□ Cosine measure:
□ Finally, given two itemsets, A and B, the cosine measure of A and B is defined as

□ The cosine measure can be viewed as a harmonized lift measure


JI The two formulae are similar except that for cosine, the square root is taken on
the product of the probabilities of A and B.
Comparison of Null-invariant Measures
112
Comparison of Null-invariant Measures
113
Comparison of Null-invariant Measures
114
Comparison of Null-invariant Measures
115
Comparison of Null-invariant Measures
116
Comparison of Null-invariant Measures
117

|sup(A) -
sup(B)|
IR(A, B) =sup(A) + sup(B) - sup(A
[ B)
Comparison of Null-invariant Measures
118
Example
119
Lift & Chi-Square
120
Null Invariant Measures
121
Unit - II
122

Mining Frequent Patterns, Associations, Correlations

(Contents: Text book 2 - Chapter 7)


Advanced Frequent Pattern Mining
123

□ Pattern Mining in Multi-Level, Multi-Dimensional Space


JI Mining Multi-Level Association
JI Mining Multi-Dimensional Association
JI Mining Quantitative Association Rules
JI Mining Rare Patterns and Negative Patterns
□ Summary
Pattern Mining in Multi-Level,
Multi-Dimensional Space
124

□ focuses on methods for mining in multilevel, multidimensional space.


JI multilevel associations - concepts at different abstraction levels.
JI multi- dimensional associations - involve more than one dimension
or predicate (e.g., rules that relate what a customer buys to his or her
age).
JI quantitative association rules - numeric attributes that have an
implicit ordering among values (e.g., age).
JI rare patterns - suggest interesting although rare item combinations.
JI negative patterns - show negative correlations between items.
Mining Multiple-Level Association Rules

□ effective methods for mining patterns at multiple abstraction levels


□ Items often form hierarchies
□ Flexible support settings
JI Items at the lower level are expected to have lower support
□ Exploration of shared multi-level mining

uniform support reduced support


Level 1
Milk Level 1
min_sup = 5%
[support = 10%] min_sup = 5%

Level 2 2% Milk Skim Milk Level 2


min_sup = 5% min_sup = 3%
[support = 6%] [support = 4%]
Multi-level Association: Flexible Support and
Redundancy filtering

□ Flexible min-support thresholds: Some items are more valuable but


less frequent
JI Use non-uniform, group-based min-support
JI E.g., {diamond, watch, camera}: 0.05%; {bread, milk}: 5%; …
□ Redundancy Filtering: Some rules may be redundant due to
“ancestor”
relationships between items
JI milk  wheat bread [support = 8%, confidence = 70%]
JI 2% milk  wheat bread [support = 2%, confidence =
72%] The first rule is an ancestor of the second rule
□ A rule is redundant if its support is close to the “expected” value, based on the
rule’s ancestor
Mining Multi-Dimensional Association

□ Single-dimensional rules:
buys(X, “milk”)  buys(X, “bread”)
□ Multi-dimensional rules: ≥ 2 dimensions or predicates
JI Inter-dimension assoc. rules (no repeated predicates)
age(X,”19-25”) ^ occupation(X,“student”)  buys(X,
“coke”)

JI hybrid-dimension assoc. rules (repeated predicates)

age(X,”19-25”) ^ buys(X, “popcorn”)  buys(X,


“coke”)
□ Categorical Attributes: finite number of possible values, no ordering among values
—data cube approach
□ Quantitative Attributes: Numeric, implicit ordering among values—discretization, clustering,
and gradient approaches
Mining Quantitative Associations

Techniques can be categorized by how numerical attributes, such as age or salary


are treated
1. Staticdiscretization based on predefined concept hierarchies (data
cube methods)
2. Dynamic discretization based on data distribution
3. Clustering: Distance-based association ()

JI One dimensional clustering then association


(age) (income) (buys)
4. Deviation:
5. Sex = female => Wage: mean=$7/hr (overall mean = $9)

(age, income) (age,buys) (income,buys)

(age,income,buys)
Quantitative Association Rules Based on
Statistical Inference Theory

□ Finding extraordinary and therefore interesting phenomena, e.g.,


(Sex = female) => Wage: mean=$7/hr (overall mean = $9)
JI LHS: a subset of the population
JI RHS: an extraordinary behavior of this subset
□ The rule is accepted only if a statistical test (e.g., Z-test) confirms the inference with
high confidence
□ Subrule: highlights the extraordinary behavior of a subset of the pop. of the super rule
JI E.g., (Sex = female) ^ (South = yes) => mean wage = $6.3/hr
□ Two forms of rules
JI Categorical => quantitative rules, or Quantitative => quantitative rules
JI E.g., Education in [14-18] (yrs) => mean wage = $11.64/hr
□ Open problem: Efficient methods for LHS containing two or more
quantitative attributes
Negative and Rare Patterns

□ Rare patterns: Very low support but interesting

JI E.g., buying Rolex watches

JI Mining: Setting individual-based or special group-based support threshold for


valuable items
□ Negative patterns

JI Since it is unlikely that one buys Ford Expedition (an SUV car) and Toyota
Prius (a hybrid car) together, Ford Expedition and Toyota Prius are likely
negatively correlated patterns
□ Negatively correlated patterns that are infrequent tend to be more interesting
than those that are frequent
Defining Negative Correlated Patterns (I)
131

□ Definition 1 (support-based)
JI If itemsets X and Y are both frequent but rarely occur together, i.e.,
sup(X U Y) < sup (X) * sup(Y)
JI Then X and Y are negatively correlated
□ Problem: A store sold two needle 100 packages A and B, only one
transaction containing both A and B.
JI When there are in total 200 transactions, we have
s(A U B) = 0.005, s(A) * s(B) = 0.25, s(A U B) < s(A) * s(B)
JI When there are 105 transactions, we have
s(A U B) = 1/105, s(A) * s(B) = 1/103 * 1/103, s(A U B) > s(A) *
s(B)
JI Where is the problem? —Null transactions, i.e., the support-based
definition is not null-invariant!
Defining Negative Correlated Patterns (II)
132

□ Definition 2 (negative itemset-based)


JI X is a negative itemset if (1) X = Ā U B, where B is a set of positive items, and Ā is
a set of negative items, |Ā|≥ 1, and (2) s(X) ≥ μ
JI Itemsets X is negatively correlated, if

□ This definition suffers a similar null-invariant problem


□ Definition 3 (Kulzynski measure-based) If itemsets X and Y are frequent, but
(P(X|Y) + P(Y|X))/2 < є, where є is a negative pattern threshold, then X and Y are
negatively correlated.
□ Ex. For the same needle package problem, when no matter there are 200 or 105
transactions, if є = 0.01, we have
(P(A|B) + P(B|A))/2 = (0.01 + 0.01)/2 < є
Dr. R. Elakkiya, AP-SoC, SASTRA Deemed University

You might also like