Introduction to Frequent Pattern Mining
1. What is Frequent Pattern Mining?
Definition:
Frequent Pattern Mining is a key technique in data mining used to identify recurring patterns,
associations, or correlations in large datasets. A frequent pattern is a pattern (such as itemsets,
subsequences, or substructures) that appears frequently in a dataset. The most common
frequent patterns are frequent itemsets, where an itemset is a collection of one or more items.
Importance:
Market Basket Analysis: Identifying frequent item combinations in shopping carts
helps retailers understand customer purchasing behavior, which can lead to better
product placement, promotions, and recommendations.
Data Compression: Frequent patterns can help in compressing data by identifying and
summarizing repetitive data elements.
Anomaly Detection: Outliers can be found by identifying items that do not conform to
frequent patterns.
Recommendation Systems: By identifying patterns of frequently purchased items
together, recommendation systems suggest products to users based on their past
behavior.
2. Applications of Frequent Pattern Mining
1. Market Basket Analysis:
o In retail, analyzing frequent itemsets helps businesses understand which
products are commonly bought together. For example, if customers frequently
buy bread and butter together, retailers might offer discounts on this
combination to boost sales or stock these items together.
2. Web Mining:
o In web usage mining, frequent pattern mining can help discover patterns in user
behavior by analyzing frequently accessed web pages or sequences of clicks.
This information can be used for website optimization or targeted
advertisements.
3. Bioinformatics:
o In bioinformatics, frequent pattern mining is used to identify frequent gene
sequences, protein structures, or biological networks that occur across datasets,
which can provide insights into biological functions and processes.
4. Text Mining:
o In text mining, frequent pattern mining helps find common phrases or terms in
large text corpora, aiding in topic modeling, sentiment analysis, or information
retrieval.
3. Basic Concepts in Frequent Pattern Mining
a. Itemsets: Definition of an Itemset
An itemset is a collection of one or more items. For example, in a grocery store dataset,
an itemset could be {milk, bread}, where "milk" and "bread" are individual items. An
itemset can contain any number of items.
k-itemset refers to an itemset containing k items. For instance, {milk, bread, butter} is
a 3-itemset.
b. Frequent Itemsets: Definition and Criteria
Frequent itemsets are those itemsets that occur at least as frequently as a predefined
minimum support threshold in the dataset.
For example, if the minimum support is set to 0.3 (30%), an itemset is considered
frequent if it appears in at least 30% of all transactions.
The frequency or support of an itemset is typically expressed as a proportion of the total
number of transactions. The general goal is to discover itemsets that meet or exceed the support
threshold.
c. Support, Confidence, and Lift
Support (s):
o Support indicates the frequency of occurrence of an itemset in a dataset. It is
defined as the proportion of transactions in which the itemset appears.
o Formula:
o For example, if {milk, bread} appears in 20 out of 100 transactions, the support
is 0.2 (or 20%).
Confidence (c):
o Confidence measures the likelihood of occurrence of item B in transactions
where item A is present. It is used for evaluating association rules.
o Formula:
o
o For example, if {milk} appears in 50 transactions and {milk, bread} appears in
30 transactions, the confidence of the rule {milk} → {bread} is 30/50=0.6 or
60%.
Lift:
o Lift measures how much more likely item B is to be purchased when item A is
present compared to if item B were purchased independently. It is the ratio of
the observed support to the expected support, assuming A and B are
independent.
o Formula:
o A lift value greater than 1 indicates a strong association between A and B,
whereas a value less than 1 suggests a weak association.
4. Association Rule Mining
Definition:
Association rule mining is a method used to find interesting relationships (rules) between items
in large datasets. An association rule is a statement of the form A → B, which implies that when
item A occurs in a transaction, item B is likely to occur as well.
a. Concepts of Association Rules
Antecedent (A): The item(s) on the left side of the rule.
Consequent (B): The item(s) on the right side of the rule.
For example, the rule {milk} → {bread} means that if milk is purchased, there is a high
likelihood that bread will also be purchased.
b. Strong Rules vs Weak Rules
Strong rules have high support and confidence values, indicating a strong correlation
between the items.
Weak rules have low support or confidence, indicating a less significant relationship.
c. Confidence and Support Thresholds
Minimum support threshold is a user-defined limit to determine how frequently an
itemset must appear in the dataset to be considered.
Minimum confidence threshold is used to filter rules that are considered strong
enough. A rule with confidence below this threshold is discarded.
For example, if the minimum support is set to 0.3 and the minimum confidence is set to 0.7,
only rules that appear in at least 30% of transactions and have at least 70% confidence will be
considered.
Apriori Algorithm
Introduction to the Apriori Algorithm
What is the Apriori Algorithm?
The Apriori Algorithm is a classic algorithm in frequent pattern mining used to find frequent
itemsets in a transactional dataset. It is widely used for association rule mining, especially in
market basket analysis, to identify associations between items purchased together.
Basic Idea:
Apriori uses a bottom-up approach, where it starts by identifying individual items (1-itemsets)
that satisfy the minimum support threshold. It then generates larger itemsets by combining
these smaller frequent itemsets. The key characteristic of Apriori is that it uses the Apriori
property to reduce the search space.
Apriori Property: Downward Closure Property
The Apriori property is based on the concept that if an itemset is frequent, all of its subsets
must also be frequent. Conversely, if an itemset is infrequent, none of its supersets can be
frequent. This property significantly reduces the number of itemsets that need to be examined.
Example:
Consider a dataset where the itemset {Milk, Bread, Butter} is found to be infrequent. By the
Apriori property, we can immediately infer that {Milk, Bread, Butter, Cheese} is also
infrequent without calculating its support, as the larger itemset contains the infrequent {Milk,
Bread, Butter}.
Key Steps in the Apriori Algorithm
The Apriori algorithm works in iterations or passes. In each pass, it generates candidate
itemsets of size k from the frequent itemsets of size k-1. It then prunes the infrequent itemsets
using the support threshold.
Steps of the Apriori Algorithm
Step 1: Generating Candidate Itemsets
Initial Step (L1 generation):
Start by identifying all 1-itemsets that appear in at least the minimum support threshold.
o For example, in a transactional dataset, count the number of transactions in which
each individual item appears.
Generating Higher Itemsets (L2, L3, etc.):
After generating frequent 1-itemsets (L1), generate candidate 2-itemsets (L2) by
combining L1 itemsets. Continue this process for k-itemsets (Lk) by joining frequent
(k-1)-itemsets.
Step 2: Pruning Using Support Threshold
Calculate the support for each candidate itemset.
Prune itemsets that do not meet the minimum support threshold.
o For example, if the minimum support is 40%, an itemset must appear in at least 40%
of the transactions to be considered frequent.
Step 3: Joining Candidate Itemsets
After pruning, the remaining frequent itemsets are used to generate higher-order
candidate itemsets by joining itemsets.
Join step:
To generate k-itemsets from (k-1)-itemsets, two itemsets are combined if their first (k-
2) items are the same.
For example, to generate 3-itemsets from {Milk, Bread} and {Milk, Butter}, we get
{Milk, Bread, Butter}.
Numerical Example
Example to understand how the Apriori algorithm works:
Transaction Dataset:
Transaction ID Items Purchased
T1 {Milk, Bread, Butter}
T2 {Bread, Butter, Beer}
T3 {Milk, Diapers, Beer}
T4 {Milk, Bread, Butter}
T5 {Milk, Bread, Diapers}
Minimum Support Threshold = 2 (40%)
Step-by-Step Execution of Apriori Algorithm
Step 1: Generate 1-Itemsets (L1)
Count the support for each individual item:
Item Support Count
Milk 4
Bread 4
Butter 3
Diapers 2
Beer 2
Since all the itemsets meet the minimum support threshold of 2, they are frequent.
Thus, we have the following 1-itemsets (L1):
L1= {Milk, Bread, Butter, Diapers, Beer}
Step 2: Generate Candidate 2-Itemsets (C2) by joining frequent 1-itemsets
Join L1 to form candidate 2-itemsets:
Candidate Itemset Support Count
{Milk, Bread} 3
{Milk, Butter} 2
{Milk, Diapers} 2
{Milk, Beer} 1
{Bread, Butter} 2
{Bread, Diapers} 1
{Bread, Beer} 1
{Butter, Diapers} 0
{Butter, Beer} 1
{Diapers, Beer} 2
Prune candidates that do not meet the minimum support threshold (support >= 2):
o {Milk, Beer}, {Bread, Diapers}, {Bread, Beer}, {Butter, Diapers}, {Butter, Beer}
are pruned.
The frequent 2-itemsets (L2) are:
L2={{Milk,Bread},{Milk,Butter},{Milk,Diapers},{Bread,Butter},{Diapers,Beer}}
Step 3: Generate Candidate 3-Itemsets (C3) by joining frequent 2-itemsets
Join L2 to form candidate 3-itemsets:
Candidate Itemset Support Count
{Milk, Bread, Butter} 2
{Milk, Diapers, Beer} 1
Both candidates meet the minimum support threshold, so the frequent 3-itemsets (L3)
are:
L3={Milk,Bread,Butter}
Step 4: Generate Candidate 4-Itemsets (C4)
No 4-itemsets can be generated because no frequent 3-itemsets can be combined.
Thus, the frequent itemsets are:
Frequent itemsets:{Milk},{Bread},{Butter},{Diapers},{Beer},{Milk,Bread},{Milk,Butter},{
Milk,Diapers},{Bread,Butter},{Diapers,Beer},{Milk,Bread,Butter}
Closed Frequent Itemsets
A Closed Frequent Itemset is a frequent itemset for which there is no superset with the same
support. In other words, it is frequent and adding any additional items to it would decrease its
support.
Example:
Consider the following transactions:
Transaction ID Items
1 Bread, Milk, Beer
2 Bread, Diapers, Milk
3 Beer, Diapers, Eggs
4 Bread, Milk, Diapers
5 Bread, Milk, Eggs, Beer
1. Step 1: Find Frequent Itemsets
Apply an algorithm like Apriori to find frequent itemsets with a minimum support threshold
(e.g., 2).
Frequent 1-itemsets:
{Bread}: 4
{Milk}: 4
{Beer}: 3
{Diapers}: 3
{Eggs}: 2
Frequent 2-itemsets:
{Bread, Milk}: 3
{Bread, Beer}: 2
{Bread, Diapers}: 2
{Milk, Diapers}: 2
{Beer, Diapers}: 2
Frequent 3-itemsets:
{Bread, Milk, Diapers}: 2
2. Step 2: Identify Closed Frequent Itemsets
For each frequent itemset, check if there is a superset with the same support.
o {Bread, Milk}: Support = 3 (No superset with support = 3)
o {Bread, Beer}: Support = 2 (Superset {Bread, Milk, Beer} has support = 2)
o {Bread, Diapers}: Support = 2 (Superset {Bread, Milk, Diapers} has support = 2)
o {Milk, Diapers}: Support = 2 (No superset with support = 2)
o {Beer, Diapers}: Support = 2 (No superset with support = 2)
o {Bread, Milk, Diapers}: Support = 2 (No superset with support = 2)
Closed Frequent Itemsets:
o {Bread, Milk}
o {Milk, Diapers}
o {Beer, Diapers}
o {Bread, Milk, Diapers}
Maximal Frequent Itemsets
A Maximal Frequent Itemset is a frequent itemset that is not a subset of any other frequent
itemset. It is the largest itemset within its frequency.
Example:
Using the same transactions:
1. Step 1: Find Frequent Itemsets
We already have the frequent itemsets from the previous example.
2. Step 2: Identify Maximal Frequent Itemsets
For each frequent itemset, check if it is a subset of any larger frequent itemset.
o {Bread, Milk, Diapers}: Maximal (not a subset of any larger frequent itemset)
o {Bread, Milk}: Not maximal (subset of {Bread, Milk, Diapers})
o {Bread, Beer}: Not maximal (subset of {Bread, Beer, Diapers})
o {Milk, Diapers}: Not maximal (subset of {Bread, Milk, Diapers})
o {Beer, Diapers}: Not maximal (subset of {Bread, Beer, Diapers})
Maximal Frequent Itemsets:
o {Bread, Milk, Diapers}
Comparison
Closed Frequent Itemsets: They capture itemsets that are frequent and do not have any
superset with the same frequency. They provide a compact representation of frequent itemsets,
useful in reducing the number of itemsets to be analyzed.
Maximal Frequent Itemsets: They are the largest frequent itemsets that are not subsets of any
other frequent itemset. They help in understanding the largest sets of items that frequently
appear together, which is useful for identifying broad patterns.
Applications
Closed Frequent Itemsets are used in summarizing and reducing the number of itemsets,
thus making the data more manageable for further analysis.
Maximal Frequent Itemsets are used in pattern discovery where understanding the largest
combinations of items is crucial, such as in market basket analysis for identifying overall
purchasing patterns.
Limitations of the Apriori Algorithm
Although the Apriori algorithm is simple and widely used, it has some limitations:
a. Computationally expensive
Multiple Scans of the Dataset:
Apriori scans the dataset multiple times (once for each pass k) to count the support for
candidate itemsets. This can be computationally expensive, especially for large
datasets.
b. Cost of Candidate Generation
Explosion of Candidate Itemsets:
As the size of the itemsets increases, the number of candidate itemsets grows
exponentially.
For example, from L1 to L2, there are combinations, where n is the number of
frequent 1-itemsets. For larger itemsets, this results in a significant
number of candidates to evaluate.
The binomial coefficient (read as "n choose k") represents the number of ways
to choose k items from n items without regard to the order of selection.
The binomial coefficient is calculated using the following formula:
where:
n! (n factorial) is the product of all positive integers up to n.
k! (k factorial) is the product of all positive integers up to k.
(n−k)! is the factorial of the difference between n and k.
c. Memory Consumption
Storage of Candidates:
The Apriori algorithm requires significant memory to store candidate itemsets and
support counts. As the itemsets become larger, this memory requirement increases.
Challenges in Frequent Pattern Mining
Frequent Pattern Mining (FPM) is essential for discovering patterns in large datasets, such as
finding associations between items in a transaction database. Despite its importance, several
challenges arise when working with real-world data, particularly related to scalability,
efficiency, and handling noise and sparsity.
2. Scalability and Efficiency in Frequent Pattern Mining
a. Handling Large Datasets
Challenge:
As datasets grow in size, mining frequent patterns becomes computationally expensive. Many
practical datasets consist of millions of transactions, making it necessary to devise techniques
that scale well.
Why this is a problem:
Large datasets require multiple scans, which can be costly in terms of time and
computational resources.
Storing and processing large amounts of data increases the demand for memory and
storage.
Example:
Consider a retail store that tracks millions of transactions. A traditional algorithm like Apriori
will need to scan this large dataset multiple times, leading to high processing time.
For example, a dataset with 1 million transactions where each transaction contains 100 items
will require multiple scans to identify frequent patterns, causing significant overhead.
Solution Approaches:
1. FP-Growth: This algorithm avoids multiple database scans by building a compressed
FP-Tree, reducing the need to store and process the entire dataset.
2. Sampling techniques: Instead of processing the entire dataset, algorithms can use
sampled data to approximate frequent patterns.
3. Parallelization: Distribute the computation across multiple processors to handle large
datasets in parallel, improving the time efficiency.
b. Reducing Search Space
Challenge:
The search space for frequent pattern mining grows exponentially as the number of items in
the dataset increases. Exploring all possible itemsets becomes computationally infeasible as
the number of transactions and items grows.
Why this is a problem:
Each transaction can contain many items, and mining all possible combinations
(subsets) results in an exponential number of candidate itemsets. For example, a
dataset with 100 items can have 2100 potential itemsets.
Example:
In a retail environment with 100 items, the number of possible itemsets is 2100 which results in
1,267,650,600,228,229,401,496,703,205,376 combinations! Clearly, examining each
combination would be computationally impractical.
Solution Approaches:
1. Apriori Property (Downward Closure): This property helps reduce the number of
candidate itemsets by ensuring that if an itemset is frequent, all its subsets must also be
frequent. By pruning infrequent itemsets early, the search space is reduced.
2. FP-Growth: Instead of generating candidate itemsets, FP-Growth mines frequent
patterns directly from a compressed representation of the dataset (FP-Tree),
significantly reducing the search space.
3. Hash-based methods: These methods map itemsets to buckets, reducing the need to
scan every itemset combination.
3. Search Space Complexity in Frequent Pattern Mining
a. Exponential Growth of Potential Itemsets
Challenge:
As the number of items increases, the number of potential itemsets (subsets of items) grows
exponentially. This makes it hard to compute all possible frequent patterns in a reasonable
amount of time.
Why this is a problem:
As the number of transactions and items increases, frequent pattern mining becomes
NP-hard (non-deterministic polynomial time), meaning that it is difficult to solve the
problem efficiently for large datasets.
More items mean more potential combinations to evaluate, resulting in an exponential
increase in the search space.
Example:
A dataset with only 20 items has 220=1,048,576 potential subsets (itemsets). Even with just 20
items, it becomes a challenge to mine patterns if the search space is not pruned effectively.
Solution Approaches:
1. Frequent Pattern Trees (FP-Trees): FP-Growth compresses the dataset into a tree
structure, which reduces the number of combinations that need to be considered.
2. Partitioning methods: Partition the data into smaller subsets, allowing the algorithm
to focus on localized data before combining results.
3. Closed frequent itemsets: Instead of mining all frequent itemsets, focus on closed
frequent itemsets, which are the maximal itemsets that cannot be extended further
without reducing support. This reduces the number of itemsets significantly.
b. Impact of Exponential Search Space on Computation
The exponential growth in search space directly impacts:
Time Complexity: More itemsets mean more time to calculate the support of each
itemset.
Memory Consumption: Large search spaces require more memory to store candidate
itemsets.
4. Data Sparsity and Noise
a. Data Sparsity
Challenge:
In many real-world applications, datasets are often sparse, meaning that most transactions
contain only a small subset of all available items. This creates difficulties in identifying
significant frequent patterns because the majority of item combinations will have very low
support.
Why this is a problem:
Sparse data leads to many itemsets with low support, which are often considered noise
or insignificant.
In sparse datasets, there may be few patterns that can be considered frequent, reducing
the utility of the mining process.
Example:
In a dataset of 100 items, if each transaction contains only 3-5 items, the overlap between
transactions is minimal, leading to a sparse dataset. Identifying common patterns from such
sparse data can be challenging as the majority of item combinations will appear in very few
transactions.
Solution Approaches:
1. Threshold tuning: Adjust the minimum support threshold to find itemsets that are
frequent enough, even in sparse data.
2. Data aggregation: Combine similar transactions or items into more general categories
to reduce sparsity and increase the chances of finding meaningful patterns.
3. Frequent Pattern Mining for Sparse Data (SPM): Specialized algorithms have been
designed to handle sparse data more efficiently by focusing on high-frequency items or
using dimensionality reduction techniques.
b. Handling Noise
Challenge:
Noise refers to random or irrelevant data that can obscure the patterns in a dataset. Noisy
transactions can create false patterns or mask important frequent patterns.
Why this is a problem:
Noise can cause algorithms to discover irrelevant or spurious patterns, which do not
generalize well to the rest of the data.
It can also obscure genuine patterns by adding random transactions that distort
frequency counts.
Example:
In a transactional dataset for market basket analysis, if some customers randomly purchase
items that don’t typically go together, these noisy transactions could create false associations.
Solution Approaches:
1. Preprocessing to Remove Noise: Techniques like data cleaning and outlier detection
can be applied to remove noise before mining.
2. Robust Algorithms: Use algorithms that are designed to handle noise by focusing on
strong associations and discarding weak patterns.
3. Statistical measures: Use additional metrics, such as confidence and lift, to evaluate
the significance of patterns, helping to reduce the impact of noise.
Examples of Challenges in Real-World Scenarios
Example 1: E-commerce Transactions
An online store has millions of transactions and thousands of products. Mining frequent
patterns is essential for recommending products, but:
Scalability becomes a concern due to the sheer volume of data.
Data sparsity is evident because most customers buy only a few products at a time.
Solutions:
Use FP-Growth to reduce the number of database scans.
Apply data reduction techniques to group similar items.
Example 2: Telecommunication Data
A telecommunications company records call details for millions of users, generating an
enormous amount of data daily. The company wants to find patterns in user behavior (e.g.,
frequently called numbers), but:
Exponential search space arises as the number of users and call patterns grows.
Noise from irrelevant calls (spam, wrong numbers) makes it difficult to mine
meaningful patterns.
Solutions:
Apply sampling to reduce the dataset size.
Use preprocessing to remove noise and irrelevant data.
FP-Growth (Frequent Pattern Growth)
Motivation
The FP-Growth (Frequent Pattern Growth) algorithm was developed as an alternative to the
Apriori Algorithm to address its limitations. The main drawbacks of Apriori are:
1. Multiple scans of the dataset: Apriori needs to scan the dataset for every iteration (for
k-itemsets).
2. Candidate Generation: Apriori generates a large number of candidate itemsets,
especially for large datasets, which leads to increased computational complexity.
Why FP-Growth?
FP-Growth addresses these inefficiencies by eliminating the need for candidate generation.
Instead of generating candidate itemsets and scanning the database multiple times, FP-Growth
builds a compressed representation of the database in the form of a Frequent Pattern Tree
(FP-Tree) and extracts frequent itemsets directly from this structure.
Basic Concepts
a. FP-Tree Structure
The FP-Tree (Frequent Pattern Tree) is a compact data structure that stores the frequent
patterns in a dataset without the need for multiple scans or candidate generation.
Definition:
An FP-Tree is a prefix-tree structure that represents the transactions in the dataset. Each
node in the tree corresponds to an item, and the path from the root to the leaf represents
a transaction. The count of each node represents the frequency of that item in the
transactions that share the prefix.
Key Characteristics:
o Compact: It compresses the dataset by storing only the frequent items,
significantly reducing memory usage.
o Shared Paths: Transactions with common items share paths, further reducing
redundancy.
b. Dividing the Database into Frequent Patterns
The FP-Growth algorithm divides the database into a set of conditional databases
based on the prefix paths in the FP-Tree. It then recursively mines these conditional
databases to find frequent patterns.
Key Features:
Divide-and-conquer approach: Breaks the problem into smaller sub-problems by
constructing conditional FP-Trees.
Memory-efficient: Uses a compact FP-Tree to store the dataset, reducing database
scans.
No candidate generation: Unlike Apriori, FP-Growth does not generate candidate
itemsets explicitly, improving efficiency.
Steps of the FP-Growth Algorithm
The FP-Growth algorithm has two main phases:
Phase 1: Construction of the FP-Tree
The FP-Tree is built using two passes over the dataset:
Pass 1: Identify frequent items
o In the first pass, the algorithm scans the dataset and calculates the support count of
each item.
o Items that do not meet the minimum support threshold are discarded. The
remaining items are called frequent items.
Pass 2: Build the FP-Tree
o In the second pass, the dataset is scanned again, but this time the transactions are
sorted by the frequency of items.
o For each transaction, a path is constructed in the FP-Tree where each node
corresponds to a frequent item, and the node count is incremented based on the
item's frequency in the transaction.
Phase 2: Mining Frequent Patterns from the FP-Tree
Once the FP-Tree is constructed, frequent patterns are mined by growing patterns from the tree:
1. Recursive Pattern Growth
o The algorithm mines the FP-Tree by recursively exploring all possible paths. For
each item, the algorithm builds a conditional FP-Tree that consists of all
transactions that contain the item. The process continues until all frequent patterns
are identified.
2. Conditional FP-Tree
o A conditional FP-Tree is a subtree constructed by considering only transactions that
include a specific frequent item. This allows the algorithm to focus on the subsets of
transactions that are most relevant for identifying frequent patterns involving that
item.
Example
Transaction Dataset : Consider the following example dataset, where each row represents a
transaction:
Transaction ID Items Purchased
T1 {Milk, Bread, Butter}
T2 {Bread, Butter, Beer}
T3 {Milk, Diapers, Beer}
T4 {Milk, Bread, Butter}
T5 {Milk, Bread, Diapers}
Minimum Support Threshold = 2
Step-by-Step Construction of the FP-Tree
Step 1: Identify Frequent Items (Pass 1)
Support Count for Each Item:
Item Support Count
Milk 4
Bread 4
Butter 3
Diapers 2
Beer 2
All items meet the minimum support threshold, so we retain all items. Now we proceed to
construct the FP-Tree.
Step 2: Build the FP-Tree (Pass 2)
Sort items in each transaction by frequency. The sorted transactions are:
Transaction ID Sorted Items
T1 {Milk, Bread, Butter}
T2 {Bread, Butter, Beer}
T3 {Milk, Diapers, Beer}
T4 {Milk, Bread, Butter}
T5 {Milk, Bread, Diapers}
Construct the FP-Tree by inserting these sorted transactions one by one. Here’s how
the tree evolves:
Insert T1: {Milk, Bread, Butter}
FP-Tree:
(null) → Milk (1) → Bread (1) → Butter (1)
Insert T2: {Bread, Butter, Beer}
FP-Tree:
(null) → Milk (1) → Bread (1) → Butter (1)
→ Bread (1) → Butter (1) → Beer (1)
Insert T3: {Milk, Diapers, Beer}
FP-Tree:
(null) → Milk (2) → Bread (1) → Butter (1)
→ Diapers (1) → Beer (1)
→ Bread (1) → Butter (1) → Beer (1)
Insert T4: {Milk, Bread, Butter}
FP-Tree:
(null) → Milk (3) → Bread (2) → Butter (2)
→ Diapers (1) → Beer (1)
→ Bread (1) → Butter (1) → Beer (1)
Insert T5: {Milk, Bread, Diapers}
Final FP-Tree:
(null) → Milk (4) → Bread (3) → Butter (2)
→ Diapers (1)
→ Diapers (2) → Beer (1)
→ Bread (1) → Butter (1) → Beer (1)
Step 3: Extract Frequent Itemsets from the FP-Tree
a) Mining the FP-Tree:
To mine the FP-Tree, we begin with the least frequent items (i.e., bottom of the tree) and
construct conditional pattern bases and conditional FP-Trees.
1. Beer’s Conditional Pattern Base:
o The two paths containing "Beer" are:
{Milk, Diapers} (support = 1)
{Bread, Butter} (support = 1)
Conditional Pattern Base for Beer:
{Milk, Diapers}: 1
{Bread, Butter}: 1
Conditional FP-Tree for Beer:
(null) → Bread (1) → Butter (1)
→ Milk (1) → Diapers (1)
Frequent Itemsets Containing Beer:
o {Beer}: 2
o {Butter, Beer}: 1
o {Bread, Beer}: 1
o {Bread, Butter, Beer}: 1
o {Milk, Beer}: 1
o {Milk, Diapers, Beer}: 1
2. Diapers’ Conditional Pattern Base:
o The two paths containing "Diapers" are:
{Milk} (support = 1)
{Milk, Bread} (support = 1)
Conditional Pattern Base for Diapers:
{Milk}: 1
{Milk, Bread}: 1
Conditional FP-Tree for Diapers:
(null) → Milk (2) → Bread (1)
Frequent Itemsets Containing Diapers:
o {Diapers}: 2
o {Milk, Diapers}: 2
o {Milk, Bread, Diapers}: 1
3. Butter’s Conditional Pattern Base:
o The two paths containing "Butter" are:
{Milk, Bread} (support = 2)
{Bread} (support = 1)
Conditional Pattern Base for Butter:
{Milk, Bread}: 2
{Bread}: 1
Conditional FP-Tree for Butter:
(null) → Bread (3) → Milk (2)
Frequent Itemsets Containing Butter:
o {Butter}: 3
o {Bread, Butter}: 3
o {Milk, Butter}: 2
o {Milk, Bread, Butter}: 2
4. Bread’s Conditional Pattern Base:
o The single path containing "Bread" is:
{Milk} (support = 3)
Conditional Pattern Base for Bread:
{Milk}: 3
Conditional FP-Tree for Bread:
(null) → Milk (3)
Frequent Itemsets Containing Bread:
o {Bread}: 4
o {Milk, Bread}: 3
5. Milk’s Conditional Pattern Base:
o No path is left for "Milk."
Frequent Itemsets Containing Milk:
o {Milk}: 4
Step 4: Summary of Frequent Itemsets
Based on the conditional pattern bases and the FP-Trees, we derive the following
frequent itemsets (with their support counts):
1. Single Items:
o {Milk}: 4
o {Bread}: 4
o {Butter}: 3
o {Diapers}: 2
o {Beer}: 2
2. Two-Itemsets:
o {Milk, Bread}: 3
o {Bread, Butter}: 3
o {Milk, Butter}: 2
o {Milk, Diapers}: 2
o {Bread, Beer}: 1
o {Butter, Beer}: 1
3. Three-Itemsets:
o {Milk, Bread, Butter}: 2
o {Milk, Diapers, Beer}: 1
o {Bread, Butter, Beer}: 1
Final Frequent Itemsets:
{Milk}: 4
{Bread}: 4
{Butter}: 3
{Diapers}: 2
{Beer}: 2
{Milk, Bread}: 3
{Bread, Butter}: 3
{Milk, Butter}: 2
{Milk, Diapers}: 2
{Milk, Bread, Butter}: 2
These are the frequent itemsets mined using the FP-Growth algorithm.
Advantages of FP-Growth
1. Efficiency: FP-Growth reduces the computational cost by avoiding the generation of
candidate itemsets (a major bottleneck in Apriori).
2. Compact Representation: The FP-Tree provides a compact representation of the
dataset, which is efficient for both memory and processing.
3. No Repeated Scanning: The FP-Growth algorithm scans the dataset twice, as opposed
to Apriori, which requires multiple database scans.
Disadvantages of FP-Growth
1. Tree Construction Complexity: Building the FP-Tree can be complex and may require
substantial memory for very large datasets.
2. Conditional FP-Trees: Mining from conditional FP-Trees can involve recursive calls
and additional processing overhead.
Mining Frequent Itemsets Using Vertical Data Format
Mining frequent itemsets is a crucial task in association rule learning, commonly used in
market basket analysis, web mining, and other data mining applications. Most algorithms
(like Apriori) use a horizontal data format, where each transaction contains a list of items.
However, the vertical data format provides an alternative approach that is often more
efficient in some cases. This guide explains the concept and process of mining frequent
itemsets using the vertical data format.
What is Vertical Data Format?
In the horizontal data format, each row represents a transaction, and each column
corresponds to an item (or product) in the transaction. However, in the vertical data
format, each row corresponds to an item, and the row contains the list of transactions in
which that item appears.
Example:
Consider the following transaction dataset (horizontal format):
Transaction ID Items Purchased
T1 {Milk, Bread, Butter}
Transaction ID Items Purchased
T2 {Bread, Butter, Beer}
T3 {Milk, Diapers, Beer}
T4 {Milk, Bread, Butter}
T5 {Milk, Bread, Diapers}
In the vertical format, this dataset is represented as follows:
Item Transaction IDs
Milk {T1, T3, T4, T5}
Bread {T1, T2, T4, T5}
Butter {T1, T2, T4}
Diapers {T3, T5}
Beer {T2, T3}
In the vertical format, each item is associated with the set of transactions in which it appears.
This transaction ID set (TID set) is crucial for mining frequent itemsets using algorithms like
ECLAT (Equivalence Class Clustering and Bottom-Up Lattice Traversal).
Steps in Mining Frequent Itemset Using Vertical Data Format
Mining frequent itemsets using the vertical format mainly involves set intersection operations.
The key advantage is that it simplifies candidate generation and counting by using intersection
instead of joining or scanning the entire dataset.
The process can be broken down into the following steps:
Step 1: Transform the Dataset into Vertical Format
The first step is to convert the transaction dataset from the horizontal format (where each
transaction contains a list of items) into the vertical format. This involves creating a list of
transaction IDs (TIDs) for each item in the dataset.
Example:
For the transactions:
T1: {Milk, Bread, Butter}
T2: {Bread, Butter, Beer}
T3: {Milk, Diapers, Beer}
T4: {Milk, Bread, Butter}
T5: {Milk, Bread, Diapers}
The vertical representation is:
Item Transaction IDs (TIDs)
Milk {T1, T3, T4, T5}
Bread {T1, T2, T4, T5}
Butter {T1, T2, T4}
Diapers {T3, T5}
Beer {T2, T3}
Step 2: Find Frequent 1-Itemsets
Once the dataset is in the vertical format, the next step is to find the frequent 1-itemsets. This
involves counting the number of transactions for each item (i.e., the length of the TID set). If
the size of the TID set for an item meets the minimum support threshold, the item is considered
frequent.
Example:
Let the minimum support threshold be 2. We calculate the support for each item based on the
TID sets:
Item Transaction IDs (TIDs) Support Count
Milk {T1, T3, T4, T5} 4
Bread {T1, T2, T4, T5} 4
Butter {T1, T2, T4} 3
Diapers {T3, T5} 2
Beer {T2, T3} 2
Since all items meet the minimum support threshold of 2, they are all frequent 1-itemsets.
Step 3: Generate Candidate k-Itemsets and Count Support
For k ≥ 2, frequent k-itemsets are generated by intersecting the TID sets of frequent
(k-1)-itemsets. This set intersection operation allows us to efficiently compute the
support for the new k-itemsets.
Example: Generating 2-Itemsets
1. {Milk, Bread}:
o TID sets: {Milk} = {T1, T3, T4, T5}, {Bread} = {T1, T2, T4, T5}
o Intersection: {T1, T4, T5}
o Support: 3
2. {Milk, Butter}:
o TID sets: {Milk} = {T1, T3, T4, T5}, {Butter} = {T1, T2, T4}
o Intersection: {T1, T4}
o Support: 2
3. {Milk, Diapers}:
o TID sets: {Milk} = {T1, T3, T4, T5}, {Diapers} = {T3, T5}
o Intersection: {T3, T5}
o Support: 2
4. {Bread, Butter}:
o TID sets: {Bread} = {T1, T2, T4, T5}, {Butter} = {T1, T2, T4}
o Intersection: {T1, T2, T4}
o Support: 3
5. {Bread, Diapers}:
o TID sets: {Bread} = {T1, T2, T4, T5}, {Diapers} = {T3, T5}
o Intersection: {T5}
o Support: 1 (discarded, as it does not meet the minimum support)
6. {Butter, Beer}:
o TID sets: {Butter} = {T1, T2, T4}, {Beer} = {T2, T3}
o Intersection: {T2}
o Support: 1 (discarded)
Thus, the frequent 2-itemsets are:
Itemset Support Count
{Milk, Bread} 3
{Milk, Butter} 2
{Milk, Diapers} 2
{Bread, Butter} 3
All the rest of the steps are the same like Apriori/ FP growth .The process continues
recursively.
Advantages of Vertical Data Format
1. Efficient Support Counting: The vertical format allows for efficient support counting
by performing simple set intersection operations on the TID lists.
2. Fewer Database Scans: After transforming the dataset into the vertical format, the need
for repeated database scans is eliminated.
3. Compact Representation: The vertical format can often be more compact and easier
to process, especially when there are many items with low support.
Disadvantages of Vertical Data Format
1. Memory Overhead: Storing transaction ID lists for each item may require more
memory, particularly when the dataset is large or transactions contain many items.
2. Inefficient for Dense Datasets: The vertical format may become inefficient when
datasets are dense (i.e., when many items appear in most transactions), leading to large
TID lists and costly intersections.