Frequent pattern mining
The concept of frequent pattern mining in data mining, which is the process of identifying recurring
relationships or patterns in a set of data. Imagine you work at a store and want to know what
products customers tend to buy together. For example, if a customer buys a PC and then a digital
camera, they might frequently also buy a memory card.
In this scenario:
Frequent patterns are combinations of items, sequences, or structures that appear repeatedly in the
data. For instance, a common purchase pattern might be "PC → digital camera → memory card."
Frequent itemsets are groups of items that appear together often, like "milk and bread" in a grocery
store.
Frequent sequential patterns refer to items bought in a specific order over time, like a sequence of
"PC, then digital camera, then memory card."
Frequent structured patterns involve more complex relationships, like parts of a larger structure or
network.
These patterns are useful for making recommendations, identifying trends, and improving marketing
strategies.
TWO KEY POINTS
1. Market Basket Analysis, a common technique in data mining--The idea is to analyze
shopping patterns by finding items that people frequently buy together.
For example, imagine customers at a grocery store. When one customer buys milk, bread, and
cereal, another buys milk, sugar, and bread, and a third buys milk, bread, and butter. Market basket
analysis would look at these "shopping baskets" and find patterns, like noticing that people who buy
milk often also buy bread.
Businesses use this information to make decisions about things like product placement, promotions,
or recommendations. If a store knows that milk and bread are often bought together, it might place
them closer to each other or offer a discount on one when the other is purchased.
In short, market basket analysis helps stores understand and predict customer buying behavior by
identifying items that are often bought together.
Example of market basket analysis - how Amazon and other e-commerce websites make product
recommendations. When a customer views or adds an item to their shopping cart, Amazon shows a
"Frequently Bought Together" or "Customers Who Bought This Also Bought" section. This feature
uses market basket analysis to suggest related products based on patterns observed in other
customers' purchases.
For instance, if many customers who buy a laptop also purchase a laptop bag and a wireless mouse,
Amazon will suggest these items to anyone buying a laptop. This recommendation system is
designed to increase the chance of customers buying additional products, ultimately boosting sales.
Another example is in grocery stores. Market basket analysis might reveal that customers who buy
diapers often buy baby wipes and maybe even a soft drink. This insight could lead the store to place
these items closer together, run joint promotions, or bundle them as a "baby essentials" pack,
making it easier and more appealing for shoppers to buy multiple related items in one trip.
This page explains association rules in data mining, which help us find relationships between items
that frequently appear together in a set of transactions, like shopping carts.
Here are the key ideas in simple terms:
Itemset: This is a group of items that appear together in a transaction. For example, {milk, bread,
butter} could be an itemset if these items are often bought together.
2.Association Rule: This is a rule that shows a relationship between two itemsets. For
example, a rule might be “If a customer buys milk, they are also likely to buy bread.” We write this as
milk → bread, meaning milk "implies" bread.
Support: Support measures how often an item or a set of items appears in the entire set of
transactions. For example, if 20% of all transactions include both milk and bread, then the support of
{milk, bread} is 20%. This tells us how common an itemset is.
Confidence: Confidence measures how often the rule is true. For example, if 80% of the
transactions that include milk also include bread, the confidence of the rule milk → bread is 80%.
This tells us how reliable the rule is.
Minimum Support and Confidence: To decide which rules are useful, we often set
minimum thresholds for support and confidence. Only rules that meet or exceed these thresholds
are considered strong or interesting enough to use.
In summary, association rules help businesses see which products are often bought together and
how strong those relationships are, which they can use for recommendations, promotions, and
improving customer experience.
Efficient and Scalable Frequent Itemset Mining
Efficient and Scalable Frequent Itemset Mining refers to methods used to quickly and effectively find
sets of items that frequently appear together in large datasets. One of the most well-known
algorithms for this is the Apriori algorithm.
1. What is Frequent Itemset Mining?
Frequent itemset mining is the process of discovering sets of items that appear together often in a
dataset. This technique is commonly used in market basket analysis, where we try to find
combinations of products that are frequently bought together. For example, in a grocery store, we
might want to know if customers who buy bread are also likely to buy milk.
1.Apriori Algorithm:
The Apriori algorithm is a popular method for finding these frequent itemsets.
It was one of the first algorithms developed for this purpose and is relatively
straightforward to understand. Here’s how it works in simple terms:
Apriori Principle: The core idea behind Apriori is that if an itemset (a set of
items) is frequent, then all smaller subsets of that itemset must also be
frequent. For example, if the combination of {bread, milk, eggs} is frequent in
transactions, then {bread, milk} and {milk, eggs} should also be frequent. This
principle helps reduce the number of combinations we need to check.
How Apriori Works:
Step 1 - Find Frequent 1-Itemsets: The algorithm first scans the
database to find individual items that are frequent (appear in
transactions at least a certain number of times). Let’s say we set a
threshold of 20% for an item to be considered frequent. If bread
appears in 25% of transactions, it qualifies as a frequent item.
Step 2 - Generate Candidate 2-Itemsets: The algorithm then pairs
these frequent 1-itemsets to create candidate 2-itemsets (sets of
two items). It checks which pairs meet the frequency threshold.
Step 3 - Filter 2-Itemsets: Any 2-itemset that meets the frequency
threshold is kept, while those that don’t are discarded.
Step 4 - Generate Candidate 3-Itemsets: The algorithm continues to
combine items to create larger itemsets (e.g., 3-itemsets, 4-itemsets)
and filters them by frequency, stopping when no more frequent
itemsets can be generated.
This process of building up larger sets from smaller ones and
discarding those that don't meet the threshold is what makes the
Apriori algorithm efficient.
Why is Apriori Efficient and Scalable?
The Apriori algorithm is efficient because it reduces the number of
itemsets it needs to check. By using the Apriori principle, it doesn’t
have to look at every possible combination of items in the data
(which would be computationally expensive). Instead, it only focuses
on combinations that could potentially be frequent based on the
results of previous steps.
Additionally, Apriori can handle large datasets, making it scalable.
However, for very large datasets, there are more advanced
algorithms (like FP-Growth) that can be even more efficient.
Other Real-Life Applications of Apriori
E-commerce Recommendations: Online stores like Amazon use
frequent itemset mining to suggest "Frequently Bought Together"
items. For example, if customers often buy a phone case with a new
smartphone, the site may recommend the case when someone views
the phone.
Fraud Detection in Banking: Banks use frequent itemset mining to
detect common patterns in fraudulent transactions. If certain
transaction sequences are often linked with fraud, the bank can set
up alerts for similar patterns.
Healthcare Analysis: Hospitals can analyze frequent itemsets of
symptoms or medications that are often prescribed together. This
helps them understand common treatment patterns and could aid in
diagnosis.
Summary
The Apriori algorithm is an efficient method for finding frequently
occurring item combinations in large datasets by building up from
individual items to larger sets. Its ability to reduce the number of
checks needed makes it practical for large-scale applications, helping
businesses like retail and e-commerce optimize their operations and
make personalized recommendation
2. FP-Growth Algorithm
The FP-Growth algorithm (Frequent Pattern Growth) is an efficient algorithm for finding frequent
itemsets in large datasets. It’s considered a faster and more efficient alternative to the Apriori
algorithm because it doesn’t require multiple scans of the database or generation of many candidate
itemsets. Let’s go through how it works and a real-life example to make it easier to understand.
1. What is the FP-Growth Algorithm?
FP-Growth, short for Frequent Pattern Growth, is used in frequent itemset mining to find patterns or
sets of items that appear frequently together. Instead of generating all possible item combinations
(like Apriori does), FP-Growth uses a more compact data structure called an FP-tree (Frequent
Pattern Tree) to store the database, which allows it to find frequent itemsets more efficiently.
How Does the FP-Growth Algorithm Work?
Build the FP-Tree:
The algorithm first scans the database to find the frequency of each item (called support).
It discards items that don’t meet the minimum support threshold (a set percentage that determines
whether an item is “frequent”).
Next, it arranges the remaining frequent items in a specific order based on their frequency.
The algorithm then builds an FP-tree, which is a compressed representation of the transactions, by
grouping similar items and sharing branches where possible.
Extract Frequent Patterns:
Once the FP-tree is built, the algorithm extracts frequent itemsets from it.
It starts from the bottom of the tree and looks for patterns by identifying paths that share common
items.
Using these paths, it generates frequent itemsets without needing to scan the entire database
multiple times.
By using this method, FP-Growth avoids the time-consuming process of generating candidate
itemsets like Apriori, making it faster and more suitable for large datasets.
Example: Retail Market Basket Analysis
Let’s look at how a grocery store might use FP-Growth for market basket analysis:
Scenario: A large grocery store wants to identify frequent item combinations in their transactions to
improve product placement and create targeted promotions. For example, they want to know if
items like "bread," "milk," and "eggs" are often purchased together.
Using FP-Growth:
Step 1 - Build the FP-Tree: The store’s transaction data is fed into the FP-Growth algorithm, which
first counts how often each item appears (e.g., bread appears in 50% of transactions, milk in 40%,
eggs in 30%). Items that don’t meet a certain threshold (e.g., 10%) are ignored.
The frequent items are then arranged by frequency, and an FP-tree is created. For example, if
"bread" and "milk" are frequently bought together, they might share a path in the tree, representing
this combination in a compressed way.
Step 2 - Extract Patterns: The algorithm then extracts patterns from the FP-tree. It finds that "bread"
and "milk" are frequently bought together, as are "milk" and "eggs." It might also find a 3-itemset
like {bread, milk, eggs} if customers often buy all three items together.
Results and Actions:
Product Placement: The store could place bread, milk, and eggs closer together to encourage
customers to pick up all three.
Bundling and Promotions: The store could create a "breakfast bundle" promotion where customers
get a discount if they buy bread, milk, and eggs together.
Personalized Offers: The store could send coupons for eggs to customers who frequently buy bread
and milk, increasing the likelihood that they’ll buy all three items on their next visit.
Why is FP-Growth Better Than Apriori?
FP-Growth is often faster and more efficient than Apriori because:
It avoids generating candidate itemsets. Instead of trying every possible combination, FP-Growth
uses the FP-tree to directly find frequent itemsets.
It reduces the number of database scans. The FP-tree structure allows it to work with a compressed
version of the data, whereas Apriori requires multiple scans of the entire database, which is time-
consuming for large datasets.
Other Real-Life Applications of FP-Growth
E-commerce Recommendations: Online stores can use FP-Growth to find combinations of items that
are frequently bought together, like a laptop with a laptop bag and mouse. The site can then
recommend these items as a bundle to customers browsing laptops.
Healthcare and Diagnosis Patterns: Hospitals can use FP-Growth to identify common combinations
of symptoms or treatments. For instance, if certain symptoms often occur together in patients with
a specific condition, doctors can be alerted to consider that diagnosis when those symptoms are
present.
Telecommunications: Telecom companies can use FP-Growth to identify frequently used service
packages. If customers often purchase a specific combination of internet, mobile, and streaming
services, the company can create bundled packages for these services.
Summary
The FP-Growth algorithm is a fast, efficient method for finding frequently occurring patterns in large
datasets by using an FP-tree structure to compress data and avoid unnecessary calculations. It’s
particularly useful in areas like retail, healthcare, and e-commerce, where understanding common
item combinations helps improve sales, personalize recommendations, and make better
business decisions.
Mining frequent itemsets using the vertical data format in data mining
In data mining, we analyze transactions (like shopping carts) to find frequent itemsets (groups of
items that are often bought together).
Horizontal format: Data is stored where each row represents a transaction, and the items in the
transaction are listed together (e.g., Transaction 1: {milk, bread, butter}).
Vertical format: Data is stored where each item is associated with a list of transaction IDs (TIDs) that
contain it. For example:
Milk: {T1, T3, T4}
Bread: {T1, T2, T4}
Butter: {T1, T4}
The vertical format is useful because it directly tells us where an item appears and makes it easier to
intersect transaction IDs to find frequent itemsets.
Real-Life Example
Imagine a grocery store's data:
In the horizontal format:
T1: {milk, bread, butter}
T2: {bread, eggs}
T3: {milk, eggs}
T4: {milk, bread, butter, eggs}
In the vertical format:
Milk: {T1, T3, T4}
Bread: {T1, T2, T4}
Butter: {T1, T4}
Eggs: {T2, T3, T4}
Using the vertical format, you can quickly find intersections:
To see how often {milk, bread} appears, intersect their TID lists:
Milk: {T1, T3, T4}
Bread: {T1, T2, T4}
Intersection: {T1, T4} → {milk, bread} appears in 2 transactions.
Why Use Vertical Format?
It’s efficient for mining frequent patterns because finding intersections is faster than scanning all
rows repeatedly (as in horizontal format).
Vertical format simplifies the process when working with algorithms like Eclat, which focuses on TID
intersections.
In real life, this approach helps businesses quickly analyze customer transactions and find
combinations of products that sell well together, enabling better product placement
or bundle offers.
Mining Closed and Max Patterns in frequent itemset mining. Let me explain these concepts in simple
terms and provide a real-life example.
Key Terms:
Frequent Itemset:
A set of items (like products) that appear together in transactions with enough frequency, meeting a
minimum support threshold.
Closed Frequent Itemset:
A frequent itemset is closed if there is no superset (larger set containing it) that has the same
support count.
It’s a way to reduce redundancy while preserving all the necessary information about frequent
patterns.
Max Frequent Itemset:
A frequent itemset is maximal if it has no frequent superset (no larger frequent itemset contains it).
Max patterns are the most compact representation but may lose some details about subsets.
Why Use Closed and Max Patterns?
When the number of frequent itemsets grows very large (e.g., when there are many items), it can
become overwhelming and computationally expensive to analyze all frequent patterns. By focusing
on closed or max patterns, we can reduce the number of patterns without losing too much critical
information.
Real-Life Example:
Scenario: Analyzing shopping behavior in a grocery store.
Transactions:
T1: {milk, bread, butter}
T2: {milk, bread}
T3: {bread, butter}
T4: {milk, bread, butter}
Minimum Support = 2.
Frequent Itemsets:
{milk}: Appears in 3 transactions.
{bread}: Appears in 4 transactions.
{butter}: Appears in 3 transactions.
{milk, bread}: Appears in 3 transactions.
{bread, butter}: Appears in 3 transactions.
{milk, bread, butter}: Appears in 2 transactions.
Closed Frequent Itemsets:
{milk, bread, butter}: Closed because no superset has the same support (2).
{milk, bread}: Closed because {milk, bread, butter} has lower support (2 vs. 3).
{bread, butter}: Closed for the same reason as above.
Max Frequent Itemset:
{milk, bread, butter}: Maximal because no larger frequent itemsets exist.
Application:
Imagine a store owner analyzing buying patterns. They don't need all frequent itemsets—just the
closed ones to see meaningful groupings (e.g., "milk and bread are bought together frequently, but
adding butter doesn't happen in all such cases"). If they only want the broadest patterns, they can
use the max itemsets (e.g., "the largest group of items people often buy together is milk, bread, and
butter").
This helps in reducing data complexity while still allowing for decisions like bundling
items for discounts.
Key Terms
Frequent Itemset: A group of items that occur together in transactions with a frequency above a
minimum threshold.
Closed Frequent Itemset: A frequent itemset where no proper superset has the same support
(occurrence count).
Pruning: Techniques to eliminate unnecessary itemsets, speeding up the mining process.
Strategies Explained
1. Item Merging
If every transaction containing a frequent itemset X also contains an itemset Y, and the union of X
and Y forms a closed frequent itemset, then:
There’s no need to search for itemsets that include X but not Y.
Example:
Consider the transactions:
T1: {1, 2, 11}
T2: {1, 2, 11, 13}
Frequent itemsets include {1, 2} and {1, 2, 11}:
{1, 2, 11} is a closed itemset because adding any additional items would reduce its support.
No need to consider subsets like {1, 2} alone without {11}, as they are already part of {1, 2, 11}.
2. Sub-Itemset Pruning
If a frequent itemset X is a subset of a closed frequent itemset Y, and both have the same support,
then:
All descendants (larger itemsets formed from X) can be ignored.
Example:
Transactions:
T1: {a1, a2, ..., a100}
T2: {a1, a2, ..., a50}
Minimum support = 2.
Frequent itemsets:
{a1}: Appears in both transactions (support = 2).
{a1, a2}: Appears in both transactions (support = 2).
Pruning:
{a2} is a proper subset of {a1, a2} with the same support (2). So, {a2} and its descendants do not
need further exploration.
3. Item Skipping
During depth-first search, if an item’s support is equal across different header tables, it can be
pruned early.
Example:
Header Table:
a1: Support = 2
a2: Support = 2
If both a1 and a2 have the same support at different levels, you can skip a2 after analyzing a1.
4. Superset and Subset Checking
To confirm if a frequent itemset is closed, check:
Superset Checking: If any existing closed itemset with the same support includes the current
itemset.
Subset Checking: If the current itemset is part of any already-found closed itemset.
Example:
Found itemsets:
{a, b} (support = 3)
{a, b, c} (support = 3)
Superset Checking: {a, b, c} is a superset of {a, b} with the same support. So, {a, b} is not closed.
Simplified Real-Life Analogy
Imagine you're grouping customers by shared product purchases:
Item Merging: If everyone buying "milk and bread" also buys "butter," focus on the group "milk,
bread, and butter."
Sub-Itemset Pruning: If the group "milk and bread" behaves the same as "milk, bread, and butter,"
there's no need to analyze "milk and bread" separately.
Item Skipping: If "milk" and "bread" show the same trends, skip redundant checks.
Superset Checking: Ensure you're only tracking unique complete combinations, not overlapping
subsets.
These optimizations save time by eliminating unnecessary checks while preserving the
important patterns.
Advanced Pattern Mining: Pattern Mining in Multilevel,
Multidimensional space
Pattern mining in multilevel, multidimensional space, which refers to
analyzing data with various levels of abstraction (general to specific)
and across multiple attributes or dimensions (like age, location, or
product type).
1. What is Multilevel Pattern Mining?
In many real-world scenarios, data is organized in a hierarchical
manner.
Multilevel mining allows us to discover patterns at different levels of
detail.
High-level patterns are broader and more general (e.g., "Electronics
are popular").
Low-level patterns are more detailed and specific (e.g., "Dell laptops
are frequently purchased").
Example
Imagine a retail store selling electronics. The data about purchased
items is arranged in a hierarchical structure (concept hierarchy), as
shown below:
Hierarchy Levels (Figure 7.2 explained):
Level 0 (Root Level): General category for all products, e.g.,
Electronics.
Level 1: Broad categories like Computers, Software, and Printers.
Level 2: More specific subcategories like Laptops, Desktops, Office
Software.
Level 3: Brand-specific items, e.g., Dell Desktop Computers,
Microsoft Office Software.
Level 4: Most detailed level, including raw data (e.g., specific product
IDs).
Why Multilevel Mining?
General patterns at higher levels (e.g., "Electronics are popular")
might seem obvious.
Specific patterns at lower levels (e.g., "Dell laptops with Microsoft
Office are often bought together") can uncover hidden insights.
By analyzing patterns across levels, businesses can make better
decisions, like customizing promotions.
What is Multidimensional Pattern Mining?
This looks at multiple dimensions of data, such as:
What a customer buys (products).
Who the customer is (age, gender, etc.).
Where the purchase happened (location).
Real-Life Example:
Consider an electronics store analyzing sales data:
At a high level: "Electronics are frequently purchased by all
customers."
At a medium level: "Laptops and office software are popular among
working professionals."
At a detailed level: "Customers aged 25-35 often buy Dell laptops
bundled with Microsoft Office from New York stores."
By mining patterns across these levels and dimensions, businesses
can:
Develop tailored marketing strategies.
Improve inventory management.
Identify customer preferences at different levels of detail.
Key Takeaway
Multilevel and multidimensional pattern mining helps uncover
general trends as well as specific insights from complex datasets by
analyzing different levels of abstraction and multiple dimensions.
This flexibility is particularly useful in industries like retail,
finance, and healthcare.