KEMBAR78
UNIT 3 Mining Frequent Pattern | PDF | Computing | Information Technology
0% found this document useful (0 votes)
11 views11 pages

UNIT 3 Mining Frequent Pattern

Frequent pattern mining is a key data mining process that identifies patterns or associations within large datasets, crucial for tasks like market basket analysis. It involves concepts such as itemsets, support, confidence, and lift to analyze customer buying habits and improve business decision-making. The Apriori algorithm is a prominent method for mining frequent itemsets, employing a level-wise search approach to discover associations between items in transactional databases.

Uploaded by

ganavig291
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views11 pages

UNIT 3 Mining Frequent Pattern

Frequent pattern mining is a key data mining process that identifies patterns or associations within large datasets, crucial for tasks like market basket analysis. It involves concepts such as itemsets, support, confidence, and lift to analyze customer buying habits and improve business decision-making. The Apriori algorithm is a prominent method for mining frequent itemsets, employing a level-wise search approach to discover associations between items in transactional databases.

Uploaded by

ganavig291
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Fundamentals of Data Science

UNIT-3
Mining Frequent pattern
Introduction:
Frequent pattern mining in data mining is the process of identifying patterns or associations within a
dataset that occur frequently. This is typically done by analysing large datasets to find items or sets of
items that appear together frequently. It encompasses recognising collections of components that occur
together frequently in a transactional or relational database.
Frequent patterns are patterns (e.g., itemsets, subsequences, or substructures) that appear frequently in a
data set For example, a set of items, such as milk and bread, that appear frequently together in a
transaction data set is a frequent itemset. A subsequence, such as buying first a PC, then a digital camera,
and then a memory card, if it occurs frequently in a shopping history database, is a (frequent) sequential
pattern. A substructure can refer to different structural forms, such as subgraphs, subtrees, or sublattices,
which may be combined with itemsets or subsequences. If a substructure occurs frequently, it is called a
(frequent) structured pattern. Finding frequent patterns plays an essential role in mining associations,
correlations, and many other interesting relationships among data. Moreover, it helps in data
classification, clustering, and other data mining tasks. Thus, frequent pattern mining has become an
important data mining task and a focused theme in data mining research.

Example of market basket analysis, the earliest form of frequent pattern mining for association rules.

Definition of Frequent Patterns: Frequent patterns refer to combinations of items, sequences, or


substructures that occur frequently in a dataset. For example, in a retail dataset, a frequent pattern
could be the association between certain products that are of ten purchased together ,like bread and
butter.

Mining frequent patterns in data science involves identifying recurring associations or relationships
within a dataset.

Basic Concepts in Frequent Pattern Mining

The technique of frequent pattern mining is built upon a number of fundamental ideas. The analysis is
based on transaction databases, which include records or transactions that represent collections of objects.
Items inside these uansactions are grouped together as itemsets.

An itemset is a collection of one or more items that are considered as a single entity. Each item within an
itemset is typically an element or attribute associated with data. Itemsets play a crucial role in the analysis
of datasets to identify patterns, associations, or relationships among items.

There are two main types of Itemsets:

1. Frequent Itemset: A frequent itemset is an itemset that appears in a dataset with a frequency greater
than or equal to a specified minimum support threshold. The support of an itemset is the proportion of
transactions in which the itemset occurs.

2 Association Rule: An association rule is a relationship between two itemsets, often represented in the
form of “ if X, then Y." The two parts of the rule are called the antecedent (X) and the consequent (Y).
The strength of the association rule is measured by metrics such as confidence and lift.

Page 1
Fundamentals of Data Science

Support

It has been calculated with the number of transactions divided by the total number of transactions made

support(A) = number of transactions in which A occurs


(number of all transactions)
EX:- support(pen) = transactions related to pen/total transactions
support -> 500/5000 = 10 percent

* Confidence

Whether the product sales are popular on individual sales or through combined sales has been calculated.
That is calculated with combined transactions/individual transactions.

Confidence( A=>B)= P B sup(A U B)


A sup( A)
EX:- Confidence = combine transactions/individual transactions
confidence-> 1000/500 = 20 percent

* Lift

Lift is calculated for knowing the ratio for the sales.

Lift( A->B)= support( A and B)


support( A)*support (B)
EX:- Lift = confidence percent/support percent Lift->20/10=2

When the Lift value is below 1, the combination is not so frequently bought by consumers. But in this
case, it shows that the probability of buying both the things together is high when compared to the
transaction for the individual items sold.

Market Basket Analysis (MBA)

Frequent itemset mining leads to the discovery of associations and correlations among items in large
transactional or relational data sets. With massive amounts of data continuously being collected and
stored, many industries are becoming interested in mining such patterns from their databases.

The discovery of interesting correlation relationships among huge amounts of business transaction
records can help in many business decision-making processes such as catalog design, cross-
marketing, and customer shopping behaviour analysis.

A typical example of frequent itemset mining is market basket analysis. This process analyzes
customer buying habits by finding associations between the different items that customers place in
their “shopping baskets” (Figure). The discovery of these associations can help retailers develop
Fundamentals of Data Science

marketing strategies by gaining insight into which items are frequently purchased together by
customers.

For instance, if customers are buying milk, how likely are they to also buy bread (and what kind of
bread) on the same trip to the supermarket? This information can lead to increased sales by helping
retailers do selective marketing and plan their shelf space.

Market Basket Analysis Examples are Retail, Telecom, IBFS, Medicine etc...

How does Market Basket Analysis Work?

Market Basket Analysis is modeled on Association rule mining, i.e., the IF {}, THEN () construct. For
example, IF a customer buys bread, THEN he is likely to buy butter as well.

Association rules are usually represented as


Example:- {Bread}–>{Butter}

Some terminologies to familiarize yourself with Market Basket Analysis are:

◆Antecedent:- Items or 'itemsets' found within the data are antecedents. In simpler words, it's the IF
component, written on the left-hand side. In the above example, bread is the antecedent.

◆Consequent:- A consequent is an item or set of items found in combination with the antecedent. It's the
THEN component, written on the right-hand side. In the above example, butter is the consequent.

Working of Market Basket Analysis


The steps involved are:
(i) Transaction Data Collection: Data on customer transactions, such as receipts or online order
histories, are gathered. Each transaction should contain a list of items purchased by a customer.

(ii) Creation of a Transaction Database: The data is organized into a transactional database, where
each row represents a unique transaction, and the columns represent the items purchased.

(iii) Generation of Itemsets: All possible combinations of items that appear together in transactions are
identified. These combinations are known as itemsets.

(iv) Calculation of Support: The support for each itemset, which is the proportion of transactions that
contain the itemset, is calculated. Support is a measure of how frequently a particular combination of
items occurs.
(v) Setting a Minimum Support Threshold: A minimum support threshold to filter out itemsets with

Page 3
Fundamentals of Data Science

low occurrence, focusing on the most relevant associations is defined.

(vi) Generation of Association Rules: Based on the frequent itemsets, generate association rules that
express relationships between items are generated. These rules typically have a format like "If
(antecedent) Then (consequent) with a certain confidence."

(vii) Calculation of Confidence: Confidence measures how often the rule is correct. It is calculated as
the support for the combined itemset divided by the support for the antecedent.

(viii) Setting a Minimum Confidence Threshold: A minimum confidence threshold is established to


select the most meaningful and actionable rules.

(ix) Interpretation and Action: The generated rules are analyzed to understand the associations between
products. Strategies such as product placement, bundling, or targeted marketing are implemented based
on the discovered patterns.

Types of Market Basket Analysis

Market Basket Analysis techniques can be categorised based on how the available data is utilised. Here
are the following types of market basket analysis in data mining, such as:

1) Descriptive market basket analysis:-

This type only derives insights from past data and is the most frequently used approach. The analysis here
does not make any predictions but rates the association between products using statistical techniques. For
those familiar with the basics of Data Analysis, this type of modeling is known as unsupervised learning.

2) Predictive market basket analysis:-

This type uses supervised learning models like classification and regression. It essentially aims to mimic
the market to analyse what causes what to happen. Essentially, it considers items purchased in a sequence
to determine cross-selling. For example, buying an extended warranty is more likely to follow the
purchase of an iPhone. While it isn't as widely used as a descriptive MBA, it is still a very valuable tool
for marketers.

3) Differential market basket analysis:-

This type of analysis is beneficial for competitor analysis. It compares purchase history between stores,
between seasons, between two time periods, between different days of the week, etc., to find interesting
patterns in consumer behaviour. For example, it can help determine why some users prefer to purchase
the same product at the same price on Amazon vs Flipkart. The answer can be that the Amazon reseller
has more warehouses and can deliver faster, or maybe something more profound like user experience.

Benefits / Advantages of (MBA)

1) Increasing market share:- Once a company hits peak growth, it becomes challenging to determine
new ways of increasing market share. Market Basket Analysis can be used put together demographic
and gentrification data to determine the location of new store or geo-targeted ads.

2) Behaviour analysis:- Understanding customer behaviour patterns is a primal stone in the foundations
of marketing. MBA can be used anywhere from a simple catalogue desig to UI/UX.
Fundamentals of Data Science

3) Optimisation of in-store operations:- MBA is not only helpful in determining what goes on the
shelves but also behind the store. Geographical patterns play a key role in determining the popularity
or strength of certain products, and therefore, MBA has been increasingly used to optimize inventory
for each store or warehouse.

4) Campaigns and promotions:- Not only is MBA used to determine which products together but also
about which products form keystones in their product line.

5) Recommendations:- OTT platforms like Netflix and Amazon Prime benefit from IFA by
understanding what kind of movies people tend to watch frequently.

6) Increasing sales of return on investment.


7) Boosts consumer engagement.
8) Increasing client satisfaction.
9) Improves marketing initiatives and strategies.

Disadvantages of MBA:-

1) It identifies hypothesis which need to be tested.

2) Measurement of impact needed.

3) Difficult to identify product grouping.

4) A large number of real transaction are needed to do an effective basket analysis.

5) The analysis can capture results were due to the success of previous marketing campaign.

6) Possible measure of information

7) Accuracy of data

AprioriAlgorithm:

Finding Frequent Itemsets Using Candidate Generation: The Apriori Algorithm

• Apriori is a seminal algorithm proposed by R.Agrawal and R.Srikanthin1994 for


miningfrequent itemsets for Boolean association rules.
• Mining single dimensional Boolean association rules from transaction database.
• The name of the algorithm is based on the fact that the algorithm uses prior knowledge
of frequent item set properties.
• Apriori employs an iterative approach known as a level-wise search, where k-itemsets
areused to explore (k+1)-itemsets.
• First, the set of frequent 1-itemsets is found by scanning the database to accumulate the
countfor each item, and collecting those items that satisfy minimum support. The resulting set
is denoted L1.Next, L1 is used to find L2, the set of frequent2-itemsets,which is used to find
L3,and soon, until no more frequent k-item sets can be found.

Page 5
Fundamentals of Data Science

• The finding of each Lk requires one full scan of the database.


• A two-step process is followed in Apriori consisting of join and prune action.

Steps to solve Apriori Algorithm

1. Define minimum support threshold


2. Generate candidate item sets C1,C2…..
3. Count the support of each candidate item set
4. Prune the candidate item sets - Remove the item sets that do not meet the minimum support threshold.
5. Generate a list of frequent 1-item sets L1,L2….
6. Repeat steps 3-5 until no more frequent item sets can be generated.
7. Generate association rules
8. Evaluate the strong association rules confidence >= the minimum confidence threshold

Problem:

Minimum support threshold is 2 minimum confident threshold (c = 60%). Find the frequent
itemsets and generate association rules on this.
Fundamentals of Data Science

Frequent Itemset (I) = {Hot Dogs, Coke, Chips}

Association rules,

• [Hot Dogs^Coke]=>[Chips] //confidence = sup(Hot Dogs^Coke^Chips)/sup(Hot Dogs^Coke) = 2/2*100=100% //Selected

• [Hot Dogs^Chips]=>[Coke] //confidence = sup(Hot Dogs^Coke^Chips)/sup(Hot Dogs^Chips) = 2/2*100=100% //Selected

• [Coke^Chips]=>[Hot Dogs] //confidence = sup(Hot Dogs^Coke^Chips)/sup(Coke^Chips) = 2/3*100=66.67% //Selected

• [Hot Dogs]=>[Coke^Chips] //confidence = sup(Hot Dogs^Coke^Chips)/sup(Hot Dogs) = 2/4*100=50% //Rejected

• [Coke]=>[Hot Dogs^Chips] //confidence = sup(Hot Dogs^Coke^Chips)/sup(Coke) = 2/3*100=66.67% //Selected

• [Chips]=>[Hot Dogs^Coke] //confidence = sup(Hot Dogs^Coke^Chips)/sup(Chips) = 2/4*100=50% //Rejected

There are four strong results (minimum confidence greater than 60%)

Algorithm:

Advantages of Apriori algorithm


1. Efficient discovery of patterns: Association rule mining algorithms are efficient at discovering patterns in large
datasets, making them useful for tasks such as market basket analysis and recommendation systems.

Page 7
Fundamentals of Data Science

2. Easy to interpret: The results of association rule mining are easy to understand and interpret, making it possible
to explain the patterns found in the data.

3. Can be used in a wide range of applications: Association rule mining can be used in a wide range of
applications such as retail, finance, and healthcare, which can help to improve decision-making and increase
revenue.

4. Handling large datasets: These algorithms can handle large datasets with many items and transactions, which
makes them suitable for big-data scenarios.

Disadvantages of Apriori algorithm


1. Large number of generated rules: Association rule mining can generate a large number of rules, many of which
may be irrelevant or uninteresting, which can make it difficult to identify the most important patterns.

2. Limited in detecting complex relationships: Association rule mining is limited in its ability to detect complex
relationships between items, and it only considers the co- occurrence of items in the same transaction.

3. Can be computationally expensive: As the number of items and transactions increases, the number of candidate
item sets also increases, which can make the algorithm computationally expensive.

4. Need to define the minimum support and confidence threshold: The minimum support and confidence
threshold must be set before the association rule mining process, which can be difficult and requires a good
understanding of the data.

Improving the Efficiency of Apriori :

Improving the efficiency of the Apriori algorithm is essential for handling large datasets and reducing computational
complexity. Here are several techniques and strategies to enhance the efficiency of the Apriori algorithm:

1. Transaction Reduction or Eliminate infrequent items from transactions: Before applying the Apriori
algorithm, remove items that do not meet the minimum support threshold. This reduces the size of transactions and
speeds up the subsequent steps.

2. Use of Data Structures: Utilize efficient data structures like hash tables or trees to store and manipulate itemsets.
This can significantly speed up the process of checking subset relationships and support counting.

3. Transaction Pruning: Discard transactions that do not contain any frequent items. If a transaction doesn't have a
single item that meets the minimum support, it cannot contribute to the discovery of frequent itemsets.

4. Caching Intermediate Results: Cache and reuse intermediate results during the candidate generation phase. If
the support of an itemset is calculated multiple times, store the result for reuse instead of recalculating it.

5. Dynamic Itemset Counting: Keep track of the count of each candidate itemset dynamically during the pass
through the dataset. This avoids the need for a separate pass to count support, reducing the number of scans through
the data.

6. Efficient Candidate Generation: Optimize the generation of candidate itemsets. Techniques such as pruning
based on frequent (k-1)-itemsets and avoiding duplicate generation can reduce the number of candidates considered.

7. Parallelisation: Parallelize the computation of support for different itemsets or transactions. This is particularly
effective when dealing with large datasets and multiple processors or machines are available.

8. Apriori Property: Leverage the Apriori property, which states that if an itemset is infrequent, all its supersets
will also be infrequent. This property can be used to prune the search space and avoid unnecessary calculations.
Fundamentals of Data Science

9. Bitwise Operations: Represent itemsets and transactions as bit vectors and use bitwise operations for set
intersection and union. This can lead to more efficient support counting.

10. Sampling: Use sampling techniques to analyze a subset of the dataset instead of the entire dataset. While this
may not guarantee the discovery of all frequent itemsets, it can provide approximate results with significantly less
computational cost.

11. Memory Efficiency: Optimize memory usage by using data structures that consume less memory, especially for
large datasets.

Frequent Pattern Growth Algorithm


The two primary drawbacks of the Apriori Algorithm are:
1. At each step, candidate sets have to be built.
2. To build the candidate sets, the algorithm has to repeatedly scan the database.

These two properties inevitably make the algorithm slower. To overcome these redundant steps, a new
association-rule mining algorithm was developed named Frequent Pattern Growth Algorithm. It overcomes the
disadvantages of the Apriori algorithm by storing all the transactions in a Tire Data Structure. The FP-Growth
Algorithm proposed by Han in.

Advantages of FP Growth Algorithm


o This algorithm needs to scan the database twice (only two passes over dataset)
o The pairing of items is not done in this algorithm, making it faster. (faster than apriori)
o No candidate generation required.
o It is efficient and scalable for mining both long and short frequent patterns.

Disadvantages of FP-Growth Algorithm


o FP Tree is more difficult to build than Apriori.
o FP Tree is expensive to build.
o The algorithm may not fit in the shared memory when the database is large.

FP Growth vs Apriori algorithm

Page 9
Fundamentals of Data Science

Problem Consider the following data let the minimum support be 3 .

Mining Multilevel Association Rules from Transaction Databases:

Types of Multilevel Association Rule: There are two types of association rules in multilevel, namely:
Intra-level Association Rule: These rules recognize patterns or relationships within a hierarchy level in data items.
The rule involves data items that are at the same level or belong to the same category in the hierarchy.

Inter-level Association Rule: In contrast to the intra-level association rule, the inter-level association rule finds
patterns across different levels of hierarchies. The rule involves data items from different levels or belonging to
different categories in the hierarchy.

Approaches For Mining Multilevel Association Rules

1. Uniform MinimumSupport:
• The same minimum support threshold is used when mining at each level of abstraction. When a
uniform minimum support threshold is used, the search procedure is simplified. The method is also
simple in that users are required to specify only one minimum support threshold.
Fundamentals of Data Science

• The uniform support approach, however, has some difficulties. It is unlikely that items at lower
levels of abstraction will occur as frequently as those at higher levels of abstraction.
• If the minimum support threshold is set too high, it could miss some meaningful associations
occurring at low abstraction levels. If the threshold is set too low, it may generate any uninteresting
associations occurring at high abstraction levels.
2. Reduced Minimum Support:
• Each level of abstraction has its own minimum support threshold.
• The deeper the level of abstraction, the smaller the corresponding threshold is. For
example, the minimum support thresholds for levels1 and 2 are 5%
and %, respectively. In this way, ―computer,¦ ―laptop computer,¦ and ―desktop computer¦
are allconsidered frequent.

3. Group-Based Minimum Support:


• Because users or experts often have insight as to which groups are more important thanothers,
it is sometimes more desirable to set up user-specific, item, or group based minimal support
thresholds when mining multilevel rules.
• For example, a user could set up the minimum support thresholds based on product
price, or on items of interest, such as by setting particularly low support thresholds for
laptop computers and flash drives in order to pay particular attention to the association
patterns containing items in these categories.

*********************************************************************

Page 11

You might also like