KEMBAR78
Frequent Pattern Mining With Associations: Lesson Introduction | PDF | Data Analysis | Information Science
0% found this document useful (0 votes)
30 views6 pages

Frequent Pattern Mining With Associations: Lesson Introduction

The document discusses frequent pattern mining, focusing on the identification of frequent itemsets, subsequences, and substructures within large datasets, particularly in the context of market basket analysis. It highlights the importance of discovering associations among items to inform business decisions and marketing strategies, using the Apriori algorithm as a method for efficient mining of frequent itemsets. The document outlines the process of generating association rules from frequent itemsets, emphasizing the significance of support and confidence measures in evaluating the strength of these rules.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views6 pages

Frequent Pattern Mining With Associations: Lesson Introduction

The document discusses frequent pattern mining, focusing on the identification of frequent itemsets, subsequences, and substructures within large datasets, particularly in the context of market basket analysis. It highlights the importance of discovering associations among items to inform business decisions and marketing strategies, using the Apriori algorithm as a method for efficient mining of frequent itemsets. The document outlines the process of generating association rules from frequent itemsets, emphasizing the significance of support and confidence measures in evaluating the strength of these rules.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

9.

Frequent Pattern Mining with Associations

Lesson Introduction

Frequent patterns are patterns (such as itemsets, subsequences, or substructures) that appear in a data
set frequently. For example, a set of items, such as milk and bread, that appear frequently together in a
transaction data set is a frequent itemset. A subsequence, such as buying first a PC, then a digital camera,
and then a memory card, if it occurs frequently in a shopping history database, is a (frequent) sequential
pattern. A substructure can refer to different structural forms, such as subgraphs, subtrees, or sublattices,
which may be combined with itemsets or subsequences. If a substructure occurs frequently, it is called a
(frequent) structured pattern. Finding such frequent patterns plays an essential role in mining associations,
correlations, and many other interesting relationships among data. Moreover, it helps in data
classification, clustering, and other data mining tasks as well. Thus, frequent pattern mining has become
an important data mining task and a focused theme in data mining research. In this lesson, we introduce
the concepts of frequent patterns and associations and study how they can be mined efficiently. This
lesson is dedicated to methods of frequent itemset mining.

Learning Outcomes:
After successful completion of this lesson, the students will be able to understand how to find frequent
itemsets from large datasets with candidate generation.

Lesson Outline:
• What is a frequent itemset?
• How can we find frequent itemsets from large amounts of data?
• What is frequent pattern mining with candidate generation?

9.1 Motivation Scenario : Market Basket Analysis.

Frequent itemset mining leads to the discovery of associations and correlations among items in large
transactional or relational data sets. With massive amounts of data continuously being collected and
stored, many industries are becoming interested in mining such patterns from their databases. The
discovery of interesting correlation relationships among huge amounts of business transaction records
can help in many business decision-making processes, such as catalog design, cross-marketing, and
customer shopping behavior analysis.

A typical example of frequent itemset mining is market basket analysis. This process analyzes customer
buying habits by finding associations between the different items that customers place in their “shopping
baskets” (Figure 9-1). The discovery of such associations can help retailers develop marketing strategies
by gaining insight into which items are frequently purchased together by customers. For instance, if
customers are buying milk, how likely are they to also buy bread (and what kind of bread) on the same
trip to the supermarket? Such information can lead to increased sales by helping retailers do selective
marketing and plan their shelf space.

Figure 9-1:Market basket analysis.

Let’s look at an example of how market basket analysis can be useful.

Suppose, as manager of a shopping mall, you would like to learn more about the buying habits of your
customers. Specifically, you wonder, “Which groups or sets of items are customers likely to purchase on
a given trip to the store?” To answer your question, market basket analysis may be performed on the
retail data of customer transactions at your store. You can then use the results to plan marketing or

advertising strategies, or in the design of a new catalog. For instance, market basket analysis may help
you design different store layouts. In one strategy, items that are frequently purchased together can be
placed in proximity in order to further encourage the sale of such items together. If customers who
purchase computers also tend to buy antivirus software at the same time, then placing the hardware
display close to the software display may help increase the sales of both items. Market basket analysis
can also help retailers plan which items to put on sale at reduced prices. If customers tend to purchase
computers and printers together, then having a sale on printers may encourage the sale of printers as well
as computers.
These patterns can be represented in the form of association rules. For example, the information that
customers who purchase computers also tend to buy antivirus software at the same time is represented
in Association Rule below:

computer antivirus software [support = 2%; confidence = 60%]

Rule support and confidence are two measures of rule interestingness. They respectively reflect the
usefulness and certainty of discovered rules. A support of 2% for the above Association Rule means that
2% of all the transactions under analysis show that computer and antivirus software are purchased
together. A confidence of 60% means that 60% of the customers who purchased a computer also bought
the software. Typically, association rules are considered interesting if they satisfy both a minimum support
threshold and a minimum confidence threshold. Such thresholds can be set by users or domain experts.
Additional analysis can be performed to uncover interesting statistical correlations between associated
items.

In general, association rule mining can be viewed as a two-step process:

 Find all frequent itemsets: By definition, each of these itemsets will occur at least as frequently as
a predetermined minimum support count, min sup.
 Generate strong association rules from the frequent itemsets: By definition, these rules must
satisfy minimum support and minimum confidence.

A major challenge in mining frequent itemsets from a large data set is the fact that such mining often
generates a huge number of itemsets satisfying the minimum support (min sup) threshold, especially
when min sup is set low. This is because if an itemset is frequent, each of its subsets is frequent as well.

9.2 Finding Frequent Itemsets Using Candidate Generation: The Apriori Algorithm.

Apriori is a seminal algorithm proposed by R. Agrawal and R. Srikant in 1994 for mining frequent itemsets
for Boolean association rules. The name of the algorithm is based on the fact that the algorithm uses prior
knowledge of frequent itemset properties, as we shall see following. Apriori employs an iterative
approach known as a level-wise search, where k-itemsets are used to explore (k+1)-itemsets. First, the set
of frequent 1-itemsets is found by scanning the database to accumulate the count for each item, and
collecting those items that satisfy minimum support. The resulting set is denoted L 1.Next, L1 is used to find
L2, the set of frequent 2-itemsets, which is used to find L3, and so on, until no more frequent k-itemsets
can be found. The finding of each Lk requires one full scan of the database. To improve the efficiency of
the level-wise generation of frequent itemsets, a valuable property called the Apriori property, presented
below, is used to reduce the search space. We will first describe this property, and then show an example
illustrating its use.

To improve the efficiency of the level-wise generation of frequent itemsets, an important property called
the Apriori property, presented below, is used to reduce the search space. We will first describe this
property, and then show an example illustrating its use.
Apriori property: All nonempty subsets of a frequent itemset must also be frequent.

The Apriori property is based on the following observation. By definition, if an itemset I does not satisfy
the minimum support threshold, min sup, then I is not frequent; that is, P(I) < min sup. If an item A is
added to the itemset I, then the resulting itemset (i.e., I [A) cannot occur more frequently than I. Therefore,
I [A is not frequent either; that is, P(I [A] < min sup.

Let us look at a simple example, shown Table 9-1. There are nine transactions in this database, that is, ǀDǀ=
9. We use Figure 9-1 to illustrate the Apriori algorithm for finding frequent itemsets in D.

Table 9-1: Transactional dataset

 In the first iteration of the algorithm, each item is a member of the set of candidate 1-itemsets,
C1. The algorithm scans all of the transactions in order to count the number of occurrences of
each item.
 Suppose that the minimum support count required is 2, that is, min sup = 2. (Here, we are
referring to absolute support because we are using a support count. The corresponding relative
support is 2/9 = 22%). The set of frequent 1-itemsets, L 1, can then be determined. It consists of
the candidate 1-itemsets satisfying minimum support. In the example, all of the candidates in C 1
satisfy minimum support.
 To discover the set of frequent 2-itemsets, L2, the algorithm uses the join L1 X L1 to generate a
candidate set of 2-itemsets, C2. Note that no candidates are removed fromC2 during the prune
step because each subset of the candidates is also frequent.
 Next, the transactions in D are scanned, and the support count of each candidate itemset inC2 is
accumulated, as shown in the middle table of the second row in Figure 9-1.
 The set of frequent 2-itemsets, L2, is then determined, consisting of those candidate 2-itemsets
in C2 having minimum support.
 The generation of the set of candidate 3-itemsets, C3, is detailed in Figure 5.3. From the join step,
we first get C3 =L2 X L2 = { {I1, I2, I3}, {I1, I2, I5}, {I1, I3, I5}, {I2, I3, I4}, {I2, I3, I5}, {I2, I4, I5} }. Based
on the Apriori property that all subsets of a frequent itemset must also be frequent, we can
determine that the four latter candidates cannot possibly be frequent. We, therefore, remove
them from C3, thereby saving the effort of unnecessarily obtaining their counts during the
subsequent scan of D to determine L3. Note that when given a candidate k-itemset, we only need
to check if its (k -1)-subsets are frequent since the Apriori algorithm uses a level-wise search
strategy. The resulting pruned version of C3 is shown in the first table of the bottom row of Figure
9-1.
 The transactions in D are scanned in order to determine L3, consisting of those candidate 3-
itemsets in C3 having minimum support (Figure 9-1).

Figure 9-1: Example for Apriori Algorithm


.

The algorithm uses L3 X L3 to generate a candidate set of 4-itemsets, C4. Although the join results in { {I1,
I2, I3, I5} }, this itemset is pruned because its subset { {I2, I3, I5} } is not frequent. Thus, C 4 = null, and the
algorithm terminates, having found all of the frequent itemsets.

Suppose the data contain the frequent itemset l = {I1, I2, I5}. What are the association rules that can be
generated from l ? The nonempty subsets of l are {I1, I2}, {I1, I5}, {I2, I5}, {I1}, {I2}, and {I5}. The resulting
association rules are as shown below, each listed with its confidence:

I1 , I2 I5, confidence = 2=4 = 50%

I1, I5 I2, confidence = 2=2 = 100%

I2, I5 I1, confidence = 2=2 = 100%

I1 I2, I5, confidence = 2=6 = 33%

I2 I1, I5, confidence = 2=7 = 29%

I5 I1, I2, confidence = 2=2 = 100%

If the minimum confidence threshold is, say, 70%, then only the second, third, and last rules above are
output, because these are the only ones generated that are strong. Note that, unlike conventional
classification rules, association rules can contain more than one conjunct in the right-hand side of the rule.

You might also like