KEMBAR78
Freq Pattern9 | PDF | Information Science | Applied Mathematics
0% found this document useful (0 votes)
10 views20 pages

Freq Pattern9

Unit 9 focuses on mining frequent patterns and associations, covering concepts such as market basket analysis, classification of frequent pattern mining, and the Apriori algorithm. It emphasizes the importance of identifying frequent itemsets and association rules to uncover relationships in large datasets, which can inform marketing strategies and enhance data mining tasks. The unit also discusses various approaches for mining multilevel and multidimensional association rules, along with the significance of support and confidence in association rule mining.

Uploaded by

Anushka kolte
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views20 pages

Freq Pattern9

Unit 9 focuses on mining frequent patterns and associations, covering concepts such as market basket analysis, classification of frequent pattern mining, and the Apriori algorithm. It emphasizes the importance of identifying frequent itemsets and association rules to uncover relationships in large datasets, which can inform marketing strategies and enhance data mining tasks. The unit also discusses various approaches for mining multilevel and multidimensional association rules, along with the significance of support and confidence in association rule mining.

Uploaded by

Anushka kolte
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

UNIT 9 MINING FREQUENT PATTERNS AND

ASSOCIATIONS
Structure
9.0 Introduction
9.1 Objectives
9.2 Market Basket Analysis
9.3 Classification of Frequent Pattern Mining
9.4 Association Rule Mining and Related Concepts
9.5 Apriori Algorithm
9.6 Mining Multilevel Association Rules
9.7 Approaches for Mining Multilevel Association Rules
9.8 Mining Multidimensional Association Rules From Relational
Databases And Data Warehouses
9.9 Mining Quantitative Association Rules
9.10 From Association Mining To Correlation Analysis
9.11 Summary
9.12 Solutions / Answers
9.13 Further Readings

9.0 INTRODUCTION
In the earlier unit we had studied Data preprocessing, data cleaning, data reduction
and other related concepts.
Data mining technology has emerged as a means for identifying patterns and trends
from large quantities of data. Data mining, also known as Knowledge Discovery
in Databases, has been defined as the nontrivial extraction of implicit, previously
unknown, and potentially useful information from data. Data mining is used to
extract structured knowledge automatically from large data sets. The information
that is ‘mined’ is expressed as a model of the semantic structure of the dataset,
where in the prediction or classification of the obtained data is facilitated with the
aid of the model.
Descriptive mining and Predictive mining are the two categories of data mining tasks.
The descriptive mining refers to the method in which the essential characteristics or
general properties of the data in the database are depicted. The descriptive mining
techniques involve tasks like Clustering, Association and Sequential mining.
The method of predictive mining deduces patterns from the data such that predictions
can be made. The predictive mining techniques involve tasks like Classification,
Regression and Deviation detection.
Mining Frequent Itemsets from transaction databases is a fundamental task for
several forms of knowledge discovery such as association rules, sequential patterns,
and classification. The subsets frequently occurring in a collection of sets of items
are known as the frequent itemsets. Frequent itemsets are typically used to generate
association rules. The objective of Frequent Item set Mining is the identification
of items that co-occur above a user given value of frequency, in the transaction
Data Mining database.
Fundamentals and
Frequent Pattern Mining In this unit we will study Frequent item set generation, association rule generation,
APRIORI algorithm etc..

9.1 OBJECTIVES
After completing this unit, you will be able to
• Discuss the elementary concepts of Frequent patterns
• Understand the concepts of Association Rule Mining, Basket Analysis,
Frequent pattern mining
• Understand and implement Apriori Algorithm for finding frequent item sets
using candidate generation
• Demonstrate mining multilevel association rules
• Discuss various approaches for mining multilevel and multidimensional
association rules.

9.2 MARKET BASKET ANALYSIS


Frequent patterns are patterns (such as itemsets, subsequences, or substructures)
that appear in a data set frequently. For example, a set of items, such as milk and
bread, that appear frequently together in a transaction data set is a frequent itemset.
Another example may be buying a mobile and its cover appears frequently together
in a transaction data set is known as frequent itemset.
A subsequence, such as buying first a PC, then a digital camera, and then a memory
card, if it occurs frequently in a shopping history database, is a (frequent) sequential
pattern. Another example may be buying a new mobile, screen –guard and SD
memory card occurs frequently in a shopping history database.
A substructure can refer to different structural forms, such as subgraphs, subtrees, or
sublattices, which may be combined with itemsets or subsequences. If a substructure
occurs frequently, it is called a (frequent) structured pattern. Finding such frequent
patterns plays an essential role in mining associations, correlations, and many other
interesting relationships among data. Moreover, it helps in data classification,
clustering, and other data mining tasks. Thus, frequent pattern mining has become
an important data mining task and a focused theme in data mining. Frequent pattern
mining searches for recurring relationships in a given data set.
A typical example of frequent itemset mining is market basket analysis. This
process analyzes customer buying habits by finding associations between the
different items that customers place in their “shopping baskets” (Figure IV). The
discovery of these associations can help retailers develop marketing strategies by
gaining insight into which items are frequently purchased together by customers.
For instance, if customers are buying milk, how likely are they to also buy bread
(and what kind of bread) on the same trip to the supermarket? This information can
lead to increased sales by helping retailers do selective marketing and plan their
shelf space.

168
Mining Frequent Patterns
and Associations

Figure 1: Market Basket Analysis

“Which groups or sets of items are customers likely to purchase on a given trip to
the store?” To answer your question, market basket analysis may be performed on
the retail data of customer transactions at your store. You can then use the results to
plan marketing or advertising strategies, or in the design of a new catalog
Example:
If customers who purchase computers also tend to buy anti-virus software at the
same time, then placing the hardware display close to the software display may
help increase the sales of both items. In an alternative strategy, placing hardware
and software at opposite ends of the store may entice customers who purchase
such items to pick up other items along the way. For instance, after deciding on
an expensive computer, a customer may observe security systems for sale while
heading toward the software display to purchase antivirus software and may decide
to purchase a home security system as well. Market basket analysis can also help
retailers plan which items to put on sale at reduced prices. If customers tend to
purchase computers and printers together, then having a sale on printers may
encourage the sale of printers as well as computers.

9.3 CLASSIFICATION OF FREQUENT PATTERN MINING


Frequent pattern is a pattern which appears frequently in a data set. By identifying
frequent patterns we can observe strongly correlated items together and easily
identify similar characteristics, associations among them. By doing frequent
pattern mining, it leads to further analysis like clustering, classification and other
data mining tasks. Frequent pattern mining can be classified in various ways, based
on the following criteria:
(i) Based on the completeness of patterns to be mined
• We can mine the complete set of frequent item sets, the closed frequent
item sets, and the maximal frequent item sets, given a minimum support
threshold.
169
Data Mining • We can also mine constrained frequent item sets, approximate frequent
Fundamentals and item sets, near- match frequent item sets, top-k frequent item sets and so on.
Frequent Pattern Mining
(ii) Based on the levels of abstraction involved in the rule set:
Some methods for association rule mining can find rules at differing levels of
abstraction.
For example, suppose that a set of association rules mined includes the following
rules where X is a variable representing a customer:
buys(X, computer ) =>buys(X, HP printer)  (1)
buys(X, laptop computer ) =>buys(X, HP printer)  (2)
In rule (1) and (2), the items bought are referenced at different levels of abstraction
(For example,
computer is a higher-level abstraction of laptop computer).
(iii) Based on the number of data dimensions involved in the rule
• If the items or attributes in an association rule reference only one dimension,
then it is a single-dimensional association rule.
buys(X, computer )=>buys(X, antivirus software)
• If a rule references two or more dimensions, such as the dimensions age,
income, and buys, then it is a multidimensional association rule. The
following rule is an example of a multidimensional rule:
age(X, 30,31…39) ^ income(X, 42K,…48K )=>buys(X, high resolution
TV)
(iv) Based on the types of values handled in the rule:
• If a rule involves associations between the presence or absence of items, it
is a Boolean association rule.
• If a rule describes associations between quantitative items or attributes,
then it is a quantitative association rule.
(v) Based on the kinds of rules to be mined:
• Frequent pattern analysis can generate various kinds of rules and other
interesting relationships.
• Association rule mining can generate a large number of rules, many of
which are redundant or do not indicate a correlation relationship among
item sets.
• The discovered associations can be further analyzed to uncover statistical
correlations, leading to correlation rules.
(vi) Based on the kinds of patterns to be mined:
• Many kinds of frequent patterns can be mined from different kinds of data
sets.
• Sequential pattern mining searches for frequent subsequences in a sequence
170 data set, where a sequence records an ordering of events.
• For example, with sequential pattern mining, we can study the order in Mining Frequent Patterns
which items are frequently purchased. For instance, customers may tend to and Associations
first buy a PC, followed by a digital camera, and then a memory card.
• Structured pattern mining searches for frequent substructures in a structured
data set.
• Single items are the simplest form of structure.
• Each element of an itemset may contain a subsequence, a sub tree, and so on.
• Therefore, structured pattern mining can be considered as the most general
form of frequent pattern mining.

9.4 ASSOCIATION RULE MINING AND RELATED


CONCEPTS
An association rule consists of a set of items, the rule body, leading to another
item, the rule head. The association rule relates the rule body with the rule head. An
association rule can contain the following characteristics:
• Statistical information about the frequency of occurrence
• Reliability
• Importance of this relation
Association rule contains type shape such as X and Y, among them, the X and Y,
respectively, known as the forerunner of association rules(antecedent or left - hand
side, LHS) and subsequent (consequent or right - hand - side, RHS). Where, the
association rule XY has support and trust.
9.4.1 Association Rule Mining
One of the popular descriptive data mining techniques is Association rule mining
(ARM), owing to its extensive use in marketing and retail communities in addition
to many other diverse fields. Mining association rules is particularly useful for
discovering relationships among items from large databases. The “market-basket
analysis” (presented in the following section) which performs a study on the habits
of customers is the source of motivation behind ARM. The extraction of interesting
correlations, frequent patterns, associations or casual structures among sets of
items in the transaction databases or other data repositories is the main objective
of ARM. As the target of discovery is not pre-determined, it is possible to identify
all association rules that exist in the database. This feature of the association rules
can be said as its major strength. The development of marketing and placement
strategies in addition to the preparation of logistics for inventory management can
be greatly assisted by the discovery of association rules. The alignment of the data
mining process and algorithms with the extensive economic objectives of the tasks
supported by data mining is essential so as to permit the additional impact of data
mining on business applications. The ultimate economic utility obtained as the
outcome of the data mining product has the impact of all the diverse stages of the
data mining processes. It is important to consider the economic utility of acquiring
data, extracting a model, and applying the acquired knowledge. The evaluation
of the decisions made on the basis of the learned knowledge is influenced by the
economic utility. The economic measures, for example, profitability and return
171
Data Mining on investment have replaced the simple assessment measures such as predictive
Fundamentals and accuracy.
Frequent Pattern Mining
Association rule mining process consists of two phases: the first stage must first find
out all of the high frequency from data collection team (the Frequent Itemsets, i.e.,
calculate meet support team (sets), the second stage again by the high-frequency
team in Association Rules (Association Rules), namely: calculate again meet the
confidence of the team.
The first stage of association rule mining must identify all large frequency sets
items from the original data set. High frequency means that the frequency of a
project's presence must reach a certain level relative to all records. The frequency
of the occurrence is called A project team Support (Support), with A 2 - with A
and B two items itemset, for example, we can through the formula contains {A, B}
(1) obtained Support the project team, if the Support is greater than or equal to the
set of Minimum Support (Minimum Support) threshold value, the {A, B} is called
high frequency project team. A k-itemset that satisfies the minimum support is
called Frequent k-itemset, which is usually expressed as Large k or Frequent k. The
algorithm generates Large k+1 from the Large k project group until it can no longer
find a longer high frequency project.
The second stage of association rule mining is to generate Association Rules.
Association rules from high frequency project team, is to use the high frequency
of the previous step k - program to generate the rules, the Minimum reliability
(Minimum Confidence) under the condition of threshold, if the rule is obtained by
reliability to meet the Minimum reliability, according to the rules for the association
rules. For example, the reliability of rule AB generated by high frequency k-project
group {A,B} can be obtained by formula (2). If the reliability is greater than or
equal to the minimum trust, then AB is called the association rule.
Association rule mining is generally applicable to the case where the index in the
record is taken as a discrete value. If the original indexes is continuous data in the
database, in association rules mining should be performed before the appropriate
data discretization (in fact, is to a value corresponding to a certain value range),
data discretization is an important part of the former data mining, the process of
discretization is reasonable will directly affect the results of association rule mining.
Association Rule Mining Problem Definition
The problem of association rule mining is defined as:
Let I = n{i1, i2, i3,……in} be a set of binary attributes called items.
Let D = {t1, t2, t3, …..tn} be a set of transactions called the database.
Each transaction in D has a unique transaction ID and contains a subset of the items
in I.
A rule is defined as an implication of the form X => Y
Where X, Y ⊆ I and X ∩ Y = ϕ
The sets of items (for short itemsets) and are called antecedent (left-hand-side or
LHS) and consequent (right-hand-side or RHS) of the rule respectively.
Suppose I {I1, I2, I3...IM} is the set of terms. Given a Transaction database D,
172 where each Transaction (Transaction)t is a non-empty subset of I, that is, each
Transaction corresponds to a unique identifier TID(Transaction ID).The support Mining Frequent Patterns
of association rules in D is the percentage of transactions in D containing both and Associations
X and Y, that is, the probability. Confidence is the percentage of Y in the case
that the transaction in D already contains X, that is, the conditional probability. If
the minimum support threshold and minimum confidence threshold are met, the
association rule is considered interesting. These thresholds are set manually for
mining purposes.
9.4.2 Some Important Concepts
Itemset
A set of items together is called an itemset.
For Example, Bread and butter, Laptop and Antivirus software, etc are itemsets.
k-Itemset
An itemset is just a collection or set of items. k-itemset is a collection of k items. For
e.g. 2-itemset can be {egg,milk} or {milk,bread} etc. Similarily 3-itemset example
can be {egg,milk,bread} etc.
Frequent Itemset
An itemset is said to be a frequent itemset when it meets a minimum support
count. For example, if the minimum support count is 70% and the support count of
2-itemset {milk, cheese} is 60% then this is not a frequent itemset.
The support count is something that has to be decided by a domain expert or SME.
Support Count
Support count is the number of times the itemsets appears out of all the transactions
under consideration. It can also be expressed in percentage. A support count of
65% for {milk, bread} means that out both milk and bread appeared 65 times in the
overall transaction of 100.
Mathematically support can also be denoted a:
support(A => B) = P(A U B)
This means that support of the rule “A and B occur together” is equal to the
probability of A union B.
Support
A transaction supports an association rule if the transaction contains the rule
body and the rule head. The rule support is the ratio of transactions supporting
the association rule and the total number of transactions within your database of
transactions.
In the example database, the item set {milk, bread, butter} has a support of
1/5 = 0.2 since it occurs in 20% of all transactions (1 out of 5 transactions).
Confidence
The confidence of an association rule is its strength or reliability. The confidence is
defined as the percentage of transactions supporting the rule out of all transactions
supporting the rule body. A transaction supports the rule body if it contains all the
items of the rule body. 173
Data Mining The confidence of a rule is defined as:
Fundamentals and
Frequent Pattern Mining conf (X=>Y) = supp (X U Y) / supp(X)
For example, the rule{butter, bread} => {milk} has a confidence of 0.2/0.2
= 1.0 in the database, which means that for 100% of the transactions containing
butter and bread the rule is correct (100% of the times a customer buys butter and
bread, milk is bought as well). Confidence can be interpreted as an estimate of the
probability P(Y| X), the probability of finding the RHS of the rule in transactions
under the condition that these transactions also contain the LHS.
The lift value of an association rule is the factor by which the confidence exceeds
the expected confidence. It is determined by dividing the confidence of the rule by
the support of the rule head.
Lift(X=>Y) = supp (X U Y) / ((supp (X) * supp (Y))
or the ratio of the observed support to that expected if X and Y were independent.
The rule {milk, bread} => {butter}
has a lift of 0.2 / (0.4 x 0.4) = 1.25
Conviction
The conviction of a rule is defined as:
conv (X => Y) = (1 – supp(Y)) / (1 – conf (X=>Y))
The rule {milk, bread} => {butter} has a conviction of
(1 - 0.4) / (1 – 0.5) = 1.2 ,
and can be interpreted as the ratio of the expected frequency that X occurs without
Y (that is to say, the frequency that the rule makes an incorrect prediction) if X and
Y were independent divided by the observed frequency of incorrect predictions.
9.4.3 Applications of Association Rule Mining
Association rule mining is used in different areas, as it is very advantageous. Some
of the fields that have adopted association rule mining are discussed below:
Market Basket Analysis: One of the best and typical examples of association rule
mining is market basket analysis, for example in supermarket stores. Managers of
these stores want to increase the interest of the existing customers and to attract the
new ones. Such stores consist of large no of databases and there is huge number of
transactional records. So the managers may be interested to know whether some
items are consistently purchased. The main aim is to analyze the buying behavior
of the customers. So, association rule mining is used to generate the rules to find out
the frequently occurring itemsets in a store. For example, if customers are buying
bread and maximum number of customers are buying milk along with bread. Thus
it will be beneficial for managers to place milk near the bread. Thus, association
rule is helpful to design a layout for the store. So, market basket can be defined as
the combination of different items purchased together by a customer in a single
transaction.
CRM of the Credit Card: Business Customer Relationship Management (CRM)
is a system that is used by a bank to manage its relationships with the customers.
Banks usually identify the behavior of their customers to find out their likings and
174
interests. In this way banks can increase the coherence of credit card customers and Mining Frequent Patterns
the bank. Here, association rules are helpful for the banks to know their customers and Associations
in a better way and provide them good quality services.
Medical Diagnosis: Association rules are also used in the field of medical sciences.
It is helpful for the physicians to cure the patients. Association rules are used to find
out the probability of illness in a disease. There is a need for proper explanation of
a disease. Thus diagnosis in itself is not an easy task. Adding some new symptoms
to an existing disease, and then finding out the relationships between the symptoms
will help the physicians. For example a physician is examining a patient, so it is
obvious that he will require all the information of the patient, only then he can take
a better decision for the patient. Here, association rule helps out as it is one of the
most important research technique of data mining.
Census Data: Association rule mining has a great possibility in census data.
Association rules helps in supporting good public policy and in business
development. Huge statistical information is generated by census. The information
is related to the economic and population census and thus can be used in planning
public services and business. In services, it includes health, education, transport etc.
In business, it has constructing new malls, factories and banks.
Protein Sequencing: Proteins are the basic constituents of any cells of organism.
There are many DNA technologies are available with different tools used for the fast
determination of DNA sequences. Basically, proteins are the sequences which are
made up of almost 20 types of amino acids. Every protein has a three dimensional
structure, depending upon the sequence of amino acids. Thus there exists a too
much dependency of protein functioning on amino acids. Association rules are
generated between different amino acids of a protein. It will help to understand the
protein composition.
You might have realized that finding frequent itemsets from hundreds and thousands
of transactions that may have hundreds of items is not an easy task. We need a
smart technique to calculate frequent itemsets efficiently and Apriori algorithm is
one such technique which we will discuss in the next section.

9.5 THE APRIORI ALGORITHM


An algorithm known as Apriori is a common one in data mining. It's used to identify
the most frequently occurring elements and meaningful associations in a dataset.
As an example, products brought in by consumers to a shop may all be used as
inputs in this system. Apriori is a seminal algorithm proposed by R. Agrawal and
R. Srikant in 1994 for mining frequent itemsets for Boolean association rules.
• The name of the algorithm is based on the fact that the algorithm uses prior
knowledge of frequent itemset properties.
• Apriori employs an iterative approach known as a level-wise search, where
k-item sets are used to explore (k+1) itemsets.
• First, the set of frequent 1-itemset is found by scanning the database to
accumulate the count for each item, and collecting those items that satisfy
minimum support. The resulting set is denoted L1.Next, L1 is used to find
L2, the set of frequent 2-itemsets, which is used to find L3, and so on, until
no more frequent k-item sets can be found. 175
Data Mining • The finding of each Lk requires one full scan of the database.
Fundamentals and
Frequent Pattern Mining • A two-step process is followed in Apriori consisting of join and prune
action.
The pseudo code of Apriori Algorithm is as follows:
Apriori Algorithm: Find frequent itemsets using an iterative level-wise approach
based on candidate generation
Input: D, a database of transactions;
min_sup, the minimum support count threshold.
Output: L, frequent itemsets in D.
Method:
(1) L1 = find_frequent_1_itemset (D); //find out the set L1 of frequent 1-itemset
(2) for(k = 2; Lk-1 ≠ ∅; k++) { //Generate candidates and prune
(3) Ck = aproiri_gen (Lk-1);
(4) for each transaction t∈D{ //scan D for candidate count
(5) Ct = subset(Ck,t); //Get the subsets of t that are candidates
(6) for each candidate c∈Ct
(7) c.count++; //support count
(8) }
(9) Lk={c∈Ck| c.count ≥min_sup} //Return the itemset that is not less than the
minimum support in the candidate item set//
(10) }
(11) return L = ∪kLk; //all frequent sets
Step - 1 Connection join
Procedure apriori_gen(Lk-1: frequent (k-1)-itemsets)
1) for each itemset l1∈ Lk-1
2) for each itemset l2∈Lk-1
3) if(l1[1]=l2[1])∧...∧(l1[k-2]=l2[k-2])∧(l1[k-1] < l2[k-1]) then{
4) c = l1⨝ l2; //Connection step: l1 connects l2 //Connection step generates
candidates, if subset c already exists in K-1 item set, prune//
5) if has_infrequent_subset(c,Lk-1) then
6) delete c; //pruning step: delete infrequent candidates
7) else add c to Ck;
8) }
9) return Ck;

176
Step 2: Pruning Mining Frequent Patterns
and Associations
Procedure has_infrequent_subset (c:candidate k-itemset; Lk-1:frequent (k-1)-
itemsets); //Use a priori knowledge//
1) for each (k-1)-subset s of c
2) if c∉Lk-1 then
3) return TRUE;
4) return FALSE;
Example:
TID List of item IDs
T100 I1, I2, I5
T200 I2, I4
T300 I2, I3
T400 I1, I2, I4
T500 I1, I3
T600 I2, I3
T700 I1, I3
T800 I1, I2, I3, I5
T900 I1, I2, I3
There are nine transactions in this database, that is, |D| = 9.
Steps:
1. In the first iteration of the algorithm, each item is a member of the set of
candidate1- itemsets, C1. The algorithm simply scans all of the transactions
in order to count the number of occurrences of each item.
2. Suppose that the minimum support count required is 2, that is, min sup = 2.
The set of frequent 1-itemsets, L1, can then be determined. It consists of the
candidate 1-itemsets satisfying minimum support. In our example, all of the
candidates in C1 satisfy minimum support.
3. To discover the set of frequent 2-itemsets, L2, the algorithm uses the join
L1 on L1 to generate a candidate set of 2-itemsets, C2.No candidates are
removed from C2 during the prune step because each subset of the candidates
is also frequent.
4. Next, the transactions in D are scanned and the support count of each
candidate itemset in C2 is accumulated.
5. The set of frequent 2-itemsets, L2, is then determined, consisting of those
candidate2- item sets in C2 having minimum support.
6. The generation of the set of candidate 3-itemsets,C3, From the join step,
we first get C3 =L2 ⨝ L2 = ({I1, I2, I3}, {I1, I2, I5}, {I1, I3, I5}, {I2, I3,
I4},{I2, I3, I5}, {I2, I4, I5}. Based on the Apriori property that all subsets
of a frequent item-set must also be frequent, we can determine that the four
latter candidates cannot possibly be frequent.
7. The transactions in D are scanned in order to determine L3, consisting of 177
Data Mining those candidate 3-itemsets in C3 having minimum support.
Fundamentals and
Frequent Pattern Mining 8. The algorithm uses L3⨝ L3 to generate a candidate set of 4-itemsets, C4.

178
Mining Frequent Patterns
and Associations

Figure 2: Working of Apriori Algorithm

9.6 MINING MULTILEVEL ASSOCIATION RULES


For many applications, it is difficult to find strong associations among data items at
low or primitive levels of abstraction due to the sparsity of data at those levels. Strong
associations discovered at high levels of abstraction may represent commonsense
knowledge. Therefore, data mining systems should provide capabilities for mining
association rules at multiple levels of abstraction, with sufficient flexibility for
easy traversal among different abstraction spaces. Association rules generated from
mining data at multiple levels of abstraction are called multiple-level or multilevel
association rules.
Multilevel association rules can be mined efficiently using concept hierarchies
under a support-confidence framework. In general, a top-down strategy is
employed, where counts are accumulated for the calculation of frequent itemsets
at each concept level, starting at the concept level 1 and working downward in the
hierarchy toward the more specific concept levels, until no more frequent itemsets
can be found. A concept hierarchy defines a sequence of mappings from a set of
low-level concepts to higher level, more general concepts. Data can be generalized
by replacing low-level concepts within the data by their higher-level concepts, or
ancestors, from a concept hierarchy.
The concept hierarchy has five levels, respectively referred to as levels 0 to 4,
starting with level 0 at the root node for all.
• Here, Level 1 includes computer, software, printer & camera, and computer
accessory. 179
Data Mining • Level 2 includes laptop computer, desktop computer, office software,
Fundamentals and antivirus software
Frequent Pattern Mining
• Level 3 includes IBM desktop computer. . . Microsoft office software and
so on.
• Level 4 is the most specific abstraction level of this hierarchy.

9.7 APPROACHES FOR MINING MULTILEVEL


ASSOCIATION RULES
Following are the approaches for mining multilevel association rules:
9.7.1 Uniform Minimum Support
• The same minimum support threshold is used when mining at each level of
abstraction.
• When a uniform minimum support threshold is used, the search procedure
is simplified.
• The method is also simple in that users are required to specify only one
minimum support threshold.
• The uniform support approach, however, has some difficulties. It is unlikely
that items at lower levels of abstraction will occur as frequently as those at
higher levels of abstraction.
• If the minimum support threshold is set too high, it could miss some
meaningful associations occurring at low abstraction levels. If the threshold
is set too low, it may generate many uninteresting associations occurring at
high abstraction levels.

Figure 3 : Multilevel mining with Uniform Support

9.7.2 Reduced Minimum Support


yy Each level of abstraction has its own minimum support threshold.
yy The deeper the level of abstraction, the smaller the corresponding threshold
is.
yy For example, the minimum support thresholds for levels 1 and 2 are 5%
and 3%, respectively. In this way, computer, laptop computer, and desktop
computer are all considered frequent.

180
Mining Frequent Patterns
and Associations

Figure 4 : Multilevel Mining with Reduced Minimum Support

9.7.3 Group-Based Minimum Support


• Because users or experts often have insight as to which groups are more
important than others, it is sometimes more desirable to set up user-specific,
item, or group based minimal support thresholds when mining multilevel
rules.
• For example, a user could set up the minimum support thresholds based
on product price, or on items of interest, such as by setting particularly
low support thresholds for laptop computers and flash drives in order to
pay particular attention to the association patterns containing items in these
categories.

9.8 MINING MULTIDIMENSIONAL ASSOCIATION


RULES FROM RELATIONAL DATABASES AND
DATA WAREHOUSES
Single dimensional or intra dimensional association rule contains a single distinct
predicate (e.g., buys) with multiple occurrences i.e., the predicate occurs more than
once within the rule.
buys(X, digital camera)=>buys(X, HP printer)
Association rules that involve two or more dimensions or predicates can be referred
to as multidimensional association rules.
age(X, “20…29”)^occupation(X, “student”)=>buys(X, “laptop”)
Above Rule contains three predicates (age, occupation, and buys), each of which
occurs only once in the rule. Hence, we say that it has no repeated predicates.
Multidimensional association rules with no repeated predicates are called inter
dimensional association rules. We can also mine multidimensional association rules
with repeated predicates, which contain multiple occurrences of some predicates.
These rules are called hybrid- dimensional association rules. An example of such a
rule is the following, where the predicate buys is repeated:
age(X, 20…29)^buys(X, laptop‖)=>buys(X, HP printer)

9.9 MINING QUANTITATIVE ASSOCIATION RULES


Quantitative association rules are multidimensional association rules in which the
181
numeric attributes are dynamically discretized during the mining process so as to
Data Mining
Fundamentals and satisfy some mining criteria, such as maximizing the confidence or compactness of
Frequent Pattern Mining the rules mined.
In this section, we focus specifically on how to mine quantitative association
rules having two quantitative attributes on the left-hand side of the rule and one
categorical attribute on the right-hand side of the rule. That is
Aquan1 ^Aquan2 =>Acat
Where Aquan1 and Aquan2 are tests on quantitative attribute interval
Acat tests a categorical attribute from the task-relevant data.
Such rules have been referred to as two-dimensional quantitative association rules,
because they contain two quantitative dimensions.
For instance, suppose you are curious about the association relationship between
pairs of quantitative attributes, like customer age and income, and the type of
television (such as high-definition TV, i.e., HDTV) that customers like to buy.
An example of such a 2-D quantitative association rule is
age(X, 30…39)^income(X, 42K…48K)=>buys(X, HDTV)

9.10 FROM ASSOCIATION MINING TO CORRELATION


ANALYSIS
A correlation measure can be used to augment the support-confidence framework
for association rules. This leads to correlation rules of the form
A=>B [support, confidence, correlation]
That is, a correlation rule is measured not only by its support and confidence but also
by the correlation between itemsets A and B. There are many different correlation
measures from which to choose. In this section, we study various correlation
measures to determine which would be good for mining large data sets.
Lift is a simple correlation measure that is given as follows. The occurrence of
itemset A is independent of the occurrence of itemset B if P(A∪B) = P(A)P(B);
otherwise, itemsets A and B are dependent and correlated as events. This definition
can easily be extended to more than two itemsets.
The lift between the occurrence of A and B can be measured by computing

yy If the lift(A,B) is less than 1, then the occurrence of A is negatively correlated


with the occurrence of B.
yy If the resulting value is greater than 1, then A and B are positively correlated,
meaning that the occurrence of one implies the occurrence of the other.
yy If the resulting value is equal to 1, then A and B are independent and there
is no correlation between them.
Check Your Progress 1:
182 1. Discuss Association Rule and Association Rule Mining? What are the
applications of Association Rule Mining? Mining Frequent Patterns
and Associations
……………………………………………………………………………
……………………………………………………………………………
…...…………………………………………………………………………
2. Explain support, confidence and lift with an example.
……………………………………………………………………………
……………………………………………………………………………
…...…………………………………………………………………………
3. List the advantages and disadvantages of Apriori Algorithm.
……………………………………………………………………………
……………………………………………………………………………
…...…………………………………………………………………………
4. Explore the methods to improve Apriori efficiency.
……………………………………………………………………………
……………………………………………………………………………
…...…………………………………………………………………………

9.11 SUMMARY
In this unit we had studied the concepts like itemset, frequent itemset, market
basket analysis, frequent itemset mining, association rules, association rule mining,
Apriori algorithm along with some advanced concepts.

9.12 SOLUTIONS/ANSWERS
Check Your Progress 1:
1. Association rule finds interesting association or correlation relationships
among a large set of data items which is used for decision-making processes.
Association rules analyzes buying patterns that are frequently associated or
purchased together.
Association rule mining finds interesting associations and correlation
relationships among large sets of data items. Association rules show
attribute value conditions that occur frequently together in a given data set.
A typical example of association rule mining is Market Basket Analysis.
Data is collected using bar-code scanners in supermarkets. Such market basket
databases consist of a large number of transaction records. Each record lists all
items bought by a customer on a single purchase transaction. Managers would
be interested to know if certain groups of items are consistently purchased
together. They could use this data for adjusting store layouts (placing items
optimally with respect to each other), for cross-selling, for promotions, for
catalog design, and to identify customer segments based on buying patterns.
Association rules provide information of this type in the form of if-then
statements. These rules are computed from the data and, unlike the if-then
rules of logic, association rules are probabilistic in nature.
In addition to the antecedent (if) and the consequent (then), an association 183
Data Mining rule has two numbers that express the degree of uncertainty about the rule.
Fundamentals and In association analysis, the antecedent and consequent are sets of items
Frequent Pattern Mining
(called itemsets) that are disjoint (do not have any items in common).
The applications of Association Rule Mining are Basket data analysis, cross-
marketing, catalog design, loss-leader analysis, clustering, classification,
etc.
2.
Support: The support is simply the number of transactions that include
all items in the antecedent and consequent parts of the rule. The support is
sometimes expressed as a percentage of the total number of records in the
database.
(or)
Support S is the percentage of transactions in D that contain AUB.
Confidence c is the percentage of transactions in D containing A that also
contain B. Support ( A=>B)= P(AUB)
Confidence: Confidence is the ratio of the number of transactions that
include all items in the consequent, as well as the antecedent (the support)
to the number of transactions that include all items in the antecedent.
Confidence (A=>B)=P(B/A)
For example, if a supermarket database has 100,000 point-of-sale
transactions, out of which 2,000 include both items A and B, and 800 of these
include item C, the association rule "If A and B are purchased, then C is
purchased on the same trip," has a support of 800 transactions (alternatively
0.8% = 800/100,000), and a confidence of 40% (=800/2,000). One way
to think of support is that it is the probability that a randomly selected
transaction from the database will contain all items in the antecedent and
the consequent, whereas the confidence is the conditional probability that a
randomly selected transaction will include all the items in the consequent,
given that the transaction includes all the items in the antecedent.
Lift is one more parameter of interest in the association analysis. Lift is
nothing but the ratio of Confidence to Expected Confidence. Using the
above example, expected Confidence in this case means, "confidence, if
buying A and B does not enhance the probability of buying C." It is the
number of transactions that include the consequent divided by the total
number of transactions. Suppose the number of total number of transactions
for C is 5,000. Thus Expected Confidence is 5,000/1,00,000=5%. For
the supermarket example the Lift = Confidence/Expected Confidence =
40%/5% = 8. Hence, Lift is a value that gives us information about the
increase in probability of the then (consequent) given the if (antecedent)
part.
A lift ratio larger than 1.0 implies that the relationship between the
antecedent and the consequent is more significant than would be expected if
the two sets were independent. The larger the lift ratio, the more significant
the association.
184
3. Advantages of Apriori algorithm are: Mining Frequent Patterns
and Associations
• It is easy to implement.
• It implements level-wise search.
• J oin and Prune steps are easy to implement on large itemsets in
large databases.
Disadvantages of Apriori algorithm are:
• A full scan of the whole database is required.
• There is need of more search space and cost of I/O is increased.
• I t requires high computation if the itemsets are very large and the
minimum support is kept very low.
4.
Following are some of the methods to improve Apriori efficiency:
Hash-Based Technique: This method uses a hash-based structure called a
hash table for generating the k-itemsets and its corresponding count. It uses
a hash function for generating the table.
Transaction Reduction: This method reduces the number of transactions
scanning in iterations. The transactions which do not contain frequent items
are marked or removed.
Partitioning: This method requires only two database scans to mine the
frequent itemsets. It says that for any itemset to be potentially frequent in
the database, it should be frequent in at least one of the partitions of the
database.
Sampling: This method picks a random sample S from Database D and
then searches for frequent itemset in S. It may be possible to lose a global
frequent itemset. This can be reduced by lowering the min_sup.
Dynamic Itemset Counting: This technique can add new candidate itemsets
at any marked start point of the database during the scanning of the database.

9.13 FURTHER READINGS


1. Data Mining: Concepts and Techniques, 3rd Edition, Jiawei Han, Micheline
Kamber, Jian Pei, Elsevier, 2012.
2. Data Mining, Charu C. Aggarwal, Springer, 2015.
3. Data Mining and Data Warehousing – Principles and Practical Techniques,
Parteek Bhatia, Cambridge University Press, 2019.
4. Introduction to Data Mining, Pang Ning Tan, Michael Steinbach, Anuj
Karpatne, Vipin Kumar, Pearson, 2018.
5. Data Mining Techniques and Applications: An Introduction, Hongbo Du,
Cengage Learning, 2013.
6. Data Mining : Vikram Pudi and P. Radha Krishna, Oxford, 2009.
7. Data Mining and Analysis – Fundamental Concepts and Algorithms;
185
Mohammed J. Zaki, Wagner Meira, Jr, Oxford, 2014.

You might also like