MAHARANA PRATAP GROUP OF INSTITUTIONS
KOTHI MANDHANA, KANPUR
(Approved by AICTE, New Delhi and Affiliated to Dr.AKTU, Lucknow )
Digital Notes
[Department of Electronics Engineering]
Subject Name : Data Analytics
Subject Code : KCS-051
Course : B. Tech
Branch : CSE
Semester : V
Prepared by : Mr. Anand Prakash Dwivedi
Unit – 4
Frequent Itemsets and Clustering:
Mining Frequent Patterns, Association and Correlations: Basic Concepts
and Methods
Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.)
that occurs frequently in a data set
First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context of
frequent itemsets and association rule mining
Motivation: Finding inherent regularities in data
What products were often purchased together?— Beer and diapers?!
What are the subsequent purchases after buying a PC?
What kinds of DNA are sensitive to this new drug?
Can we automatically classify web documents?
Applications
Basket data analysis, cross-marketing, catalog design, sale campaign
analysis, Web log (click stream) analysis, and DNA sequence analysis.
Why Is Freq. Pattern Mining Important?
Freq. pattern: An intrinsic and important property of datasets
Foundation for many essential data mining tasks
Association, correlation, and causality analysis
Sequential, structural (e.g., sub-graph) patterns
Pattern analysis in spatiotemporal, multimedia, time-series, and stream
data
Classification: discriminative, frequent pattern analysis
Cluster analysis: frequent pattern-based clustering
Data warehousing: iceberg cube and cube-gradient
Semantic data compression: fascicles
Broad applications
2
Page
Introduction to Market Basket Analysis
• Market Basket Analysis (Association Analysis) is a mathematical modeling
technique based upon the theory that if you buy a certain group of items, you
are likely to buy another group of items.
• It is used to analyze the customer purchasing behavior and helps in increasing
the sales and maintain inventory by focusing on the point of sale transaction
data.
• Consider shopping cart filled with several items
• Market basket analysis tries to answer the following questions:
• Who makes purchases?
• What items tend to be purchased together
obvious: steak-potatoes; beer-pretzels
• What items are purchased sequentially
• obvious: house-furniture; car-tires
• In what order do customers purchase items?
• What items tend to be purchased by season
• It is also about what customers do not purchase, and why.
• If customers purchase baking powder, but no flour, what are they
baking?
• If customers purchase a mobile phone, but no case, are you missing
an opportunity?
• Categorize customer purchase behavior
• identify actionable information
• purchase profiles
• profitability of each purchase profile
• use for marketing
• layout or catalogs
• select products for promotion
• space allocation, product placement
•
Market Basket Benefits
• Selection of promotions, merchandising strategy
sensitive to price: Italian entrees, pizza, pies, Oriental entrees,
orange juice
• Uncover consumer spending patterns
3
correlations: orange juice & waffles
Page
• Joint promotional opportunities
• A database of customer transactions
• • Each transaction is a set of items
• • Example:
• Transaction with TID 111 contains items
• {Pen, Ink, Milk, Juice}
TID CID Date Item Qty
111 201 5/1/99 Pen 2
111 201 5/1/99 Ink 1
111 201 5/1/99 Milk 3
111 201 5/1/99 Juice 6
112 105 6/3/99 Pen 1
112 105 6/3/99 Ink 1
112 105 6/3/99 Milk 1
113 106 6/5/99 Pen 1
113 106 6/5/99 Milk 1
114 201 7/1/99 Pen 2
114 201 7/1/99 Ink 2
114 201 7/1/99 Juice 4
4
Page
Coocurrences
• 80% of all customers purchase items X, Y and Z together.
Association rules
• 60% of all customers who purchase X and Y also buy Z.
Sequential patterns
• 60% of customers who first buy X also purchase Y within three weeks.
Association rule mining
• Proposed by Agrawal et al in 1993.
• It is an important data mining model studied extensively by the database and
data mining community.
• Assume all data are categorical.
• No good algorithm for numeric data.
• Initially used for Market Basket Analysis to find how items purchased by
customers are related.
Bread Milk [sup = 5%, conf = 100%]
• Motivation: finding regularities in data
• What products were often purchased together? — Beer and diapers
• What are the subsequent purchases after buying a PC?
• What kinds of DNA are sensitive to this new drug?
• Can we automatically classify web documents?
5
Page
• Given a set of transactions, find rules that will predict the occurrence of an item
based on the occurrences of other items in the transaction.
• Market-Basket transactions
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Example of Association Rules
{Diaper}{Beer},
{Milk,Bread}{Eggs,Coke},
{Beer, Bread} {Milk},
Basket Data
Retail organizations, e.g., supermarkets, collect and store massive amounts sales data,
called basket data.
A record consist of
transaction date
items bought
Or, basket data may consist of items bought by a customer over a period.
l Items frequently purchased together:
Bread PeanutButter
6
Page
Example Association Rule
90% of transactions that purchase bread and butter also purchase milk
“IF” part = antecedent
“THEN” part = consequent
“Item set” = the items (e.g., products) comprising the antecedent or consequent
• Antecedent and consequent are disjoint (i.e., have no items in common)
Antecedent: bread and butter
Consequent: milk
Confidence factor: 90%
Transaction data: supermarket data
• Market basket transactions:
t1: {bread, cheese, milk}
t2: {apple, eggs, salt, yogurt}
… …
tn: {biscuit, eggs, milk}
• Concepts:
• An item: an item/article in a basket
• I: the set of all items sold in the store
• A transaction: items purchased in a basket; it may have TID (transaction
ID)
• A transactional dataset: A set of transactions
Definition: Frequent Itemset
• Itemset
• A collection of one or more items
• Example: {Milk, Bread, Diaper}
• k-itemset
• An itemset that contains k items
• Support count ()
• Frequency of occurrence of an itemset
• E.g. ({Milk, Bread,Diaper}) = 2
• Support
7
• Fraction of transactions that contain an itemset
Page
• E.g. s({Milk, Bread, Diaper}) = 2/5
• Frequent Itemset
TID• AnItems
itemset whose support is greater than or equal to a minsup threshold
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
The model: data
• I = {i1, i2, …, im}: a set of items.
• Transaction t :
• t a set of items, and t I.
• Transaction Database T: a set of transactions T = {t1, t2, …, tn}.
l I: itemset
{cucumber, parsley, onion, tomato, salt, bread, olives, cheese, butter}
l T: set of transactions
1{{cucumber, parsley, onion, tomato, salt, bread},
2 {tomato, cucumber, parsley},
3 {tomato, cucumber, olives, onion, parsley},
4 {tomato, cucumber, onion, bread},
5 {tomato, salt, onion},
6 {bread, cheese}
7 {tomato, cheese, cucumber}
8 {bread, butter}}
The model: Association rules
• A transaction t contains X, a set of items (itemset) in I, if X t.
• An association rule is an implication of the form:
X Y, where X, Y I, and X Y =
• An itemset is a set of items.
8
• E.g., X = {milk, bread, cereal} is an itemset.
Page
• A k-itemset is an itemset with k items.
• E.g., {milk, bread, cereal} is a 3-itemset
Rule strength measures
• Support: The rule holds with support sup in T (the transaction data set) if sup%
of transactions contain X Y.
• sup = probability that a transaction contains Pr(X Y)
(Percentage of transactions that contain X Y)
• Confidence: The rule holds in T with confidence conf if conf% of tranactions that
contain X also contain Y.
• conf = conditional probability that a transaction having X also contains Y
Pr(Y | X)
(Ratio of number of transactions that contain X Y to the number that contain X)
• An association rule is a pattern that states when X occurs, Y occurs with certain
probability.
Support and Confidence
• Support count: The support count of an itemset X, denoted by X.count, in a data
set T is the number of transactions in T that contain X. Assume T has n
transactions.
Then, ( X Y ).count
support
n
( X Y ).count
confidence
X .count
Goal: Find all rules that satisfy the user-specified minimum support (minsup)
and minimum confidence (minconf).
9
Page
Association Rule
An implication expression of the form X Y, where X and Y are itemsets
Example:
{Milk, Diaper} {Beer}
Rule Evaluation Metrics
Support (s)
Fraction of transactions that contain both X and Y
Confidence (c)
Measure how often items in Y appear in transactions that contain X
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Example
{Milk, Diaper} Beer
(Milk , Diaper, Beer) 2
s 0.4
|T| 5
(Milk, Diaper, Beer) 2
c 0.67
(Milk , Diaper) 3
Is minimum support and minimum confidence can be automatically determined in
mining association rules?
10
Page
• For the mininmum support, it all depends on the dataset. Usually, may start
with a high value, and then decrease the values until to find a value that will
generate enough paterns.
• For the minimum confidence, it is a little bit easier because it represents the
confidence that you want in the rules. So usually, use something like 60 % . But it
also depends on the data.
• In terms of performance, when minsup is higher you will find less pattern and
the algorithm is faster. For minconf, when it is set higher, there will be less
pattern but it may not be faster because many algorithms don't use minconf to
prune the search space. So obviously, setting these parameters also depends on
how many rules you want.
•
Mining Association Rules
• Two-step approach:
1. Frequent Itemset Generation
– Generate all itemsets whose support minsup
if an itemset is frequent, each of its subsets is frequent as well.
This property belongs to a special category of properties called
antimonotonicity in the sense that if a set cannot pass a test, all of its
supersets will fail the same test as well.
Rule Generation
– Generate high confidence rules from each frequent itemset, where
each rule is a binary partitioning of a frequent itemset
• Frequent itemset generation is still computationally expensive
Frequent Itemset Generation
• An itemset X is closed in a data set D if there exists no proper super-itemset Y*
such that Y has the same support count as X in D.
*(Y is a proper super-itemset of X if X is a proper sub-itemset of Y, that is, if X Y.
In other words, every item of X is contained in Y but there is at least one item of Y
that is not in X.)
• An itemset X is a closed frequent itemset in set D if X is both closed and frequent
in D.
An itemset X is a maximal frequent itemset (or max-itemset) in a data set D if X is
11
frequent, and there exists no super-itemset Y such that X Y and Y is frequent in D
Page
Frequent Itemset Generation Strategies
• Reduce the number of candidates (M)
• Complete search: M=2d
• Use pruning techniques to reduce M
• Reduce the number of transactions (N)
• Reduce size of N as the size of itemset increases
• Used by DHP (Direct Hashing & Purning) and vertical-based mining
algorithms
• Reduce the number of comparisons (NM)
• Use efficient data structures to store the candidates or transactions
No need to match every candidate against every transaction
Many mining algorithms
• There are a large number of them!!
• They use different strategies and data structures.
• Their resulting sets of rules are all the same.
• Given a transaction data set T, and a minimum support and a minimum
confident, the set of association rules existing in T is uniquely determined.
• Any algorithm should find the same set of rules although their computational
efficiencies and memory requirements may be different.
• We study only one: the Apriori Algorithm
The Apriori algorithm
• The algorithm uses a level-wise search, where k-itemsets are used to explore
(k+1)-itemsets
• In this algorithm, frequent subsets are extended one item at a time (this step is
known as candidate generation process)
• Then groups of candidates are tested against the data.
• It identifies the frequent individual items in the database and extends them to
larger and larger item sets as long as those itemsets appear sufficiently often in
12
the database.
Page
• Apriori algorithm determines frequent itemsets that can be used to determine
association rules which highlight general trends in the database.
• The Apriori algorithm takes advantage of the fact that any subset of a frequent
itemset is also a frequent itemset.
• i.e., if {l1,l2} is a frequent itemset, then {l1} and {l2} should be frequent
itemsets.
• The algorithm can therefore, reduce the number of candidates being considered
by only exploring the itemsets whose support count is greater than the minimum
support count.
• All infrequent itemsets can be pruned if it has an infrequent subset.
How do we do that?
• So we build a Candidate list of k-itemsets and then extract a Frequent list of k-
itemsets using the support count
• After that, we use the Frequent list of k-itemsets in determing the Candidate
and Frequent list of k+1-itemsets.
• We use Pruning to do that
• We repeat until we have an empty Candidate or Frequent of k-itemsets
• Then we return the list of k-1-itemsets.
KEY CONCEPTS
• Frequent Itemsets: All the sets which contain the item with the minimum support
(denoted by L for ith itemset).
• Apriori Property: Any subset of frequent itemset must be frequent.
• Join Operation: To find Lk , a set of candidate k-itemsets is generated by joining Lk-1
with itself.
13
Page
The Apriori Algorithm : Pseudo Code
Goal: find all itemsets I s.t. supp(I) > minsupp
• For each item X check if supp(X) > minsupp then retain I1 = {X}
• K=1
• Repeat
• For every itemset Ik, generate all itemsets Ik+1 s.t. Ik Ik+1
• Scan all transactions and compute supp(Ik+1) for all itemsets Ik+1
• Drop itemsets Ik+1 with support < minsupp
• Until no new frequent itemsets are found
Association Rules
Finally, construct all rules X Y s.t.
• XY has high support
14
• Supp(XY)/Supp(X) > min-confidence
Page
LIMITATIONS
• Apriori algorithm can be very slow and the bottleneck is candidate generation.
• For example, if the transaction DB has 104 frequent 1- itemsets, they will generate
107 candidate 2-itemsets even after employing the downward closure.
• To compute those with sup more than min sup, the database need to be scanned at
every level. It needs (n +1 ) scans, where n is the length of the longest pattern.
METHODS TO IMPROVE APRIORI’S EFFICIENCY
• Hash-based itemset counting: A k-itemset whose corresponding hashing bucket count
is below the threshold cannot be frequent
• Transaction reduction: A transaction that does not contain any frequent k-itemset is
useless in subsequent scans
• Partitioning: Any itemset that is potentially frequent in DB must be frequent in at
least one of the partitions of DB.
15
• Sampling: mining on a subset of given data, lower support threshold + a method to
Page
determine the completeness
• Dynamic itemset counting: add new candidate itemsets only when all of their subsets
are estimated to be frequent.
What is Cluster Analysis?
•Cluster: a collection of data objects
•Similar to one another within the same cluster
•Dissimilar to the objects in other clusters
•Cluster analysis
•Grouping a set of data objects into clusters
•Clustering is unsupervised classification: no predefined classes
•Typical applications
•As a stand-alone tool to get insight into data distribution
•As a preprocessing stepfor other algorithms
•Given a collection of objects, put objects into groups based on similarity.
•Used for “discovery-based” science, to find unexpected patterns in data.
•Also called “unsupervised learning” or “data mining”
•Inherently an ill-defined problem
General Applications of Clustering
Pattern Recognition
•Spatial Data Analysis
•create thematic maps in GIS by clustering feature spaces
•detect spatial clusters and explain them in spatial data mining
•Image Processing
•Economic Science (especially market research)
•WWW
•Document classification
•Cluster Weblog data to discover groups of similar access patterns
16
Page
Requirements of Clustering in Data Mining
•Scalability
•Ability to deal with different types of attributes
•Discovery of clusters with arbitrary shape
•Minimal requirements for domain knowledge to determine input parameters
•Able to deal with noise and outliers
•Insensitive to order of input records
•High dimensionality
•Incorporation of user-specified constraints
•Interpretability and usability
Similarity and Dissimilarity Measures
•In clustering techniques, similarity (or dissimilarity) is an important measurement.
•Informally, similarity between two objects (e.g., two images, two documents, two
records, etc.) is a numerical measure of the degree to which two objects are alike.
•The dissimilarity on the other hand, is another alternative (or opposite) measure of
the degree to which two objects are different.
•Both similarity and dissimilarity also termed as proximity.
•Usually, similarity and dissimilarity are non-negative numbers and may range from
zero (highly dissimilar (no similar)) to some finite/infinite value (highly similar (no
dissimilar)).Note:
•Frequently, the term distance is used as a synonym for dissimilarity
•In fact, it is used to refer as a special case of dissimilarity.
17
Page
Measure the Quality of Clustering
• Dissimilarity/Similarity metric: Similarity is expressed in terms of a distance
function, which is typically metric: d(i, j)
• There is a separate “quality” function that measures the “goodness” of a cluster.
• The definitions of distance functions are usually very different for interval-scaled,
boolean, categorical, ordinal and ratio variables.
• Weights should be associated with different variables based on applications and
data semantics.
• It is hard to define “similar enough” or “good enough”
• the answer is typically highly subjective.
Major Clustering Approaches
• Partitioning algorithms: Construct various partitions and then evaluate them by
some criterion
• Hierarchy algorithms: Create a hierarchical decomposition of the set of data (or
objects) using some criterion
• Density-based: based on connectivity and density functions
• Grid-based: based on a multiple-level granularity structure
• Model-based: A model is hypothesized for each of the clusters and the idea is to
find the best fit of that model to each other
• Important distinction between partitional and hierarchical sets of
clusters
18
Page
• Partitional Clustering
• A division data objects into non-overlapping subsets (clusters)
such that each data object is in exactly one subset
• Hierarchical clustering
• A set of nested clusters organized as a hierarchical tree
Partitioning Algorithms: Basic Concept
• Partitioning method: Construct a partition of a database D of n objects
into a set of k clusters
• Given a k, find a partition of k clusters that optimizes the chosen
partitioning criterion
• Global optimal: exhaustively enumerate all partitions
• Heuristic methods: k-means and k-medoids algorithms
• k-means (MacQueen’67): Each cluster is represented by the
center of the cluster
• k-medoids or PAM (Partition around medoids) (Kaufman &
Rousseeuw’87): Each cluster is represented by one of the objects
in the cluster
19
Page
Partitioning Algorithms: Basic Concept
Partitioning method:Construct a partition of a database Dof nobjects into a set of
kclusters
•Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion
•Global optimal: exhaustively enumerate all partitions
•Heuristic methods: k-means and k-medoids algorithms
•k-means(MacQueen’67): Each cluster is represented by the center of the cluster
•k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw’87): Each
cluster is represented by one of the objects in the cluster
K-Means Clustering
•Simple Clustering: K-means
•Given k, the k-meansalgorithm consists of foursteps:
(Basic version works with numeric data only)
1) Select initial centroids at random -Pick a number (K) of cluster centers -centroids (at
random)
2) Assign every item to its nearest cluster center (e.g. using Euclidean distance)
3) Move each cluster center to the mean of its assigned items
4) Repeat steps 2,3 until convergence (change in cluster assignments less than a
threshold)
K-means Algoritms
•Initialization
•Arbitrarily choose k objects as the initial cluster centers (centroids)
•Iteration until no change
•For each object Oi
•Calculate the distances between Oiand the k centroids
•(Re)assign Oito the cluster whose centroidis the closest to Oi
•Update the cluster centroidsbased on current assignment
20
Page
Weaknesses of K-Mean Clustering
1. When the numbers of data are not so many, initial grouping will determine the
cluster significantly.
2. The number of cluster, K, must be determined before hand. Its disadvantage is that it
does not yield the same result with each run, since the resulting clusters depend on the
initial
random assignments.
3. We never know the real cluster, using the same data, because if it is inputted in a
different order it may produce different cluster if the number of data is few.
4. It is sensitive to initial condition. Different initial condition may produce different
result of cluster. The algorithm may be trapped in the local optimum.
Applications of K-Mean Clustering
• It is relatively efficient and fast. It computes result at O(tkn), where n is number of
21
objects or points, k is number of clusters and t is number of iterations.
• k-means clustering can be applied to machine learning or data mining
Page
• Used on acoustic data in speech understanding to convert waveforms into one of k
categories (known as Vector Quantization or Image Segmentation).
• Also used for choosing color palettes on old fashioned graphical display devices and
Image Quantization.
CONCLUSION
• K-means algorithm is useful for undirected knowledge discovery and is relatively
simple.
• K-means has found wide spread usage in lot of fields, ranging from unsupervised
learning of neural network, Pattern recognitions, Classification analysis, Artificial
intelligence, image processing, machine vision, and many others.
CLIQUE (Clustering In QUEst)
• Agrawal, Gehrke, Gunopulos, Raghavan (SIGMOD’98)
• Automatically identifying subspaces of a high dimensional data space that allow
better clustering than original space
• CLIQUE can be considered as both density-based and grid-based
– It partitions each dimension into the same number of equal length interval
– It partitions an m-dimensional data space into non-overlapping rectangular
units
– A unit is dense if the fraction of total data points contained in the unit
exceeds the input model parameter
– A cluster is a maximal set of connected dense units within a subspace
CLIQUE: The Major Steps
• Partition the data space and find the number of points that lie inside each cell of the
partition.
• Identify the subspaces that contain clusters using the Apriori principle
• Identify clusters
– Determine dense units in all subspaces of interests
– Determine connected dense units in all subspaces of interests.
• Generate minimal description for the clusters
– Determine maximal regions that cover a cluster of connected dense units for
each cluster
– Determination of minimal cover for each cluster
22
Page
Strength and Weakness of CLIQUE
• Strength
– automatically finds subspaces of the highest dimensionality such that high
density clusters exist in those subspaces
– insensitive to the order of records in input and does not presume some
canonical data distribution
– scales linearly with the size of input and has good scalability as the number
of dimensions in the data increases
• Weakness
– The accuracy of the clustering result may be degraded at the expense of
simplicity of the method
Frequent Pattern-Based Approach
• Clustering high-dimensional space (e.g., clustering text documents, microarray data)
– Projected subspace-clustering: which dimensions to be projected on?
• CLIQUE, ProClus
– Feature extraction: costly and may not be effective?
– Using frequent patterns as “features”
• “Frequent” are inherent features
• Mining freq. patterns may not be so expensive
23
Page