KEMBAR78
Module5 DMW | PDF | Cluster Analysis | Information Science
0% found this document useful (0 votes)
122 views13 pages

Module5 DMW

This document discusses module 5 of a syllabus covering association rule mining concepts and algorithms. It introduces association rule mining, the Apriori algorithm for finding frequent itemsets, and generating association rules. The Apriori algorithm uses a level-wise search approach to iteratively find candidate itemsets that meet minimum support, pruning unpromising candidates at each level based on the Apriori property.

Uploaded by

Sreenath Sree
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
122 views13 pages

Module5 DMW

This document discusses module 5 of a syllabus covering association rule mining concepts and algorithms. It introduces association rule mining, the Apriori algorithm for finding frequent itemsets, and generating association rules. The Apriori algorithm uses a level-wise search approach to iteratively find candidate itemsets that meet minimum support, pruning unpromising candidates at each level based on the Apriori property.

Uploaded by

Sreenath Sree
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Module 5: Syllabus

Association Rules Mining: Concepts, Apriori and FP-Growth Algorithm.


Cluster Analysis: Introduction, Concepts, Types of data in cluster analysis, Categorization of clustering
methods.
Partitioning method: K-Means and K-Medoid Clustering

Association Rules Mining: Concepts


Association rule mining finds interesting association or correlation relationships among a
large set of data items. With massive amounts of data continuously being collected and stored in
databases, many industries are becoming interested in mining association rules from their databases. For
example, the discovery of interesting association relationships among huge amounts of business
transaction records can help catalog design, cross-marketing, lossleader analysis, and other business
decision making processes.

A typical example of association rule mining is market basket analysis. This process analyzes
customer buying habits by finding associations between the different items that customers place in their
“shopping baskets" (Figure 6.1). The discovery of such associations can help retailers develop marketing
strategies by gaining insight into which items are frequently purchased together by customers. For
instance, if customers are buying milk, how likely are they to also buy bread (and what kind of bread) on
the same trip to the supermarket? Such information can lead to increased sales by helping retailers to do
selective marketing and plan their shelf space. For instance, placing milk and bread within close proximity
may further encourage the sale of these items together within single visits to the store.

Basic concepts
1. Transaction database
 Let I = {I1, I2,..., Im} be an itemset {SET OF ITEMS} .
 Let D, the task-relevant data, be a set of database transactions T where each transaction ti is a
nonempty item set such that ti ⊆ I.
 Transaction database, T={t1, t2,…….,tn}
 A transaction database T contains “n” transaction.
 Each transaction is associated with an identifier, called a TID.

Fig: Sample Transaction database

2. Association rule
An association rule is an implication of the form X ⇒ Y, where X⇒ I, Y ⇒ I, and X∩Y = φ.

 Meaning that whenever X appears in the Transaction Y also tends to appear.


 X and Y are single item or set of Items.
 Example: {Milk} ⇒ {Bread},{Computer, CD Drive}⇒ {CD}
 Usually we prefer association rules that satisfies required support and confidence.

3. Support, Confidence and Lift


 Support and confidence are used to measure interestingness of the rule
 Support(X) is the number of times X appears in the transaction divided by total number of
transaction.

Support(X) =(No. of times X appears)/( total number of transaction)=P(X)

Support(X,Y) =(No. of times X and Y appears together)/( total number of transaction)=P(X∩Y)

 Confidence of the association rule X ⇒ Y is defined as the ratio of support of X and Y together to the
support of X

Confidence ( X ⇒ Y)= ( support of X and Y together)/( support of X)= P(X∩Y)/ P(X)= P(Y|X)

 Lift is used to measure the power of association between items that are purchased together.

Lift( X ⇒ Y)= P(X∩Y)/ P(X)= P(Y|X)/ P(Y)

4. Frequent Items
Items that frequently appeared in the transactions are called Frequent Items. Usually we select
frequent items based on minimum support count.

Association Rules Mining: Algorithms


In general, association rule mining can be viewed as a two-step process:

Step 1. Find all frequent itemsets: By definition, each of these itemsets will occur at least as
frequently as a predetermined minimum support count.

2. Generate strong association rules from the frequent itemsets: By definition, these rules
must satisfy minimum support and minimum confidence.

1. Apriori Algorithm
Apriori algorithm consist of two parts

Part 1: Find all frequent item sets: Item sets that exceeds minimum support

Part 2: Generating strong association rules from frequent itemsets ( Association rule that meet
minimum confidence)

PART 1: Finding frequent itemsets


To improve the efficiency of the level-wise generation of frequent itemsets, an important property
called the Apriori property is used to reduce the search space.

The Apriori property. All non empty subsets of a frequent itemset must also be frequent.
This property belongs to a special category of properties called anti-monotone in the sense that if a
set cannot pass a test, all of its supersets will fail the same test as well.

Apriori employs an iterative approach known as a level-wise search, where k-itemsets are used to
explore (k+1)-itemsets.

First, the set of frequent 1-itemsets is found. This set is denoted L1. L1 is used to fnd L2, the
frequent 2-itemsets, which is used to fnd L3, and so on, until no more frequent k-itemsets can be found.
The finding of each Lk requires one full scan of the database.

A two-step process is followed to find frequent items consisting of join and prune actions.
2. The prune step: Ck is a superset of Lk , that is, its members may or may not be frequent, but all of the
frequent k-itemsets are included in Ck . A database scan to determine the count of each candidate in
Ck would result in the determination of Lk (i.e., all candidates having a count no less than the
minimum support count are frequent by definition, and therefore belong to Lk). Ck , however, can be
huge, and so this could involve heavy computation. To reduce the size of Ck , the Apriori property is
used as follows. Any (k − 1)-itemset that is not frequent cannot be a subset of a frequent k-itemset.
Hence, if any (k − 1)-subset of a candidate k-itemset is not in Lk−1, then the candidate cannot be
frequent either and so can be removed from Ck .

PART 2: Generating Association Rules from Frequent Itemsets


Once the frequent itemsets from transactions in a database D have been found, it is straightforward to
generate strong association rules from them (where strongassociation rules satisfy both minimum
support and minimum confidence). This can be done using Equation for confidence.

Confidence ( X ⇒ Y)= ( support of X and Y together)/( support of X)= P(X∩Y)/ P(X)= P(Y|X)

.
Example Problem 1:

Based on the AllElectronics transaction database, D, of Table 6.1. Find association rule present in
the database by considering Minimum support of 20% and confidence of 75%?.

Based on the AllElectronics transaction database, D, of Table 6.1. Find association rule present in
the database by considering Minimum support of 20% and confidence of 75%?.

Solution:

There are nine transactions in this database, that is, n = 9.

Set of Items, I={I1,I2,I3,I4,I5}

Part 1: Frequent Item Set generation


Step 1- E item is a member of the set of candidate 1-itemsets, C1. The algorithm simply scans all of the
transactions to count the number of occurrences of each item.

Step 2- The set of frequent 1-itemsets, L1, can then be determined Minimum support count required is 2,
that is, min sup = 2. (Here, support is 2/9 = 22%.). All Items in C1 are qualified
Step 3- To discover the set of frequent 2-itemsets, L2, the algorithm uses the join L1 ⋈ L1 to generate a
candidate set of 2-itemsets, C2.

Step 4. Next, the transactions in D are scanned and the support count of each candidate itemset in C2 is
calculated

Step 5: The set of frequent 2-itemsets, L2, is then determined, consisting of those candidate 2-itemsets in
C2 having minimum support.

Step 6. The generation of the set of the candidate 3-itemsets, C3. From the join step, we first get C3 = L2 ⋈
L2 = {{I1, I2, I3}, {I1, I2, I5}, {I1, I3, I5}, {I2, I3, I4}, {I2, I3, I5}, {I2, I4, I5}

C3
Items
{I1, I2, I3},
{I1, I2, I5},
{I1, I3, I5},
{I2, I3, I4},
{I2, I3, I5}
{I2, I4, I5}

Step 7. The transactions in D are scanned to determine support count of C3


C3
Items Sup count
{I1, I2, I3}, 2
{I1, I2, I5}, 2
{I1, I3, I5}, 1
{I2, I3, I4}, 0
{I2, I3, I5} 1
{I2, I4, I5} 0
Step 8: Find 3 frequent item set, L3. ( candidate 3-itemsets in C3 having minimum support.)

Step 9: Generation of the set of the candidate 4-itemsets, C4. From the join step, we first get C4 = L3 ⋈ L3

C4
Items Sup count
{I1, I2, I3,I5 }, 1

Step 10: Find 4 frequent item set, L4. ( C3 having minimum support)

NULL list. Stop Frequent Item set generation.

PART 2: Generating Association Rules from Frequent Item sets

 Minimum confidence required 75 %


 Confidence ( X ⇒ Y)= ( support of X and Y together)/( support of X)= P(X∩Y)/ P(X)= P(Y|X)

Consider L3={{I1, I2, I3},{I1, I2, I5}}

Association rules formed from frequent Item set {I1, I2, I3},

I1^I2⇒ I3, confidence = 2/4 = 50%


I1^I3⇒ I2, confidence = 2/4 = 50%
I2^I3⇒ I1, confidence = 2/4 = 50%
I1⇒ I2^I3, confidence = 2/6 = 33%
I2⇒ I1^I3, confidence = 2/7 = 29%
I3⇒ I1^I2, confidence = 2/6 = 33%
Association rules formed from frequent Item set {I1, I2, I5},
Association rules that satisfy minimum confidence are {I1^I5⇒ I2 ,I2^I5⇒ I1, I5⇒ I1^I2}

Major drawbacks of Apriori Algorithm

1. The number of candidate itemset grows quickly and result in large candidate set. For example, if
there are 104 frequent 1-itemsets (L1), the Apriori algorithm will need to generate more than 107
candidate 2-itemsets(C2).
2. The Apriori algorithm requires many scans of the database.

FP-growth Association Rule Mining Algorithm


“Can we design a method that mines the complete set of frequent itemsets without such a costly
candidate generation process?”

An interesting method in this attempt is called frequent pattern growth, or simply FP-growth, which
adopts a divide-and-conquer strategy as follows.

First, it compresses the database representing frequent items into a frequent pattern tree, or FP-tree,
which retains the itemset association information. It then divides the compressed database into a set of
conditional databases (a special kind of projected database), each associated with one frequent item or
“pattern fragment,” and mines each database separately. For each “pattern fragment,” only its associated
data sets need to be examined. Therefore, this approach may substantially reduce the size of the data sets
to be searched, along with the “growth” of patterns being examined.

Note: FPgrowth algorithm find set of frequent itemsets without candidate generation process

Advantages of FP growth algorithm:-

1. Faster than apriori algorithm

2. No candidate generation

3. Only two passes over dataset

Disadvantages of FP growth algorithm:-

1. FP tree may not fit in memory

2. FP tree is expensive to build

Fp growth algorithm example

FP-growth Association Rule Mining Algorithm


Algorithm: FP growth. Mine frequent itemsets using an FP-tree by pattern fragment growth.

Input: D- a transaction database; min_sup- the minimum support count threshold.

Output: The complete set of frequent patterns.


Example Problem:

Based on the AllElectronics transaction database, D, of Table 6.1. Find frequent pattern present in
the database using FP Growth Algorithm by considering minimum support of 20% and confidence of
75%?.
Solution:

FP-Tree corresponding to given Dataset is shown in below figure

Frequent patterns generated from the given Dataset is shown in below figure
What is Cluster Analysis?

 Cluster: A collection of data objects


o similar (or related) to one another within the same group
o dissimilar (or unrelated) to the objects in other groups
o
 Cluster analysis (or clustering, data segmentation, …)
o Finding similarities between data according to the characteristics found in the data and
grouping similar data objects into clusters
 Unsupervised learning: no predefined classes (i.e., learning by observations vs. learning by
examples: supervised)

 Typical applications of Cluster analysis


o As a stand-alone tool to get insight into data distribution
o As a preprocessing step for other algorithms

Clustering for Data Understanding and Applications

 Biology: taxonomy of living things: kingdom, phylum, class, order, family, genus and species
 Information retrieval: document clustering
 Land use: Identification of areas of similar land use in an earth observation database
 Marketing: Help marketers discover distinct groups in their customer bases, and then use this
knowledge to develop targeted marketing programs
 City-planning: Identifying groups of houses according to their house type, value, and
geographical location
 Earth-quake studies: Observed earth quake epicenters should be clustered along continent
faults
 Climate: understanding earth climate, find patterns of atmospheric and ocean
 Economic Science: market research

Clustering as a Preprocessing Tool (Utility)

 Summarization:
o Preprocessing for regression, PCA, classification, and association analysis
 Compression:
o Image processing: vector quantization
 Finding K-nearest Neighbors
o Localizing search to one or a small number of clusters
 Outlier detection
o Outliers are often viewed as those “far away” from any cluster

Quality: What Is Good Clustering?

 A good clustering method will produce high quality clusters


o high intra-class similarity: cohesive within clusters
o low inter-class similarity: distinctive between clusters

 The quality of a clustering method depends on


o the similarity measure used by the method
o its implementation, and
o Its ability to discover some or all of the hidden patterns

Major Clustering Approaches


In general, the major clustering methods can be classified into the following categories.
 Partitioning Methods
 Hierarchical Methods
 Density-Based Methods
 Grid-Based Methods

 Partitioning approach:
 Construct various partitions and then evaluate them by some criterion, e.g., minimizing the
sum of square errors
 Given a database of n objects, a partitioning method constructs k partitions of the data object,
where each partition represents a cluster .
 That is, it clusters the data into k groups, which together satisfy the following requirements:
(1) each group must contain at least one object, and
(2) Each object must belong to exactly one group
 Typical methods: k-means, k-medoids, CLARANS

 Hierarchical approach:

o A hierarchical method creates a hierarchical decomposition of the given set of data objects.

o A hierarchical method can be classified as being either agglomerative or divisive

o The agglomerative approach, also called the bottom-up approach, starts with each object
forming a separate group. It successively merges the objects or groups that are close to one
another, until all of the groups are merged into one (the topmost level of the hierarchy), or
until a termination condition holds.

o The divisive approach, also called the top-down approach, starts with all of the objects in the
same cluster. In each successive iteration, a cluster is split up into smaller clusters, until
eventually each object is in one cluster, or until a termination condition holds.

o Typical methods: Diana, Agnes, BIRCH, CAMELEON

 Density-based approach:

o Most partitioning methods cluster objects based on the distance between objects. Such
methods can find only spherical-shaped clusters and encounter difficulty at discovering
clusters of arbitrary shapes.

o Density-based clustering methods have been developed based on the notion of density.

o Based on connectivity and density functions

o Typical methods: DBSACN, OPTICS, DenClue

 Grid-based approach:
o Grid-based methods quantize the object space into a finite number of cells that form a
grid structure.

o All of the clustering operations are performed on the grid structure (i.e., on the quantized
space).

o The main advantage of this approach is its fast processing time, which is typically
independent of the number of data objects and dependent only on the number of cells in
each dimension in the quantized space.

o Typical methods: STING, WaveCluster, CLIQUE

 Model-based:

o Model-based methods hypothesize a model for each of the clusters and find the best
fit of the data to the given model.

o Typical methods: Expectation-Maximization (EM), SOM, COBWEB

 Frequent pattern-based:

o Based on the analysis of frequent patterns

o Typical methods: p-Cluster

 User-guided or constraint-based:

o Clustering by considering user-specified or application-specific constraints

o Typical methods: COD (obstacles), constrained clustering

You might also like