KEMBAR78
DWDM Unit III Notes | PDF | Information Technology | Data Analysis
0% found this document useful (0 votes)
6 views23 pages

DWDM Unit III Notes

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views23 pages

DWDM Unit III Notes

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

23AD1901 - DATAWAREHOUSING AND DATA MINING V SEM (2025 -2026)

UNIT III DATA MINING - FREQUENT PATTERN ANALYSIS


Mining Frequent Patterns, Associations and Correlations – Mining Methods- Pattern Evaluation Method –
Pattern Mining in Multilevel, Multi Dimensional Space – Constraint Based Frequent Pattern Mining,
Classification using Frequent Patterns.

Mining Frequent Patterns


Basic Concepts:
Frequent pattern mining searches for recurring relationships in a given data set. This section
introduces the basic concepts of frequent pattern mining for the discovery of interesting associations
and correlations between item sets in transactional and relational databases.
The method that mines the complete set of frequent item sets with candidate generation. Frequent
patterns are patterns (e.g., item sets, subsequences, or substructures) that appear frequently in a data set.
For example, a set of items, such as milk and bread, that appear frequently together in a transaction
data set is a frequent itemset.
A subsequence, such as buying first a PC, then a digital camera, and then a memory card, if it occurs
frequently in a shopping history database, is a (frequent) sequential pattern. A substructure can refer
to different structural forms, such as subgraphs, subtrees, or sublattices, which may be combined with
item sets or subsequences. If a substructure occurs frequently, it is called a (frequent) structured
pattern.
Finding frequent patterns plays an essential role in mining associations, correlations, and many other
interesting relationships among data.
Moreover, it helps in data classification, clustering, and other data mining tasks. Thus, frequent pattern
mining has become an important data mining task and a focused theme in data mining research.

Apriori property & The Apriori Algorithm.

Apriori property

• All nonempty subsets of a frequent item set most also be frequent.


– An item set I does not satisfy the minimum support threshold, min-sup, then I is not
frequent, i.e., support(I) < min-sup
– If an item A is added to the item set I then the resulting item set (I U A) can not occur
more frequently than I.

Page 1
23AD1901 - DATAWAREHOUSING AND DATA MINING V SEM (2025 -2026)

• Monotonic functions are functions that move in only one direction.


• This property is called anti-monotonic.
• If a set can not pass a test, all its supersets will fail the same test as well.
• This property is monotonic in failing the test.

Association Mining
• Association rule mining:
– Finding frequent patterns, associations, correlations, or causal structures among sets of items
or objects in transaction databases, relational databases, and other information repositories.
• Applications:
– Basket data analysis, cross-marketing, catalog design, loss-leader analysis, clustering,
classification, etc.
• Examples.
– Rule form: “Body ® Head [support, confidence]”.
– buys(x, “diapers”) ® buys(x, “beers”) [0.5%, 60%]
– major(x, “CS”) ^ takes(x, “DB”) ® grade(x, “A”) [1%, 75%]

Association Rule: Basic Concepts


• Given: (1) database of transactions, (2) each transaction is a list of items (purchased
by a customer in a visit)
• Find: all rules that correlate the presence of one set of items with that of another set of items
– E.g., 98% of people who purchase tires and auto accessories also get automotive
services done

• Applications

• * Maintenance Agreement (What the store should do to boost Maintenance Agreement


sales)
– Home Electronics * (What other products should the store stocks up?)
– Attached mailing in direct marketing
– Detecting “ping-pong”ing of patients, faulty “collisions”

Rule Measures: Support and Confidence

• Find all the rules X & Y Z with minimum confidence and support
– support, s, probability that a transaction contains {X Y Z}

Page 2
23AD1901 - DATAWAREHOUSING AND DATA MINING V SEM (2025 -2026)

– confidence, c, conditional probability that a transaction having {X Y} also


contains Z
Let minimum support 50%, and minimum confidence 50%, we have

– A C (50%, 66.6%)
– C A (50%, 100%)

Transaction Items
ID Bought
2000 A,B,C
1000 A,C
4000 A,D
5000 B,E,F

Association Rule Mining: A Road Map

• Boolean vs. quantitative associations (Based on the types of values handled)


– buys(x, “SQLServer”) ^ buys(x, “DMBook”) ® buys(x, “DBMiner”) [0.2%, 60%]
– age(x, “30..39”) ^ income(x, “42..48K”) ® buys(x, “PC”) [1%, 75%]
• Single dimension vs. multiple dimensional associations (see ex. Above)
• Single level vs. multiple-level analysis
– What brands of beers are associated with what brands of diapers?
• Various extensions
– Correlation, causality analysis
• Association does not necessarily imply correlation or causality
– Maxpatterns and closed itemsets
– Constraints enforced
• E.g., small sales (sum < 100) trigger big buys (sum > 1,000)?

Market – Basket analysis

A market basket is a collection of items purchased by a customer in a single transaction, which is a


well-defined business activity. For example, a customer's visits to a grocery store or an online purchase from
a virtual store on the Web are typical customer transactions. Retailers accumulate huge collections of
transactions by recording business activities over time. One common analysis run against a transactions
database is to find sets of items, or itemsets, that appear together in many transactions. A business can use
knowledge of these patterns to improve the Placement of these items in the store or the layout of mail- order
catalog page and Web pages. An itemset containing i items is called an i-itemset. The percentage of

Page 3
23AD1901 - DATAWAREHOUSING AND DATA MINING V SEM (2025 -2026)

transactions that contain an itemset is called the itemset's support. For an itemset to be interesting, its support
must be higher than a user-specified minimum. Such itemsets are said to be frequent.

Figure : Market basket analysis.

Rule support and confidence are two measures of rule interestingness. They respectively reflect the usefulness
and certainty of discovered rules. A support of 2% for association Rule means that 2% of all the transactions
under analysis show that computer and financial management software are purchased together. A confidence
of 60% means that 60% of the customers who purchased a computer also bought the software. Typically,
association rules are considered interesting if they satisfy both a minimum support threshold and a minimum
confidence threshold.

The Apriori Algorithm

• Join Step: Ck is generated by joining Lk-1with itself


• Prune Step: Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k- itemset

Page 4
23AD1901 - DATAWAREHOUSING AND DATA MINING V SEM (2025 -2026)

Example

Page 5
23AD1901 - DATAWAREHOUSING AND DATA MINING V SEM (2025 -2026)

The method that mines the complete set of frequent itemsets without generation.

• Compress a large database into a compact, Frequent-Pattern tree (FP-tree) structure


– highly condensed, but complete for frequent pattern mining
– avoid costly database scans
• Develop an efficient, FP-tree-based frequent pattern mining method
– A divide-and-conquer methodology: decompose mining tasks into smaller ones
– Avoid candidate generation: sub-database test only!

Construct FP-tree from a Transaction DB

TID Items bought (ordered) frequent items

100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}

200 {a, b, c, f, l, m, o} {f, c, a, b, m} min_support =


0.5
300 {b, f, h, j, o} {f, b}
400 {b, c, k, s, p} {c, b, p}
500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}

Steps:

1. Scan DB once, find frequent 1-itemset (single item pattern)


2. Order frequent items in frequency descending order
3. Scan DB again, construct FP-tree.

Page 6
23AD1901 - DATAWAREHOUSING AND DATA MINING V SEM (2025 -2026)

Header Table

Item frequency head

f 4
c 4
a 3
b 3
m 3
p 3

Benefits of the FP-tree Structure

• Completeness:
– never breaks a long pattern of any transaction
– preserves complete information for frequent pattern mining
• Compactness
– reduce irrelevant information—infrequent items are gone
– frequency descending ordering: more frequent items are more likely to be shared
– never be larger than the original database (if not count node-links and counts)
– Example: For Connect-4 DB, compression ratio could be over 100

Mining Frequent Patterns Using FP-tree

• General idea (divide-and-conquer)


– Recursively grow frequent pattern path using the FP-tree
• Method
– For each item, construct its conditional pattern-base, and then its conditional FP- tree
– Repeat the process on each newly created conditional FP-tree
– Until the resulting FP-tree is empty, or it contains only one path (single path will generate
all the combinations of its sub-paths, each of which is a frequent pattern)

Page 7
23AD1901 - DATAWAREHOUSING AND DATA MINING V SEM (2025 -2026)

Major Steps to Mine FP-tree

1) Construct conditional pattern base for each node in the FP-tree


2) Construct conditional FP-tree from each conditional pattern-base
3) Recursively mine conditional FP-trees and grow frequent patterns obtained so far
 If the conditional FP-tree contains a single path, simply enumerate all the patterns

Principles of Frequent Pattern Growth

• Pattern growth property


– Let be a frequent itemset in DB, B be 's conditional pattern base, and be

Page 8
23AD1901 - DATAWAREHOUSING AND DATA MINING V SEM (2025 -2026)

an itemset in B. Then is a frequent itemset in DB iff is frequent in B.


• “abcdef ” is a frequent pattern, if and only if
– “abcde ” is a frequent pattern, and
– “f ” is frequent in the set of transactions containing “abcde ”

Why Is Frequent Pattern Growth Fast?

• Our performance study shows


– FP-growth is an order of magnitude faster than Apriori, and is also faster than tree-
projection
• Reasoning
– No candidate generation, no candidate test
– Use compact data structure
– Eliminate repeated database scan
Basic operation is counting and FP-tree building

Mining multilevel association rules from transactional databases.

• Items often form hierarchy.


• Items at the lower level are expected to have lower support.

• Rules regarding item sets at appropriate levels could be quite useful.

• Transaction database can be encoded based on dimensions and levels

• We can explore shared multi-level mining


Food

Milk Bread

Skim 2% Wheat White

Fraser Sunset

Page 9
23AD1901 - DATAWAREHOUSING AND DATA MINING V SEM (2025 -2026)

TI Items
D
T1 {111, 121, 211, 221}
T2 {111, 211, 222, 323}
T3 {112, 122, 221, 411}
T4 {111, 121}
T5 {111, 122, 211, 221,
413}

Mining Multi-Level Associations

• A top_down, progressive deepening approach:


– First find high-level strong rules:
milk ® bread [20%, 60%].

– Then find their lower-level “weaker” rules:


2% milk ® wheat bread [6%, 50%].

• Variations at mining multiple-level association rules.


– Level-crossed association
rules: 2% milk ® Wonder
wheatbread

– Association rules with multiple, alternative


hierarchies: 2% milk ® Wonder bread

Multi-level Association: Uniform Support vs. Reduced Support

• Uniform Support: the same minimum support for all levels


– + One minimum support threshold. No need to examine itemsets containing any item
whose ancestors do not have minimum support.
– – Lower level items do not occur as frequently. If support threshold
• too high miss low level associations
• too low generate too many high level associations
• Reduced Support: reduced minimum support at lower levels
– There are 4 search strategies:
• Level-by-level independent

Page 10
23AD1901 - DATAWAREHOUSING AND DATA MINING V SEM (2025 -2026)

• Level-cross filtering by k-itemset


• Level-cross filtering by single item
• Controlled level-cross filtering by single item

Multi-level Association: Redundancy Filtering

• Some rules may be redundant due to “ancestor” relationships between items.


• Example
– milk wheat bread [support = 8%, confidence = 70%]
– 2% milk wheat bread [support = 2%, confidence = 72%]
• We say the first rule is an ancestor of the second rule.
• A rule is redundant if its support is close to the “expected” value, based on the rule’s ancestor

Multi-Level Mining: Progressive Deepening

• A top-down, progressive deepening approach:


– First mine high-level frequent items:
milk (15%), bread(10%)

– Then mine their lower-level “weaker” frequent


itemsets: 2% milk (5%), wheat bread (4%)

• Different min_support threshold across multi-levels lead to different algorithms:


– If adopting the same min_support across multi-levels then toss t if any of t’s ancestors is
infrequent.
– If adopting reduced min_support at lower levels then examine only those
descendents whose ancestor’s support is frequent/non-negligible.

Correlation in detail.

• Interest (correlation, lift)


– taking both P(A) and P(B) in consideration
– P(A^B)=P(B)*P(A), if A and B are independent events
– A and B negatively correlated, if the value is less than 1; otherwise A and B positively
correlated

Page 11
23AD1901 - DATAWAREHOUSING AND DATA MINING V SEM (2025 -2026)

X 1 1 1 1 0 0 0 0 Itemset Support Interest


Y11000000 X,Y 25% 2
Z01111111 X,Z 37.50% 0.9
2 Correlation
Y,Z 12.50% 0.57

• 2 measures correlation between categorical attributes

2 (observed _ exp ected )2 exp


ected

game not game


sum(ro
w)
video 4000(450 3500(3000) 7500
0)
not video 2000(150 500 (1000) 2500
0)
sum(col.) 6000 4000 10000

Page 12
23AD1901 - DATAWAREHOUSING AND DATA MINING V SEM (2025 -2026)

• expected(i,j) = count(row i) * count(column j) / N


• 2 = (4000 - 4500)2 / 4500 - (3500 - 3000)2 / 3000 - (2000 - 1500)2 / 1500 - (500 -
1000)2 /
1000 = 555.6
• 2 > 1 and observed value of (game, video) < expected value, there is a negative correlation

Numeric correlation

• Correlation concept in statistics


– Used to study the relationship existing between 2 or more numeric variables
– A correlation is a measure of the linear relationship between variables Ex:
number of hours spent studying in a class with grade received
– Outcomes:
• positively related
• Not related
• negatively related
– Statistical relationships
• Covariance
• Correlation coefficient

Constraint-Based Mining in detail.

• Interactive, exploratory mining giga-bytes of data?


– Could it be real? — Making good use of constraints!
• What kinds of constraints can be used in mining?
– Knowledge type constraint: classification, association, etc.
– Data constraint: SQL-like queries
• Find product pairs sold together in Vancouver in Dec.’98.
– Dimension/level constraints:
• in relevance to region, price, brand, customer category.
– Rule constraints
• small sales (price < $10) triggers big sales (sum > $200).

Page 13
23AD1901 - DATAWAREHOUSING AND DATA MINING V SEM (2025 -2026)

– Interestingness constraints:
• strong rules (min_support 3%, min_confidence 60%).

Rule Constraints in Association Mining

• Two kind of rule constraints:


– Rule form constraints: meta-rule guided mining.
• P(x, y) ^ Q(x, w) ® takes(x, “database systems”).
– Rule (content) constraint: constraint-based query optimization (Ng, et al.,
SIGMOD’98).
• sum(LHS) < 100 ^ min(LHS) > 20 ^ count(LHS) > 3 ^ sum(RHS) > 1000
• 1-variable vs. 2-variable constraints (Lakshmanan, et al. SIGMOD’99):
– 1-var: A constraint confining only one side (L/R) of the rule, e.g., as shown above.
– 2-var: A constraint confining both sides (L and R).
• sum(LHS) < min(RHS) ^ max(RHS) < 5* sum(LHS)

Constrain-Based Association Query

• Database: (1) trans (TID, Itemset ), (2) itemInfo (Item, Type, Price)
• A constrained asso. query (CAQ) is in the form of {(S1, S2 )|C },
– where C is a set of constraints on S1, S2 including frequency constraint
• A classification of (single-variable) constraints:
– Class constraint: S A. e.g. S Item
– Domain constraint:
• S v, { , , , , , }. e.g. S.Price < 100
• v S, is or . e.g. snacks S.Type
• V S, or S V, { , , , , }

Page 14
23AD1901 - DATAWAREHOUSING AND DATA MINING V SEM (2025 -2026)

– e.g. {snacks, sodas } S.Type


– Aggregation constraint: agg(S) v, where agg is in {min, max, sum, count, avg},
and
{ , , , , , }.
1
• e.g. count(S1.Type) , avg(S2.Price) 100

Constrained Association Query Optimization Problem

• Given a CAQ = { (S1, S2) | C }, the algorithm should be :


– sound: It only finds frequent sets that satisfy the given constraints C
– complete: All frequent sets satisfy the given constraints C are found
• A naïve solution:
– Apply Apriori for finding all frequent sets, and then to test them for constraint
satisfaction one by one.
• Our approach:
– Comprehensive analysis of the properties of constraints and try to push them as deeply as
possible inside the frequent set computation.
Categories of Constraints.

1. Anti-monotone and Monotone Constraints


• constraint Ca is anti-monotone iff. for any pattern S not satisfying Ca, none of the super-
patterns of S can satisfy Ca
• A constraint Cm is monotone iff. for any pattern S satisfying Cm, every super-pattern of S also
satisfies it

2. Succinct Constraint
• A subset of item Is is a succinct set, if it can be expressed as p(I) for some selection
predicate p, where is a selection operator

• SP 2I is a succinct power set, if there is a fixed number of succinct set I1, …, Ik I,


s.t. SP can be expressed in terms of the strict power sets of I1, …, Ik using union and minus
• A constraint Cs is succinct provided SATCs(I) is a succinct power set

Page 15
23AD1901 - DATAWAREHOUSING AND DATA MINING V SEM (2025 -2026)

3. Convertible Constraint
• Suppose all items in patterns are listed in a total order R

• A constraint C is convertible anti-monotone iff a pattern S satisfying the constraint implies that
each suffix of S w.r.t. R also satisfies C
• A constraint C is convertible monotone iff a pattern S satisfying the constraint implies that
each pattern of which S is a suffix w.r.t. R also satisfies C.

Property of Constraints: Anti-Monotone

• Anti-monotonicity: If a set S violates the constraint, any superset of S violates the constraint.
• Examples:
– sum(S.Price) v is anti-monotone
– sum(S.Price) v is not anti-monotone
– sum(S.Price) = v is partly anti-monotone
• Application:
Push “sum(S.price) 1000” deeply into iterative frequent set computation.

Example of Convertible Constraints: Avg(S) V

• Let R be the value descending order over the set of items


– E.g. I={9, 8, 6, 4, 3, 1}
• Avg(S) v is convertible monotone w.r.t. R
– If S is a suffix of S1, avg(S1) avg(S)
• {8, 4, 3} is a suffix of {9, 8, 4, 3}
• avg({9, 8, 4, 3})=6 avg({8, 4, 3})=5
– If S satisfies avg(S) v, so does S1
• {8, 4, 3} satisfies constraint avg(S) 4, so does {9, 8, 4, 3}

Page 16
23AD1901 - DATAWAREHOUSING AND DATA MINING V SEM (2025 -2026)

Property of Constraints: Succinctness

• Succinctness:
– For any set S1 and S2 satisfying C, S1 S2 satisfies C
– Given A1 is the sets of size 1 satisfying C, then any set S satisfying C are based on A1
, i.e., it contains a subset belongs to A1 ,

• Example :
– sum(S.Price ) v is not succinct
– min(S.Price ) v is succinct

• Optimization:
– If C is succinct, then C is pre-counting prunable. The satisfaction of the constraint
alone is not affected by the iterative support counting.
Classification and Prediction

 Classification:

– predicts categorical class labels


– classifies data (constructs a model) based on the training set and the values (class
labels) in a classifying attribute and uses it in classifying new data
• Prediction

– models continuous-valued functions, i.e., predicts unknown or missing values

• Typical applications

– Credit approval
– Target marketing
– Medical diagnosis
– Fraud detection

Classification—A Two-Step Process

• Model construction: describing a set of predetermined classes


– Each tuple/sample is assumed to belong to a predefined class, as determined by the class

Page 17
23AD1901 - DATAWAREHOUSING AND DATA MINING V SEM (2025 -2026)

label attribute
– The set of tuples used for model construction: training set
– The model is represented as classification rules, decision trees, or
mathematical formulae
• Model usage: for classifying future or unknown objects
– Estimate accuracy of the model
The known label of test sample is compared with the classified result from the
model
• Accuracy rate is the percentage of test set samples that are correctly classified
by the model
• Test set is independent of training set, otherwise over-fitting will occur

Process (1): Model Construction

Page 18
23AD1901 - DATAWAREHOUSING AND DATA MINING V SEM (2025 -2026)

Process (2): Using the Model in Prediction

Supervised vs. Unsupervised Learning


 Supervised learning (classification)
 Supervision: The training data (observations, measurements, etc.) are
accompanied by labels indicating the class of the observations
 New data is classified based on the training set
 Unsupervised learning (clustering)
 The class labels of training data is unknown
 Given a set of measurements, observations, etc. with the aim of establishing
the existence of classes or clusters in the data

Classification using Frequent Patterns:


Frequent patterns show interesting relationships between attribute–value pairs that occur frequently in a
given data set.
For example, we may find that the attribute–value pairs age = youth and credit = OK occur in 20% of data
tuples describing All Electronics customers who buy a computer.
We can think of each attribute–value pair as an item, so the search for these frequent patterns is known as
frequent pattern mining or frequent itemset mining.
The general idea is that we can search for strong associations between frequent patterns (conjunctions of
attribute–value pairs) and class labels.

Page 19
23AD1901 - DATAWAREHOUSING AND DATA MINING V SEM (2025 -2026)

Explores discriminative frequent pattern–based classification, where frequent patterns serve as combined
features, which are considered in addition to single features when building a classification model. Because
frequent patterns explore highly confident associations among multiple attributes, frequent pattern–based
classification may overcome some constraints introduced by decision tree induction, which considers only
one attribute at a time.
Associative Classification
Association rules are mined in a two-step process consisting of frequent itemset mining followed by rule
generation.
The first step searches for patterns of attribute–value pairs that occur repeatedly in a data set, where each
attribute–value pair is considered an item.
The resulting attribute–value pairs form frequent item sets (also referred to as frequent pat- terns).
The second step analyzes the frequent item sets to generate association rules.
All association rules must satisfy certain criteria regarding their “accuracy” (or confidence) and the proportion
of the data set that they actually represent (referred to as support).
For example, the following is an association rule mined from a data set, D, shown with its confidence
and support:

age = youth ∧ credit = OK ⇒ buys computer


= yes [support = 20%, confidence = 93%],
where ∧ represents a logical “AND.” We will say more about confidence and support later.
More formally, let D be a data set of tuples. Each tuple in D is described by n attributes, A1, A2,..., An, and
a class label attribute, Aclass. All continuous attributes are discretized and treated as categorical (or nominal)
attributes. An item, p, is an attribute– value pair of the form (Ai, v), where Ai is an attribute taking a value, v.
A data tuple X = (x1, x2,..., xn) satisfies an item, p = (Ai, v), if and only if xi = v, where xi is the value of
the ith attribute of X. Association rules can have any number of items in the rule antecedent (left side)
and any number of items in the rule consequent (right side). However, when mining association rules for
use in classification, we are only interested in association rules of the form p1 ∧ p2 ∧ ...pl ⇒ A class = C,
where the rule antecedent is a conjunction of items, p1, p2,..., pl (l ≤ n), associated with a class label, C.
For a given rule, R, the percentage of tuples in D satisfying the rule antecedent that also have the class
label C is called the confidence of R.

From a classification point of view, this is akin to rule accuracy.


For example, a confidence of 93% for Rule (9.21) means that 93% of the customers in D who are young and
Page 20
23AD1901 - DATAWAREHOUSING AND DATA MINING V SEM (2025 -2026)

have an OK credit rating belong to the class buys computer = yes. The percentage of tuples in D satisfying
the rule antecedent and having class label C is called the support of R. A support of 20% for Rule (9.21)
means that 20% of the customers in D are young, have an OK credit rating, and belong to the class buys
computer = yes.
In general, associative classification consists of the following steps:

1. Mine the data for frequent item sets, that is, find commonly occurring attribute–value pairs in the
data.
2. Analyze the frequent item sets to generate association rules per class, which satisfy confidence
and support criteria.
3. Organize the rules to form a rule-based classifier.
One of the earliest and simplest algorithms for associative classification is CBA (Clas- sification Based
on Associations). CBA uses an iterative approach to frequent itemset mining.
In general, the number of passes made is equal to the length of the longest rule found. The complete set of
rules satisfying minimum confidence and minimum sup- port thresholds are found and then analyzed for
inclusion in the classifier.
CBA uses a heuristic method to construct the classifier, where the rules are ordered according to decreasing
precedence based on their confidence and support.
If a set of rules has the same antecedent, then the rule with the highest confidence is selected to represent the set.
When classifying a new tuple, the first rule satisfying the tuple is used to classify it.
The classifier also contains a default rule, having lowest precedence, which specifies a default class for any
new tuple that is not satisfied by any other rule in the classifier. In this way, the set of rules making up the
classifier form a decision list. In general, CBA was empirically found to be more accurate than C4.5 on a
good number of data sets.

CMAR (Classification based on Multiple Association Rules)


It differs from CBA in its strategy for frequent itemset mining and its construction of the classifier. It also
employs several rule pruning strategies with the help of a tree structure for efficient storage and retrieval of
rules. CMAR adopts a variant of the FP-growth algorithm to find the complete set of rules satisfying the
minimum confidence and minimum support thresh- olds. FP-growth was described in Section 6.2.4. FP-
growth uses a tree structure, called an FP- tree, to register all the frequent itemset information contained in
the given data set, D. This requires only two scans of D. The frequent item sets are then mined from the FP-
tree. CMAR uses an enhanced FP-tree that maintains the distribution of class labels among tuples satisfying

Page 21
23AD1901 - DATAWAREHOUSING AND DATA MINING V SEM (2025 -2026)

each frequent itemset. In this way, it is able to combine rule generation together with frequent itemset mining
in a single step.
Rule pruning Strategies are triggered whenever a rule is inserted into the tree. For example, given two rules,
R1 and R2, if the antecedent of R1 is more general than that of R2 and conf(R1) ≥ conf(R2), then R2 is
pruned. The rationale is that highly specialized rules with low confidence can be pruned if a more generalized
version with higher confidence exists.
CMAR also prunes rules for which the rule antecedent and class are not positively correlated,
based on an χ2 test of statistical significance.

Discriminative Frequent Pattern–Based Classification:


From work on associative classification, we see that frequent patterns reflect strong associations between
attribute–value pairs (or items) in data and are useful for classification. “But just how discriminative are
frequent patterns for classification?” Frequent patterns represent feature combinations. Let’s compare the
discriminative power of frequent patterns and single features.
The discrimination power of some frequent patterns is higher than that of single features. Frequent patterns
map data to a higher dimensional space.
They capture more underlying semantics of the data, and thus can hold greater expressive power than
single features.
“Why not consider frequent patterns as combined features, in addition to single features when building
a classification model?” This notion is the basis of frequent pattern– based classification—the learning
of a classification model in the feature space of single attributes as well as frequent patterns. In this
way, we transfer the original feature space to a larger space. This will likely increase the chance of
including important features.
The general framework for discriminative frequent pattern–based classification is as follows.

1. Feature generation: The data, D, are partitioned according to class label. Use frequent itemset
mining to discover frequent patterns in each partition, satisfying minimum support. The collection
of frequent patterns, F, makes up the feature candidates.

2. Feature selection: Apply feature selection to F, resulting in FS, the set of selected (more
discriminating) frequent patterns. Information gain, Fisher score, or other evaluation measures
can be used for this step. Relevancy checking can also be
incorporated into this step to weed out redundant patterns. The data set D is transformed to D 0, where
the feature space now includes the single features as well as the selected frequent patterns, FS.

Page 22
23AD1901 - DATAWAREHOUSING AND DATA MINING V SEM (2025 -2026)

3. Learning of classification model: A classifier is built on the data set D. Any learning algorithm can
be used as the classification model.

Page 23

You might also like