DWDM Unit III Notes
DWDM Unit III Notes
Apriori property
Page 1
23AD1901 - DATAWAREHOUSING AND DATA MINING V SEM (2025 -2026)
Association Mining
• Association rule mining:
– Finding frequent patterns, associations, correlations, or causal structures among sets of items
or objects in transaction databases, relational databases, and other information repositories.
• Applications:
– Basket data analysis, cross-marketing, catalog design, loss-leader analysis, clustering,
classification, etc.
• Examples.
– Rule form: “Body ® Head [support, confidence]”.
– buys(x, “diapers”) ® buys(x, “beers”) [0.5%, 60%]
– major(x, “CS”) ^ takes(x, “DB”) ® grade(x, “A”) [1%, 75%]
• Applications
• Find all the rules X & Y Z with minimum confidence and support
– support, s, probability that a transaction contains {X Y Z}
Page 2
23AD1901 - DATAWAREHOUSING AND DATA MINING V SEM (2025 -2026)
– A C (50%, 66.6%)
– C A (50%, 100%)
Transaction Items
ID Bought
2000 A,B,C
1000 A,C
4000 A,D
5000 B,E,F
Page 3
23AD1901 - DATAWAREHOUSING AND DATA MINING V SEM (2025 -2026)
transactions that contain an itemset is called the itemset's support. For an itemset to be interesting, its support
must be higher than a user-specified minimum. Such itemsets are said to be frequent.
Rule support and confidence are two measures of rule interestingness. They respectively reflect the usefulness
and certainty of discovered rules. A support of 2% for association Rule means that 2% of all the transactions
under analysis show that computer and financial management software are purchased together. A confidence
of 60% means that 60% of the customers who purchased a computer also bought the software. Typically,
association rules are considered interesting if they satisfy both a minimum support threshold and a minimum
confidence threshold.
Page 4
23AD1901 - DATAWAREHOUSING AND DATA MINING V SEM (2025 -2026)
Example
Page 5
23AD1901 - DATAWAREHOUSING AND DATA MINING V SEM (2025 -2026)
The method that mines the complete set of frequent itemsets without generation.
Steps:
Page 6
23AD1901 - DATAWAREHOUSING AND DATA MINING V SEM (2025 -2026)
Header Table
f 4
c 4
a 3
b 3
m 3
p 3
• Completeness:
– never breaks a long pattern of any transaction
– preserves complete information for frequent pattern mining
• Compactness
– reduce irrelevant information—infrequent items are gone
– frequency descending ordering: more frequent items are more likely to be shared
– never be larger than the original database (if not count node-links and counts)
– Example: For Connect-4 DB, compression ratio could be over 100
Page 7
23AD1901 - DATAWAREHOUSING AND DATA MINING V SEM (2025 -2026)
Page 8
23AD1901 - DATAWAREHOUSING AND DATA MINING V SEM (2025 -2026)
Milk Bread
Fraser Sunset
Page 9
23AD1901 - DATAWAREHOUSING AND DATA MINING V SEM (2025 -2026)
TI Items
D
T1 {111, 121, 211, 221}
T2 {111, 211, 222, 323}
T3 {112, 122, 221, 411}
T4 {111, 121}
T5 {111, 122, 211, 221,
413}
Page 10
23AD1901 - DATAWAREHOUSING AND DATA MINING V SEM (2025 -2026)
Correlation in detail.
Page 11
23AD1901 - DATAWAREHOUSING AND DATA MINING V SEM (2025 -2026)
Page 12
23AD1901 - DATAWAREHOUSING AND DATA MINING V SEM (2025 -2026)
Numeric correlation
Page 13
23AD1901 - DATAWAREHOUSING AND DATA MINING V SEM (2025 -2026)
– Interestingness constraints:
• strong rules (min_support 3%, min_confidence 60%).
• Database: (1) trans (TID, Itemset ), (2) itemInfo (Item, Type, Price)
• A constrained asso. query (CAQ) is in the form of {(S1, S2 )|C },
– where C is a set of constraints on S1, S2 including frequency constraint
• A classification of (single-variable) constraints:
– Class constraint: S A. e.g. S Item
– Domain constraint:
• S v, { , , , , , }. e.g. S.Price < 100
• v S, is or . e.g. snacks S.Type
• V S, or S V, { , , , , }
Page 14
23AD1901 - DATAWAREHOUSING AND DATA MINING V SEM (2025 -2026)
2. Succinct Constraint
• A subset of item Is is a succinct set, if it can be expressed as p(I) for some selection
predicate p, where is a selection operator
Page 15
23AD1901 - DATAWAREHOUSING AND DATA MINING V SEM (2025 -2026)
3. Convertible Constraint
• Suppose all items in patterns are listed in a total order R
• A constraint C is convertible anti-monotone iff a pattern S satisfying the constraint implies that
each suffix of S w.r.t. R also satisfies C
• A constraint C is convertible monotone iff a pattern S satisfying the constraint implies that
each pattern of which S is a suffix w.r.t. R also satisfies C.
• Anti-monotonicity: If a set S violates the constraint, any superset of S violates the constraint.
• Examples:
– sum(S.Price) v is anti-monotone
– sum(S.Price) v is not anti-monotone
– sum(S.Price) = v is partly anti-monotone
• Application:
Push “sum(S.price) 1000” deeply into iterative frequent set computation.
Page 16
23AD1901 - DATAWAREHOUSING AND DATA MINING V SEM (2025 -2026)
• Succinctness:
– For any set S1 and S2 satisfying C, S1 S2 satisfies C
– Given A1 is the sets of size 1 satisfying C, then any set S satisfying C are based on A1
, i.e., it contains a subset belongs to A1 ,
• Example :
– sum(S.Price ) v is not succinct
– min(S.Price ) v is succinct
• Optimization:
– If C is succinct, then C is pre-counting prunable. The satisfaction of the constraint
alone is not affected by the iterative support counting.
Classification and Prediction
Classification:
• Typical applications
– Credit approval
– Target marketing
– Medical diagnosis
– Fraud detection
Page 17
23AD1901 - DATAWAREHOUSING AND DATA MINING V SEM (2025 -2026)
label attribute
– The set of tuples used for model construction: training set
– The model is represented as classification rules, decision trees, or
mathematical formulae
• Model usage: for classifying future or unknown objects
– Estimate accuracy of the model
The known label of test sample is compared with the classified result from the
model
• Accuracy rate is the percentage of test set samples that are correctly classified
by the model
• Test set is independent of training set, otherwise over-fitting will occur
Page 18
23AD1901 - DATAWAREHOUSING AND DATA MINING V SEM (2025 -2026)
Page 19
23AD1901 - DATAWAREHOUSING AND DATA MINING V SEM (2025 -2026)
Explores discriminative frequent pattern–based classification, where frequent patterns serve as combined
features, which are considered in addition to single features when building a classification model. Because
frequent patterns explore highly confident associations among multiple attributes, frequent pattern–based
classification may overcome some constraints introduced by decision tree induction, which considers only
one attribute at a time.
Associative Classification
Association rules are mined in a two-step process consisting of frequent itemset mining followed by rule
generation.
The first step searches for patterns of attribute–value pairs that occur repeatedly in a data set, where each
attribute–value pair is considered an item.
The resulting attribute–value pairs form frequent item sets (also referred to as frequent pat- terns).
The second step analyzes the frequent item sets to generate association rules.
All association rules must satisfy certain criteria regarding their “accuracy” (or confidence) and the proportion
of the data set that they actually represent (referred to as support).
For example, the following is an association rule mined from a data set, D, shown with its confidence
and support:
have an OK credit rating belong to the class buys computer = yes. The percentage of tuples in D satisfying
the rule antecedent and having class label C is called the support of R. A support of 20% for Rule (9.21)
means that 20% of the customers in D are young, have an OK credit rating, and belong to the class buys
computer = yes.
In general, associative classification consists of the following steps:
1. Mine the data for frequent item sets, that is, find commonly occurring attribute–value pairs in the
data.
2. Analyze the frequent item sets to generate association rules per class, which satisfy confidence
and support criteria.
3. Organize the rules to form a rule-based classifier.
One of the earliest and simplest algorithms for associative classification is CBA (Clas- sification Based
on Associations). CBA uses an iterative approach to frequent itemset mining.
In general, the number of passes made is equal to the length of the longest rule found. The complete set of
rules satisfying minimum confidence and minimum sup- port thresholds are found and then analyzed for
inclusion in the classifier.
CBA uses a heuristic method to construct the classifier, where the rules are ordered according to decreasing
precedence based on their confidence and support.
If a set of rules has the same antecedent, then the rule with the highest confidence is selected to represent the set.
When classifying a new tuple, the first rule satisfying the tuple is used to classify it.
The classifier also contains a default rule, having lowest precedence, which specifies a default class for any
new tuple that is not satisfied by any other rule in the classifier. In this way, the set of rules making up the
classifier form a decision list. In general, CBA was empirically found to be more accurate than C4.5 on a
good number of data sets.
Page 21
23AD1901 - DATAWAREHOUSING AND DATA MINING V SEM (2025 -2026)
each frequent itemset. In this way, it is able to combine rule generation together with frequent itemset mining
in a single step.
Rule pruning Strategies are triggered whenever a rule is inserted into the tree. For example, given two rules,
R1 and R2, if the antecedent of R1 is more general than that of R2 and conf(R1) ≥ conf(R2), then R2 is
pruned. The rationale is that highly specialized rules with low confidence can be pruned if a more generalized
version with higher confidence exists.
CMAR also prunes rules for which the rule antecedent and class are not positively correlated,
based on an χ2 test of statistical significance.
1. Feature generation: The data, D, are partitioned according to class label. Use frequent itemset
mining to discover frequent patterns in each partition, satisfying minimum support. The collection
of frequent patterns, F, makes up the feature candidates.
2. Feature selection: Apply feature selection to F, resulting in FS, the set of selected (more
discriminating) frequent patterns. Information gain, Fisher score, or other evaluation measures
can be used for this step. Relevancy checking can also be
incorporated into this step to weed out redundant patterns. The data set D is transformed to D 0, where
the feature space now includes the single features as well as the selected frequent patterns, FS.
Page 22
23AD1901 - DATAWAREHOUSING AND DATA MINING V SEM (2025 -2026)
3. Learning of classification model: A classifier is built on the data set D. Any learning algorithm can
be used as the classification model.
Page 23