BTech CE-2 6th Semester
Dr. Neelam Duhan
Associate Professor, CE Department
J. C Bose University of Science & Technology, YMCA,
Faridabad
July 18, 2022 Data Mining: Clustering -- Dr. Neelam 1
Classification Vs Regression.. Some
example scenarios
Given input/training data: Weather related data such as Humidity, Temperature etc.
of past days.
Target: To find temperature of future days
July 18, 2022 Data Mining: Concepts and Techniques 2
Classification Vs Regression.. Some
example scenarios
Given input Data: Academic details/profile of some existing students
Target: To predict result of some new student
July 18, 2022 Data Mining: Concepts and Techniques 3
Classification Vs Regression
July 18, 2022 Data Mining: Concepts and Techniques 4
Classification Vs Regression
◼ Classification
◼ predicts categorical class labels (discrete or nominal)
◼ classifies data (constructs a model) based on the training set and
the values (class labels) in a classifying attribute and uses it in
classifying new data
◼ Prediction/Regression
◼ models continuous-valued functions, i.e., predicts unknown or
missing values
◼ Typical applications
◼ Credit approval
◼ Target marketing
◼ Medical diagnosis
◼ Fraud detection
July 18, 2022 Data Mining: Concepts and Techniques 5
Classification Regression
July 18, 2022 Data Mining: Concepts and Techniques 6
Classification—A Two-Step Process
◼ Model construction: describing a set of predetermined classes
◼ Each tuple/sample is assumed to belong to a predefined class,
as determined by the class label attribute
◼ The set of tuples used for model construction is training set
◼ The model is represented as classification rules, decision trees,
or mathematical formulae
◼ Model usage: for classifying future or unknown objects
◼ Estimate accuracy of the model
◼ The known label of test sample is compared with the
classified result from the model
◼ Accuracy rate is the percentage of test set samples that are
correctly classified by the model
◼ Test set is independent of training set, otherwise over-fitting
will occur
◼ If the accuracy is acceptable, use the model to classify data
tuples whose class labels are not known
July 18, 2022 Data Mining: Concepts and Techniques 7
Process (1): Model Construction
Classification
Algorithms
Training
Data
NAME RANK YEARS TENURED Classifier
M ike A ssistant P rof 3 no (Model)
M ary A ssistant P rof 7 yes
B ill P rofessor 2 yes
Jim A ssociate P rof 7 yes
IF rank = ‘professor’
D ave A ssistant P rof 6 no
OR years > 6
A nne A ssociate P rof 3 no
THEN tenured = ‘yes’
July 18, 2022 Data Mining: Concepts and Techniques 8
Process (2): Using the Model in Prediction
IF rank = ‘professor’
OR years > 6
THEN tenured = ‘yes’
MODEL/RULES Classifier
Testing
Data Unseen Data
(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
July 18, 2022 Data Mining: Concepts and Techniques 9
Example of Model
Fitting
◼ Overfitting is the case where the
overall cost is really small, but the
generalization of the model is
unreliable. This is due to the
model learning “too much” from
the training data set.
◼ Underfitting is the case where the
model has “ not learned enough”
from the training data, resulting in
low generalization and unreliable
predictions.
◼ A model that generalizes well is a
model that is neither underfit nor
overfit. It is Robust or good fit
model.
10
Another example of fitting
July 18, 2022 Data Mining: Concepts and Techniques 11
Supervised Vs. Unsupervised Learning
◼ Supervised learning (classification)
◼ Supervision: The training data (observations,
measurements, etc.) are accompanied by labels
indicating the class of the observations
◼ New data is classified based on the training set
◼ Unsupervised learning (clustering)
◼ The class labels of training data is unknown
◼ Given a set of measurements, observations, etc. with
the aim of establishing the existence of classes or
clusters in the data
July 18, 2022 Data Mining: Concepts and Techniques 12
Issues: Data Preparation
◼ Data cleaning
◼ Preprocess data in order to reduce noise and handle
missing values
◼ Relevance analysis (feature selection)
◼ Remove the irrelevant or redundant attributes
◼ Data transformation
◼ Generalize and/or normalize data
July 18, 2022 Data Mining: Concepts and Techniques 13
Issues: Evaluating Classification Methods
◼ Accuracy
◼ classifier accuracy: predicting class label
◼ predictor accuracy: guessing value of predicted
attributes
◼ Speed
◼ time to construct the model (training time)
◼ time to use the model (classification/prediction time)
◼ Robustness: handling noise and missing values
◼ Scalability: efficiency in disk-resident databases
◼ Interpretability
◼ understanding and insight provided by the model
◼ Other measures, e.g., goodness of rules, such as decision
tree size or compactness of classification rules
July 18, 2022 Data Mining: Concepts and Techniques 14
Classification by
Decision Tree Induction
July 18, 2022 Data Mining: Concepts and Techniques 15
Decision Tree
◼ A decision tree is a flowchart-like structure in which each
internal node represents a “test” on an attribute (e.g.
whether a coin flip comes up heads or tails), each branch
represents the outcome of the test, and each leaf node
represents a class label (decision taken after computing all
attributes).
◼ The paths from root to leaf represent classification rules.
◼ Decision Tree algorithms are referred to as CART
(Classification and Regression Trees).
July 18, 2022 Data Mining: Concepts and Techniques 16
Common terms used with Decision trees
◼ Root Node: It represents entire population or sample and this further
gets divided into two or more homogeneous sets.
◼ Splitting: It is a process of dividing a node into two or more sub-
nodes.
◼ Internal/ Decision Node: When a sub-node splits into further sub-
nodes, then it is called decision node.
◼ Leaf/ Terminal Node: Node which do not split is called Leaf or
Terminal node.
◼ Pruning: When we remove sub-nodes of a decision node, this
process is called pruning. You can say opposite process of splitting.
◼ Branch / Sub-Tree: A sub section of entire tree is called branch or
sub-tree.
◼ Parent and Child Node: A node, which is divided into sub-nodes is
called parent node of sub-nodes whereas sub-nodes are the children
of parent node.
July 18, 2022 Data Mining: Concepts and Techniques 17
Decision Tree Induction: Training Dataset
▪ Independent attributes/input attributes: age, income, student, credit_rating
▪ Dependent attribute/ class label attribute/ output attribute: buys_computer
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no 18
Output: A Decision Tree for “buys_computer”
age?
<=30 31..40 >40
student? Yes credit rating?
no yes excellent
fair
No Yes No Yes
July 18, 2022 Data Mining: Concepts and Techniques 19
Some other examples: Rain prediction
July 18, 2022 Data Mining: Concepts and Techniques 20
Some more examples: Fitness prediction
July 18, 2022 Data Mining: Concepts and Techniques 21
Algorithm for Decision Tree Induction
◼ Basic algorithm (a greedy algorithm)
◼ Tree is constructed in a top-down recursive divide-and-conquer
manner
◼ At start, all the training examples are at the root
◼ Attributes are categorical (if continuous-valued, they are
discretized in advance)
◼ Data is partitioned recursively based on selected attributes
◼ Test attributes are selected on the basis of a heuristic or
statistical measure (e.g., information gain)
◼ Conditions for stopping partitioning
◼ All samples for a given node belong to the same class
◼ There are no remaining attributes for further partitioning –
majority voting is employed for classifying the leaf
◼ There are no samples left
July 18, 2022 Data Mining: Concepts and Techniques 22
Attribute Selection Measure:
Information Gain (ID3/C4.5)
◼ Let D is database and m= No of classes of class label attribute
◼ Let pi be the probability that an arbitrary tuple in D belongs to
class Ci. i.e. pi =|Ci, D|/|D|
◼ Expected information (entropy) needed to classify a tuple in
D: m
Info( D) = I ( D) = − pi log 2 ( pi )
i =1
◼ Information needed (after using attribute A to split D into v
partitions) to classify D by A: v |D |
Info A ( D) = j I ( D j )
j =1 | D |
◼ Information gained by branching on attribute A
Gain(A) = Info(D) − Info A(D)
◼ Select the attribute with the highest information gain
July 18, 2022 Data Mining: Concepts and Techniques 23
Entropy and Information Gain
◼ Entropy is the measure of impurity, disorder or uncertainty in a bunch
of examples.
◼ What an Entropy basically does?
Entropy controls how a Decision Tree decides to split the data. It
actually effects how a Decision Tree draws its boundaries.
◼ Information gain (IG) measures how much “information” a feature
gives us about the class.
◼ Why it matter ?
◼ Information gain is the main key that is used by Decision Tree
Algorithms to construct a Decision Tree.
◼ Decision Trees algorithm will always tries to maximize Information
gain.
◼ An attribute with highest Information gain will tested/split first.
July 18, 2022 Data Mining: Concepts and Techniques 24
Attribute Selection: Information Gain
age income student credit_rating buys_computer
<=30 high no fair no
Class 1: buys_computer = “yes”
<=30 high no excellent no Class 2: buys_computer = “no”
31…40 high no fair yes
>40 medium no fair yes 9 9 5 5
>40 low yes fair yes Info( D) = I (9,5) = − log 2 ( ) − log 2 ( ) =0.940
14 14 14 14
>40 low yes excellent no
31…40 low yes excellent yes
5 4 5
<=30 medium no fair no Infoage ( D) = I (2,3) + I (4,0) + I (3,2) = 0.694
<=30 low yes fair yes 14 14 14
>40 medium yes fair yes
<=30 medium yes excellent yes Here, 5 I (2,3) means “age <=30” has 5 out of
31…40 medium no excellent yes 14
31…40 high yes fair yes 14 samples, with 2 YES and 3 NO.
>40 medium no excellent no
Calculations of Info needed by Age Gain(age) = Info( D) − Infoage ( D) = 0.246
age Y N I(Y,N)
<=30 2 3 0.971
31…40 4 0 0
Similarly, Gain(income) = 0.029
>40 3 2 0.971
Gain( student) = 0.151
Gain(credit _ rating ) = 0.048
July 18, 2022 Data Mining: Concepts and Techniques 25
How data is partitioned
◼ Highest information gain is of attribute age, So it becomes root
As mix of YES & NO,
so repeat the process
age? to find Info Gain of
Income, student,
<=30 >40 credit_rating
31..40
income student credit_rating buys_computer income student credit_rating buys_computer
high no fair no medium no fair yes
high no excellent no low yes fair yes
medium no fair no low yes excellent no
low yes fair yes medium yes fair yes
medium yes excellent yes medium no excellent no
As mix of YES & NO, income student credit_rating buys_computer
so repeat the process high no fair yes
to find Info Gain of As all are YES, so
low yes excellent yes
Income, student, create a leaf node
credit_rating for this medium no excellent yes
‘YES’
dataset high yes fair yes
26
How data is partitioned…
◼ Select the attribute with highest information gain in each branch
As mix of YES & NO,
so repeat the process
age? to find Info Gain of
Among Income, student,
credit_rating, highest <=30 Income, student,
credit_rating
Info Gain is of Student,
31..40 >40
so make it parent
student? income student credit_rating buys_computer
medium no fair yes
Yes low yes fair yes
low yes excellent no
no yes medium yes fair yes
medium no excellent no
income credit_rating buys_computer income credit_rating buys_computer
high fair no low fair yes
high excellent no medium excellent yes
As all are YES, so
create a leaf node
medium fair no
As all are NO, so ‘YES’
create a leaf node
‘NO’
27
Final Decision Tree
age?
<=30 31..40 >40
student? Yes credit rating?
no yes excellent
fair
No Yes No Yes
July 18, 2022 Data Mining: Concepts and Techniques 28
Another solved Example
July 18, 2022 Data Mining: Concepts and Techniques 29
July 18, 2022 Data Mining: Concepts and Techniques 30
July 18, 2022 Data Mining: Concepts and Techniques 31
Overfitting and Tree Pruning
◼ Overfitting: An induced tree may overfit the training data
◼ Too many branches, some may reflect anomalies due to noise or
outliers
◼ Poor accuracy for unseen samples
◼ Two approaches to avoid overfitting
◼ Prepruning: Halt tree construction early—do not split a node if this
would result in the goodness measure falling below a threshold
◼ Difficult to choose an appropriate threshold
◼ Postpruning: Remove branches from a “fully grown” tree—get a
sequence of progressively pruned trees
◼ Use a set of data different from the training data to decide
which is the “best pruned tree”
July 18, 2022 Data Mining: Concepts and Techniques 32
Rule based Classification
Data Mining: Concepts and
July 18, 2022 Techniques 33
Using IF-THEN Rules for Classification
◼ Represent the knowledge in the form of IF-THEN rules
R: IF age <=30 AND student = yes THEN buys_computer = yes
◼ Rule antecedent/precondition vs. rule consequent
◼ Assessment of a rule R is done by: coverage and accuracy
◼ ncovers = # of tuples covered by R (i.e. matching of LHS of rule)
◼ ncorrect = # of tuples correctly classified by R (i.e. matching of LHS & RHS)
coverage(R) = ncovers /|D| Here D: training data set
accuracy(R) = ncorrect / ncovers
◼ If more than one rule is triggered, need conflict resolution
◼ Size ordering: assign the highest priority to the triggering rules that has
the “toughest” requirement (i.e., with the most attribute test)
◼ Class-based ordering: decreasing order of prevalence or misclassification
cost per class
◼ Rule-based ordering (decision list): rules are organized into one long
priority list, according to some measure of rule quality or by experts
July 18, 2022 Data Mining: Concepts and Techniques 34
Rule Extraction from a Decision Tree
◼ Rules are easier to understand than large
trees
◼ One rule is created for each path from the
root to a leaf
◼ Each attribute-value pair along a path forms
a conjunction: the leaf holds the class
prediction
◼ Rules are mutually exclusive and exhaustive
◼ Example: Rule extraction from our buys_computer decision-tree
IF age = young AND student = no THEN buys_computer = no
IF age = young AND student = yes THEN buys_computer = yes
IF age = mid-age THEN buys_computer = yes
IF age = old AND credit_rating = excellent THEN buys_computer = yes
IF age = young AND credit_rating = fair THEN buys_computer = no
35
Thank you
Data Mining: Concepts and
July 18, 2022 Techniques 36