0% found this document useful (0 votes)

5 views36 pages

Classification

Regression

Uploaded by

Rishabh Sharma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views36 pages

Classification

Regression

Uploaded by

Rishabh Sharma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 36

BTech CE-2 6th Semester

Dr. Neelam Duhan

Associate Professor, CE Department
J. C Bose University of Science & Technology, YMCA,
Faridabad

July 18, 2022 Data Mining: Clustering -- Dr. Neelam 1

Classification Vs Regression.. Some
example scenarios
Given input/training data: Weather related data such as Humidity, Temperature etc.
of past days.
Target: To find temperature of future days

July 18, 2022 Data Mining: Concepts and Techniques 2

Classification Vs Regression.. Some
example scenarios
Given input Data: Academic details/profile of some existing students
Target: To predict result of some new student

July 18, 2022 Data Mining: Concepts and Techniques 3

Classification Vs Regression

July 18, 2022 Data Mining: Concepts and Techniques 4

Classification Vs Regression
◼ Classification
◼ predicts categorical class labels (discrete or nominal)
◼ classifies data (constructs a model) based on the training set and
the values (class labels) in a classifying attribute and uses it in
classifying new data
◼ Prediction/Regression
◼ models continuous-valued functions, i.e., predicts unknown or
missing values

◼ Typical applications
◼ Credit approval

◼ Target marketing

◼ Medical diagnosis

◼ Fraud detection

July 18, 2022 Data Mining: Concepts and Techniques 5

Classification Regression

July 18, 2022 Data Mining: Concepts and Techniques 6

Classification—A Two-Step Process
◼ Model construction: describing a set of predetermined classes
◼ Each tuple/sample is assumed to belong to a predefined class,
as determined by the class label attribute
◼ The set of tuples used for model construction is training set

◼ The model is represented as classification rules, decision trees,

or mathematical formulae
◼ Model usage: for classifying future or unknown objects
◼ Estimate accuracy of the model

◼ The known label of test sample is compared with the

classified result from the model

◼ Accuracy rate is the percentage of test set samples that are

correctly classified by the model

◼ Test set is independent of training set, otherwise over-fitting

will occur
◼ If the accuracy is acceptable, use the model to classify data
tuples whose class labels are not known
July 18, 2022 Data Mining: Concepts and Techniques 7
Process (1): Model Construction

Classification
Algorithms
Training
Data

NAME RANK YEARS TENURED Classifier

M ike A ssistant P rof 3 no (Model)
M ary A ssistant P rof 7 yes
B ill P rofessor 2 yes
Jim A ssociate P rof 7 yes
IF rank = ‘professor’
D ave A ssistant P rof 6 no
OR years > 6
A nne A ssociate P rof 3 no
THEN tenured = ‘yes’
July 18, 2022 Data Mining: Concepts and Techniques 8
Process (2): Using the Model in Prediction
IF rank = ‘professor’
OR years > 6
THEN tenured = ‘yes’
MODEL/RULES Classifier

Testing
Data Unseen Data

(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
July 18, 2022 Data Mining: Concepts and Techniques 9
Example of Model
Fitting
◼ Overfitting is the case where the
overall cost is really small, but the
generalization of the model is
unreliable. This is due to the
model learning “too much” from
the training data set.

◼ Underfitting is the case where the

model has “ not learned enough”
from the training data, resulting in
low generalization and unreliable
predictions.

◼ A model that generalizes well is a

model that is neither underfit nor
overfit. It is Robust or good fit
model.
10
Another example of fitting

July 18, 2022 Data Mining: Concepts and Techniques 11

Supervised Vs. Unsupervised Learning
◼ Supervised learning (classification)
◼ Supervision: The training data (observations,
measurements, etc.) are accompanied by labels
indicating the class of the observations
◼ New data is classified based on the training set
◼ Unsupervised learning (clustering)
◼ The class labels of training data is unknown
◼ Given a set of measurements, observations, etc. with
the aim of establishing the existence of classes or
clusters in the data

July 18, 2022 Data Mining: Concepts and Techniques 12

Issues: Data Preparation

◼ Data cleaning
◼ Preprocess data in order to reduce noise and handle
missing values
◼ Relevance analysis (feature selection)
◼ Remove the irrelevant or redundant attributes
◼ Data transformation
◼ Generalize and/or normalize data

July 18, 2022 Data Mining: Concepts and Techniques 13

Issues: Evaluating Classification Methods

◼ Accuracy
◼ classifier accuracy: predicting class label

◼ predictor accuracy: guessing value of predicted

attributes
◼ Speed
◼ time to construct the model (training time)

◼ time to use the model (classification/prediction time)

◼ Robustness: handling noise and missing values

◼ Scalability: efficiency in disk-resident databases
◼ Interpretability
◼ understanding and insight provided by the model

◼ Other measures, e.g., goodness of rules, such as decision

tree size or compactness of classification rules
July 18, 2022 Data Mining: Concepts and Techniques 14
Classification by
Decision Tree Induction

July 18, 2022 Data Mining: Concepts and Techniques 15

Decision Tree
◼ A decision tree is a flowchart-like structure in which each
internal node represents a “test” on an attribute (e.g.
whether a coin flip comes up heads or tails), each branch
represents the outcome of the test, and each leaf node
represents a class label (decision taken after computing all
attributes).
◼ The paths from root to leaf represent classification rules.
◼ Decision Tree algorithms are referred to as CART
(Classification and Regression Trees).

July 18, 2022 Data Mining: Concepts and Techniques 16

Common terms used with Decision trees
◼ Root Node: It represents entire population or sample and this further
gets divided into two or more homogeneous sets.
◼ Splitting: It is a process of dividing a node into two or more sub-
nodes.
◼ Internal/ Decision Node: When a sub-node splits into further sub-
nodes, then it is called decision node.
◼ Leaf/ Terminal Node: Node which do not split is called Leaf or
Terminal node.
◼ Pruning: When we remove sub-nodes of a decision node, this
process is called pruning. You can say opposite process of splitting.
◼ Branch / Sub-Tree: A sub section of entire tree is called branch or
sub-tree.
◼ Parent and Child Node: A node, which is divided into sub-nodes is
called parent node of sub-nodes whereas sub-nodes are the children
of parent node.
July 18, 2022 Data Mining: Concepts and Techniques 17
Decision Tree Induction: Training Dataset
▪ Independent attributes/input attributes: age, income, student, credit_rating
▪ Dependent attribute/ class label attribute/ output attribute: buys_computer
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no 18
Output: A Decision Tree for “buys_computer”

age?

<=30 31..40 >40

student? Yes credit rating?

no yes excellent
fair

No Yes No Yes

July 18, 2022 Data Mining: Concepts and Techniques 19

Some other examples: Rain prediction

July 18, 2022 Data Mining: Concepts and Techniques 20

Some more examples: Fitness prediction

July 18, 2022 Data Mining: Concepts and Techniques 21

Algorithm for Decision Tree Induction
◼ Basic algorithm (a greedy algorithm)
◼ Tree is constructed in a top-down recursive divide-and-conquer

manner
◼ At start, all the training examples are at the root

◼ Attributes are categorical (if continuous-valued, they are

discretized in advance)
◼ Data is partitioned recursively based on selected attributes

◼ Test attributes are selected on the basis of a heuristic or

statistical measure (e.g., information gain)

◼ Conditions for stopping partitioning
◼ All samples for a given node belong to the same class

◼ There are no remaining attributes for further partitioning –

majority voting is employed for classifying the leaf

◼ There are no samples left

July 18, 2022 Data Mining: Concepts and Techniques 22

Attribute Selection Measure:
Information Gain (ID3/C4.5)
◼ Let D is database and m= No of classes of class label attribute
◼ Let pi be the probability that an arbitrary tuple in D belongs to
class Ci. i.e. pi =|Ci, D|/|D|
◼ Expected information (entropy) needed to classify a tuple in
D: m
Info( D) = I ( D) = − pi log 2 ( pi )
i =1

◼ Information needed (after using attribute A to split D into v

partitions) to classify D by A: v |D |
Info A ( D) =  j  I ( D j )
j =1 | D |

◼ Information gained by branching on attribute A

Gain(A) = Info(D) − Info A(D)
◼ Select the attribute with the highest information gain
July 18, 2022 Data Mining: Concepts and Techniques 23
Entropy and Information Gain
◼ Entropy is the measure of impurity, disorder or uncertainty in a bunch
of examples.
◼ What an Entropy basically does?
Entropy controls how a Decision Tree decides to split the data. It
actually effects how a Decision Tree draws its boundaries.
◼ Information gain (IG) measures how much “information” a feature
gives us about the class.
◼ Why it matter ?
◼ Information gain is the main key that is used by Decision Tree
Algorithms to construct a Decision Tree.
◼ Decision Trees algorithm will always tries to maximize Information
gain.
◼ An attribute with highest Information gain will tested/split first.

July 18, 2022 Data Mining: Concepts and Techniques 24

Attribute Selection: Information Gain
age income student credit_rating buys_computer
<=30 high no fair no
 Class 1: buys_computer = “yes”
<=30 high no excellent no  Class 2: buys_computer = “no”
31…40 high no fair yes
>40 medium no fair yes 9 9 5 5
>40 low yes fair yes Info( D) = I (9,5) = − log 2 ( ) − log 2 ( ) =0.940
14 14 14 14
>40 low yes excellent no
31…40 low yes excellent yes
5 4 5
<=30 medium no fair no Infoage ( D) = I (2,3) + I (4,0) + I (3,2) = 0.694
<=30 low yes fair yes 14 14 14
>40 medium yes fair yes
<=30 medium yes excellent yes Here, 5 I (2,3) means “age <=30” has 5 out of
31…40 medium no excellent yes 14
31…40 high yes fair yes 14 samples, with 2 YES and 3 NO.
>40 medium no excellent no

Calculations of Info needed by Age Gain(age) = Info( D) − Infoage ( D) = 0.246

age Y N I(Y,N)
<=30 2 3 0.971
31…40 4 0 0
Similarly, Gain(income) = 0.029
>40 3 2 0.971
Gain( student) = 0.151
Gain(credit _ rating ) = 0.048
July 18, 2022 Data Mining: Concepts and Techniques 25
How data is partitioned
◼ Highest information gain is of attribute age, So it becomes root

As mix of YES & NO,

so repeat the process
age? to find Info Gain of
Income, student,
<=30 >40 credit_rating
31..40

income student credit_rating buys_computer income student credit_rating buys_computer

high no fair no medium no fair yes
high no excellent no low yes fair yes
medium no fair no low yes excellent no
low yes fair yes medium yes fair yes
medium yes excellent yes medium no excellent no

As mix of YES & NO, income student credit_rating buys_computer

so repeat the process high no fair yes
to find Info Gain of As all are YES, so
low yes excellent yes
Income, student, create a leaf node
credit_rating for this medium no excellent yes
‘YES’
dataset high yes fair yes
26
How data is partitioned…
◼ Select the attribute with highest information gain in each branch

As mix of YES & NO,

so repeat the process
age? to find Info Gain of
Among Income, student,
credit_rating, highest <=30 Income, student,
credit_rating
Info Gain is of Student,
31..40 >40
so make it parent
student? income student credit_rating buys_computer
medium no fair yes
Yes low yes fair yes
low yes excellent no
no yes medium yes fair yes
medium no excellent no

income credit_rating buys_computer income credit_rating buys_computer

high fair no low fair yes
high excellent no medium excellent yes
As all are YES, so
create a leaf node
medium fair no
As all are NO, so ‘YES’
create a leaf node
‘NO’
27
Final Decision Tree

age?

<=30 31..40 >40

student? Yes credit rating?

no yes excellent
fair

No Yes No Yes

July 18, 2022 Data Mining: Concepts and Techniques 28

Another solved Example

July 18, 2022 Data Mining: Concepts and Techniques 29

July 18, 2022 Data Mining: Concepts and Techniques 30
July 18, 2022 Data Mining: Concepts and Techniques 31
Overfitting and Tree Pruning

◼ Overfitting: An induced tree may overfit the training data

◼ Too many branches, some may reflect anomalies due to noise or
outliers
◼ Poor accuracy for unseen samples
◼ Two approaches to avoid overfitting
◼ Prepruning: Halt tree construction early—do not split a node if this
would result in the goodness measure falling below a threshold
◼ Difficult to choose an appropriate threshold
◼ Postpruning: Remove branches from a “fully grown” tree—get a
sequence of progressively pruned trees
◼ Use a set of data different from the training data to decide
which is the “best pruned tree”
July 18, 2022 Data Mining: Concepts and Techniques 32
Rule based Classification

Data Mining: Concepts and

July 18, 2022 Techniques 33
Using IF-THEN Rules for Classification
◼ Represent the knowledge in the form of IF-THEN rules
R: IF age <=30 AND student = yes THEN buys_computer = yes
◼ Rule antecedent/precondition vs. rule consequent
◼ Assessment of a rule R is done by: coverage and accuracy
◼ ncovers = # of tuples covered by R (i.e. matching of LHS of rule)
◼ ncorrect = # of tuples correctly classified by R (i.e. matching of LHS & RHS)
coverage(R) = ncovers /|D| Here D: training data set
accuracy(R) = ncorrect / ncovers
◼ If more than one rule is triggered, need conflict resolution
◼ Size ordering: assign the highest priority to the triggering rules that has
the “toughest” requirement (i.e., with the most attribute test)
◼ Class-based ordering: decreasing order of prevalence or misclassification
cost per class
◼ Rule-based ordering (decision list): rules are organized into one long
priority list, according to some measure of rule quality or by experts
July 18, 2022 Data Mining: Concepts and Techniques 34
Rule Extraction from a Decision Tree
◼ Rules are easier to understand than large
trees
◼ One rule is created for each path from the
root to a leaf
◼ Each attribute-value pair along a path forms
a conjunction: the leaf holds the class
prediction
◼ Rules are mutually exclusive and exhaustive
◼ Example: Rule extraction from our buys_computer decision-tree
IF age = young AND student = no THEN buys_computer = no
IF age = young AND student = yes THEN buys_computer = yes
IF age = mid-age THEN buys_computer = yes
IF age = old AND credit_rating = excellent THEN buys_computer = yes
IF age = young AND credit_rating = fair THEN buys_computer = no
35
Thank you

Data Mining: Concepts and

July 18, 2022 Techniques 36

Data Mining: UNIT-3 Classification
No ratings yet
Data Mining: UNIT-3 Classification
54 pages
Classification and Prediction
No ratings yet
Classification and Prediction
126 pages
Chapter3 Classification and Prediction
No ratings yet
Chapter3 Classification and Prediction
63 pages
Slides For Textbook - Chapter 7 - : March 6, 2014 Data Mining: Concepts and Techniques 1
No ratings yet
Slides For Textbook - Chapter 7 - : March 6, 2014 Data Mining: Concepts and Techniques 1
23 pages
7 Class
No ratings yet
7 Class
72 pages
Unit V Classification
No ratings yet
Unit V Classification
69 pages
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 7
No ratings yet
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 7
72 pages
7 Class
No ratings yet
7 Class
72 pages
DM Ch6 (Classification and Prediction)
No ratings yet
DM Ch6 (Classification and Prediction)
39 pages
Classfication and Prediction
No ratings yet
Classfication and Prediction
133 pages
Chapter 6 Classification and Prediction25.10.13
No ratings yet
Chapter 6 Classification and Prediction25.10.13
43 pages
Classification & Prediction
No ratings yet
Classification & Prediction
78 pages
Data Mining: Concepts and Techniques: - Chapter 6
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 6
115 pages
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 7
No ratings yet
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 7
88 pages
Chapter 5. Classification and Prediction
No ratings yet
Chapter 5. Classification and Prediction
122 pages
Classification - Prediction Data Model Very Important
No ratings yet
Classification - Prediction Data Model Very Important
173 pages
Data Mining: Concepts and Techniques: - Chapter 6
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 6
172 pages
A Data Mining Query Language
No ratings yet
A Data Mining Query Language
69 pages
Data Mining: Concepts and Techniques: - Chapter 6
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 6
129 pages
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 7
No ratings yet
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 7
88 pages
Classification and Prediction
No ratings yet
Classification and Prediction
134 pages
Classification Prediction
No ratings yet
Classification Prediction
71 pages
Classification Lecture 1
No ratings yet
Classification Lecture 1
51 pages
DM Chapter 4
No ratings yet
DM Chapter 4
47 pages
Chapter4 Classification Prediction
No ratings yet
Chapter4 Classification Prediction
173 pages
Chapter 7. Classification and Prediction
No ratings yet
Chapter 7. Classification and Prediction
68 pages
Data Mining for Analysts
No ratings yet
Data Mining for Analysts
38 pages
Classification and Prediction
No ratings yet
Classification and Prediction
40 pages
Chap 7
No ratings yet
Chap 7
71 pages
Data Mining: Concepts and Techniques: - Chapter 6
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 6
129 pages
Chapter 6 - : Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign
No ratings yet
Chapter 6 - : Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign
129 pages
Data Mining: Concepts and Techniques
100% (2)
Data Mining: Concepts and Techniques
139 pages
Lecture 3.1.1
No ratings yet
Lecture 3.1.1
17 pages
Lecture 3.1.2
No ratings yet
Lecture 3.1.2
27 pages
Classification & Prediction in Data Mining
No ratings yet
Classification & Prediction in Data Mining
112 pages
Classification
No ratings yet
Classification
20 pages
Module - 4.1-DM-1
No ratings yet
Module - 4.1-DM-1
63 pages
DM Classification 1 3
No ratings yet
DM Classification 1 3
19 pages
Fundamentals of Data Science Unit 4
100% (1)
Fundamentals of Data Science Unit 4
31 pages
3 - Sınıflandırma 2
No ratings yet
3 - Sınıflandırma 2
62 pages
Lecture 10
No ratings yet
Lecture 10
53 pages
DM Unit-3
No ratings yet
DM Unit-3
46 pages
Class Basic
No ratings yet
Class Basic
67 pages
Classification Intr DT
No ratings yet
Classification Intr DT
31 pages
Data Mining
No ratings yet
Data Mining
25 pages
Data Mining Concepts and Techniques
No ratings yet
Data Mining Concepts and Techniques
30 pages
Lecture 8
No ratings yet
Lecture 8
109 pages
Concepts and Techniques
No ratings yet
Concepts and Techniques
53 pages
Data Mining Unit-IV
No ratings yet
Data Mining Unit-IV
5 pages
7 Classification
100% (3)
7 Classification
63 pages
DM - 06 Mar 2025
No ratings yet
DM - 06 Mar 2025
13 pages
DM Unit 4
No ratings yet
DM Unit 4
24 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
88 pages
08 Class Basic
No ratings yet
08 Class Basic
76 pages
Classification Ppts 2021
No ratings yet
Classification Ppts 2021
80 pages
Unit 3 Classification
No ratings yet
Unit 3 Classification
15 pages
Classification & Prediction
No ratings yet
Classification & Prediction
24 pages
Lecture 3.1.3 3.1.4
No ratings yet
Lecture 3.1.3 3.1.4
24 pages
Daa 4
No ratings yet
Daa 4
26 pages
Symmetric Key Cryptography: CPIS 312
No ratings yet
Symmetric Key Cryptography: CPIS 312
64 pages
Hashing
No ratings yet
Hashing
13 pages
What Is The Role of Algorithm Analysis in Data Structures?: Computer Science
No ratings yet
What Is The Role of Algorithm Analysis in Data Structures?: Computer Science
10 pages
DLCV Day2
No ratings yet
DLCV Day2
5 pages
Product of Two Binomials
No ratings yet
Product of Two Binomials
19 pages
CLASS X - POLYNOMIALS WORKSHEET 2023-24 - Google Docs 1
No ratings yet
CLASS X - POLYNOMIALS WORKSHEET 2023-24 - Google Docs 1
4 pages
LDPC Codes
No ratings yet
LDPC Codes
3 pages
Basic Models of Artificial Neural Networks
No ratings yet
Basic Models of Artificial Neural Networks
5 pages
Flash Cards
No ratings yet
Flash Cards
4 pages
METR4200 - Advanced Control: Linear Quadratic Regulator Control (LQR)
No ratings yet
METR4200 - Advanced Control: Linear Quadratic Regulator Control (LQR)
22 pages
A Review of Meta Heuristic Algorithms For Reactive 2018 Ain Shams Engineeri
No ratings yet
A Review of Meta Heuristic Algorithms For Reactive 2018 Ain Shams Engineeri
17 pages
Unit II
No ratings yet
Unit II
9 pages
ECTE301: Digital Signal Processing: Revision of Part A
No ratings yet
ECTE301: Digital Signal Processing: Revision of Part A
20 pages
Final Term Paper DIP Spring 2021 FINAL
No ratings yet
Final Term Paper DIP Spring 2021 FINAL
2 pages
Algorithm Practice for Students
No ratings yet
Algorithm Practice for Students
2 pages
Parametric Identification
No ratings yet
Parametric Identification
6 pages
The Least-Mean-Square (LMS) Algorithm and Its Geophysical Applications
No ratings yet
The Least-Mean-Square (LMS) Algorithm and Its Geophysical Applications
28 pages
A1 BCS401
No ratings yet
A1 BCS401
1 page
Slow Learns Question Bank
No ratings yet
Slow Learns Question Bank
2 pages
A Practical Handbook of Speech Coders
No ratings yet
A Practical Handbook of Speech Coders
15 pages
Complex Numbers & Matrices Guide
No ratings yet
Complex Numbers & Matrices Guide
10 pages
Algorithm Analysis Key Topics
No ratings yet
Algorithm Analysis Key Topics
30 pages
What Is Backpropagation
No ratings yet
What Is Backpropagation
8 pages
Example 1
No ratings yet
Example 1
2 pages
Che188-1 Q#2
No ratings yet
Che188-1 Q#2
3 pages
Unit I Signals and Systems
No ratings yet
Unit I Signals and Systems
3 pages
Transportation III - Merged
No ratings yet
Transportation III - Merged
159 pages
Flow Vision Lab Assignment-2: Submitted in Fulfillment of The Requirements of
No ratings yet
Flow Vision Lab Assignment-2: Submitted in Fulfillment of The Requirements of
8 pages
Mean Shift Clustering
No ratings yet
Mean Shift Clustering
23 pages

Classification

Uploaded by

Classification

Uploaded by

BTech CE-2 6th Semester

Dr. Neelam Duhan

July 18, 2022 Data Mining: Clustering -- Dr. Neelam 1

July 18, 2022 Data Mining: Concepts and Techniques 2

July 18, 2022 Data Mining: Concepts and Techniques 3

July 18, 2022 Data Mining: Concepts and Techniques 4

July 18, 2022 Data Mining: Concepts and Techniques 5

July 18, 2022 Data Mining: Concepts and Techniques 6

◼ The model is represented as classification rules, decision trees,

◼ The known label of test sample is compared with the

classified result from the model

correctly classified by the model

NAME RANK YEARS TENURED Classifier

◼ Underfitting is the case where the

◼ A model that generalizes well is a

July 18, 2022 Data Mining: Concepts and Techniques 11

July 18, 2022 Data Mining: Concepts and Techniques 12

July 18, 2022 Data Mining: Concepts and Techniques 13

◼ predictor accuracy: guessing value of predicted

◼ time to use the model (classification/prediction time)

◼ Robustness: handling noise and missing values

◼ Other measures, e.g., goodness of rules, such as decision

July 18, 2022 Data Mining: Concepts and Techniques 15

July 18, 2022 Data Mining: Concepts and Techniques 16

<=30 31..40 >40

student? Yes credit rating?

July 18, 2022 Data Mining: Concepts and Techniques 19

July 18, 2022 Data Mining: Concepts and Techniques 20

July 18, 2022 Data Mining: Concepts and Techniques 21

◼ Attributes are categorical (if continuous-valued, they are

◼ Test attributes are selected on the basis of a heuristic or

statistical measure (e.g., information gain)

◼ There are no remaining attributes for further partitioning –

majority voting is employed for classifying the leaf

July 18, 2022 Data Mining: Concepts and Techniques 22

◼ Information needed (after using attribute A to split D into v

◼ Information gained by branching on attribute A

July 18, 2022 Data Mining: Concepts and Techniques 24

Calculations of Info needed by Age Gain(age) = Info( D) − Infoage ( D) = 0.246

As mix of YES & NO,

income student credit_rating buys_computer income student credit_rating buys_computer

As mix of YES & NO, income student credit_rating buys_computer

As mix of YES & NO,

income credit_rating buys_computer income credit_rating buys_computer

<=30 31..40 >40

student? Yes credit rating?

July 18, 2022 Data Mining: Concepts and Techniques 28

July 18, 2022 Data Mining: Concepts and Techniques 29

◼ Overfitting: An induced tree may overfit the training data

Data Mining: Concepts and

Data Mining: Concepts and

You might also like