KEMBAR78
Classification | PDF | Statistical Classification | Bayesian Inference
0% found this document useful (0 votes)
8 views20 pages

Classification

The document discusses classification and prediction in data mining, highlighting the differences between the two processes. Classification predicts categorical labels using models constructed from training data, while prediction focuses on continuous-valued functions. It also covers various classification methods, issues in data preparation, evaluation metrics, and algorithms like Bayesian classification and k-Nearest Neighbors.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views20 pages

Classification

The document discusses classification and prediction in data mining, highlighting the differences between the two processes. Classification predicts categorical labels using models constructed from training data, while prediction focuses on continuous-valued functions. It also covers various classification methods, issues in data preparation, evaluation metrics, and algorithms like Bayesian classification and k-Nearest Neighbors.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 20

Classification and Prediction

 What is classification? What is prediction?


 Issues regarding classification and prediction
 Classification by decision tree induction
 Bayesian Classification
 Prediction
 Linear Regression
 Non linear regression

1
Classification vs. Prediction
 Classification:
 predicts categorical class labels (discrete or nominal)

 classifies data (constructs a model) based on the

training set and the values (class labels) in a


classifying attribute and uses it in classifying new data
 Prediction:
 models continuous-valued functions, i.e., predicts

unknown or missing values


 Typical Applications
 credit approval

 target marketing

 medical diagnosis

 treatment effectiveness analysis

2
Classification—A Two-Step
Process
 Model construction: describing a set of predetermined classes
 Each tuple/sample is assumed to belong to a predefined

class, as determined by the class label attribute


 The set of tuples used for model construction is training set

 The model is represented as classification rules, decision

trees, or mathematical formulae


 Model usage: for classifying future or unknown objects
 Estimate accuracy of the model


The known label of test sample is compared with the
classified result from the model

Accuracy rate is the percentage of test set samples that
are correctly classified by the model

Test set is independent of training set, otherwise over-
fitting will occur
 If the accuracy is acceptable, use the model to classify

data tuples whose class labels are not known

3
Classification Process (1):
Model Construction
Classification
Algorithms
Training
Data

NAME RANK YEARS TENURED Classifier


Mike Assistant Prof 3 no (Model)
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes IF rank = ‘professor’
Dave Assistant Prof 6 no
OR years > 6
Anne Associate Prof 3 no
THEN tenured = ‘yes’
4
Classification Process (2): Use
the Model in Prediction

Classifier

Testing
Data Unseen Data

(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
5
Supervised vs. Unsupervised
Learning
 Supervised learning (classification)

Supervision: The training data (observations,
measurements, etc.) are accompanied by
labels indicating the class of the observations

New data is classified based on the training set
 Unsupervised learning (clustering)

The class labels of training data is unknown

Given a set of measurements, observations,
etc. with the aim of establishing the existence
of classes or clusters in the data
August 5, 2025 Data Mining: Concepts and Techniques 6
Issues Regarding Classification and
Prediction (1): Data Preparation

 Data cleaning
 Preprocess data in order to reduce noise and
handle missing values
 Relevance analysis (feature selection)
 Remove the irrelevant or redundant attributes
 Data transformation
 Generalize and/or normalize data

August 5, 2025 Data Mining: Concepts and Techniques 7


Issues regarding classification and
prediction (2): Evaluating Classification
Methods
 Predictive accuracy
 Speed and scalability

time to construct the model

time to use the model
 Robustness

handling noise and missing values
 Scalability

efficiency in disk-resident databases
 Interpretability:

understanding and insight provided by the model
 Goodness of rules

decision tree size

compactness of classification rules
August 5, 2025 Data Mining: Concepts and Techniques 8
Bayesian Classification: Why?
 Probabilistic learning: Calculate explicit probabilities
for hypothesis, among the most practical
approaches to certain types of learning problems
 Incremental: Each training example can
incrementally increase/decrease the probability that
a hypothesis is correct. Prior knowledge can be
combined with observed data.
 Probabilistic prediction: Predict multiple hypotheses,
weighted by their probabilities
 Standard: Even when Bayesian methods are
computationally intractable, they can provide a
standard of optimal decision making against which
other methods can be measured
August 5, 2025 Data Mining: Concepts and Techniques 9
Bayesian Theorem: Basics

 Let X be a data sample whose class label is


unknown
 Let H be a hypothesis that X belongs to class C
 For classification problems, determine P(H/X): the
probability that the hypothesis holds given the
observed data sample X
 P(H): prior probability of hypothesis H (i.e. the
initial probability before we observe any data,
reflects the background knowledge)
 P(X): probability that sample data is observed
 P(X|H) : probability of observing the sample X,
given that the hypothesis holds
August 5, 2025 Data Mining: Concepts and Techniques 10
Bayesian
Theorem

 Given training data X, posteriori probability of a


hypothesis H, P(H|X) follows the Bayes theorem
P(H | X ) P( X | H )P(H )
P( X )
 Informally, this can be written as
posterior =likelihood x prior / evidence
 MAP (maximum posteriori) hypothesis
h arg max P(h | D) arg max P(D | h)P(h).
MAP hH hH

 Practical difficulty: require initial knowledge of


many probabilities, significant computational cost
August 5, 2025 Data Mining: Concepts and Techniques 11
Naïve Bayes Classifier

 A simplified assumption: attributes are


conditionally independent:
n
P( X | C i)   P( x k | C i)
k 1
 The product of occurrence of say 2 elements x1 and
x2, given the current class is C, is the product of
the probabilities of each element taken separately,
given the same class P([y1,y2],C) = P(y1,C) * P(y2,C)
 No dependence relation between attributes
 Greatly reduces the computation cost, only count
the class distribution.
 Once the probability P(X|Ci) is known, assign X to
the class with maximum P(X|Ci)*P(Ci)
August 5, 2025 Data Mining: Concepts and Techniques 12
Training dataset

age income student credit_rating buys_computer


Class: <=30 high no fair no
C1:buys_computer= <=30 high no excellent no
‘yes’ 30…40 high no fair yes
C2:buys_computer= >40 medium no fair yes
‘no’ >40 low yes fair yes
>40 low yes excellent no
Data sample 31…40 low yes excellent yes
X =(age<=30, <=30 medium no fair no
Income=medium, <=30 low yes fair yes
Student=yes >40 medium yes fair yes
Credit_rating= <=30 medium yes excellent yes
Fair) 31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no

August 5, 2025 Data Mining: Concepts and Techniques 13


Naïve Bayesian Classifier:
Example
 Compute P(X/Ci) for each class

P(age=“<30” | buys_computer=“yes”) = 2/9=0.222


P(age=“<30” | buys_computer=“no”) = 3/5 =0.6
P(income=“medium” | buys_computer=“yes”)= 4/9 =0.444
P(income=“medium” | buys_computer=“no”) = 2/5 = 0.4
P(student=“yes” | buys_computer=“yes)= 6/9 =0.667
P(student=“yes” | buys_computer=“no”)= 1/5=0.2
P(credit_rating=“fair” | buys_computer=“yes”)=6/9=0.667
P(credit_rating=“fair” | buys_computer=“no”)=2/5=0.4

X=(age<=30 ,income =medium, student=yes,credit_rating=fair)

P(X|Ci) : P(X|buys_computer=“yes”)= 0.222 x 0.444 x 0.667 x 0.0.667 =0.044


P(X|buys_computer=“no”)= 0.6 x 0.4 x 0.2 x 0.4 =0.019
P(X|Ci)*P(Ci ) : P(X|buys_computer=“yes”) * P(buys_computer=“yes”)=0.028
P(X|buys_computer=“yes”) *
P(buys_computer=“yes”)=0.007

X belongs to class “buys_computer=yes”


August 5, 2025 Data Mining: Concepts and Techniques 14
Naïve Bayesian Classifier:
Comments

 Advantages :
 Easy to implement

 Good results obtained in most of the cases

 Disadvantages
 Assumption: class conditional independence , therefore

loss of accuracy
 Practically, dependencies exist among variables

 E.g., hospitals: patients: Profile: age, family history etc

Symptoms: fever, cough etc., Disease: lung cancer,


diabetes etc
 Dependencies among these cannot be modeled by Naïve

Bayesian Classifier
 How to deal with these dependencies?
 Bayesian Belief Networks
August 5, 2025 Data Mining: Concepts and Techniques 15
Avoid Overfitting in
Classification
 Overfitting: An induced tree may overfit the training
data
 Too many branches, some may reflect anomalies

due to noise or outliers


 Poor accuracy for unseen samples

 Two approaches to avoid overfitting


 Prepruning: Halt tree construction early—do not

split a node if this would result in the goodness


measure falling below a threshold

Difficult to choose an appropriate threshold
 Postpruning: Remove branches from a “fully grown”

tree—get a sequence of progressively pruned trees



Use a set of data different from the training data
to decide which is the “best pruned tree”
16
The k-Nearest Neighbor
Algorithm
 All instances correspond to points in the n-D
space.
 The nearest neighbor are defined in terms of
Euclidean distance.
 The target function could be discrete- or real-
valued.
 For discrete-valued, the k-NN returns the most
common value among the k training examples
nearest to xq.
 Vonoroi diagram: the decision surface induced
_
by 1-NN for_ a _typical set of training examples.
_
.
+
_ .
+
xq + . . .
August 5, 2025
_
+ .
Data Mining: Concepts and Techniques 17
Discussion on the k-NN
Algorithm
 The k-NN algorithm for continuous-valued target
functions
 Calculate the mean values of the k nearest neighbors

 Distance-weighted nearest neighbor algorithm


 Weight the contribution of each of the k neighbors

according to their distance to the query point xq


w 1

giving greater weight to closer neighbors d ( x , x )2
q i
 Similarly, for real-valued target functions
 Robust to noisy data by averaging k-nearest neighbors
 Curse of dimensionality: distance between neighbors
could be dominated by irrelevant attributes.
 To overcome it, axes stretch or elimination of the least

relevant attributes.
August 5, 2025 Data Mining: Concepts and Techniques 18
Classifier Evaluation Metrics:
Confusion Matrix
Confusion Matrix:
Actual class\Predicted class C1 ¬ C1
C1 True Positives (TP) False Negatives (FN)
¬ C1 False Positives (FP) True Negatives (TN)

ample of Confusion Matrix:


Actual class\Predicted buy_computer buy_computer Total
class = yes = no
buy_computer = yes 6954 46 7000
buy_computer = no 412 2588 3000
Total 7366 2634 10000

Given m classes, an entry, CMi,j in a confusion matrix
indicates # of tuples in class i that were labeled by the
classifier as class j
 May have extra rows/columns to provide totals
19
Accuracy, Error Rate, Sensitivity
and Specificity
A\P C ¬C  Class Imbalance Problem:
C TP FN P  One class may be rare, e.g.
¬C FP TN N
fraud, or HIV-positive
P’ N’ All
 Significant majority of the

 Classifier Accuracy, or negative class and minority of


recognition rate: percentage the positive class
of test set tuples that are  Sensitivity: True Positive
correctly classified
recognition rate
Accuracy = (TP + TN)/All 
Sensitivity = TP/P
 Error rate: 1 – accuracy, or
 Specificity: True Negative
Error rate = (FP + FN)/All
recognition rate

Specificity = TN/N
20

You might also like