KEMBAR78
Classification (Part II) | PDF | Statistical Classification | Cluster Analysis
0% found this document useful (0 votes)
22 views162 pages

Classification (Part II)

Uploaded by

20051694
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views162 pages

Classification (Part II)

Uploaded by

20051694
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 162

Classification

Classification and Clustering

• Classification and clustering are two methods of pattern identification with some
similarity and disimilarity.

• Classification uses predefined classes in which objects are assigned.

• Clustering identifies similarities between objects, which it groups according to


those characteristics in common and which differentiate them from other groups
of objects. These groups are known as "clusters"
Classification and Clustering: Example
Classification
• Classification is the Data mining technique used to predict group membership for
data instances.
• There are two ways to assign a new value to a given class
Crispy Classification
Probabilistic Classification
Crispy Classification: Given an input, the classifier exactly returns its class label.
Probabilistic Classification: Given an input, the classifier returns its probabilities to
belong to each class. This is useful when some mistakes can be more costly than
others
• Give only data > 90%
• Assign the object to the class with the highest probability
• Assign the object to the class only if its probability >40%

For example, in a banking application, the customer who applies for a loan may be classified as a
Clustering
• Clustering is a technique of organising a group of data into classes and clusters
where the objects reside inside a cluster will have high similarity and the objects
of two clusters would be dissimilar to each other.
• In clustering, the similarity as well as distance between two objects is measured
by the similarity function. Generally, Intra-cluster distance are less than inter-
cluster distance.
Classification Vs Clustering
Parameter Classification Clustering
Type Used for supervised learning Used for unsupervised learning
Basic process of classifying the input instances grouping the instances based on their
based on their corresponding class labels similarity without the help of class
labels
Need it has labels, so there is need of training there is no need of training and testing
and testing dataset for verifying the dataset
created model
Complexity more complex less complex
Example Logistic regression, Naive Bayes k-means clustering algorithm, Fuzzy c-
Algorithms classifier, Support vector machines, KNN, means clustering algorithm, Gaussian
Decision Tree etc. (EM) clustering algorithm, etc.
Application Detection of unsolicited email Investigation of the social networks
Recognition of the face Segmentation of an image
Approval of a Bank Loan Recommendation Engines
Supervised vs. Unsupervised Learning
Supervised and Unsupervised learning are the two techniques of ML. Both are
used in different scenarios and with different dataset.
Supervised vs. Unsupervised Learning
Supervised learning (classification) –
 In Supervised learning models are trained using labeled data.
 In supervised learning, models need to find the mapping function to map the input variable (X) with the
output variable (Y).
Y= f(X)
 Supervised learning needs supervision to train the model, which is similar to as a student learns things in
the presence of a teacher.
 Supervised learning can be used for two types of problems: Classification and Regression.

Unsupervised learning (clustering) –


 In Unsupervised learning patterns are inferred from the unlabeled input data.
 The goal of unsupervised learning is to find the structure and patterns from the input data.
 Unsupervised learning does not need any supervision. Instead, it finds patterns from the data by its own
 Unsupervised learning can be used for two types of problems: Clustering and Association.
How Supervised Learning Works?
How Un-supervised Learning Works?
Trainning Dataset Vs. Test Dataset
• The training data is the biggest subset (>=60%) of the original dataset, which
is used to train or fit the model. Firstly, the training data is fed to the
algorithms, which lets them learn how to make predictions for the given task.
The type of training data that we provide to the model is highly responsible
for the model's accuracy and prediction ability. It means that the better the
quality of the training data, the better will be the performance of the model.

• The test dataset is another well-organized subset (20%-40%) of original data,


which is independent of the training dataset. However, it has some similar
types of features and class probability distribution for each type of scenario
and uses it as a benchmark for model evaluation once the model training is
completed.
Classification and Prediction

What is classification and What is prediction?

Issues regarding classification and prediction


What is classification and What is prediction?
Classification:
Classification is to identify the category or the class label of a new observation.
First a set of training data (input data with their class label) is given to the algorithm, the
algorithm derives a model or classifier (that can be decision tree, mathematical formula, or a
neural network). When a new unlabeled data is given to the classifier model, it should find the
class to which it belongs.
Prediction:
Prediction is to compute/predict the numeric output value of a new observation.
Same as classification, a set of training data (input data with their numeric output value) is given
to the algorithm, the algorithm derives a model or predictor (that can be regression). When a
new unlabeled data is given to the predictor model, it should find the continuous numeric output.
prediction
classification
Classification Learning: Definition

• Given a collection of records (training set)


–Each record contains a set of attributes, one of the attributes is the class

• Find a model for the class attribute as a function of the values of the other
attributes

• Use training set to construct the model and use test set to estimate the
accuracy of the model

• Goal: previously unseen records should be assigned a class as accurately as


possible
Illustrating Classification Learning
Classification - A Two-Step Process
Model construction: describing a set of predetermined classes
 Building the Classifier or Model
 Each tuple/sample is assumed to belong to a predefined class, as defined by
the class label attribute
 The set of tuples used for model construction: training set
 The model is represented as

Model usage: for classifying future or unknown objects


Using Classifier for Classification estimate accuracy of the model
The known label of test sample is compared with the classified result from the
model
Accuracy rate is the percentage of test set that are correctly classified by the model
Test set is independent of training set, otherwise over-fitting will occur
Classification - A Two-Step Process:

The data classification process: Learning:Training data


are analyzed by a classification algorithm. Here, the class
label attribute is loan_decision, and the learned model or
classifier is represented in the form of classification rules.
Classification - A Two-Step Process:

Classifiaction: Test data are used to estimate the accuracy of the classification rules. If the
accuracy is considered acceptable, the rules can be applied to the classification of new data tuples.
Issues regarding classification and prediction:
(1) Data Preparation

• Preprocess data in order to reduce noise and handle missing


values

• Remove the irrelevant or redundant attributes

• Generalize and/or normalize data

19
Issues regarding classification and prediction:
(2) Evaluating Classification Methods
• Predictive accuracy
• Speed and scalability
time to construct the model
time to use the model
• Robustness
handling noise and missing values
• Scalability
efficiency in disk-resident databases
• Interpretability
understanding and insight provded by the model
• Goodness of rules
decision tree size
compactness of classification rules 20
Classification Technique
• Decision Trees
• Bayesian Classifiers
• Neural Networks
• K-Nearest Neighbour
• Support Vector Machines
• Linear Regression
• Logistic Regression
• Fuzzzy Algorithm
• Genetic Algorithm
Classification by Decision Tree Induction
• Decision tree
– A flow-chart-like tree structure
– Internal node denotes a test on an attribute
– Branch represents an outcome of the test
– Leaf nodes represent class labels or class distribution
– The topmost node in the tree is the root node.
• Decision tree generation consists of two phases
–Tree construction
 At start, all the training examples are at the root
 Partition examples recursively based on selected attributes
–Tree pruning
Identify and remove branches that reflect noise or outliers
– Use of decision tree: Classifying an unknown sample
– Test the attribute values of the sample against the decision tree
Decision Tree for Cheat or not Cheat
Decision Tree for Playing Tennis
Play-Tennis Data
DAY OUTLOOK TEMP HUMIDITY WIND PLAY
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No
Decision Tree for Conjunction
Decision Tree for Disjunction
Decision Tree for XOR
Decision Tree for Disjunction of Conjuction
Three Possible Partition Scheme
Atribute Selection Measure
• Attribute selection measure finds the best attributes in sequence which can
give a smallest decision tree. It provide a splitting rule that determine how
the tuple at a give node can be best split.
• On the basis of sample dataset/traning dataset attribute selection measure
provide a ranking for each attibute.
• The attribute with best rank/score for the measure is choosen as the
splitting attribute for the given tuples.
• Three popular attribute selection measure:
• Information Gain
• Gain Ratio
• Gini Index
Information Gain Method

• Entropy/Information is the probabilistic measure of


uncertanity/variance/randomness present in the
dataset/attribute.
• Information Gain is the decrease/increase in Entropy value
when the node is split. It is the measure of a reduction of
uncertainty/variance.
• Based on the computed values of Entropy and Information Gain
teh best attribute at the respective level is choosen
Information Gain Method

• m: number of Classes or labels


• pi: Probability of Class i

• v: no. of variations of attribute A


• pi: Probability of Class i

• A is the particular Attribute


Which Attribute to Select
Class:
RID Age Income Student Credit_Rating Buys
Computer
1 Youth High No Fair No Youth
AGE
2 Youth High No Excellent No
3 middle_aged High No Fair Yes Senior
4 senior Medium No Fair Yes
Middle_aged
5 senior Low Yes Fair Yes
6 senior Low Yes Excellent No
7 middle_aged Low Yes Excellent Yes NO YES YES
8 Youth Medium No Fair No NO YES YES
9 Youth Low Yes Fair Yes NO YES NO
10 senior Medium Yes Fair Yes YES YES YES
11 Youth Medium Yes Excellent Yes YES NO
12 middle_aged Medium No Excellent Yes
13 middle_aged High Yes Fair Yes
14 senior Medium No Excellent No
Which Attribute to Select
Class:
RID Age Income Student Credit_Rating Buys
Computer
Income
1 Youth High No Fair No High
2 Youth High No Excellent No
3 middle_aged High No Fair Yes Low
4 senior Medium No Fair Yes
5 senior Low Yes Fair Yes
Medium
6 senior Low Yes Excellent No
7 middle_aged Low Yes Excellent Yes NO YES YES
8 Youth Medium No Fair No NO NO NO
9 Youth Low Yes Fair Yes YES YES YES
10 senior Medium Yes Fair Yes YES YES YES
11 Youth Medium Yes Excellent Yes YES
12 middle_aged Medium No Excellent Yes NO
13 middle_aged High Yes Fair Yes
14 senior Medium No Excellent No
Which Attribute to Select
Class:
RID Age Income Student Credit_Rating Buys
Computer
Student
1 Youth High No Fair No NO
2 Youth High No Excellent No
3 middle_aged High No Fair Yes YES
4 senior Medium No Fair Yes
5 senior Low Yes Fair Yes
6 senior Low Yes Excellent No
7 middle_aged Low Yes Excellent Yes NO YES
8 Youth Medium No Fair No NO NO
9 Youth Low Yes Fair Yes YES YES
10 senior Medium Yes Fair Yes YES YES
11 Youth Medium Yes Excellent Yes NO YES
12 middle_aged Medium No Excellent Yes YES YES
13 middle_aged High Yes Fair Yes NO YES
14 senior Medium No Excellent No
Which Attribute to Select
Class:
RID Age Income Student Credit_Rating Buys
Computer Credit_
1 Youth High No Fair No Rating
2 Youth High No Excellent No Fair
3 middle_aged High No Fair Yes Excellent
4 senior Medium No Fair Yes
5 senior Low Yes Fair Yes
6 senior Low Yes Excellent No
7 middle_aged Low Yes Excellent Yes NO NO
8 Youth Medium No Fair No YES NO
9 Youth Low Yes Fair Yes YES YES
10 senior Medium Yes Fair Yes YES YES
11 Youth Medium Yes Excellent Yes NO YES
12 middle_aged Medium No Excellent Yes YES NO
13 middle_aged High Yes Fair Yes YES
14 senior Medium No Excellent No YES
Information Gain Approach

• m: number of Classes or labels=2


• pi: Probability of Class i

Class YES of D Class NO of D


Information Gain Approach

• m: number of Classes or labels=2


• pi: Probability of Class i

Class Variation in Class Variation in Class Variation in


Youth Middle_aged Senior
Information Gain Approach
Step 2: Iteratively find the Best Atrribute for each level of the Decision ree.

Iteration 1: Find Best Attribute for For 0th level Cont...

Similarly find the gain of attribute “income”, “student”, “credit_rating”

b. Gain of income = 0.029


c. Gain of student = 0.151
d. Gain of credit_rating = 0.048

Age has highest information gain from rest of the attribute, and hence is seleceted
as splitting attribute.
Level 0 Decision Tree using 1st Splitting Attribute
Age
• Tupple Falling on age=middle_aged
all belong to the same class “YES”
and hence a leaf can be created for
that branch.

• continue the tree construction with


finding the best gain for data-partion
age=youth and age=senior
Final Decision Tree
Example of Majority Voting
Extracting Classification Rule from Trees
• Represent the knowledge in the form of IF-THEN rules
• One rule is created for each path from the root to leaf.
• Each attribute-value pair along a path forms a conjuction
• The leaf node holds the class prediction

Example:
• If (Age == Youth) AND (Student == No) then buys_computer=”No”
• If (Age == Youth) AND (Student == No) then buys_computer=”Yes”
• If (Age == middle_aged) then buys_computer=”Yes”
• If( Age == senior) AND credit_rating == “Excellent” then buys_computer=”No”
• If(Age == senior) AND credit_rating == “fair” then buys_computer=”Yes”
Drawback of Information gain as attribute selection measure
• The information gain measure is biased towards tests with many outcomes.
i.e. it prefers to select attribute having a large number of values.
• For instance, for the this dataset information gain
prefers to split on attribute “Table income” that
would result in a large number of pure partitions
but each one containg one tuple.
Infoincome(D)=0 [variation = 0]
=> Gainincome(D) = Info(D)-Infoincome(D)=Info(D)
• Hence, Information gain w.r.t. income is maximal.
Solution:
C4.5, a successor of ID3
Gain Ratio
• C4.5, a successor of ID3, uses an extension to information gain known as gain ratio,
which attempts to overcome this bias.
• It attempts a kind of normalization to information gain using a “split information” value
defined analogously with Info(D) as follows.

• If there are “v” partitions in attribute A then SlpitInfo finds the number of tuples having
that outcome w.r.t. total number of tuple.
• Now, to compute the best splitting attirbute the criteria “GainRatio” will be used instead
of “Gain” which intern uses “SplitInfo”.

• Conclusion: Gain Ratio applies a kind of normalization to information gain using


“split information” value. the attribute with max gain ratio is selected as splitting
Example of Gain Ratio
In the process of selecting best partition attribute nstead of finding gain of each attribute,
now find the gain ratio of each attribute and then find the max gain ration
Class:
Credit_
RID Age Income Student Buys Let computed Gain of attribute income = 0.029
Rating
Computer
Attribute income is having tree partitions:
1 - High - - No
2 - High - - No
Low, Medium, High
3 - High - - Yes Low -> 4 tuples
4 - Medium - - Yes Medium -> 6 tuples
5 - Low - - Yes
6 - Low - - No
High -> 4 tuples
7 - Low - - Yes
8 - Medium - - No
9 - Low - - Yes
10 - Medium - - Yes
LOW MEDIUM HIGH
11 - Medium - - Yes
12 - Medium - - Yes
13 - High - - Yes
14 - Medium - - No
Gini Index
Gini Index
Classifier -II
Naive Bayesian Classifier
Classification by Naive Bayesian Classifier
• Bayesian Classifier is a statistical classifier based on Bayes Theorem which can
predict the probability that a given tuple belongs to a particular class.

• Naive Bayesian Classifier is a simple Bayesian classifier that exhibits high


accuracy and speed compared to Decision tree & Neural Network classifier.

• Naive Assumption: [Class-Conditional Independence]: The effect of an attribute


value on a particular class is independent of other attribute value.

[Simplify the computation] => Naive


Introduction to Bayesian Classification

• Statistical method for classification

• Supervised Learning method.

• Assumes an underlying probabilistic model, the Bayes theorem

• Can solve problems involving both categorical and continuous valued


attributes.

• Named after Thomas Bayes, who proposed the Bayes Theorem


Probability Basics: Prior Probability

A prior probability is the probability that an observation will fall into a group
before one collect the data.

Example:
1. John conducts a single coin toss. What is the a priori probability of landing a head?
A priori probability of landing a head = 1 / 2 = 50%.

2. A six-sided fair dice is rolled. What is the a priori probability of rolling a 2, 4, or 6, in


a dice roll?
The number of desired outcomes is 3 (rolling a 2, 4, or 6), and there are 6 outcomes in
total.
A priori probability of rolling a 2, 4, or 6 = 3 / 6 = 50%.
Probability Basics: Posterior Probability
A posterio probability / Conditional probability is the probability of an event occurring
given that another event has already occurred.
For example, we might be interested in finding the probability of some event “A”
occurring after we account for some event “B” that has just occurred.
The formula to calculate a posterior probability of A occurring given that B occurred:
Bayes’ Theorem

Where, A, B = Events
P(B|A):- The probability of B occuring given that A has happened
P(A) & P(B):- The prior probability of event A & event B
Probability Basics: Example Calculating Posterior Probability
Probability Basics: Posterior Probability contd...
For an intuitive understanding of this probability, suppose the following grid represents this
forest with 100 trees. Exactly 20 of the trees are Oak trees (20%) and 18 (90%) of them are
healthy. The other 80 (80%) trees are Maple and 40 (50%) of them are healthy.
(O = Oak, M = Maple, Green = Healthy, Red = Unhealthy)

Out of all exactly 58 of them are healthy and 18 of these healthy ones are Oak trees. Thus, if
a healthy tree has selected then the probability that it’s an Oak tree is 18/58 = 0.3103.
Bayes’ Theorem [useful in finding posterior probability]
An Example of Bayes’ Theorem

Example: Let customers are only


described by attributes
age, income and credit_rating

 X: 35 years old customer with an income 40K and fair Credit_rating


 H: Hypothesis that the customer will buy a computer
Likelihood Prior

Determine:-
Normalization Constant
An Example of Bayes’ Theorem contd...
Naive Bayes Classifier
Naive Bayes Classifier

As P(X) is Constant, Hence only maximize P(X|Ci) × P(Ci)


Problem
Class:
For the given training
RID Age Income Student Credit_Rating Buys
Computer
dataset let the new
1 Youth High No Fair No
customer X value as
follows. Find the
2 Youth High No Excellent No
probability that customer
3 middle_aged High No Fair Yes
will buy computer or not
4 senior Medium No Fair Yes
or
5 senior Low Yes Fair Yes
6 senior Low Yes Excellent No Find the query X belongs
7 middle_aged Low Yes Excellent Yes
to which class?
8 Youth Medium No Fair No
9 Youth Low Yes Fair Yes X = (Age=Youth
10 senior Medium Yes Fair Yes
income = medium
11 Youth Medium Yes Excellent Yes
12 middle_aged Medium No Excellent Yes student = Yes
13 middle_aged High Yes Fair Yes credit_rating = Fair)
14 senior Medium No Excellent No
Problem
Problem
Step-3: Find the class that maximizes P(X|Ci) × P(Ci)

Max{ P(X|buys_Comp=Yes) × p(buys_computer = Yes),


P(X|buys_Comp=No) × p(buys_computer = No)
= {0.44 × 0.643, 0.019 × 0.357}
= {0.028, 0.007} = 0.028 {Class “Yes” Maximizes the conditional probability of X}

Hence, the Naive Bayesian classifier predicts buys_comp is yes for given tuple X
Problem
For the given training
dataset let the new
customer X value as
follows. Find the
probability that Golf will
be played or not?
Find the query X belongs
to which class?

X = (Outlook=Rainy
Temp = Hot
Humidity = High
Windy = False)
Partial Solution
Advantage
• Simple, fast and easy to implement.Needs less training data. Highly scalable.
• Handles continuous and discrete data.
• Easily updatable if training dataset increases
Uses
• Text Classification
• Spam Filtering
• Hybrid Recommender
• Online Application
Disavantages
• Assumption: Class conditional independence, therefore loss of accuracy
• Practically, dependecies exist among variables.
Example: Hospitals, patients profile, family history, didease symptoms etc
Major Issues
• Violations of Independence Assumption: It assumes that the effect of an attribute values
on a given class is independent of the values of the other attribute.
Solution: Bayesian belief network classifier allows the representation of
dependencies among subsets of attributes
• Zero Conditional Probability problem: If a given class and feature value never occur
together in the training set then the frequency based probability estimate will be zero. This
is problematic since it will wipe out all information in the other probabilities when they are
multiplied.
Therefore often it is desirable to incorporate a small sample correction in all possible
option. Or consider the occurrence/probability as 1 instead of zero which can be
negligible in the consideration of maximum.
K-Nearest Neighbor
(Lazy Learners or Learning from your Neighbour)
Eager Learners Vs. Lazy Learners
Eager Learner Lazy Learner
• Eager Learner approach of classification • Lazy Learner approach stores the received
constructs a generalized classification training tuples and wait to the last minute of
model just after receiving the training tuples receiving the new/test tuple. Only when it
receives the test tuple, it constructs the
(much before receiving new/test tuple), and
generalized classification model based on
be ready and eager to classify the new similarity among test tuple and training
unseen tuple. tuples.
• More work when training tuple is received • Less work (only store) when training tuple is
and less work when making a classification. received and more work when making a
• Less Expensive but not support incremental classification
learning • More Expensive but support incremental
learning
• e.g. Decision Tree, Bayesian Classifier,
Backpropagation, Association Rule Mining • e.g. K-Nearest Neighbor, case-based learning
K-Nearest Neighbor Classifier

Nearest based classifiers are based on learning by analogy i.e. by comparing a


given test tuples with training tuples that are similar to it.
• Let the training tuples are described by n attributes. All instances of training
tuples are correspond to points in the n-dimentional space and accordingly each
tuple represents a point in an n-dimensional space
• A k-nearest-neighbor classifier searches the pattern space for the k training
tuples that are closest to the unknown tuple.
Prediction for test data is done on the basis of its neighbour
Properties of KNN

 How many neighbors should be consider? i.e. what is k?


 How do we measure distance?
 Should all points be weighted equally, or some points have more influence than others?
 How many neighbours should be consider? (what is k)
• At K=1, the KNN model tends to closely follow the training data and thus shows a high
training score and low test score, indicating over-fitting.

• The problem can be solved by tuning the value of k-neighbours parameter. As we increase
the number of neighbours k, the model starts to generalize well, but increasing the value too
much would again drop the performance.

• Therefore, it’s important to find an optimal value of K, such that the model is able to
classify well on the test data set with minimum error.

• Setting K value to the square root of number of training tuple or half of the square root
of number of training tuple can lead to better result.
• For an even number of classes (e.g. 2 or 4) it is a good idea to choose a K value with an odd
number to avoid a tie. And the inverse, use an even number for K when you have an odd
number of classes
How to Measure Distance

From these Euclidean distance is more preferable as it gives equal important


to each feature
Example: 1
• Let given is the data from the questionaires survey with two attributes (X1: Acid
durability and X2: Strength) to classify whether a special paper tissue is good or not.
Classify the new paper tissue with features: X1=3 and X2=7.

X1: X2: Y: Class


Acid Durability Strength
7 7 Bad
7 4 Bad
3 4 Good
1 4 Good
Example-1: Solution
Step 1: Intialize and Define K
K=1 Overfitting Problem
Number of tuples is even, hence consider K as odd.
Let K = 3 for this problem
Step 2: Compute distance between input sample and training samples using Euclidian
distance measure.

X1: X2: Y: Distance from


Acid Strength Class (3, 7)
Durability
7 7 Bad sqrt((7-3)2 + (7-7)2)=4
7 4 Bad sqrt((7-3)2 + (4-7)2)=5
3 4 Good sqrt((3-3)2 + (4-7)2)=3
1 4 Good sqrt((1-3)2 + (4-7)2)=3.6
Example-1: Solution
Step 3: Determine K nearest neighbor those are at minimum distance and rank them.

X1: X2: Y: Class Distance Rank


Acid Strength from
Durability (3, 7)
7 7 Bad 4 3
7 4 Bad 5 4
3 4 Good 3 1
1 4 Good 3.6 2

Step 4: Apply Simple Mojority


There are 2 “Good” and 1 “Bad”. Thus the new paper tissue will be classified as Good due to
simple majority
Example 2
Given is the customer detail as customer’s age, income, credit card no. and their
corresponding class. Find the class label for the new tuple/sample:
Customer: RRR, Age: 37, Income: 50K, Cr Card: 2, Class- ?

Customer Age Income No. of Credit Class


Cards
XXX 35 35K 3 No
YYY 22 50K 2 Yes
ZZZ 63 200K 1 No
PPP 59 170K 1 No
QQQ 25 40K 4 Yes
RRR 37 50K 2 ????
Example-2: Solution
Find the Euclidian Distance of new tuple from each training data

Customer Age Income No. of Credit Class Distance


Cards
XXX 35 35K 3 No sqrt((35-37)2+(35-50)2+(3-2)2)=15.16
+YYY 22 50K 2 Yes sqrt((22-37)2+(50-50)2+(2-2)2)=15

ZZZ 63 200K 1 No sqrt((63-37)2+(200-50)2+(1-2)2)=152.2

PPP 59 170K 1 No sqrt((59-37)2+(170-50)2+(1-2)2)=122

QQQ 25 40K 4 Yes sqrt((25-37)2+(40-50)2+(4-2)2)=15.74

RRR 37 50K 2 ????

Majority of {Yes, Yes, No} is Yes. Hence Class of new tuple is Yes
Should all Attributes be weighted equally,
or
Some Attributes have more influence than others?
Some attributes (like income, loan) outweighs some smaller ranged attributes (like binary or
categorical or numeric like age) and thus, the result obtained can be a biased one.
Normalization is a solution to it: Max-Min Normalization generally used
Solution : Normalization
KNN Classifier Algorithm
Advantage:
• No training period: There is no discriminative function or model build at the traing stage
• As there is no training period, any new data can be added seamlessly without impacting
accuracy of the result.
• Easy to implement
• Best used where probability is unknown
• Useful for non-linear data as no assumption is required

Disadvantage:
• Computationally expensive and more space complexity
• Output depends on choosen K value whic can reduce accuracy for some K value
• Does not work well with large data set and high dimension
• Sensitive to noisy, missing and outlier data
• Need normlization
Application
• Used in classification and regression
• Used in pattern recognition
• Used in gene expression, protein-protein prediction, get 3D structure
of the protein
• Used to measure document similarity
Clustering
 Clustering is the task of dividing the population or data points into a number of groups such that data
points in the same groups are more similar to other data points in the same group and dissimilar to the
data points in other groups.
 It is basically a collection of objects on the basis of similarity and dissimilarity between them.
 Following is an example of finding clusters of population based on their income and debt.

96
Clustering Example
Imagine of a number of objects in a basket. Each item has a distinct set of features (size, shape, color, etc.).
Now the task is to group each of the objects in the basket. A natural first question to ask is, ‘on what basis
these objects should be grouped?’ Perhaps size, shape, or color.

97
Clustering Example cont…
The data points clustered together can be classified into one single group. The clusters can be distinguished,
and can identify that there are 3 clusters.

98
Clustering Types
Clustering can be divided into two subgroups:
 Hard Clustering: each data point either belongs to a cluster completely or not.
 Soft Clustering: Instead of putting each data point into a separate cluster, a probability or likelihood of
that data point to be in those clusters is assigned.

99
Clustering Application
 Marketing: It can be used to characterize & discover customer segments for marketing purposes.
 Biology: It can be used for classification among different species of plants and animals.
 Libraries: It is used in clustering different books on the basis of topics and information.
 Insurance: It is used to acknowledge the customers, their policies and identifying the frauds.
 City Planning: It is used to make groups of houses and to study their values based on their geographical
locations and other factors present.
 Earthquake studies: By learning the earthquake affected areas, the dangerous zones can be
determined.
 Healthcare: It can be used in identifying and classifying the cancerous gene.
 Search Engine: It is the backbone behind the search engines. Search engines try to group similar
objects in one cluster and the dissimilar objects far from each other.
 Education: It can be used to monitor the students' academic performance. Based on the students' score
they are grouped into different-different clusters, where each clusters denoting the different level of
performance.

100
Requirements of Clustering in Data Mining

 Scalability − Highly scalable clustering algorithms are needed to deal with large databases.
 Ability to deal with different kinds of attributes - Algorithms should be capable to be applied on any
kind of data such as interval-based (numerical) data, categorical, and binary data.
 Discovery of clusters with attribute shape - The clustering algorithm should be capable of detecting
clusters of arbitrary shape. They should not be bounded to only distance measures that tend to find
spherical cluster of small sizes.
 High dimensionality - The clustering algorithm should not only be able to handle low-dimensional data
but also the high dimensional space.
 Ability to deal with noisy data − Dataset contain noisy, missing or erroneous data. Some algorithms are
sensitive to such data and may lead to poor quality clusters.
 Interpretability − The clustering results should be interpretable, comprehensible, and usable.

101
Types of clustering algorithms
Since the task of clustering is subjective, the means that can be used for achieving this goal are plenty. Every
methodology follows a different set of rules for defining the ‘similarity’ among data points. The popular types are:

 Connectivity models  Distribution models


 Centroid models  Density models

Connectivity models: These models are based on the notion that the data points closer in data space exhibit more
similarity to each other than the data points lying farther away. These models can follow two approaches. In the
first approach, they start with classifying all data points into separate clusters & then aggregating them as the
distance decreases. In the second approach, all data points are classified as a single cluster and then partitioned
as the distance increases. Also, the choice of distance function is subjective. These models are very easy to interpret
but lacks scalability for handling big datasets. Examples of these models are hierarchical clustering algorithm and its
variants.
Centroid models: These are iterative clustering algorithms in which the notion of similarity is derived by the
closeness of a data point to the centroid of the clusters. K-Means clustering algorithm is a popular algorithm that
falls into this category. In these models, the no. of clusters required at the end have to be mentioned beforehand, which
makes it important to have prior knowledge of the dataset.

102
Types of clustering algorithms cont…
Distribution models: These clustering models are based on the notion of how probable is it that all data points in the
cluster belong to the same distribution. These models often suffer from overfitting. A popular example of these models is
Expectation-maximization algorithm which uses multivariate normal distributions.

Density Models: These models search the data space for areas of varied density of data points in the data space. It
isolates various different density regions and assign the data points within these regions in the same cluster. Popular
examples of density models are DBSCAN and OPTICS.

Summary
 Connectivity based: Create a hierarchical decomposition of the set of data using some criterion.
 Centroid based: Construct various partitions and then evaluate them by some criterion.
 Distribution based: A model is hypothesized for each of the clusters and the idea is to find the best fit of that
model to each other.
 Density based: Based on connectivity and density functions.

103
Hierarchical Clustering
Let’s say we have the below points and we want to cluster them into groups.

We can assign each of these points to a separate cluster.

Now, based on the similarity of these clusters, the most similar clusters combined together and this process is
repeated until only a single cluster is left.

We are essentially building a hierarchy of clusters. That’s why this algorithm is called hierarchical clustering.

104
Hierarchical Clustering Example
Suppose a faculty wants to divide the students into different groups. The faculty has the marks scored by each student
in an assignment and based on these marks, he/she wants to segment them into groups. There’s no fixed target as to
how many groups to have. Since the faculty does not know what type of students should be assigned to which group, it
cannot be solved as a supervised learning problem. So, hierarchical clustering is applied to segment the students into
different groups. Let’s take a sample of 5 students.

Roll No Mark Creating a Proximity Matrix


1 10 First, a proximity matrix to be created which tell the distance between each of
these points (marks). Since the distance is calculated of each point from each
2 7
of the other points, a square matrix of shape n X n (where n is the number of
3 28 observations) is obtained. Let’s make the 5 x 5 proximity matrix for the
4 20 example.
5 35

105
Hierarchical Clustering Example cont…
Proximity Matrix
Roll No 1 2 3 4 5
1 0 3 18 10 25
2 3 0 21 13 28
3 18 21 0 8 7
4 10 13 8 0 15
5 25 28 7 15 0

The diagonal elements of this matrix is always 0 as the distance of a point with itself is always 0. The Euclidean distance
formula is used to calculate the rest of the distances. So, to calculate the distance between
Point 1 and 2: √(10-7)^2 = √9 = 3
Point 1 and 3: √(10-28)^2 = √324 = 18 and so on…
Similarly, all the distances are calculated and the proximity matrix is filled.

106
Steps to Perform Hierarchical Clustering

Step 1: First, all the points to an individual cluster is assigned. Different colors here represent different clusters. Hence,
5 different clusters for the 5 points in the data.

Step 2: Next, look at the smallest distance in the proximity matrix and merge the points with the smallest distance.
Then the proximity matrix is updated.

Roll No 1 2 3 4 5
1 0 3 18 10 25
2 3 0 21 13 28
3 18 21 0 8 7
4 10 13 8 0 15
5 25 28 7 15 0
Here, the smallest distance is 3 and hence point 1 and 2 is merged.

107
Steps to Perform Hierarchical Clustering cont…

Let’s look at the updated clusters and accordingly update the proximity matrix. Here, we have taken the maximum of
the two marks (7, 10) to replace the marks for this cluster. Instead of the maximum, the minimum value or the average
values can also be considered.

Roll No Mark
(1, 2) 10
3 28
4 20
5 35

Now, the proximity matrix for these clusters is calculated again.

108
Steps to Perform Hierarchical Clustering cont…
Revised Proximity Matrix

Roll No 1, 2 3 4 5
1, 2 0 18 10 25
3 18 0 8 7
4 10 8 0 15
5 25 7 15 0

Step 3: Step 2 is repeated until only a single cluster is left. So, look at the minimum distance in the proximity matrix
and then merge the closest pair of clusters. We will get the merged clusters after repeating these steps:

109
Steps to Perform Hierarchical Clustering cont…

We started with 5 clusters and finally have a single cluster.

110
Types of Hierarchical Clustering
There are mainly two types of hierarchical clustering:
 Agglomerative hierarchical clustering
 Divisive Hierarchical clustering
Agglomerative Hierarchical Clustering
 Each point is assigned to an individual cluster in this technique. Suppose there are 4 data points, so each of these
points would be assigned to a cluster and hence there would be 4 clusters in the beginning.
 Then, at each iteration, closest pair of clusters are merged and this step is repeated until only a single cluster is left.

111
Types of Hierarchical Clustering cont…

Divisive Hierarchical clustering


 Divisive hierarchical clustering works in the opposite way. Instead of starting with n clusters (in case of n
observations), we start with a single cluster and assign all the points to that cluster. So, it doesn’t matter if we have
10 or 1000 data points. All these points will belong to the same cluster at the beginning.
 Now, at each iteration, farthest point in the cluster is split and this process is repeated until each cluster only
contains a single point.

112
Dendrogram
 To get the number of clusters for hierarchical clustering, we make use of the concept called a Dendrogram.
 A dendrogram is a tree-like diagram that records the sequences of merges or splits.
 Let’s get back to faculty-student example. Whenever we merge two clusters, a dendrogram record the distance
between these clusters and represent it in graph form.

We have the samples of the dataset on the x-axis


and the distance on the y-axis. Whenever two
clusters are merged, we will join them in this
dendrogram and the height of the join will be the
distance between these points.

113
Dendrogram cont…
We started by merging sample 1 and 2 and the distance
between these two samples was 3. Let’s plot this in the
dendrogram.

Here, we can see that we have merged sample 1 and 2. The vertical line represents the distance between these
samples.
114
Dendrogram cont…
Similarly, we plot all the steps where we merged the clusters and finally, we get a dendrogram like this:

We can clearly visualize the steps of hierarchical clustering. More the distance of the vertical lines in the
dendrogram, more the distance between those clusters.
115
Dendrogram cont…
Now, we can set a threshold distance and draw a horizontal line (Generally, the threshold is set in such a way that it cuts
the tallest vertical line). Let’s set this threshold as 12 and draw a horizontal line:

The number of clusters will be the number of vertical lines which are being intersected by the line drawn using the
threshold. In the above example, since the red line intersects 2 vertical lines, we will have 2 clusters. One cluster will
have a sample (1,2,4) and the other will have a sample (3,5).
116
Hierarchical Clustering closeness of two clusters

The decision of merging two clusters is taken on the basis of closeness of these clusters. There are multiple
metrics for deciding the closeness of two clusters and primarily are:
 Euclidean distance  Manhattan distance
 Squared Euclidean distance  Maximum distance
 Mahalanobis distance

117
K- Means Clustering Algorithm
 It is an iterative algorithm that divides the unlabeled dataset into k different clusters in such a way that each
dataset belongs only one group that has similar properties.
 It is a centroid-based algorithm, where each cluster is associated with a centroid. The main aim of this
algorithm is to minimize the sum of distances between the data point and their corresponding clusters. A
centroid is a data point (imaginary or real) at the center of a cluster.
 The algorithm takes the unlabeled dataset as input, divides the dataset into k-number of clusters, and repeats
the process until it does not find the best clusters. The value of k should be predetermined in this algorithm.
 The k-means clustering algorithm mainly performs two tasks:
 Determines the best value for k center points or centroids by an iterative process.
 Assigns each data point to its closest k-center. Those data points which are near to the particular k-
center, create a cluster.
 Hence each cluster has datapoints with some commonalities, and it is away from other clusters.

118
K- Means Algorithm cont…

The below diagram explains the working of the K-means Clustering Algorithm:

119
How does the K-Means Algorithm Work?

1. Begin
2. Step-1: Select the number K to decide the number of clusters.
3. Step-2: Select random K points or centroids. (It can be other from the input dataset).
4. Step-3: Assign each data point to their closest centroid, which will form the predefined K clusters.
5. Step-4: Calculate the variance and place a new centroid of each cluster.
6. Step-5: Repeat the third steps, which means reassign each datapoint to the new closest centroid of each
cluster.
7. Step-6: If any reassignment occurs, then go to step-4 else go to step-7.
8. Step-7: The model is ready.
9. End

120
Working of K-Means Algorithm
Suppose we have two variables x and y. The x-y axis scatter plot of these two variables is given below:

121
Working of K-Means Algorithm cont…
 Let's take number k of clusters, i.e., K=2, to identify the dataset and to put them into different clusters. It
means here we will try to group these datasets into two different clusters.
 We need to choose some random K points or centroid to form the cluster. These points can be either the
points from the dataset or any other point. So, here we are selecting the below two points as K points, which
are not the part of dataset. Consider the below image:

122
Working of K-Means Algorithm cont…
 Now we will assign each data point of the scatter plot to its closest K-point or centroid. We will compute it by
calculating the distance between two points. So, we will draw a median between both the centroids. Consider
the below image:

123
Working of K-Means Algorithm cont…
 From the image, it is clear that points left side of the line is near to the K1 or blue centroid, and points to the
right of the line are close to the orange centroid i.e. K2. Let's color them as blue and yellow for clear
visualization.

124
Working of K-Means Algorithm cont…
 As we need to find the closest cluster, so we will repeat the process by choosing a new centroid. To choose
the new centroids, we will compute the center of gravity of these centroids, and will find new centroids as
below:

125
Working of K-Means Algorithm cont…
 Next, we will reassign each datapoint to the new centroid. For this, we will repeat the same process of finding
a median line. The median will be like below image:

From the above image, we can see, one yellow point is on the left side of the line, and two blue points are right to
the line. So, these three points will be assigned to new centroids.
126
Working of K-Means Algorithm cont…

 As reassignment has taken place, so we will again go to the step-4, which is finding new centroids or K-
points.

127
Working of K-Means Algorithm cont…
 We will repeat the process by finding the center of gravity of centroids, so the new centroids will be as shown
in the below image:

128
Working of K-Means Algorithm cont…
 As we got the new centroids so again will draw the median line and reassign the data points. So, the image
will be:

129
Working of K-Means Algorithm cont…
 We can see in the previous image; there are no dissimilar data points on either side of the line, which means
our model is formed. Consider the below image:

130
Working of K-Means Algorithm cont…
 As our model is ready, so we can now remove the assumed centroids, and the two final clusters will be as
shown in the below image:

131
K-means Algoritms
• Initialization
• Arbitrarily choose k objects as the initial cluster
centers (centroids)
• Iteration until no change
• For each object Oi
• Calculate the distances between Oi and the k
centroids
• (Re)assign Oi to the cluster whose centroid is the
closest to Oi
• Update the cluster centroids based on current
assignment
Illustrating K-Means
K-means example, step 1
K-means example, step 2
K-means example, step 3
K-means example, step 4
K-means example, step 4a
K-means example, step 4b
K-means example, step 5
K-Means Clustering Example cont..
(Calculate the distances of all data points from 3 centroids)
Sr.
No.
1
2
3
4
• Compare the 3 distances of individual 5
data points from the corresponding 6
centroids and find the minimumm 7
distance. 8
• The data point will belong to the 9
cluster to which it is closer
10
Clusters
141
G-1 = {1, 2, 4} G-2 = {7, 10} G-3 = {3, 5, 6, 8, 9}
K-Means Clustering Example cont..

Sr.
No.
1
Iteration-2 2
3
4
5
6
7
8
9
10
New Cluster
142
G-1 = {1, 2, 4} G-2 = {7, 9, 10} G-3 = {3, 5, 6, 8}
K-Means Clustering Example cont..

Sr.
No.
1
2
Iteration-3
3
4
5
6
7
8
9
10
New Cluster
143
G-1 = {1, 2, 4} G-2 = {5, 7, 9, 10} G-3 = {3, 6, 8}
K-Means Clustering Example cont..
Sr.
Iteration-4 No.
1
2
3
4
5
6
7
8
9
10

New Cluster
144
G-1 = {1, 2} G-2 = {5, 7, 9, 10} G-3 = {3, 4, 6, 8}
K-Means Clustering Example cont..
Sr.
Iteration-5 No.
1
2
3
4
5
6
7
8
9
10

New Cluster
G-1 = {1, 2} G-2 = {5, 7, 9, 10} G-3 = {3, 4, 6, 8}
145
Example:
• Apply K-mean clustering for the
following data sets for two
clusters. Tabulate all the
assignments.
Step-2: New Centroid
Step-3: Update the cluster centroid
Step-4: Similarly process for next data set
Step-7: Update the cluster centroid

Chart Title
90

80

70

60

50

40

30

20

10

0
165 170 175 180 185 190
Support Vector Machines
Support Vector Machine (SVM) is an algorithm which can be used for both classification or regression challenges.
However, it is mostly used in two-group classification problems. In the SVM algorithm, each data item is plotted as a
point in n-dimensional space (where n is number of features) with the value of each feature being the value of a
particular coordinate. Then, classification is performed by finding the hyperplane that differentiates the two classes.

What is hyperplane?
A hyperplane is a generalization of a plane.
 In one dimension, it is called a point.
 In two dimensions, it is a line.
 In three dimensions, it is a plane.
 In more dimensions one can call it an hyperplane.
The following figure represents datapoint in one dimension and the point L is a separating hyperplane.

151
School of Computer Engineering
Support Vector Machine

School of Computer Engineering 152


Support Vector Machines cont…
How does SVM works?
 Let’s imagine we have two tags: red and blue, and our data has two features: x
and y. We want a classifier that, given a pair of (x, y) coordinates, outputs if it’s
either red or blue. We plot our already labeled training data on a plane.

 A support vector machine takes these data points and outputs the hyperplane
(which in two dimensions it’s simply a line) that best separates the tags. This
line is the decision boundary: anything that falls to one side of it we will
classify as blue, and anything that falls to the other as red.
 In 2D, the best hyper-plane is simply a line.

153
School of Computer Engineering
Support Vector Machines cont…
But, what exactly is the best hyperplane? For SVM, it’s the one that maximizes the margins from both tags. In other
words: the hyperplane (in 2D, it's a line) whose distance to the nearest element of each tag is the largest.

154
School of Computer Engineering
Support Vector Machines cont…
(Scenario-1) Identification of the right hyperplane: Here, we have three hyperplanes (A, B and C). Now, the job is to
identify the right hyperplane to classify star and circle. Remember a thumb rule - identify the right hyperplane: “Select
the hyperplane which segregates the two classes better”. In this scenario, hyperplane “B” has excellently performed this
job.

155
School of Computer Engineering
Support Vector Machines cont…
(Scenario-2) Identify the right hyperplane: Here, we have three hyperplanes (A, B and C) and all are segregating the
classes well. Now, How can we identify the right hyperplane?

Here, maximizing the distances between nearest data point (either class) and hyperplane will help to decide the right
hyperplane. This distance is called as margin.

156
School of Computer Engineering
Support Vector Machines cont…
Below, you can see that the margin for hyperplane C is high as compared to both A and B. Hence, we name the right
hyperplane as C. Another lightning reason for selecting the hyperplane with higher margin is robustness. If we select a
hyperplane having low margin then there is high chance of misclassification.

157
School of Computer Engineering
Support Vector Machines cont…
(Scenario-3) Identify the right hyperplane: Use the rules as discussed in previous section to identify the right
hyperplane.

Some of you may have selected the hyperplane B as it has higher margin compared to A. But, here is the catch, SVM
selects the hyperplane which classifies the classes accurately prior to maximizing margin. Here, hyperplane B has a
classification error and A has classified all correctly. Therefore, the right hyperplane is A.

158
School of Computer Engineering
Support Vector Machines cont…
(Scenario-4) Below, we are unable to segregate the two classes using a straight line, as one of the stars lies in the
territory of other(circle) class as an outlier.

One star at other end is like an outlier for star class. The SVM algorithm has a feature to ignore outliers and find the
hyperplane that has the maximum margin. Hence, we can say, SVM classification is robust to outliers.

159
School of Computer Engineering
Support Vector Machines cont…
(Scenario-5) Find the hyperplane to segregate to classes: In the
scenario, we can’t have linear hyperplane between the two classes,
so how does SVM classify these two classes? Till now, we have only
looked at the linear hyperplane.

SVM can solve this problem. Easily! It solves this problem by


introducing additional feature. Here, we can add a new feature
z=x2+y2. Now, plotting the data points on axis x and z:

In above plot, points to consider are:


 All values for z would be positive as z is the squared sum of both x and y.
 In the original plot, red circles appear close to the origin of x and y axes, leading to lower value of z and
star relatively away from the origin result to higher value of z.
160
School of Computer Engineering
Support Vector Machines cont…
 In the SVM classifier, it is easy to have a linear hyperplane between
these two classes. But, another burning question which arises is, should
we need to add this feature manually to have a hyperplane. No, the SVM
algorithm has a technique called the kernel trick. The SVM kernel takes
low dimensional input space and transforms it to a higher dimensional
space i.e. it converts not separable problem to separable problem. It is
mostly useful in non-linear separation problem. Simply put, it does
some extremely complex data transformations, then finds out the
process to separate the data based on the labels or outputs you’ve
defined.
 When we look at the hyperplane in original input space it looks like a
circle.

161
School of Computer Engineering
Types of SVM
 Linear SVM: Linear SVM is used for data that are linearly separable i.e. for a dataset that can be
categorized into two categories by utilizing a single straight line. Such data points are termed as linearly
separable data, and the classifier is used described as a linear SVM classifier.
 Non-linear SVM: Non-Linear SVM is used for data that are non-linearly separable data i.e. a straight line
cannot be used to classify the dataset. For this, we use something known as a kernel trick that sets data
points in a higher dimension where they can be separated using planes or other mathematical functions.
Such data points are termed as non-linear data, and the classifier used is termed as a non-linear SVM
classifier.

162
School of Computer Engineering

You might also like