Unit-2
Supervised Learning
Syllabus
• Supervised Learning:
• Learning a Class from Examples, Learning Multiple Classes, Regression, Model Selection and Generalization,
Dimensions of Supervised Machine Learning Algorithm,
• Non-Parametric Methods: Histogram Estimator, Kernel Estimator, K-Nearest Neighbor Estimator
Learning a class from examples
• Class learning is finding a description that is shared by all the
positive examples.
• For example, given a car that we have not seen before by
checking with the description learned, we will be able to say
whether it is a family car or not.
Learning a Class from Examples
• Class C of a “family car”
– Prediction: Is car x a family car?
– Knowledge extraction: What do people expect from a family car?
• Output:
Positive (+) and negative (–) examples
• Input representation:
x1: price, x2 : engine power
4
Training set X
X = {x t ,r t }tN=1
1 if x is positive
r =
0 if x is negative
x 1
x=
x 2
5
Class C
(p1 price p2 ) AND (e1 engine power e2 )
6
Hypothesis class H
1 if h classifies x as positive
h (x ) =
0 if h classifies x as negative
Error of h on H
(( ) )
N
E (h | X ) = 1 h x r
t t
t =1
7
Learning Multiple classes
• In the example discussed earlier, positive examples belonging to
the family car and the negative examples belongs to all other
cars.
• This is a two class problem.
• In the general case, we have K classes denoted by Ci, where i=1
…k, and an input instance belongs to one and exactly one of
them.
• An example in next slide with instances from 3 classes: family car,
sports car and luxury sedan car.
Multiple Classes, Ci i=1,...,K
K number of classes and t number of test data
X = {x t ,r t }tN=1
1 if x t
Ci
ri =
t
0 if x t
C j , j i
Train hypotheses
hi(x), i =1,...,K:
1 if x t
Ci
( )
hi x =
t
0 if x C j , j i
t
9
Regression
• Regression models are used to predict a continuous value.
• Predicting prices of a house given the features of house
like size, price etc is one of the common examples of
Regression. It is a supervised technique.
• Classification applied on discrete values and regression on
continuous values.
• Example: age vs. height., temperature of a city, house price
Regression vs classification
Supervised vs. unsupervised learning
Underfitting vs. Overfitting
Model Selection and Generalization
• What is a model ? ML algorithm trained with training data is
usually referred as Hypothesis or Model.
• For example: Linear Regression algorithm
Other Examples: Decision Tree, Neural Network
Model Selection?
• It is the process of selecting the optimal model from the set of candidate
models, for a given data.
• Data used in learning:
– Training Set (Usually 60%)
– Validation set – Cross Validation (20%) : Validated on different models on the same
training set
– Test data (20%) : Unseen data
• Model Selection is finding the optimal model which minimizes both bias and
variance.
• Bias is the error during training and Variance is the error during testing
Inductive bias
• The set of assumption we make to have learning possible is called
inductive bias of the learning algorithm.
• Model selection is also about choosing the right inductive bias.
Generalization
• Generalisation is how well a model trained on the training set
predicts the right output for new instances.
• A model should be selected having best generalisation. It should
avoid two problems
– Underfitting
– Overfitting
In underfitting – simple model is selected and hence
bias is very high
Underfitting – overfitting – just right model-
Generalization
• As the model selected the bias (training error) is very high.
• Also variance testing error) is also very high.
• Overfitting is trying to fit each and every data/samples.
• Bias is low.
• The model is complex
• Variance is expected to be very high.
• In just right model, it is close all the data/samples, due to this minim
bias and variance is also low : This is called generalization
Dimensions of Supervised Machine Learning
Algorithm
Consider we have N samples and K classes
Non-Parametric Estimation
• Statistical Estimation:
– Population is the entire data and a smaller set of data from the
population is called Sample. (Election result analysis – Exit poll )
– Usually it is very difficult to estimate on the entire data, it will be
performed on a sample data.
• Classification of Statistical Estimation:
– Parametric : It uses mean and SD.
– Non-Parametric : It does not uses mean and standard deviation.
Non-Parametric Methods: Histogram Estimator
• Histogram is a convenient way to describe the data.
• To form a histogram, the data from a single class are grouped into
intervals.
• Over each interval rectangle is drawn, with height proportional to
number of data points falling in that interval. In the example
interval is chosen to have width of two units.
Kernel and Window Estimators
• Histogram is a good representation for discrete data. It will show the spikes
for each bin.
• But may not suite for continuous data. Then we will be using Kernel
(function) for each of the data points. And the total density is estimated by
the kernel density function.
• It is useful for applications like audio density estimation.
• This approximation to a continuous density estimation is not useful in
decision making.
• Each Delta function is replaced by Kernel Functions such as rectangles,
triangles or normal density functions which have been scaled so that their
combined area should be equal to one.
Kernel Density function
-4 to -2 = 1, -2 to 0 = 2, 0 to -2 = 1 -2 to -4 =0, -4 to -6=1 and -6 to -8=1
Height = 1/6*2 = 0.08 (first case) and so on
KERNEL DENSITY ESTIMATION
Similarity and Dissimilarity
Distance are used to measure similarity
There are many ways to measure the distance s between two instances
Distance or similarity measures are essential in solving many pattern recognition problems
such as classification and clustering. Various distance/similarity measures are available in the
literature to compare two data distributions.
As the names suggest, a similarity measures how close two distributions are.
For algorithms like the k-nearest neighbor and k-means, it is essential to measure the
distance between the data points.
• In KNN we calculate the distance between points to find the nearest neighbor.
• In K-Means we find the distance between points to group data points into clusters based
on similarity.
• It is vital to choose the right distance measure as it impacts the results of our algorithm.
Euclidean Distance
• We are most likely to use Euclidean distance when calculating the distance between two rows
of data that have numerical values, such a floating point or integer values.
• If columns have values with differing scales, it is common to normalize or standardize the
numerical values across all columns prior to calculating the Euclidean distance. Otherwise,
columns that have large values will dominate the distance measure.
n
dist = ( pk − qk )2
k =1
• Where n is the number of dimensions (attributes) and pk and qk are, respectively, the kth
attributes (components) or data objects p and q.
• Euclidean distance is also known as the L2 norm of a vector.
Compute the Euclidean Distance between the following data set
• D1= [10, 20, 15, 10, 5]
• D2= [12, 24, 18, 8, 7]
Apply Pythagoras theorem for Euclidean distance
Manhattan distance:
Manhattan distance is a metric in which the distance between two points is the sum
of the absolute differences of their Cartesian coordinates. In a simple way of saying it
is the total sum of the difference between the x-coordinates and y-coordinates.
Formula: In a plane with p1 at (x1, y1) and p2 at (x2, y2)
• The Manhattan distance is related to the L1 vector norm
• In general ManhattanDistance = sum for i to N sum of |v1[i] – v2[i]|
Compute the Manhattan distance for the following
• D1 = [10, 20, 15, 10, 5]
• D2 = [12, 24, 18, 8, 7]
Manhattan distance:
is also popularly called city block distance
Euclidean distance is like flying
distance
Manhattan distance is like
travelling by car
Minkowski Distance
• It calculates the distance between two real-valued vectors.
• It is a generalization of the Euclidean and Manhattan distance measures and
adds a parameter, called the “order” or “r“, that allows different distance
measures to be calculated.
• The Minkowski distance measure is calculated as follows:
1
n
dist = ( | pk − qk r r
| )
k =1
Where r is a parameter, n is the number of dimensions (attributes) and pk
and qk are, respectively, the kth attributes (components) or data objects p
and q.
Minkowski is called generalization of Manhattan and Euclidean:
Manhattan Distance is called L1 Norm and
Euclidean distance is called L2
Minkowski is called Lp where P can be 1 or 2
Cosine Similarity
(widely used in recommendation system and NLP)
–If A and B are two document vectors.
–Cosine similarity ranges between (-1 to +1)
– -1 indicates not at all close and +1 indicates it is very close in similarity
–In cosine similarity data objects are treated as vectors.
– It is measured by the cosine of the angle between two vectors and determines
whether two vectors are pointing in roughly the same direction. It is often used
to
measure document similarity in text analysis.
–Cosine Distance = 1- Cosine Similarity
cos(A, B) = 1: exactly the same
0: orthogonal
−1: exactly opposite
Formula for Cosine Similarity
• The cosine similarity between two vectors is measured in ‘θ’.
• If θ = 0°, the ‘x’ and ‘y’ vectors overlap, thus proving they are
similar.
• If θ = 90°, the ‘x’ and ‘y’ vectors are dissimilar.
• If two points are on the same plane or same vector
• In the example given P1 and P2 are on the same vector,
and hence the angle between them is 0, so COS(0) =1 indicates
they are of high similarity
• In this example two points P1 and P2
are separated by 45 degree, and hence
Cosine similarity is COS(45) = 0.53.
In this example P1 and P2 are separated by
90 degree, and hence the Cosine
similarity is COS(90)= 0
If P1 and P2 are on the opposite side
• If P1 and P2 are on the opposite side then the angle between
them is 180 degree and hence the COS(180)= -1
• If it is 270, then again it will be 0, and 360 or 0 it will be 1.
Cosine Similarity
Advantages of Cosine Similarity
• The cosine similarity is beneficial because even if the two similar
data objects are far apart by the Euclidean distance because of
the size, they could still have a smaller angle between them.
Smaller the angle, higher the similarity.
• When plotted on a multi-dimensional space, the cosine similarity
captures the orientation (the angle) of the data objects and not
the magnitude.
Example1 for computing cosine distance
Consider an example to find the similarity between two vectors – ‘x’ and ‘y’, using Cosine
Similarity. (if angle can not be estimated directly)
The ‘x’ vector has values, x = { 3, 2, 0, 5 }
The ‘y’ vector has values, y = { 1, 0, 0, 0 }
The formula for calculating the cosine similarity is : Cos(x, y) = x . y / ||x|| * ||y||
x . y = 3*1 + 2*0 + 0*0 + 5*0 = 3
||x|| = √ (3)^2 + (2)^2 + (0)^2 + (5)^2 = 6.16
||y|| = √ (1)^2 + (0)^2 + (0)^2 + (0)^2 = 1
∴ Cos(x, y) = 3 / (6.16 * 1) = 0.49
Example2 for computing cosine distance
d1 = 3 2 0 5 0 0 0 2 0 0 ; d2 = 1 0 0 0 0 0 0 1 0 2
d1 • d2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
||d1|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0) 0.5 = (42) 0.5 = 6.481
(square root of sum of squares of all the elements)
||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.245
So cosine similarity = cos( d1, d2 ) = (d1 • d2)/ (||d1|| *||d2|| )
= (5/(6.481*2.245)) = 0.3150
Cosine distance (or it can be called dis-similarity)
= 1-cos(d1,d2) = 1-0.3436 = 0.6564
Find Cosine distance between
D1 = [5 3 8 1 9 6 0 4 2 1] D2 = [1 0 3 6 4 5 2 0 0 1]
When to use Cosine Similarity
• Cosine similarity looks at the angle between two vectors, euclidian
similarity at the distance between two points. Hence it is very popular
for NLP applications.
• Let's say you are in an e-commerce setting and you want to compare
users for product recommendations:
• User 1 bought 1x eggs, 1x flour and 1x sugar.
• User 2 bought 100x eggs, 100x flour and 100x sugar
• User 3 bought 1x eggs, 1x Vodka and 1x Red Bull
• By cosine similarity, user 1 and user 2 are more similar. By euclidean
similarity, user 3 is more similar to user 1.
JACCARD SIMILARITY AND DISTANCE:
In Jaccard similarity instead of vectors, we will be using sets.
It is used to find the similarity between two sets.
Jaccard similarity is defined as the intersection of sets divided by
their union. (count)
Jaccard similarity between two sets A and B is
A simple example using set notation: How similar are these two sets?
A = {0,1,2,5,6}
B = {0,2,3,4,5,7,9}
J(A,B) = {0,2,5}/{0,1,2,3,4,5,6,7,9} = 3/9 = 0.33
Jaccard Similarity is given by :
Overlapping vs Total items.
• Jaccard Similarity value ranges between 0 to 1
• 1 indicates highest similarity
• 0 indicates no similarity
Application of Jaccard Similarity
• Language processing is one example where jaccard similarity is
used.
• In this example it is 4/12 = 0.33
Jaccard Similarity is popularly used for ML model
performance analysis
• In this example, table is designed against
Actual vs predicted.
This gives an idea how our algorithm is working
• In the example is shows the overlapping +ve vs
Total positives including actual and predicted
Common Properties of a Distance
• Distances, such as the Euclidean distance, have some well known
properties.
1. d(p, q) 0 for all p and q and d(p, q) = 0 only if
p = q. (Positive definiteness)
2. d(p, q) = d(q, p) for all p and q. (Symmetry)
3. d(p, r) d(p, q) + d(q, r) for all points p, q, and r.
(Triangle Inequality)
where d(p, q) is the distance (dissimilarity) between points (data objects), p and q.
• A distance that satisfies these properties is a metric, and a space is
called a metric space
Distance Metrics Continued
• Dist (x,y) >= 0
• Dist (x,y) = Dist (y,x) are Symmetric
• Detours can not Shorten Distance
Dist(x,z) <= Dist(x,y) + Dist (y,z)
z
X X y
y z
Euclidean Distance
2 p1 point x y
p3 p4 p1 0 2
1 p2 2 0
p2 p3 3 1
0 p4 5 1
0 1 2 3 4 5 6
p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
Distance Matrix
Minkowski Distance
3
L1 p1 p2 p3 p4
p1 0 4 4 6
2 p1 p2 4 0 2 4
p3 p4
1
p3 4 2 0 2
p2 p4 6 4 2 0
0
0 1 2 3 4 5 6 L2 p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
point x y
L p1 p2 p3 p4
p1 0 2
p1 0 2 3 5
p2 2 0 p2 2 0 1 3
p3 3 1 p3 3 1 0 2
p4 5 1 p4 5 3 2 0
Distance Matrix
Summary of Distance Metrics
• Manhattan Distance • Euclidean Distance
|X1-X2| + |Y1-Y2| • 𝑥1 − 𝑥2 2 + √ 𝑦1 − 𝑦2 2
Nearest Neighbors Classifiers
• Basic idea:
– If it walks like a duck, quacks like a duck, then it’s probably a duck
Compute Distance
Test Record
Training Records Choose k of the “nearest”
records
K-Nearest Neighbors (KNN) : ML algorithm
• Simple, but a very powerful classification algorithm
• Classifies based on a similarity measure
• This algorithm does not build a model
• Does not “learn” until the test example is submitted for classification
• Whenever we have a new data to classify, we find its K-nearest neighbors from
the training data
• Classified by “MAJORITY VOTES” for its neighbor classes
• Assigned to the most common class amongst its K-Nearest Neighbors
(by measuring “distant” between data)
• In practice, k is usually chosen to be odd, so as to avoid ties
• The k = 1 rule is generally called the “nearest-neighbor classification” rule
K-Nearest Neighbors (KNN)
• K-Nearest Neighbor is one of the simplest Machine Learning algorithms based
on Supervised Learning technique.
• K-NN algorithm assumes the similarity between the new case/data/Pattern
and available cases and put the new case into the category that is most similar
to the available categories.
• K-NN algorithm can be used for Regression as well as for Classification but
mostly it is used for the Classification problems.
• K-NN is a non-parametric algorithm, which means it does not make any
assumption on underlying data.
• K-NN algorithm stores all the available data and classifies a new data point
based on the similarity. This means when new data appears then it can be
easily classified into a well suite category by using K- NN algorithm.
• It is also called a lazy learner algorithm because it does not learn from the
training set immediately instead it stores the dataset and at the time of
classification, it performs an action on the dataset.
• KNN algorithm at the training phase just stores the dataset and when it gets
new data, then it classifies that data into a category that is much similar to the
new data.
Illustrative Example for KNN
Collected data over the past few years(training data)
Considering K=1, based on nearest neighbor find the test
data class- It belongs to class of africa
Now we have used K=3, and 2 are showing it is close to
North/South America and hence the new data or data under
testing belongs to that class.
In this case K=3… but still not a correct value to
classify…Hence select a new value of K
Algorithm
• Step-1: Select the number K of the neighbors
• Step-2: Calculate the Euclidean distance to all the data points in
training.
• Step-3: Take the K nearest neighbors as per the calculated
Euclidean distance.
• Step-4: Among these k neighbors, apply voting algorithm
• Step-5: Assign the new data points to that category for which the
number of the neighbor is maximum.
• Step-6: Our model is ready.
Consider the following data set of a pharmaceutical company with assigned class labels,
using K nearest neighbour method classify a new unknown sample using k =3 and k = 2.
Points X1 (Acid Durability ) X2(strength) Y=Classification
P1 7 7 BAD
P2 7 4 BAD
P3 3 4 GOOD
P4 1 4 GOOD
New pattern with X1=3, and X2=7 Identify the Class?
Points X1(Acid Durability) X2(Strength) Y(Classification)
P1 7 7 BAD
P2 7 4 BAD
P3 3 4 GOOD
P4 1 4 GOOD
P5 3 7 ?
KNN
P1 P2 P3 P4
(7,7) (7,4) (3,4) (1,4)
Euclidean
Distance of
P5(3,7) from
Sqrt((7-3) 2 + (7-7)2 ) = 16 Sqrt((7-3) 2 + (4-7)2 ) = 25 Sqrt((3-3) 2 + (4-7)2 ) Sqrt((1-3) 2 + (4-7)2 )
=4 =5 = 9 = 13
=3 = 3.60
P1 P2 P3 P4
Euclide (7,7) (7,4) (3,4) (1,4)
an
Distanc
e of Sqrt((7-3) 2 + Sqrt((7-3) 2 + Sqrt((3-3) 2 Sqrt((1-3) 2
P5(3,7) (7-7)2 ) (4-7)2 ) = 25 + (4-7)2 ) + (4-7)2 )
from = 16 =5 = 9 = 13
=4 =3 = 3.60
Class BAD BAD GOOD GOOD
Height (in cms) Weight (in kgs) T Shirt Size
158 58 M New customer named 'Mary’ has height
158 59 M 161cm and weight 61kg.
158 63 M
160 59 M
Suggest the T shirt Size with K=3,5
160 60 M
163 60 M
using Euclidean Distance and
163 61 M also Manhattan Distance
160 64 L
163 64 L
165 61 L
165 62 L
165 65 L
168 62 L
168 63 L
168 66 L
170 63 L
170 64 L
170 68 L
There is a Car manufacturer company that has
manufactured a new SUV car. The company wants to give
the ads to the users who are interested in buying that SUV.
So for this problem, we have a dataset that contains
multiple user's information through the social network. The
dataset contains lots of information but the Estimated
Salary and Age we will consider for the independent
variable and the Purchased variable is for the dependent
variable. Dataset is as shown in the table. Using K =5
classify the new sample
• There is no particular way to determine the best value for "K", so we need to try some
values to find the best out of them. The most preferred value for K is 5.
• A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of
outliers in the model.
• Large values for K are good, but it may find some difficulties.
• Advantages of KNN Algorithm:
• It is simple to implement.
• It is robust to the noisy training data
• It can be more effective if the training data is large.
• Disadvantages of KNN Algorithm:
• Always needs to determine the value of K which may be complex some time.
• The computation cost is high because of calculating the distance between the data
points for all the training samples.
Another example: solve
• Because the distance function used to find the k nearest
neighbors is not linear, so it usually won't lead to a linear
decision boundary.
End of Unit - 2