Chapter 4
Similarity-based
Learning
“Anyone who stops learning is old, whether at twenty or eighty.”
— Henry Ford
Similarity-based Learning is a supervised learning technique that predicts the class label of atest
instance by gauging the similarity of this test instance with training instances. Similarity-based
learning refers to a family of instance-based learning which is used to solve both classification
and regression problems. Instance-based learning makes prediction by computing distances
or similarities between test instance and specific set of training instances local to the test
instance in an incremental process. In contrast to other learning mechanisms, it considers only
the nearest instance or instances to predict the class of unseen instances. This learning method-
ology improves the performance of classification since it uses only a specific set of instances
as incremental learning task. Similarity-based Classification is useful in various fields such as
image processing, text classification, pattem recognition, bio informatics, data mining, infor-
mation retrieval, natural language processing, etc. A practical application of this learning is
predicting daily stock index price changes. This chapter provides an insight of how different
similarity-based models predict the lass of a new instance.
Learning Objectives
* Understand the fundamentals of Instance based learning
* Know about the concepts of Nearest-Neighbor Learning using the algorithm called
k-Nearest-Neighbors (k-NN)
Learn about Weighted k-Nearest-Neighbor classifier that chooses the neighbors by
using the weighted distance
* Gain knowledge about Nearest Centroid classifier, a simple alternative to NN
dlassifiers
+ Understand Locally Weighted Regression (LWR) that approximates the linear
functions of all k neighbors to minimize the error while prediction
Z)116 + Machine Learning
4.1 INTRODUCTION TO SIMILARITY OR INSTANCE-BASED
LEARNING
Similarity-based classifiers use similarity measures to locate the nearest neighbors and classify a test
instance which works in contrast with other learning mechanisms such as decision trees or neural
networks. Similarity-based learning is also called as Instance-based learning/Just-in time learning,
since it does not build an abstract model of the training instances and performs lazy learning when
classifying a new instance. This learning mechanism simply stores all data and uses it only when it
needs to classify an unseen instance. The advantage of using this learning is that processing occurs
only when a request to dassify a new instance is given. This methodology is particularly useful
when the whole dataset is not available in the beginning but collected in an incremental manner.
The drawback of this learning is that it requires a large memory to store the data since a global
abstract model is not constructed initially with the training data. Classification of instances is done
based on the measure of similarity in the form of distance functions over data instances, Several
distance metrics are used to estimate the similarity or dissimilarity between instances required for
clustering, nearest neighbor classification, anomaly detection, and so on. Popular distance metrics
used are Hamming distance, Euclidean distance, Manhattan distance, Minkowski distance, Cosine
similarity, Mahalanobis distance, Pearson's correlation or correlation similarity, Mean squared
difference, Jaccard coefficient, Tanimoto coefficient, etc.
Generally, Similarity-based classification problems formulate the features of test instance and
training instances in Euclidean space to learn the similarity or dissimilarity between instances.
4.1.1 Differences Between Instance- and Model-based Learning
An instance is an entity or an example in the training dataset. It is described by a set of features or
attributes. One attribute describes the class label or category of an instance, Instance-based methods
learn or predict the class label of a test instance only when a new instance is given for classification
and until then it delays the processing of the training dataset.
Itis also referred to’as lazy learning methods since it does not generalize any model from the
taining dataset but just keeps the training dataset as a knowledge base until a new instance is
given. In contrast, model-based learning, generally referred to as eager learning, tries to generalize the
training data to a model before receiving test instances. Model-based machine learning describes
all assumptions about the problem domain in the form of a model. These algorithms basically learn
in two phases, called training phase and testing phase. In training phase, a model is built from the
training dataset and is used to classify a test instance during the testing phase. Some examples of
models constructed are decision trees, neural networks and Support Vector Machines (SVM), etc.
The differences between Instance-based Learning and Model-based Learning are listed in
Table 4.1.
Table 4.1: Differences between Instance-based Learning and Model-based Learning
(fees ee (EEE een)
Lazy Learners, Eager Learners
‘Processing of training instances is done only during | Processing of training instances is done during
testing phase training phase
(Continued)‘No model is built with the training instances before
it receives a test instance
Similarity-based Learning © 117
Rirekne ein)
Generalizes a mode! with the training instances
before it receives a test instance
Predicts the class of the test instance directly from
Predicts the class of the test instance from the model
the training data
Slow in testing phase
Learns by making many local approximations
built
Fast in testing phase
Leams by creating global approximation
Instance-based learning also comes under the category of memory-based models which normally
compare the given test instance with the trained instances that are stored in memory. Memory-
based models classify a test instance by checking the similarity with the training instances.
Some examples of Instance-based learning algorithms are:
1. k-Nearest Neighbor (k-NN)
Variants of Nearest Neighbor learning
Locally Weighted Regression
‘Learning Vector Quantization (LVQ)
Self-Organizing Map (SOM)
Radial Basis Function (RBF) networks
In this chapter, we will discuss about certain instance-based learning algorithms such as
k-Nearest Neighbor (NN), Variants of Nearest Neighbor leaming, and Locally Weighted
Regression learning.
Self-Organizing Map (SOM) and Radial Basis Function (RBF) networks are discussed along
with the concepts of artificial neural networks discussed in Chapter 10 since they could be referred
only after the understanding of neural networks.
These instance-based methods have serious limitations about the range of feature values taken.
Moreover, they are sensitive to irrelevant and correlated features leading to misclassification of
instances.
ape
4.2 NEAREST-NEIGHBOR LEARNING
A natural approach to similarity-based lassification is
k-Nearest-Neighbors (k-NN), which is a non-parametric
method used for both classification and regression
problems. It is a simple and powerful non-parametric
algorithm that predicts the category of the test instance
according to the ‘k’ training samples which are closer to
the test instance and classifies it to that category which
has the largest probability. A visual representation
of this learning is shown in Figure 4.1. There are two
classes of objects called C, and C, in the given figure.
When given a test instance T, the category of this test
instance is determined by looking at the class of k= 3
nearest neighbors. Thus, the class of this test instance
Tis predicted as C,.
Figure 4.1: Visual Representation of
k-Nearest Neighbor Learning118 + Machine Learning
The algorithm relies on the assumption that similar objects are close to each other in the feature
space. k-NN performs instance-based learning which just stores the training data instances and
learning instances case by case. The model is also ‘memory-based’ as it uses training data at time
when predictions need to be made. It is a lazy learning algorithm since no prediction model is built
earlier with training instances and classification happens only after getting the test instance.
The algorithm classifies a new instance by determining the ‘K’ most similar instances (
k nearest neighbors) and summarizing the output of those ‘K’ instances. If the target variable is
discrete then it is a classification problem, so it selects the most common class value among the ‘k’
instances by a majority vote. However, if the target variable is continuous then it is a regression
problem, and hence the mean output variable of the ‘K’ instances is the output of the test instance.
The most popular distance measure such as Euclidean distance is used in -NN to determine
the ‘¥’ instances which are similar to the test instance. The value of ‘K’ is best determined by tuning,
with different ‘K’ values and choosing the ‘K’ which classifies the test instance more accurately.
PAu Eat
Inputs: Training dataset T, distance metric d, Test instance f, the number of nearest neighbors k
Output: Predicted class or category
Prediction: For test instance f,
1. For each instance i in T, compute the distance between the test instance t and every
other instance / in the training dataset using a distance metric (Euclidean distance).
[Continuous attributes - Euclidean distance between two points in the plane with
coordinates (x, y,) and (xy y,) is given as dist ((xy ¥,), ty ¥))= Vt —%,) +. -¥,)' 1
[Categorical attributes (Binary) - Hamming Distance: If the value of the two instances
is same, the distance d will be equal to 0 otherwise d = 1.]
2. Sort the distances in an ascending order and select the first k nearest training data
instances to the test instance.
3. Predict the class of the test instance by majority voting (if target attribute is discrete
valued) or mean (if target attribute is continuous valued) of the k selected nearest
instances.
~~
Consider the student performance training dataset of 8 data instances shown in
Table 4.2 which describes the performance of individual students in a course and their CGPA
obtained in the previous semesters. The independent attributes are CGPA, Assessment and
Project. The target variable is ‘Result’ which is a discrete valued variable that takes two values
‘Pass’ or ‘Fail’. Based on the performance of a student, classify whether a student will pass or fail
in that course.
Wess Project Submitted) (eos
85 8 Pass
80 7 Pass
81 8 PassSimilarity-based Learning © 119
Can Se Geese fete
4. 6 45 5 Fail
5. 65 50 4 Fail
6 82 72 7 Pass
7. 58 38 5 Fail
8. 89 a1 9 Pass
Solution: Given a test instance (6.1, 40, 5) and a set of categories (Pass, Fail} also called as classes,
we need to use the training set to classify the test instance using Euclidean distance.
The task of classification is to assign a category or class to an arbitrary instance.
Assignk=3.
Step 1: Calculate the Euclidean distance between the test instance (6.1, 40, and 5) and each of the
training instances as shown in Table 4.3.
Table 4.3: Euclidean Distance
ry Caen fete Gereneaans
tin
* * 8 PP] Jezmeay + (a0) +(6-5p
= 45.2068
* . ° 7 \P8s Ye —eay +(e0—a0) (7-5)
= 40.09501
* “8 X BPP | ese) + (er—a0y +63)
=41.17961
* 6 6 . fail (6—6.1)' + (45-40) +(5-5)-
=5.001
S 8 ° ‘ i (65-6.1)' +(50-40) + (4-5)
= 1005783
. 82 2 7 Poss (s2-6:1) + (72-40) +(7-s),
3213114
7 58 8 5 vail (58-6. + (38-40) +(5-5)"
= 2.022375
* 8° ” ° Tas (89-61) +(91-40) +(9-5)
=51.23319120» Machine Learning
Step 2: Sort the distances in the ascending order and select the first 3 nearest training data instances
to the test instance. The selected nearest neighbors are shown in Table 4.4.
Table 4.4: Nearest Neighbors
(iets eben laeiies ES
4 5.001 Fail
5 10.05783 Fail
7 2.022375 Fail
Here, we take the 3 nearest neighbors as instances 4, 5 and 7 with smallest distances.
Step 3: Predict the class of the test instance by majority voting.
The class for the test instance is predicted as ‘Fail’.
ee “*”
Data normalization/standardization is required when data (features) have different ranges or a
wider range of possible values when computing distances and to transform all features to a specific
range. This is probably done to eliminate the influence of one feature over another (i., to give all
features equal chances). For example if one feature has values in’the range 6f [0-1] and another
feature has values in the range of {0-100], then the second feature will influence more even if there
isa small variation than the first feature.
J-NN classifier performance is strictly affected by three factors such as the number of nearest
neighbors (ie., selection of k), distance metric and decision rule.
If the k value selected is small then it may result in overfitting or less stable and if itis big then it
may include many irrelevant points from other dlasses. The choice of the distance metric selected also
plays a major role and it depends on the type of the independent attributes in the training dataset.
The K-NN classification algorithm best suits lower dimensional dala asin a high-dimensional
space the nearest neighbors may not be very close at all.
4.3 WEIGHTED K-NEAREST-NEIGHBOR ALGORITHM
The Weighted K-NN is an extension of k-NN. It chooses the neighbors by using the weighted
distance. The k-Nearest Neighbor (k-NN) algorithm has some serious limitations as its performance
is solely dependent on choosing the k nearest neighbors, the distance metric used and the decision
rule, However, the principle idea of Weighted k-NN is that k closest neighbors to the test instance
are assigned a higher weight in the decision as compared to neighbors that are farther away from
the test instance. The idea is that weights are inversely proportional to distances.
The selected k nearest neighbors can be assigned uniform weights, which means all the
instances in each neighborhood are weighted equally or weights can be assigned by the inverse of
their distance. In the second case, closer neighbors of a query point will have a greater influence
than neighbors which are further away.
TAs era)
Inputs: Training dataset “T’, Distance metric ‘
number of nearest neighbors ‘K’
Weighting function zo(i), Test instance ‘t’, the
(Continued)Similarity-based Learning + 121
Output: Predicted class or category
Prediction: For test instance t,
1. Foreach instance ‘i’ in Training dataset T, compute the distance between the test instance
t and every other instance ‘i’ using a distance metric (Euclidean distance).
[Continuous attributes - Euclidean distance between two points in the plane with
coordinates (x, y,) and (x; yp) is given as dist ((x, %4), Cy» ¥s)) = Ye, —2Y +. HP I
[Categorical attributes Binary) - Hamming Distance: If the values of two instances are
the same, the distance d will be equal to 0. Otherwise d= 1]
2. Sort the distances in the ascending order and select the first ‘k’ nearest training data
instances to the test instance.
3. Predict the class of the test instance by weighted voting technique (Weighting function
w(i) for the k selected nearest instances:
+ Compute the inverse of each distance of the ‘K’ selected nearest instances.
* Find the sum of the inverses.
* Compute the weight by dividing each inverse distance by the sum. (Each weight is a
vote for its associated class).
«Add the weights of the same class.
* Predict the class by choosing the class with the maximum vole.
oS
Consider the same training dataset given in Table 4.1. Use Weighted K-NN and
determine the class.
Solution:
Step 1: Given a test instance (7.6, 60, 8) and a set of classes (Pass, Fail], use the training dataset to
classify the test instance using Euclidean distance and weighting function.
Assign k — 3. The distance calculation is shown in Table 4.5.
Table 4.5: Euclidean Distance
(cp eens Pemeaenss
" °? % 5 Pass (92-76) + (85-60) +(8-8)"
= 2505115
* . ° ; Pass (8-7.6)' + (80-60)' +(7-8),
=20.02898
* “° * . Fas (85-76) +(s1-60) +(s-8)
=21.01928
(Continued)122» Machine Learning
Teac coon Me eI aos
med
‘ . * . fi (6-7.6)' +(45-60)' +(5-8)
=1538051
5 ° = ‘ = (65-76) +(50-60) +(4-8)°
= 10.82636
. "2 2 7 ined (.2-7.6)' + (72-60) +(7-8)"
=12.05653,
s # = 7 = (5:8-7.6)' + (38-60) +(5-8)"
=22.27644
8 89 a 9 Pass. (eo-76Feo-eF +0)
=31,04336.
Step 2: Sort the distances in the ascending order and select the first 3nearest training data instances
to the test instance. The selected nearest neighbors are shown in Table 4.6.
Step
nearest instances.
Table 4.6: Nearest Neighbors
(eens Piereubncues Cass
1538051 Fail
1082636 Fail
12.05653 Pass
Predict the class of the test instance by weighted voting technique from the 3 selected
* Compute the inverse of each distance of the 3 selected nearest instances as shown in
Table 4.7.
Table 4,7: Inverse Distance
(Sen Sion en Mere eau CES
4 1538051 0.06502 Fail
5 10.82636 0.092370 Fail
6 1.05653 0.08294 Pass
+ Find the sum of the inverses.
‘Sum = 0.06502 + 0.092370 +-0.08294 = 0.24033
+ Compute the weight by dividing each inverse distance by the sum as shown in
Table 4.8.Similarity-based Learning © 123
Table 4.8: Weight Calculation
een ietee ens Wwerse Distance Weight =Inverse (5
nest
4 1538051 0.06502 0.270545 Fail
5 1082636 0.092370 0.384347 Fail
6 1205653 0.08294 0.345109 Pass
* Add the weights of the same class.
Fail = 0.270545 + 0.384347 = 0.654892
Pass = 0.345109
* Predict the class by choosing the class with the maximum vote.
The class is predicted as ‘Fail’.
—l“"*
4.4 NEAREST CENTROID CLASSIFIER
A simple alternative to k-NN classifiers for similarity-based classification is the Nearest Centroid
Classifier. It is a simple classifier and also called as Mean Difference classifier. The idea of this
classifier is to classify a test instance to the class whose centroid/mean is closest to that instance.
Poi eM eager OCR ESTs
Inputs: Training dataset T, Distance metric d, Test instance t
Output: Predicted class or category
1. Compute the mean/centroid of each class.
2. Compute the distance between the test instance and mean/centroid of each class
(Guclidean Distance).
3. Predict the class by choosing the class with the smaller distance.
es
[setae eH Consider the sample data shown in Table 4.9 with two features x and y. The target
classes are ‘A’ or ‘B’, Predict the class using Nearest Centroid Classifier.
Table 4.9: Sample Data
(er
wlalafafale
alsfafeln|=
sfelal>|>) >
Solution:
Step 1: Compute the mean/centroid of each class. In this example there are two classes called ‘A’
and ‘BY.124 » Machine Learning
Centroid of class‘A’ = (3 +5 +4,1+2+3)3 = (12, 6)/3= (4,2)
Centroid of class ‘B’ = (7 +6 +8, 6 +7 +5)/3 = (21, 18)/3=(7, 6)
Now given a test instance (6, 5), we can predict the class.
Step 2: Calculate the Euclidean distance between test instance (6, 5) and each of the centroid.
Euc_Dist{(6, 5) (4, 2] = y(6—4) +(8-2) = V3 -3.6
Fuc_Distl(6, 5); (7, 6)]= ¥(6—7) +(5-6) = V2 =1.414
The test instance has smaller distance to class B. Hence, the class of this test instance is predicted
as ‘B'.
—__SSS—C—COsCsSC—SF?F
4.5 LOCALLY WEIGHTED REGRESSION (LWR)
Locally Weighted Regression (LWR) is a non-parametric supervised learning algorithm that
performs local regression by combining regression model with nearest neighbor's model. LWR
is also referred to as a memory-based method as it requires training data while prediction but
uses only the training data instances locally around the point of interest. Using nearest neighbors
algorithm, we find the instances that are closest to a test instance and fit linear function to each of
those ‘’ nearest instances in the local regression model. The key idea is that we need to approx-
imate the linear functions of all ‘k’ neighbors that minimize the error such that the prediction line
is no more linear but rather it is a curve.
(Ordinary linear regression finds out a linear relationship between the input x and the output y.
Given training dataset T,
Hypothesis function h,(x), the predicted target output is a linear function where fi, is the
intercept and f, is the coefficient of x.
It is given in Eq, (4.1) as,
My) = By + BX (4.1)
The cost function is such that it minimizes the error difference between the predicted value
(2) and true value ‘y and itis given as in Fq, (4.2).
1
10-33 (ts(s)-w) (42)
where ‘m’ is the number of instances in the training dataset.
Now the cost function is modified for locally weighted linear regression including the weights
only for the nearest neighbor points. Hence, the cost function is given as in Eq. (4.3).
1< 2
1A= 50. (b()-x) @3)
where w, is the weight associated with each x,
The weight function used is a Gaussian kernel that gives a higher value for instances that are
close to the test instance, and for instances far away, it tends to zero but never equals to zero.
w,is computed in Eq. (4.4) as,
sett
e (4.4)Similarity-based Learning © 125
where, t is called the bandwidth parameter and controls the rate at which 1, reduces to zero with
distance from x,
o_O
Consider a simple example with four instances shown in Table 4.10 and apply
locally weighted regression.
Table 4.11
ample Table
ar Salary (in\lakhs) Expenditure (in thousands)
1 3 25
2. 1 5
3. 2 7
4. 1 8
Solution: Using linear regression model assuming we have computed the parameters:
Bj= 4.72, B= 0.62
Given a test instance with x —2, the predicted y’ is:
¥ =A, +B, x= 4.72 + 0.62 x2=5.96
Applying the nearest neighbor model, we choose k=3 closest instances.
Table 4.11 shows the Euclidean distance calculation for the training instances.
Table 4.11: Euclidean Distance Calculation
Pees ea | ee
Instances 2, 3 and 4 are closer with smaller distances.
The mean value = (5 +7 +8)/3 =20/3 = 6.67.
Using Eq. (4.4) compute the weights for the closest instances, using the Gaussian kernel,
Atel
w,=e
Hence the weights of the closest instances is computed as follows,
Weight of Instance 2 is:
tet ty
we =e <6 _ 9053
Weight of Instance 3 is:
sesh eat
we =emt ae
=1 [w, is closer hence gets a higher weight value]126 » Machine Learning
‘Weight of Instance 4 is:
ot
wire =e8 =
0.043
‘The predicted output for the three closer instances is given as follows:
The predicted output of Instance 2 is
Yo = Hg() = By + B, 2p = 4.72 + 0.62 x 1=
The predicted output of Instance 3 is:
Uy = Hix.) = By B X= 4.72 + 0.62 x 2=5.96
The predicted output of Instance 4 is:
Ys = Hy) = By B, ¥,— 4.72 + 0.62 x 1=5.34
The error value is calculated as:
134
Ie) 3 S10, (%,) —y,) = ; (0.043(5.34 — 5)? + 1(6.96 — 798+ 0.043(6.34 — 8)*) = 0.6953
Now, we need to adjust this cost function to minimize the error difference and get optimal B
parameters.
OO SSSSS—CS——F?F
1. Similarity-based learning is a supervised learning technique that predicts the class label of a test
instance by measuring the similarity of this test instance with training instances.
2, Similarity-based classification problems formulate the features of test instance and training instances
in Euclidean space to lear the similarity or dissimilarity between instances.
3. A natural approach to similarity-based classification is k-Nearest-Neighbors (k-NN), which is a
non-parametric method used for both classification and regression problems.
4. KNN predicts the category of the test instance according to the ‘K’ training samples which are closer
to the test instance and classifies it to that eategory which has the largest category probability.
5, Data normalization/standardization is required when data (features) have different ranges or wider
‘range of possible values when computing distances and to transform all features to a specific range.
6. NN best suits for lower dimensional data, as in a high-dimensional space the nearest neighbors
may not be very close at all.
7. Weighted k-NN assigns a higher weight to the k closest neighbors to the test instance in the decision
than neighbors that are farther away from the test instance.
8. Nearest Centroid Classifier isa simple classifier also called as Mean Difference classifier that classifies
4 test instance to the class whose centroid/mean is closest to that instance.
9, Locally Weighted Regression (LWR) is a non-parametric supervised learning algorithm that performs
Jocal regression by combining regression model with nearest neighbor's model.
* Instance - An entity or an example in the training dataset.
+ Instance-based Methods —Leam or predict the class label ofa test instance only when anew instance is
given for classification.Similarity-based Learning © 127
‘Model-based Machine Learning - Describes all assumptions about the problem domain in the form
ofa model.
Lary Leaning - Methods do not generalize any model from the training dataset but just keep the
training dataset as a knowledge base until a new instance is given.
Fager Learning - Methods generalize the training data to a model before receiving test instances.
‘Memory-based Models — Classify a test instance by the similarity with the training instances stored
in the memory.
Revii
COTES
1. What do you understand by similarity-based learning?
2. Compare and contrast between instance-based learning and model-based learning.
3. Why instance based leamers are called as lazy learners?
4. Differentiate between lazy learning and eager leaning.
5. Why INN method is called as memory-based method?
6. Why data normalization/standardization is required in NN?
7. What are the benefits and limitations of k-NN algorithm?
8. Consider the following training dataset of 10 data instances shown in Table 4.12 which describes
the award performance of individual students based on GPA and No. of projects done. The target
variable is‘Award’ which is a discrete valued variable that takes 2 values ‘Yes’ or ‘No’.
Table 4.12: Training Dataset
Gn an
1. 95 ‘Yes
x 80 ‘Yes
3. 72 No
4 65 Yes
5. 95 ‘Yes
6. 32 No
z 66 No
8 54 No
9. 89 ‘Yes
10, 72 4 ‘Yes
Given a test instance (GPA - 7.8, No. of projects done - 4), use the training set to classify the test
instance. Choose k=3.
+ k-Nearest Neighbor classifier
* Weighted k-Nearest Neighbor classifier
‘+ Nearest Centroid Classifier
9, A COVID care centre decide to develop a case-based reasoning system to predict whether a person
‘will test positive or negative based on the symptoms. ‘The table below shows the number of possible
symptoms and the results of the previous cases. The training dataset contains the following instances
as shown in the Table 4.13 below.128 + Machine Learning
‘Table 4.13: Sample Set of Instances
Enea) Ga cues ale es fetes Sees
Gato] uss) fuer Gia ean
iene)
or
ial
Yes
2 |¥es [No [Yes [No [No Yes No [No No
3 [No |No [No [No [No No No |No No
4, [ves [Yes [Yes [No [No No No__|No Yes
5_|¥es [Yes |¥es_ [Yes | No No Yes _| Yes Yes
6._[ves [ves [Yes [Yes [No Yes No__|No No
7_|yes_|¥es [yes [Yes [No No Yes_| Yes No
Yes_|ves [Yes [Yes [No No No [No No
9. [Yes [Yes [Yes [Yes [No No No [Ne No
0. [No [No [No [No [No No ‘No. | No. No
‘+ Determine k = number of nearest neighbors to get a better prediction result.
‘+ Increase ‘K’ value and check the prediction. Is it good or bad to have a smaller or larger ‘k’ value?
‘+ Apply proper similarity measure [Asymmetric binary features] and predict the test result of
the instance [Fever = Yes, Dry Cough = Yes, Tiredness = yes, Sore Throat = Yes, Diarrhea = No,
Headache = No, Loss of Taste ar Smell = No, Shoriness of Breath = No, Chest Pain = No].
10. What is meant by locally weighted regression?
Crossword129
Similarity-based Learning «
Down
Across
also known as an
1. The entity or an example in the dataset is
No)
2. KNN is a parametric based method. (Yes/
3, Instance-based learning are memory-based
4. Nearest centroid classifier is also known as
difference classifier.
5. Weighted KNN algorithm assigns a higher
methods. (Yes/No)
5. Locally weighted regression is a non-
‘weight for the closest neighbors. (Yes/No)
6. Euclidean distance is the most popular
parametric method. (Yes/No)
6. Majority voting is used to determine the
method for finding neighbors. (Yes/No)
lass among neighbors in KNN algorithm.
(Yes/No)
7. KNN is an example of
based
Tearing.
8 Model-based methods are also known as
earning.
MOE aT
Find and mark the words listed below.
VGFCPVRPGOWUTI
I
S VUFL
FKENOUTONUHO
VOunnxX Tawa
Pex VOuMna
Dt an¥zZou
KRONRD AHH
MOnnmODZ wy
VOCE>
uéoounUas
VawontMwoo
NEIMPARMHE
UMxXNAWUP
Us >PUMHAN
eOndn qn a
Hee moo
>PUROmM DAKE
MMOXKh ArH
PRENOWNM a
MPR HaaxK aw
UturnE RAVHH aw
PReOMEMONnE
PZOMVMmHRH
MomoN & oe ot
obne <0
Hm HZnox
Ham Nne
UmEe zoey
Sz0moaa
mm mn |
HE mh eZ
DOUuUxunE>
OZBMEZE
HOeOuKeon
Z>eMnod
Env oaz
=
Orunoze
Om<>ZH4
ZOno<30
eX Oh DK
ZDZnNEr0
BAanoas
HORZVOO
BAm>vAzoO
STD TO 4
wm~VOTDY
Yes
No Yes ‘Yes ‘Yes Mean
Instance
Instance Eager
YesChapter, 5
Regression Analysis
“Regression analysis is the hydrogen bomb of the statistics arsenal.”
~ Charles Wheelan, Naked Statistics: Stripping the
Dread from the Data
Regression analysis is a supervised learning method for predicting continuous variables.
The difference between classification and regression analy sisis that regression methodsare used
to predict qualitative variables or continuous numbers unlike categorical variables or labels.
It is used to predict linear or non-linear relationships among variables of the given dataset.
This chapter deals with an introduction of regression and its various types.
* Understand the basics of regression analysis
* Introduce concepts of correlation and causation
+ Learn about linear regression and its validation techniques
+ Discuss about multiple linear regression
* Introduce logistic regression
+ Study about the concept of regularization
«Study popular regression methods like Ridge, Lasso, and Elastic Net
5.1 INTRODUCTION TO REGRESSION
Regression analysis is the premier method of supervised learning. This is one of the most popular
and oldest supervised learning technique. Given a training dataset D containing N training points
(x, y), where i= 1...N, regression analysis is used to model the relationship between one or more
independent variables x, and a dependent variable y, The relationship between the dependent and
independent variables can be represented as a function as follows:
y=fe) 6.1)
Here, the feature variable x is also known as an explanatory variable, exploratory variable,
a predictor variable, an independent variable, a covariate, or a domain point. y is a dependent
variable. Dependent variables are also called as labels, target variables, or response variables.
Regression analysis determines the change in response variables when one exploration variable
is varied while keeping all other parameters constant. This is used to determine the relationship each
of the exploratory variables exhibits. Thus, regression analysis is used for prediction and forecasting.Regression Analysis «131
Regression is used to predict continuous variables or quantitative variables such as price and revenue.
Thus, the primary concem of regression analysis is to find answer to questions such as:
1. What is the relationship between the variables?
2. What is the strength of the relationships?
3. What is the nature of the relationship such as linear or non-linear?
4, What is the relevance of the attributes?
5. What is the contribution of each attribute?
There are many applications of regression analysis. Some of the applications of regressions
include predicting:
1. Sales of a goods or services
2. Value of bonds in portfolio management
3. Premium on insurance companies
4, Yield of crops in agriculture
5. Prices of real estate
5.2 INTRODUCTION TO LINEARITY, CORRELATION, AND CAUSATION
‘The quality of the regression analysis is determined by the factors such as correlation and causation.
Regression and Correlation
Correlation among two variables can be done effectively using a Scatter plot, which is
a plot between explanatory variables and response variables. It is a 2D graph showing the
relationship between two variables. The x-axis of the scatter plot is independent, or input or
predictor variables and y-axis of the scatter plot is output or dependent or predicted variables.
The scatter plot is useful in exploring data. Some of the scaiter plots are shown in Figure 5.1.
The Pearson correlation coefficient is the most common test for determining correlation if there
is an association between two variables. The correlation coefficient is denoted by r. Correlation
is discussed in Chapter 2 of this book. The positive, negative, and random correlations are
given in Figure 5.1. In positive correlation, one variable change is associated with the change
in another variable. In negative correlation, the relationship between the variables is reciprocal
while in random correlation, no relationship exists between variables.
24. * 20: .
& é
i i
ig 12,
14. - .
12, 10
: :
al X-axis Xaxis
@) (b) @
Figure 5.1: Examples of (a) Positive Correlation (b) Negative Correlation
(© Random Points with No Correlation132 + Machine Learning
While correlation is about relationships among variables, say x and y, regression is about
predicting one variable given another variable.
Regression and Causation
Causation is about causal relationship among variables, say x and y. Causation means knowing
whether x causes y to happen or vice versa. x causes y is often denoted as x implies y. Correlation
and Regression relationships are not same as causation relationship. For example, the correlation
between economical background and marks scored does not imply that economic background
causes high marks. Similarly, the relationship between higher sales of cool drinks due to a rise
in temperature is not a causal relation. Even though high temperature is the cause of cool drinks
sales, it depends on other factors too.
Linearity and Non-linearity Relationships
The linearity relationship between the variables means the relationship between the dependent
and independent variables can be visualized as a straight line. The line of the form, y = ax +5
can be fitted to the data points that indicate the relationship between x and y. By linearity, i
meant that as one variable increases, the corresponding variable also increases in a linear manner.
A linear relationship is shown in Figure 5.2 (a). A non-linear relationship exists in functions such as
exponential function and power function and it is shown in Figures 5.2 (b) and 5.2 (c). Here, x-axis
is given by x data and y-axis is given by y data.
Yeaxls Yaaxis yoat
yoax+b
Xaxis Xaxis
@ )
Yeaxis x
ma
X-axis
©
Figure 5.2: (a) Example of Linear Relationship of the Form y= ax + b (b) Example of a Non-linear
Relationship of the Form y= ax* (c) Examples of a Non-linear Relationship y = —"—
The functions like exponential function (y = ax*) and power function (v- 5) are
ax
non-linear relationships between the dependent and independent variables that cannot be fitted in
aline. This is shown in Figures 5.2 (b) and (c)..Regression Analysis + 133
Types of Regression Methods
The classification of regression methods is shown in Figure 5.3.
Regression
methods
Linear regression Nor-linear Logical
methods regression regression
Single linear Multiple linear L] Polynomial
regression regression regression
Figure 5.3: Types of Regression Methods
Linear Regression It is a type of regression where a line is fitted upon given data for finding
the linear relationship between one independent variable and one dependent variable to describe
relationships.
Multiple Regression It is a type of regression where a line is fitted for finding the linear
relationship between two or more independent variables and one dependent variable to describe
relationships among variables.
Polynomial Regression It is a typé of non-linear regression method of describing relation-
ships among variables where N® degree polynomial is used to model the relationship between
one independent variable and one dependent variable, Polynomial multiple regression is used to
model two or more independent variables and one dependant variable.
Logistic Regression It is used for predicting categorical variables that involve one or more
independent variables and one dependent variable. This is also known as a binary classifier.
Lasso and Ridge Regression Methods These are special variants of regression method where
regularization methods are used to limit the number and size of coefficients of the independent
variables.
Limitations of Regression Method
1, Outliers — Outliers are abnormal data. It can bias the outcome of the regression model, as outliers
push the regression line towards it.
2. Number of cases - The ratio of independent and dependent variables should be at least 20 : 1.
For every explanatory variable, there should be at least 20 samples. Atleast five samples are
required in extreme cases.
3. Missing data - Missing data in training data can make the model unfit for the sampled data.
4, Multicollinearity — If exploratory variables are highly correlated (0.9 and above), the regression
is vulnerable to bias. Singularity leads to perfect correlation of 1. The remedy is to remove
exploratory variables that exhibit correlation more than 1. If there is a tie, then the tolerance
(1—R squared) is used to eliminate variables that have the greatest value.434 6 Machine Learning $$
5.3 INTRODUCTION TO LINEAR REGRESSION
Inthe simplest form, the linear regression model can be created by fitting a line among the scattered
data points. The line is of the form given in Eq. (5.2).
y=ayta,xxte (62)
Here, a, is the intercept which represents the bias and a, represents the slope of the line. These
are called regression coefficients. ¢ is the error in prediction.
‘The assumptions of linear regression are listed as follows:
1. The observations (y) are random and are mutually independent.
2. The difference between the predicted and true values is called an error. The error is also
mutually independent with the same distributions such as normal distribution with zero
mean and constant variables.
3, The distribution of the error term is independent of the joint distribution of explanatory
variables.
4, The unknown parameters of the regression models are constants.
The idea of linear regression is based on Ordinary Least Square (OLS) approach. This method
is also known as ordinary least squares method. In this method, the data points are modelled
using a straight line, Any arbitrarily drawn line is not an optimal line. In Figure 5.4, three data
points and their errors (e,, ¢y ¢,) are shown. The vertical distance between each point and the
line (predicted by the approximate line equation y = 4, + 4,x) is called an error. These individual
errors are added to compute the total error f the predicted line. This is called sum of residuals.
The squares of the individual errors can also be computed and added to give a sum of squared
error. The line with the lowest sumvof squared error is called line of best fit.
y-axis
Figure 5.4: Data Points and their Errors
In another words, OLS is an optimization technique where the difference between the data
points and the line is optimized.
Mathematically, based on Eq. (5.2), the line equations for points (x, X, -...X,) are:
=@, 44x) +e,
Ha Gtax) te
(a, + 4,x,) +e, 63)
In general, the exror is given as: ¢,= y,— (a, 4,x) 4)
This can be extended into the set of equations as shown in Eq. (5.3).Regression Analysis + 135
Here, the terms (¢,,¢, ...,€,) ate error associated with the data points and denote the difference
between the true value of the observation and the point on the line. This is also called as residuals.
The residuals can be positive, negative or zero.
A regression line is the line of best fit for which the sum of the squares of residuals is minimum.
The minimization can be done as minimization of individual errors by finding the parameters a,
and a, such that:
E= De - Ey, — (a, +43) (6.5)
Or as the minimization of sum of absolute values of the individual errors:
E=Blel= Ely, -@ +a] 69)
Or as the minimization of the sum of the squares of the individual errors:
E= See = BU, (4, +43) 67
Sum of the squares of the individual errors, often preferred as individual errors (positive and
negative errors), do not get cancelled out and are always positive, and stim of squares results in a
large increase even for a small change in the error. Therefore, this is preferred for linear regression.
Therefore, linear regression is modelled as a minimization function as follows:
Ly, — f@)P
By, +03 )F 68)
Here, J(ay a,) is the criterion function of parameters a, and a,, This needs to be minimized.
This is done by differentiating and substituting to zero. This yields the coefficient values of
a, and a,. The values of estimates of a, and a, are given as follows:
E)- OY
am (5.9)
" @2)-@F @°)
And the value of a, is given as follows:
a,=(@)—4,x¥ (6.10)
Let us consider a simple problem to illustrate the usage of the above concept.
SSS
[EER aGERA Let us consider an example where the five weeks’ sales data (in Thousands) is given
as shown below in Table 5.1. Apply linear regression technique to predict the 7 and 9* month sales.
Table 5.1: Sample Data
(Sales in Thousands)
1 12
2 18
3 26
4
3
32
38136 + Machine Learning
Solution: Here, there are 5 items, ie. i = 1, 2, 3, 4, 5. The computation table is shown below
(Table 5.2). Here, there are five samples, so i ranges from 1 to 5.
Table 5.
‘omputation Table
La Ce baa
1 12 1 12
2 18 4 36
3 26 9 78
4 32 16 128
5 38 2 19
Sum=15 Sum = 126 Sum =55 Sum = 44.4
Average of (x,) Average of (y,) Average of (x7) Average of (x, xy.)
15 5-126 = we its
3 aes oe 3 oS
=3 = =n =BE
Let us compute the slope and intercept now using Eq. (5.9) as:
9, 888-3252) _ oe
* +3 ‘
4, = 2.52 - 0.663 = 0.54
The fitted line is shown in Figure 5.5.
a
3S
Y-Dependent
1 2 3 4 5
2Xcindependent
—Regression line (7 = 0.66x + 0.54)
Figure 5.5: Linear Regression Model Constructed
Let us model the relationship as y = 4, +, xx. Therefore, the fitted line for the above data is:
y = 0.54 40.66% -x.
‘The predicted 7” week sale would be (when x =7), y= 0.54 + 0.66 x 7=
0.54 + 0.66 x 12 =8.46. All sales are in thousands.
5-16 and the 12" month,
Scan for ‘Additional Examples’Regression Analysis + 137
Linear Regression in Matrix Form
Matrix notations can be used for representing the values of independent and dependent variables.
This is illustrated through Example 5.2.
The Eq. (5.3) can be written in the form of matrix as follows:
11)
This can be written as:
Y= Xa+e, where X is ann x 2 matrix, Y is ann 1 vector, a is a2 x1 column vector and eis an
nx 1 column vector.
ee OOSSSS—S—————S—S—SS—SSS——
Find linear regression of the data of week and product sales (in Thousands) given
in Table 5.3. Use linear regression in matrix form.
Table 5.3: Sample Data for Regression
a of
(Week) || (Product Sales in Thousands),
1
wfe lola
2
3
4
Solution: Here, the dependent variable X is be given as:
x7=[1234]
And the independent variable is given as follows:
yi=[13.48]
‘The data can be given in matrix form as follows:
3 The frst column can be used for setting bias.
and Y=
Cn)
The regression is given as:
a= (XP XY
The computation order of this equation is shown step by step as:138 » Machine Learning
14
1. Computation of (xtx)=(1111),]12}_(4 10
1234) |13] (10 30
14
410) (15 05
2. Computation of matrix inverse of (XTX) (i a) = (3 50 a
15 -05) (1111) (1 05 0 -o5
-05 02) (1234) |-03 -01 01 03
1
ji 1 05 0 -05) |3}_(-15)(Intercept
4, Finally, ((X™X)"'X)Y. =
inally, CON {its 0.1 0.1 a) 4 (ak slope }
3. Computation of ((X™X)*X")
8
Thus, the substitution of values in Eq. (9.11) using the previous steps yields the fitted line as
22x-15.
eee
5.4 VALIDATION OF REGRESSION METHODS
The regression model should be evaluated using some metrics for checking the correctness.
The following metrics are used to validate the results of regression.
Standard Error
Residuals or error is the difference between the actual (y) and predicted value (9).
If the residuals have normal distribution, then the mean is zero and hence it is desirable. This
is a measure of variability in finding the coefficients. It is preferable that the error be less than the
coefficient estimate. The standard deviation of residuals is called residual standard error. If it is
zero, then it means that the model fits the data correctly.
Mean Absolute Error (MAE)
MAE is the mean of residuals. It is the difference between estimated or predicted target value and
actual target incomes. It can be mathematically defined as follows:
(6.12)
Here, jj is the estimated or predicted target output and y is the actual target output, and 1 is
the number of samples used for regression analysis.
Mean Squared Error (MSE)
It is the sum of square of residuals. This value is always positive and closer to 0. This is given
mathematically as:
ee oe
EU, -5P 13)Regression Analysis « 139
Root Mean Square Error (RMSE)
The square root of the MSE is called RMSE. This is given as:
(6.14)
Relative MSE
Relative MSE is the ratio of the prediction ability of the to the average of the trivial population.
The value of zero indicates that the model is perfect and its value ranges between 0 and 1. If the
value is more than 1, then the created model is not a good one. This is given as follows:
2-97
RelMSE = =~ (6.15)
Eu, -97
Coefficient of Variation
Coefficient of variation is unit less and is given as:
cv= “S (6.16)
ee ———sSeSSsSsSSSS—C—sSC—S
FER IEGEN Consider the following training set Table 5.4 for predicting the sales of the items.
Table 5.4: Training Item Table
Tee) | VSSTETRE Sy Cuter 63)
* a
tn 80.
by 90.
4 100
L, 10
I, 120
Consider two fresh items I, and I,, whose actual values are 80 and 75, respectively. A regression
model predicts the values of the items I, and I, as 75 and 85, respectively. Find MAE, MSE, RMSE,
RelMSE and CV.
Solution: The test items' actual and prediction is given in Table 5.5 as:
Table 5.5: Test Item Table
Tends Peer gates Wy
Yi ui
i 80 B
L % B
Mean Absolute Error (MAE) using Eq. (5.12) is given as:
1 | = 15
MAE —73| + |75 — 85] = —
x[p0—75|+ 5 —as|= 2
‘Mean Squared Error (MSE) using Eq. (5.13) is given as:
2 p _ 125
+ 75-85 => = 625
75
1
MSE = 5 x|80-7:140 + Machine Learning
Root Mean Square error using Eq. (5-14) is given as:
RMSE = ¥MSE = 625 =7.91
For finding RelMSE and CV, the training table should be used to find the average of y.
The average of y is aes ~ 100.
RelMSE using Eq. (5.15) can be computed as:
(80-75) +(75 85) _ 125,
ReIMSE = 1.1219
° (80 — 100)" + (75 — 100y
CV can be computed using Eq, (5.16) as
SC“?
Coefficient of Determination
To understand the coefficient of determination, one needs to understand the total variation of
coefficients in regression analysis. The sum of the squares of the differences between the y-value
of the data pair and the average of y is called total variation. Thus, the following variations can
be defined.
The explained variation is given as:
LG IY G17)
The unexplained variation is given as:
=Xy,- 9) (5.18)
Thus, the total variation is equal to the explained variation and the unexplained variation.
The coefficient of determination r? is the ratio of the explained and total variations.
Explained variation
Total variation
2
6.19)
Itis{a measure of how many future samples are likely to be predicted by the regression model.
Its value ranges from 1 to -«, where 1 is the most optimum value. It also signifies the proportion
of variance. Here, r is the correlation coefficient. If r = 0.95, then r? is given as 0.95 x 0.95 = 0.9025.
This means that 90% of the model can be explained by the relationship between x and y. The rest
10% is unexplained and that may be due to various reasons such as noise, chance, or error.
Standard Error Estimate
Standard error estimate is another useful measure of regression. It is the standard deviation of the
observed values to the predicted values. This is given as:
(5.20)
Here, as usual, y, is the observed value and §, is the predicted value. Here, n is the number
of samples.Regression Analysis + 141
Let us consider the data given in the Table 5.3 with actual and predicted values.
Find standard error estimate.
Solution: The observed value or the predicted value is given below in Table 5.6.
Table 5.6: Sample Data
1] 1s 146 (1.5 — 1.46)-= 0.0016
2) 29 2.02 (29 -2.02P =0.7744
3 [27 258 (2.7 - 2.58) = 0.0144
4 | 3a 3.14 (B.1-3.14)= 0.0016
The sum of (y— 9)? for all i=1, 2,3 and 4 (i.e,, number of samples 1 =4) is 0.792. The standard
deviation error estimate as given in Eq. (5.20)
0.782 _ 10396 - 0.629
4-2
5.5 MULTIPLE LINEAR REGRESSION
Multiple regression model involves multiple predictors or independent variables and one
dependent variable. This is an extension of the linear regression problem. The basic assumptions
of multiple linear regression are that the independent variables are not highly correlated
and hence multicollinearity problem does not exist. Also, it is assumed that the residuals are
normally distributed.
For example, the multiple regression of two variables x, and x, is given as follows:
y= fy 4)
=a, +4,%, 40%, 21)
In general, this is given for ‘n’ independent variables as:
Y= fy Xp Ky oH)
Hay $OX HO, +. HOT, HE 622)
Here, (x x, --- , x,) are predictor variables, y is the dependent variable, (ay a,, ..- , 4,) are
the coefficients of the regression equation and ¢ is the error term. This is illustrated through
Example 55.
o_O
[Beets] Apply multiple regression for the values given in Table 5.7 where weekly sales along
with sales for products x, and x, are provided. Use matrix approach for finding multiple regression.
Table 5.7: Sample Data
ea 7
(Product One Sales)|||(Product Two Sales)|| Output Weekly Sales (in Thousands)
1 4 1
2 5 6
3 8 8
4 2 2142» Machine Learning
Solution: Here, the matrices for Y and X are given as follows:
114 1
125
x= and Y =
138
142 12
The coefficient of the multiple regression equation is given as a=
‘The regression coefficient for multiple regression is calculated the same way as linear regression:
&= (XTX XY (623)
Using Eq. (5.23), and substituting the values (Similar to Problem 5.2), one gets @ as:
114
111i. Het a
=|[1234 x]1234
138
4582 4582
142
1.69
=| 348
0.05
Here, the coefficients are @,=~1.69, a, ~ 3.48 and a, ~ -0.05. Hence, the constructed model is:
1.69 +3.48x, — 0.05x,
fA _,
5.6 POLYNOMIAL REGRESSION
If the relationship between the independent and dependent variables is not linear, then linear
regression cannot be used as it will result in large errors. The problem of non-linear regression can
be solved by two methods:
1. Transformation of non-linear data to linear data, so that the linear regression can handle
the data
2. Using polynomial regression
Transformations
‘The first method is called transformation. The trick is to convert non-linear data to linear data that
can be handled using the linear regression method. Let us consider an exponential function y= ae.
The transformation can be done by applying log function to both sides to get:
Iny=bx+Ina (6.24)Regression Analysis + 143
Similarly, power function of the form (y = ax") can be transformed by applying log function
on both sides as follows:
log oy = blog, «¢ + log, 625)
Once the transformation is carried out, linear regression can be performed and after the results
are obtained, the inverse functions can be applied to get the desire result.
Polynomial Regression
Itcan handle non-linear relationships among variables by using n' degree of a polynomial. Instead
of applying transforms, polynomial regression can be directly used to deal with different levels of
curvilinearity.
Polynomial regression provides a non-linear curve such as quadratic and cubic. For example,
the second-degree transformation (called quadratic transformation) is given as: y—4,+a,x+a,2° and
the third-degree polynomial is called cubic transformation given as: y = a, #a,x + ax? + ax.
Generally, polynomials of maximum degree 4 are used, as higher order polynomials take some
strange shapes and make the curve more flexible. It leads to a situation of overfitting and hence is
avoided.
Let us consider a polynomial of 2“ degree. Given points (x, ¥,), &y Yr + &y ¥,)- the
objective is to fit a polynomial of degree 2. The polynomial of degree 2 is given as:
yrataxtae (6.26)
Such that the error E = Fly, (a, +4,2;+4,x;)P is minimized. The coefficients ay a, a, of
Eq. (6.26) can be obtained by taking partial derivatives with respect to each of the coefficients as
ae 2 2 and substituting it with zero. This results in 2+ 1 equations given as follows:
, a,
(627)
The best line is the line that minimizes the error between line and data points. Arranging the
coefficients of the above equation in the matrix form results in:
n Ex, Ex Ta, Ly,
Ex, Ex Bea, |=] 0,4) 628)
Ex? Ex Ext fle] [Z02,y)
This is of the form Xa=B. One can solve this equation for a as:
a=X"B (6.29)
e
paetih)ixo4 Consider the data provided in Table 58 and fit it using the second-order
polynomial.144» Machine Learning
Table 5.8: Sample Data
4
& 15
Solution: For applying polynomial regression, computation is done as shown in Table 5.9.
Here, the order is 2 and the sample i ranges from 1 to 4.
Table 5.9: Computation Table
eA | SA LaA ae LHL oi
1 1 1 1 1 1 1
2 4 8 4 16 8 16
3 9 7 9 al 7 a
4 15 60. 16 240 64 256.
¥ax,=10 | ¥y,=29 | Yxy,=96 |Ex? =30] Derry, = 338 | x= 100 | Sx! = 354
It can be noted that, N = 4, Dy, = 29, Exy,=96, Dx?y, = 338. When the order is 2, the matrix
using Eq. (5.28) is given as follows:
4 10 30][a,] [29
10, 30.100 =| 96
30 100 354|/4,] [338
Therefore, using Eq, (5.29), one can get coefficients as:
a|-[4 10 30]" [29] (-075
a,|=]10 30 100] x} 96 |=] 0.95).
a,| {30 100 a4} {338} | 075
This leads to the regression equation using Eq. (5.26) as:
¥=-0.75 + 0.95 x+0.75
SSS?
5.7 LOGISTIC REGRESSION
Linear regression predicts the numerical response but is not suitable for predicting the categorical
variables. When categorical variables are involved, it is called classification problem. Logistic
regression is suitable for binary classification problem. Here, the output is often a categorical
variable. For example, the following scenarios are instances of predicting categorical variables.
1, Is the mail spam or not spam? The answer is yes or no. Thus, categorical dependant
variable is a binary response of yes or no.
2. If the student should be admitted or not is based on entrance examination marks,
Here, categorical variable response is admitted or not.
3, The student being pass or fail is based on marks secured.Regression Analysis + 145
Thus, logistic regression is used asa binary classifier and works by predicting the probability
of the categorical variable. In general, it takes one or more features x and predicts the response y.
If the probability is predicted via linear regression, it is given as:
P(x)=ay + a,x
Hence, logistic regression tries to model the probability of the particular response variable.
In email classification problem, say normal email or spam, if the probability of the response
variable is 0.7, then there is a 70% possibility of a normal mail.
Linear regression generated value is in the range —« to +e, whereas the probability of the
response variable ranges between 0 and 1. Hence, there must be a mapping function to map
the value -© to 40 to 0-1. The core of the mapping function in logistic regression method is
sigmoidal function. A sigmoidal function is a’S’ shaped function that yields values between Qand.1.
This is known as logit function. This is mathematically represented as:
logit(x) = (530)
le
Here, xis the independent variable and e is the Euler number. The purpose of the logit function
is to map any real number to zero or 1.
Logistic regression can be viewed as an extension of linear regression, but the only
difference is that the output of linear regression can be an extremely high number. This needs
to be mapped into the range 0-1, as probability can have values only in the range 0-1. This
problem is solved using log odd or logit functions. What is the difference between odds and
probability? Odds and probability (or likelihood) are two sides of a coin and represent uncer-
tainty. The odds are defined as the ratio of the probability of an event and probability of an
event that is not happening. Thisis given
probability of an event
= mrty oan event (5.31)
ot ~ ohabilily of an non —event 63)
Log-odds can be taken for the odds, resulting in:
p(x)
oe =atax (6.32)
Here, log(,) is a logit function or log odds function. One can solve for p(x) by taking the
inverse of the above function as:
exp(a, + 4,x)
P(e) = Pe 633)
“+ exp(@, + 4,2)
This is the same sigmoidal function. It always gives the value in the range 0-1. Dividing the
numerator and denominator by the numerator, one gets:
1
P(e) = ——————_ 6.34)
rexpea, — 43)
‘One can rearrange this by taking the minus sign outside to get the following logistic
function:
1
Pz) = Trept, bax) (6.35)146» Machine Learning
Here, x is the explanatory or predictor variable, ¢ is the Euler number, and ay a, are the
regression coefficients. The coefficients 4, a, can be learned and the predictor predicts p(x) directly
using the threshold function as:
{, if p(x) > 05
0 otherwise (636)
a ——
FEERyaGEeA Let us assume a binomial logistic regression problem where the classes are pass and
fail. The student dataset has entrance mark based on the historic data of those who are selected
or not selected. Based on the logistic regression, the values of the learnt parameters are ¢,= 1 and
a, =8. Assuming marks of x=60, compute the resultant class.
Solution: The values of regression coefficients are a, = 1 and a, = 8, and given that x= 60.
Based on the regression coefficients, z can be computed as:
z=a, tax
+8x60=481
One can fit this in a sigmoidal function using Eq. (5.30) to get the probability as:
yn __._1_
1+ exp(481) 2271
If we assume the threshold value as 0.5, then it is observed that 0.44 < 05, therefore, the
candidate with marks 60 is not selected
LM
To determine the relationship between dependant and independent variables, parameters need
to be obtained. In logistic regression, the parameters are obtained through maximum likelihood
function (MLE) using the training data. The aim is to lean the values of parameters of the logisti
model (¢’s) by minimizing the error in the probability predicted by the model.
There can be many different sets of coefficients available. The optimal value of the parameters
is obtained by using the MLE function, which is a set of coefficients for which the probability of
getting the observed data is maximum.
If mis the success of the outcome and 1 — ris the failure of the outcome, then the likelihood
function is given as:
ueey i
fa
"
J (-m) (637)
1-2,
To determine the value of the parameters, the log of the likelihood function is taken. Techniques
like Newton method can be used to maximize the log-likelihood of the function.
Logistic regression is suitable for binary classification. The idea can be extended for multiple
classes called multinomial logistic regression. Let us assume that there are three classes 1, 2 and 3.
Then, the multinomial logistic regression creates three classification problems — class 1 and Not class
1, class 2 and Not class 2, and finally class 3 and Not class 3. Three problems are simultaneously
used to find the maximum probability relative to others to get the appropriate class.
Logistic regression is a simple and efficient method for binary classification. The model can
be easily interpreted too. The disadvantages of logistic regression are that multinomial logistic
regression cannot handle many attributes and can handle only linear features. Also, if all the
attributes have multicollinearity problem, the logistic method does not work effectively.