UNIT- II
Maximum likelihood estimation- Least squares, Robust linear expression, ridge
regression, Bayesian linear regression, Linear models for classification:
Discriminant function, Probabilistic generative models, Probabilistic
discriminative models, Laplacian approximation, Bayesian logistic regression,
Kernel functions, using kernels in GLM, Kernel trick, SVMs.
Linear Regression
In Machine Learning,
Linear Regression is a supervised machine learning algorithm.
It tries to find out the best linear relationship that describes the data you have.
It assumes that there exists a linear relationship between a dependent variable and independent variable(s).
The value of the dependent variable of a linear regression model is a continuous value i.e. real numbers.
Representing Linear Regression Model-
Linear regression model represents the linear relationship between a dependent variable and independent variable(s) via
a sloped straight line.
The sloped straight line representing the linear
relationship that fits the given data best is called as a
regression line.
Types of Linear Regression
Based on the number of independent variables, there are two types of linear regression-
1. Simple Linear Regression-
In simple linear regression, the dependent variable depends only on a single independent variable.
For simple linear regression, the form of the model is-
Y = β0 + β 1 X
Here,
Y is a dependent variable.
X is an independent variable.
β0 and β1 are the regression coefficients.
β0 is the intercept or the bias that fixes the offset to a line.
β1 is the slope or weight that specifies the factor by which X has an impact on Y.
Types of Linear Regression
2. Multiple Linear Regression-
In multiple linear regression, the dependent variable depends on more than one independent variables.
For multiple linear regression, the form of the model is-
Y = β0 + β1X1 + β2X2 + β3X3 + …… + βnXn
Here,
Y is a dependent variable.
X1, X2, …., Xn are independent variables.
β0, β1,…, βn are the regression coefficients.
βj (1<=j<=n) is the slope or weight that specifies the factor by which X j has an impact on Y.
Assumptions of Linear Regression
Assumptions of Linear Regression
Assumptions of Linear Regression
Assumptions of Linear Regression
Assumptions of Linear Regression
Assumptions of Linear Regression
Maximum Likelihood Estimation
Maximum Likelihood Estimation
• Maximum Likelihood Estimation (MLE) is a statistical method used to
estimate the parameters of a probability distribution that best describe a
given dataset.
• To analyze the data provided, we need to identify the distribution from
which we have obtained data.
• Next, use data to find the parameters of our distribution. A parameter is a
numerical characteristic of a distribution.
• Example Distribution and their parameters:
• Normal distributions - mean (µ) & variance (σ2)
• Binomial distributions –no.of trials (n) & probability of success (p)
• Gamma distributions - shape (k) and scale (θ)
• Exponential distributions - inverse mean (λ).
• These parameters are vital for understanding the size, shape, spread, and
other properties of a distribution.
• Since the data that we have is mostly randomly generated, we often don’t
know the true values of the parameters characterizing our distribution.
Maximum Likelihood Estimation
• An estimator is like a function of data that gives approximate values of the
parameters.
• Ex: sample-mean estimator – Simple & Frequently used estimator
• Since the numerical characteristics of the distribution vary as a function of
the range of parameter it is not easy to estimate parameter θ of the
distribution.
• Maximum likelihood estimation, which is a process of estimation that gives
an entire class of estimators called maximum likelihood estimators or MLEs
Likelihood Function
When to Use Log-Likelihood:
•When Dealing with Large Datasets: The likelihood function can become extremely
small as more data points are considered, leading to computational difficulties. The
log-likelihood avoids this by converting multiplication into addition.
•Simplifying Derivatives: When performing MLE, you often need to take derivatives
to find the maximum. The log-likelihood simplifies this process, as the logarithm of
a product becomes a sum of logarithms, making differentiation easier.
The log-likelihood is a transformed version of the likelihood function that is more
mathematically and computationally convenient for optimization in machine
learning models.
Likelihood Function
MLE – for Linear Regression Model
In the context of linear regression, MLE is used to estimate the coefficients (parameters) of
the regression model.
Maximizing Log-Likelihood
Example - Maximum likelihood estimation
Let's work through an example of using Maximum Likelihood Estimation (MLE) in
linear regression.
• Example : We have a dataset with the following observations:
X Y
1 2
2 3
3 5
4 4
2 6
LINEAR REGRESSION PROBLEM
Problem:
You are given the following data representing the
advertising budget (in thousands of dollars) and the
corresponding sales (in thousands of units) for a
product:
Advertising Budget: X=[2,4,6,8,10]
Sales: Y=[4,7,9,11,15]
You want to find a linear relationship between the
advertising budget and sales.
This simple linear regression example illustrates how
you can establish a linear relationship between two
variables and use that relationship for prediction. In
real-world applications, additional steps and
considerations may be necessary, including data
preprocessing, feature selection, model validation,
and more.
LINEAR REGRESSION PROBLEM
Extra Problem-
A small fitness center wants to understand the relationship between the number of hours spent in the gym and
weight loss (in pounds) for its members over a one-month period. They collect the following data:
Hours in Gym (X): [5,10,15,20,25]
Weight Loss (Y): [3,6,9,12,15]
The fitness center wants to use this information to predict the weight loss for a member who spends 18 hours in
the gym.
Extra Problem-Solution
Steps applied in Linear Regression
Modeling
1. Missing value and outlier treatment
2. Correlation check of independent variables
3. Train and test random classification
4. Fit the model on train data
5. Evaluate model on test data
PROGRAM
#Import required Libraries #import linearregression and split the data into training and testi
ng dataset
import pandas as pd from sklearn.model_selection import train_test_split
import numpy as np x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,ra
import matplotlib.pyplot as plt ndom_state=0)
import seaborn as sns y_train=y_train.reshape(-1,1)
#read the dataset y_test=y_test.reshape(-1,1)
#df=pd.read_csv('kc_house_data.csv') #fit the model over the training dataset
# E:\Machine Learning\Machine Learning LAB from sklearn.linear_model import LinearRegression
df= pd.read_csv("/content/kc_house_data.csv") model =LinearRegression()
#visualizing the data using heatmap model.fit(x_train, y_train)
plt.figure() #calculate intercept and coefficient
sns.heatmap(df.corr(),cmap='coolwarm') print(model.intercept_)
plt.show() print(model.coef_)
#selecting the required parameters pred=model.predict(x_test)
area=df['sqft_living'] predictions=pred.reshape(-1,1)
price=df['price'] #calculate root mean square error to evaluate model performanc
x=np.array(area).reshape(-1,1) e from sklearn.matrics
y=np.array(price) from sklearn.metrics import mean_squared_error
print('MSE:', mean_squared_error(y_test,predictions))
print('RMSE:',np.sqrt(mean_squared_error(y_test,predictions)))
OUTPUT
[-48536.69005829]
[[284.14771038]]
MSE: 62014619472.34492
RMSE: 249027.34683633628
Robust linear expression
Robust linear expression
• Robust linear regression is designed to be less sensitive to outliers compared to
traditional linear regression.
• Traditional linear regression minimizes the sum of squared residuals, which can be
heavily influenced by outliers.
• Robust linear regression uses different techniques to mitigate the effect of outliers
and produce a more reliable model.
• Linear Regression: Suitable when the data meets the assumptions, especially when
there are no significant outliers and the relationship is linear.
• Robust Linear Regression: Appropriate when there are outliers or when the
assumptions of linear regression are violated, making it more reliable for real-world
data that may not adhere perfectly to theoretical assumptions.
Robust linear expression
Robust linear expression
Robust linear expression
Robust linear expression
Ridge Regression(L2 Regularization)
Ridge Regression(L2 Regularization)
Ridge Regression(L2 Regularization)
Ridge Regression(L2 Regularization)
Ridge Regression(L2 Regularization)
Bayesian Linear Regression
Bayesian Linear Regression
Prior should have no information from likelihood
Evidence term is just a standardization to ensure
closure
Bayesian Linear Regression
Likelihood (P(Data | Hypothesis)):
•The likelihood represents how probable the observed data is, assuming a particular hypothesis or
model is true. It is a function of the parameters of the model, and it quantifies how well the model
explains the observed data.
•Example: In a classification problem, the likelihood would measure how probable the observed labels
are given the predicted labels from your model.
Prior (P(Hypothesis)):
•The prior is your initial belief about the hypothesis before observing any data. It reflects your
knowledge or assumptions about the model parameters based on previous information or intuition.
Evidence (P(Data)):
•The evidence (also called the marginal likelihood) is the probability of the observed data under all
possible hypotheses. It normalizes the posterior distribution so that it sums to one. Evidence is used in
Bayesian inference but is often hard to compute directly.
Bayesian Linear Regression
Bayesian Linear Regression
Discriminant Functions
A discriminant function is a function used in classification tasks to assign a given
input to one of several possible classes. It is designed to make decisions based
on the values of input features by computing a score for each class. The class
with the highest score is the one to which the input is assigned.
Discriminant Functions(contd..)
Discriminant Functions(contd..)
Discriminant Functions(contd..)
Discriminant Functions(contd..)
Discriminant Functions(contd..)
Discriminant Functions(contd..)
Discriminant Functions(contd..)
Discriminant Functions(contd..)
Discriminant Functions(contd..)
Discriminant Functions(contd..)
Discriminant Functions(contd..)
Discriminant Functions(contd..)
Discriminant Functions(contd..)
Discriminant Functions(contd..)
Discriminant Functions(contd..)
Discriminant Functions(contd..)
Discriminant Functions(contd..)
Discriminant Functions(contd..)
Discriminant Functions(contd..)
Discriminant Functions(contd..)
Discriminant Functions(contd..)
Discriminant Functions(contd..)
Probabilistic Generative Models
Probabilistic Generative Models
Probabilistic Generative Models
Probabilistic Generative Models
Probabilistic Generative Models
Probabilistic Generative Models
Probabilistic Generative Models
Probabilistic Generative Models
Probabilistic Generative Models
Probabilistic Generative Models
Probabilistic Generative Models
Probabilistic Generative Models
Laplace Approximation
Laplace Approximation
• Laplace approximation is a technique in machine learning and statistics used to
approximate integrals, particularly when dealing with Bayesian inference. It's
often applied when the exact calculation of posterior distributions is intractable.
The method relies on approximating the posterior distribution with a Gaussian
distribution centered around the mode of the posterior. This is achieved by taking
the second-order Taylor expansion of the log-posterior distribution around its
mode.
• Key Steps in Laplace Approximation:
1.Find the Mode of the Posterior Distribution:
1. Identify the maximum a posteriori (MAP) estimate, which is the mode of the
posterior distribution. This can be done using optimization techniques.
2.Approximate the Posterior with a Gaussian:
1. Use a second-order Taylor expansion of the log-posterior distribution around
the MAP estimate. This results in a quadratic approximation, which
corresponds to a Gaussian distribution.
Laplace Approximation
3.Calculate the Hessian Matrix:
The covariance of the approximating Gaussian is the inverse of the Hessian matrix of the negative log-
posterior evaluated at the mode. The Hessian captures the curvature of the posterior distribution.
4. Obtain the Approximation:
With the mode and covariance matrix, the posterior is approximated as a multivariate normal distribution.
Applications in Machine Learning:
• Bayesian Neural Networks (BNNs): Laplace approximation can be used to approximate the posterior
distribution of the weights in BNNs, leading to uncertainty quantification in predictions.
• Model Selection: In Bayesian model comparison, Laplace approximation helps compute marginal
likelihoods, which can be used for model selection.
• Gaussian Processes: It is used to approximate non-Gaussian likelihoods in Gaussian process models,
especially in classification problems.
Laplace approximation is particularly useful when dealing with high-dimensional models, where exact inference
becomes computationally expensive.
Laplace Approximation
The Laplace Approximation aims to find a Gaussian approximation to a probability
density defined over a set of continuous variables.
Consider first the case of a single continuous variable z, and suppose the
distribution p(z) is defined by
Laplace Approximation…
Laplace Approximation…
Gaussian approximation will only be well defined if its precision A > 0, in other words the stationary
point z0 must be a local maximum, so that the second derivative of f(z) at the point z0 is negative.
Laplace Approximation…
and ∇ is the gradient operator. Taking the exponential of both sides we obtain
The distribution q(z) is proportional to f(z) and the appropriate normalization coefficient can be
found by inspection, using the standard result for a normalized multivariate Gaussian, giving
Laplace Approximation…
One major weakness of the Laplace approximation is that, since it is based on a Gaussian
distribution, it is only directly applicable to real variables.
In other cases, it may be possible to apply the Laplace approximation to a transformation of the
variable. For instance, if 0 <T < ∞ then we can consider a Laplace approximation of ln τ .
The most serious limitation of the Laplace framework, however, is that it is based purely on the
aspects of the true distribution at a specific value of the variable, and so can fail to capture
important global properties.
Bayesian Logistic Regression
Logistic Regression
Logistic Regression
Logistic regression is one of the most popular machine learning algorithms for binary
classification. This is because it is a simple algorithm that performs very well on a wide
range of problems.
Logistic Function
The logistic function is defined as:
P = 1 / (1 + e^-x)
Where e is the numerical constant Euler’s number and x is a input we plug into the
function.
Logistic Regression
Let’s plug in a series of numbers from -5 to +5 and see how the logistic function transforms them:
Smallest negative numbers resulted in values
close to zero
Larger positive numbers resulted in values
close to one.
0 transformed to 0.5 or the midpoint of the
new range.
Logistic Regression
Consider the following example: An organization wants to
determine an employee’s salary increase based on their
performance.
For this purpose, a linear regression algorithm will help them
decide. Plotting a regression line by considering the employee’s
performance as the independent variable, and the salary increase as
the dependent variable will make their task easier.
Now, what if the organization wants to know whether an employee
would get a promotion or not based on their performance? The above
linear graph won’t be suitable in this case. As such, we clip the line at
zero and one, and convert it into a sigmoid curve (S curve).
Based on the threshold values, the organization can decide whether an
employee will get a salary increase or not.
Difference between Linear and Logistic
Linear Regression
Regression Logistic Regression
Used to solve regression problems Used to solve classification problems
The response variables are continuous in nature The response variable is categorical in nature
It helps estimate the dependent variable when there It helps to calculate the possibility of a particular
is a change in the independent variable event taking place
It is a straight line It is an S-curve (S = Sigmoid)
Logistic Regression
This dataset has two input variables (X1 and X2). In input variables are real-valued random and one output variable (Y).
The output variable has two values, making the problem a binary classification problem.
For this dataset, the logistic regression has three coefficients just like linear
regression, for example:
Unlike linear regression, the output is transformed into a probability using the
logistic function:
Logistic Regression
Calculate p(x) for each record
p = 1 / (1 + e^(-L))
or
p= e^L/ (1 + e^L)
Logit L = b0 + b1*x1 + b2*x2
b0- intercept
b1- first regression coefficient (learning)
b2-second regression coefficient (learning)
x1-first predictor variable(input)
x2-second predictor variable (input)
Calculate Logit for each record based on the above formula
The job of the learning algorithm will be to discover the best values for the coefficients (b0, b1 and b2) based on the training data
Default class (class 0).
If the probability is > 0.5, prediction is class 0 else class 1.
Logistic Regression –Calculate
Prediction
Assign 0.0 to each coefficient and calculating the probability of the first training instance that belongs to class 0.
b0 = 0.0
b1 = 0.0
b2 = 0.0
The first training instance is: x1=2.7810836, x2=2.550537003
Using the above equation we can plug in all of these numbers and calculate a prediction:
prediction = 1 / (1 + e^(-(b0 + b1*x1 + b2*x2)))
prediction = 1 / (1 + e^(-(0.0 + 0.0*2.7810836 + 0.0*2.550537003)))
prediction = 0.5
Logistic Regression –Calculate New
Coefficients
The new coefficient values can be calculated using
b1= 𝝨(x1-x1bar)(x2-x2 bar) / 𝝨(x1-x1 bar)^2
b2= 𝝨(x1-x1 bar)(x2-x2 bar) / 𝝨(x2-x2 bar)^2
b0= x2 bar-b1*x1 bar
Logistic Regression –Advantages
Advantages:
– Makes no assumptions about distributions of classes in feature space
– Easily extended to multiple classes (multinomial regression)
– Natural probabilistic view of class predictions
– Quick to train
– Very fast at classifying unknown records
– Good accuracy for many simple data sets
Logistic Regression –Applications
•Using the logistic regression algorithm, banks can predict whether a customer would
default on loans or not
•To predict the weather conditions of a certain place (sunny, windy, rainy, humid, etc.)
•Ecommerce companies can identify buyers if they are likely to purchase a certain product
•Companies can predict whether they will gain or lose money in the next quarter, year, or
month based on their current performance
•To classify objects based on their features and attributes
Logistic Regression –Problem
x1.deviation x2.deviation x1.deviation*x2.deviation
(x1-x1bar)(x2- Logit L=b0 + b1*x1 +
x1 x2 x1-x1bar x2-x2bar x2bar) (x1-x1bar)^2 (x2-x2bar)^2 b2*x2 P=1/(1+exp(-L)) Y NEW Given Y Accuracy
1 2.7810836 2.550537003 -2.051618547 0.14514411 -0.297780347 4.20913866 0.021066813 0.869350519 0.704610536 1 0 0
2 1.465489372 2.362125076 -3.367212775 -0.043267817 0.145691948 11.33812187 0.001872104 1.192505447 0.767188862 1 0 0
3 3.396561688 4.400293529 -1.436140459 1.994900636 -2.864957513 2.062499417 3.979628546 -0.54526968 0.366962574 0 0 1
4 1.38807019 1.850220317 -3.444631957 -0.555172576 1.912365198 11.86548932 0.30821659 1.570711242 0.827884978 1 0 0
5 3.06407232 3.005305973 -1.768629827 0.59991308 -1.061024166 3.128051463 0.359895703 0.502742829 0.623103689 1 0 0
6 7.627531214 2.759262235 2.794829068 0.353869342 0.989004322 7.811069517 0.125223511 0.026996984 0.506748836 1 1 1
7 5.332441248 2.088626775 0.499739102 -0.316766118 -0.158300415 0.24973917 0.100340774 0.835995 0.697621047 1 1 1
8 6.922596716 1.77106367 2.08989457 -0.634329223 -1.325681199 4.367659312 0.402373564 0.836487968 0.697725026 1 1 1
9 8.675418651 -0.242068655 3.842716505 -2.647461548 -10.17344419 14.76647013 7.00905265 2.029804644 0.88389103 1 1 1
10 7.673756466 3.508563011 2.84105432 1.103170118 3.134166228 8.071589646 1.216984308 -0.517012363 0.373551108 0 1 0
sum 48.32702147 24.05392893 0 -2.44249E-15 -9.699960133 67.8698285 13.52465456 6.802312589 50
x1- bar, x2-bar 4.832702147 2.405392893
(x1-x1 bar)(x2-
x2bar)/(x1-
b1 x1bar)^2 -0.142920063
(x1-x1 bar)(x2-
x2bar)/(x2-
b2 x2bar)^2 -0.717205758
b0 x2bar-b1*x1bar 3.096082986
Logistic Regression –Additional Problem
x1.deviatio
x1.deviatio n*x2.devia
n x2.deviation tion
(x1-x1bar) (x2- Logit L=b0 + b1*x1 P=1/(1+exp(-
x1 x2 x1-x1bar x2-x2bar (x2-x2bar) (x1-x1bar)^2 x2bar)^2 + b2*x2 L)) Y NEW Given Y Accuracy
1 68 166 27.1 77.2 2092.12 734.41 5959.84 216.2132671 1 1 1 1
2 70 178 29.1 89.2 2595.72 846.81 7956.64 225.5147413 1 1 1 1
3 72 170 31.1 81.2 2525.32 967.21 6593.44 226.7086169 1 1 1 1
4 66 124 25.1 35.2 883.52 630.01 1239.04 194.750395 1 1 0 0
5 66 115 25.1 26.2 657.62 630.01 686.44 191.1019757 1 1 0 0
6 67 135 26.1 46.2 1205.82 681.21 2134.44 201.4280318 1 1 0 0
sum 409 888 163.6 355.2 9960.12 4489.66 24569.84 1255.717028 50
x1- bar, x2-bar 40.9 88.8
(x1-x1 bar)(x2-
x2bar)/(x1- 2.2184575
b1 x1bar)^2 22
(x1-x1 bar)(x2-
x2bar)/(x2- 0.4053799
b2 x2bar)^2 29
-
1.9349126
b0 x2bar-b1*x1bar 66
Applying steps in logistic regression modeling
The following steps are applied in linear regression modeling in industry:
1. Exclusion criteria and good-bad definition finalization
2. Initial data preparation and univariate analysis
3. Derived/dummy variable creation
4. Fine classing and coarse classing
5. Fitting the logistic model on the training data
6. Evaluating the model on test data
Logistic Regression
# import the necessary libraries
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# load the breast cancer dataset
X, y = load_breast_cancer(return_X_y=True)
# split the train and test dataset
X_train, X_test,y_train, y_test = train_test_split(X, y,test_size=0.20, random_state=23)
# LogisticRegression
clf = LogisticRegression(random_state=0)
clf.fit(X_train, y_train)
# Prediction
y_pred = clf.predict(X_test)
acc = accuracy_score(y_test, y_pred)
print("Logistic Regression model accuracy (in %):", acc*100)
Output:
Logistic Regression model accuracy (in %): 95.6140350877193
Support Vector Machine
SVM Algorithm
• SVM stands for support vector machine, and although it can solve
both classification and regression problems, it is mainly used for
classification problems in machine learning (ML).
• SVM’s purpose is to predict the classification of a query sample by
relying on labeled input data which are separated into two group
classes by using a margin.
• Specifically, the data is transformed into a higher dimension, and a
support vector classifier is used as a threshold (or hyperplane) to
separate the two classes with minimum error.
SVM Algorithm
• Dimensions:
• In simple terms, a dimension of
something is a particular aspect of
it. Examples: width, depth and
height are dimensions.
• A line on a plane is one dimension,
considering the edges a square has
two dimensions and a cube has
three dimensions
• Planes and Hyperplane:
• In one dimension, a hyperplane is
called a point.
• In two dimensions, it is a line.
• In three dimensions, it is a plane
and in more dimensions we call it a
hyperplane.
SVM Algorithm
• Terminologies in SVM:
• The points closest to the hyperplane are called as the support vector
points and the distance of the vectors from the hyperplane are called
the margins.
• The basic intuition is that more the farther SV points, from the hyperplane,
more is the probability of correctly classifying the points in their respective
region or classes.
• SV points are very critical in determining the hyperplane because if the
position of the vectors changes the hyperplane’s position is altered.
• Technically this hyperplane can also be called as margin maximizing
hyperplane.
Identify the right hyper-plane (Scenario-1):
• Here, we have three hyper-planes (A, B, and C). Now, identify the right hyper-
plane to classify stars and circles.
• You need to remember a thumb rule to identify the right hyper-plane: “Select
the hyper-plane which segregates the two classes better”. In this scenario,
hyper-plane “B” has excellently performed this job
Identify the right hyper-plane
(Scenario-2)
• Here, we have three hyper-planes (A, B, and C) and all are segregating the classes
well. Now, How can we identify the right hyper-plane?
• Here, maximizing the distances between nearest data point (either class) and hyper-
plane will help us to decide the right hyper-plane. This distance is called as Margin
you can see that the margin for hyper-plane C is
high as compared to both A and B. Hence, we
name the right hyper-plane as C. Another
lightning reason for selecting the hyper-plane with
higher margin is robustness. If we select a hyper-
plane having low margin then there is high
chance of miss-classification.
Identify the right hyper-plane
(Scenario-3):
• Hint: Use the rules as discussed in previous section to identify the right
hyper-plane.
• Some of you may have selected the hyper-plane B as it has higher margin
compared to A. But, here is the catch, SVM selects the hyper-plane which
classifies the classes accurately prior to maximizing margin. Here, hyper-
plane B has a classification error and A has classified all correctly. Therefore,
the right hyper-plane is A.
Can we classify two classes (Scenario-4)?
• Below, I am unable to segregate the two classes using a straight line, as
one of the stars lies in the territory of other(circle) class as an outlier
Find the hyper-plane to segregate to classes (Scenario-5):
• In the scenario below, we can’t have linear hyper-plane between the
two classes, so how does SVM classify these two classes? Till now, we
have only looked at the linear hyper-plane.
SVM
• SVM can solve this problem. Easily! It solves this problem by
introducing additional feature. Here, we will add a new feature
z=x^2+y^2. Now, let’s plot the data points on axis x and z:
1-Dimensional Data Transferable
• we cannot classify using support vector classifiers whatever the cost value is.
• Another way of handling the data, called the kernel trick, using the kernel
function to work with non-linearly separable data.
• A polynomial kernel with degree 2 has been applied in transforming the data
from 1-dimensional to 2-dimensional data.
1-Dimensional Data Transferable
1-Dimensional Data Transferable
• The degree of the polynomial kernel is a tuning parameter
• The practitioner needs to tune them with various values to check
where higher accuracies are possible with the model
2-Dimensional Transferable
• In the 2-dimensional case, the kernel trick is applied as below with the
polynomial kernel with degree 2.
• It seems that observations have been classified successfully using a
linear plane after projecting the data into higher dimensions
Kernel Trick
• SVM algorithms use a set of mathematical functions that are defined as
the kernel. The function of kernel is to take data as input and transform it
into the required form.
• Firstly, a kernel takes the data from its original space and implicitly maps it
to a higher-dimensional space. This is crucial when dealing with data that
is not linearly separable in its original form.
• Instead of performing computationally expensive high-dimensional
calculations, the kernel function calculates the relationships or similarities
between pairs of data points as if they were in this higher-dimensional
space.
Kernel Trick
• Kernel Trick allows us to operate in the • However, if we use the kernel
original feature space without computing function, which is denoted as k(x, y),
the coordinates of the data in a higher instead of doing the complicated
dimensional space. computations in the 9-dimensional
• Example: space, we reach the same result
within the 3-dimensional space by
calculating the dot product of x -
• Here x and y are two data points in 3 transpose and y.
dimensions. Let’s assume that we need • The computational complexity, in this
to map x and y to 9-dimensional space. case, is O(n).
We need to do the following calculations
to get the final result, which is just a
scalar.
• The computational complexity, in this
case, is O(n²).
• Kernel trick does for us is to offer a
more efficient and less expensive way
to transform data into higher
Kernel Trick
• Numerical Example for working of Kernel Function
Feature(x) -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6
X2 36 25 16 9 4 2 0 1 4 9 16 25 36
Fig 1: Linearly inseparable data in one
dimension
Fig 2: Applying kernel method to represent data in
two dimensions
Kernel Functions
• Let κ(x, x’) ≥ 0 be some measure of similarity between objects x, x’ ∈ X ,
where X is some abstract space; we will call κ a kernel function.
• We define a kernel function to be a real-valued function of two arguments, κ(x,
x’ ) ∈ R, for x, x’ ∈ X . Typically the function is symmetric (i.e., κ(x, x’ ) = κ(x’ ,
x)), and non-negative (i.e., κ(x, x’ ) ≥ 0.
• Linear Kernel:
• Let φ(x) = x, we get the linear kernel, defined by just the dot product between
the two object vectors:
• This is useful if the original data is already high dimensional, and if the original
features are individually informative,
• e.g., a bag of words representation where the vocabulary size is large, or the
expression level of many genes.
• In such a case, the decision boundary is likely to be representable as a linear
combination of the original features, so it is not necessary to work in some
other feature space
Kernel Functions
• Mercer Kernel:
• Let X = {x1, . . . , xn} be a finite set of n samples from ჯ. The Gram matrix of
X is defined as
• If ∀X ⊆ ჯ , the matrix K is positive definite, κ is called a Mercer Kernel, or a
positive definite kernel.
• Mercer’s Theorem: If the Gram matrix is positive definite, we can compute
an eigenvector decomposition of it as
• where Λ is a diagonal matrix of eigenvalues λi > 0. Now consider an element
of K:
• Let us define Then we can write
• Thus entries in the kernel matrix can be computed by performing an inner
product of some feature vectors that are implicitly defined by the eigenvectors
U. In general, if the kernel is Mercer, then there exists a function φ mapping x
∈ ჯ to such that
Kernel Functions
• Polynomial Kernel:
• It represents the similarity of vectors in the training set of data in a feature
space over polynomials of the original variables used in the kernel.
• Where r>0.
• Sigmoid Kernel:
• An example of a kernel that is not a Mercer kernel is the so-called sigmoid
kernel, defined by
• this function is equivalent to a two-layer, perceptron model of the neural
network, which is used as an activation function for artificial neurons.
Kernel Functions
• RBF Kernel:
• The below function is the Gaussian or RBF kernel
• Here σ is the variance and our hyperparameter
• ||x-x’|| is the Euclidean (L₂-norm) Distance between two points x and x’
• When using a Gaussian kernel in an SVM, the decision boundary is a
nonlinear hyper plane that can capture complex nonlinear relationships
between the input features.
• The width of the Gaussian function, controlled by the gamma parameter,
determines the degree of nonlinearity in the decision boundary.
Kernel Functions
• String Kernels:
• If we’re interested in matching all substrings (for example) instead of
representing an object as a bag of words, we can use a string kernel:
• Let A denote an alphabet, e.g., {a, ..., z}, and A ∗ = [A, A2 , . . . , Am], where
m is the length of the longest string we would like to match. Then a basis
function φ(x) will map a string x to a vector of length |A ∗|, where each
element j is the number of times we observe substring A ∗j in string x,
where j = 1 : | A∗ |.
• The string kernel measures the similarity of two strings x and x’ :
• where φs(x) denotes the number of occurrences of substring s in string x.
Kernel Functions
• Matern Kernel:
• The Matern kernel, which is commonly used in Gaussian process has the
following form
• Where is a modified Bessel
function.
• Fisher Kernel:
• We can construct a kernel based on a chosen generative model using the
concept of a Fisher kernel. The idea is that this kernel represents the
distance in likelihood space between different objects for a fitted
generative model. A Fisher kernel is defined as
• Where
Using Kernels inside GLMs
• Kernel Machines:
• We define a kernel machine to be a GLM where the input feature vector
has the form
• where ∈ ჯ are a set of K centroids. If κ is an RBF kernel, this is called
an RBF network
• We will call above Equation a kernelised feature vector.
• We can use the kernelized feature vector for logistic regression by defining
p(y|x, θ) = Ber(cφ(x)).
• This provides a simple way to define a non-linear decision boundary.
Using Kernels inside GLMs
• Example:
• Consider the data coming from the exclusive or or xor function. This is a
binary valued function of two binary inputs. Its truth table is shown in
Figure 14.2(a). In Figure 14.2(b), we have show some data labeled by the
xor function. We see we cannot separate the data even using a degree 10
polynomial.
• However, using an RBF kernel and just 4 prototypes easily solves the
problem as shown in Figure 14.2(c)
Using Kernels inside GLMs
• Example:
• We can also use the kernelized feature
vector inside a linear regression model
by defining p(y|x, θ) = N (wTφ(x), σ2).
• For example, Figure 14.3 shows a 1d
data set fit with K = 10 uniformly
spaced RBF prototypes, but with the
bandwidth ranging from small to
large.
• Small values lead to very wiggly
functions, since the predicted function
value will only be non-zero for points x
that are close to one of the prototypes
μk.
• If the bandwidth is very large, the
design matrix reduces to a constant
matrix of 1’s, since each point is
SVM Algorithm
• Math Behind SVM
• Consider a binary classification problem
with two classes, labeled as +1 and -1. We
have a training dataset consisting of input
feature vectors X and their corresponding
class labels Y.
• The equation for the linear hyperplane can
be written as:
• Anything above the decision
• where the output y indicates whether it is boundary should have label 1
in a positive class or the negative class. w • Similarly, anything below the
is that matrix representing the plane's decision boundary should have label
parameters also the coefficient −1.
of x where x is the input • If any point exactly on the decision
data. b represents the intercept of the boundary, then the output of the
hyperplane. classifier would be zero
SVM Algorithm
• Why the output of the equation is either
positive or negative.
• Consider the problem where the decision
boundary passes through the origin and
hence intercept is zero and its slope is +1.
• A single data point on each side of the
hyperplane represents both the positive and
negative classes.
• Substituting the values in the equation of the
hyperplane:
Any point below the hyperplane will
always be positive, and above the
hyperplane will be negative.
SVM Algorithm
• We would like to choose a hyperplane that maximizes the margin between
classes. The graph below shows what good margins and bad margins are.
SVM Algorithm
• The margin has to be maximized to find the optimal decision boundary.
• Consider the negative support vector as point x1 and the positive support
vector as point x2.
• The margin would be simply the difference between x1 and x2. subtract one
equation from another.
• To find x1-x2, w has to be sent to the equation's left-hand side, which gives 2
over w.
• It is already known that w is a vector, and vectors can not be divided directly
like a scalar value. The equivalent would be to divide both sides by the length
of w, that is, the magnitude of that norm of w.
SVM Algorithm
• Now that we have arrived at the equation for the margin, it is considered the
optimization function that needs to be maximized using optimization
algorithms like gradient descent.
• Optimization algorithms work best when finding the local minimum, hence to
ease the problem, minimizing the reciprocal of x1-x2 can be used as an
optimization function, which is the norm of w over 2 as below.
• Hard Margin
• The maximum-margin hyperplane or the hard margin hyperplane is a
hyperplane that properly separates the data points of different categories
without any misclassifications.
SVM Algorithm
• Soft Margin
• It is also possible that the SVM model can
have some percentage of error, meaning,
misclassification of new data
• This has to be integrated into our
optimization function,
• where c indicates the number of error
points, in other words, the number of
misclassified data points and summation
of the distance between the marginal
hyperplane and the misclassified data
point.
Types of SVM
• Linear SVM:
• Linear SVMs use a linear decision
boundary to separate the data points of
different classes.
• When the data can be precisely linearly
separated, linear SVMs are very suitable.
This means that a single straight line (in
2D) or a hyperplane (in higher
dimensions) can entirely divide the data
points into their respective classes.
• Non-Linear SVM:
• Non-Linear SVM can be used to classify
data when it cannot be separated into two
classes by a straight line (in the case of
2D).
• By using kernel functions, nonlinear SVMs
can handle nonlinearly separable data.
Hinge Loss in SVM
• Hinge loss is a function popularly used in
support vector machine algorithms to measure
the distance of data points from the decision
boundary. This helps approximate the
possibility of incorrect predictions and
evaluate the model's performance.
• Mathematically, Hinge loss for a data point can
be represented as :
• L(y, f(x))=max(0,1 –y ∗ f(x))
• Here,
Case 1 : Correct Classification and |y| ≥ 1 (blu
• y- the actual class (-1 or 1) Case 2 : Correct Classification and |y| < 1 (faded blue)
Case 3: Incorrect Classification (Red)
• f(x) – the output of the classifier for the
datapoint