0% found this document useful (0 votes)

7 views148 pages

Unit-2 Machine Learning

Uploaded by

lgourav205

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views148 pages

Unit-2 Machine Learning

Uploaded by

lgourav205

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 148

UNIT- II

Maximum likelihood estimation- Least squares, Robust linear expression, ridge

regression, Bayesian linear regression, Linear models for classification:
Discriminant function, Probabilistic generative models, Probabilistic
discriminative models, Laplacian approximation, Bayesian logistic regression,
Kernel functions, using kernels in GLM, Kernel trick, SVMs.
Linear Regression
In Machine Learning,
 Linear Regression is a supervised machine learning algorithm.
 It tries to find out the best linear relationship that describes the data you have.
 It assumes that there exists a linear relationship between a dependent variable and independent variable(s).
 The value of the dependent variable of a linear regression model is a continuous value i.e. real numbers.

Representing Linear Regression Model-

Linear regression model represents the linear relationship between a dependent variable and independent variable(s) via
a sloped straight line.

The sloped straight line representing the linear

relationship that fits the given data best is called as a

regression line.
Types of Linear Regression
Based on the number of independent variables, there are two types of linear regression-

1. Simple Linear Regression-

In simple linear regression, the dependent variable depends only on a single independent variable.
For simple linear regression, the form of the model is-
Y = β0 + β 1 X

Here,
 Y is a dependent variable.
 X is an independent variable.
 β0 and β1 are the regression coefficients.
 β0 is the intercept or the bias that fixes the offset to a line.
 β1 is the slope or weight that specifies the factor by which X has an impact on Y.
Types of Linear Regression

2. Multiple Linear Regression-

In multiple linear regression, the dependent variable depends on more than one independent variables.
For multiple linear regression, the form of the model is-
Y = β0 + β1X1 + β2X2 + β3X3 + …… + βnXn

Here,
 Y is a dependent variable.
 X1, X2, …., Xn are independent variables.
 β0, β1,…, βn are the regression coefficients.
 βj (1<=j<=n) is the slope or weight that specifies the factor by which X j has an impact on Y.
Assumptions of Linear Regression
Assumptions of Linear Regression
Assumptions of Linear Regression
Assumptions of Linear Regression
Assumptions of Linear Regression
Assumptions of Linear Regression
Maximum Likelihood Estimation
Maximum Likelihood Estimation
• Maximum Likelihood Estimation (MLE) is a statistical method used to
estimate the parameters of a probability distribution that best describe a
given dataset.
• To analyze the data provided, we need to identify the distribution from
which we have obtained data.
• Next, use data to find the parameters of our distribution. A parameter is a
numerical characteristic of a distribution.
• Example Distribution and their parameters:
• Normal distributions - mean (µ) & variance (σ2)
• Binomial distributions –no.of trials (n) & probability of success (p)
• Gamma distributions - shape (k) and scale (θ)
• Exponential distributions - inverse mean (λ).
• These parameters are vital for understanding the size, shape, spread, and
other properties of a distribution.
• Since the data that we have is mostly randomly generated, we often don’t
know the true values of the parameters characterizing our distribution.
Maximum Likelihood Estimation
• An estimator is like a function of data that gives approximate values of the
parameters.
• Ex: sample-mean estimator – Simple & Frequently used estimator

• Since the numerical characteristics of the distribution vary as a function of

the range of parameter it is not easy to estimate parameter θ of the
distribution.

• Maximum likelihood estimation, which is a process of estimation that gives

an entire class of estimators called maximum likelihood estimators or MLEs
Likelihood Function

When to Use Log-Likelihood:

•When Dealing with Large Datasets: The likelihood function can become extremely
small as more data points are considered, leading to computational difficulties. The
log-likelihood avoids this by converting multiplication into addition.
•Simplifying Derivatives: When performing MLE, you often need to take derivatives
to find the maximum. The log-likelihood simplifies this process, as the logarithm of
a product becomes a sum of logarithms, making differentiation easier.
The log-likelihood is a transformed version of the likelihood function that is more
mathematically and computationally convenient for optimization in machine
learning models.
Likelihood Function
MLE – for Linear Regression Model
In the context of linear regression, MLE is used to estimate the coefficients (parameters) of
the regression model.
Maximizing Log-Likelihood
Example - Maximum likelihood estimation
Let's work through an example of using Maximum Likelihood Estimation (MLE) in
linear regression.
• Example : We have a dataset with the following observations:
X Y
1 2
2 3
3 5
4 4
2 6
LINEAR REGRESSION PROBLEM
Problem:
You are given the following data representing the
advertising budget (in thousands of dollars) and the
corresponding sales (in thousands of units) for a
product:
Advertising Budget: X=[2,4,6,8,10]
Sales: Y=[4,7,9,11,15]
You want to find a linear relationship between the
advertising budget and sales.

This simple linear regression example illustrates how

you can establish a linear relationship between two
variables and use that relationship for prediction. In
real-world applications, additional steps and
considerations may be necessary, including data
preprocessing, feature selection, model validation,
and more.
LINEAR REGRESSION PROBLEM
Extra Problem-
A small fitness center wants to understand the relationship between the number of hours spent in the gym and
weight loss (in pounds) for its members over a one-month period. They collect the following data:
Hours in Gym (X): [5,10,15,20,25]
Weight Loss (Y): [3,6,9,12,15]
The fitness center wants to use this information to predict the weight loss for a member who spends 18 hours in
the gym.
Extra Problem-Solution
Steps applied in Linear Regression
Modeling
1. Missing value and outlier treatment

2. Correlation check of independent variables

3. Train and test random classification

4. Fit the model on train data

5. Evaluate model on test data

PROGRAM
#Import required Libraries #import linearregression and split the data into training and testi
ng dataset
import pandas as pd from sklearn.model_selection import train_test_split
import numpy as np x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,ra
import matplotlib.pyplot as plt ndom_state=0)
import seaborn as sns y_train=y_train.reshape(-1,1)
#read the dataset y_test=y_test.reshape(-1,1)
#df=pd.read_csv('kc_house_data.csv') #fit the model over the training dataset
# E:\Machine Learning\Machine Learning LAB from sklearn.linear_model import LinearRegression
df= pd.read_csv("/content/kc_house_data.csv") model =LinearRegression()
#visualizing the data using heatmap model.fit(x_train, y_train)
plt.figure() #calculate intercept and coefficient
sns.heatmap(df.corr(),cmap='coolwarm') print(model.intercept_)
plt.show() print(model.coef_)
#selecting the required parameters pred=model.predict(x_test)
area=df['sqft_living'] predictions=pred.reshape(-1,1)
price=df['price'] #calculate root mean square error to evaluate model performanc
x=np.array(area).reshape(-1,1) e from sklearn.matrics
y=np.array(price) from sklearn.metrics import mean_squared_error
print('MSE:', mean_squared_error(y_test,predictions))
print('RMSE:',np.sqrt(mean_squared_error(y_test,predictions)))
OUTPUT

[-48536.69005829]
[[284.14771038]]
MSE: 62014619472.34492
RMSE: 249027.34683633628
Robust linear expression
Robust linear expression
• Robust linear regression is designed to be less sensitive to outliers compared to
traditional linear regression.
• Traditional linear regression minimizes the sum of squared residuals, which can be
heavily influenced by outliers.
• Robust linear regression uses different techniques to mitigate the effect of outliers
and produce a more reliable model.
• Linear Regression: Suitable when the data meets the assumptions, especially when
there are no significant outliers and the relationship is linear.
• Robust Linear Regression: Appropriate when there are outliers or when the
assumptions of linear regression are violated, making it more reliable for real-world
data that may not adhere perfectly to theoretical assumptions.
Robust linear expression
Robust linear expression
Robust linear expression
Robust linear expression
Ridge Regression(L2 Regularization)
Ridge Regression(L2 Regularization)
Ridge Regression(L2 Regularization)
Ridge Regression(L2 Regularization)
Ridge Regression(L2 Regularization)
Bayesian Linear Regression
Bayesian Linear Regression

Prior should have no information from likelihood

Evidence term is just a standardization to ensure

closure
Bayesian Linear Regression
Likelihood (P(Data | Hypothesis)):
•The likelihood represents how probable the observed data is, assuming a particular hypothesis or
model is true. It is a function of the parameters of the model, and it quantifies how well the model
explains the observed data.
•Example: In a classification problem, the likelihood would measure how probable the observed labels
are given the predicted labels from your model.
Prior (P(Hypothesis)):
•The prior is your initial belief about the hypothesis before observing any data. It reflects your
knowledge or assumptions about the model parameters based on previous information or intuition.
Evidence (P(Data)):
•The evidence (also called the marginal likelihood) is the probability of the observed data under all
possible hypotheses. It normalizes the posterior distribution so that it sums to one. Evidence is used in
Bayesian inference but is often hard to compute directly.
Bayesian Linear Regression
Bayesian Linear Regression
Discriminant Functions

A discriminant function is a function used in classification tasks to assign a given

input to one of several possible classes. It is designed to make decisions based
on the values of input features by computing a score for each class. The class
with the highest score is the one to which the input is assigned.
Discriminant Functions(contd..)
Discriminant Functions(contd..)
Discriminant Functions(contd..)
Discriminant Functions(contd..)
Discriminant Functions(contd..)
Discriminant Functions(contd..)
Discriminant Functions(contd..)
Discriminant Functions(contd..)
Discriminant Functions(contd..)
Discriminant Functions(contd..)
Discriminant Functions(contd..)
Discriminant Functions(contd..)
Discriminant Functions(contd..)
Discriminant Functions(contd..)
Discriminant Functions(contd..)
Discriminant Functions(contd..)
Discriminant Functions(contd..)
Discriminant Functions(contd..)
Discriminant Functions(contd..)
Discriminant Functions(contd..)
Discriminant Functions(contd..)
Discriminant Functions(contd..)
Probabilistic Generative Models
Probabilistic Generative Models
Probabilistic Generative Models
Probabilistic Generative Models
Probabilistic Generative Models
Probabilistic Generative Models
Probabilistic Generative Models
Probabilistic Generative Models
Probabilistic Generative Models
Probabilistic Generative Models
Probabilistic Generative Models
Probabilistic Generative Models
Laplace Approximation
Laplace Approximation

• Laplace approximation is a technique in machine learning and statistics used to

approximate integrals, particularly when dealing with Bayesian inference. It's
often applied when the exact calculation of posterior distributions is intractable.
The method relies on approximating the posterior distribution with a Gaussian
distribution centered around the mode of the posterior. This is achieved by taking
the second-order Taylor expansion of the log-posterior distribution around its
mode.
• Key Steps in Laplace Approximation:
1.Find the Mode of the Posterior Distribution:
1. Identify the maximum a posteriori (MAP) estimate, which is the mode of the
posterior distribution. This can be done using optimization techniques.
2.Approximate the Posterior with a Gaussian:
1. Use a second-order Taylor expansion of the log-posterior distribution around
the MAP estimate. This results in a quadratic approximation, which
corresponds to a Gaussian distribution.
Laplace Approximation

3.Calculate the Hessian Matrix:

 The covariance of the approximating Gaussian is the inverse of the Hessian matrix of the negative log-
posterior evaluated at the mode. The Hessian captures the curvature of the posterior distribution.

4. Obtain the Approximation:

 With the mode and covariance matrix, the posterior is approximated as a multivariate normal distribution.

Applications in Machine Learning:

• Bayesian Neural Networks (BNNs): Laplace approximation can be used to approximate the posterior
distribution of the weights in BNNs, leading to uncertainty quantification in predictions.
• Model Selection: In Bayesian model comparison, Laplace approximation helps compute marginal
likelihoods, which can be used for model selection.
• Gaussian Processes: It is used to approximate non-Gaussian likelihoods in Gaussian process models,
especially in classification problems.

Laplace approximation is particularly useful when dealing with high-dimensional models, where exact inference
becomes computationally expensive.
Laplace Approximation

The Laplace Approximation aims to find a Gaussian approximation to a probability

density defined over a set of continuous variables.
Consider first the case of a single continuous variable z, and suppose the
distribution p(z) is defined by
Laplace Approximation…
Laplace Approximation…
Gaussian approximation will only be well defined if its precision A > 0, in other words the stationary
point z0 must be a local maximum, so that the second derivative of f(z) at the point z0 is negative.
Laplace Approximation…

and ∇ is the gradient operator. Taking the exponential of both sides we obtain

The distribution q(z) is proportional to f(z) and the appropriate normalization coefficient can be
found by inspection, using the standard result for a normalized multivariate Gaussian, giving
Laplace Approximation…

One major weakness of the Laplace approximation is that, since it is based on a Gaussian
distribution, it is only directly applicable to real variables.
In other cases, it may be possible to apply the Laplace approximation to a transformation of the
variable. For instance, if 0 <T < ∞ then we can consider a Laplace approximation of ln τ .
The most serious limitation of the Laplace framework, however, is that it is based purely on the
aspects of the true distribution at a specific value of the variable, and so can fail to capture
important global properties.
Bayesian Logistic Regression
Logistic Regression
Logistic Regression

Logistic regression is one of the most popular machine learning algorithms for binary
classification. This is because it is a simple algorithm that performs very well on a wide
range of problems.

Logistic Function
The logistic function is defined as:
P = 1 / (1 + e^-x)
Where e is the numerical constant Euler’s number and x is a input we plug into the
function.
Logistic Regression
Let’s plug in a series of numbers from -5 to +5 and see how the logistic function transforms them:

 Smallest negative numbers resulted in values

close to zero
 Larger positive numbers resulted in values
close to one.
 0 transformed to 0.5 or the midpoint of the
new range.
Logistic Regression
Consider the following example: An organization wants to
determine an employee’s salary increase based on their
performance.
For this purpose, a linear regression algorithm will help them
decide. Plotting a regression line by considering the employee’s
performance as the independent variable, and the salary increase as
the dependent variable will make their task easier.

Now, what if the organization wants to know whether an employee

would get a promotion or not based on their performance? The above
linear graph won’t be suitable in this case. As such, we clip the line at
zero and one, and convert it into a sigmoid curve (S curve).

Based on the threshold values, the organization can decide whether an

employee will get a salary increase or not.
Difference between Linear and Logistic
Linear Regression
Regression Logistic Regression

Used to solve regression problems Used to solve classification problems

The response variables are continuous in nature The response variable is categorical in nature

It helps estimate the dependent variable when there It helps to calculate the possibility of a particular
is a change in the independent variable event taking place

It is a straight line It is an S-curve (S = Sigmoid)

Logistic Regression
This dataset has two input variables (X1 and X2). In input variables are real-valued random and one output variable (Y).
The output variable has two values, making the problem a binary classification problem.

 For this dataset, the logistic regression has three coefficients just like linear
regression, for example:
 Unlike linear regression, the output is transformed into a probability using the
logistic function:
Logistic Regression
 Calculate p(x) for each record

p = 1 / (1 + e^(-L))

p= e^L/ (1 + e^L)

Logit L = b0 + b1x1 + b2x2

b0- intercept

b1- first regression coefficient (learning)

b2-second regression coefficient (learning)

x1-first predictor variable(input)

x2-second predictor variable (input)

 Calculate Logit for each record based on the above formula
 The job of the learning algorithm will be to discover the best values for the coefficients (b0, b1 and b2) based on the training data
 Default class (class 0).
 If the probability is > 0.5, prediction is class 0 else class 1.
Logistic Regression –Calculate

Prediction
Assign 0.0 to each coefficient and calculating the probability of the first training instance that belongs to class 0.

b0 = 0.0

b1 = 0.0

b2 = 0.0

 The first training instance is: x1=2.7810836, x2=2.550537003

 Using the above equation we can plug in all of these numbers and calculate a prediction:

prediction = 1 / (1 + e^(-(b0 + b1x1 + b2x2)))

prediction = 1 / (1 + e^(-(0.0 + 0.02.7810836 + 0.02.550537003)))

prediction = 0.5
Logistic Regression –Calculate New
Coefficients
 The new coefficient values can be calculated using

b1= 𝝨(x1-x1bar)(x2-x2 bar) / 𝝨(x1-x1 bar)^2

b2= 𝝨(x1-x1 bar)(x2-x2 bar) / 𝝨(x2-x2 bar)^2

b0= x2 bar-b1*x1 bar

Logistic Regression –Advantages

Advantages:
– Makes no assumptions about distributions of classes in feature space

– Easily extended to multiple classes (multinomial regression)

– Natural probabilistic view of class predictions

– Quick to train

– Very fast at classifying unknown records

– Good accuracy for many simple data sets

Logistic Regression –Applications
•Using the logistic regression algorithm, banks can predict whether a customer would
default on loans or not

•To predict the weather conditions of a certain place (sunny, windy, rainy, humid, etc.)

•Ecommerce companies can identify buyers if they are likely to purchase a certain product

•Companies can predict whether they will gain or lose money in the next quarter, year, or
month based on their current performance

•To classify objects based on their features and attributes

Logistic Regression –Problem
x1.deviation x2.deviation x1.deviation*x2.deviation

(x1-x1bar)(x2- Logit L=b0 + b1*x1 +

x1 x2 x1-x1bar x2-x2bar x2bar) (x1-x1bar)^2 (x2-x2bar)^2 b2*x2 P=1/(1+exp(-L)) Y NEW Given Y Accuracy
1 2.7810836 2.550537003 -2.051618547 0.14514411 -0.297780347 4.20913866 0.021066813 0.869350519 0.704610536 1 0 0

2 1.465489372 2.362125076 -3.367212775 -0.043267817 0.145691948 11.33812187 0.001872104 1.192505447 0.767188862 1 0 0

3 3.396561688 4.400293529 -1.436140459 1.994900636 -2.864957513 2.062499417 3.979628546 -0.54526968 0.366962574 0 0 1

4 1.38807019 1.850220317 -3.444631957 -0.555172576 1.912365198 11.86548932 0.30821659 1.570711242 0.827884978 1 0 0

5 3.06407232 3.005305973 -1.768629827 0.59991308 -1.061024166 3.128051463 0.359895703 0.502742829 0.623103689 1 0 0

6 7.627531214 2.759262235 2.794829068 0.353869342 0.989004322 7.811069517 0.125223511 0.026996984 0.506748836 1 1 1

7 5.332441248 2.088626775 0.499739102 -0.316766118 -0.158300415 0.24973917 0.100340774 0.835995 0.697621047 1 1 1

8 6.922596716 1.77106367 2.08989457 -0.634329223 -1.325681199 4.367659312 0.402373564 0.836487968 0.697725026 1 1 1

9 8.675418651 -0.242068655 3.842716505 -2.647461548 -10.17344419 14.76647013 7.00905265 2.029804644 0.88389103 1 1 1

10 7.673756466 3.508563011 2.84105432 1.103170118 3.134166228 8.071589646 1.216984308 -0.517012363 0.373551108 0 1 0

sum 48.32702147 24.05392893 0 -2.44249E-15 -9.699960133 67.8698285 13.52465456 6.802312589 50
x1- bar, x2-bar 4.832702147 2.405392893

(x1-x1 bar)(x2-
x2bar)/(x1-
b1 x1bar)^2 -0.142920063

(x1-x1 bar)(x2-
x2bar)/(x2-
b2 x2bar)^2 -0.717205758

b0 x2bar-b1*x1bar 3.096082986
Logistic Regression –Additional Problem
x1.deviatio
x1.deviatio n*x2.devia
n x2.deviation tion
(x1-x1bar) (x2- Logit L=b0 + b1*x1 P=1/(1+exp(-
x1 x2 x1-x1bar x2-x2bar (x2-x2bar) (x1-x1bar)^2 x2bar)^2 + b2*x2 L)) Y NEW Given Y Accuracy
1 68 166 27.1 77.2 2092.12 734.41 5959.84 216.2132671 1 1 1 1
2 70 178 29.1 89.2 2595.72 846.81 7956.64 225.5147413 1 1 1 1
3 72 170 31.1 81.2 2525.32 967.21 6593.44 226.7086169 1 1 1 1
4 66 124 25.1 35.2 883.52 630.01 1239.04 194.750395 1 1 0 0
5 66 115 25.1 26.2 657.62 630.01 686.44 191.1019757 1 1 0 0
6 67 135 26.1 46.2 1205.82 681.21 2134.44 201.4280318 1 1 0 0
sum 409 888 163.6 355.2 9960.12 4489.66 24569.84 1255.717028 50
x1- bar, x2-bar 40.9 88.8

(x1-x1 bar)(x2-
x2bar)/(x1- 2.2184575
b1 x1bar)^2 22

(x1-x1 bar)(x2-
x2bar)/(x2- 0.4053799
b2 x2bar)^2 29
-
1.9349126
b0 x2bar-b1*x1bar 66
Applying steps in logistic regression modeling
The following steps are applied in linear regression modeling in industry:

1. Exclusion criteria and good-bad definition finalization

2. Initial data preparation and univariate analysis

3. Derived/dummy variable creation

4. Fine classing and coarse classing

5. Fitting the logistic model on the training data

6. Evaluating the model on test data

Logistic Regression
# import the necessary libraries
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# load the breast cancer dataset
X, y = load_breast_cancer(return_X_y=True)
# split the train and test dataset
X_train, X_test,y_train, y_test = train_test_split(X, y,test_size=0.20, random_state=23)
# LogisticRegression
clf = LogisticRegression(random_state=0)
clf.fit(X_train, y_train)
# Prediction
y_pred = clf.predict(X_test)
acc = accuracy_score(y_test, y_pred)
print("Logistic Regression model accuracy (in %):", acc*100)

Output:

Logistic Regression model accuracy (in %): 95.6140350877193

Support Vector Machine
SVM Algorithm

• SVM stands for support vector machine, and although it can solve
both classification and regression problems, it is mainly used for
classification problems in machine learning (ML).

• SVM’s purpose is to predict the classification of a query sample by

relying on labeled input data which are separated into two group
classes by using a margin.

• Specifically, the data is transformed into a higher dimension, and a

support vector classifier is used as a threshold (or hyperplane) to
separate the two classes with minimum error.
SVM Algorithm
• Dimensions:
• In simple terms, a dimension of
something is a particular aspect of
it. Examples: width, depth and
height are dimensions.
• A line on a plane is one dimension,
considering the edges a square has
two dimensions and a cube has
three dimensions
• Planes and Hyperplane:
• In one dimension, a hyperplane is
called a point.
• In two dimensions, it is a line.
• In three dimensions, it is a plane
and in more dimensions we call it a
hyperplane.
SVM Algorithm
• Terminologies in SVM:
• The points closest to the hyperplane are called as the support vector
points and the distance of the vectors from the hyperplane are called
the margins.
• The basic intuition is that more the farther SV points, from the hyperplane,
more is the probability of correctly classifying the points in their respective
region or classes.
• SV points are very critical in determining the hyperplane because if the
position of the vectors changes the hyperplane’s position is altered.
• Technically this hyperplane can also be called as margin maximizing
hyperplane.
Identify the right hyper-plane (Scenario-1):
• Here, we have three hyper-planes (A, B, and C). Now, identify the right hyper-
plane to classify stars and circles.

• You need to remember a thumb rule to identify the right hyper-plane: “Select
the hyper-plane which segregates the two classes better”. In this scenario,
hyper-plane “B” has excellently performed this job
Identify the right hyper-plane
(Scenario-2)
• Here, we have three hyper-planes (A, B, and C) and all are segregating the classes
well. Now, How can we identify the right hyper-plane?

• Here, maximizing the distances between nearest data point (either class) and hyper-
plane will help us to decide the right hyper-plane. This distance is called as Margin
you can see that the margin for hyper-plane C is
high as compared to both A and B. Hence, we
name the right hyper-plane as C. Another
lightning reason for selecting the hyper-plane with
higher margin is robustness. If we select a hyper-
plane having low margin then there is high
chance of miss-classification.
Identify the right hyper-plane
(Scenario-3):
• Hint: Use the rules as discussed in previous section to identify the right
hyper-plane.

• Some of you may have selected the hyper-plane B as it has higher margin
compared to A. But, here is the catch, SVM selects the hyper-plane which
classifies the classes accurately prior to maximizing margin. Here, hyper-
plane B has a classification error and A has classified all correctly. Therefore,
the right hyper-plane is A.
Can we classify two classes (Scenario-4)?
• Below, I am unable to segregate the two classes using a straight line, as
one of the stars lies in the territory of other(circle) class as an outlier
Find the hyper-plane to segregate to classes (Scenario-5):

• In the scenario below, we can’t have linear hyper-plane between the

two classes, so how does SVM classify these two classes? Till now, we
have only looked at the linear hyper-plane.
SVM
• SVM can solve this problem. Easily! It solves this problem by
introducing additional feature. Here, we will add a new feature
z=x^2+y^2. Now, let’s plot the data points on axis x and z:
1-Dimensional Data Transferable
• we cannot classify using support vector classifiers whatever the cost value is.

• Another way of handling the data, called the kernel trick, using the kernel
function to work with non-linearly separable data.

• A polynomial kernel with degree 2 has been applied in transforming the data
from 1-dimensional to 2-dimensional data.
1-Dimensional Data Transferable
1-Dimensional Data Transferable
• The degree of the polynomial kernel is a tuning parameter

• The practitioner needs to tune them with various values to check

where higher accuracies are possible with the model
2-Dimensional Transferable
• In the 2-dimensional case, the kernel trick is applied as below with the
polynomial kernel with degree 2.
• It seems that observations have been classified successfully using a
linear plane after projecting the data into higher dimensions
Kernel Trick
• SVM algorithms use a set of mathematical functions that are defined as
the kernel. The function of kernel is to take data as input and transform it
into the required form.
• Firstly, a kernel takes the data from its original space and implicitly maps it
to a higher-dimensional space. This is crucial when dealing with data that
is not linearly separable in its original form.
• Instead of performing computationally expensive high-dimensional
calculations, the kernel function calculates the relationships or similarities
between pairs of data points as if they were in this higher-dimensional
space.
Kernel Trick
• Kernel Trick allows us to operate in the • However, if we use the kernel
original feature space without computing function, which is denoted as k(x, y),
the coordinates of the data in a higher instead of doing the complicated
dimensional space. computations in the 9-dimensional
• Example: space, we reach the same result
within the 3-dimensional space by
calculating the dot product of x -
• Here x and y are two data points in 3 transpose and y.
dimensions. Let’s assume that we need • The computational complexity, in this
to map x and y to 9-dimensional space. case, is O(n).
We need to do the following calculations
to get the final result, which is just a
scalar.
• The computational complexity, in this
case, is O(n²).
• Kernel trick does for us is to offer a
more efficient and less expensive way
to transform data into higher
Kernel Trick
• Numerical Example for working of Kernel Function
Feature(x) -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6
X2 36 25 16 9 4 2 0 1 4 9 16 25 36

Fig 1: Linearly inseparable data in one

dimension

Fig 2: Applying kernel method to represent data in

two dimensions
Kernel Functions
• Let κ(x, x’) ≥ 0 be some measure of similarity between objects x, x’ ∈ X ,
where X is some abstract space; we will call κ a kernel function.
• We define a kernel function to be a real-valued function of two arguments, κ(x,
x’ ) ∈ R, for x, x’ ∈ X . Typically the function is symmetric (i.e., κ(x, x’ ) = κ(x’ ,
x)), and non-negative (i.e., κ(x, x’ ) ≥ 0.
• Linear Kernel:
• Let φ(x) = x, we get the linear kernel, defined by just the dot product between
the two object vectors:

• This is useful if the original data is already high dimensional, and if the original
features are individually informative,
• e.g., a bag of words representation where the vocabulary size is large, or the
expression level of many genes.
• In such a case, the decision boundary is likely to be representable as a linear
combination of the original features, so it is not necessary to work in some
other feature space
Kernel Functions
• Mercer Kernel:
• Let X = {x1, . . . , xn} be a finite set of n samples from ჯ. The Gram matrix of
X is defined as

• If ∀X ⊆ ჯ , the matrix K is positive definite, κ is called a Mercer Kernel, or a

positive definite kernel.
• Mercer’s Theorem: If the Gram matrix is positive definite, we can compute
an eigenvector decomposition of it as
• where Λ is a diagonal matrix of eigenvalues λi > 0. Now consider an element
of K:

• Let us define Then we can write

• Thus entries in the kernel matrix can be computed by performing an inner
product of some feature vectors that are implicitly defined by the eigenvectors
U. In general, if the kernel is Mercer, then there exists a function φ mapping x
∈ ჯ to such that
Kernel Functions
• Polynomial Kernel:
• It represents the similarity of vectors in the training set of data in a feature
space over polynomials of the original variables used in the kernel.

• Where r>0.
• Sigmoid Kernel:
• An example of a kernel that is not a Mercer kernel is the so-called sigmoid
kernel, defined by

• this function is equivalent to a two-layer, perceptron model of the neural

network, which is used as an activation function for artificial neurons.
Kernel Functions
• RBF Kernel:
• The below function is the Gaussian or RBF kernel

• Here σ is the variance and our hyperparameter

• ||x-x’|| is the Euclidean (L₂-norm) Distance between two points x and x’
• When using a Gaussian kernel in an SVM, the decision boundary is a
nonlinear hyper plane that can capture complex nonlinear relationships
between the input features.
• The width of the Gaussian function, controlled by the gamma parameter,
determines the degree of nonlinearity in the decision boundary.
Kernel Functions
• String Kernels:
• If we’re interested in matching all substrings (for example) instead of
representing an object as a bag of words, we can use a string kernel:

• Let A denote an alphabet, e.g., {a, ..., z}, and A ∗ = [A, A2 , . . . , Am], where
m is the length of the longest string we would like to match. Then a basis
function φ(x) will map a string x to a vector of length |A ∗|, where each
element j is the number of times we observe substring A ∗j in string x,
where j = 1 : | A∗ |.
• The string kernel measures the similarity of two strings x and x’ :

• where φs(x) denotes the number of occurrences of substring s in string x.

Kernel Functions
• Matern Kernel:
• The Matern kernel, which is commonly used in Gaussian process has the
following form

• Where is a modified Bessel

function.
• Fisher Kernel:
• We can construct a kernel based on a chosen generative model using the
concept of a Fisher kernel. The idea is that this kernel represents the
distance in likelihood space between different objects for a fitted
generative model. A Fisher kernel is defined as

• Where
Using Kernels inside GLMs
• Kernel Machines:
• We define a kernel machine to be a GLM where the input feature vector
has the form

• where ∈ ჯ are a set of K centroids. If κ is an RBF kernel, this is called

an RBF network
• We will call above Equation a kernelised feature vector.
• We can use the kernelized feature vector for logistic regression by defining
p(y|x, θ) = Ber(cφ(x)).
• This provides a simple way to define a non-linear decision boundary.
Using Kernels inside GLMs
• Example:
• Consider the data coming from the exclusive or or xor function. This is a
binary valued function of two binary inputs. Its truth table is shown in
Figure 14.2(a). In Figure 14.2(b), we have show some data labeled by the
xor function. We see we cannot separate the data even using a degree 10
polynomial.
• However, using an RBF kernel and just 4 prototypes easily solves the
problem as shown in Figure 14.2(c)
Using Kernels inside GLMs
• Example:
• We can also use the kernelized feature
vector inside a linear regression model
by defining p(y|x, θ) = N (wTφ(x), σ2).
• For example, Figure 14.3 shows a 1d
data set fit with K = 10 uniformly
spaced RBF prototypes, but with the
bandwidth ranging from small to
large.
• Small values lead to very wiggly
functions, since the predicted function
value will only be non-zero for points x
that are close to one of the prototypes
μk.
• If the bandwidth is very large, the
design matrix reduces to a constant
matrix of 1’s, since each point is
SVM Algorithm
• Math Behind SVM
• Consider a binary classification problem
with two classes, labeled as +1 and -1. We
have a training dataset consisting of input
feature vectors X and their corresponding
class labels Y.
• The equation for the linear hyperplane can
be written as:

• Anything above the decision

• where the output y indicates whether it is boundary should have label 1
in a positive class or the negative class. w • Similarly, anything below the
is that matrix representing the plane's decision boundary should have label
parameters also the coefficient −1.
of x where x is the input • If any point exactly on the decision
data. b represents the intercept of the boundary, then the output of the
hyperplane. classifier would be zero
SVM Algorithm
• Why the output of the equation is either
positive or negative.
• Consider the problem where the decision
boundary passes through the origin and
hence intercept is zero and its slope is +1.
• A single data point on each side of the
hyperplane represents both the positive and
negative classes.
• Substituting the values in the equation of the
hyperplane:

Any point below the hyperplane will

always be positive, and above the
hyperplane will be negative.
SVM Algorithm
• We would like to choose a hyperplane that maximizes the margin between
classes. The graph below shows what good margins and bad margins are.
SVM Algorithm
• The margin has to be maximized to find the optimal decision boundary.
• Consider the negative support vector as point x1 and the positive support
vector as point x2.
• The margin would be simply the difference between x1 and x2. subtract one
equation from another.

• To find x1-x2, w has to be sent to the equation's left-hand side, which gives 2
over w.
• It is already known that w is a vector, and vectors can not be divided directly
like a scalar value. The equivalent would be to divide both sides by the length
of w, that is, the magnitude of that norm of w.
SVM Algorithm
• Now that we have arrived at the equation for the margin, it is considered the
optimization function that needs to be maximized using optimization
algorithms like gradient descent.

• Optimization algorithms work best when finding the local minimum, hence to
ease the problem, minimizing the reciprocal of x1-x2 can be used as an
optimization function, which is the norm of w over 2 as below.

• Hard Margin
• The maximum-margin hyperplane or the hard margin hyperplane is a
hyperplane that properly separates the data points of different categories
without any misclassifications.
SVM Algorithm
• Soft Margin
• It is also possible that the SVM model can
have some percentage of error, meaning,
misclassification of new data
• This has to be integrated into our
optimization function,

• where c indicates the number of error

points, in other words, the number of
misclassified data points and summation
of the distance between the marginal
hyperplane and the misclassified data
point.
Types of SVM
• Linear SVM:
• Linear SVMs use a linear decision
boundary to separate the data points of
different classes.
• When the data can be precisely linearly
separated, linear SVMs are very suitable.
This means that a single straight line (in
2D) or a hyperplane (in higher
dimensions) can entirely divide the data
points into their respective classes.
• Non-Linear SVM:
• Non-Linear SVM can be used to classify
data when it cannot be separated into two
classes by a straight line (in the case of
2D).
• By using kernel functions, nonlinear SVMs
can handle nonlinearly separable data.
Hinge Loss in SVM
• Hinge loss is a function popularly used in
support vector machine algorithms to measure
the distance of data points from the decision
boundary. This helps approximate the
possibility of incorrect predictions and
evaluate the model's performance.
• Mathematically, Hinge loss for a data point can
be represented as :
• L(y, f(x))=max(0,1 –y ∗ f(x))
• Here,
Case 1 : Correct Classification and |y| ≥ 1 (blu
• y- the actual class (-1 or 1) Case 2 : Correct Classification and |y| < 1 (faded blue)
Case 3: Incorrect Classification (Red)
• f(x) – the output of the classifier for the
datapoint

Unit-2 ML
No ratings yet
Unit-2 ML
199 pages
Unit 2 - ML - SRM
No ratings yet
Unit 2 - ML - SRM
89 pages
Unit 2 - ML - SRM
No ratings yet
Unit 2 - ML - SRM
66 pages
ML Unit3
No ratings yet
ML Unit3
9 pages
AAI Lecture 10 SP 25
No ratings yet
AAI Lecture 10 SP 25
37 pages
Linear Regression
No ratings yet
Linear Regression
7 pages
Linear & Polynomial Regression Guide
No ratings yet
Linear & Polynomial Regression Guide
56 pages
Module 2-Supervised Learning
No ratings yet
Module 2-Supervised Learning
74 pages
Unit 2
No ratings yet
Unit 2
133 pages
Regression Analysis Guide
No ratings yet
Regression Analysis Guide
13 pages
UNIT 2 Machine Learning BCAI601BCDS062
No ratings yet
UNIT 2 Machine Learning BCAI601BCDS062
244 pages
Unit-2 ML
No ratings yet
Unit-2 ML
39 pages
Linear vs Logistic Regression Guide
No ratings yet
Linear vs Logistic Regression Guide
81 pages
What Are Linear Models in Machine Learning (1) .Docx (Unit3 ML)
No ratings yet
What Are Linear Models in Machine Learning (1) .Docx (Unit3 ML)
60 pages
ML - LAB - BE CSE (DS) Final
No ratings yet
ML - LAB - BE CSE (DS) Final
110 pages
UNIT II Regration
No ratings yet
UNIT II Regration
62 pages
SML Updated UNIT 3
No ratings yet
SML Updated UNIT 3
41 pages
Ch-2 Supervised Machine Learning
No ratings yet
Ch-2 Supervised Machine Learning
48 pages
Machine Learning for Data Analysts
No ratings yet
Machine Learning for Data Analysts
201 pages
Unit-Iii-1 1
No ratings yet
Unit-Iii-1 1
31 pages
RRB - Unit 2 Regresion
No ratings yet
RRB - Unit 2 Regresion
53 pages
CSE545 sp23 (7) Regressions To Transformers 3-29
No ratings yet
CSE545 sp23 (7) Regressions To Transformers 3-29
188 pages
Unit-2 Supervised Machine Learning
No ratings yet
Unit-2 Supervised Machine Learning
132 pages
Regression: Machine Learning Algorithms
No ratings yet
Regression: Machine Learning Algorithms
5 pages
Unit 2
No ratings yet
Unit 2
19 pages
MachineLearning Unit-II
No ratings yet
MachineLearning Unit-II
45 pages
Linear Regression
No ratings yet
Linear Regression
11 pages
Linear Regression Lab Guide
100% (1)
Linear Regression Lab Guide
8 pages
Linear Regression
No ratings yet
Linear Regression
16 pages
Unit - Iii Supervisied Learning - Notes
No ratings yet
Unit - Iii Supervisied Learning - Notes
42 pages
Linear Regression - Module 3
No ratings yet
Linear Regression - Module 3
16 pages
Unit-2: Machine Learning Techniques (KCS-055) Module-2
No ratings yet
Unit-2: Machine Learning Techniques (KCS-055) Module-2
199 pages
DS Unit-Iv
No ratings yet
DS Unit-Iv
34 pages
Unit Iii
No ratings yet
Unit Iii
27 pages
ML Unit-4
No ratings yet
ML Unit-4
65 pages
Linear Regression - Jupyter Notebook
100% (3)
Linear Regression - Jupyter Notebook
56 pages
Unit 2 3 Notes
No ratings yet
Unit 2 3 Notes
16 pages
Unit III Regression
No ratings yet
Unit III Regression
24 pages
Module 2
No ratings yet
Module 2
21 pages
Regression
No ratings yet
Regression
45 pages
Classification & Regression Models
No ratings yet
Classification & Regression Models
32 pages
ML Unit3b
No ratings yet
ML Unit3b
175 pages
Unit 2&3 - 250421 - 215911
No ratings yet
Unit 2&3 - 250421 - 215911
60 pages
Supervised Machine Learning - Regression
No ratings yet
Supervised Machine Learning - Regression
34 pages
Module3-Fitting A Model To Data
No ratings yet
Module3-Fitting A Model To Data
57 pages
ML Exp 1
No ratings yet
ML Exp 1
4 pages
ML Unit 3
No ratings yet
ML Unit 3
40 pages
Unit 3
No ratings yet
Unit 3
30 pages
Complete
No ratings yet
Complete
12 pages
L4a - Supervised Learning
No ratings yet
L4a - Supervised Learning
25 pages
Regression Notes
No ratings yet
Regression Notes
7 pages
Aiml Unit 3
No ratings yet
Aiml Unit 3
9 pages
Supervised Learning. wk3
No ratings yet
Supervised Learning. wk3
18 pages
Analytics Compendium
No ratings yet
Analytics Compendium
41 pages
Unit 2
No ratings yet
Unit 2
19 pages
Updated Module2 - OTML Updated
No ratings yet
Updated Module2 - OTML Updated
83 pages
Unit III Da Notes
No ratings yet
Unit III Da Notes
43 pages
Supervised Learning Algorithms
No ratings yet
Supervised Learning Algorithms
20 pages
Unit I - ML For Data Analytics
No ratings yet
Unit I - ML For Data Analytics
106 pages
Score Card 2027 Batch-Assessment 1-13.09.25
No ratings yet
Score Card 2027 Batch-Assessment 1-13.09.25
320 pages
CC Report Akn-1
No ratings yet
CC Report Akn-1
25 pages
NOC25CS149S638500285
No ratings yet
NOC25CS149S638500285
2 pages
ML - Unit 2 - Question Bank
No ratings yet
ML - Unit 2 - Question Bank
15 pages
Unit 1 - Senario Based Questions
No ratings yet
Unit 1 - Senario Based Questions
2 pages
KCSE Physics Paper 2 2024 Trial Exam
No ratings yet
KCSE Physics Paper 2 2024 Trial Exam
11 pages
Fluid Mechanics for Engineers
No ratings yet
Fluid Mechanics for Engineers
474 pages
Graphene Magnetoplasmonic Devices
No ratings yet
Graphene Magnetoplasmonic Devices
2 pages
A Quantum-Cognition Approach To The Study of Second Language Acquisition
No ratings yet
A Quantum-Cognition Approach To The Study of Second Language Acquisition
34 pages
Norfloxacin Certificate Analysis
No ratings yet
Norfloxacin Certificate Analysis
2 pages
Immediate Download (Ebook) Basic Concepts of Analytical Chemistry by S.M. Khopkar ISBN 9781906574376, 1906574375 Ebooks 2024
100% (4)
Immediate Download (Ebook) Basic Concepts of Analytical Chemistry by S.M. Khopkar ISBN 9781906574376, 1906574375 Ebooks 2024
76 pages
Topic Wise Test-5 Free Electron Theory (Question)
No ratings yet
Topic Wise Test-5 Free Electron Theory (Question)
5 pages
Exam Schedule Bachelor - Spring 2024
No ratings yet
Exam Schedule Bachelor - Spring 2024
13 pages
Time Series
No ratings yet
Time Series
190 pages
MEK451-Chapter - 2 - First - Law - of - Thermodynamics - M2
No ratings yet
MEK451-Chapter - 2 - First - Law - of - Thermodynamics - M2
68 pages
Auto Tech Electrical Basics
100% (53)
Auto Tech Electrical Basics
10 pages
Rough Volatility 2023 Part 1 Handout
No ratings yet
Rough Volatility 2023 Part 1 Handout
43 pages
NEO JEE 11 P1 PHY H Electric Charges and Fields 09 211
No ratings yet
NEO JEE 11 P1 PHY H Electric Charges and Fields 09 211
95 pages
Unit 2 - Control Systems - WWW - Rgpvnotes.in
No ratings yet
Unit 2 - Control Systems - WWW - Rgpvnotes.in
16 pages
LM Landing Gear Subsystem
No ratings yet
LM Landing Gear Subsystem
60 pages
PCM Test Paper-I SET-B (With Answers) 21.08.24
No ratings yet
PCM Test Paper-I SET-B (With Answers) 21.08.24
11 pages
Diode Circuit Models & Applications
No ratings yet
Diode Circuit Models & Applications
22 pages
Retaining Wall Stability Analysis
No ratings yet
Retaining Wall Stability Analysis
6 pages
HC Weir Manual
100% (2)
HC Weir Manual
60 pages
A 20 Years Long UHPC Journey On Bridges
No ratings yet
A 20 Years Long UHPC Journey On Bridges
56 pages
Lab Report
100% (1)
Lab Report
3 pages
CE04 Fluid Mechanics Properties of Fluids
No ratings yet
CE04 Fluid Mechanics Properties of Fluids
3 pages
Postgraduate Brochure - 2021
No ratings yet
Postgraduate Brochure - 2021
9 pages
12 Reasons Why Bearings Fail
No ratings yet
12 Reasons Why Bearings Fail
1 page
Lecture 10 - Beams On Elastic Foundations
No ratings yet
Lecture 10 - Beams On Elastic Foundations
21 pages
Forensic Ballistics Exam Prep
No ratings yet
Forensic Ballistics Exam Prep
28 pages
UBTEB PAPERS. Sep 02, 2022
100% (2)
UBTEB PAPERS. Sep 02, 2022
17 pages
Product Guide: Electronic Edition - Drive Power Applications
No ratings yet
Product Guide: Electronic Edition - Drive Power Applications
22 pages
Ats - GR.8 - 50 Items - Esp 8
No ratings yet
Ats - GR.8 - 50 Items - Esp 8
21 pages
Tiara Shinta R - 210666372 - Questions To See Problem As A System
100% (1)
Tiara Shinta R - 210666372 - Questions To See Problem As A System
1 page

Unit-2 Machine Learning

Uploaded by

Unit-2 Machine Learning

Uploaded by

UNIT- II

Maximum likelihood estimation- Least squares, Robust linear expression, ridge

Representing Linear Regression Model-

The sloped straight line representing the linear

relationship that fits the given data best is called as a

1. Simple Linear Regression-

2. Multiple Linear Regression-

• Since the numerical characteristics of the distribution vary as a function of

• Maximum likelihood estimation, which is a process of estimation that gives

When to Use Log-Likelihood:

This simple linear regression example illustrates how

2. Correlation check of independent variables

3. Train and test random classification

4. Fit the model on train data

5. Evaluate model on test data

Prior should have no information from likelihood

Evidence term is just a standardization to ensure

A discriminant function is a function used in classification tasks to assign a given

• Laplace approximation is a technique in machine learning and statistics used to

3.Calculate the Hessian Matrix:

4. Obtain the Approximation:

Applications in Machine Learning:

The Laplace Approximation aims to find a Gaussian approximation to a probability

 Smallest negative numbers resulted in values

Now, what if the organization wants to know whether an employee

Based on the threshold values, the organization can decide whether an

Used to solve regression problems Used to solve classification problems

It is a straight line It is an S-curve (S = Sigmoid)

Logit L = b0 + b1*x1 + b2*x2

b1- first regression coefficient (learning)

b2-second regression coefficient (learning)

x1-first predictor variable(input)

x2-second predictor variable (input)

 The first training instance is: x1=2.7810836, x2=2.550537003

prediction = 1 / (1 + e^(-(b0 + b1*x1 + b2*x2)))

prediction = 1 / (1 + e^(-(0.0 + 0.0*2.7810836 + 0.0*2.550537003)))

b1= 𝝨(x1-x1bar)(x2-x2 bar) / 𝝨(x1-x1 bar)^2

b2= 𝝨(x1-x1 bar)(x2-x2 bar) / 𝝨(x2-x2 bar)^2

b0= x2 bar-b1*x1 bar

– Easily extended to multiple classes (multinomial regression)

– Natural probabilistic view of class predictions

– Very fast at classifying unknown records

– Good accuracy for many simple data sets

•To classify objects based on their features and attributes

(x1-x1bar)(x2- Logit L=b0 + b1*x1 +

2 1.465489372 2.362125076 -3.367212775 -0.043267817 0.145691948 11.33812187 0.001872104 1.192505447 0.767188862 1 0 0

3 3.396561688 4.400293529 -1.436140459 1.994900636 -2.864957513 2.062499417 3.979628546 -0.54526968 0.366962574 0 0 1

4 1.38807019 1.850220317 -3.444631957 -0.555172576 1.912365198 11.86548932 0.30821659 1.570711242 0.827884978 1 0 0

5 3.06407232 3.005305973 -1.768629827 0.59991308 -1.061024166 3.128051463 0.359895703 0.502742829 0.623103689 1 0 0

6 7.627531214 2.759262235 2.794829068 0.353869342 0.989004322 7.811069517 0.125223511 0.026996984 0.506748836 1 1 1

8 6.922596716 1.77106367 2.08989457 -0.634329223 -1.325681199 4.367659312 0.402373564 0.836487968 0.697725026 1 1 1

9 8.675418651 -0.242068655 3.842716505 -2.647461548 -10.17344419 14.76647013 7.00905265 2.029804644 0.88389103 1 1 1

10 7.673756466 3.508563011 2.84105432 1.103170118 3.134166228 8.071589646 1.216984308 -0.517012363 0.373551108 0 1 0

1. Exclusion criteria and good-bad definition finalization

2. Initial data preparation and univariate analysis

3. Derived/dummy variable creation

4. Fine classing and coarse classing

5. Fitting the logistic model on the training data

6. Evaluating the model on test data

Logistic Regression model accuracy (in %): 95.6140350877193

• SVM’s purpose is to predict the classification of a query sample by

• Specifically, the data is transformed into a higher dimension, and a

• In the scenario below, we can’t have linear hyper-plane between the

• The practitioner needs to tune them with various values to check

Fig 1: Linearly inseparable data in one

Fig 2: Applying kernel method to represent data in

• If ∀X ⊆ ჯ , the matrix K is positive definite, κ is called a Mercer Kernel, or a

• Let us define Then we can write

• this function is equivalent to a two-layer, perceptron model of the neural

• Here σ is the variance and our hyperparameter

• where φs(x) denotes the number of occurrences of substring s in string x.

• Where is a modified Bessel

• where ∈ ჯ are a set of K centroids. If κ is an RBF kernel, this is called

• Anything above the decision

Any point below the hyperplane will

• where c indicates the number of error

Logit L = b0 + b1x1 + b2x2

prediction = 1 / (1 + e^(-(b0 + b1x1 + b2x2)))

prediction = 1 / (1 + e^(-(0.0 + 0.02.7810836 + 0.02.550537003)))