KEMBAR78
Unit-2 MLT | PDF | Support Vector Machine | Bayesian Network
0% found this document useful (0 votes)
30 views84 pages

Unit-2 MLT

The document discusses regression analysis, emphasizing its importance in predicting the value of a variable based on others, with generalized linear models being the most commonly used technique. It outlines the regression process, common reasons for conducting regression, and the formulation of linear models. Additionally, it introduces Bayes theorem and concept learning, detailing the brute-force MAP learning algorithm and its assumptions regarding hypothesis probabilities.

Uploaded by

Kartikeya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views84 pages

Unit-2 MLT

The document discusses regression analysis, emphasizing its importance in predicting the value of a variable based on others, with generalized linear models being the most commonly used technique. It outlines the regression process, common reasons for conducting regression, and the formulation of linear models. Additionally, it introduces Bayes theorem and concept learning, detailing the brute-force MAP learning algorithm and its assumptions regarding hypothesis probabilities.

Uploaded by

Kartikeya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 84

UNIT

UNIT-2
1. Regression

Prepared By: Deepti Singh


Regression Modelling
• Regression Analysis is widely used, because there are so many
statistical problems that can be presented as finding out how to
predict the value of a variable from the value of other variables.
Generalized Linear Models:
• The fitting of generalized linear models is currently the most
frequently applied statistical technique.
technique
- used to describe the relationship between the mean, sometimes
called the trend, of one variable and the values taken by several other
variables.
• Modelling this type of relationship is sometimes called Regression.
Regression
• Regression: It is the process of determining how a variable y is related
to one or more, other variables x1,x2,x3….
x1,x2,x3….xn
xn..
Common reasons for doing a regression:
• The output is expensive to measure, but the inputs are not, and so
cheap predictions of the output are sought.
• The values of the inputs are known earlier than the output.
• We can control the values of the inputs, we believe there is a casual
link between the inputs and the outputs and so we want to know
what values of the i/p
/p should be chosen to obtain a particular target
value for the output.
• As we believed about casual link between the i/p and o/p, we wish to
identify which i/p
/p is related to the o/p.
• The most widely used form of regression model is GLM.
• The linear model is usually written as:
Cont…
• Yj =β0 + β1x1j +β2x2j +…….+ βnxn

Y =β0 + β1x
y= dependent variable
x= independent variable
β0 =constant term or intercept
β1 = x’s slope or coefficient
L-------------_j)
'1
p( ji' "=' I ( X; ~ fi) ,:, h ( XL) -- ~(V
f (;y =-0 I>lj ; pJ J-h ( i ,)
=;
!--
w ~ ~ vvv\., ~
d

IAJe_ ~ _____ , ,_,


J' ,,' - Ji
p ( ~1/xii ~)~ Qi(xtl)' (! ;- h(?!i); __-y-

fvu,.-,.,rt\LJ,M Q_,~J -h.' ¼~ 1-a 1,1,-,- J~ 4


~ u·~J. { f->o, p,) 1~ ~ -14 ~
f>roh,J,,Lh'~ ~ ~r ,j_o ~ (J__ oY O) .
-re_,__ L~~od ~ ·Lt) (t~ ~

l ( /2°,~ )= }f
1
f(d
. J; =-I
,!y; - C,- p(K,'.))--~U)
-< ~ 0
-n;:L- ~ r.~ ~ )1-,'r__, ,y; 'n
'
w .

~ 0-JL. c-,_ ~ (l.A i1yW\'WJ ~ _&:n.-+- /


fl-o ~ ~toed 11"' a-§4
j
I

r. 0
,- .
0
3.
0
0
t. 0
0

. 2.· oo \
0
1.. 2..5 I
0
I
(J

1
0

5, 0 0 I
,.s I
ngs (bkhs) versus the probability of \oan tql.\yment , with the l<Jy)<;tk r~ve..~c;,·\on c.urvc
.
ttgttssion analysis using the maximum \ikclihnm\ estimate gives the ft,\\ow·m g out~ut:
coefficients are /3 0 = -4.07778 and /3 1 == 1.5046

1.00
• ••••• •
l.4 -
C
G)
E 0 .75
>-
ca
0.
!
C
ca

-I
.2 0.50
0

:aca 0.25
.a
e
~
: ...

0.00
1 2 3 4 5
Savings in lakhs

Figure 8.3 Probability of loan repayment with yearly savings.


J 'nto th<' lnp,m1C.. . cqu.:u,on
, , rrr,rr~~1nn ' to r.,uitn h, ~ "
111Nl' c;c,c-ffidcntl arr rntrrr I ' ,_,, I ' p,,,. h·
1,>an non -Jc I.,111 IIC"f : I r.. 11
'') 'if
.. (I · on •tld;111l1 rr ~ - ---·----
""lh.-h1l11y o "'"'~" l+r~,,( - (1 .50/4(i,,,. · ·----,._
Ytr1g.1 - /4 .() ]]i""
•f, 11 ,t,Hnrr wi th 2 l"kh , 5:wi ng prr yr:ir, rhr r•titt1:.1trcJ L J)
For ('U nll'Ic, f,' ' 1 r,ro,>3hili,y r.1
..J _. lrrt i~ u h,11,,"ll: .,, I. ~
n<'n r,111 . - -· I '11!
r"~\ahilicy nf hring non -tl,,f;:111l1 rr ~ I -( ( - - - - - -
+ <'X p - 1. 50/4'1 , 2 - 4,0777 D
rt with 4 la~lu Mwi 11r,, prr yr:.i r, thr c~tim:.itrd prob~hility f I
~ l11iy "..., • 'u~tom ' I r I /I r I I c, '•3'1 ,
•l'ht · _J and al 011 , 1)u ll for d,t 11011 '"~ ~
m, rf'<'ih1'N
.,,u -c rnrn tcr c c,:iu trr arc, 1own in 1~hl
(! t),2
O . • ,, ,,,
,. o., iJ t1lrn I.\ C'utro, I, rl,t (llllf)III 0. "'· 6,,

~
i,
........,
Table 1.2 Predicted and actual values of defaulter/non.defaulter

o.s
~

0
D,fi,,J.,. Fitt,JV._

0.0347) 0025
0
O.S~ 0 0.04977197 I
0
1 0 0.070889852
0
1.:.5 0 0.1000247 15 ·O
1.5 0 0.1 39337907 0
1.75 0 0.190826302 0
l.':'5 1 0.190826302 0
l o. 0.255688447 0
2.25 I 0.333510508 0
l.S o,,. 0.42160211 5 0
2.75 1 0.51498301 3 I
J 0 0.607329347 I •
315 1 0.692588758 I
J.S 0 0.766454783 l •
4 I 0.8744290:?6 l
4lS I 0.9 10262'>67 l
4.S
l 0.9366123:?4 l
..,s I 0. 95 S602 l 24 l
s l 0. 969090667 l
s.s I 0. 98 Sl '>&N4 I
-
' .1 ~J ~f N) T~ 14 w-{ tP) .

-® TN · ~ ~ ~~ ~
7

VJ O. -rL_
a.c.fu:t.-.t if,,_k_ J:,.,
fvLPM .
fr ~
~V"-"4

~"'""· -~· •,
~· th hr
~ Utl _
@ Ff ➔ ,~ ~ v);.t,.,_ ~ o 2..- ~ t~
tn 1. TIN- ~k L )Jluu r-f"•- ~- ~' th ·

~ -~ ,al . J ~ ~
/2)~
lV
pJ ~
J-
~
~
fyu)µ
r ~
tf'.~-«uilj
-· -o
V1 ~
~-c b..
V"'&

tA
I
o. ~
,
~ TP ~ t,Shv,v µ
. ,

f«Y((~b
~c J_ o-,.,L,
.

rJ,./
~ t/~ IA 1 , lK' fflAAuJly f~ ~ -/
• also called confusion matrix, shown in Table 8.3 helps classify the values that were correctly
rnacrLX,
d using the model bu1·1t.
'1"I.1
11 e.
predicce lassifications are used to calculate accuracy, precision (or positive predictive value), recall (or
These
. . . )c specificity, an d negauve
· pred.1cuve
· val ue. These
are a few metrics used to identify the accuracy
senor 6siuv;1~
co
logistic model based on the conf4sion matrix shown in Table 8.4.

Table 8.4 Metrics to evaluate logistic regression for defaulter/non-defaulter


prediction

(TP + TN)/Totld N,m,ber of01,lff'INdio,u


Precision or positive predictive value TP/(TP + FP)
Negative predictive rate TN/(TN +FN)
Sensitivity TP/(TP + FN)
Specificity TN/(TN + FP)

Let us consider the following example of confusion matrix:

Predicted 0 Predicted 1
Actual 0 8 (TN) 2(FP)
Actual 1 2(FN) 8(TP)

(TP+TN
Accuracy=....:,___ _. .);.,. = -16 = 0.8
N 20
.. TP 8
Prects1on = (TP + FP) = = 08 .
10
. TN 8
Negative predicuve rate= ( ) = - = 0.8
' TN+FN 10
TP 8
Sensitivity= (TP + FN) = = 0.8
10

S "fi . TN 8
08
pec1 1c1ty =(TN+ FP) = 10 = .

In this sample dataset the accuracy obtained is 80%. This is how accuracy and precision arc computed in a
real time dataset.
UNIT
UNIT-2
Bayes Theorem & Concept Learning
Bayes Optimal Classifier
Naïve Bayes Classifier
Bayesian Belief Network
EM Algorithm

Prepared By: Deepti Singh


Bayes Theorem & Concept Learning
1. Brute-Force Bayes Concept Learning
• Design a concept learning algorithm to output the maximum a posteriori hy
BRUTE-FORCE MAP LEARNING algorithm hypothesis, based on Bayes theorem, as
follows:
1. For each hypothesis h in H, calculate the posterior probability

2. Output the hypothesis hMAP with the highest posterior probability

3. Impractical for large hypothesis spaces


• Learning problem for the BRUTE-FORCEFORCE MAP LEARNING algorithm we must
specify what values are to be used for P(h) and for P(D|h).
• We may choose the probability distributions P(h) and P(D|h) in any way we wish,
to describe our prior knowledge about the learning task.
• Here let us choose them to be consistent with the following assumptions:
1. The training data D is noise free (i.e., di = c(xi)).
2. The target concept c is contained in the hypothesis space H.
3. We have no a priori reason to believe that any hypothesis is more probable
than any other.
• Given these assumptions, what values should we specify for P(h)?
P(h)
- given no prior knowledge that one hypothesis is more likely than another, it is
reasonable to assign the same prior probability to every hypothesis h in H.
- because we assume the target concept is contained in H we should require that
these prior probabilities sum to 1.
….. Eq. 1
• What choice shall we make for P(D|h)? )? P(D|h)
P( is the probability of ob- serving the
target values D = (d1 . . .dm) for the fixed set of instances (x1 . . . xm), given a world
in which hypothesis h holds.
- Since we assume noise-freefree training data, the probability of observing
classification di given h is just 1 if di = h(xi) and 0 if di != h(xi). Therefore,

….. Eq.2
It means, the probability of data D given hypothesis h is 1 if D is consistent with
h, and 0 otherwise.
• Given these choices for P(h) and for P(D|h
D|h) we now have a fully-defined problem
for the above BRUTE-FORCE MAP LEARNING algorithm.
-> Let us consider the first step of this algorithm, which uses Bayes theorem to
compute the posterior probability P(h|D)
P( of each hypothesis h given the
observed training data D.

Bayes theorem, we have :


First consider the case where h is inconsistent with the training data D. Since
Equation (2) defines P(D|h) to be 0 when h is inconsistent with D, we have :

The posterior probability of a hypothesis inconsistent with D is zero.


Now consider the case where h is consistent with D. Since Equation (2) defines
P(D|h) to be 1 when h is consistent with D, we have :
• To summarize, Bayes theorem implies that the posterior probability P(h|D)
P( under
our assumed P(h) and P(D|h) is :
2. MAP Hypotheses and Consistent Learners
• We will say that a learning algorithm is a consistent which results a
hypothesis that commits zero errors over the training examples.
• Given the previous analysis, we can conclude that every consistent
learner outputs/results a MAP hypothesis, if we assume a uniform
prior probability distribution over H (i.e., P(hi) = P(hj) for all i, j), and if
we assume deterministic, noise free training data (i.e., P(D|h) = 1 if D
and h are consistent, and 0 otherwise).
otherwise)
Example:
• FIND-SS looks through the hypothesis space, starting with the most
specific hypotheses and moving to more general ones.
• It stops when it finds the most specific hypothesis that fits all the
data (called a consistent hypothesis).
• Even though FIND-SS doesn't use probability calculations, its chosen
hypothesis turns out to match the MAP hypothesis if the probability
distributions P(h)P(h) (prior probability) and P(D∣h)P(D|h)
P( ∣ (likelihood)
favor more specific explanations.
BAYES OPTIMAL CLASSIFIER
• "what is the most probable hypothesis given the training data?' In fact, the
question that is often of most significance is the closely related question "what
is the most probable classification of the new instance given the training data?
• Although it may seem that this second question can be answered by simply
applying the MAP hypothesis to the new instance, in fact it is possible to do
better.
• the most probable classification of the new instance is obtained by combining the
predictions of all hypotheses, weighted by their posterior probabilities. If the
possible classification of the new example can take on any value v, from some set
V, then the probability P(vj|D) that the correct classification for the new instance
is vj, is just :
• The optimal classification of the new instance is the value vj, for which P (vj|D) is
maximum.

• Any system that classifies new instances according to acc. to above equation is
called a Bayes optimal classifier, or Bayes optimal learner.
• No other classification method using the same hypothesis space and same prior
knowledge can outperform this method on average.
• This method maximizes the probability that the new instance is classified
correctly, given the available data, hypothesis space, and prior probabilities over
the hypotheses.
Naive Bayes Classifier
• One highly practical Bayesian learning method is the naive Bayes learner, often
called the naive Bayes classifier. In some domains its performance has been
shown to be comparable to that of neural network and decision tree learning.

• The naive Bayes classifier applies to learning tasks where each instance x is
described by a conjunction of attribute values and where the target function f (x)
can take on any value from some finite set V.

• A set of training examples of the target function is provided, and a new instance
is presented, described by the tuple of attribute values (a1, a2.. .an). The learner is
asked to predict the target value, or classification, for this new instance.
• The Bayesian approach to classifying the new instance is to assign the most probable
target value, VMAP, given the attribute values (a1, a2.. .an) that describe the instance.

• ….Eq 1
• The naive Bayes classifier is based on the simplifying assumption that the attribute
values are conditionally independent given the target value. In other words, the
assumption is that given the target value of the instance, the probability of observing the
conjunction al, a2.. .a, is just the product of the probabilities for the individual attributes:

• Substituting this into Equation (1), we have the approach used by the naive Bayes
classifier.
• where VNB denotes the target value output by the naive Bayes classifier.

• An Illustrative Example, Let us apply the naive Bayes classifier to a concept


learning problem i.e., classifying days according to whether someone will play
tennis.
• The below table provides a set of 14 training examples of the target concept
PlayTennis, where each day is described by the attributes Outlook, Temperature,
Humidity, and Wind
• Here we use the naive Bayes classifier and the training data from this table to
classify the following novel instance:
• < Outlook = sunny, Temperature = cool, Humidity = high, Wind = strong >
• Our task is to predict the target value (yes or no) of the target concept PlayTennis
for this new instance
….eqn1

• To calculate VNB we now require 10 probabilities that can be estimated from th


training data. First, the probabilities of the different target values can easily b
estimated based on their frequencies over the 14 training examples:
P(PlayTennis = yes) = 9/14 = .64
P(PlayTennis = no) = 5/14 = .36
• Similarly, we can estimate the conditional probabilities.
probabilities For example, those for Wind = strong are
P(Wind = strong|PlayTennis = yes) = 3|9 = .33
P(Wind = strong| PlayTennis = no) = 3|5 = .60

Temperature Y N Outlook Y N
hot 2/9 2/5 Sunny 2/9 3/5
mild 4/9 2/5 Overcast 4/9 0
cool 3/9 1/5 Rain 3/9 2/5

Windy Y N Humidity Y N
Strong 3/9 3/5 High 3/9 4/5
Weak 6/9 2/5 Normal 6/9 1/5
• Using these probability estimates and similar estimates for the remaining
attribute values, we calculate VNB according to Equation (1) as follows:
(9/14*2/9*3/9*3/9*3/

Thus, the naive Bayes classifier assigns the target value PlayTennis = no to this
new instance, based on the probability estimates learned from the training data.
• By normalizing the above quantities to sum to one we can calculate the
conditional probability that the target value is no, given the observed attribute
values.
• ESTIMATING PROBABILITIES
• We have estimated probabilities by the fraction of times the event is observed to occur ove
the total number of opportunities.
e.g. : we estimated P(Wind = strong| Play Tennis = no) by the fraction
nc/n
where n = 5 is the total number of training examples for which PlayTennis = no, and nc = 3 is
the number of these for which Wind = strong.
• While this observed fraction provides a good estimate of the probability in many cases, it
provides poor estimates when n, is very small.
• To avoid this difficulty we can adopt a Bayesian approach to estimating the probability, usin
the m-estimate defined as follows :
p=prior estimate,
m is constant called equivalent sample
size

Bayesian Belief Network
• It is a graphical model that represents the probabilistic relationships among
variables.
• It is used to handle uncertainty and make predictions or decisions based on
probabilities.
• Graphical Representation: Variables are represented as nodes in a directed
acyclic graph (DAG), and their dependencies are shown as edges.
• Conditional Probabilities:: Each node’s probability depends on its parent
nodes, expressed as P(Variable | Parent).
• Probabilistic Model:: Built from probability distributions, BBNs apply
probability theory for tasks like prediction and anomaly detection.
• The naive Bayesian classifier makes the assumption of class conditional
independence, i.e., given the class label of a tuple, the values of the attributes
are assumed to be conditionally independent of one another. This simplifies
computation.
• When the assumption influence true, therefore the naïve Bayesian classifier is
the efficient in comparison with multiple classifiers. Bayesian belief networks
defines joint conditional probability distributions.
distributions
• They enable class conditional independencies to be represented among subsets
of variables. They support a graphical structure of causal relationships, on which
learning can be implemented. Trained Bayesian belief networks is used for
classification. Bayesian belief networks are also called a belief networks, Bayesian
networks, and probabilistic networks.
• A belief network is represented by two components including a directed acyclic
graph and a group of conditional probability tables. Every node in the directed
acyclic graph defines a random variable.
variable The variables can be discrete- or
continuous-valued.
• In general, a Bayesian network represents the joint probability distribution by
specifying a set of conditional independence assumptions (represented by a
directed acyclic graph), together with sets of local conditional probabilities.
• The joint probability for any desired assignment of values (y1, . . . , yn) to the tuple
of network variables (Y1 . . . Yn) can be computed by the formula:

• where Parents(Yi) denotes the set of immediate predecessors of Yi in the net-


work. Note the values of P(yi |Parents(Yi)) are precisely the values stored in the
conditional probability table associated with node Yi.
Example
Expectation-Maximization
Maximization Algorithm (EM)
• In real world applications of machine learning it is common that there are many
relevant features available for learning but only a small subset of them are
observable.
• The EM algorithm can be used for the latent variables (variables that are not
directly observable and are actually inferred from the values of the other
observed variables.)
• It has a wide range of applications, but it is likely best recognized in machine
learning for its usage in unsupervised learning tasks such as density estimation
and expectation maximization clustering.
EM Algorithm
• Initially, a set of initial values of parameters are considered. A set of incomplete
observed data is given to the system with the assumptions that the observed data
comes from a specific model.
• The second step, known as Expectation or E-Step,
E is used to estimate or guess the
values of missing or incomplete data using the observed data. E-step E also largely
updates the variables.
• The third step is known as the Maximization or M-step,
M and it is when we use the
whole data from the 2nd step to update the parameter values.(update
hypothesis)
• The fourth or final step is to determine whether or not the values of latent
variables are converging. If it returns “yes,” end the procedure; otherwise, restart
from step 2 until convergence occurs.
EM Algorithm
Uses:
• It can be used to fill the missing data in a sample.
• It can be used as the basis of unsupervised learning of clusters.
• Used for the purpose of estimating parameters of Hidden Markov
Model (HMM).
• Discovers the values of latent variables.
Advantages:
Disadvantages:
Support Vector Machine
• -Introduction
- Types of Support Vector Kernel
- Hyperplane
- Properties of SVM & Issues.
Support Vector Machine

• Support Vector Machine are Supervised Learning algorithms that were introduced in 1992.
- It becomes popular of their success in handwritten digit recognition.
- Experimentally, it was proved that SVMs has low error rate.
- 1.1% test error rate for SVM.(same as neural networks)
- It can be employed for both classification and regression purpose.
- It tries to map an input space into an output space using a non-linear mapping
function Φ such that, the problem or the data points become linearly separable in the
output space.
- When the points become linearly separable then SVM discovers the optimal separating
hyperplane.

• The goal of SVM is to find the optimal hyperplane which maximizes the margin of the
training data.
SVM: Types
1. Linear SVM:
• We want to find the best hyperplane (i.e. decision boundary) linearly separating our classes. Our
boundary will have equation: wTx + b = 0.
• Anything above the decision boundary should have label 1. i.e.,
• wTxi + b > 0 will have corresponding yi = 1.
• Similarly, anything below the decision boundary should have label -1. i.e.,
wTxi + b < 0 will have corresponding yi = -1.
2. Non-Linear SVM: The dataset cannot be classified into two classes by using
a straight line.
• Non-linear
linear classification is carried out using the kernel concept.
• Non-Linear
Linear SVM applies the function of the kernel concept to a space that has
high dimensions.
Kernel Trick in SVM
• In Machine Learning, the data can be text, image or video.
• The function of the kernel trick is to map the low-dimensional input space and
transforms into a higher dimensional space.
space
• We need to extract features from these data for the classification purpose.
• In real world, Many classifications models are complex and mostly require non-
linear hyperplanes.
• E.g. mapping function Φ: R2 R3 Used to transform a 2-D data into
3-D data. Given as follows:
Φ(x,y)=(x2,√2xy,y
√2xy,y2)
Kernel Trick for 2nd Degree polynomial Mapping.
Types of Kernel
• Linear Kernel
• Polynomial Kernel
• Homogeneous Kernel
• Inhomogeneous Kernel
• Gaussian Kernel or Radial-Basis
Basis function Kernel
• Sigmoid Kernel
• Etc.
1. Linear Kernel
• Linear Kernel are of the type:

where x and y are two vectors.


• Therefore,
2. Polynomial Kernel
• Polynomial kernels are of type:
• This is called homogeneous kernel.
kernel
• Here q is the degree of the polynomial.
• If q=2 then it is called quadratic kernel.
• For inhomogeneous kernel,, this is given as:
• Here, C is constant, q is degree of the polynomial
• If c=0 & q=1,the polynomial kernel is reduced to a linear kernel.
kernel
• The value of q should be optimal as more degree may lead to
overfitting.
Example:
Example:
Example:
Gaussian Kernels
• RBFs or Gaussian kernels are extremely useful in in SVM.
• It is shown as follows:

• Here, y plays an very important role as a parameter. If y is small, then


the RBFs is similar to linear SVM.
• If y is large, then the kernel is impacted by more support vectors.
• The RBFs performs the dot product in R∞ ,and because of this it is
highly effective in separating the classes.
Example:
Sigmoid Kernel
Hyperplane
• A hyperplane in Support Vector Machine (SVM) is a decision boundary that
separates different classes in the data
Its Purpose is:
• To classify data points into different categories.
• To maximize the margin between classes for optimal separation.
• SVM can work with no. of dimensions:
• In a 1-D, a hyperplane is called a point; in a 2D space, it's a line; in 3D space, it's a
plane; in higher dimensions, it's called a hyperplane.
• The optimal hyperplane is the one that maximizes the margin between two
classes.
• SVM uses support vectors (data points closest to the hyperplane) to define this
boundary.
Cont…

w⊤x+b=0
x+b

• w = weight vector perpendicular to the hyperplane


• x = feature vector
• b = bias
Properties of SVM
• Margin Maximization: SVM aims to find the hyperplane that best separates the
classes in the feature space by maximizing the margin, which is the distance
between the hyperplane and the nearest data points (support vectors).
• Support Vectors: Only the data points closest to the decision boundary (support
vectors) influence the position of the hyperplane. This makes SVM robust to
outliers.
• Kernel Trick: SVM can handle non-linear
linear data by applying kernel functions (e.g.,
linear, polynomial, radial basis function (RBF)) to transform the input features
into a higher-dimensional space where the data becomes linearly separable.
• Dual Formulation: SVM uses a dual optimization problem, which allows it to
operate efficiently, especially when the number of features exceeds the number of
data points.
Cont…

• Regularization Parameter (C):: The parameter C controls the trade-off


trade between
achieving a low error on the training data and maintaining a large margin. A
smaller C encourages a wider margin, while a larger C prioritizes fitting the data.

• Scalability:: While SVMs work well with small-


small to medium-sized datasets, their
training time can grow significantly with large datasets, as the complexity depends
on the number of support vectors.
Issues in SVM

• Scalability: SVMs struggle with large datasets because their training time can be
computationally expensive, especially if the number of support vectors grows.
• Choice of Kernel: Deciding on the right kernel function and its parameters (like
the degree for polynomial kernels or gamma for RBF kernels) can be tricky and
often requires experimentation.
• Sensitivity to Parameters: SVM performance heavily depends on
hyperparameters (e.g., regularization parameter CC and kernel parameters). Poor
tuning can lead to suboptimal results.
Cont…
• Non-Probabilistic Outputs: SVMs don't directly provide probabilistic outputs.
this means that the model provides a definitive classification or decision (e.g.,
"Class A" or "Class B") rather than a probability score indicating the likelihood of
belonging to a particular class.
• Difficulty with Noisy Data: When data is noisy or overlapping classes, SVM can
struggle to find a clear decision boundary, affecting its performance.

You might also like