Unit-2 MLT
Unit-2 MLT
UNIT-2
1. Regression
Y =β0 + β1x
y= dependent variable
x= independent variable
β0 =constant term or intercept
β1 = x’s slope or coefficient
L-------------_j)
'1
p( ji' "=' I ( X; ~ fi) ,:, h ( XL) -- ~(V
f (;y =-0 I>lj ; pJ J-h ( i ,)
=;
!--
w ~ ~ vvv\., ~
d
l ( /2°,~ )= }f
1
f(d
. J; =-I
,!y; - C,- p(K,'.))--~U)
-< ~ 0
-n;:L- ~ r.~ ~ )1-,'r__, ,y; 'n
'
w .
r. 0
,- .
0
3.
0
0
t. 0
0
. 2.· oo \
0
1.. 2..5 I
0
I
(J
1
0
5, 0 0 I
,.s I
ngs (bkhs) versus the probability of \oan tql.\yment , with the l<Jy)<;tk r~ve..~c;,·\on c.urvc
.
ttgttssion analysis using the maximum \ikclihnm\ estimate gives the ft,\\ow·m g out~ut:
coefficients are /3 0 = -4.07778 and /3 1 == 1.5046
1.00
• ••••• •
l.4 -
C
G)
E 0 .75
>-
ca
0.
!
C
ca
-I
.2 0.50
0
:aca 0.25
.a
e
~
: ...
0.00
1 2 3 4 5
Savings in lakhs
~
i,
........,
Table 1.2 Predicted and actual values of defaulter/non.defaulter
o.s
~
0
D,fi,,J.,. Fitt,JV._
0.0347) 0025
0
O.S~ 0 0.04977197 I
0
1 0 0.070889852
0
1.:.5 0 0.1000247 15 ·O
1.5 0 0.1 39337907 0
1.75 0 0.190826302 0
l.':'5 1 0.190826302 0
l o. 0.255688447 0
2.25 I 0.333510508 0
l.S o,,. 0.42160211 5 0
2.75 1 0.51498301 3 I
J 0 0.607329347 I •
315 1 0.692588758 I
J.S 0 0.766454783 l •
4 I 0.8744290:?6 l
4lS I 0.9 10262'>67 l
4.S
l 0.9366123:?4 l
..,s I 0. 95 S602 l 24 l
s l 0. 969090667 l
s.s I 0. 98 Sl '>&N4 I
-
' .1 ~J ~f N) T~ 14 w-{ tP) .
-® TN · ~ ~ ~~ ~
7
VJ O. -rL_
a.c.fu:t.-.t if,,_k_ J:,.,
fvLPM .
fr ~
~V"-"4
~"'""· -~· •,
~· th hr
~ Utl _
@ Ff ➔ ,~ ~ v);.t,.,_ ~ o 2..- ~ t~
tn 1. TIN- ~k L )Jluu r-f"•- ~- ~' th ·
~ -~ ,al . J ~ ~
/2)~
lV
pJ ~
J-
~
~
fyu)µ
r ~
tf'.~-«uilj
-· -o
V1 ~
~-c b..
V"'&
tA
I
o. ~
,
~ TP ~ t,Shv,v µ
. ,
f«Y((~b
~c J_ o-,.,L,
.
rJ,./
~ t/~ IA 1 , lK' fflAAuJly f~ ~ -/
• also called confusion matrix, shown in Table 8.3 helps classify the values that were correctly
rnacrLX,
d using the model bu1·1t.
'1"I.1
11 e.
predicce lassifications are used to calculate accuracy, precision (or positive predictive value), recall (or
These
. . . )c specificity, an d negauve
· pred.1cuve
· val ue. These
are a few metrics used to identify the accuracy
senor 6siuv;1~
co
logistic model based on the conf4sion matrix shown in Table 8.4.
Predicted 0 Predicted 1
Actual 0 8 (TN) 2(FP)
Actual 1 2(FN) 8(TP)
(TP+TN
Accuracy=....:,___ _. .);.,. = -16 = 0.8
N 20
.. TP 8
Prects1on = (TP + FP) = = 08 .
10
. TN 8
Negative predicuve rate= ( ) = - = 0.8
' TN+FN 10
TP 8
Sensitivity= (TP + FN) = = 0.8
10
S "fi . TN 8
08
pec1 1c1ty =(TN+ FP) = 10 = .
In this sample dataset the accuracy obtained is 80%. This is how accuracy and precision arc computed in a
real time dataset.
UNIT
UNIT-2
Bayes Theorem & Concept Learning
Bayes Optimal Classifier
Naïve Bayes Classifier
Bayesian Belief Network
EM Algorithm
….. Eq.2
It means, the probability of data D given hypothesis h is 1 if D is consistent with
h, and 0 otherwise.
• Given these choices for P(h) and for P(D|h
D|h) we now have a fully-defined problem
for the above BRUTE-FORCE MAP LEARNING algorithm.
-> Let us consider the first step of this algorithm, which uses Bayes theorem to
compute the posterior probability P(h|D)
P( of each hypothesis h given the
observed training data D.
• Any system that classifies new instances according to acc. to above equation is
called a Bayes optimal classifier, or Bayes optimal learner.
• No other classification method using the same hypothesis space and same prior
knowledge can outperform this method on average.
• This method maximizes the probability that the new instance is classified
correctly, given the available data, hypothesis space, and prior probabilities over
the hypotheses.
Naive Bayes Classifier
• One highly practical Bayesian learning method is the naive Bayes learner, often
called the naive Bayes classifier. In some domains its performance has been
shown to be comparable to that of neural network and decision tree learning.
• The naive Bayes classifier applies to learning tasks where each instance x is
described by a conjunction of attribute values and where the target function f (x)
can take on any value from some finite set V.
• A set of training examples of the target function is provided, and a new instance
is presented, described by the tuple of attribute values (a1, a2.. .an). The learner is
asked to predict the target value, or classification, for this new instance.
• The Bayesian approach to classifying the new instance is to assign the most probable
target value, VMAP, given the attribute values (a1, a2.. .an) that describe the instance.
• ….Eq 1
• The naive Bayes classifier is based on the simplifying assumption that the attribute
values are conditionally independent given the target value. In other words, the
assumption is that given the target value of the instance, the probability of observing the
conjunction al, a2.. .a, is just the product of the probabilities for the individual attributes:
• Substituting this into Equation (1), we have the approach used by the naive Bayes
classifier.
• where VNB denotes the target value output by the naive Bayes classifier.
Temperature Y N Outlook Y N
hot 2/9 2/5 Sunny 2/9 3/5
mild 4/9 2/5 Overcast 4/9 0
cool 3/9 1/5 Rain 3/9 2/5
Windy Y N Humidity Y N
Strong 3/9 3/5 High 3/9 4/5
Weak 6/9 2/5 Normal 6/9 1/5
• Using these probability estimates and similar estimates for the remaining
attribute values, we calculate VNB according to Equation (1) as follows:
(9/14*2/9*3/9*3/9*3/
Thus, the naive Bayes classifier assigns the target value PlayTennis = no to this
new instance, based on the probability estimates learned from the training data.
• By normalizing the above quantities to sum to one we can calculate the
conditional probability that the target value is no, given the observed attribute
values.
• ESTIMATING PROBABILITIES
• We have estimated probabilities by the fraction of times the event is observed to occur ove
the total number of opportunities.
e.g. : we estimated P(Wind = strong| Play Tennis = no) by the fraction
nc/n
where n = 5 is the total number of training examples for which PlayTennis = no, and nc = 3 is
the number of these for which Wind = strong.
• While this observed fraction provides a good estimate of the probability in many cases, it
provides poor estimates when n, is very small.
• To avoid this difficulty we can adopt a Bayesian approach to estimating the probability, usin
the m-estimate defined as follows :
p=prior estimate,
m is constant called equivalent sample
size
•
Bayesian Belief Network
• It is a graphical model that represents the probabilistic relationships among
variables.
• It is used to handle uncertainty and make predictions or decisions based on
probabilities.
• Graphical Representation: Variables are represented as nodes in a directed
acyclic graph (DAG), and their dependencies are shown as edges.
• Conditional Probabilities:: Each node’s probability depends on its parent
nodes, expressed as P(Variable | Parent).
• Probabilistic Model:: Built from probability distributions, BBNs apply
probability theory for tasks like prediction and anomaly detection.
• The naive Bayesian classifier makes the assumption of class conditional
independence, i.e., given the class label of a tuple, the values of the attributes
are assumed to be conditionally independent of one another. This simplifies
computation.
• When the assumption influence true, therefore the naïve Bayesian classifier is
the efficient in comparison with multiple classifiers. Bayesian belief networks
defines joint conditional probability distributions.
distributions
• They enable class conditional independencies to be represented among subsets
of variables. They support a graphical structure of causal relationships, on which
learning can be implemented. Trained Bayesian belief networks is used for
classification. Bayesian belief networks are also called a belief networks, Bayesian
networks, and probabilistic networks.
• A belief network is represented by two components including a directed acyclic
graph and a group of conditional probability tables. Every node in the directed
acyclic graph defines a random variable.
variable The variables can be discrete- or
continuous-valued.
• In general, a Bayesian network represents the joint probability distribution by
specifying a set of conditional independence assumptions (represented by a
directed acyclic graph), together with sets of local conditional probabilities.
• The joint probability for any desired assignment of values (y1, . . . , yn) to the tuple
of network variables (Y1 . . . Yn) can be computed by the formula:
• Support Vector Machine are Supervised Learning algorithms that were introduced in 1992.
- It becomes popular of their success in handwritten digit recognition.
- Experimentally, it was proved that SVMs has low error rate.
- 1.1% test error rate for SVM.(same as neural networks)
- It can be employed for both classification and regression purpose.
- It tries to map an input space into an output space using a non-linear mapping
function Φ such that, the problem or the data points become linearly separable in the
output space.
- When the points become linearly separable then SVM discovers the optimal separating
hyperplane.
• The goal of SVM is to find the optimal hyperplane which maximizes the margin of the
training data.
SVM: Types
1. Linear SVM:
• We want to find the best hyperplane (i.e. decision boundary) linearly separating our classes. Our
boundary will have equation: wTx + b = 0.
• Anything above the decision boundary should have label 1. i.e.,
• wTxi + b > 0 will have corresponding yi = 1.
• Similarly, anything below the decision boundary should have label -1. i.e.,
wTxi + b < 0 will have corresponding yi = -1.
2. Non-Linear SVM: The dataset cannot be classified into two classes by using
a straight line.
• Non-linear
linear classification is carried out using the kernel concept.
• Non-Linear
Linear SVM applies the function of the kernel concept to a space that has
high dimensions.
Kernel Trick in SVM
• In Machine Learning, the data can be text, image or video.
• The function of the kernel trick is to map the low-dimensional input space and
transforms into a higher dimensional space.
space
• We need to extract features from these data for the classification purpose.
• In real world, Many classifications models are complex and mostly require non-
linear hyperplanes.
• E.g. mapping function Φ: R2 R3 Used to transform a 2-D data into
3-D data. Given as follows:
Φ(x,y)=(x2,√2xy,y
√2xy,y2)
Kernel Trick for 2nd Degree polynomial Mapping.
Types of Kernel
• Linear Kernel
• Polynomial Kernel
• Homogeneous Kernel
• Inhomogeneous Kernel
• Gaussian Kernel or Radial-Basis
Basis function Kernel
• Sigmoid Kernel
• Etc.
1. Linear Kernel
• Linear Kernel are of the type:
w⊤x+b=0
x+b
• Scalability: SVMs struggle with large datasets because their training time can be
computationally expensive, especially if the number of support vectors grows.
• Choice of Kernel: Deciding on the right kernel function and its parameters (like
the degree for polynomial kernels or gamma for RBF kernels) can be tricky and
often requires experimentation.
• Sensitivity to Parameters: SVM performance heavily depends on
hyperparameters (e.g., regularization parameter CC and kernel parameters). Poor
tuning can lead to suboptimal results.
Cont…
• Non-Probabilistic Outputs: SVMs don't directly provide probabilistic outputs.
this means that the model provides a definitive classification or decision (e.g.,
"Class A" or "Class B") rather than a probability score indicating the likelihood of
belonging to a particular class.
• Difficulty with Noisy Data: When data is noisy or overlapping classes, SVM can
struggle to find a clear decision boundary, affecting its performance.