0 ratings0% found this document useful (0 votes) 51 views31 pagesMachine Learning Unit2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content,
claim it here.
Available Formats
Download as PDF or read online on Scribd
UNIT II
Supervised Learning
Syllabus
Linear Regression Models : Least squares, single & multiple variables, Bayesian linear regression,
radient descent, Linear Classification Models : Discriminant function - Perceptron algorithm,
Sopabilistic discriminative model - Logistic regression, Probabilistic generative model - Naive
Bayes, Maximum margin classifier - Support vector machine, Decision Tree, Random Forests
Contents
2.1. Regression
2.2. Linear Classification Models
23 Probabilistic Generative Model
2.4 Maximum Margin Classifier : Support Vector Machine
2.5 Decision Tree
2.6 Random Forests
2.7 Two Marks Questions with Answers
@-Machine Learing
Regression
© Regression finds correlations between i
dependent and independent variables. 7 cnr
If the desired output consists of one line eteeco,
or more continuous variable, then the s
z
task is called as regression. 8
+ Therefore, regression algorithms help
a
predict continuous variables such as
house prices, market trends, weather
patterns, oil and gas prices etc. Fig, 241 Regression
Independent variable
+ Fig, 2.1.1 shows regression.
* When the targets in a dataset are real numbers, ¢
known as regression and each sample in the dataset has a real-valued output or
the machine learning task is
target.
* Regression analysis is a set of statistical methods used for the estimation of
relationships between a dependent variable and one or more independent
variables. It can be utilized to assess the strength of the relationship between
variables and for modelling the future relationship between them.
+ The two basic types of regression are linear regression and multiple linear
regression.
EXE] Linear Regression Models
. Linear regression is a statistical method that allows us to summarize and study
relationships between two continuous (quantitative) variables.
* The objective of a linear regressi is i
7 ‘gression model is to find a relationshi
input variables and a target variable. ae
1. One variable, denoted x, is regarded as the predictor,
independent variable. explanatory or
2. The other variable, denoted y, is rej
, 7 garded as
Fre ainer| vac he response, outcome or
* Regression models predict a continuous variable, such as the sales made on a d
or predict temperature of a city. Let’s imagine that we fit a line with the trai i
point that we have. If we want to add another data point, but to fit it, we ane
change existing model. rw need to
* This will happen with each data point that we add to the model; hence, {i
regression isn’t good for classification models. ‘ oer
®iL
‘Supervised Leaming
ion line gives thi .
+ The regress! 8 © average relationshi iables i
mathematical form. ip between the two variables in
vo variables X
. For two varial and Y, there are always two lines of regression.
+ Regression line of X on Y Gives the best ost
Fae civen'walues'st Wf imate for the value of X for any
specific 5
X= atby
here a = X- intercept
b = Slope of the line
X = Dependent variable
Y = Independent variable
« Regression line Y on X : Gives the best estimate for the value of Y for any
specific given values of X :
Y = a+bx
wee a
Y - intercept
b = Slope of the line
Y = Dependent variable
x = Independent variable
thod (a procedure that minimizes the vertical
yunding a straight line) we are able to construct a
ter diagram points and then formulate a
* By using the least squares mel
deviations of plotted points surro'
best fitting straight line to the scal
gression equation in the form of :
1
§ = a+bx Bias term——"
My
fix. w)
§ = y+be-x SS
"Re J ¥+bex-%) Input vector) *2—w,
‘tession analysis is the art x t
“nd science of fitting straight “
“Nes to patterns of data. In @ Fig. 2.1.2
ar regression model, the
TECHNICAL PUBLICAT!2-4 Supervised Leeming
Machine Learning :
variable) is predicted from k other variables
+ equation. If Y denotes the dependent
ariables, then the assumption is that
etermined by the linear equation ;
variable of interest (“dependent”
(Vindependent” variables) using a lineal
variable and X1,.../Xxv are the independent v.
the value of Y at time t in the data sample is d
Yy = Bo +BaX11 +BaXat + +Ba%Ht Ft
betas are constants and the epsilons are independent and identically
ith mean zero.
the predicted value and the actual values
‘The split point errors across
lowest SSE is
where the
distributed normal random variables wi
At each split point, the “error” between E
is squared to get a “Sum of Squared Errors (SSE)". The sp
the variables are compared and the variable/point yielding the
chosen as the root node/split point. This process is recursively continued.
« Error function measures how much our predictions deviate from the desired
answers.
1
Mean-squared error Jn == (yi fous)?
isn
Advantages :
a. Training a linear regression model is usually much faster than methods such as
neural networks.
b. Linear regression models are simple and require minimum memory to implement.
c. By examining the magnitude and sign of the regression coefficients you can infer
how predictor variables affect the target outcome.
EXP Least Squares
+ The method of least squares is about estimating parameters by minimizing the
squared discrepancies between observed data, on the one hand, and their expected
values on the other. y
Considering an arbitrary ' $= By + Byx
straight line, y = by +b)x, is
to be fitted through these
data points. The question is
‘Which line is the most yj -9j=Ertor (residual)
representative" ? ||
* What are the values of 3%
bo andb; such that the z
resulting line "best" fits the |
data points ? But, what Fig. 24.3ea
a Leamind 2.
ye 2 supervised Leaming
ess-of-fit criterion to
oo? use to determine among all possibl binations of
poand er ? gall possible combi
east Squares (LS) criteri
nee The ae that the sum of the squares of errors is
mun hi solutions yields y(x) whose elements sum to 1, but do
ensure the outputs to be in the range [0,1] s
How to draw such a line based
gn data points observed 7 y
guppose @ imaginary line ofy= 4
a + bx.
| jmagine @ vertical distance 3
petween the line and a data
point E = ¥ - EQ).
E(Y)=a + bX
« This error is the deviation of the
data point from the imaginary
line, regression line. Then what
is the best values ofaandb?A
and b that minimizes the sum
of such errors.
+ Deviation does not have good Fig. 2.14
properties for computation.
Then why do we use squares of deviation ? Let us get a and b that can minimize
the sum of squared, deviations rather than the sum of deviations. This method is
called least squares.
f squares of errors. Such a and b are
parameters a. and B.
a and b) is called estimation.
thod minimizes the sum of
* Least squares met
timators of
called least squares estimators i.e. es
parameter estimators (eg,
* The process of getting
method of Ordinary Least Squares (OLS).
Lest squares method is the estimation
tT
isadvantages of least square
1, Lack robustness to outliers
2 Certain datasets unsuitable for I ation
east squares classific
3. Decision boundary corresponds to ML. solution
vn up-thrust for knowledge
pL ICATIONSSupervised Learning
2-6
Machine Leaming
CQEEEEESED Fo stright tine to the points i the table.
Compute m and b by least
squares.
Points x y
x 3.00 oy
B 4.25 4.25
c 5.50 5.50
D 8.00 SEN
Solution : Represent in matrix form :
3.00 1 450] [va
425 1] [m 425] | vg
= +
550 1 cE ve
8.00 1 5.50} vp
X= (tl aT aytaty
121.3125 20.7500]""/105.8125] _ [0.246
* | 20.7500 4.0000 | | 19.7500 | ~ | 3.663
V = AX-L
3.00 1 4.50] f-0.10
425 1/7024] | 4.25 0.46
5.50 1||3.663|7] 5.50! ~|.-0.48
8.00 1 5.50 0.13
Multiple Regression
+ Regression analysis is used to predict the value of one or more responses from a
set of predictors. It can also be used to estimate the linear association between the
predictors and responses. Predictors can be continuous or categorical or a mixture
of both.
e If multiple independent variables affect the response variable, then the analysis
calls for a model different from that used for the single predictor variable. In a
situation where more than one independent factor (variable) affects.the outcome of
a process, a multiple regression model is used. This is referred to as multiple
linear regression model or multivariate least squares fitting.“Gq a
machine Learning >
Machin? Supervised Leaming
« Let Z1; Z, be a set of r .
Predictors believed to be related to a response
variable Y. The li i
varial near regression model for the j!" sample unit has the form
Yi = Bo+Br 2i1+B2 Zip +.B, 2p +e;
s ir i
where € is a random error and Bj, i=
x Bi, i=0,1,...,r are un-known regression coefficients.
» With n independent observations,
Ss, We can writ ,
rac deennotel te now rite one model for each sample unit so
Y = ZBte
where Y is nx 1, Z is nx (r+1),B is(r+1)x1 and eis nx1
ein order to estimate B , we take a least squares approach that is analogous to what
we did in the simple linear regression case.
« In matrix form, we can arrange the data in the following form :
Lox x2 XK yi By
Lox Xa A
xn [EE Xm a] yfye | og |B
1 XN1 XN20 ++ XNK YN B,
where fj are the estimates of the regression coefficients
By Difference between Simple Regression and Multiple Regression
|
Simple regression Multiple regression
One dependent variable Y predicted from one One dependent variable ¥ predicteg om 2 set
independent variable X _of independent variables (X), Xz -- Xi)
One regression coefficient for each independent
One regression coefficient
variable
R? : Proportion of variation in dependent
1 : Proportion of variation in dependent
variable Y predictable from X variable Y predictable by set of independent
variables (X's)
EXE] Bayesian Linear Regression
n allows a useful mechanism to deal with insufficient
to put a prior on the coefficients and
the priors can take over. A prior is a
* Bayesian linear regressio:
data, or poor distributed data. It allows user
on the noise so that in the absence of data,
distribution on a parameter.
* If we could flip the coin an infinite number
easy by the law of large numbers. Howevel if we c
ques that a coin is biased if we
handful of times? Would we.
of times, inferring its bias would be
what if we could only flip the coin a
saw three heads in
TECHNICAL PUBLICATIONS® - an up-thrust for krowiedgeSupervised Learning
Machine Lea
ight times with unbiased coins? The
bias of p =1-
quantifying our prior knowledge that
a the bias parameter is peaked aroung
about coins.
three flips, an event that happens one out ofl
MLE would overfit these data, inferring a coin
+ A Bayesian approach avoids overfitting b;
most coins are unbiased, that the prior on
ig pri lief
one-half, The data must overwhelm this prior be i eee ae
imate model para’ - °
. vesi \ds allow us to estimal . A Igorith:
ee a conduct model comparisons. Le EE ee ete
sts and to
1 hypotheses.
dea that the training data are utilized to calculate
foreca t
calculate explicit probabilities fo
+ Bayesian classifiers use a simple i
an observed probability of each class
an classifier is used for unclassified data, i
s for the new features.
based on feature values.
it uses the observed
« When Bayesi e
probabilities to predict the most likely clas
« Each observed training example can incrementally decrease or increase the
estimated probability that a hypothesis is correct.
+ Prior knowledge can be combined with observed data to determine the final
probability of a hypothesis. In Bayesian learning, prior knowledge is provided by
asserting a prior probability for each candidate hypotheses and a probability
distribution over observed data for each possible hypothesis.
* Bayesian methods can accommodate hypotheses that make probabilistic
predictions. New instances can be classified by combining the predictions of
multiple hypotheses, weighted by their probabilities.
Even in cases where Bayesian methods
provide a standard of optimal decisi
methods can be measured.
. prove computationally intractable, they can
ion making against which other practical
* Uses of Bayesian classifiers are as follows :
1. Used in text-based classification f i
¢ or finding spam or junk mail filter
2. Medical diagnosis. sean
3. Network security such as detecting illegal intrusion,
The basic procedure for implementing Bayesian Linear R ion i
i) Specify priors for the model parameter, Baer
ii) Create a model mapping the training inj
ie Puts to the traini
iii) Have a Markov Chain Monte Carlo (MCMC) alpori Ing outputs.
the posterior distributions for the parameters Tt draw samples fromje Leaming 2
-9 Supervised Leaming
Mech
po Gradient Descent
+ Goal : Solving minimizati i
Goi iB nization nonlinear problems through derivative information
. eon dead :
First a econ derivatives of the objective function or the constraints play an
important role in optimization. The first order derivatives are called the gradient
and the second order derivatives are called the Hessian matrix.
Derivative based optimization is also called nonlinear. Capable of determining
search directions" according to an objective function's derivative information.
Derivative based optimization methods are used for :
1, Optimization of nonlinear neuro-fuzzy models
2. Neural network learning
3, Regression analysis in nonlinear models
Basic descent methods are as follows :
1. Steepest descent
2. Newton-Raphson method
Gradient Descent :
Gradient descent is a first-order optimization algorithm. T
of a function using gradient descent, one takes steps proportional to the negative
of the gradient of the function at the current point.
Gradient descent is popular for very large-scale optimization problems because it
is easy to implement, can handle black box functions, and each iteration is cheap.
Given a differentiable scalar field f (x) and an initial guess x; , gradient descent
s of "f" by taking steps in the
s ‘o find a local minimum
iteratively moves the guess toward lower value:
direction of the negative gradient ~ V f (x).
* Locally, the negated gradient is the steep
that x would need to move in order to
yest descent direction, ie., the direction
decrease "f" the fastest. The algorithm
typically converges to a local minimum, but may rarely reach @ saddle point, or
lies at a local maximum.
e curve at that x and its direction will point
change x in the opposite direction to lower
not move at all if x;
* The gradient will give the slope of th
to an increase in the function. So we
the function value :
Xie = xR AVE (x)
The A>0 is a small number that forces the algorithm to make small jumps
TECHNICAL PUBLICATIONS® - an up-thrust for knowledgeSupervised
Machine Leaning 2-10 Learn
dient Descent : aa
he minimum : techni
to # technically, it
Jatively slow close
ce is inferior to many other methods.
gradient descent increasingly ‘zigzag.
the shortest direction to a mi A as
‘um,
Limitations of Gra
© Gradient descent is rel
asymptotic rate of convergen
.d convex problems,
+ For poorly conditione
orthogonally to
the gradients point nearly
point
Steepest Descent :
# Steepest descent is also
« This method is based on first
function. This method is also call
descent method.
known as gradient method.
order Taylor series approximation of obje
ed saddle point method. Fig. 2.1.5 shows ae e
est
Fig. 2.1.5 Steepest descent method
© The Steepest D
escent i :
direction is where Paes simplest of the gradient methods. The choice of
es most quickly, which is in the direction opposite
VE (x). The search
: starts at an arbi :
until reach close to the eae) point x0 and then go down the gradient
The method of
steepest is
nen eas ee one is the discrete analogue of gradient descent, but
using a local minimization rather than computing ®
gradient. It is typi
ically
local minima Be 'y able to converge in few st ne e
plateaus in the objective functi ‘eps but it is unable to escap’
inction.
The gradient i
; t is everywhi
aaa eres Eee to the contour lines. After each lin?
ient i: ‘i
nt is always orthogonal to the previous step directo"
TECHNICAL PUBLICAT| ip-thrust for knowled:
'UBLICATIONS™ - an up-thrust
8 jowledgeachine Leeming 2-14 ‘Supervised Leaming
Consequently, the iterates tend to zig-zag down the valley in a very inefficient
manner.
+ The method of Steepest Descent is simple, easy to apply, and each iteration is fast.
It also very stable; if the minimum points exist, the method is guaranteed to locate
them after at least an infinite number of iterations.
fa Linear Classification Models
+ A classification algorithm (Classifier) that makes its classification based on a linear
predictor function combining a set of weights with the feature vector.
+ A linear classifier does classification decision based on the value of a linear
combination of the characteristics. Imagine that the linear classifier will merge into
it's weights all the characteristics that define a particular class.
+ Linear classifiers can represent a lot of things, but they can't represent everything.
The classic example of what they can't represent is the XOR function.
EZAD Discri
+ Linear Discriminant Analysis (LDA) is the most commonly used dimensionality
reduction technique in supervised learning. Basically, it is a preprocessing step for
pattern classification and machine learning applications. LDA is a powerful
algorithm that can be used to determine the best separation between two or more
inant Function
classes,
+ LDA is a supervised learning algorithm, which means that it requires a labelled
training set of data points in order to learn the linear discriminant function.
* The main purpose of LDA is to find the line or plane that best separates data
points belonging to different classes. The key idea behind LDA is that the decision
boundary should be chosen such that it maximizes the distance between the
means of the two classes while simultaneously minimizing the variance within
each class's data or within-class scatter. This criterion is known as the Fisher
criterion,
* LDA is one of the most widely used machine learning algorithms due to its
accuracy and flexibility, LDA can be used for a variety of tasks such as
classification, dimensionality reduction, and feature selection.Supervised 1,
2-12 oe
Machine Learning :
ify them efficiently, then ye;
classes and we need to classify # a
two
* Suppose we have "
an classes are divided as follows : :
Before LDA After LDA
Fig. 2.2.1 LDA
ithm wi following steps :
* LDA algorithm works based on the x
a) The first step is to calculate the means and standard deviation of each feature,
b) Within class scatter matrix and between class scatter matrix is calculated
c) These matrices are then used to calculate the eigenvectors and eigenvalues.
4) LDA chooses the k eigenvectors with the largest eigenvalues to form a
transformation matrix.
LDA uses this transformation matrix to transform the data into a new space
with k dimensions.
f) Once the transformation matrix transforms the data into new space with k
dimensions, LDA can then be used for classification or dimensionality
reduction
Benefits of using LDA :
a) LDA is used for classification problems,
e)
») LDA is a powerful tool for dimensionality reduction,
©) LDA is not susceptible to the
learning algorithms.
Logistic Regression
ae a
ees ie form of regression analysis in which the outcome variable
shotomous. A statistical method i
: : : used to model dichotomous oF
binary outcomes using predictor variables, .
* Logistic component : Instead of mode
models the log odds
curse of dimensionality" like many other machine
ling the outcome, Y, directly, the method
CO using the logistic function.” ” ‘
TECHNICAL PUBLICATIONS® _ .ine Leaming é
sch 2-13 Supervised Leeming
» Regression component ? Methods us
outcome and predictor variables,
function of predictors.
ed to quantify association between an
It could be used to build predictive models as a
«In simple logistic regression, logistic regression with 1 predictor variable.
Logistic Regression :
PQ) ne
of 2) = Bo + BiX1+B2X2 +...4B,X,
= Bo+ BiX1+B2X2 4...4B,X_ +e
With logistic regression, the response variable is an indicator of some
characteristic, that is, a 0/1 variable. Logistic regression is used to determine
whether other measurements are related to the presence of some characteristic, for
example, whether certain blood measures are predictive of having a disease.
If analysis of covariance can be said to be a t test adjusted for other variables, then
logistic regression can be thought of
as a chi-square test for homogeneity
of proportions adjusted for other
variables. While the response
variable in a logistic regression is a
0/1 variable, the logistic regression
equation, which is a linear equation,
does not predict the 0/1 variable
itself.
Linear
Ny
Logistic
Fig, 2.2.2
Fig. 22.2 shows Sigmoid curve for
logistic regression.
* The linear and logistic probability models are :
Linear Regression :
P= at ayXq $agXo tet AkXk
Logistic Regression :
In[p(—py = bo + byX1 tb2X2 tt PRX
obability p is a linear function of the
* The li that the pr
pero came that the natural log of the odds
egressors, while the logistic model assumes
P/(1~p) is a linear function of the regressors.
* The major advantage of the linear model is its interpretability. In the linear model,
if a1 is 0.05, that means that a one-unit increase in X1 is associated with a5 %
Point increase in the probability that ¥ is 7
TECHNICAL PUBLIGATIONS® - an up-hrst for knowiedgeSupervised |,
ay
2-14 ty
del, if bl is 0
1 model, i 5,
retable. In the ST ath a 0.05 increase in thet
Machine Leaming
- interp’ -
. istic model is less interPE Ne cociates
The logistic mote nit increase in X1 is > Ye never met anyone wig °8
means that a one - unl does that mean! any
odds that Y is 1. And what
intuition for log odds. ;
i ive Mode!
EX] Probabilistic Generati statistical models that generate new ga,
ate unsupervised machine learning to perf.
‘hood estimation, modelling data points, a
hese probabilities.
* Generative models are @ class
instances. These models are used
tasks such as probability and cae
istinguishing between classes using :
wee Sern rely on the Bayes theorem to find the ioe probabil
Generative models describe how data is generated using, Pr a ili noi!
They predict P(ylx), the probability of y given x, calculating the P(xy), thy
probability of x and y.
Ee Naive Bayes
* Naive Bayes classifiers are a family of simple probabilistic classifiers based on
applying Bayes’ theorem with strong independence assumptions between the
features. It is highly’scalable, requiring a number of parameters linear in the
number of variables (features /predictors) in a learning problem.
+ A Naive Bayes Classifier is a program which predicts a class value given a set of
attributes.
* For each known class value,
1. Calculate probabilities for each attribute, conditional on the class value.
2. Use the product rule to obtain a joint conditional probability for the attributes.
3. Use Bayes rule to derive conditional ‘lit
robabiliti cable’
* Once this has been done f PI ities for the class variable.
probability, Or all class values, output the class with the highetape Leeming:
wore
conditional Probability
+ Let Aand B be two events such that Pj
of B given that A has occurre
2-18 Supervised Learning
(A) > 0. We denote P(BI A) the probability
d. Since A is kr
is known to have occurred, it becomes the
ew sample space replacing the ori ; ‘
a ne the original S. From this, the definition is
P(B/A) =
PA)
OR
P(A B) = P(A) P(B/A)
The notation PBI A) is read "the probability of event B given event A”. It is the
probability of an event B given the occurrence of the event A.
We say that, the probability that both A and B occur is equal to the probability
that A occurs times the probability that B occurs given that A has occurred. We
call P(BIA) the conditional probability of B given A, ive., the probability that B
will occur given that A has occurred.
Similarly, the conditional probability of an event A, given B by,
P(AN B)
P(AB) = oo
The probability P(A1B) simply reflects the fact that the probability of an event A
may depend on a second event B. If A and B are mutually exclusive AN B= ¢
and P(AIB) =0.
Another way to look at the conditional probability formula is :
PiSecond/Fi P (First choice and second choice)
emcees P (First choice)
Conditional probability is a defined quantity and cannot be proven.
The key to solving conditional probability problems is to =
1. Define the events.
2. Express the given information and question in probability notation.
3. Apply the formula.
Joint Probability
* A joint probability is a probability that measures the likelihood that two or more
events will happen concurrently.
* If there are two independent events A and B, the probability that A and B will
occur is found by multiplying the two probabilities. Thus for two events A and B,
the special rule of multiplication shown symbolically is :
P(A and B) = P(A) P@)-
TECHNICAL PUBLICATIONS® - an p-thrust for knowledgeMac Superviseg
2-16 22ming
chine Leaming :
n is used to find the joint probability thay
tion is
: iw
the general rule of multiplication is,
* The general rule of multiplicat
events will occur. Symbolically,
= P(A) P(BIA) 7
P(A and 2 sia te b) is called the joint probability for tWo events A ang»
* The probability s
. nape
h t in the sample space. Venn diagram will readily shows that
which intersect it sa
P(An B) = P(A) + PB) - P (AU B)
2
Equivalently :
P(AN B) = P(A)+ P(B)— P(AN B)s P(A) + PB)
The probability of the union of two events never exceeds the sum of the even
+ The pr
probabilities.
* A tree diagram is very useful for portraying conditional and joint Probabilities, 4
tree diagram portrays outcomes that are mutually exclusive.
Bayes Theorem
* Bayes’ theorem is a method to revise the probability of an event 8iven additional
information. Bayes's theorem calculates a conditional Probability called a Posterior
or revised probability.
* Bayes’ theorem is a result in Probability theory that Telates conditional
Probabilities. If A and B denote two events, P(AIB) denotes the conditional
Probability of A occurring, given that B Occurs. The two conditional probabilities
P(AIB) and P(BIA) are in general different,
* Bayes theorem gives a relation between P(A IB)
* A prior probability is a
n initial probabilit
additional information is
ty value originally obtained before any
obtained,
* A posterior probability
is a p
additional information th;
robability value that has been revised by using
at is later obtained,
* Suppose that By, Ba, Bs ~By partition the outcomes of
is another event, For any number, k, with 1 < kK < n we have the formula :
P(B/A) = P(A/B, } PIB, )
an experiment and that A
X P(A/B}P(B,
is]
TECHNICAL PUBLICATION:~ pachine Leeming
Generative model
Generative models can generate new data
instances,
Generative model revolves around the
Gistribution of a dataset to retum a probability
for a given example.
Generative models capture the joint probability
pO ¥), oF just pOX) if there are no labels,
‘A generative model includes the distribution of
the data itself, and tells you how likely a given
‘example is.
Generative models are used in unsupervised
machine learning to perform tasks such as
robability and likelihood estimation
+ Support Vector Machines (SVMs)
are a set of supervised learning
methods which learn from the
dataset and used for
dlassification. SVM is a classifier
derived from statistical learning
theory by Vapnik and
Chervonenkis.
* An SVM is a_ kind of
large-margin classifier : I
the goal is to find a
decision boundary between
two classes that _— is
maximally far from any
Point in the training data
Given a set of training
examples, each marked as
belonging to one of two
Classes, an SVM algorithm
Class 1
2-17
pu Difference between Generative and Discriminative Models
TECHNICAL PUBLICATIONS”
Supervised Leaming
| Discriminative models
Discriminative models discriminate between
different kinds of data instances
Discriminative model makes predictions based
on conditional probability and is either used
for classification or regression.
Discriminative models capture the conditional
probability p(Y | X).
‘A discriminative model ignores the question of
whether a given instance is likely, and just tells |
you how likely a label is to apply to the
instance.
‘The discriminative model is used particularly
for supervised machine learning.
Example : Logistic regression, SVMs
Class 1
Fig. 2.4.1 Two class problem
tis a vector space based machine learning method where
°
owiess?
Fig. 2.4.2 Bad decision boundary of SVM
- an up-thrust for knowledgeSupervised Leaming
Machine Leaming xample falls into one class oy the
SVM model as representing the
the examples of the Separate
icts whether @ new @
can think of an
mapped so that each of #
s possible.
me space and classified to belong tg
builds a model that predi
other. Simply speaking, W®
in ace,
examples as points in sp F ie
classes are divided by a gap that is as wi
New examples are then ma
the class based on which sid
ypped into the sa
eof the gap they fall on.
Two Class Problems :
Many decision boundaries can
choose ? a
Perceptron leaming rule can be used to find any decision boundary between class
separate these two classes. Which one should we
1 and class 2.
The line that maximizes the minimum margin is a good bet. The model class of
“hyper-planes with a margin of m" has a low VC dimension if m is big.
« This maximum-margin separator is determined by a subset of the data points,
Data points in this subset are called "support vectors". It. will be useful
computationally if only a small fraction of the data points are support vectors,
because we use the support vectors to decide which side of the separator a test
case is on.
Example of Bad Decision Boundaries
* SVM are primarily two-class classifiers with
aim to find the optimal hyperplane such that
minimized. Instead. of directl
the distinct characteristic that they
ne suct the expected generalization error is
'Y minimizing the empirical risk calculated from the
training data, SVMs perform structural risk
minimization t) ~——achieve
generalization, om i
Confidence
Empirical risk
ass + Because o
distribution p We don't know
ib egos 2 Low
empirical tisk over a trae "inimize
from P. This * 8 training dataset draw) Smal may
- This general lear; Hu
ae ming techn; a
called empirical risk. minimis fechnique is Complexity of function set
on,
° Fig. 2.4.3 vs iri
ig. shows empirical a Fig. 2.4.3 Empirical risk
TE
CHNICAL PUBLICATION that maximize
margin => B1 is better than 82chine Learning
chine L 2-24 Supervised Loaming
2, They maximize the margin of the deci , Taeeeea
techniques which find the optimal ea boundary using quadratic optimization
Ability to handle large feature spaces
Overfitting can be controlled by soft margin approach
5, When used in practice, SVM approaches frequently map the examples to a higher
dimensional space and find margin maximal hyperplanes in the mapped space,
obtaining decision boundaries which are not hyperplanes in the original space.
The most popular versions of SVMs use non-linear kernel functions and map the
attribute space into a higher dimensional space to facilitate finding "good" linear
decision boundaries in the modified space.
pa SVM Applications
+ SVM has been used successfully in many real-world problems,
1. Text (and hypertext) categorization
Be
2
Image classification
Bioinformatics (Protein classification, Cancer classification)
Peo
Hand-written character recognition
Determination of SPAM email.
o
EJ Limitations of SVM
1. It is sensitive to noise.
2. The biggest limitation of SVM lies in the choice of the keel.
3. Another limitation is speed and size.
4. The optimal design for multiclass SVM classifiers is also a research area.
E2Y] sott Margin SvM
For the very high dimensional problems common in text classification, sometimes
the data are linearly separable, But in the general case they are not, and even if
they are, we might prefer a solution that better separates the bulk of the data
while ignoring a few weird noise documents.
What if the training set is not linearly separable ? Slack variables can be added to
allow misclassification of difficult or noisy examples, resulting margin called soft.
cross into the margin or over the
A soft-margin allows a few variables to
hyperplane, allowing misclassification.
We penalize the crossover by looking at the number and distance of the
misclassifications, This is a trade off between the hyperplane violations and the
TECHNICAL PUBLICATIONS® - an up-thrust for knowledgeSupervise
a,
2-22 Loam,
Machine Leaming
ne set cost. The fay
bles are bounded by som the prediction ther they
margin size. The slack Vs ess influence they have on
Jess
soft margin, the .
ere frome cone csociated slack variable,
‘ae argin.
All observations have an ia
anes variable = 0 then all points on th oe ie,
Sla ble > 0 then a point in the margin eof
2. Slack variable > a
hyperplane ; and tl in.
c. the tradeoff between the slack variable penalty and the margin,
3. Cis the
EXXX comparison of SVM and Neural Networks
See Neural Network
Support Vector Machine : |
Kemel maps to a very-high dimensional space Hidden Layers map to lower dimensional |
eee es spaces E |
Search space has a unique minimum Search space has multiple local minima |
Classification extremely
Ve:
Y good accuracy in typical domains Very good accuracy in typical domains
Kemel and cost the two parameters to select
Training is extremely efficient
CEE Fo the followir
Support vectors (if any), sla
Pariables on wrong side
Requires number of hidden units and lay
vers
ck variables on correct side of
Of classifier (j
Penalty and why + if any). Mention which point will have maximum
ry?| wechine Learning 2-23 Supervised Leaming
olution :
vor
3.
EA Decision Tree
Data points 1 and 5 will have maximum penalty,
Margin (m) is the gap between data points & the classifier boundary. The margin
is the minimum distance of any sample to the decision boundary. If this
hyperplane is in the canonical form, the margin can be measured by the length of
the weight vector.
Maximal margin classifier : A classifier in the family F that maximizes the margin.
Maximizing the margin is good according to intuition and PAC theory. Implies
that only support vectors matter; other training examples are ignorable.
What if the training set is not linearly separable ? Slack variables can be added to
allow misclassification of difficult or noisy examples, resulting margin called soft.
‘A soft-margin allows a few variables to cross into the margin or over the
hyperplane, allowing misclassification.
We penalize the crossover by looking at the number and distance of the
misclassifications. This is a trade off between the hyperplane violations and the
margin size. The slack variables are bounded by some set cost. The farther they
are from the soft margin, the less influence they have on the prediction.
All observations have an associated slack variable
Slack variable = 0 then all points on the margin.
Slack variable > 0 then a point in the margin or ‘on the wrong side of the
hyperplane.
Cis the tradeoff between the slack variable penalty and the margin.
A decision tree is a simple representation for classifying examples. Decision tree
learning is one of the most successful techniques for supervised classification
learning.
In decision analysis, a decision tree can be used to visually and explicitly represent
decisions and decision making. As the name goes, it uses a tree-like model of
decisions,
Learned trees can also be represented as sets of if-then rules to improve human
readability.
A decision tree has two kinds of nodes i
1. Each leaf node has a class label, determined by majority vote of training
examples reaching that leaf.
TECHNICAL PUBLICATIONS® - an up-thrust for knowledgeSupervised
Loam
24 ng
5 out accordin,
Machine Learning tion on features. It branche: B to the
is a que’
internal node
2 at imating discrete-value
imating di d targa,
5 a method for approx
answers. tin
d by a decision tree.
Decision tree learn
functions. The learnes
.d decision tree can
one of
ing i
i i resentes
.d function is rep!
also be re-represented as a set of if-then rules,
«A leame the most widely used and practical methods j,,
Decision tree Iearning is
ive il ce. ;
inductive inferen es
f learning disj
isy data and capable o}
« It is robust to noisy
learning method searches a completely expressive hypothesis
n tree le
© Decisio
EEE Decision Tree Representation
Goal ; Build a decision tree for classifying examples as positive or negative
instances of a concept a
Supervised learning, batch processing of training examples, using a preference
bias.
A decision tree is a tree where
a. Each non-leaf node has associated with it an attribute (feature).
b. Each leaf node has associated with it a classification (+ or -).
c. Each arc has associated with it one of the possible values of the attribute at the
node from which the are is directed.
Internal node denotes a test on an attribute. Branch represents an outcome of the
test. Leaf nodes represent class labels or class. distribution.
A decision tree is a flow-chart-like tree structure, where each node denotes a test
sy an attribute value, each branch represents an outcome of the test, and tree
Saves represent classes or class distributions, Decision trees can easily be
converted to classification rules,
Decision Tree Algorithm
* To generate decision
Taput; tree from the training tuples of data Partition D.
1. Data partition @M 2 Attribute list
Algorithm : 3. Attribute selection method
J. Create a node (N)
2. If tuples in D are all of the
same cl,
3. Return node ) lass then
as a leaf node labeled with the class C.
aero ee alming
machine Learning 2-25 Supervised Learning
4, If attribute list is empty then return N as a leaf node labeled with the majority
dass in D
5. Apply attribute selection method(D, attribute list) to find the "best" splitting
criterion;
6, Label node N with splitting criterion;
7, If splitting attribute is discrete-valued and multiway splits allowed
3, Then attribute list -> attribute list > splitting attribute
9, For (each outcome j of splitting criterion )
10, Let D; be the set of data tuples in D satisfying outcome j;
11. If Dj is empty then attach a leaf labeled with the majority class in D to node N;
12. Else attach the node returned by Generate decision tree(Dj, attribute list) to
node N; 7
13, End of for loop
14, Return N;
+ Decision tree generation consists of two phases : Tree construction and pruning
+ In tree construction phase, all the training examples are at the root. Partition
examples recursively based on selected attributes.
«+ In tree pruning phase, the identification and removal of branches that reflect noise
or outliers.
+ There are various paradigms that are used for learning binary classifiers which
include :
1. Decision Trees 2. Neural Networks
3. Bayesian Classification 4. Support Vector Machines
Fig. 2.5.1 Decision tree
TECHNICAL PUBLICATIONS® - an up-thrust for knowledge6
Machine Leeming
Jes for majority class.
it ision Tul
PREREETE Using following feature tree, trite decisi
combining two Boolean features. Each internal node
4d each edge emanating from a split is labelled with
a unique combination of feature
derived from the training set.
Solution : Left Side : A feature tree
or split is labelled with a feature, an
a feature value. Each leaf therefore corresponds t
values. Also indicated in each Jeaf is the class distribution
* Right Side : A feature tree partitions the instance space into rectangular regions,
cone for each leaf.
‘Viagra’
Fig, 2.5.3
* The leaves of i i
aoe nee of es i in the above figure could be labelled, from left to right, a5
pam, employing a simple decision rule called majority class. a
‘spam: 20
ham: 5
TECHNICAL PUBLICATIONS®
IONS® - an y
ip-thrust for knowl
ledgein
achine Learning eae
Supervised Learning
ide : A feat ein
. Left si Mead ute) tree with training set class distribution in the leaves.
+ Right side : A decision tree obtained using the majority class decision rule.
pa Appropriate Problem for Decision Tree Learning
« Decision tree learning is generally b i i i
Pe acersticel Bs ly best suited to problems with the following
1, Instances are represented by attribute-value pairs. Fixed set of attributes, and
the attributes take a small number of disjoint possible values.
Rp
The target function has discrete output values. Decision tree learning is
appropriate for a boolean classification, but it easily extends to learning
functions with more than two possible output values.
3. Disjunctive descriptions may be required. Decision trees naturally represent
disjunctive expressions.
4. The training data may contain errors. Decision tree learning methods are
; robust to errors, both errors in classifications of the training examples and
errors in the attribute values that describe these examples.
5. The training data may contain missing attribute values. Decision tree methods
can be used even when some training examples have unknown values.
6. Decision tree learning has been applied to problems such as learning to
classify.
Advantages and Disadvantages of Decision Tree
Advantages :
1. Rules are simple and easy to understand.
2. Decision trees can handle both nominal and numerical attributes.
3. Decision trees are capable of handling datasets that may have errors.
4. Decision trees are capable of handling datasets that may have missing values.
5. Decision trees are considered to be a nonparametric method.
6.
. Decision trees are self-explantory.
Disadvantages :
1. Most of the algorithms require that the target attribute will
values,
2 Some problem are difficult to solve like XOR.
3. Decision trees are less appropriate for estimation
the value of a continuous attribute.
I have only discrete
n tasks where the goal is to predict
TECHNICAL PUBLICATIONS® - an up-thrust for knowledge‘Supervis
2-28 Supervised Leeming
9
Machine Lea!
s with many class ang
ation problem:
one to errors in clas!
e
ig examples.
4. Decision trees are Pp ict
ber of trainin:
relatively small num
EEG Random Forests
Random forest is @ famous syste!
metho
st ised getting to know
pian is based totally on th
regression issues in ML. It is 9
thare a process of combining multiple classifiers to so
to enhance the overall performance of the model.
“Random forest is a classifier that incorporates some of
sets of the given dataset and takes the average to
that dataset.” Instead of relying on one decision
arily based on
im learning set of rules that belongs to the
d. It may be used for both classification ang
e concept of ensemble studying,
Ive a complex problem and
‘As the call indicates,
choice timber on diverse sul
improve the predictive accuracy of
kes the prediction from each tree and prim
and it predicts the very last output.
results in better accuracy and
tree, the random forest tal
most of the people's votes of predictions,
The more wider variety of trees within the forest
prevents the hassle of overfitting,
[EGEI How Does Random Forest Algorithm Work ?
Random forest works in two-section first is to create the rando:
combining N selection trees and second is to make predictions for each tree
created inside the first segment.
The working technique may be explained within the below steps and diagram :
m woodland by
Step - 1: Select random K statistics points from the schooling set.
Step -2: Build the selection tr i ‘ 7 . .
(Gubsets), n trees associated with the selected information points
Step - 3: Choose the wide variety N for selection trees which we want to build.
Step - 4: Repeat step 1 and 2.
Step - 5: For new factors, locate the predicti
7 predictions of each choice tri i new
records factors to the category that wins most people's votes. See oar
« The working of the set of rules ma i ‘
y be higher
mre wn gher understood by the underneath
« Example : Suppose there may be a dataset that includes more than one fruit
photo. So, ae dataset is given to the random wooded area classifier. The dataset
js divided into subsets and given to every decision tree. During the training
section, each decision tree produces a prediction end result and while a brand new
TECHNICAL PUBLICATIONS® - en up-thrust for knowledgeine Leamin
ectine Leaming 2-29 Supervised Leaming
statistics ae occurs, then primarily based on the majority of consequences, the
random forest classifier predicts the final decision. Consider the underneath
picture =
:
Treen 4
Class-B
Fig. 2.6.1 Example of random forest
_ EEE] Applications of Random Forest
There are specifically 4 sectors where random forest normally used :
1. Banking : Banking zone in general uses this algorithm for the identification of
loan danger.
2. Medicine : With the assistance of this set of rules, disorder traits
disorder may be recognized.
3. Land use : We can perceive the areas of comparable land use with the aid of this
algorithm.
4. Marketing : Marketing tendencies can be recognized by the usagi
algorithm.
and risks of the
e of this
by Advantages of Random Forest
Random forest is able to appearing both classification and regression responsibilities.
* Itis capable of managing large datasets with high dimensionality.
* Ttenhances the accuracy of the version and forestalls the overfitting trouble.
TECHNICAL PUBLICATIONS® - an up-thrust for knowledgeSupervised
2-30 beaming
Machine Learning
[ERE Disadvantages of Random Forest
be used for both class and regression responsibilitie,
«Although random forest can : :
jt isn’t extra appropriate for regression obligations.
Two Marks Questions with Answers
1 What do you mean by least square method ?
tical method used to determine a line of best fit by
5 created by a mathematical function. A "square" is
between a data point and the regression line or
‘Ans. : Least squares is a statis!
minimizing the sum of square
determined by squaring the distance
mean value of the data set.
Q.2 What is linear Discriminant function ?
‘Ans. : LDA is a supervised learning algorithm, which means that it requires a labelled
training set of data points in order to learn the Linear Discriminant function.
Q.3 What Is a support vector in SVM ?
‘Ans. : Support vectors are data points that are closer to the hyperplane and influence
the position and orientation of the hyperplane. Using these support vectors, we
maximize the margin of the classifier.
Q.4 What is Support Vector Machines ?
‘Ans. : A Support Vector Machine (SVM) is a supervised machine learning model that
uses classification algorithms for two-group classification problems. After giving an
SVM model sets of labeled training data for each category, they're able to categorize
new text.
Q5 Define logistic regression.
Ans. + Logistic regression is supervised learning technique. It is used for predicting the
categorical dependent variable using a given set of independent variables.
Q6 List out types of machine learning.
Ans. : Types of machine learning are su
° ipervised, semi-supervised, fised and
reinforcement learning. P ee
Q.7 What is Random forest 7
oa + Random forest is an ensemble learning technique that combines multiple
cision trees, implementing the bagging method and results in a robust model with
low variance.
Q.8 What are the five popular algorithms of machine learning 7
Ans. : Popular algorithms are Decision Trees, Ne i
Pop , Neural Networks (bi ation),
Probabilistic networks, Nearest Neighbor and Support vector ae ee
TECHNICAL PUBLICATIONS® - an up-thrust for knowledgemachine Learning 2-31 Supervised Learning
qa What is the function of ‘Supervised Learning’ 7
ans.: Functions of ‘Supervised Learning’ are Classifications, Speech recognition,
regression, Predict time series and Annotate strings.
q10 What are the advantages of Naive Bayes 7
ans. : In Naive Bayes classifier will converge quicker than discriminative models like
Iogistic regression, so you need less training data. The main advantage is that it can't
eam interactions between features.
a1 What is regression ?
Ans, : Regression is a method to determine the statistical relationship between a
dependent variable and one or more independent variables.
iz Explain linear and non-linear regression model.
Ans. : In linear regression models, the dependence of the response on the regressors is
defined by a linear function, which makes their statistical analysis mathematically
tractable. On the other hand, in nonlinear regression models, this dependence is
defined by a nonlinear function, hence the mathematical difficulty in their analysis.
13 What is regression analysis used for 7
ans.: Regression analysis is a form of predictive modelling technique which
investigates the relationship between a dependent (target) and independent variable (s)
(predictor). This technique is used for forecasting, time series modelling and finding the
causal effect relationship between the variables.
Q14 List two properties of logistic regression.
Ans. =
and
1. The dependent variable in logistic regression follows Bernoulli Distribution.
2. Estimation is done through maximum likelihood.
Q15 What is the goal of logistic regression ?
Ans. The goal of logistic regression is to correctly predict the category of outcome for
individual cases using the most parsimonious model. To accomplish this goal, a model
is created that includes all predictor variables that are useful in predicting the response
Variable.
Q16 Define supervised learning.
Ans. : Supervised learning in which the network is trained by providing it with input
and matching output patterns. These input-output pairs are usually provided by an
external teacher.
aQ0o0
TECHNICAL PUBLICATIONS® - an up-thrust for knowledge