0% found this document useful (0 votes)

7 views76 pages

07 - Linear Models For Classification

The document outlines various linear models for classification in Python, including linear regression, logistic regression (binary and multi-class), and Support Vector Machines (SVM) for both linearly and non-linearly separable classes. It discusses the principles of supervised learning, the statistical framework for classification, and the advantages and drawbacks of methods like k-NN. Additionally, it covers the implementation of these models and their performance in terms of prediction error and risk minimization.

Uploaded by

Wanmeo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views76 pages

07 - Linear Models For Classification

Uploaded by

Wanmeo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 76

CSC 43042 EP — Algorithms for Data Analysis in Python

Linear Models for Classification

— Steve Oudot
Outline
• Reminder about supervised classification

• Principles of linear methods for classification

• Textbook case: linear regression for classification

• Principles of linear methods for classification

• Textbook case: linear regression for classification

• Logistic regression:
- binary
- multi-class
• Support Vector Machines (SVM):
- binary, linearly separable classes
- binary, non-linearly separable classes
- multi-class
• Non-linear classification using kernels
Supervised learning (classification)
Input: n observations + responses (x1 , y1 ), · · · , (xn , yn ) ∈ X × Y

Goal: build a predictor f : X → Y from (x1 , y1 ), · · · , (xn , yn )

whose mean prediction error on new query observations is minimal

?
X = {images}
Y = {labels}

classification: Y discrete
Statistical framework
iid
xi ∼ X with values in X = Rd
observations
Hyp: are drawn from a random variable X, the responses from a random variable
iid
yi ∼ Y with values in Y = {1, · · · , κ} (classification)

→ X,
e that theY joint distribution
are the marginals ofPr(X, Y ) distribution.
the joint encodes theIncomplexity of the problem
fact the complexity is encoded in the co

Y
κ
...

3
X, Y perfectly dependent ⇒ ∃ perfect predictor
2

1
X
Y
κ

...
3
X, Y imperfectly dependent ⇒ ∄ perfect predictor 2

1
X
x
Statistical framework
iid
xi ∼ X with values in X = Rd
observations
Hyp: are drawn from a random variable X, the responses from a random variable
iid
yi ∼ Y with values in Y = {1, · · · , κ} (classification)

→ X,
e that theY joint distribution
are the marginals ofPr(X, Y ) distribution.
the joint encodes theIncomplexity of the problem
fact the complexity is encoded in the co

Prediction error is measured by a loss function L : Y × Y → R

→ goal: minimize risk (expected prediction error): E(X,Y ) L(Y, f (X))

n
1 X
is the mean error on the input dataset
→ in practice: minimize empirical risk: L(yi , f (xi ))
n i=1
Classification with 0-1 loss
iid
xi ∼ X with values in X = Rd
Hyp: iid
yi ∼ Y with values in Y = {1, · · · , κ} (classification)

Risk: R(f ) = E(X,Y ) 1Y ̸=f (X)

= EX E(Y |X) 1Y ̸=f (X) | X (conditioning on X)

→ minimize risk pointwise (i.e. independently for each value x of X):

∗
n, here y isf our
(x)guess
:= argmin
for f (x),Eand we[1take
(Y |X) | Xy=that
Y ̸=ythe x] minimizes the expected condition
y∈{1,··· ,κ}
Y Xκ
κ
e the sum over r =just
is argmin
the 1r̸=of
expression y P(Y
the = r | X = x)
expectation (Y categorical variable)
...

3 y∈{1,··· ,κ} r=1

step
1 is where the = argmin
choice of the − P(Y
10-1 loss = y | Xinto
comes = x) = argmax P(Y = y | X = x)
play
X y∈{1,··· ,κ} y∈{1,··· ,κ}
x
(best prediction at x maximizes the posterior probability P(Y | X) (Bayes classifier))
Advantages and drawbacks of k-NN in practice
• high flexibility:

▶ little prior on the fitting model

▶ method based on distances / (dis-)similarities (no need for coordinates)

• easiness of implementation:

▶ only
same applies for amore
few advanced
lines of code for NN-search
sublinear methods,via linear
using thescan
right libraries (e.g. ANN, LS

• algorithmic cost of prediction:

▶ linear scan in Θ(nd)

▶ sublinear methods become (close to) linear in high dimensions

• slow convergence in high dimensions (curse of dimensionality):

▶ asymptotic regime often not attained in practice ⇝ need to select k

Advantages and drawbacks of k-NN in practice
• high flexibility:

▶ little prior on the fitting model prior: linear model

▶ method based on distances / (dis-)similarities (no need for coordinates)
requires linear space hence coordinates
• easiness of implementation:

▶ only
same applies for amore
few advanced
lines of code for NN-search
sublinear methods,via linear
using thescan
right libraries (e.g. ANN, LS
easy and efficient prediction (dot-product)
• algorithmic cost of prediction: algorithmic cost put on pre-training
▶ linear scan in Θ(nd)
▶ sublinear methods become (close to) linear in high dimensions
no hyper-parameter
• slow convergence in high dimensions (curse of dimensionality):

▶ asymptotic regime often not attained in practice ⇝ need to select k

Outline
• Reminder about supervised classification

• Principles of linear methods for classification

• Textbook case: linear regression for classification

• Logistic regression:
- binary
- multi-class
• Support Vector Machines (SVM):
- binary, linearly separable classes
- binary, non-linearly separable classes
- multi-class
• Non-linear classification using kernels
Linear methods for classification
Response variable Y is discrete
▶ consider the fibers of the predictor f : f −1 ({1}), · · · , f −1 ({κ})
ona linear
course ▶ linear regression
classifier produces
methods,linear decision
we said that a boundaries
method is considered as linear if the c

f −1 ({1})

olored dots are the input observations with responses. The colored areas are the fibers of

f −1 ({2})

linear nonlinear
Linear methods for classification
Response variable Y is discrete
▶ consider the fibers of the predictor f : f −1 ({1}), · · · , f −1 ({κ})
ona linear
course ▶ linear regression
classifier produces
methods,linear decision
we said that a boundaries
method is considered as linear if the c

2 types of approaches:
▶ model
g to the posterior probability
Bayes classifier, one tries to(discriminant function
model the posterior δy ) for each
probability P(Y class
= y |y,X = x), o
then classify by taking argmaxy∈{1,··· ,κ} δy (x)
e.g. linear / logistic regression, LDA

f −1 ({1})
δ1 δ3
e show the result of the LDA on the Iris dataset: - left: the dataset with the decision bou
δ2

f −1 ({3})
f −1 ({2})
Linear methods for classification
Response variable Y is discrete
▶ consider the fibers of the predictor f : f −1 ({1}), · · · , f −1 ({κ})
ona linear
course ▶ linear regression
classifier produces
methods,linear decision
we said that a boundaries
method is considered as linear if the c

▶ model the separating hyperplane(s) directly

e.g. Support Vector Machines, perceptron
Outline
• Reminder about supervised classification

• Principles of linear methods for classification

• Textbook case: linear regression for classification

• Logistic regression:
- binary
- multi-class
• Support Vector Machines (SVM):
- binary, linearly separable classes
- binary, non-linearly separable classes
- multi-class
• Non-linear classification using kernels
Linear regression for classification
Use a linear model for the discriminant functions (yields linear decision boundaries):

▶ ∀y ∈ {1, · · · , κ}, δy (x) := [ 1 xT ] βy for some parameter vector βy ∈ Rd+1

▶ matrix of parameters: B := [ β1 ··· βκ ] ∈ Rd+1×κ

Linear regression for classification
Use a linear model for the discriminant functions (yields linear decision boundaries):

▶ ∀y ∈ {1, · · · , κ}, δy (x) := [ 1 xT ] βy for some parameter vector βy ∈ Rd+1

▶ matrix of parameters: B := [ β1 ··· βκ ] ∈ Rd+1×κ

n
X 2
Fit the model by least squares: B̂ := argmin ∥Z(yi ) − [ 1 xTi ] B∥2
B
i=1

ector has Z(y

where exactly
i ) = one nonzero
[ 1yi =1 entry:
··· 1yi =κ ] is thethat
rowatindicator
index yivector
. The of
motivation
response yisi that we w

the corresponding
▶ random row random
indicatorvariable in our
vector: Z(Y ) :=statistical
[ 1Y =1 ··· 1model. R1×κ
Y =κ ] ∈ It is fully dependent o
Linear regression for classification
Use a linear model for the discriminant functions (yields linear decision boundaries):

▶ ∀y ∈ {1, · · · , κ}, δy (x) := [ 1 xT ] βy for some parameter vector βy ∈ Rd+1

▶ matrix of parameters: B := [ β1 ··· βκ ] ∈ Rd+1×κ

2
Fit the model by least squares: B̂ := argmin ∥Z − X B∥F
B
as2F the ℓ2 -norm 2in
P
(Frobenius
This norm:acts
norm basically ∥M ∥ := i,j M ij ) th

ector has Z(y

where exactly
i ) = one nonzero
[ 1yi =1 entry:
··· 1yi =κ ] is thethat
rowatindicator
index yivector
. The of
motivation
response yisi that we w

the corresponding
▶ random row random
indicatorvariable in our
vector: Z(Y ) :=statistical
[ 1Y =1 ··· 1model. R1×κ
Y =κ ] ∈ It is fully dependent o
" Z(y1 ) #
our new input
▶ input of responses:
indicator responsea matrix:
binary matrix .. exactly
Z := with ∈ Rn×κ
one nonzero entry per ro
.
Z(yn )
Linear regression for classification
Use a linear model for the discriminant functions (yields linear decision boundaries):

▶ ∀y ∈ {1, · · · , κ}, δy (x) := [ 1 xT ] βy for some parameter vector βy ∈ Rd+1

▶ matrix of parameters: B := [ β1 ··· βκ ] ∈ Rd+1×κ

2
Fit the model by least squares: B̂ := argmin ∥Z − X B∥F
B
This norm basically acts as the ℓ2 -norm in th
−1
p
wise we can
Unique regularize
minimum by some
(assuming fullℓcolumn ℓ2 ) as
(e.g. rank XTordinary
in in X): B̂ linear
= X X T T
regression,
X Zto get

▶ row vector of estimated discriminant functions:

h column inδ̂(x)
B̂ corresponds
:= [ δ̂1 (x) ···toδ̂κthe
(x) ] fitted
= [ 1 xparameter
T ] B̂ vector β̂y , hence the y-th entry inδ̂(x)

▶ classifier: fˆ(x) := argmax δ̂y (x)

y∈{1,··· ,κ}
Linear regression for classification
t looks
Experimental
reasonable,results:
as the two
n=classes
100 +are not(mixture
100 linearly of
separable.
2 Gaussians), d = 2

linear regression Bayes classifier

error rate ≈ 34% error rate ≈ 21%
Linear regression for classification
Experimental results:

n = 100 + 100 + 100, d = 2, Bayes error rate ≈ 2.5%

error
surprising, as the ≈ 33%
rate are
classes indeed linearly separable. For comparison,
diagonal the Bayes error rat
cross-section

In the plot on the right-hand side, the a

δ̂1

δ̂2

This plot explains the phenomenon: the discriminant function of class 2 nev
δ̂3
Linear regression for classification
What is happening:

▶ the δ̂y (x) supposedly model the posterior probabilities P(Y = y | X = x)

(cf. Bayes classifier)

▶ theythat
assuming sumthe
up set
to 1of(in is centered in Rd (mean= 0)
centered model)
observations

▶ they may fall outside [0, 1] δ̂1

δ̂2

δ̂3
Outline
• Reminder about supervised classification

• Principles of linear methods for classification

• Textbook case: linear regression for classification

• Logistic regression:
- binary
- multi-class
• Support Vector Machines (SVM):
- binary, linearly separable classes
- binary, non-linearly separable classes
- multi-class
• Non-linear classification using kernels
Logistic regression for binary classification
Generalized linear model for discriminant functions:

▶ δ1 (x) := σ ([ 1 xT ] β1 ) for some parameter β1 ∈ Rd+1

exp(t) 1
where σ(t) := = (logistic sigmoid function)
1 + exp(t) 1 + exp(−t)

▶ δ2 (x) := 1 − δ1 (x)
Logistic regression for binary classification
Generalized linear model for discriminant functions:

▶ δ1 (x) := σ ([ 1 xT ] β1 ) for some parameter β1 ∈ Rd+1

exp(t) 1
where σ(t) := = (logistic sigmoid function)
1 + exp(t) 1 + exp(−t)
forces δ1 (x) ∈ [0, 1]
▶ δ2 (x) := 1 − δ1 (x)

forces δ1 (x) + δ2 (x) = 1, hence δ2 (x) ∈ [0, 1]

Logistic regression for binary classification
Generalized linear model for discriminant functions:

▶ δ1 (x) := σ ([ 1 xT ] β1 ) for some parameter β1 ∈ Rd+1

exp(t) 1
where σ(t) := = (logistic sigmoid function)
1 + exp(t) 1 + exp(−t)

▶ δ2 (x) := 1 − δ1 (x) = σ (− [ 1 xT ] β1 ) i.e. makes the

symmetric problem
problem in 1,invariant
2 und
Logistic regression for binary classification
Generalized linear model for discriminant functions:

▶ δ1 (x) := σ ([ 1 xT ] β1 ) for some parameter β1 ∈ Rd+1

exp(t) 1
where σ(t) := = (logistic sigmoid function)
1 + exp(t) 1 + exp(−t)

▶ δ2 (x) := 1 − δ1 (x) = σ (− [ 1 xT ] β1 )

Properties
moid” of the logistic
means ”S-shaped” (and bounded) → the logistic sigmoid is but one example
sigmoid:

▶ monotonic homeomorphism R → (0, 1)

▶ σ −1 (u) = ln u
1−u
(logit function)

▶ σ(t) + σ(−t) = 1

▶ σ ′ (t) = σ(t) (1 − σ(t))

Logistic regression for binary classification
Generalized linear model for discriminant functions:

▶ δ1 (x) := σ ([ 1 xT ] β1 ) for some parameter β1 ∈ Rd+1

exp(t) 1
where σ(t) := = (logistic sigmoid function)
1 + exp(t) 1 + exp(−t)

▶ δ2 (x) := 1 − δ1 (x) = σ (− [ 1 xT ] β1 )

Properties
at, once again,ofwethe fˆ(x) toand
regression
define be associated
the argmaxclassifier:
of the discriminant functions

▶ model for δy (y = 1, 2) assumes P(Y = y | X = x) follows logistic

i.e. distribution
precisely the d

P(Y = 1 | X = x) = δ1 (x) = σ ([ 1 xT ] β1 )

P(Y = 2 | X = x) = 1 − δ1 (x) = δ2 (x) = σ (− [ 1 xT ] β1 )

Logistic regression for binary classification
Generalized linear model for discriminant functions:

▶ δ1 (x) := σ ([ 1 xT ] β1 ) for some parameter β1 ∈ Rd+1

exp(t) 1
where σ(t) := = (logistic sigmoid function)
1 + exp(t) 1 + exp(−t)

▶ δ2 (x) := 1 − δ1 (x) = σ (− [ 1 xT ] β1 )

Properties
at, once again,ofwethe fˆ(x) toand
regression
define be associated
the argmaxclassifier:
of the discriminant functions

▶ model for δy (y = 1, 2) assumes P(Y = y | X = x) follows logistic

i.e. distribution
precisely the d

▶ model
ans that, makes probability
fundamentally, ratio
we fit one log-linear
of the discriminant functions (say δ2 ) independently, t

P(Y = 1 | X = x) σ ([ 1 xT ] β1 ) 1 + exp ([ 1 xT ] β1 )
ln = ln = ln factorize the numerator
= [ 1 xT by
] β1exp
P(Y = 2 | X = x) σ (− [ 1 xT ] β1 ) 1 + exp (− [ 1 xT ] β1 )
Logistic regression for binary classification
Generalized linear model for discriminant functions:

▶ δ1 (x) := σ ([ 1 xT ] β1 ) for some parameter β1 ∈ Rd+1

exp(t) 1
where σ(t) := = (logistic sigmoid function)
1 + exp(t) 1 + exp(−t)

▶ δ2 (x) := 1 − δ1 (x) = σ (− [ 1 xT ] β1 )

Properties
at, once again,ofwethe fˆ(x) toand
regression
define be associated
the argmaxclassifier:
of the discriminant functions

▶ model for δy (y = 1, 2) assumes P(Y = y | X = x) follows logistic

i.e. distribution
precisely the d

▶ model
ans that, makes probability
fundamentally, ratio
we fit one log-linear
of the discriminant functions (say δ2 ) independently, t

▶ yields linear decision boundary: δ1 (x) = δ2 (x) ⇐⇒ δ1 (x) = 1/2

−1
⇐⇒ [ 1 xT ] β1 = σ
i.e. [ 1 xT ]=β10 =
(1/2)
Logistic regression for binary classification
Modelseen
e already fitting by maximum
maximum likelihood:
likelihood estimation for parametrized models in the lecture on d

∀i, L(yi ; xi , β1 ) := P(Y = yi | X = xi ; β1 )

n
Y
n n
⇒ L((yi )i=1 ; (xi )i=1 , β1 ) = P(Y = yi | X = xi ; β1 ) (independence)
i=1
n
X
n n
⇒ ln L((yi )i=1 ; (xi )i=1 , β1 ) = ln P(Y = yi | X = xi ; β1 )
i=1
Logistic regression for binary classification
Modelseen
e already fitting by maximum
maximum likelihood:
likelihood estimation for parametrized models in the lecture on d

∀i, L(yi ; xi , β1 ) := P(Y = yi | X = xi ; β1 )

n
Y
n n
⇒ L((yi )i=1 ; (xi )i=1 , β1 ) = P(Y = yi | X = xi ; β1 ) (independence)
i=1
n
X
n n
⇒ ln L((yi )i=1 ; (xi )i=1 , β1 ) = ln P(Y = yi | X = xi ; β1 )
i=1

Change
is just of variable:
to have a response Z
taking
:= Y values 2∈
mod in {0,{0,
1}1} ∀i,ofz{1,
instead i :=2},
yi which 2∈
mod is {0, 1}
more conveni

⇒ ∀i, ln P(Y = yi | X = xi ; β1 ) = ln P(Z = zi | X = xi ; β1 )

= zi ln P(Z = 1 | X = xi ; β1 ) + (1 − zi ) ln P(Z = 0 | X = xi ; β1 )
= zi ln σ ([ 1 xTi ] β1 ) + (1 − zi ) ln σ (− [ 1 xTi ] β1 )
esult is deduced=from xT
zi [ 1 the
i ] β
line
1 − ln
above(1 + exp
after a ([
few
1 x T
] β1of
ilines )) calculation, in which the expressi
eveloped as follows (where t := [ 1 xTi ] β1 ):
Logistic regression for binary classification
Modelseen
e already fitting by maximum
maximum likelihood:
likelihood estimation for parametrized models in the lecture on d
n n
1 := 1argmax ℓ(β1 ) where ℓ(β1 ) := ln L ((yi )i=1 ; (xi )i=1 , β1 )ℓ(β1 ) for simplicity
matorβ̂for β is the one that maximizes the log-likelihood, renamed
β1
Xn
at this functional has the same shape as=the one [ 1 xTi ]on
zi based − ln (1 + exp
β1minimizing 1 xT
the([empirical risk u
i ] β1 ))
i=1
Logistic regression for binary classification
Modelseen
e already fitting by maximum
maximum likelihood:
likelihood estimation for parametrized models in the lecture on d
n n
1 := 1argmax ℓ(β1 ) where ℓ(β1 ) := ln L ((yi )i=1 ; (xi )i=1 , β1 )ℓ(β1 ) for simplicity
matorβ̂for β is the one that maximizes the log-likelihood, renamed
β1
Xn
at this functional has the same shape as=the one [ 1 xTi ]on
zi based − ln (1 + exp
β1minimizing 1 xT
the([empirical risk u
i ] β1 ))
i=1

{
X
1

∇ ℓ(βat
ient vector 1) β
=1 (zi − σ ([ 1 xT
i ] β1 )) xi
i=1

Xn
2 ′
1
∇ ℓ(β
ian matrix at1 )β1=. −
Here we
σ ([use T
1 xithe
] βfact
1 ) (1that ] β−
− σσ([ 1=xiσ(1
T
σ) xi [ 1 xTi ]
1 ))
i=1

negative
because the Hessian matrixsemi-definite ⇒ sum
is the negated ℓ(β1 )ofispositive
a concave function matrices with non-negat
semi-definite
Logistic regression for binary classification
Modelseen
e already fitting by maximum
maximum likelihood:
likelihood estimation for parametrized models in the lecture on d
n n
1 := 1argmax ℓ(β1 ) where ℓ(β1 ) := ln L ((yi )i=1 ; (xi )i=1 , β1 )ℓ(β1 ) for simplicity
matorβ̂for β is the one that maximizes the log-likelihood, renamed
β1
Xn
at this functional has the same shape as=the one [ 1 xTi ]on
zi based − ln (1 + exp
β1minimizing 1 xT
the([empirical risk u
i ] β1 ))
i=1

{
X
1

∇ ℓ(βat
ient vector 1) β
=1 (zi − σ ([ 1 xT
i ] β1 )) xi
i=1

Xn
2 ′
1
∇ ℓ(β
ian matrix at1 )β1=. −
Here we
σ ([use T
1 xithe
] βfact
1 ) (1that ] β−
− σσ([ 1=xiσ(1
T
σ) xi [ 1 xTi ]
1 ))
i=1

negative
because the Hessian matrixsemi-definite ⇒ sum
is the negated ℓ(β1 )ofispositive
a concave function matrices with non-negat
semi-definite

the Hessian matrix

▶ choose is negative
β̂1 arbitrarily in semi-definite, only
the solution set ∇ ℓ(β
of the global
1 ) =maxima
0 annihilate the gradi

d + 1 non-linear equations in β1
Logistic regression for binary classification
Modelseen
e already fitting by maximum
maximum likelihood:
likelihood estimation for parametrized models in the lecture on d
n n
1 := 1argmax ℓ(β1 ) where ℓ(β1 ) := ln L ((yi )i=1 ; (xi )i=1 , β1 )ℓ(β1 ) for simplicity
matorβ̂for β is the one that maximizes the log-likelihood, renamed
β1
Xn
at this functional has the same shape as=the one [ 1 xTi ]on
zi based − ln (1 + exp
β1minimizing 1 xT
the([empirical risk u
i ] β1 ))
i=1

Newton-Raphson’s
basically method:
gradient descent (or rather ascent), with step prescribed by the Hessian matrix

init: β̂is1 a←−

e null vector good
0 first
//orguess
someinarbitrary
general vector

repeat:
−1
β̂1 ←− β̂1 − ∇2 ℓ(β̂1 ) ∇ ℓ(β̂1 ) // assuming non-singular Hessian

until convergence // requires convergence threshold

Logistic regression for binary classification
Modelseen
e already fitting by maximum
maximum likelihood:
likelihood estimation for parametrized models in the lecture on d
n n
1 := 1argmax ℓ(β1 ) where ℓ(β1 ) := ln L ((yi )i=1 ; (xi )i=1 , β1 )ℓ(β1 ) for simplicity
matorβ̂for β is the one that maximizes the log-likelihood, renamed
β1
Xn
at this functional has the same shape as=the one [ 1 xTi ]on
zi based − ln (1 + exp
β1minimizing 1 xT
the([empirical risk u
i ] β1 ))
i=1

Newton-Raphson’s
basically method:
gradient descent (or rather ascent), with step prescribed by the Hessian matrix

init: β̂is1 a←−

e null vector good
0 first
//orguess
someinarbitrary
general vector

repeat:
−1
β̂1 ←− β̂1 − ∇2 ℓ(β̂1 ) ∇ ℓ(β̂1 ) // assuming non-singular Hessian

until convergence // requires convergence threshold

non-singular
Thm: if aHessian
maximum implies s.t. ∇2 definite
β̄1 isnegative ℓ(β̄1 ) is Hessian, hencethen
non-singular, the β̄
map
1 is ℓthe
is strictly
unique concave
max.
and,to
close enough forβ̄1an initial
, the β̂1 close
Hessian at β̂enough to β̄1 , convergence
1 is non-singular
to β̄1the
as well, hence is quadratic.
algorithm procee
Logistic regression for binary classification
tion:Degenerate
in fact, as the
cases
example
(singular
is degenerate,
Hessian): applying the vanilla logistic regression may lea

▶ regularized logistic regression:

X n
p
hat here weβ̂have a minus
1 := argmax sign[zi [ i ] β1 − ln (1 + exp ([ i ] β1 ))] − λ ∥β1 ∥pto linea
because
1 x T
we are maximizing a1 T
quantity,
x as opposed
β1
i=1
Logistic regression for binary classification
tion:Degenerate
in fact, as the
cases
example
(singular
is degenerate,
Hessian): applying the vanilla logistic regression may lea

▶ regularized logistic regression:

X n
p
hat here weβ̂have a minus
1 := argmax sign[zi [ i ] β1 − ln (1 + exp ([ i ] β1 ))] − λ ∥β1 ∥pto linea
because
1 x T
we are maximizing a1 T
quantity,
x as opposed
β1
i=1

lecture▶on
case
regression
p = 2 (Tikhonov):
we called it ”ridge” because it led to ridge (linear) regression. In fac
n

{
X
1

∇ ℓ(β1 ) = (zi − σ ([ 1 xT
i ] β1 )) xi − 2λ β1
i=1
n
X
2 1

∇ ℓ(β1 ) = − σ ([ 1 xT
i ] β1 ) (1 − σ ([ 1 xT
i ] β1 )) xi [ 1 xTi ] − 2λ Id+1
i=1

negative definite ⇒ strictly concave functional

Logistic regression for binary classification
tion:Degenerate
in fact, as the
cases
example
(singular
is degenerate,
Hessian): applying the vanilla logistic regression may lea

▶ regularized logistic regression:

X n
p
hat here weβ̂have a minus
1 := argmax sign[zi [ i ] β1 − ln (1 + exp ([ i ] β1 ))] − λ ∥β1 ∥pto linea
because
1 x T
we are maximizing a1 T
quantity,
x as opposed
β1
i=1

lecture▶on
case
regression
p = 2 (Tikhonov):
we called it ”ridge” because it led to ridge (linear) regression. In fac
n

{
X
1

∇ ℓ(β1 ) = (zi − σ ([ 1 xT
i ] β1 )) xi − 2λ β1
i=1
n
X
2 1

∇ ℓ(β1 ) = − σ ([ 1 xT
i ] β1 ) (1 − σ ([ 1 xT
i ] β1 )) xi [ 1 xTi ] − 2λ Id+1
i=1

▶ apply Newton-Raphson, with guaranteed quadratic convergence

(small values λ may lead to numerical instabilities in practice though)
Outline
• Reminder about supervised classification

• Principles of linear methods for classification

• Textbook case: linear regression for classification

• Logistic regression:
- binary
- multi-class
• Support Vector Machines (SVM):
- binary, linearly separable classes
- binary, non-linearly separable classes
- multi-class
• Non-linear classification using kernels
Multi-class logistic regression
he Log-linear
binary case,model for posterior
we choose probability
a reference ratios:
class (sayy = κ) and we regress the other classes a

{
P(Y = 1 | X = x)
ln = [ 1 xT ] β 1
P(Y = κ | X = x)
.. parameter matrix
. B := [ β1 ··· βκ−1 ] ∈ Rd+1×κ−1
P(Y = κ − 1 | X = x)
ln = [ 1 xT ] βκ−1
P(Y = κ | X = x)
Multi-class logistic regression
he Log-linear
binary case,model for posterior
we choose probability
a reference ratios:
class (sayy = κ) and we regress the other classes a

{
P(Y = 1 | X = x)
ln = [ 1 xT ] β 1
P(Y = κ | X = x)
.. parameter matrix
. B := [ β1 ··· βκ−1 ] ∈ Rd+1×κ−1
P(Y = κ − 1 | X = x)
ln = [ 1 xT ] βκ−1
P(Y = κ | X = x)

▶ generalized linear model for discriminant functions:

{
exp ([ 1 xT ] βy )
δy (x) := P(Y = y | X = x) = P for y = 1, · · · , κ − 1
1 + z<κ exp ([ 1 xT ] βz )
generalized sigmoid
(softmax)
This ∈ [0,is1]the o
terminology
1
δκ (x) := P(Y = κ | X = x) = P
1 + z<κ exp ([ 1 xT ] βz )
P
= 1 − y<κ
rces (again) the discriminant functions
δy (x) to sum up to 1. Note that, although it is not as direct as
Multi-class logistic regression
he Log-linear
binary case,model for posterior
we choose probability
a reference ratios:
class (sayy = κ) and we regress the other classes a

{
P(Y = 1 | X = x)
ln = [ 1 xT ] β 1
P(Y = κ | X = x)
.. parameter matrix
. B := [ β1 ··· βκ−1 ] ∈ Rd+1×κ−1
P(Y = κ − 1 | X = x)
ln = [ 1 xT ] βκ−1
P(Y = κ | X = x)

▶ generalized linear model for discriminant functions:

{
exp ([ 1 xT ] βy )
δy (x) := P(Y = y | X = x) = P for y = 1, · · · , κ − 1
1 + z<κ exp ([ 1 xT ] βz )

1
δκ (x) := P(Y = κ | X = x) = P
1 + z<κ exp ([ 1 xT ] βz )

▶ estimate
pressions for the B by maximum
objective functionlikelihood and
and for the Newton-Raphson’s
iteration steps are more algorithm.
complicated than in the b
Multi-class logistic regression
Back to our running experiment:

n = 100 + 100 + 100, d = 2, Bayes error rate ≈ 2.5%

linear (error rate ≈ 33%) logistic (error rate ≈ 3%)

Logistic regression with Scikit-Learn
sionUsage:
1.6.1 (as
etaitof laversion
version stable
1.6.1 la library)
of the plus recente au moment de la creation de ce tra
class sklearn . linear_model . L o g i s t i c R e g r e s s i o n (
fit_intercept = True , # intercept forced at 0 if False
penalty = ’l 2 ’ , # regularization type
c=1.0, # inverse regu larization coefficient
random_state = None , # fix random seed with int
... # other arguments ( see documentation )
)
>>> from sklearn . linear_model import L o g i s t i c R e g r e ss io n

>>> clf = L o gi s t i c R e g r e s s i o n () # instantiate classifier

>>> clf . fit ( X_train , y_train ) # train

ties, a.k.a. log-likelihoods, of the data points. These are output as an 1-d array of
>>> y = clf . predict ( X_test ) # predict class labels

>>> P = clf . predict_proba ( X_test ) # predict posterior probas

Outline
• Reminder about supervised classification

• Principles of linear methods for classification

• Textbook case: linear regression for classification

• Logistic regression:
- binary
- multi-class
• Support Vector Machines (SVM):
- binary, linearly separable classes
- binary, non-linearly separable classes
- multi-class
• Non-linear classification using kernels
Support Vector Machines (SVM)
Principle: explicitly construct the ‘best’ hyperplanes separating the various classes.

▶ the hyperplanes that maximize the margins (closest distances to data points)

margin
Support Vector Machines (SVM)
Principle: explicitly construct the ‘best’ hyperplanes separating the various classes.

▶ the hyperplanes that maximize the margins (closest distances to data points)

Hyperplane equation: xT β − β0 = 0

0
0
=
β
▶ parameters: β ∈ Rd \ {0}, β0 ∈ R

−
β
xT
▶ β is normal to the hyperplane
β0
▶ ∥β∥
is the shift from the origin along β β

2-norm

β0 0
∥β∥
Support Vector Machines (SVM)
Principle: explicitly construct the ‘best’ hyperplanes separating the various classes.

▶ the hyperplanes that maximize the margins (closest distances to data points)

Hyperplane equation: xT β − β0 = 0

0
=

=
0
β

0
β
−
▶ parameters: β ∈ Rd \ {0}, β0 ∈ R

−
β
xT

β
xT

1
−
=
▶ β is normal to the hyperplane

0
β
1

−
∥β∥

β
β0

xT
▶ ∥β∥
is the shift from the origin along β β
1
e is one degree
▶ fix ∥β∥
of be
to freedom in the hyperplane equation, since the solution set is invariant u
the margin

⇒ maximizing the margin is equivalent β0 0

∥β∥
to minimizing ∥β∥ or ∥β∥2

⇒ slab
is another boundaries of
consequence have
ourequations xTthat
convention β−β 0 = ±1
1/∥β∥ is set to be the margin.
Support Vector Machines (SVM)
Principle: explicitly construct the ‘best’ hyperplanes separating the various classes.

▶ the hyperplanes that maximize the margins (closest distances to data points)

Binary classification case (Y = {−1, 1}):

0
=

=
0
β
(xi , 1)

0
β
−
2

−
β̂, β̂0 := argmin ∥β∥ subject to:

β
xT

1
β,β0

−
=
0
xTi β − β0 ≥ 1
(

β
∀i s.t. yi = 1 1

−
∥β∥

β
xT
xTi β − β0 ≤ −1 ∀i s.t. yi = −1 β

T (xi , −1)

⇔ yiis xequivalent
constraint i β − β0 ≥ 1the ∀i
to = 1, · · ones.
previous · ,n
β0 0
∥β∥

▶ quadratic programming problem ⇝

solvers: ellipsoid, interior point, etc.
h problems
(w/ pos.
have definite
quadraticquadratic
objective
form)
functions and linear constraints (equalities or inequali
Support Vector Machines (SVM)
Principle: explicitly construct the ‘best’ hyperplanes separating the various classes.

▶ the hyperplanes that maximize the margins (closest distances to data points)

Binary classification case (Y = {−1, 1}):

0
=

=
0
β
(xi , 1)

0
β
−
2

−
β̂, β̂0 := argmin ∥β∥ subject to:

β
xT

1
β,β0

−
(maximize margin)

=
0
xTi β − β0 ≥ 1
(

β
∀i s.t. yi = 1 1

−
∥β∥

β
xT
xTi β − β0 ≤ −1 ∀i s.t. yi = −1 β

T (xi , −1)

⇔ yiis xequivalent
constraint i β − β0 ≥ 1the ∀i
to = 1, · · ones.
previous · ,n
(leave data points outside slab, on correct side)
β0 0
∥β∥

▶ quadratic programming problem ⇝

▶ the hyperplanes that maximize the margins (closest distances to data points)

Binary classification case (Y = {−1, 1}):

0
=

=
βˆ
0
(xi , 1)

βˆ
0
−
2

−
βˆ
β̂, β̂0 := argmin ∥β∥ subject to:

βˆ
xT

1
β,β0

−
(maximize margin)

=
βˆ
0
xTi β − β0 ≥ 1
(
∀i s.t. yi = 1 1

−
∥β̂∥

βˆ
xT
xTi β − β0 ≤ −1 ∀i s.t. yi = −1 β̂

T (xi , −1)

⇔ yiis xequivalent
constraint i β − β0 ≥ 1the ∀i
to = 1, · · ones.
previous · ,n
(leave data points outside slab, on correct side)
β̂0 0
∥β̂∥

▶ quadratic programming problem The fˆ(x)
class/label
▶ classifier: of =
a query xT β̂ −x β̂is0 dete
sign point
h problems
(w/ pos.
have definite
quadraticquadratic
objective
form)
functions and linear constraints (equalities or inequali
Outline
• Reminder about supervised classification

• Principles of linear methods for classification

(loss compared to a slab excluding observation xi )

w some examples on the picture opposite.

1
Interpretation:

0
=

=
0
ˆi > 0:

β
1 − yside T
) = ξ > 0. (xi , 1)
− β0boundary

0
in ξthat
eed, ▶ xiwe
case is on wrong
have i (xiof
β slab

β
−

−
β
xT
T

β
se are the
ˆ vectors that count to
support define the
vectors slab. Indeed, each equality y (x i β̂ − β̂0 ) = 1

1
▶ ξi = 0: i

−
=
• yi (xTi β̂ − β̂0 ) = 1: xi is on slab boundary

0
β
1

−
∥β∥
• yi (xTi β̂ − β̂0 ) > 1: xi is on correct side

β
xT
β
n
ˆ n 1 X 2n ∝ loss
is aβ̂,quadratic program in
β̂0 , (ξi )i=1 := argmin the unknownsξi + λ0∥β∥i )i=1 , with a positive-definite
β, β , (ξ quadratic t
β,β0 ,(ξi )n n (xi , −1)
i=1 i=1

n o
T
Hinge
is zero forloss: max 0, 1 − yi xi β − side
observations lying on correct β0 of the slab (i.e. satisfying the previous const

(loss compared to a slab excluding observation xi )

w some examples on the picture opposite.

1
Interpretation:

0
=

=
0
ˆi > 0:

β
1 − yside T
) = ξ > 0. (xi , 1)
− β0boundary

0
in ξthat
eed, ▶ xiwe
case is on wrong
have i (xiof
β slab

β
−

−
β
xT
T

β
se are the
ˆ vectors that count to define the slab. Indeed, each equality y (x i β̂ − β̂0 ) = 1

1
▶ ξi = 0: i

−
=
• yi (xTi β̂ − β̂0 ) = 1: xi is on slab boundary

0
β
1

−
∥β∥
• yi (xTi β̂ − β̂0 ) > 1: xi is on correct side

subject to: (select e.g. by cross-validation)

T
∀i ξ ≥ 0
nfringe the definition
i and y (x β − β0the
of ξi ito iremove ) ≥ max
1 − ξfrom
i the constraint. Specifically, we only a
Outline
• Reminder about supervised classification

• Principles of linear methods for classification

• Textbook case: linear regression for classification

• Logistic regression:
- binary
- multi-class
• Support Vector Machines (SVM):
- binary, linearly separable classes
- binary, non-linearly separable classes
- multi-class
• Non-linear classification using kernels
Multi-class SVM
is because
Principle:SVM
convert
is essentially
multi-class
tied
problem
to binary
intoclassification.
multiple binary problems.
Multi-class SVM
is because
Principle:SVM
convert
is essentially
multi-class
tied
problem
to binary
intoclassification.
multiple binary problems.

▶ One-vs-all:
y y
• train 1 classifier (β̂ , β̂0 ) for each class y = 1, · · · , κ, to discriminate
labels are in class
this the binary problem,
(assigned label where
1) fromclass
the yrest
is seen as data
of the the positive
(assigned label −1).
class
d T y y
• assign each new observation x ∈ R to the class argmax x β̂ − β̂0x. This requi
class whose corresponding classifier gives the highest score wins the bet for
y=1,··· ,κ
Multi-class SVM
is because
Principle:SVM
convert
is essentially
multi-class
tied
problem
to binary
intoclassification.
multiple binary problems.

▶ One-vs-one:
y,y ′ y,y ′
•
s there are κ
train( 21)classifier
binary classifiers,
(β̂ , β̂0i.e.) aforquadratic
each pairquantity in
of classes the y ′ ∈ {1,of
y ̸=number · · classes
· , κ},
e, discrimination
to discriminate
is among
y from y ′ in their jointspanned
the subpopulation subpopulations.
by the observations with labels y o

• given a new observation x ∈ Rd , decide between y and y ′ using

′
′

y,y
T y,y
sign x β̂ − β̂0 for each pair of classes y ̸= y ′ , then assign x
to the class with the highest number of positive answers.
SVM with Scikit-Learn
sionUsage:
1.6.1 (as
etaitof laversion
version stable
1.6.1 la library)
of the plus recente au moment de la creation de ce tra
une autre
classclasse egalement,
sklearn sklearn.svm.LinearSVC,
. svm . SVC ( qui n’utilise pas de noyau m
kernel = ’ rbf ’ , # kernel type ( ’ linear ’ for no kernel )
gamma = ’ scale ’ # inverse kernel bandwidth ( if applicable )
c=1.0, # inverse regu larization coefficient
random_state = None , # fix random seed with int
... # other arguments ( see documentation )
)
>>> from sklearn . svm import SVC

>>> clf = SVC () # instantiate classifier

>>> clf . fit ( X_train , y_train ) # train

ties, a.k.a. log-likelihoods, of the data points. These are output as an 1-d array of
>>> clf . s u p p ort _ v e ct o rs _ # support vectors

>>> y = clf . predict ( X_test ) # predict class labels

Outline
• Reminder about supervised classification

• Principles of linear methods for classification

• Textbook case: linear regression for classification

1.5 1.5

1 1

0.5 0.5

SVM
0
The linear classifier in the data space pe
0

-0.5 -0.5

-1
-1

-1.5
-1.5

-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5

-1.5 -1
feature -0.5
space 0 0.5 1 1.5
Kernel SVM

1.5 1.5

1 1

0.5 0.5

SVM
0
The linear classifier in the data space pe
0

-0.5 -0.5

-1
-1

-1.5
-1.5

-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5

-1.5 -1
feature -0.5
space 0 0.5 1 1.5

Φ : x 7→ [ x x2 ]
1.5

3.5
1
3
2.5
0.5
2
1.5
0

In the feature space, the two classespullback

0.5
1
become separable (hence the support vec -0.5

3 -1

2
1 -1.5
0
SVM in -1 4
3 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
-2
feature space -3 1
2

0
Kernel SVM
present the hard
Quadratic margin (hard
program case, the soft /margin
margin case is similar. Note also that soft margins
no slack):

argmin ∥β∥ 2
subject to yi (Φ(xi )T β − β0 ) ≥ 1 ∀i = 1, · · · , n
β,β0

(vector in feature space (inner product in feature space)

—may be infinite-dimensional)

1.5

3.5
1
3
2.5
0.5
2
1.5
0

In the feature space, the two classespullback

1
0.5 become separable (hence the support vec
-0.5

3 -1

2
1 -1.5
0
SVM in -1 4
3 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
-2
feature space -3 1
2

0
Kernel SVM
present the hard
Quadratic margin (hard
program case, the soft /margin
margin case is similar. Note also that soft margins
no slack):
n n n
!
XX X
merely substitute β α
argmin for
i ythe linear
i k(x yj αj subj. to
i , xj )combination of yΦ(x
i i )’s. α
We
j yjthus , xj )a−new
k(xiget ≥ 1 ∀i p
β0 quadratic
α,β0
i=1 j=1 j=1

Pn Pn
Representer Thm =⇒ β̂ = i=1 αi yi Φ(xi ) = i=1 αi yi k(xi , ·)

1.5

3.5
1
3
2.5
0.5
2
1.5
0

In the feature space, the two classespullback

1
0.5 become separable (hence the support vec
-0.5

3 -1

2
1 -1.5
0
SVM in -1 4
3 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
-2
feature space -3 1
2

0
Experimental results
n = 100 + 100 (mixture of 2 Gaussians), d = 2

e Bayes classifier and its error rate can beclassifier

Bayes estimated accurately since we know the p
error rate ≈ 21%
Experimental results
n = 100 + 100 (mixture of 2 Gaussians), d = 2

linear regression Parameter

7-NNk =classifier
7 has been selected u
error rate ≈ 34% error rate ≈ 22.5%
Experimental results
n = 100 + 100 (mixture of 2 Gaussians), d = 2

λ’s priviledge the with

SVM λ = 10−2
constraints, hence small margin and number of λ
SVM with = 104 vectors. Large
support
error rate ≈ 28.8% error rate ≈ 30%
Experimental results
n = 100 + 100 (mixture of 2 Gaussians), d = 2

d lines
SVMarewith
the slab’s
deg.-4boundaries,
polynomial as
kernel
before. The decision
SVMboundary of the kernel
with Gaussian Bayes classifier
rmance of theerror
Gaussian
rate ≈kernel
24.5%is particularly good here. This
errorisrate
explained
≈ 21.8%by the fact that
Experimental results
n = 100 + 100 (mixture of 2 Gaussians), d = 2

gistic regression
logistic in feature
reg. with deg.-4space is regularized.
polynomial kernel λ and the reg.
logistic window
withsize for thekernel
Gaussian kernel have
gain, the performance
error rate of the Gaussian kernel is particularlyerror
≈ 26.3% good,
ratedue to the fact that the
≈ 22.1%
What you should know
• Two types of linear approaches for classification

• Logistic regression:
- generalized linear model
- fitting by likelihood maximization & Newton-Raphson’s method,
convergence guarantees
- degenerate cases & regularization
- extension to multi-class

• Support Vector Machines (SVM):

- margin maximization & quadratic programming problem
- hinge loss, soft margin,
relaxed quadratic programming problem, support vectors
- extension to multi-class
- kernel SVM

Intro To Machine Learning Lecture Notes2
No ratings yet
Intro To Machine Learning Lecture Notes2
7 pages
Data Science Lecture: Classification & Regression
No ratings yet
Data Science Lecture: Classification & Regression
27 pages
Supervised Machine Learning Guide
No ratings yet
Supervised Machine Learning Guide
74 pages
Aiml Unit 3
No ratings yet
Aiml Unit 3
9 pages
06 Lectureslides LinearClassification Fixed
No ratings yet
06 Lectureslides LinearClassification Fixed
52 pages
Week3 Summary Detail
No ratings yet
Week3 Summary Detail
13 pages
Cours1 ML
No ratings yet
Cours1 ML
41 pages
Unit 2 ML - Ver 2
No ratings yet
Unit 2 ML - Ver 2
129 pages
Unit 2 ML - Ver 2
No ratings yet
Unit 2 ML - Ver 2
129 pages
Supervised Learning. wk3
No ratings yet
Supervised Learning. wk3
18 pages
Unit 3
No ratings yet
Unit 3
8 pages
Lecture 03 Bayes Classifier With Prob Concepts
No ratings yet
Lecture 03 Bayes Classifier With Prob Concepts
70 pages
Unit 3 in Machine Intelligence
No ratings yet
Unit 3 in Machine Intelligence
62 pages
Cheatsheet Supervised Learning
100% (1)
Cheatsheet Supervised Learning
4 pages
UNIT3 Machine Learning
No ratings yet
UNIT3 Machine Learning
53 pages
Week 4 Logistic
No ratings yet
Week 4 Logistic
21 pages
QSRI Lecture1
No ratings yet
QSRI Lecture1
45 pages
Supervised Learning Cheatsheet
No ratings yet
Supervised Learning Cheatsheet
4 pages
EE353 - 769 08 Linear Classification
No ratings yet
EE353 - 769 08 Linear Classification
22 pages
Linear Discriminant & SVM Explained
No ratings yet
Linear Discriminant & SVM Explained
41 pages
C30 C35 LinearModelForClassification
No ratings yet
C30 C35 LinearModelForClassification
50 pages
06-07-08-Supervised Learning by Computing Distances, Multi Class Classification, Decision Boundary
No ratings yet
06-07-08-Supervised Learning by Computing Distances, Multi Class Classification, Decision Boundary
32 pages
03 Linear Models
No ratings yet
03 Linear Models
46 pages
Lecture 3
No ratings yet
Lecture 3
51 pages
Lec4 Oct12 2022 PracticalNotes LinearRegression
No ratings yet
Lec4 Oct12 2022 PracticalNotes LinearRegression
34 pages
Exam in Statistical Machine Learning Statistisk Maskininlärning (1RT700)
No ratings yet
Exam in Statistical Machine Learning Statistisk Maskininlärning (1RT700)
10 pages
Chapter 4. Classification Algorithms-Stud
No ratings yet
Chapter 4. Classification Algorithms-Stud
43 pages
Ai and ML
No ratings yet
Ai and ML
16 pages
ML Algorithms Week 3
No ratings yet
ML Algorithms Week 3
30 pages
ML Unit 2
No ratings yet
ML Unit 2
53 pages
Machine Learning
No ratings yet
Machine Learning
87 pages
ML Unit-4
No ratings yet
ML Unit-4
20 pages
Lecture2 MCQ Guide
No ratings yet
Lecture2 MCQ Guide
8 pages
6.867 Section 3: Classification: 1 Intro 2 2 Representation 2 3 Probabilistic Models 2
No ratings yet
6.867 Section 3: Classification: 1 Intro 2 2 Representation 2 3 Probabilistic Models 2
10 pages
20MEMECH Part 3 - Classification
No ratings yet
20MEMECH Part 3 - Classification
49 pages
Linear Regression
No ratings yet
Linear Regression
9 pages
189 Cheat Sheet Nominicards PDF
No ratings yet
189 Cheat Sheet Nominicards PDF
2 pages
Machine Learning Classifiers Guide
No ratings yet
Machine Learning Classifiers Guide
12 pages
UNIT 1 Linear Discrimnat
No ratings yet
UNIT 1 Linear Discrimnat
7 pages
Unit 3
No ratings yet
Unit 3
8 pages
w04 LectureSlices MA4550
No ratings yet
w04 LectureSlices MA4550
32 pages
Linear Classifiers Explained
No ratings yet
Linear Classifiers Explained
13 pages
l05 Machine Learning
No ratings yet
l05 Machine Learning
34 pages
Module 3.1
No ratings yet
Module 3.1
25 pages
1 An Introduction To Linear Classifiers
No ratings yet
1 An Introduction To Linear Classifiers
9 pages
PA
No ratings yet
PA
8 pages
i2ML Cheatsheets
No ratings yet
i2ML Cheatsheets
7 pages
2021 Logistic Regression
No ratings yet
2021 Logistic Regression
33 pages
AI Linear Regression & Perceptron
No ratings yet
AI Linear Regression & Perceptron
8 pages
Introduction to Support Vector Machines
No ratings yet
Introduction to Support Vector Machines
40 pages
Machine Learning and Data Mining: Prof. Alexander Ihler
No ratings yet
Machine Learning and Data Mining: Prof. Alexander Ihler
46 pages
ISYE 6501 Notes
No ratings yet
ISYE 6501 Notes
45 pages
Math Behind Machine Learning
No ratings yet
Math Behind Machine Learning
9 pages
Linear Models for Classification
No ratings yet
Linear Models for Classification
72 pages
Linear Classifiers
No ratings yet
Linear Classifiers
48 pages
Data Mining: Classification
No ratings yet
Data Mining: Classification
79 pages
Lecture 7 Classification
No ratings yet
Lecture 7 Classification
33 pages
Tutorial 7 Machine Learning Algorithms
No ratings yet
Tutorial 7 Machine Learning Algorithms
30 pages
Lecture 11 - 09.09.24 Classification Part 1
No ratings yet
Lecture 11 - 09.09.24 Classification Part 1
51 pages
Santhosh Project Report
No ratings yet
Santhosh Project Report
23 pages
AI ML Step 1 Project Planning Template
No ratings yet
AI ML Step 1 Project Planning Template
7 pages
Unit 1 - Exploratory Data Analysis Fundamentals
No ratings yet
Unit 1 - Exploratory Data Analysis Fundamentals
47 pages
Augmented Analytics From BI To Smart Analytics
No ratings yet
Augmented Analytics From BI To Smart Analytics
11 pages
Mini Project 2-2
No ratings yet
Mini Project 2-2
18 pages
Dao Minh Tuan TopCV - VN 070225.205504
No ratings yet
Dao Minh Tuan TopCV - VN 070225.205504
4 pages
Mathematical Foundations
No ratings yet
Mathematical Foundations
431 pages
Protecting Privacy in The Era of Artificial Intelligence: Keyur Tripathi & Usama Mubarak
No ratings yet
Protecting Privacy in The Era of Artificial Intelligence: Keyur Tripathi & Usama Mubarak
7 pages
PKS Sentiment Paper
No ratings yet
PKS Sentiment Paper
8 pages
Chap 7-1 Regularization For Deep Learning-Keonwoo Noh
No ratings yet
Chap 7-1 Regularization For Deep Learning-Keonwoo Noh
41 pages
Graph Data Science For Dummies Neo4j 2nd Edition
No ratings yet
Graph Data Science For Dummies Neo4j 2nd Edition
53 pages
Amazon Sales Analysis Presentation
No ratings yet
Amazon Sales Analysis Presentation
24 pages
An Incremental Clustering Algorithm Based On Mahalanobis Distance
No ratings yet
An Incremental Clustering Algorithm Based On Mahalanobis Distance
1 page
Bullet Point Summary
No ratings yet
Bullet Point Summary
3 pages
Digital Environment
No ratings yet
Digital Environment
12 pages
IIM Indore US Pathway Brochure
No ratings yet
IIM Indore US Pathway Brochure
20 pages
Lean-Six Sigma in The Age of Artificial Intelligence
No ratings yet
Lean-Six Sigma in The Age of Artificial Intelligence
5 pages
VLSI & Stochastic Computing in AI
No ratings yet
VLSI & Stochastic Computing in AI
8 pages
Databricks Ai Security Framework Dasf Whitepaper v4 Final
0% (1)
Databricks Ai Security Framework Dasf Whitepaper v4 Final
84 pages
Unit 4 Machine Learning
No ratings yet
Unit 4 Machine Learning
51 pages
Data-Science-Report - Priyesh
No ratings yet
Data-Science-Report - Priyesh
32 pages
Empowering Women in STEM
No ratings yet
Empowering Women in STEM
61 pages
The Impact of Artificial INTELLIGNCE ON MILITARY DEFEBNCE AND SECURITY DANIEL ARAYA and MEG KING - Copie PDF
No ratings yet
The Impact of Artificial INTELLIGNCE ON MILITARY DEFEBNCE AND SECURITY DANIEL ARAYA and MEG KING - Copie PDF
28 pages
Mental Health Analysis in Social Media Posts: A Survey: Muskan Garg
No ratings yet
Mental Health Analysis in Social Media Posts: A Survey: Muskan Garg
24 pages
Module 2 Lab 4
No ratings yet
Module 2 Lab 4
5 pages
PSOC
No ratings yet
PSOC
129 pages
Improving Graph Neural Networks With Simple Architecture Design
No ratings yet
Improving Graph Neural Networks With Simple Architecture Design
10 pages
6 Rajar
No ratings yet
6 Rajar
22 pages
Computer and Information Science (EG) (CIS) : Page 1 of 25
No ratings yet
Computer and Information Science (EG) (CIS) : Page 1 of 25
25 pages
DWDM Unit - 1 MCQ: by Arpit Sharma 01629802018
No ratings yet
DWDM Unit - 1 MCQ: by Arpit Sharma 01629802018
27 pages