0% found this document useful (0 votes)

9 views47 pages

Classification

The document provides an introduction to classification in machine learning, focusing on two-class classification problems. It discusses four classifiers: Gaussian Discriminant Analysis, Logistic Regression, k-nearest neighbors, and the Perceptron, detailing their models, assumptions, and parameter estimation methods. Key concepts such as likelihood, gradient descent, and the perceptron convergence theorem are also covered to illustrate the learning process in these classifiers.

Uploaded by

eswam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views47 pages

Classification

Uploaded by

eswam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 47

Topic 3: INTRO TO CLASSIFICATION

STAT 37710/CAAM 37710/CMSC 35400 Machine Learning

Risi Kondor, The University of Chicago
[Hastie, Tibshirani & Friedman]

2
2/48
/48
Classification
Two-class classification in Rn is the prototypical supervised ML problem.
• Input space: X (in the simplest case, X = Rn )
• Output space: Y = {−1, +1} (in k-class case Y = {1, . . . , k} )
• Training set: S = {(x1 , y1 ) , (x2 , y2 ) , . . . , (xm , ym )}
• Goal: find a good hypothesis f : X → Y .

We will look at four simple classifiers:

1. Gaussian Discriminant Analysis
2. Logistic regression
3. k –nearest neighbors
4. The perceptron

3
3/48
/48
1. Gaussian discriminant analysis
Model based classification
1. Assume that x and y come from a joint distribution p(x, y) (similar to
mixture models for clustering).
2. Define a marginal distribution for y (Bernoulli):

p0 (y = +1) = π p0 (y = −1) = 1 − π.

3. Define the conditionals for x given y :

p( x |y = +1) = p+1 (x)

p( x |y = −1) = p−1 (x).

4. Estimate the parameters of p+ , p− and p0 from the training data.

5. Classify future examples according to
(
+1 if p(y = +1| x) ≥ p(y = −1| x)
yb =
−1 if p(y = +1| x) < p(y = −1| x).

This is called the Bayes optimal classifier.

5
5/48
/48
Gaussian Discriminant Analysis
Assume that X = Rn and p+ (x) and p− (x) are both multivariate
normal (Gaussian) distributions:

1 ⊤ −1
N ( x; µ , Σ ) = e−(x−µ) Σ (x−µ)/2
(2π)n/2 | Σ |1/2

where µ ∈ Rn , and Σ ∈ Rn×n is a symmetric positive definite matrix.

E(x) = µ Cov(xi , xj ) = Σi,j

6
6/48
/48
Gaussian Discriminant Analysis

!
1 (x − µy )⊤ Σ−1
y (x − µy )
p(x, y) = π (1+y)/2 (1−π)(1−y)/2 exp −
(2π)n/2 |Σy |1/2 2

How do we learn (estimate) the parameters {µ+1 , Σ+1 , µ−1 , Σ−1 , π} ?

Easier than clustering because y is not latent, so no need for EM.
7
/48
7/48
Likelihood for GDA

Y
m
L(π, µ+1 , µ−1 , Σ+1 , Σ−1 ) = π (1+yi )/2 (1 − π)(1−yi )/2 . . .
i=1
1 ⊤ −1
exp − xi − µ yi Σ yi xi − µ yi /2
(2π)n/2 | Σyi |1/2

8
8/48
/48
Log-likelihood for GDA

m
X
1 + yi 1 − yi
ℓ(π, µ+1 , µ−1 , Σ+1 , Σ−1 ) = cst + log π + log(1 − π)
i=1
2 2
X 1 (x − µ+1 )⊤ Σ−1
+1 (x − µ+1 )

+ − log | Σ+1 | −
2 2
i : yi =1
X 1 (x − µ−1 )⊤ Σ−1
−1 (x − µ−1 )

+ − log | Σ−1 | −
2 2
i : yi =−1

Observation: the part of the likelihood relating to the y = +1 Gaussian completely

separates from the part relating to the y = −1 Gaussian, which, in turn, separates
from the Bernoulli part. So these three parts can be maximized separately.

9
9/48
/48
MLE for two-Gaussians model

• MLE for π :

n+ number of traning examples with yi = 1

b=
π =
n total number of training examples

• MLE for (µ+1 , Σ+1 ) and (µ−1 , Σ−1 ) :

1 X X
b +1 =
µ xi b +1 = 1
Σ (xi −µ+1 )(xi −µ+1 )⊤
n+1 n+1
i : yi =+1 i : yi =+1

1 X X
b −1 =
µ xi b −1 = 1
Σ (xi −µ−1 )(xi −µ−1 )⊤
n−1 n−1
i : yi =−1 i : yi =−1

10
10/48
/48
2. Logistic Regression
Logistic regression

• Assume that the conditional of y is

P(y = 1|x) = h(x)

P(y = −1|x) = 1 − h(x).

In other words y|x ∼ Bernoulli(h(x)).

• Set h(x) to be a nice function Rd → [0, 1] , in particular the logistic
function (sigmoid function of θ · x )

1
h(x) = .
1 + e−θ·x
This is similar to how in model based regression we took y|x ∼ N (f (x), σ 2 )) .

12
12/48
/48
The sigmoid function

1
The sigmoid function g(z) = 1+e −z has the nice property that

d 1 1 1 e−z
g ′ (z) = = e −z
=
dz 1 + e−z (1 + e−z )2 1 + e−z 1 + e−z
(1 + e−z ) − 1
=g(z) = g(z) (1 − g(z)).
1 + e−z

13
13/48
/48
MLE for logistic regression
1+yi
Defining for convenience ui = 2 , the likelihood is

Y
m
L(θ) = h(xi )ui (1 − h(xi ))1−ui .
i=1

The log-likelihood is

X
m

ℓ(θ) = ui log(h(xi )) + (1 − ui ) log(1 − h(xi )) .
i=1

Unlike in GDA, this cannot be maximized in closed form. → Try stochastic

gradient descent.

14
14/48
/48
Gradient descent

Given a loss function J , we can optimize it using gradient descent:

θ←0
until(convergence){
θ ← θ − α∇J(θ)
}

α : a parameter called the learning rate.

Gradient descent is the “poor man’s algorithm” for optimization. Works in almost any
case, but sometimes not very well. Choosing α can be tricky.

15
15/48
/48
Recap: the gradient

The gradient of a function J : Rp → R is the vector ∇J(u) with

∂
[∇J(u)]i = J(u).
∂ui
The gradient always points in the most uphill direction possible (picture shows
negative gradient). At a minimum/maximum (or saddle point) ∇J = 0 .

16
16/48
/48
The learning rate α

Setting the learning rate can be quite tricky. Decrease learning rate as we get
closer to the optimum?

17
/48
17/48
Gradient descent

Gradient descent trajectories can be more complicated than you would

expect.

18
18/48
/48
SGD for logistic regression

θ←0
until(convergence){
for( j = 1 to m )
θ ← θ − α (h(xi ) − ui ) xi
}
}

Stochastic gradient descent performs gradient descent wrt the loss (negative
likelihood) of individual examples or small randomly selected sets of
examples, called minibatches.

19
19/48
/48
3. k –nearest neighbors
k -nearest neighbors

1-nearest neighbor 15-nearest neighbor

The hypothesis returned by the k -NN classifier is

P
fb(x) = sgn i∈NNk (x) yi ,

where NNk (x) = {i1 , . . . , ik } are the indices of the k points in S

P
closest to x . (When k is even and i∈NNk (x) yi = 0 , break ties

arbitrarily.)

What effect does k have on the behavior of k -NN?

21
21/48
/48
3. The Perceptron
Linear classifiers
The Perceptron belongs to the wide class of linear classifiers.

Input space: X = Rn
Output space: Y = {−1, +1}
Hypothesis (affine hyperplane):

f (x) = sgn(w · x + b)

• w is the normal to the

separating hyperplane
• b is the bias

1 X
m
Empirical error : Etrain (f ) = ℓ0/1 (f (xi ), yi )
m
i=1

b, y) = 0 if yb = y , else ℓ0/1 (yb, y) = 1 .

where ℓ0/1 (y

Of all possible hyperplanes that separate the data which one is best?
23
23/48
/48
Inspiration: Artificial Neural Networks

X
o j = φ θj + wi,j xi ,
i

An old idea going back to [McCulloch & Pitts, 1943].

24
24/48
/48
The perceptron (Rosenblatt, 1958)

Consider the following a simplified scenario:

• kxi k = 1 for all i .
• Assume that all datapoints satisfy the ground truth concept
(
1 if v · x ≥ 0
htrue (x) = ,
−1 otherwise

where v is some fixed vector and w.l.o.g. kvk = 1 .

• Since htrue goes through the origin, set b = 0 for all h .
• We want to solve the problem in its online form.

25
25/48
/48
The perceptron (Rosenblatt, 1958)

w←0;
t←1;
while(true){
if w · xt ≥ 0 predict ŷt = 1 ; else predict ŷt = −1 ;
if ((ŷt = −1) and (yt = 1)) let w ← w + xt ;
if ((ŷt = 1) and (yt = −1)) let w ← w − xt ;
t←t+1;
}

Note: In this form, the Perceptron is an online learning algorithm.

26
26/48
/48
The margin
The projection of x onto the line orthog-
onal to decision boundary w · x + b =
0 is
P x = w · x/ kwk .
On the boundary
−b
w · x = −b ⇒ Px = .
kwk

The margin of a correctly classified positive resp. negative example are

−b −b
δ(x,+1) = P x − , δ(x,−1) = − P x,
kwk kwk

so the general formula is δ(x,y) = y (w · x + b)/kwk .

27
27/48
/48
The perceptron convergence thm

Suppose that the perceptron is run on a sequence of examples with

kxi k = 1 and yi (v · xi ) ≥ 0 for some v with kvk = 1 . In addition,
assume that the data obeys |v · xi | ≥ δ for all xt (margin) .

Claim 1: After M mistakes w · v ≥ δM .

Initially, w · v = 0 . After a false negative, w ← w + xt , and

(w + xt ) · v ≥ (w · v) + δ , so w · v increases by at least δ . Similarly
for a false positive.

28
28/48
/48
The perceptron convergence thm

2
Claim 2: After M mistakes, kwk ≤ M

Initially kwk = 0 . On a false negative, w ← w + xt , and

k w + x k2 = (w + xt ) · (w + xt ) = kwk2 + 2w · xt + kxt k2 .

The third term on the RHS is 1 , while the second term must be negative,
2
since we predicted 0 . Therefore, kwk can increase by at most 1 .
Similarly for false positives.

29
29/48
/48
The perceptron convergence thm

Theorem. If the perceptron is run a sequence of examples with |v · xt | ≥ δ

(plus our other assumptions), then it will make at most 1/δ 2 mistakes.

Proof. √
δM ≤ w · v ≤ kwk ≤ M
■

30
30/48
/48
Affine hyperplanes

Question: What if we don’t want the separator to go through the origin?

Add extra feature with [xt ]n+1 = 1 for all t .

x1 w 1 + x2 w 2 + . . . x n w n ≥ θ

⇐⇒
x1 w1 + x2 w2 + . . . xn wn + xn+1 wn+1 ≥ 0
with wn+1 = −θ

Question: How can we make linear threshold functions more expressive?

31
31/48
/48
Towards multiclass

Rewrite perceptron as competition between w+ and w−

w+ ← (0, 0, . . . , 0) ;
w− ← (0, 0, . . . , 0) ;
t←1;
while(1){
if w+ · xt ≥ w− · xt predict ŷt = 1 ; else predict ŷt = 0 ;
if ((ŷt = 0) and (yt = 1)) let w+ ← w+ + xt ;
if ((ŷt = 1) and (yt = 0)) let w− ← w− + xt ;
t←t+1;
}

32
/48
32/48
Towards multiclass

If yt ∈ {1, 2, . . . , κ} , maintain κ different weight vectors

w1 , w2 , . . . , wκ .

On erroneously classifying example xt as class i instead of class j :

• wi ← wi − xt /2
• wj ← wj + xt /2

[Collins & Duffy, 2002]

33
33/48
/48
General concepts of supervised learning
The general framework

• There is an unknown distribution p on X × Y .

• The training set S = {(x1 , y1 ) , (x2 , y2 ) , . . . , (xm , ym )} , is an IID
sample from p .
• The hypothesis is a function fb: X 7→ Y .
• The learning algorithm is a function A : S 7→ fb .
• The hypothesis space F is the space of functions accessible to A .
We want f to be good on future examples drawn from p , where “good” is
defined in terms of a loss function ℓ : Y × Y → R+ .

How can we ever be sure that we can find a good f , especially when F is
huge??? This is what Statistical Learning Theory is about.

35
35/48
/48
Three notions of error
• The training error or empirical error:

1 X b
m
Eemp [fb] = ℓ(f (xi ), yi )
m
i=1
This is what we can actually measure, and what many learning algorithms
try to minimize. → Empirical Risk Minimization (ERM)
• The testing error:
′
1 X b ′ ′
m
Etest [fb] = ′ ℓ(f (xi ), yi )
m
i=1
This is what the algorithm is going to be evaluated on.
• The true error:
Etrue [fb] = E(x,y)∼p ℓ(fb(x), y).
This is what an ideal algorithm would minimize, but it can’t be measured
since p is unknown.

36
36/48
/48
Overfitting
In general, because fb is explicitly chosen to minimize Eemp [fb] , it will tend
to be an overoptimistic estimate of the true error, i.e., Etrue [fb] > Etrain [f ] and
Etest [fb] > Etrain [f ] . This is called overfitting.

Question: How can we avoid overfitting? Restrict hypothesis space!! But

how much?

37
/48
37/48
Overfitting and underfitting

This is the biggest thing to be paranoid about in all branches of ML.

38
38/48
/48
The overfitting/underfitting tradeoff

The hypothesis space should be big enough but not too big. Alternatively, we
need to add a regularizer (c.f. inverse problems in applied math).

39
/48
39/48
Regularized Risk Minimization (RRM)
The most general framework for finding the right balance between overfitting
and underfitting in supervised learning is RRM:

1 X
m
fb = arg min ℓ(f (xi ), yi ) + λ Ω(f )
f ∈F m | {z }
i=1
| {z } regularizer
training error

Terminology:
1. F : hypothesis space
b, y) : loss function
2. ℓ(y
3. Ω : F → R+ : regularization functional

The right compromise is found by tuning the regularization parameter λ .

[Tykhonov regularization]

40
40/48
/48
Statistical Learning Theory
Concentration of measure

The reason that we have any hope of learning a good fb at all is that for any
fixed f ∈ F , over the choice of training/testing sets,

E(Eemp [f ]) = E(Etest [f ]) = Etrue [f ].

Moreover, as the size of training/testing sets increases, these quantities are

more and more concentrated around Etrue [f ]

However, a very unlucky choice of training set can always mess us up.

42
42/48
/48
Generalization Bounds
Statistical Learning Theory proves bounds of the form

P [ Etrue [fb] > Eemp [fb] + ϵ ] < δ

where ϵ is a complicated function depending on δ , the size of the function

class F , and the nature of the regularization. → Probably
Approximately Correct (PAC) bounds [Valiant, 1984]

Example: if the Vapnik-Chervonenkis dimension of F is d , then

s

d(log 2m
d + 1) + log(4/δ)
P Etrue [fb] > Eemp [fb] + ≤ δ.
m

In reality even the best such bounds are ridiculously loose.

43
43/48
/48
Cross validation
Holdout set
The generalization bounds from statistical learning theory are framed entirely
in terms of the training set, therefore they reveal fundamental properties of
the learning algorithm. Unfortunately, they are almost always incredibly
loose.

The more practial way to estimate Etrue is to split the training data into:
• The true training set, used to train the algorithm
• A holdout set {(x1 , y1 ) , (x2 , y2 ) , . . . , (xp , yp )} which is only used to
compute the holdout error

1X b
p
Eh.o. [fb] = ℓ(f (xi ), yi )
p
i=1
.
Holdout error is an unbiased estimator of Etrue [fb] , i.e.,
E[Eh.o. [fb]] = Etrue [fb] .
45
45/48
/48
Cross Validation

k –fold cross validation uses the holdout idea to set internal parameters θ
of learning algorithms (e.g., k in k –NN):
• Set θ to some value.
• Split the training set into k roughly equal parts S1 , S2 , . . . , Sk .
• For each i , train the algorithm on S1 , . . . , Si−1 , Si+1 , . . . , Sp to get
fbi .
• Compute the holdout error of each fbi on the corresponding Si .
• Average to get an estimator of Etrue (fbθ ).
• Iterate over a range of θ values and finally set θ to the value that
minimizes the cross-validation error.

46
46/48
/48
Review of algorithms so far

Clustering Classification
Algorithmic k-means k-nearest neighbors
perceptron
Probabilistic mixture of Gaussians GDA
logistic regression

47
47/48
/48

06 Lectureslides LinearClassification Fixed
No ratings yet
06 Lectureslides LinearClassification Fixed
52 pages
SML Lecture5
No ratings yet
SML Lecture5
45 pages
2021 Logistic Regression
No ratings yet
2021 Logistic Regression
33 pages
6.036: Intro To Machine Learning: Lecturer: Professor Leslie Kaelbling Notes By: Andrew Lin Fall 2019
No ratings yet
6.036: Intro To Machine Learning: Lecturer: Professor Leslie Kaelbling Notes By: Andrew Lin Fall 2019
50 pages
ML-chap10 2024 110300
No ratings yet
ML-chap10 2024 110300
29 pages
Linear Classifiers & Perceptron Guide
No ratings yet
Linear Classifiers & Perceptron Guide
5 pages
Supervised Learning Cheatsheet
No ratings yet
Supervised Learning Cheatsheet
4 pages
Linear Discriminant Functions Guide
No ratings yet
Linear Discriminant Functions Guide
41 pages
Cs221 Section2 Solutions
No ratings yet
Cs221 Section2 Solutions
7 pages
Introduction To Machine Learning Lecture 3: Linear Classification Methods
No ratings yet
Introduction To Machine Learning Lecture 3: Linear Classification Methods
40 pages
Main
No ratings yet
Main
5 pages
הרצאה-Classifiers and Decision Trees
No ratings yet
הרצאה-Classifiers and Decision Trees
119 pages
Supervised Learning: Linear Models
No ratings yet
Supervised Learning: Linear Models
34 pages
6.86x Machine Learning With Python: Linear Classifiers
No ratings yet
6.86x Machine Learning With Python: Linear Classifiers
7 pages
CS60010: Deep Learning: Spring 2021
No ratings yet
CS60010: Deep Learning: Spring 2021
32 pages
Cheatsheet Supervised Learning
100% (1)
Cheatsheet Supervised Learning
4 pages
Lecture 04
No ratings yet
Lecture 04
28 pages
Midterm Review Spring18 Sols
No ratings yet
Midterm Review Spring18 Sols
22 pages
3 Linear
No ratings yet
3 Linear
5 pages
Lecture 2
No ratings yet
Lecture 2
57 pages
NN Theory
No ratings yet
NN Theory
138 pages
Logistic Regression (Probability Concepts) and Perceptron
No ratings yet
Logistic Regression (Probability Concepts) and Perceptron
20 pages
Perceptron & Margin Classifiers Guide
No ratings yet
Perceptron & Margin Classifiers Guide
5 pages
Intro To Machine Learning Lecture Notes2
No ratings yet
Intro To Machine Learning Lecture Notes2
7 pages
SVM PDF
No ratings yet
SVM PDF
29 pages
Statistical Machine Learning-The Basic Approach and Current Research Challenges
No ratings yet
Statistical Machine Learning-The Basic Approach and Current Research Challenges
35 pages
ECE 449 Notes
No ratings yet
ECE 449 Notes
5 pages
Linear Models For Classification: Sumeet Agarwal, EEL709 (Most Figures From Bishop, PRML)
No ratings yet
Linear Models For Classification: Sumeet Agarwal, EEL709 (Most Figures From Bishop, PRML)
21 pages
12 - Bài Toán Phân L P - LR - v2
No ratings yet
12 - Bài Toán Phân L P - LR - v2
130 pages
cs221 Lecture11
No ratings yet
cs221 Lecture11
71 pages
Linear Classifier: Linear Discriminant Function: Compiled by Lakshmi Manasa, CED16I033
No ratings yet
Linear Classifier: Linear Discriminant Function: Compiled by Lakshmi Manasa, CED16I033
31 pages
Discriminant Functions
No ratings yet
Discriminant Functions
33 pages
Cheat Sheet For Exam
No ratings yet
Cheat Sheet For Exam
2 pages
Introduction To: Support Vector Machines
No ratings yet
Introduction To: Support Vector Machines
53 pages
Anuranan Das Summer of Sciences, 2019. Understanding and Implementing Machine Learning
No ratings yet
Anuranan Das Summer of Sciences, 2019. Understanding and Implementing Machine Learning
17 pages
Lect 1
No ratings yet
Lect 1
24 pages
ch6 (Q 2,8,4)
No ratings yet
ch6 (Q 2,8,4)
9 pages
Support Vector Machines: CS229 Lecture Notes
100% (2)
Support Vector Machines: CS229 Lecture Notes
25 pages
Ds 2
No ratings yet
Ds 2
27 pages
CSE 440 AI Volume1 (p1)
No ratings yet
CSE 440 AI Volume1 (p1)
4 pages
189 Cheat Sheet Nominicards PDF
No ratings yet
189 Cheat Sheet Nominicards PDF
2 pages
I2ml3e Chap10
No ratings yet
I2ml3e Chap10
27 pages
DAC ML Tutorial Final Deck
No ratings yet
DAC ML Tutorial Final Deck
150 pages
2IIG0 Cheat Sheet 1
No ratings yet
2IIG0 Cheat Sheet 1
2 pages
Perceptron
No ratings yet
Perceptron
23 pages
Supervised Machine Learning Guide
No ratings yet
Supervised Machine Learning Guide
74 pages
05 Optimization Basics
No ratings yet
05 Optimization Basics
94 pages
Math Behind Machine Learning
No ratings yet
Math Behind Machine Learning
9 pages
Statistical Machine Learning-The Basic Approach and Current Research Challenges
No ratings yet
Statistical Machine Learning-The Basic Approach and Current Research Challenges
35 pages
AI Linear Regression & Perceptron
No ratings yet
AI Linear Regression & Perceptron
8 pages
Probabilistic Discriminative Models
No ratings yet
Probabilistic Discriminative Models
29 pages
Mock Exams 2024
No ratings yet
Mock Exams 2024
81 pages
Matematics and Machine Learning
No ratings yet
Matematics and Machine Learning
156 pages
Maths For ML
No ratings yet
Maths For ML
156 pages
Supervised Learning - Support Vector Machines and Feature Reduction
No ratings yet
Supervised Learning - Support Vector Machines and Feature Reduction
11 pages
4 Linear Regression Additional Notes
No ratings yet
4 Linear Regression Additional Notes
8 pages
Support Vector Machines
No ratings yet
Support Vector Machines
57 pages
Generalized Linear Model
No ratings yet
Generalized Linear Model
67 pages
Tensor Density Estimator by Convolution-Deconvolution
No ratings yet
Tensor Density Estimator by Convolution-Deconvolution
57 pages
S-SOS: Stochastic Sum-Of-Squares For Parametric Polynomial Optimization
No ratings yet
S-SOS: Stochastic Sum-Of-Squares For Parametric Polynomial Optimization
37 pages
Convex Relation For Fokker-Planck Equation
No ratings yet
Convex Relation For Fokker-Planck Equation
21 pages
Re-Anchoring Quantum Monte Carlo With Tensor-Train Sketching
No ratings yet
Re-Anchoring Quantum Monte Carlo With Tensor-Train Sketching
26 pages
On A Spherically Lifted Spin Model at Finite Temperature
No ratings yet
On A Spherically Lifted Spin Model at Finite Temperature
15 pages
Seminar
No ratings yet
Seminar
16 pages
Gscarr 2025 0086
No ratings yet
Gscarr 2025 0086
14 pages
Image Presentation
No ratings yet
Image Presentation
14 pages
AI Outskill
No ratings yet
AI Outskill
2 pages
11 AI Sample Test For UT 1
No ratings yet
11 AI Sample Test For UT 1
3 pages
How Does LLM Inference Work - LLM Inference Handbook
No ratings yet
How Does LLM Inference Work - LLM Inference Handbook
4 pages
Ai Powered Ocr For Efficient Government Documentation
No ratings yet
Ai Powered Ocr For Efficient Government Documentation
49 pages
1725888984module 4 Deep Learning For Natural Language Processing (NLP)
No ratings yet
1725888984module 4 Deep Learning For Natural Language Processing (NLP)
15 pages
Al3502 - DLV Unit 3
No ratings yet
Al3502 - DLV Unit 3
11 pages
Exam 1Z0 1122 25 Dumps - OCI 2025 AI Foundations Associate
No ratings yet
Exam 1Z0 1122 25 Dumps - OCI 2025 AI Foundations Associate
5 pages
Machine Learning Methods 1st Edition Hang Li Instant Download
100% (1)
Machine Learning Methods 1st Edition Hang Li Instant Download
123 pages
Iva Notes 3
No ratings yet
Iva Notes 3
12 pages
Paper For Pub
No ratings yet
Paper For Pub
11 pages
SkillDzire AI Program Book
No ratings yet
SkillDzire AI Program Book
39 pages
Thesis Revised PDF
No ratings yet
Thesis Revised PDF
25 pages
The Illustrated Transformer - Jay Alammar - Visualizing Machine Learning One Concept at A Time
No ratings yet
The Illustrated Transformer - Jay Alammar - Visualizing Machine Learning One Concept at A Time
24 pages
Convolutional Neural Networks With Swift For Tensorflow: Image Recognition and Dataset Categorization 1st Edition Brett Koonce Ready To Read
No ratings yet
Convolutional Neural Networks With Swift For Tensorflow: Image Recognition and Dataset Categorization 1st Edition Brett Koonce Ready To Read
86 pages
Metaheuristic Algorithms For Solar Radiation Prediction A Systematic Analysis
No ratings yet
Metaheuristic Algorithms For Solar Radiation Prediction A Systematic Analysis
23 pages
Image+Text Hybrid Model Paper
No ratings yet
Image+Text Hybrid Model Paper
27 pages
X AI Split Up For Summer Stations KVs 2025-26
No ratings yet
X AI Split Up For Summer Stations KVs 2025-26
5 pages
Inside Deep Learning: Math, Algorithms, Models 1st Edition Edward Raff Online Version
No ratings yet
Inside Deep Learning: Math, Algorithms, Models 1st Edition Edward Raff Online Version
164 pages
B.Tech II Year I Semester Dept - of Data Science: Syllabus
No ratings yet
B.Tech II Year I Semester Dept - of Data Science: Syllabus
5 pages
V T P A: A C S: Ision Ransformers in Recision Griculture Omprehensive Urvey
No ratings yet
V T P A: A C S: Ision Ransformers in Recision Griculture Omprehensive Urvey
32 pages
Supervised Learning Naive Biased Algorithm in NLP
No ratings yet
Supervised Learning Naive Biased Algorithm in NLP
7 pages
Advanced Feature Engineering and Data Preprocessing in Machine Learning
No ratings yet
Advanced Feature Engineering and Data Preprocessing in Machine Learning
7 pages
Chang Generalist YOLO Towards Real-Time End-To-End Multi-Task Visual Language Models WACV 2025 Paper
No ratings yet
Chang Generalist YOLO Towards Real-Time End-To-End Multi-Task Visual Language Models WACV 2025 Paper
11 pages
Angirekula Prudhvi Ai ML Engineer
No ratings yet
Angirekula Prudhvi Ai ML Engineer
2 pages
Quantum Machine Learning: A Systematic Categorization Based On Learning Paradigms, NISQ Suitability, and Fault Tolerance
No ratings yet
Quantum Machine Learning: A Systematic Categorization Based On Learning Paradigms, NISQ Suitability, and Fault Tolerance
55 pages
HW 01 - CSL 537
No ratings yet
HW 01 - CSL 537
6 pages
IndoBERT-based Indonesian Cyberbullying Detection With Multi-Stage Labeling
No ratings yet
IndoBERT-based Indonesian Cyberbullying Detection With Multi-Stage Labeling
8 pages

Classification

Uploaded by

Classification

Uploaded by

Topic 3: INTRO TO CLASSIFICATION

STAT 37710/CAAM 37710/CMSC 35400 Machine Learning

We will look at four simple classifiers:

3. Define the conditionals for x given y :

p( x |y = +1) = p+1 (x)

4. Estimate the parameters of p+ , p− and p0 from the training data.

This is called the Bayes optimal classifier.

where µ ∈ Rn , and Σ ∈ Rn×n is a symmetric positive definite matrix.

E(x) = µ Cov(xi , xj ) = Σi,j

How do we learn (estimate) the parameters {µ+1 , Σ+1 , µ−1 , Σ−1 , π} ?

Observation: the part of the likelihood relating to the y = +1 Gaussian completely

n+ number of traning examples with yi = 1

• MLE for (µ+1 , Σ+1 ) and (µ−1 , Σ−1 ) :

• Assume that the conditional of y is

P(y = 1|x) = h(x)

In other words y|x ∼ Bernoulli(h(x)).

Unlike in GDA, this cannot be maximized in closed form. → Try stochastic

Given a loss function J , we can optimize it using gradient descent:

α : a parameter called the learning rate.

The gradient of a function J : Rp → R is the vector ∇J(u) with

Gradient descent trajectories can be more complicated than you would

1-nearest neighbor 15-nearest neighbor

The hypothesis returned by the k -NN classifier is

where NNk (x) = {i1 , . . . , ik } are the indices of the k points in S

What effect does k have on the behavior of k -NN?

• w is the normal to the

b, y) = 0 if yb = y , else ℓ0/1 (yb, y) = 1 .

An old idea going back to [McCulloch & Pitts, 1943].

Consider the following a simplified scenario:

where v is some fixed vector and w.l.o.g. kvk = 1 .

Note: In this form, the Perceptron is an online learning algorithm.

The margin of a correctly classified positive resp. negative example are

so the general formula is δ(x,y) = y (w · x + b)/kwk .

Suppose that the perceptron is run on a sequence of examples with

Claim 1: After M mistakes w · v ≥ δM .

Initially, w · v = 0 . After a false negative, w ← w + xt , and

Initially kwk = 0 . On a false negative, w ← w + xt , and

Theorem. If the perceptron is run a sequence of examples with |v · xt | ≥ δ

Question: What if we don’t want the separator to go through the origin?

Add extra feature with [xt ]n+1 = 1 for all t .

Question: How can we make linear threshold functions more expressive?

Rewrite perceptron as competition between w+ and w−

If yt ∈ {1, 2, . . . , κ} , maintain κ different weight vectors

On erroneously classifying example xt as class i instead of class j :

[Collins & Duffy, 2002]

• There is an unknown distribution p on X × Y .

Question: How can we avoid overfitting? Restrict hypothesis space!! But

This is the biggest thing to be paranoid about in all branches of ML.

The right compromise is found by tuning the regularization parameter λ .

E(Eemp [f ]) = E(Etest [f ]) = Etrue [f ].

Moreover, as the size of training/testing sets increases, these quantities are

P [ Etrue [fb] > Eemp [fb] + ϵ ] < δ

where ϵ is a complicated function depending on δ , the size of the function

Example: if the Vapnik-Chervonenkis dimension of F is d , then

In reality even the best such bounds are ridiculously loose.

You might also like