KEMBAR78
01-Linear Regression-Part 2 | PDF | Regression Analysis | Machine Learning
0% found this document useful (0 votes)
20 views37 pages

01-Linear Regression-Part 2

The document discusses generalization in machine learning, focusing on the ability of models to perform well on unseen data and the concepts of training and test errors. It explains overfitting and underfitting, the bias-variance tradeoff, and the importance of regularization to prevent overfitting. Additionally, it introduces probabilistic regression and maximum likelihood estimation as methods for modeling relationships between inputs and outputs with uncertainty.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views37 pages

01-Linear Regression-Part 2

The document discusses generalization in machine learning, focusing on the ability of models to perform well on unseen data and the concepts of training and test errors. It explains overfitting and underfitting, the bias-variance tradeoff, and the importance of regularization to prevent overfitting. Additionally, it introduces probabilistic regression and maximum likelihood estimation as methods for modeling relationships between inputs and outputs with uncertainty.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Generalization Probabilistic regression References

Machine Learning (CE 40477)


Fall 2024

Ali Sharifi-Zarchi

CE Department
Sharif University of Technology

September 30, 2024

CE Department (Sharif University of Technology) Machine Learning (CE 40477) September 30, 2024 1 / 37
Generalization Probabilistic regression References

1 Generalization

2 Probabilistic regression

3 References

CE Department (Sharif University of Technology) Machine Learning (CE 40477) September 30, 2024 2 / 37
Generalization Probabilistic regression References

1 Generalization

2 Probabilistic regression

3 References

CE Department (Sharif University of Technology) Machine Learning (CE 40477) September 30, 2024 3 / 37
Generalization Probabilistic regression References

Generalization Overview

Main Idea: The ability of a model to perform well on unseen data


• Training Set: D = {(xi , yi )}n
i=1
• Test Set: New data not seen during training
• Cost Function: Measures how well the model fits data
n
(y (i) − hw (x(i) ))2
X
J(w) =
i=1

• Objective: Minimize the cost function on unseen data (generalization error)

CE Department (Sharif University of Technology) Machine Learning (CE 40477) September 30, 2024 4 / 37
Generalization Probabilistic regression References

Expected Test Error

Definition: Expected performance on unseen data


• Test data sampled from the same distribution p(x, y)

J(w) = Ep(x,y) [(y − hw (x))2 ]

• Approximate using test set Ĵ(w) train sample


• Generalization error is the gap between training and test performance.

CE Department (Sharif University of Technology) Machine Learning (CE 40477) September 30, 2024 5 / 37
Generalization Probabilistic regression References

Training vs Test Error

Key Concept: Training error measures fit on known data, test error on unseen data
• Training (empirical) error:

1X n ³ ´2
Jtrain (w) = y (i) − hw (x(i) )
n i=1

• Test error:
1 Xm ³ ´2
(i) (i)
Jtest (w) = ytest − hw (xtest )
m i=1
• Goal: Minimize the test error (generalization).

CE Department (Sharif University of Technology) Machine Learning (CE 40477) September 30, 2024 6 / 37
Generalization Probabilistic regression References

Overfitting Definition

Concept: A model fits the training data well but performs poorly on the test set

Jtrain (w) ≪ Jtest (w)

• Causes: Model too complex, high variance


• Consequence: Captures noise in training data, fails on unseen data

CE Department (Sharif University of Technology) Machine Learning (CE 40477) September 30, 2024 7 / 37
Generalization Probabilistic regression References

Underfitting Definition

Concept: The model is too simple and cannot capture the structure of the data

Jtrain (w) ≈ Jtest (w) ≫ 0

• Causes: Model lacks complexity, high bias


• Consequence: Poor fit on both training and test data

CE Department (Sharif University of Technology) Machine Learning (CE 40477) September 30, 2024 8 / 37
Generalization Probabilistic regression References

Generalization: polynomial regression

Degree of 1 Degree of 3

Example adapted from slides of Dr. Soleymani, ML course, Sharif University of technology.
CE Department (Sharif University of Technology) Machine Learning (CE 40477) September 30, 2024 9 / 37
Generalization Probabilistic regression References

Overfitting: polynomial regression

Degree of 5 Degree of 7

Example adapted from slides of Dr. Soleymani, ML course, Sharif University of technology.
CE Department (Sharif University of Technology) Machine Learning (CE 40477) September 30, 2024 10 / 37
Generalization Probabilistic regression References

Polynomial regression with various degrees: example

Figures adapted from Machine Learning and Pattern Recognition, Bishop


CE Department (Sharif University of Technology) Machine Learning (CE 40477) September 30, 2024 11 / 37
Generalization Probabilistic regression References

Polynomial regression with various degrees: example (cont.)

Figures adapted from Machine Learning and Pattern Recognition, Bishop


CE Department (Sharif University of Technology) Machine Learning (CE 40477) September 30, 2024 12 / 37
Generalization Probabilistic regression References

Root mean squared error

s ¢2
Pn ¡ (i) (i)
i=1 y − f (x ; w
ERMS =
n

Figures adapted from Machine Learning and Pattern Recognition, Bishop


CE Department (Sharif University of Technology) Machine Learning (CE 40477) September 30, 2024 13 / 37
Generalization Probabilistic regression References

Bias-Variance Decomposition

Generalization error decomposition:

E[(y − hw (x))2 ] = (Bias)2 + Variance + Noise


very important

• Bias: Error due to simplifying assumptions in the model

Bias(x) = E[hw (x)] − f (x)

• Variance: Sensitivity of the model to training data

Variance(x) = E[(hw (x) − E[hw (x)])2 ]

• Noise: Irreducible error from the inherent randomness in data

CE Department (Sharif University of Technology) Machine Learning (CE 40477) September 30, 2024 14 / 37
Generalization Probabilistic regression References

Bias-Variance Decomposition Proof

Assume f (x) is the ground truth and observation y is a noisy observation y = f (x) + ϵ
where ϵ N (0, σ2 ). We start with the definition of the expected squared error, which is:
h¡ ¢2 i h¡ ¢2 i
Edata fˆ (x) − y = Edata fˆ (x) − f (x) + ϵ
h¡ ¢2 i
= E fˆ (x) − f (x) − 2ϵ fˆ (x) − f (x) + ϵ2
¡ ¢

Since we assume the noise ϵ has zero mean and variance σ2 , the term E[ϵ] = 0, and thus:

E[ϵ2 ] = σ2

Since E[ϵ] = 0 and ϵ is independent of the parenthesis, we can write:

E −2ϵ fˆ (x) − f (x) = 0


£ ¡ ¢¤

CE Department (Sharif University of Technology) Machine Learning (CE 40477) September 30, 2024 15 / 37
Generalization Probabilistic regression References

Bias-Variance Decomposition Proof (cont.)

¢2
Now, we decompose the squared difference fˆ (x) − f (x) as follows:
¡

h¡ ¢2 i h¡ ¢2 i
E fˆ (x) − f (x) = E fˆ (x) − E fˆ (x) + E fˆ (x) − f (x)
£ ¤ £ ¤

Expanding this further:


h¡ a**2 £ ¤¢2 i h¡ £ b**2 ¢2 i
= E fˆ (x) − E fˆ (x) + E E fˆ (x) − f (x) + 2E fˆ (x) − E fˆ (x) E fˆ (x) − f (x)
¤ £¡ £ ¤¢ ¡ £ ¤ ¢¤

variance bias **2


Since E [ϵA] = E [ϵ] E [A], A and ϵ are independent and E [ϵ] we have E [ϵA] = 0 thus:

E [E[ fˆ (x) −fˆ (x) = E fˆ (x) − E fˆ (x) = 0


¤ ¤ £ ¤ £ ¤

CE Department (Sharif University of Technology) Machine Learning (CE 40477) September 30, 2024 16 / 37
Generalization Probabilistic regression References

Bias-Variance Decomposition Proof (cont.)

Thus, the expected squared error becomes:


h¡ ¢2 i
Edata fˆ (xn ) − y = Variance + Bias2 + σ2

where:
h¡ ¤¢2 i
• Variance is E fˆ (x) − E fˆ (x)
£

• Bias is E fˆ (x) − f (x)


££ ¤ ¤

• Noise is σ2

CE Department (Sharif University of Technology) Machine Learning (CE 40477) September 30, 2024 17 / 37
Generalization Probabilistic regression References

High Bias in Simple Models

Explanation: Simple models, such as linear regression, often underfit

hw (x) = w0 + w1 x

• Bias remains large even with infinite data

Bias2 ≫ Variance

• Leads to large generalization error

CE Department (Sharif University of Technology) Machine Learning (CE 40477) September 30, 2024 18 / 37
Generalization Probabilistic regression References

High Variance in Complex Models

Explanation: Complex models tend to overfit

hw (x) = w0 + w1 x + w2 x2 + · · · + wm xm

• Variance dominates when the model is too complex

Variance ≫ Bias

• Fits noise, leading to high test error

CE Department (Sharif University of Technology) Machine Learning (CE 40477) September 30, 2024 19 / 37
Generalization Probabilistic regression References

Bias-Variance Tradeoff

Tradeoff: Balancing between bias and variance is key for optimal performance
• Low complexity: High bias, low variance
• High complexity: Low bias, high variance

CE Department (Sharif University of Technology) Machine Learning (CE 40477) September 30, 2024 20 / 37
Generalization Probabilistic regression References

Regularization

Purpose: Prevent overfitting by penalizing large weights

Jλ (w) = J(w) + λR(w)

• Common regularizers: L1 and L2 norms


• λ controls the balance between fit and simplicity

CE Department (Sharif University of Technology) Machine Learning (CE 40477) September 30, 2024 21 / 37
Generalization Probabilistic regression References

Effect of Regularization Parameter λ

Balancing Fit and Complexity:


m
Jλ (w) = J(w) + λ wj2 = J(w) + λwT w
X
j=1

• Large λ: Forces smaller weights, reduces complexity, increases bias, decreases


variance
• Small λ: Allows larger weights, increases complexity, reduces bias, increases
variance

CE Department (Sharif University of Technology) Machine Learning (CE 40477) September 30, 2024 22 / 37
Generalization Probabilistic regression References

Effect of Regularization parameter λ

n ¡ ¢2
t (n) − f (x(n) ; w) + λwT w
X
Jλ (w) =
i=1
(n)
f (x ; w) = w0 + w1 x + · · · + w9 x9

Figures adapted from Machine Learning and Pattern Recognition, Bishop


CE Department (Sharif University of Technology) Machine Learning (CE 40477) September 30, 2024 23 / 37
Generalization Probabilistic regression References

Effect of regularization on weights

Table adapted from Machine Learning and Pattern Recognition, Bishop


CE Department (Sharif University of Technology) Machine Learning (CE 40477) September 30, 2024 24 / 37
Generalization Probabilistic regression References

Regularization parameter

• λ controls the effective complexity


of the model
• hence the degree of overfitting

Figures adapted from Machine Learning and Pattern Recognition, Bishop


CE Department (Sharif University of Technology) Machine Learning (CE 40477) September 30, 2024 25 / 37
Generalization Probabilistic regression References

1 Generalization

2 Probabilistic regression

3 References

CE Department (Sharif University of Technology) Machine Learning (CE 40477) September 30, 2024 26 / 37
Generalization Probabilistic regression References

Introduction to Regression (Probabilistic Perspective)

• Objective: Model the relationship between input x and output y.


• Uncertainty: Output y has an associated uncertainty modeled by a probability
distribution.
• Example:
y = f (x; w) + ϵ , ϵ ∼ N (0, σ2 )
• The goal is to learn f (x; w) to predict y.

CE Department (Sharif University of Technology) Machine Learning (CE 40477) September 30, 2024 27 / 37
Generalization Probabilistic regression References

Curve Fitting with Noise

• In real-world scenarios, observed output y is noisy.


• Model: True output plus noise

y = f (x; w) + ϵ

• Noise represents unknown or unmodeled factors.


• Example: Predicting house prices based on features with inherent unpredictability.

CE Department (Sharif University of Technology) Machine Learning (CE 40477) September 30, 2024 28 / 37
Generalization Probabilistic regression References

Expected Value of Output

• Best Estimate: The conditional expectation of y given x.

E[y|x] = f (x; w)

• Goal: Learn a function f (x; w) that represents the average behavior of the data.
• Key Point: The model captures the mean of the target variable given input x.

CE Department (Sharif University of Technology) Machine Learning (CE 40477) September 30, 2024 29 / 37
Generalization Probabilistic regression References

Maximum Likelihood Estimation (MLE)

• MLE: A method to estimate parameters that maximize the likelihood of the data.
• Given data D = {(xi , yi )}n , MLE maximizes:
i=1

n
L(D; w, σ2 ) = p(yi |xi , w, σ2 )
Y
i=1

• MLE finds parameters w and σ2 that best explain the data.

CE Department (Sharif University of Technology) Machine Learning (CE 40477) September 30, 2024 30 / 37
Generalization Probabilistic regression References

Maximum Likelihood Estimation (cont.)

• Instead of maximizing the likelihood, it is often easier to maximize the


log-likelihood:
n
log L(D; w, σ2 ) = log p(yi |xi , w, σ2 )
X
i=1
• It is because log f (x) preserves the behaviour of f (x).
• It is also easier to find derivative on summation of terms.

CE Department (Sharif University of Technology) Machine Learning (CE 40477) September 30, 2024 31 / 37
Generalization Probabilistic regression References

Univariate Linear Function Example

• Assuming Gaussian noise with parameters (0, σ2 ), probability of observing real


output value y is:

(y − f (x; w))2
µ ¶
2 1
p(y|x, w, σ ) = p exp −
2πσ2 2σ2

• For a simple linear model f (x; w) = w0 + w1 x we have:

(y − w0 − w1 x)2
µ ¶
1
p(y|x, w, σ2 ) = p exp −
2πσ2 2σ2

• Key Observation: Points far from the fitted line will have a low likelihood value.

CE Department (Sharif University of Technology) Machine Learning (CE 40477) September 30, 2024 32 / 37
Generalization Probabilistic regression References

Log-Likelihood and Sum of Squares

• Using log-likelihood we have:

n 1 Xn
log L(D; w, σ2 ) = −n log σ − log(2π) − 2 (y (i) − f (x(i) ; w))2
2 2σ i=1

• Since the objective of MLE is to optimize with regards to random variables, we can
rule out the constants:
n
log L(D; w, σ2 ) ∼ − (y (i) − f (x(i) ; w))2
X
i=1

• Equivalence: Maximizing the log-likelihood is equivalent to minimizing the Sum


of Squared Errors (SSE):
n
J(w) = (y (i) − f (x(i) ; w))2
X
i=1

CE Department (Sharif University of Technology) Machine Learning (CE 40477) September 30, 2024 33 / 37
Generalization Probabilistic regression References

Estimating σ2

• The maximum likelihood estimate of the noise variance σ2 :

1X n ³ ´2
σ̂2 = y (i) − f (x(i) ; ŵ)
n i=1

• Interpretation: Mean squared error of the predictions.


• Note: σ2 reflects the noise level in the observations.

CE Department (Sharif University of Technology) Machine Learning (CE 40477) September 30, 2024 34 / 37
Generalization Probabilistic regression References

1 Generalization

2 Probabilistic regression

3 References

CE Department (Sharif University of Technology) Machine Learning (CE 40477) September 30, 2024 35 / 37
Generalization Probabilistic regression References

Contributions

• These slides are authored by:


• Arshia Gharooni

• Mahan Bayhaghi

CE Department (Sharif University of Technology) Machine Learning (CE 40477) September 30, 2024 36 / 37
Generalization Probabilistic regression References

[1] C. M., Pattern Recognition and Machine Learning.


Information Science and Statistics, New York, NY: Springer, 1 ed., Aug. 2006.
[2] M. Soleymani Baghshah, “Machine learning.” Lecture slides.
[3] A. Ng and T. Ma, CS229 Lecture Notes.
[4] T. Mitchell, Machine Learning.
McGraw-Hill series in computer science, New York, NY: McGraw-Hill Professional,
Mar. 1997.
[5] Y. S. Abu-Mostafa, M. Magdon-Ismail, and H.-T. Lin, Learning From Data: A Short
Course.
New York, NY: AMLBook, 2012.
[6] S. Goel, H. Bansal, S. Bhatia, R. A. Rossi, V. Vinay, and A. Grover, “CyCLIP: Cyclic
Contrastive Language-Image Pretraining,” ArXiv, vol. abs/2205.14459, May 2022.

CE Department (Sharif University of Technology) Machine Learning (CE 40477) September 30, 2024 37 / 37

You might also like