Generalization Probabilistic regression References
Machine Learning (CE 40477)
Fall 2024
Ali Sharifi-Zarchi
CE Department
Sharif University of Technology
September 30, 2024
CE Department (Sharif University of Technology) Machine Learning (CE 40477) September 30, 2024 1 / 37
Generalization Probabilistic regression References
1 Generalization
2 Probabilistic regression
3 References
CE Department (Sharif University of Technology) Machine Learning (CE 40477) September 30, 2024 2 / 37
Generalization Probabilistic regression References
1 Generalization
2 Probabilistic regression
3 References
CE Department (Sharif University of Technology) Machine Learning (CE 40477) September 30, 2024 3 / 37
Generalization Probabilistic regression References
Generalization Overview
Main Idea: The ability of a model to perform well on unseen data
• Training Set: D = {(xi , yi )}n
i=1
• Test Set: New data not seen during training
• Cost Function: Measures how well the model fits data
n
(y (i) − hw (x(i) ))2
X
J(w) =
i=1
• Objective: Minimize the cost function on unseen data (generalization error)
CE Department (Sharif University of Technology) Machine Learning (CE 40477) September 30, 2024 4 / 37
Generalization Probabilistic regression References
Expected Test Error
Definition: Expected performance on unseen data
• Test data sampled from the same distribution p(x, y)
J(w) = Ep(x,y) [(y − hw (x))2 ]
• Approximate using test set Ĵ(w) train sample
• Generalization error is the gap between training and test performance.
CE Department (Sharif University of Technology) Machine Learning (CE 40477) September 30, 2024 5 / 37
Generalization Probabilistic regression References
Training vs Test Error
Key Concept: Training error measures fit on known data, test error on unseen data
• Training (empirical) error:
1X n ³ ´2
Jtrain (w) = y (i) − hw (x(i) )
n i=1
• Test error:
1 Xm ³ ´2
(i) (i)
Jtest (w) = ytest − hw (xtest )
m i=1
• Goal: Minimize the test error (generalization).
CE Department (Sharif University of Technology) Machine Learning (CE 40477) September 30, 2024 6 / 37
Generalization Probabilistic regression References
Overfitting Definition
Concept: A model fits the training data well but performs poorly on the test set
Jtrain (w) ≪ Jtest (w)
• Causes: Model too complex, high variance
• Consequence: Captures noise in training data, fails on unseen data
CE Department (Sharif University of Technology) Machine Learning (CE 40477) September 30, 2024 7 / 37
Generalization Probabilistic regression References
Underfitting Definition
Concept: The model is too simple and cannot capture the structure of the data
Jtrain (w) ≈ Jtest (w) ≫ 0
• Causes: Model lacks complexity, high bias
• Consequence: Poor fit on both training and test data
CE Department (Sharif University of Technology) Machine Learning (CE 40477) September 30, 2024 8 / 37
Generalization Probabilistic regression References
Generalization: polynomial regression
Degree of 1 Degree of 3
Example adapted from slides of Dr. Soleymani, ML course, Sharif University of technology.
CE Department (Sharif University of Technology) Machine Learning (CE 40477) September 30, 2024 9 / 37
Generalization Probabilistic regression References
Overfitting: polynomial regression
Degree of 5 Degree of 7
Example adapted from slides of Dr. Soleymani, ML course, Sharif University of technology.
CE Department (Sharif University of Technology) Machine Learning (CE 40477) September 30, 2024 10 / 37
Generalization Probabilistic regression References
Polynomial regression with various degrees: example
Figures adapted from Machine Learning and Pattern Recognition, Bishop
CE Department (Sharif University of Technology) Machine Learning (CE 40477) September 30, 2024 11 / 37
Generalization Probabilistic regression References
Polynomial regression with various degrees: example (cont.)
Figures adapted from Machine Learning and Pattern Recognition, Bishop
CE Department (Sharif University of Technology) Machine Learning (CE 40477) September 30, 2024 12 / 37
Generalization Probabilistic regression References
Root mean squared error
s ¢2
Pn ¡ (i) (i)
i=1 y − f (x ; w
ERMS =
n
Figures adapted from Machine Learning and Pattern Recognition, Bishop
CE Department (Sharif University of Technology) Machine Learning (CE 40477) September 30, 2024 13 / 37
Generalization Probabilistic regression References
Bias-Variance Decomposition
Generalization error decomposition:
E[(y − hw (x))2 ] = (Bias)2 + Variance + Noise
very important
• Bias: Error due to simplifying assumptions in the model
Bias(x) = E[hw (x)] − f (x)
• Variance: Sensitivity of the model to training data
Variance(x) = E[(hw (x) − E[hw (x)])2 ]
• Noise: Irreducible error from the inherent randomness in data
CE Department (Sharif University of Technology) Machine Learning (CE 40477) September 30, 2024 14 / 37
Generalization Probabilistic regression References
Bias-Variance Decomposition Proof
Assume f (x) is the ground truth and observation y is a noisy observation y = f (x) + ϵ
where ϵ N (0, σ2 ). We start with the definition of the expected squared error, which is:
h¡ ¢2 i h¡ ¢2 i
Edata fˆ (x) − y = Edata fˆ (x) − f (x) + ϵ
h¡ ¢2 i
= E fˆ (x) − f (x) − 2ϵ fˆ (x) − f (x) + ϵ2
¡ ¢
Since we assume the noise ϵ has zero mean and variance σ2 , the term E[ϵ] = 0, and thus:
E[ϵ2 ] = σ2
Since E[ϵ] = 0 and ϵ is independent of the parenthesis, we can write:
E −2ϵ fˆ (x) − f (x) = 0
£ ¡ ¢¤
CE Department (Sharif University of Technology) Machine Learning (CE 40477) September 30, 2024 15 / 37
Generalization Probabilistic regression References
Bias-Variance Decomposition Proof (cont.)
¢2
Now, we decompose the squared difference fˆ (x) − f (x) as follows:
¡
h¡ ¢2 i h¡ ¢2 i
E fˆ (x) − f (x) = E fˆ (x) − E fˆ (x) + E fˆ (x) − f (x)
£ ¤ £ ¤
Expanding this further:
h¡ a**2 £ ¤¢2 i h¡ £ b**2 ¢2 i
= E fˆ (x) − E fˆ (x) + E E fˆ (x) − f (x) + 2E fˆ (x) − E fˆ (x) E fˆ (x) − f (x)
¤ £¡ £ ¤¢ ¡ £ ¤ ¢¤
variance bias **2
Since E [ϵA] = E [ϵ] E [A], A and ϵ are independent and E [ϵ] we have E [ϵA] = 0 thus:
E [E[ fˆ (x) −fˆ (x) = E fˆ (x) − E fˆ (x) = 0
¤ ¤ £ ¤ £ ¤
CE Department (Sharif University of Technology) Machine Learning (CE 40477) September 30, 2024 16 / 37
Generalization Probabilistic regression References
Bias-Variance Decomposition Proof (cont.)
Thus, the expected squared error becomes:
h¡ ¢2 i
Edata fˆ (xn ) − y = Variance + Bias2 + σ2
where:
h¡ ¤¢2 i
• Variance is E fˆ (x) − E fˆ (x)
£
• Bias is E fˆ (x) − f (x)
££ ¤ ¤
• Noise is σ2
CE Department (Sharif University of Technology) Machine Learning (CE 40477) September 30, 2024 17 / 37
Generalization Probabilistic regression References
High Bias in Simple Models
Explanation: Simple models, such as linear regression, often underfit
hw (x) = w0 + w1 x
• Bias remains large even with infinite data
Bias2 ≫ Variance
• Leads to large generalization error
CE Department (Sharif University of Technology) Machine Learning (CE 40477) September 30, 2024 18 / 37
Generalization Probabilistic regression References
High Variance in Complex Models
Explanation: Complex models tend to overfit
hw (x) = w0 + w1 x + w2 x2 + · · · + wm xm
• Variance dominates when the model is too complex
Variance ≫ Bias
• Fits noise, leading to high test error
CE Department (Sharif University of Technology) Machine Learning (CE 40477) September 30, 2024 19 / 37
Generalization Probabilistic regression References
Bias-Variance Tradeoff
Tradeoff: Balancing between bias and variance is key for optimal performance
• Low complexity: High bias, low variance
• High complexity: Low bias, high variance
CE Department (Sharif University of Technology) Machine Learning (CE 40477) September 30, 2024 20 / 37
Generalization Probabilistic regression References
Regularization
Purpose: Prevent overfitting by penalizing large weights
Jλ (w) = J(w) + λR(w)
• Common regularizers: L1 and L2 norms
• λ controls the balance between fit and simplicity
CE Department (Sharif University of Technology) Machine Learning (CE 40477) September 30, 2024 21 / 37
Generalization Probabilistic regression References
Effect of Regularization Parameter λ
Balancing Fit and Complexity:
m
Jλ (w) = J(w) + λ wj2 = J(w) + λwT w
X
j=1
• Large λ: Forces smaller weights, reduces complexity, increases bias, decreases
variance
• Small λ: Allows larger weights, increases complexity, reduces bias, increases
variance
CE Department (Sharif University of Technology) Machine Learning (CE 40477) September 30, 2024 22 / 37
Generalization Probabilistic regression References
Effect of Regularization parameter λ
n ¡ ¢2
t (n) − f (x(n) ; w) + λwT w
X
Jλ (w) =
i=1
(n)
f (x ; w) = w0 + w1 x + · · · + w9 x9
Figures adapted from Machine Learning and Pattern Recognition, Bishop
CE Department (Sharif University of Technology) Machine Learning (CE 40477) September 30, 2024 23 / 37
Generalization Probabilistic regression References
Effect of regularization on weights
Table adapted from Machine Learning and Pattern Recognition, Bishop
CE Department (Sharif University of Technology) Machine Learning (CE 40477) September 30, 2024 24 / 37
Generalization Probabilistic regression References
Regularization parameter
• λ controls the effective complexity
of the model
• hence the degree of overfitting
Figures adapted from Machine Learning and Pattern Recognition, Bishop
CE Department (Sharif University of Technology) Machine Learning (CE 40477) September 30, 2024 25 / 37
Generalization Probabilistic regression References
1 Generalization
2 Probabilistic regression
3 References
CE Department (Sharif University of Technology) Machine Learning (CE 40477) September 30, 2024 26 / 37
Generalization Probabilistic regression References
Introduction to Regression (Probabilistic Perspective)
• Objective: Model the relationship between input x and output y.
• Uncertainty: Output y has an associated uncertainty modeled by a probability
distribution.
• Example:
y = f (x; w) + ϵ , ϵ ∼ N (0, σ2 )
• The goal is to learn f (x; w) to predict y.
CE Department (Sharif University of Technology) Machine Learning (CE 40477) September 30, 2024 27 / 37
Generalization Probabilistic regression References
Curve Fitting with Noise
• In real-world scenarios, observed output y is noisy.
• Model: True output plus noise
y = f (x; w) + ϵ
• Noise represents unknown or unmodeled factors.
• Example: Predicting house prices based on features with inherent unpredictability.
CE Department (Sharif University of Technology) Machine Learning (CE 40477) September 30, 2024 28 / 37
Generalization Probabilistic regression References
Expected Value of Output
• Best Estimate: The conditional expectation of y given x.
E[y|x] = f (x; w)
• Goal: Learn a function f (x; w) that represents the average behavior of the data.
• Key Point: The model captures the mean of the target variable given input x.
CE Department (Sharif University of Technology) Machine Learning (CE 40477) September 30, 2024 29 / 37
Generalization Probabilistic regression References
Maximum Likelihood Estimation (MLE)
• MLE: A method to estimate parameters that maximize the likelihood of the data.
• Given data D = {(xi , yi )}n , MLE maximizes:
i=1
n
L(D; w, σ2 ) = p(yi |xi , w, σ2 )
Y
i=1
• MLE finds parameters w and σ2 that best explain the data.
CE Department (Sharif University of Technology) Machine Learning (CE 40477) September 30, 2024 30 / 37
Generalization Probabilistic regression References
Maximum Likelihood Estimation (cont.)
• Instead of maximizing the likelihood, it is often easier to maximize the
log-likelihood:
n
log L(D; w, σ2 ) = log p(yi |xi , w, σ2 )
X
i=1
• It is because log f (x) preserves the behaviour of f (x).
• It is also easier to find derivative on summation of terms.
CE Department (Sharif University of Technology) Machine Learning (CE 40477) September 30, 2024 31 / 37
Generalization Probabilistic regression References
Univariate Linear Function Example
• Assuming Gaussian noise with parameters (0, σ2 ), probability of observing real
output value y is:
(y − f (x; w))2
µ ¶
2 1
p(y|x, w, σ ) = p exp −
2πσ2 2σ2
• For a simple linear model f (x; w) = w0 + w1 x we have:
(y − w0 − w1 x)2
µ ¶
1
p(y|x, w, σ2 ) = p exp −
2πσ2 2σ2
• Key Observation: Points far from the fitted line will have a low likelihood value.
CE Department (Sharif University of Technology) Machine Learning (CE 40477) September 30, 2024 32 / 37
Generalization Probabilistic regression References
Log-Likelihood and Sum of Squares
• Using log-likelihood we have:
n 1 Xn
log L(D; w, σ2 ) = −n log σ − log(2π) − 2 (y (i) − f (x(i) ; w))2
2 2σ i=1
• Since the objective of MLE is to optimize with regards to random variables, we can
rule out the constants:
n
log L(D; w, σ2 ) ∼ − (y (i) − f (x(i) ; w))2
X
i=1
• Equivalence: Maximizing the log-likelihood is equivalent to minimizing the Sum
of Squared Errors (SSE):
n
J(w) = (y (i) − f (x(i) ; w))2
X
i=1
CE Department (Sharif University of Technology) Machine Learning (CE 40477) September 30, 2024 33 / 37
Generalization Probabilistic regression References
Estimating σ2
• The maximum likelihood estimate of the noise variance σ2 :
1X n ³ ´2
σ̂2 = y (i) − f (x(i) ; ŵ)
n i=1
• Interpretation: Mean squared error of the predictions.
• Note: σ2 reflects the noise level in the observations.
CE Department (Sharif University of Technology) Machine Learning (CE 40477) September 30, 2024 34 / 37
Generalization Probabilistic regression References
1 Generalization
2 Probabilistic regression
3 References
CE Department (Sharif University of Technology) Machine Learning (CE 40477) September 30, 2024 35 / 37
Generalization Probabilistic regression References
Contributions
• These slides are authored by:
• Arshia Gharooni
• Mahan Bayhaghi
CE Department (Sharif University of Technology) Machine Learning (CE 40477) September 30, 2024 36 / 37
Generalization Probabilistic regression References
[1] C. M., Pattern Recognition and Machine Learning.
Information Science and Statistics, New York, NY: Springer, 1 ed., Aug. 2006.
[2] M. Soleymani Baghshah, “Machine learning.” Lecture slides.
[3] A. Ng and T. Ma, CS229 Lecture Notes.
[4] T. Mitchell, Machine Learning.
McGraw-Hill series in computer science, New York, NY: McGraw-Hill Professional,
Mar. 1997.
[5] Y. S. Abu-Mostafa, M. Magdon-Ismail, and H.-T. Lin, Learning From Data: A Short
Course.
New York, NY: AMLBook, 2012.
[6] S. Goel, H. Bansal, S. Bhatia, R. A. Rossi, V. Vinay, and A. Grover, “CyCLIP: Cyclic
Contrastive Language-Image Pretraining,” ArXiv, vol. abs/2205.14459, May 2022.
CE Department (Sharif University of Technology) Machine Learning (CE 40477) September 30, 2024 37 / 37