Week11 - Regularization and Optimization
Week11 - Regularization and Optimization
Week 11
1
Recap: K-Means
• Inputs:
• Initialization:
• Outputs:
• Stopping criterion
2
Recap: HAC
• Inputs:
• Dendrogram
3
Example: HAC
points #1 #2 #3 #4
x 1.9 1.8 2.3 2.3
y 1.0 0.9 1.6 2.1
4
Example: HAC
points #1 #2 #3 #4
x 1.9 1.8 2.3 2.3
y 1.0 0.9 1.6 2.1
5
Example: HAC
points #1 #2 #3 #4
x 1.9 1.8 2.3 2.3
y 1.0 0.9 1.6 2.1
#1 0
#2 0.14 0
#3 0.72 0.86 0
#4 1.17 1.3 0.5 0
Average Distance:
• D(1+2, 3) = 0.5 * D(1,3) + 0.5 * D(2,3) = 0.36 + 0.43 = 0.79
• D(1+2, 4) = 0.5 * D(1,4) + 0.5 * D(2,4) = 0.59 + 0.65 = 1.24
#1 + #2 #3 #4
#1 + #2 0
#3 0.79 0
#4 1.24 0.5 0
7
#1 #2 #3 #4
Example: HAC #1 0
#2 0.14 0
#3 0.72 0.86 0
#4 1.17 1.3 0.5 0
Average Distance:
• D(1+2, 3) = Average {D(1,3), D(2,3)} = 0.5 * D(1,3) + 0.5 * D(2,3) = 0.36 + 0.43 = 0.79
• D(1+2, 4) = Average {D(1,4), D(2,4)} = 0.5 * D(1,4) + 0.5 * D(2,4) = 0.59 + 0.65 = 1.24
#1 + #2 #3 #4
#1 + #2 0
#3 0.79 0
#4 1.24 0.5 0
8
#1 #2 #3 #4
Example: HAC #1 0
#2 0.14 0
#3 0.72 0.86 0
#4 1.17 1.3 0.5 0
MIN Distance:
• D(1+2, 3) = MIN {D(1,3), D(2,3)} = MIN {0.72, 0.86} = 0.72
• D(1+2, 4) = MIN {D(1,4), D(2,4)} = MIN {1.17, 1.3}= 1.17
#1 + #2 #3 #4
#1 + #2 0
#3 0.72 0
#4 1.17 0.5 0
9
#1 #2 #3 #4
Example: HAC #1 0
#2 0.14 0
#3 0.72 0.86 0
#4 1.17 1.3 0.5 0
MAX Distance:
• D(1+2, 3) = MAX {D(1,3), D(2,3)} = MAX {0.72, 0.86} = 0.86
• D(1+2, 4) = MAX {D(1,4), D(2,4)} = MAX {1.17, 1.3}= 1.3
#1 + #2 #3 #4
#1 + #2 0
#3 0.86 0
#4 1.3 0.5 0
10
Recap: Linear Regression
11
Recap: Linear Regression
𝒙
𝒘 𝑇𝒙 + 𝑏 = 𝒘 𝑇 | 𝑏
1
𝒙
𝒘𝑇 ← 𝒘𝑇 | 𝑏 𝒙 ←
1
12
Recap: Linear Regression
𝐿 𝑓𝒘 = 1 σ𝑁
𝑖=1 (𝒘 𝑇 𝒙 − 𝑦 )2 =
𝒊 𝑖
1
𝑿𝒘 − 𝒚 2
2
N N
• Matrix 𝑿 ∈ 𝑅𝑁×𝑑
𝒚
• Vector 𝒚 ∈ 𝑅𝑁
• Vector 𝒘 ∈ 𝑅𝑑
13
Recap: Linear Regression
𝒘 = (𝑿𝑇 𝑿)−1 𝑿𝑇 𝒚
Example: 𝒙 𝒚
0 1
1 2
2 3
𝒙 𝒚
Step 1: Use the modified 𝒙 for all data samples
(0, 1) 1
𝒙 (1, 1) 2
𝒙 ←
1 (2, 1) 3
15
Example: Linear Regression
0 1 1
𝑿 = 1 1 𝒚 = 2
𝒙𝒊 𝑦𝒊
2 1 3
𝒙 𝒚 𝒚
(0, 1) 1
(1, 1) 2
(2, 1) 3
16
Example: Linear Regression
𝒘 = (𝑿𝑇 𝑿)−1 𝑿𝑇 𝒚
0 1 1
0 1 2 5 3 0 1 2 8
𝑿𝑇 𝑿 = 1 1 = 𝑿𝑇 𝒚 = 2 =
1 1 1 3 3 1 1 1 6
2 1 3
𝒘 = (𝑿𝑇 𝑿)−1 𝑿𝑇 𝒚
−1 3/6 −3/6
5 3 8 8 1
= = =
3 3 6 −3/6 5/6 6 1
17
Regularization and Optimization
18
Outline
19
Carry-on Questions
20
Supervised vs Unsupervised Learning
• Classification is supervised:
• Clustering is unsupervised:
21
Types of Supervisions
Semi-supervised
(labels for a small portion of
training data)
22
Revisit Supervised Image Classification
input desired output
apple
pear
tomato
cow
dog
horse
23
Framework of Supervised Learning
Training time Training
Labels
Training
Samples
Learned
Features Training
model
Testing time
Learned
Features Prediction
model
24
The Basic Supervised Learning Framework
𝑦 = 𝑓𝜃 (𝒙)
25
Learning Effectiveness
• Potential Problems
26
Learning Effectiveness
• Potential Problems
27
Learning Effectiveness
• Potential Problems
28
Where does the learning error come from?
• Challenges:
1. We never know the exact model mapping from the inputs to outputs.
29
Where does the learning error come from?
• Challenges:
1. We never know the exact model mapping from the inputs to outputs.
• What we do in practice:
30
Where does the learning error come from?
• Challenges:
1. We never know the exact model mapping from the inputs to outputs.
• What we do in practice:
• E.g., We minimize the prediction error averaged over the training dataset.
31
Where does the learning error come from?
• Challenges:
1. We never know the exact model mapping from the inputs to outputs.
32
What are the learning errors?
33
What are the learning errors?
• Expected Error
For a future testing sample, randomly drawn from the underlying distribution,
Expected Error = the likelihood that we expect it to be misclassified by 𝑓𝜃 (. ).
34
Bias and Variance
• Expected Error:
• For an item that is randomly drawn from the underlying distribution, the likelihood that we
expect it to be misclassified by 𝑓𝜃 (. ).
35
Bias and Variance
36
Bias and Variance
• Bias:
• Type of error that occurs due to wrong / inaccurate assumptions made in the
learning algorithm.
• Variance:
• Type of error that occurs due to a model's sensitivity to small fluctuations in the
training set.
39
Bias and Variance
41
Basics on statistical learning theory (optional)
42
Basics on statistical learning theory
• We cannot know exactly how well an algorithm will work in practice (the
true "risk“ – measure of effectiveness).
43
Basics on statistical learning theory
• The expected (true) risk measures how well the ℎ(𝑥) approximates the 𝑦.
44
Basics on statistical learning theory
• Expected Risk:
• Empirical Risk:
• Instead of integration, we take the average distance between 𝑦 (𝑖) and the
predicted ℎ 𝑥 (𝑖) : all samples have equal weights.
45
Basics on statistical learning theory
• Expected Risk:
• Empirical Risk:
46
Basics on statistical learning theory
• Expected Risk:
• Empirical Risk:
The best
• Limitations of learning the function ℎ(𝑥): possible ℎ(. )
With limitation 1
With limitations 1 + 2
47
Basics on statistical learning theory
• Expected Risk:
• Empirical Risk:
Limitation 1
48
Basics on statistical learning theory
• Expected Risk:
• Empirical Risk:
49
Basics on statistical learning theory
Good Model!
52
Overfitting vs Underfitting
• Simple Model
• High Bias
• Complex Model
• High Variance
53
Overfitting vs Underfitting
• Simple Model
• Complex Model
54
Overfitting vs Underfitting
Underfitting Overfitting
High bias and low variance Low bias and high variance
56
Overfitting vs Underfitting
57
Overfitting vs Underfitting
58
Overfitting vs Underfitting
Overfitting
Large gap between
training and test errors
Underfitting
Small gap between
training and test errors 59
Overfitting vs Underfitting
• Bias-Variance Tradeoff:
• The bias is error from erroneous assumptions in the learning algorithm. High
bias can cause an algorithm to miss the relevant relations between features
and target outputs (e.g., model is too simple -> underfitting).
• The variance is error from sensitivity to small fluctuations in the training set.
High variance can cause an algorithm to model the random noise in the
training data, rather than the intended outputs (e.g., model is too
complicated -> overfitting).
60
Optimize and Regularize Learning
• Important statistics:
• Training parameters:
1. Learning Rate
2. Model Regularization
62
Diagnosing learning rates
• Dropout: During training, some number of layer outputs are randomly ignored or
“dropped out”.
64
Regularization to prevent overfitting
• Early Stopping: Sample the model every few iterations of training, check how well it
works with the validation set, and stop when the validation error reaches the minimum.
65
Regularization to prevent overfitting
• Early Stopping: Do not train a network to achieve too low training error, but the minimal
validation loss
66
Regularization to prevent overfitting
• Dropout: During training, some number of layer outputs are randomly ignored or
“dropped out”.
• Early Stopping: Sample the model every few iterations of training, check how well it
works with the validation set, and stop when the validation error reaches the minimum.
• Weight Sharing: Instead of training each neuron independently, we can force their
parameters to be the same. Examples: Recurrent Neural Networks (RNN).
67
Regularization to prevent overfitting
• Data Augmentation: modify the data available in a realistic but randomized way, to
increase the variety of data seen during training
68
Data augmentation
• Introduce transformations not adequately sampled in the training data
70
Data augmentation
• Introduce transformations not adequately sampled in the training data
71
Data augmentation
• Introduce transformations not adequately sampled in the training data
72
Regularization to prevent overfitting
• Dimensionality Reduction
k = 200 k = 50 k=2
73
What we have learned
• Types of learnings
• Examples of each learning type
74
Carry-on Questions
75