0% found this document useful (0 votes)

23 views44 pages

Lecture04. Training Models (Regression in Chapter 4)

Uploaded by

emad qedies

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views44 pages

Lecture04. Training Models (Regression in Chapter 4)

Uploaded by

emad qedies

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 44

Lecture 3

Training Models
Chapter 4

CSC 484 / 584, DA 515

Chapter 4 Training (Main Points)
Regression Models
 Linear Regression: optimization problem
 Batch

 SGD

 Mini-Batch

 Polynomial, Exponential, Logarithm..

 Generalized Regression

 Regularization (L1, L2, or Elastic)

 -------------

 Logistic Regression

 Softmax Regression
2
Linear Regression

3
Goal: best fitting line
 The optimum weights: Least Squared Errors

4
Solution 1.1: Analytical Solution
 1.1 Use Numpy to find the inverse and product:

# use np to solve it
theta_best = np.linalg.inv(X_b.T.dot(X_b)).dot(X_b.T).dot(y)
--------------------------------------------
y = Xθ
XTy = XTX θ
(XTX)-1XTy = (XTX)-1 (XTX) θ  θ = (XTX)-1XTy
Prediction: y = Xnewθ 5
Prediction with found weights

6
Solution 1.2: using sk-learn
 1.2 use sk-learn
from sklearn.linear_model import LinearRegression

# 3 steps: model/fitting/result
lin_reg = LinearRegression()
lin_reg.fit(X, y)
lin_reg.intercept_, lin_reg.coef_

# predict
lin_reg.predict(X_new)

7
Matrix inverse vs. SVD

 Matrix inverse has problem sometimes

 XT•X is not always invertible (singular**, for redundant features)
 X⊺ X, which is an (n + 1) × (n + 1) matrix (where n is the number of features).
The computational complexity of inverting such a matrix is typically about O(n2.4)
to O(n3)
 Too many data will not fit into the memory.
** A square matrix that does not have a matrix inverse. A matrix is singular iff its determinant is 0.
 SVD: Singular Value Decomposition
decompose the training set matrix X into the matrix
multiplication of three matrices X = U Σ V
Good:
 If m < n, data sample # m is smaller than feature # n
 If m >> n, SVD is more efficient, especially for large dataset
 Sk-learn is using SVD to find the solution (W in our case) 8

 The SVD in Scikit-Learn’s LinearRegression class is about O(n2).

Singular Value Decomposition
 A is the data Matrix mxn (or X we have used):

https://slideplayer.com/slide/5189063/

9
Singular Values

10
Solution 3. Minimize the Error
 Minimize the cost function: random initial weights

11
Solution 2. Gradient Descent
 Two key parameters
 Direction: negative gradient

 Step length: learning rate(η)

12
Gradient Descent
 Local minimum and plateau

13
Data Scaling
 In 2-D:
 features 1 and 2 have the same scale (on the left)

 feature 1 has much smaller values than feature 2 (on

the right).

14
Batch Gradient Descent-1
 We have m data samples and each has n features
(j = 1, …, n) :

 One feature direction

and all samples:

 For all feature and

All samples

15
Batch Gradient Descent-2
 Move to next step: weights updating

eta = 0.05 *decay # learning rate

n_iterations = 1000
m = 100 # sample size
theta = np.random.randn(2,1) # random initialization
for iteration in range(n_iterations):
gradients = 2/m * X_b.T.dot(X_b.dot(theta) - y)
theta = theta - eta * gradients
16
Batch Gradient Descent: Learning Rates
 Use: all samples and all features
 maximum 1000 steps(or converged)
 Different Learning Rates:

17
Batch Gradient Descent: summary
 MSE cost function:
 is convex

 its slope does not change abruptly

 Batch Gradient Descent:

 fixed learning rate:
 eventually converge to the optimal solution

 may have to wait a while:

 uses the whole training set to compute the gradients at

every step, very slow when the training set is large.

18
Stochastic Gradient Descent
 Stochastic Gradient Descent:
 Use random instance in the training set at every step

 computes the gradients based only on that single

instance.

 cost function will bounce up and down, decreasing only

on average. Over time it will end up very close to the
minimum, but once it gets there it will continue to
bounce around, never settling down

19
SGD: single sample each time
 Problem:
 cost function will bounce up and down,

 decreasing only on average.

 end up very close to the minimum.

 Good for irregular cost

 Escape from local optima

20
SK-learn: SGD method
 GDRegressor
from sklearn.linear_model import SGDRegressor

sgd_reg = SGDRegressor(max_iter=1000, tol=1e-3,

penalty=None, eta0=0.1)

# max epoch: 1000,

# learning rate = 0.1,
# converge=1e-3
sgd_reg.fit(X, y.ravel())

21
Best Choice: mini-batch
 Mini-batch Gradient Descent:
small random sets of instances called mini-batches

22
simulated annealing for learning rate
 SGD even jumps around, still have hard time find the
global optimum.
 Method: Gradually decreasing the learning rate

23
Comparison of algorithms for Linear Regression

24
Challenges
 Directions:
 Gradient Descent =>

 more than one gradients

https://www.ruder.io/optimizing-gradient-descent/

 Step length:
 learning rate => decay

 adaptive

25
Linear vs. Non-linear problem
 Linear
 analytical solution

 SVD: Singular Value Decomposition

 Gradient descent

 Non-linear  linear problem

 Polynomial

 Exponential

 Or others

26
Polynomial Regression
 Polynomial: Y = w0 + w1 X + w2X2 + w3X3

 For d = 3, we can do: X1 = X, X2 = X2 , X3 = X3

Now we still use multi-linear Regression:

Y = w0 + w1X1 + w2X2 + w3X3

 More general:
Y = w0 + w1X1 + w2X2 + w3X3 + … + wdXd

HOMEWORK 2: given data, find the best d

27
Polynomial Curve

28
Higher order: Overfitting data
 Linear: underfitting

 High-degree Polynomial: d = ?
 d = 300, Overfitting

29
The Bias/Variance Trade-off
 ERROR = sum of three very different errors:
 Bias: due to wrong assumptions, models

 Variance: due to variations in the training data.

 Irreducible error: noisiness and outliers

 Trade off:
 Less complexity: increases its bias and reduces its
variance.
 More complexity: increase its variance and reduce its
bias.
30
Learning Curve

31
Learning curve
 Learning vs. validation => still underfit ?
(train error < val. error)
The homework 1 uses only partial data

32
Early Stopping: over epochs
 Train the model many epochs:
 if val_error < minimum_val_error:

33
Plateau Check
 One Step
 (Error_n+1 – Errror_n)/ Errror_n

 (Error_101 - Error_100)/Error_100
 (650 000 – 650 100)/650 100 < Tolerance = 0.01

 Multiple Steps

34
Generalized Regression
 You can image:
 exponential regression,

 logarithm regression

 Generalized Regression

 link(Y) = Z= WX

for example: y = 2wx

log2 (y) = WX
if we use z = log2 (y), then we have z = WX

35
Regularized Linear Models
 Ridge

 LASSO: Least Absolute Shrinkage and Selection Operator Regression

 Elastic

36
Ridge
 Alpha bigger => Shrink the weights(Theta)
increasing α leads to flatter (i.e., less extreme, more reasonable) predictions,
thus
reducing the model’s variance but increasing its bias.
 Linear model Poly d=10

37
LASSO
 Diff. alpha => eliminate some weights
Linear model Poly d=10

38
Ridge vs. LASSO
 Lasso Regression L1:
 tends to eliminate the weights of the least important
features (i.e., set them to zero).
 In other words, Lasso Regression automatically
performs feature selection and outputs a sparse model
(i.e., with few nonzero feature weights).

 Ridge Regression L2:

 Reduce the overfitting

39
Optimization

40
Example: Cancer data with 30 features

41
Elastic

 Combine both
 When r = 0, equivalent to Ridge

 When r = 1, equivalent to Lasso Regression

 Ridge is a good default

 prefer Lasso (or Elastic) if only few features are useful

42
Summary
 Models:
 Linear Regression

 Multilinear Regression

 Generalized Linear Regression

 Optimization
 Cost = minimizing (mean(ERROR^2))

 Analytical Solution

 SGD or Batch or mini-batch (how many samples once?)

 Training stop: eval => plateau: how many epochs?

 Regularization: L2 for overcoming over-fitting,

L1 for feature selection 43

END

• Read book Chapter 4

• Practice code from this Chapter
• Do your homework HW2
• Next lecture: Decision Tree (ch. 6)

Lecture3 Upload
No ratings yet
Lecture3 Upload
28 pages
Linear Regression Techniques
No ratings yet
Linear Regression Techniques
25 pages
Module3 Ch1
No ratings yet
Module3 Ch1
83 pages
Regression
No ratings yet
Regression
25 pages
IML Summary
No ratings yet
IML Summary
12 pages
HandsOnML Ch4E
No ratings yet
HandsOnML Ch4E
46 pages
1.1 ID5059 1.2 Tom Kelsey - Jan 2021: February 15, 2021
No ratings yet
1.1 ID5059 1.2 Tom Kelsey - Jan 2021: February 15, 2021
43 pages
Chapter04 Training Models
No ratings yet
Chapter04 Training Models
33 pages
Linear Regression for Beginners
No ratings yet
Linear Regression for Beginners
36 pages
2EL1730 ML Lecture02 Linear and Logistic Regression
No ratings yet
2EL1730 ML Lecture02 Linear and Logistic Regression
65 pages
S1 - 25 (NSP) - ML - CS 34 - 10th17th Aug 2025
No ratings yet
S1 - 25 (NSP) - ML - CS 34 - 10th17th Aug 2025
89 pages
Q. (A) What Are Different Types of Machine Learning? Discuss The Differences
No ratings yet
Q. (A) What Are Different Types of Machine Learning? Discuss The Differences
12 pages
(MLP) Lecture Notes
No ratings yet
(MLP) Lecture Notes
22 pages
COMPX310-19A Machine Learning Chapter 4: Training Models
No ratings yet
COMPX310-19A Machine Learning Chapter 4: Training Models
48 pages
GradientDescent-Regression Slides
No ratings yet
GradientDescent-Regression Slides
26 pages
Lecture Slides - Linear Reg
No ratings yet
Lecture Slides - Linear Reg
34 pages
Unit3 ML
No ratings yet
Unit3 ML
52 pages
Lecture 8: Gradient Descent and Logistic Regression
No ratings yet
Lecture 8: Gradient Descent and Logistic Regression
39 pages
Lecture Notes 5 Linear Regression
No ratings yet
Lecture Notes 5 Linear Regression
11 pages
Linear Regression
No ratings yet
Linear Regression
91 pages
03 Linear Models
No ratings yet
03 Linear Models
46 pages
ML4 Linear Models
No ratings yet
ML4 Linear Models
34 pages
Lab 4 - Markdown Practical - Solution
No ratings yet
Lab 4 - Markdown Practical - Solution
5 pages
02 - Linear Models - A
No ratings yet
02 - Linear Models - A
23 pages
Lec4 Oct12 2022 PracticalNotes LinearRegression
No ratings yet
Lec4 Oct12 2022 PracticalNotes LinearRegression
34 pages
Wk05 Machine Learning
No ratings yet
Wk05 Machine Learning
6 pages
ANN-Regression-Python Examples
No ratings yet
ANN-Regression-Python Examples
35 pages
Lecture - 4 - Logistic Regression
No ratings yet
Lecture - 4 - Logistic Regression
62 pages
Machinelearning
No ratings yet
Machinelearning
59 pages
Lecture 09 ML
No ratings yet
Lecture 09 ML
26 pages
MECH4403 LR Week04
No ratings yet
MECH4403 LR Week04
25 pages
ML 3
No ratings yet
ML 3
50 pages
ML 1
No ratings yet
ML 1
24 pages
CS 229: Supervised Learning Basics
100% (1)
CS 229: Supervised Learning Basics
48 pages
Regression Algorithms Guide
No ratings yet
Regression Algorithms Guide
22 pages
Supervised Regression Notes
No ratings yet
Supervised Regression Notes
11 pages
Lecture 2 Linear Regression, Machine Learning Course Andrew NG
No ratings yet
Lecture 2 Linear Regression, Machine Learning Course Andrew NG
14 pages
Linear - Regression - SGD
No ratings yet
Linear - Regression - SGD
71 pages
Complete Chapter Revision Takeaways Supervised ML Regression
No ratings yet
Complete Chapter Revision Takeaways Supervised ML Regression
22 pages
Lecture3 Supervised Learning I
No ratings yet
Lecture3 Supervised Learning I
84 pages
Machine Learning
No ratings yet
Machine Learning
9 pages
2.1 Supervised Regression
No ratings yet
2.1 Supervised Regression
26 pages
ML Labs
No ratings yet
ML Labs
46 pages
Linear Regression For Machine Learning Course
No ratings yet
Linear Regression For Machine Learning Course
41 pages
Regression Analysis
No ratings yet
Regression Analysis
54 pages
Brief Summary ML
No ratings yet
Brief Summary ML
25 pages
Lecture - 6 Classification (Logistic Regression)
No ratings yet
Lecture - 6 Classification (Logistic Regression)
48 pages
Regression
No ratings yet
Regression
16 pages
Regression Analysis
No ratings yet
Regression Analysis
11 pages
Deep Learning for Data Experts
No ratings yet
Deep Learning for Data Experts
95 pages
Lec6 7 Linear Regression
No ratings yet
Lec6 7 Linear Regression
38 pages
2a Linear Regression 18may
No ratings yet
2a Linear Regression 18may
28 pages
Exam in Statistical Machine Learning Statistisk Maskininlärning (1RT700)
No ratings yet
Exam in Statistical Machine Learning Statistisk Maskininlärning (1RT700)
12 pages
Module B Handbook
No ratings yet
Module B Handbook
11 pages
02 Lecturenote GD
No ratings yet
02 Lecturenote GD
10 pages
ML Models and When To Choose One Over Others
No ratings yet
ML Models and When To Choose One Over Others
7 pages
PoF 13 Flight Mechanics
0% (1)
PoF 13 Flight Mechanics
60 pages
Answer Key - ASM
No ratings yet
Answer Key - ASM
14 pages
Otala 1980 Icassp
No ratings yet
Otala 1980 Icassp
2 pages
2024 Pred 2 Hans
No ratings yet
2024 Pred 2 Hans
20 pages
9792 PHYSICS: MARK SCHEME For The May/June 2010 Question Paper For The Guidance of Teachers
No ratings yet
9792 PHYSICS: MARK SCHEME For The May/June 2010 Question Paper For The Guidance of Teachers
14 pages
Circuit Breaker Interrupting Times
No ratings yet
Circuit Breaker Interrupting Times
2 pages
S2 CH 5 Pythagoras' Theorem
No ratings yet
S2 CH 5 Pythagoras' Theorem
24 pages
2019 20TOG ElementaryGrade3
No ratings yet
2019 20TOG ElementaryGrade3
34 pages
Data Handling Analog Io Selection of PLC
No ratings yet
Data Handling Analog Io Selection of PLC
14 pages
A A Glossary Traffic Analysis Terms
No ratings yet
A A Glossary Traffic Analysis Terms
13 pages
Concept Learning
No ratings yet
Concept Learning
33 pages
Computer Number Systems Workbook
No ratings yet
Computer Number Systems Workbook
14 pages
Heap Sort
No ratings yet
Heap Sort
28 pages
AAD Lec04
No ratings yet
AAD Lec04
3 pages
2D Seismic Survey Modeling in MATLAB
No ratings yet
2D Seismic Survey Modeling in MATLAB
18 pages
Chapter 2 & 3-GIS Data and Database Management and Prcocessing Systems
80% (5)
Chapter 2 & 3-GIS Data and Database Management and Prcocessing Systems
39 pages
Hull OFOD10e MultipleChoice Questions and Answers Ch11
100% (1)
Hull OFOD10e MultipleChoice Questions and Answers Ch11
6 pages
Electric Charge & Coulomb Force-Solution PDF
No ratings yet
Electric Charge & Coulomb Force-Solution PDF
8 pages
IPC2022-87168 - A Transparent Asme B31.8-Based Strain Assessment Method Using 3D Measurement of Dent Morphology - Final
No ratings yet
IPC2022-87168 - A Transparent Asme B31.8-Based Strain Assessment Method Using 3D Measurement of Dent Morphology - Final
12 pages
Technical Drawing 8 (Q1-Week 1)
No ratings yet
Technical Drawing 8 (Q1-Week 1)
4 pages
Technology For Mathematics For Students
No ratings yet
Technology For Mathematics For Students
7 pages
Sparse Matrix Techniques in Power Flow
No ratings yet
Sparse Matrix Techniques in Power Flow
27 pages
Grade 8 Math Exam Paper 2 - Nov 2023
No ratings yet
Grade 8 Math Exam Paper 2 - Nov 2023
17 pages
CH 2 - Wave Propagation in Viscous Fluid PDF
No ratings yet
CH 2 - Wave Propagation in Viscous Fluid PDF
20 pages
Algebraic Reasoning & Postulates Guide
No ratings yet
Algebraic Reasoning & Postulates Guide
10 pages
CH - 12 LINEAR PROGRAMMING
No ratings yet
CH - 12 LINEAR PROGRAMMING
27 pages
2 Lom
No ratings yet
2 Lom
12 pages
DLL Week 7-q3 Math 5
No ratings yet
DLL Week 7-q3 Math 5
7 pages
Viscosity Models for Petroleum Mixes
No ratings yet
Viscosity Models for Petroleum Mixes
5 pages
Course Content Teaching Mathematics in The Intermediate Grades
No ratings yet
Course Content Teaching Mathematics in The Intermediate Grades
3 pages

Lecture04. Training Models (Regression in Chapter 4)

Uploaded by

Lecture04. Training Models (Regression in Chapter 4)

Uploaded by

Lecture 3

CSC 484 / 584, DA 515

 Polynomial, Exponential, Logarithm..

 Regularization (L1, L2, or Elastic)

 Matrix inverse has problem sometimes

 The SVD in Scikit-Learn’s LinearRegression class is about O(n2).

 Step length: learning rate(η)

 feature 1 has much smaller values than feature 2 (on

 One feature direction

 For all feature and

eta = 0.05 *decay # learning rate

 its slope does not change abruptly

 Batch Gradient Descent:

 may have to wait a while:

 uses the whole training set to compute the gradients at

 computes the gradients based only on that single

 cost function will bounce up and down, decreasing only

 decreasing only on average.

 end up very close to the minimum.

 Good for irregular cost

sgd_reg = SGDRegressor(max_iter=1000, tol=1e-3,

# max epoch: 1000,

 more than one gradients

 SVD: Singular Value Decomposition

 Non-linear  linear problem

 For d = 3, we can do: X1 = X, X2 = X2 , X3 = X3

Now we still use multi-linear Regression:

HOMEWORK 2: given data, find the best d

 Variance: due to variations in the training data.

 Irreducible error: noisiness and outliers

for example: y = 2wx

 LASSO: Least Absolute Shrinkage and Selection Operator Regression

 Ridge Regression L2:

 When r = 1, equivalent to Lasso Regression

 Ridge is a good default

 Generalized Linear Regression

 SGD or Batch or mini-batch (how many samples once?)

 Training stop: eval => plateau: how many epochs?

 Regularization: L2 for overcoming over-fitting,

L1 for feature selection 43

• Read book Chapter 4

You might also like