Lecture 3
Training Models
Chapter 4
CSC 484 / 584, DA 515
Chapter 4 Training (Main Points)
Regression Models
Linear Regression: optimization problem
Batch
SGD
Mini-Batch
Polynomial, Exponential, Logarithm..
Generalized Regression
Regularization (L1, L2, or Elastic)
-------------
Logistic Regression
Softmax Regression
2
Linear Regression
3
Goal: best fitting line
The optimum weights: Least Squared Errors
4
Solution 1.1: Analytical Solution
1.1 Use Numpy to find the inverse and product:
# use np to solve it
theta_best = np.linalg.inv(X_b.T.dot(X_b)).dot(X_b.T).dot(y)
--------------------------------------------
y = Xθ
XTy = XTX θ
(XTX)-1XTy = (XTX)-1 (XTX) θ θ = (XTX)-1XTy
Prediction: y = Xnewθ 5
Prediction with found weights
6
Solution 1.2: using sk-learn
1.2 use sk-learn
from sklearn.linear_model import LinearRegression
# 3 steps: model/fitting/result
lin_reg = LinearRegression()
lin_reg.fit(X, y)
lin_reg.intercept_, lin_reg.coef_
# predict
lin_reg.predict(X_new)
7
Matrix inverse vs. SVD
Matrix inverse has problem sometimes
XT•X is not always invertible (singular**, for redundant features)
X⊺ X, which is an (n + 1) × (n + 1) matrix (where n is the number of features).
The computational complexity of inverting such a matrix is typically about O(n2.4)
to O(n3)
Too many data will not fit into the memory.
** A square matrix that does not have a matrix inverse. A matrix is singular iff its determinant is 0.
SVD: Singular Value Decomposition
decompose the training set matrix X into the matrix
multiplication of three matrices X = U Σ V
Good:
If m < n, data sample # m is smaller than feature # n
If m >> n, SVD is more efficient, especially for large dataset
Sk-learn is using SVD to find the solution (W in our case) 8
The SVD in Scikit-Learn’s LinearRegression class is about O(n2).
Singular Value Decomposition
A is the data Matrix mxn (or X we have used):
https://slideplayer.com/slide/5189063/
9
Singular Values
10
Solution 3. Minimize the Error
Minimize the cost function: random initial weights
11
Solution 2. Gradient Descent
Two key parameters
Direction: negative gradient
Step length: learning rate(η)
12
Gradient Descent
Local minimum and plateau
13
Data Scaling
In 2-D:
features 1 and 2 have the same scale (on the left)
feature 1 has much smaller values than feature 2 (on
the right).
14
Batch Gradient Descent-1
We have m data samples and each has n features
(j = 1, …, n) :
One feature direction
and all samples:
For all feature and
All samples
15
Batch Gradient Descent-2
Move to next step: weights updating
eta = 0.05 *decay # learning rate
n_iterations = 1000
m = 100 # sample size
theta = np.random.randn(2,1) # random initialization
for iteration in range(n_iterations):
gradients = 2/m * X_b.T.dot(X_b.dot(theta) - y)
theta = theta - eta * gradients
16
Batch Gradient Descent: Learning Rates
Use: all samples and all features
maximum 1000 steps(or converged)
Different Learning Rates:
17
Batch Gradient Descent: summary
MSE cost function:
is convex
its slope does not change abruptly
Batch Gradient Descent:
fixed learning rate:
eventually converge to the optimal solution
may have to wait a while:
uses the whole training set to compute the gradients at
every step, very slow when the training set is large.
18
Stochastic Gradient Descent
Stochastic Gradient Descent:
Use random instance in the training set at every step
computes the gradients based only on that single
instance.
cost function will bounce up and down, decreasing only
on average. Over time it will end up very close to the
minimum, but once it gets there it will continue to
bounce around, never settling down
19
SGD: single sample each time
Problem:
cost function will bounce up and down,
decreasing only on average.
end up very close to the minimum.
Good for irregular cost
Escape from local optima
20
SK-learn: SGD method
GDRegressor
from sklearn.linear_model import SGDRegressor
sgd_reg = SGDRegressor(max_iter=1000, tol=1e-3,
penalty=None, eta0=0.1)
# max epoch: 1000,
# learning rate = 0.1,
# converge=1e-3
sgd_reg.fit(X, y.ravel())
21
Best Choice: mini-batch
Mini-batch Gradient Descent:
small random sets of instances called mini-batches
22
simulated annealing for learning rate
SGD even jumps around, still have hard time find the
global optimum.
Method: Gradually decreasing the learning rate
23
Comparison of algorithms for Linear Regression
24
Challenges
Directions:
Gradient Descent =>
more than one gradients
https://www.ruder.io/optimizing-gradient-descent/
Step length:
learning rate => decay
adaptive
25
Linear vs. Non-linear problem
Linear
analytical solution
SVD: Singular Value Decomposition
Gradient descent
Non-linear linear problem
Polynomial
Exponential
Or others
26
Polynomial Regression
Polynomial: Y = w0 + w1 X + w2X2 + w3X3
For d = 3, we can do: X1 = X, X2 = X2 , X3 = X3
Now we still use multi-linear Regression:
Y = w0 + w1X1 + w2X2 + w3X3
More general:
Y = w0 + w1X1 + w2X2 + w3X3 + … + wdXd
HOMEWORK 2: given data, find the best d
27
Polynomial Curve
28
Higher order: Overfitting data
Linear: underfitting
High-degree Polynomial: d = ?
d = 300, Overfitting
29
The Bias/Variance Trade-off
ERROR = sum of three very different errors:
Bias: due to wrong assumptions, models
Variance: due to variations in the training data.
Irreducible error: noisiness and outliers
Trade off:
Less complexity: increases its bias and reduces its
variance.
More complexity: increase its variance and reduce its
bias.
30
Learning Curve
31
Learning curve
Learning vs. validation => still underfit ?
(train error < val. error)
The homework 1 uses only partial data
32
Early Stopping: over epochs
Train the model many epochs:
if val_error < minimum_val_error:
33
Plateau Check
One Step
(Error_n+1 – Errror_n)/ Errror_n
(Error_101 - Error_100)/Error_100
(650 000 – 650 100)/650 100 < Tolerance = 0.01
Multiple Steps
34
Generalized Regression
You can image:
exponential regression,
logarithm regression
Generalized Regression
link(Y) = Z= WX
for example: y = 2wx
log2 (y) = WX
if we use z = log2 (y), then we have z = WX
35
Regularized Linear Models
Ridge
LASSO: Least Absolute Shrinkage and Selection Operator Regression
Elastic
36
Ridge
Alpha bigger => Shrink the weights(Theta)
increasing α leads to flatter (i.e., less extreme, more reasonable) predictions,
thus
reducing the model’s variance but increasing its bias.
Linear model Poly d=10
37
LASSO
Diff. alpha => eliminate some weights
Linear model Poly d=10
38
Ridge vs. LASSO
Lasso Regression L1:
tends to eliminate the weights of the least important
features (i.e., set them to zero).
In other words, Lasso Regression automatically
performs feature selection and outputs a sparse model
(i.e., with few nonzero feature weights).
Ridge Regression L2:
Reduce the overfitting
39
Optimization
40
Example: Cancer data with 30 features
41
Elastic
Combine both
When r = 0, equivalent to Ridge
When r = 1, equivalent to Lasso Regression
Ridge is a good default
prefer Lasso (or Elastic) if only few features are useful
42
Summary
Models:
Linear Regression
Multilinear Regression
Generalized Linear Regression
Optimization
Cost = minimizing (mean(ERROR^2))
Analytical Solution
SGD or Batch or mini-batch (how many samples once?)
Training stop: eval => plateau: how many epochs?
Regularization: L2 for overcoming over-fitting,
L1 for feature selection 43
END
• Read book Chapter 4
• Practice code from this Chapter
• Do your homework HW2
• Next lecture: Decision Tree (ch. 6)