CSC 411: Lecture 02: Linear Regression
Richard Zemel, Raquel Urtasun and Sanja Fidler
University of Toronto
(Most plots in this lecture are from Bishop’s book)
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 1 / 22
Problems for Today
What should I watch this Friday?
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 2 / 22
Problems for Today
What should I watch this Friday?
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 2 / 22
Problems for Today
Goal: Predict movie rating automatically!
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 2 / 22
Problems for Today
Goal: How many followers will I get?
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 2 / 22
Problems for Today
Goal: Predict the price of the house
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 2 / 22
Regression
What do all these problems have in common?
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 3 / 22
Regression
What do all these problems have in common?
I Continuous outputs, we’ll call these t
(e.g., a rating: a real number between 0-10, # of followers, house
price)
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 3 / 22
Regression
What do all these problems have in common?
I Continuous outputs, we’ll call these t
(e.g., a rating: a real number between 0-10, # of followers, house
price)
Predicting continuous outputs is called regression
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 3 / 22
Regression
What do all these problems have in common?
I Continuous outputs, we’ll call these t
(e.g., a rating: a real number between 0-10, # of followers, house
price)
Predicting continuous outputs is called regression
What do I need in order to predict these outputs?
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 3 / 22
Regression
What do all these problems have in common?
I Continuous outputs, we’ll call these t
(e.g., a rating: a real number between 0-10, # of followers, house
price)
Predicting continuous outputs is called regression
What do I need in order to predict these outputs?
I Features (inputs), we’ll call these x (or x if vectors)
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 3 / 22
Regression
What do all these problems have in common?
I Continuous outputs, we’ll call these t
(e.g., a rating: a real number between 0-10, # of followers, house
price)
Predicting continuous outputs is called regression
What do I need in order to predict these outputs?
I Features (inputs), we’ll call these x (or x if vectors)
I Training examples, many x (i) for which t (i) is known (e.g., many
movies for which we know the rating)
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 3 / 22
Regression
What do all these problems have in common?
I Continuous outputs, we’ll call these t
(e.g., a rating: a real number between 0-10, # of followers, house
price)
Predicting continuous outputs is called regression
What do I need in order to predict these outputs?
I Features (inputs), we’ll call these x (or x if vectors)
I Training examples, many x (i) for which t (i) is known (e.g., many
movies for which we know the rating)
I A model, a function that represents the relationship between x and t
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 3 / 22
Regression
What do all these problems have in common?
I Continuous outputs, we’ll call these t
(e.g., a rating: a real number between 0-10, # of followers, house
price)
Predicting continuous outputs is called regression
What do I need in order to predict these outputs?
I Features (inputs), we’ll call these x (or x if vectors)
I Training examples, many x (i) for which t (i) is known (e.g., many
movies for which we know the rating)
I A model, a function that represents the relationship between x and t
I A loss or a cost or an objective function, which tells us how well our
model approximates the training examples
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 3 / 22
Regression
What do all these problems have in common?
I Continuous outputs, we’ll call these t
(e.g., a rating: a real number between 0-10, # of followers, house
price)
Predicting continuous outputs is called regression
What do I need in order to predict these outputs?
I Features (inputs), we’ll call these x (or x if vectors)
I Training examples, many x (i) for which t (i) is known (e.g., many
movies for which we know the rating)
I A model, a function that represents the relationship between x and t
I A loss or a cost or an objective function, which tells us how well our
model approximates the training examples
I Optimization, a way of finding the parameters of our model that
minimizes the loss function
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 3 / 22
Today: Linear Regression
Linear regression
I continuous outputs
I simple model (linear)
Introduce key concepts:
I loss functions
I generalization
I optimization
I model complexity
I regularization
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 4 / 22
Simple 1-D regression
Circles are data points (i.e., training examples) that are given to us
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 5 / 22
Simple 1-D regression
Circles are data points (i.e., training examples) that are given to us
The data points are uniform in x, but may be displaced in y
t(x) = f (x) +
with some noise
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 5 / 22
Simple 1-D regression
Circles are data points (i.e., training examples) that are given to us
The data points are uniform in x, but may be displaced in y
t(x) = f (x) +
with some noise
In green is the ”true” curve that we don’t know
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 5 / 22
Simple 1-D regression
Circles are data points (i.e., training examples) that are given to us
The data points are uniform in x, but may be displaced in y
t(x) = f (x) +
with some noise
In green is the ”true” curve that we don’t know
Goal: We want to fit a curve to these points
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 5 / 22
Simple 1-D regression
Key Questions:
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 6 / 22
Simple 1-D regression
Key Questions:
I How do we parametrize the model?
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 6 / 22
Simple 1-D regression
Key Questions:
I How do we parametrize the model?
I What loss (objective) function should we use to judge the fit?
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 6 / 22
Simple 1-D regression
Key Questions:
I How do we parametrize the model?
I What loss (objective) function should we use to judge the fit?
I How do we optimize fit to unseen test data (generalization)?
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 6 / 22
Example: Boston Housing data
Estimate median house price in a neighborhood based on neighborhood
statistics
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 7 / 22
Example: Boston Housing data
Estimate median house price in a neighborhood based on neighborhood
statistics
Look at first possible attribute (feature): per capita crime rate
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 7 / 22
Example: Boston Housing data
Estimate median house price in a neighborhood based on neighborhood
statistics
Look at first possible attribute (feature): per capita crime rate
Use this to predict house prices in other neighborhoods
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 7 / 22
Example: Boston Housing data
Estimate median house price in a neighborhood based on neighborhood
statistics
Look at first possible attribute (feature): per capita crime rate
Use this to predict house prices in other neighborhoods
Is this a good input (attribute) to predict house prices?
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 7 / 22
Represent the Data
Data is described as pairs D = {(x (1) , t (1) ), · · · , (x (N) , t (N) )}
I x ∈ R is the input feature (per capita crime rate)
I t ∈ R is the target output (median house price)
(i)
I simply indicates the training examples (we have N in this case)
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 8 / 22
Represent the Data
Data is described as pairs D = {(x (1) , t (1) ), · · · , (x (N) , t (N) )}
I x ∈ R is the input feature (per capita crime rate)
I t ∈ R is the target output (median house price)
(i)
I simply indicates the training examples (we have N in this case)
Here t is continuous, so this is a regression problem
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 8 / 22
Represent the Data
Data is described as pairs D = {(x (1) , t (1) ), · · · , (x (N) , t (N) )}
I x ∈ R is the input feature (per capita crime rate)
I t ∈ R is the target output (median house price)
(i)
I simply indicates the training examples (we have N in this case)
Here t is continuous, so this is a regression problem
Model outputs y , an estimate of t
y (x) = w0 + w1 x
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 8 / 22
Represent the Data
Data is described as pairs D = {(x (1) , t (1) ), · · · , (x (N) , t (N) )}
I x ∈ R is the input feature (per capita crime rate)
I t ∈ R is the target output (median house price)
(i)
I simply indicates the training examples (we have N in this case)
Here t is continuous, so this is a regression problem
Model outputs y , an estimate of t
y (x) = w0 + w1 x
What type of model did we choose?
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 8 / 22
Represent the Data
Data is described as pairs D = {(x (1) , t (1) ), · · · , (x (N) , t (N) )}
I x ∈ R is the input feature (per capita crime rate)
I t ∈ R is the target output (median house price)
(i)
I simply indicates the training examples (we have N in this case)
Here t is continuous, so this is a regression problem
Model outputs y , an estimate of t
y (x) = w0 + w1 x
What type of model did we choose?
Divide the dataset into training and testing examples
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 8 / 22
Represent the Data
Data is described as pairs D = {(x (1) , t (1) ), · · · , (x (N) , t (N) )}
I x ∈ R is the input feature (per capita crime rate)
I t ∈ R is the target output (median house price)
(i)
I simply indicates the training examples (we have N in this case)
Here t is continuous, so this is a regression problem
Model outputs y , an estimate of t
y (x) = w0 + w1 x
What type of model did we choose?
Divide the dataset into training and testing examples
I Use the training examples to construct hypothesis, or function
approximator, that maps x to predicted y
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 8 / 22
Represent the Data
Data is described as pairs D = {(x (1) , t (1) ), · · · , (x (N) , t (N) )}
I x ∈ R is the input feature (per capita crime rate)
I t ∈ R is the target output (median house price)
(i)
I simply indicates the training examples (we have N in this case)
Here t is continuous, so this is a regression problem
Model outputs y , an estimate of t
y (x) = w0 + w1 x
What type of model did we choose?
Divide the dataset into training and testing examples
I Use the training examples to construct hypothesis, or function
approximator, that maps x to predicted y
I Evaluate hypothesis on test set
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 8 / 22
Noise
A simple model typically does not exactly fit the data
I lack of fit can be considered noise
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 9 / 22
Noise
A simple model typically does not exactly fit the data
I lack of fit can be considered noise
Sources of noise:
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 9 / 22
Noise
A simple model typically does not exactly fit the data
I lack of fit can be considered noise
Sources of noise:
I Imprecision in data attributes (input noise, e.g., noise in per-capita
crime)
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 9 / 22
Noise
A simple model typically does not exactly fit the data
I lack of fit can be considered noise
Sources of noise:
I Imprecision in data attributes (input noise, e.g., noise in per-capita
crime)
I Errors in data targets (mis-labeling, e.g., noise in house prices)
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 9 / 22
Noise
A simple model typically does not exactly fit the data
I lack of fit can be considered noise
Sources of noise:
I Imprecision in data attributes (input noise, e.g., noise in per-capita
crime)
I Errors in data targets (mis-labeling, e.g., noise in house prices)
I Additional attributes not taken into account by data attributes, affect
target values (latent variables). In the example, what else could affect
house prices?
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 9 / 22
Noise
A simple model typically does not exactly fit the data
I lack of fit can be considered noise
Sources of noise:
I Imprecision in data attributes (input noise, e.g., noise in per-capita
crime)
I Errors in data targets (mis-labeling, e.g., noise in house prices)
I Additional attributes not taken into account by data attributes, affect
target values (latent variables). In the example, what else could affect
house prices?
I Model may be too simple to account for data targets
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 9 / 22
Least-Squares Regression
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 10 / 22
Least-Squares Regression
Define a model
y (x) = function(x, w)
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 10 / 22
Least-Squares Regression
Define a model
Linear: y (x) = w0 + w1 x
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 10 / 22
Least-Squares Regression
Define a model
Linear: y (x) = w0 + w1 x
Standard loss/cost/objective function measures the squared error between y
and the true value t
XN
`(w) = [t (n) − y (x (n) )]2
n=1
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 10 / 22
Least-Squares Regression
Define a model
Linear: y (x) = w0 + w1 x
Standard loss/cost/objective function measures the squared error between y
and the true value t
N
X
Linear model: `(w) = [t (n) − (w0 + w1 x (n) )]2
n=1
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 10 / 22
Least-Squares Regression
Define a model
Linear: y (x) = w0 + w1 x
Standard loss/cost/objective function measures the squared error between y
and the true value t
XN
Linear model: `(w) = [t (n) − (w0 + w1 x (n) )]2
n=1
For a particular hypothesis (y (x) defined by a choice of w, drawn in red),
what does the loss represent geometrically?
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 10 / 22
Least-Squares Regression
Define a model
Linear: y (x) = w0 + w1 x
Standard loss/cost/objective function measures the squared error between y
and the true value t
XN
Linear model: `(w) = [t (n) − (w0 + w1 x (n) )]2
n=1
The loss for the red hypothesis is the sum of the squared vertical errors
(squared lengths of green vertical lines)
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 10 / 22
Least-Squares Regression
Define a model
Linear: y (x) = w0 + w1 x
Standard loss/cost/objective function measures the squared error between y
and the true value t
N
X
Linear model: `(w) = [t (n) − (w0 + w1 x (n) )]2
n=1
How do we obtain weights w = (w0 , w1 )?
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 10 / 22
Least-Squares Regression
Define a model
Linear: y (x) = w0 + w1 x
Standard loss/cost/objective function measures the squared error between y
and the true value t
N
X
Linear model: `(w) = [t (n) − (w0 + w1 x (n) )]2
n=1
How do we obtain weights w = (w0 , w1 )? Find w that minimizes loss `(w)
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 10 / 22
Least-Squares Regression
Define a model
Linear: y (x) = w0 + w1 x
Standard loss/cost/objective function measures the squared error between y
and the true value t
N
X
Linear model: `(w) = [t (n) − (w0 + w1 x (n) )]2
n=1
How do we obtain weights w = (w0 , w1 )?
For the linear model, what kind of a function is `(w)?
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 10 / 22
Optimizing the Objective
One straightforward method: gradient descent
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 11 / 22
Optimizing the Objective
One straightforward method: gradient descent
I initialize w (e.g., randomly)
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 11 / 22
Optimizing the Objective
One straightforward method: gradient descent
I initialize w (e.g., randomly)
I repeatedly update w based on the gradient
∂`
w ←w−λ
∂w
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 11 / 22
Optimizing the Objective
One straightforward method: gradient descent
I initialize w (e.g., randomly)
I repeatedly update w based on the gradient
∂`
w ←w−λ
∂w
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 11 / 22
Optimizing the Objective
One straightforward method: gradient descent
I initialize w (e.g., randomly)
I repeatedly update w based on the gradient
∂`
w ←w−λ
∂w
λ is the learning rate
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 11 / 22
Optimizing the Objective
One straightforward method: gradient descent
I initialize w (e.g., randomly)
I repeatedly update w based on the gradient
∂`
w ←w−λ
∂w
λ is the learning rate
For a single training case, this gives the LMS update rule (Least Mean
Squares):
w ← w + 2λ(t (n) − y (x (n) ))x (n)
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 11 / 22
Optimizing the Objective
One straightforward method: gradient descent
I initialize w (e.g., randomly)
I repeatedly update w based on the gradient
∂`
w ←w−λ
∂w
λ is the learning rate
For a single training case, this gives the LMS update rule (Least Mean
Squares):
w ← w + 2λ (t (n) − y (x (n) )) x (n)
| {z }
error
Note: As error approaches zero, so does the update (w stops changing)
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 11 / 22
Optimizing Across Training Set
Two ways to generalize this for all examples in training set:
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 12 / 22
Optimizing Across Training Set
Two ways to generalize this for all examples in training set:
1. Batch updates: sum or average updates across every example n, then
change the parameter values
N
X
w ← w + 2λ (t (n) − y (x (n) ))x (n)
n=1
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 12 / 22
Optimizing Across Training Set
Two ways to generalize this for all examples in training set:
1. Batch updates: sum or average updates across every example n, then
change the parameter values
N
X
w ← w + 2λ (t (n) − y (x (n) ))x (n)
n=1
2. Stochastic/online updates: update the parameters for each training
case in turn, according to its own gradients
Algorithm 1 Stochastic gradient descent
1: Randomly shuffle examples in the training set
2: for i = 1 to N do
3: Update:
w ← w + 2λ(t (i) − y (x (i) ))x (i) (update for a linear model)
4: end for
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 12 / 22
Optimizing Across Training Set
Two ways to generalize this for all examples in training set:
1. Batch updates: sum or average updates across every example n, then
change the parameter values
N
X
w ← w + 2λ (t (n) − y (x (n) ))x (n)
n=1
2. Stochastic/online updates: update the parameters for each training
case in turn, according to its own gradients
I Underlying assumption: sample is independent and identically
distributed (i.i.d.)
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 12 / 22
Analytical Solution?
For some objectives we can also find the optimal solution analytically
This is the case for linear least-squares regression
How?
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 13 / 22
Analytical Solution?
For some objectives we can also find the optimal solution analytically
This is the case for linear least-squares regression
How?
Compute the derivatives of the objective wrt w and equate with 0
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 13 / 22
Analytical Solution?
For some objectives we can also find the optimal solution analytically
This is the case for linear least-squares regression
How?
Compute the derivatives of the objective wrt w and equate with 0
Define:
t = [t (1) , t (2) , . . . , t (N) ]T
1, x (1)
1, x (2)
X= ...
1, x (N)
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 13 / 22
Analytical Solution?
For some objectives we can also find the optimal solution analytically
This is the case for linear least-squares regression
How?
Compute the derivatives of the objective wrt w and equate with 0
Define:
t = [t (1) , t (2) , . . . , t (N) ]T
1, x (1)
1, x (2)
X= ...
1, x (N)
Then:
w = (XT X)−1 XT t
(work it out!)
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 13 / 22
Multi-dimensional Inputs
One method of extending the model is to consider other input dimensions
y (x) = w0 + w1 x1 + w2 x2
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 14 / 22
Multi-dimensional Inputs
One method of extending the model is to consider other input dimensions
y (x) = w0 + w1 x1 + w2 x2
In the Boston housing example, we can look at the number of rooms
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 14 / 22
Linear Regression with Multi-dimensional Inputs
Imagine now we want to predict the median house price from these
multi-dimensional observations
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 15 / 22
Linear Regression with Multi-dimensional Inputs
Imagine now we want to predict the median house price from these
multi-dimensional observations
Each house is a data point n, with observations indexed by j:
(n) (n) (n)
x(n) = x1 , · · · , xj , · · · , xd
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 15 / 22
Linear Regression with Multi-dimensional Inputs
Imagine now we want to predict the median house price from these
multi-dimensional observations
Each house is a data point n, with observations indexed by j:
(n) (n) (n)
x(n) = x1 , · · · , xj , · · · , xd
We can incorporate the bias w0 into w, by using x0 = 1, then
d
X
y (x) = w0 + w j xj = w T x
j=1
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 15 / 22
Linear Regression with Multi-dimensional Inputs
Imagine now we want to predict the median house price from these
multi-dimensional observations
Each house is a data point n, with observations indexed by j:
(n) (n) (n)
x(n) = x1 , · · · , xj , · · · , xd
We can incorporate the bias w0 into w, by using x0 = 1, then
d
X
y (x) = w0 + w j xj = w T x
j=1
We can then solve for w = (w0 , w1 , · · · , wd ). How?
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 15 / 22
Linear Regression with Multi-dimensional Inputs
Imagine now we want to predict the median house price from these
multi-dimensional observations
Each house is a data point n, with observations indexed by j:
(n) (n) (n)
x(n) = x1 , · · · , xj , · · · , xd
We can incorporate the bias w0 into w, by using x0 = 1, then
d
X
y (x) = w0 + w j xj = w T x
j=1
We can then solve for w = (w0 , w1 , · · · , wd ). How?
We can use gradient descent to solve for each coefficient, or compute w
analytically (how does the solution change?)
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 15 / 22
More Powerful Models?
What if our linear model is not good? How can we create a more
complicated model?
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 16 / 22
Fitting a Polynomial
What if our linear model is not good? How can we create a more
complicated model?
We can create a more complicated model by defining input variables that are
combinations of components of x
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 16 / 22
Fitting a Polynomial
What if our linear model is not good? How can we create a more
complicated model?
We can create a more complicated model by defining input variables that are
combinations of components of x
Example: an M-th order polynomial function of one dimensional feature x:
M
X
y (x, w) = w0 + wj x j
j=1
where x j is the j-th power of x
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 16 / 22
Fitting a Polynomial
What if our linear model is not good? How can we create a more
complicated model?
We can create a more complicated model by defining input variables that are
combinations of components of x
Example: an M-th order polynomial function of one dimensional feature x:
M
X
y (x, w) = w0 + wj x j
j=1
where x j is the j-th power of x
We can use the same approach to optimize for the weights w
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 16 / 22
Fitting a Polynomial
What if our linear model is not good? How can we create a more
complicated model?
We can create a more complicated model by defining input variables that are
combinations of components of x
Example: an M-th order polynomial function of one dimensional feature x:
M
X
y (x, w) = w0 + wj x j
j=1
where x j is the j-th power of x
We can use the same approach to optimize for the weights w
How do we do that?
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 16 / 22
Which Fit is Best?
from Bishop
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 17 / 22
Generalization
Generalization = model’s ability to predict the held out data
What is happening?
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 18 / 22
Generalization
Generalization = model’s ability to predict the held out data
What is happening?
Our model with M = 9 overfits the data (it models also noise)
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 18 / 22
Generalization
Generalization = model’s ability to predict the held out data
What is happening?
Our model with M = 9 overfits the data (it models also noise)
Not a problem if we have lots of training examples
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 18 / 22
Generalization
Generalization = model’s ability to predict the held out data
What is happening?
Our model with M = 9 overfits the data (it models also noise)
Let’s look at the estimated weights for various M in the case of fewer
examples
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 18 / 22
Generalization
Generalization = model’s ability to predict the held out data
What is happening?
Our model with M = 9 overfits the data (it models also noise)
Let’s look at the estimated weights for various M in the case of fewer
examples
The weights are becoming huge to compensate for the noise
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 18 / 22
Generalization
Generalization = model’s ability to predict the held out data
What is happening?
Our model with M = 9 overfits the data (it models also noise)
Let’s look at the estimated weights for various M in the case of fewer
examples
The weights are becoming huge to compensate for the noise
One way of dealing with this is to encourage the weights to be small (this
way no input dimension will have too much influence on prediction). This is
called regularization
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 18 / 22
Regularized Least Squares
Increasing the input features this way can complicate the model considerably
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 19 / 22
Regularized Least Squares
Increasing the input features this way can complicate the model considerably
Goal: select the appropriate model complexity automatically
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 19 / 22
Regularized Least Squares
Increasing the input features this way can complicate the model considerably
Goal: select the appropriate model complexity automatically
Standard approach: regularization
N
X
˜
`(w) = [t (n) − (w0 + w1 x (n) )]2 + αwT w
n=1
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 19 / 22
Regularized Least Squares
Increasing the input features this way can complicate the model considerably
Goal: select the appropriate model complexity automatically
Standard approach: regularization
N
X
˜
`(w) = [t (n) − (w0 + w1 x (n) )]2 + αwT w
n=1
Intuition: Since we are minimizing the loss, the second term will encourage
smaller values in w
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 19 / 22
Regularized Least Squares
Increasing the input features this way can complicate the model considerably
Goal: select the appropriate model complexity automatically
Standard approach: regularization
N
X
˜
`(w) = [t (n) − (w0 + w1 x (n) )]2 + αwT w
n=1
Intuition: Since we are minimizing the loss, the second term will encourage
smaller values in w
When we use the penalty on the squared weights we have ridge regression in
statistics
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 19 / 22
Regularized Least Squares
Increasing the input features this way can complicate the model considerably
Goal: select the appropriate model complexity automatically
Standard approach: regularization
N
X
˜
`(w) = [t (n) − (w0 + w1 x (n) )]2 + αwT w
n=1
Intuition: Since we are minimizing the loss, the second term will encourage
smaller values in w
When we use the penalty on the squared weights we have ridge regression in
statistics
Leads to a modified update rule for gradient descent:
N
X
w ← w + 2λ[ (t (n) − y (x (n) ))x (n) − αw]
n=1
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 19 / 22
Regularized Least Squares
Increasing the input features this way can complicate the model considerably
Goal: select the appropriate model complexity automatically
Standard approach: regularization
N
X
˜
`(w) = [t (n) − (w0 + w1 x (n) )]2 + αwT w
n=1
Intuition: Since we are minimizing the loss, the second term will encourage
smaller values in w
When we use the penalty on the squared weights we have ridge regression in
statistics
Leads to a modified update rule for gradient descent:
N
X
w ← w + 2λ[ (t (n) − y (x (n) ))x (n) − αw]
n=1
Also has an analytical solution: w = (XT X + α I)−1 XT t (verify!)
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 19 / 22
Regularized least squares
Better generalization
Choose α carefully
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 20 / 22
1-D regression illustrates key concepts
Data fits – is linear model best (model selection)?
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 21 / 22
1-D regression illustrates key concepts
Data fits – is linear model best (model selection)?
I Simple models may not capture all the important variations (signal) in
the data: underfit
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 21 / 22
1-D regression illustrates key concepts
Data fits – is linear model best (model selection)?
I Simple models may not capture all the important variations (signal) in
the data: underfit
I More complex models may overfit the training data (fit not only the
signal but also the noise in the data), especially if not enough data to
constrain model
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 21 / 22
1-D regression illustrates key concepts
Data fits – is linear model best (model selection)?
I Simple models may not capture all the important variations (signal) in
the data: underfit
I More complex models may overfit the training data (fit not only the
signal but also the noise in the data), especially if not enough data to
constrain model
One method of assessing fit: test generalization = model’s ability to predict
the held out data
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 21 / 22
1-D regression illustrates key concepts
Data fits – is linear model best (model selection)?
I Simple models may not capture all the important variations (signal) in
the data: underfit
I More complex models may overfit the training data (fit not only the
signal but also the noise in the data), especially if not enough data to
constrain model
One method of assessing fit: test generalization = model’s ability to predict
the held out data
Optimization is essential: stochastic and batch iterative approaches; analytic
when available
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 21 / 22
So...
Which movie will you watch?
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 22 / 22