Linear Regression
Courtesy:Richard Zemel, Raquel Urtasun and
Sanja Fidler
University of Toronto
(Most plots in this lecture are from Bishop’s book)
Zemel, Urtasun, Fidler (UofT) 02-Regression 1 / 22
Problems for Today
What should I watch this Friday?
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 2 / 22
Problems for Today
What should I watch this Friday?
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 2 / 22
Problems for Today
Goal: Predict movie rating automatically!
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 2 / 22
Problems for Today
Goal: Predict the price of the house
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 2 / 22
Regression
What do all these problems have in common?
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 3 / 22
Regression
What do all these problems have in common?
) Continuous outputs, we’ll call these t
(e.g., a rating: a real number between 0-10, # of followers, house
price)
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 3 / 22
Regression
What do all these problems have in common?
) Continuous outputs, we’ll call these t
(e.g., a rating: a real number between 0-10, # of followers, house
price)
Predicting continuous outputs is called regression
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 3 / 22
Regression
What do all these problems have in common?
) Continuous outputs, we’ll call these t
(e.g., a rating: a real number between 0-10, # of followers, house
price)
Predicting continuous outputs is called regression
What do I need in order to predict these outputs?
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 3 / 22
Regression
What do all these problems have in common?
) Continuous outputs, we’ll call these t
(e.g., a rating: a real number between 0-10, # of followers, house
price)
Predicting continuous outputs is called regression
What do I need in order to predict these outputs?
) Features (inputs), we’ll call these x (or x if
vectors)
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 3 / 22
Regression
What do all these problems have in common?
) Continuous outputs, we’ll call these t
(e.g., a rating: a real number between 0-10, # of followers, house
price)
Predicting continuous outputs is called regression
What do I need in order to predict these outputs?
) Features (inputs), we’ll call these x (or x if
vectors)
) Training examples, many x (i ) for which t ( i ) is known (e.g., many
movies for which we know the rating)
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 3 / 22
Regression
What do all these problems have in common?
) Continuous outputs, we’ll call these t
(e.g., a rating: a real number between 0-10, # of followers, house
price)
Predicting continuous outputs is called regression
What do I need in order to predict these outputs?
) Features (inputs), we’ll call these x (or x if
vectors)
) Training examples, many x (i ) for which t (i ) is known (e.g., many
movies for which we know the rating)
) A model, a function that represents the relationship between x and
t
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 3 / 22
Regression
What do all these problems have in common?
) Continuous outputs, we’ll call these t
(e.g., a rating: a real number between 0-10, # of followers, house
price)
Predicting continuous outputs is called regression
What do I need in order to predict these outputs?
) Features (inputs), we’ll call these x (or x if
vectors)
) Training examples, many x (i ) for which t (i ) is known (e.g., many
movies for which we know the rating)
) A model, a function that represents the relationship between x and
t
) A loss or a cost or an objective function, which tells us how well our
model approximates the training examples
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 3 / 22
Regression
What do all these problems have in common?
) Continuous outputs, we’ll call these t
(e.g., a rating: a real number between 0-10, # of followers, house
price)
Predicting continuous outputs is called regression
What do I need in order to predict these outputs?
) Features (inputs), we’ll call these x (or x if
vectors)
) Training examples, many x (i ) for which t (i ) is known (e.g., many
movies for which we know the rating)
) A model, a function that represents the relationship between x and
t
) A loss or a cost or an objective function, which tells us how well our
model approximates the training examples
) Optimization, a way ofCSC
Zemel, Urtasun, Fidler (UofT) finding the parameters of our model that
411: 02-Regression 3 / 22
Simple 1-D regression
Circles are data points (i.e., training examples) that are given to us
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 5 / 22
Simple 1-D regression
Circles are data points (i.e., training examples) that are given to us
The data points are uniform in x , but may be displaced in y
t(x ) = f (x ) + s
with s some noise
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 5 / 22
Simple 1-D regression
Circles are data points (i.e., training examples) that are given to us
The data points are uniform in x , but may be displaced in y
t(x ) = f (x ) + s
with s some noise
In green is the ”true” curve that we don’t know
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 5 / 22
Simple 1-D regression
Circles are data points (i.e., training examples) that are given to us
The data points are uniform in x , but may be displaced in y
t(x ) = f (x ) + s
with s some noise
In green is the ”true” curve that we don’t know
Goal: We want to fit a curve to these points
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 5 / 22
Simple 1-D regression
Key Questions:
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 6 / 22
Simple 1-D regression
Key Questions:
) How do we parametrize the model?
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 6 / 22
Simple 1-D regression
Key Questions:
) How do we parametrize the model?
) What loss (objective) function should we use to judge the fit?
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 6 / 22
Simple 1-D regression
Key Questions:
) How do we parametrize the model?
) What loss (objective) function should we use to judge the fit?
) How do we optimize fit to unseen test data (generalization)?
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 6 / 22
Example: Boston Housing data
Estimate median house price in a neighborhood based on neighborhood
statistics
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 7 / 22
Example: Boston Housing data
Estimate median house price in a neighborhood based on neighborhood
statistics
Look at first possible attribute (feature): per capita crime rate
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 7 / 22
Example: Boston Housing data
Estimate median house price in a neighborhood based on neighborhood
statistics
Look at first possible attribute (feature): per capita crime rate
Use this to predict house prices in other neighborhoods
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 7 / 22
Example: Boston Housing data
Estimate median house price in a neighborhood based on neighborhood
statistics
Look at first possible attribute (feature): per capita crime rate
Use this to predict house prices in other neighborhoods
Is this a good input (attribute) to predict house prices?
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 7 / 22
Represent the Data
Data is described as pairs D = {(x ( 1 ) , t ( 1 ) ) , · · · , (x ( N ) , t ( N ) )}
) x ∈ R is the input feature (per capita crime rate)
) t ∈ R is the target output (median house price)
) (i ) simply indicates the training examples (we have N in this case)
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 8 / 22
Represent the Data
Data is described as pairs D = {(x ( 1 ) , t ( 1 ) ) , · · · , (x ( N ) , t ( N ) )}
) x ∈ R is the input feature (per capita crime rate)
) t ∈ R is the target output (median house price)
) (i ) simply indicates the training examples (we have N in this case)
Here t is continuous, so this is a regression problem
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 8 / 22
Represent the Data
Data is described as pairs D = {(x ( 1 ) , t ( 1 ) ) , · · · , (x ( N ) , t ( N ) )}
) x ∈ R is the input feature (per capita crime rate)
) t ∈ R is the target output (median house price)
) (i ) simply indicates the training examples (we have N in this case)
Here t is continuous, so this is a regression problem
Model outputs y , an estimate of t
y (x ) = w0 + w1x
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 8 / 22
Represent the Data
Data is described as pairs D = {(x ( 1 ) , t ( 1 ) ) , · · · , (x ( N ) , t ( N ) )}
) x ∈ R is the input feature (per capita crime rate)
) t ∈ R is the target output (median house price)
) (i ) simply indicates the training examples (we have N in this case)
Here t is continuous, so this is a regression problem
Model outputs y , an estimate of t
y (x ) = w0 + w1x
What type of model did we choose?
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 8 / 22
Represent the Data
Data is described as pairs D = {(x ( 1 ) , t ( 1 ) ) , · · · , (x ( N ) , t ( N ) )}
) x ∈ R is the input feature (per capita crime rate)
) t ∈ R is the target output (median house price)
) (i ) simply indicates the training examples (we have N in this case)
Here t is continuous, so this is a regression problem
Model outputs y , an estimate of t
y (x ) = w0 + w1x
What type of model did we choose?
Divide the dataset into training and testing
examples
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 8 / 22
Represent the Data
Data is described as pairs D = {(x ( 1 ) , t ( 1 ) ) , · · · , (x ( N ) , t ( N ) )}
) x ∈ R is the input feature (per capita crime rate)
) t ∈ R is the target output (median house price)
) (i ) simply indicates the training examples (we have N in this case)
Here t is continuous, so this is a regression problem
Model outputs y , an estimate of t
y (x ) = w0 + w1x
What type of model did we choose?
Divide the dataset into training and testing
examples
) Use the training examples to construct hypothesis, or function
approximator, that maps x to predicted y
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 8 / 22
Represent the Data
Data is described as pairs D = {(x ( 1 ) , t ( 1 ) ) , · · · , (x ( N ) , t ( N ) )}
) x ∈ R is the input feature (per capita crime rate)
) t ∈ R is the target output (median house price)
) (i ) simply indicates the training examples (we have N in this case)
Here t is continuous, so this is a regression problem
Model outputs y , an estimate of t
y (x ) = w0 + w1x
What type of model did we choose?
Divide the dataset into training and testing
examples
) Use the training examples to construct hypothesis, or function
approximator, that maps x to predicted y
) Evaluate hypothesis on test set
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 8 / 22
Noise
A simple model typically does not exactly fit the data
) lack of fit can be considered noise
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 9 / 22
Noise
A simple model typically does not exactly fit the data
) lack of fit can be considered noise
Sources of noise:
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 9 / 22
Noise
A simple model typically does not exactly fit the data
) lack of fit can be considered noise
Sources of noise:
) Imprecision in data attributes (input noise, e.g., noise in per-capita
crime)
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 9 / 22
Noise
A simple model typically does not exactly fit the data
) lack of fit can be considered noise
Sources of noise:
) Imprecision in data attributes (input noise, e.g., noise in per-capita
crime)
) Errors in data targets (mis-labeling, e.g., noise in house prices)
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 9 / 22
Noise
A simple model typically does not exactly fit the data
) lack of fit can be considered noise
Sources of noise:
) Imprecision in data attributes (input noise, e.g., noise in per-capita
crime)
) Errors in data targets (mis-labeling, e.g., noise in house prices)
) Additional attributes not taken into account by data attributes, affect
target values (latent variables). In the example, what else could affect
house prices?
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 9 / 22
Noise
A simple model typically does not exactly fit the data
) lack of fit can be considered noise
Sources of noise:
) Imprecision in data attributes (input noise, e.g., noise in per-capita
crime)
) Errors in data targets (mis-labeling, e.g., noise in house prices)
) Additional attributes not taken into account by data attributes, affect
target values (latent variables). In the example, what else could affect
house prices?
) Model may be too simple to account for data targets
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 9 / 22
Least-Squares Regression
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 10 / 22
Least-Squares Regression
Define a model
y (x ) = function(x,
w)
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 10 / 22
Least-Squares Regression
Define a model
Linear y (x ) = w0 + w1x
:
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 10 / 22
Least-Squares Regression
Define a model
Linear: y
(x ) = w0 + w1x
Standard loss/cost/objective function measures the squared error between y
and the true value t
ΣN
A(w) = [t(n) − y (x(n) 2
)] n= 1
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 10 / 22
Least-Squares Regression
Define a model
Linear: y
(x ) = w0 + w1x
Standard loss/cost/objective function measures the squared error between y
and the true value t
ΣN
Linear model: A(w) = [t(n) − 0 + w x (n) 2
(w n= 1 1 )]
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 10 / 22
Least-Squares Regression
Define a model
Linear: y
(x ) = w0 + w1x
Standard loss/cost/objective function measures the squared error between y
and the true value t
ΣN
Linear model: A(w) = [t(n) − 0 + w x (n) 2
(w n=1 1 )]
For a particular hypothesis (y (x ) defined by a choice of w, drawn in red),
what does the loss represent geometrically?
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 10 / 22
Least-Squares Regression
Define a model
Linear: y
(x ) = w0 + w1x
Standard loss/cost/objective function measures the squared error between y
and the true value t
ΣN
Linear model: A(w) = [t(n) − 0 + w x (n) 2
(w n= 1 1 )]
How do we obtain weights w = (w0,
w1)?
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 10 / 22
Least-Squares Regression
Define a model
Linear: y
(x ) = w0 + w1x
Standard loss/cost/objective function measures the squared error between y
and the true value t
ΣN
Linear model: A(w) = [t(n) − 0 + w x (n) 2
(w n=1 1 )]
How do we obtain weights w = (w0, w1)? Find w that minimizes loss
A(w)
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 10 / 22
Least-Squares Regression
Define a model
Linear: y
(x ) = w0 + w1x
Standard loss/cost/objective function measures the squared error between y
and the true value t
ΣN
Linear model: A(w) = [t(n) − 0 + w x (n) 2
(w n=1 1 )]
How do we obtain weights w = (w0, w1)?
For the linear model, what kind of a function is A(w)?
Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression 10 / 22