Linear regression
Gertraud Malsiner-Walli
Readings: ISLR Chapter 3
Gertraud Malsiner-Walli Linear regression 1 / 42
Outline
Visualization of multivariate relationships
Linear regression
Simple linear regression
Multiple linear regression
Model selection
Further topics
Gertraud Malsiner-Walli Linear regression 2 / 42
Advertising data set
▶ Goal: provide a marketing plan for a company that will
improve sales for a particular product.
▶ The dataset contains information about the sales of a product
in 200 different markets together with advertising budgets in
each of these markets for different media channels: TV, Radio
and Newspaper.
▶ The Sales are in thousands of units.
▶ The budgets TV, Radio and Newspaper for are in thousands of
dollars.
▶ Additionally, the variable NewspaperCat is a categorization of
the newspaper budget into three categories: below 25,000,
between 25,000 and 50,000, above 50,000.
Gertraud Malsiner-Walli Linear regression 3 / 42
Data matrix
## TV Radio Newspaper Sales NewspaperCat
## 1 230.1 37.8 69.2 22.1 (50,150]
## 2 44.5 39.3 45.1 10.4 (25,50]
## 3 17.2 45.9 69.3 9.3 (50,150]
## 4 151.5 41.3 58.5 18.5 (50,150]
## 5 180.8 10.8 58.4 12.9 (50,150]
## 6 8.7 48.9 75.0 7.2 (50,150]
Gertraud Malsiner-Walli Linear regression 4 / 42
Visualization of multivariate relationships
Gertraud Malsiner-Walli Linear regression 5 / 42
Scatterplots
25
25
25
20
20
20
15
15
15
Sales
Sales
Sales
10
10
10
5
5
0 50 100 150 200 250 300 0 10 20 30 40 50 0 20 40 60 80 100
TV Radio Newspaper
Gertraud Malsiner-Walli Linear regression 6 / 42
Parallel boxplots
25
20
15
Sales
10
5
(0,25] (25,50] (50,150]
NewspaperCat
Gertraud Malsiner-Walli Linear regression 7 / 42
Linear regression
Gertraud Malsiner-Walli Linear regression 8 / 42
Linear regression
▶ Linear regression assumes that the dependence among a
response variable Y on covariates X1 , X2 , · · · Xp is linear.
▶ Even if in reality true relationships are rarely linear, this simple
approach is extremely useful both conceptually and
practically.
▶ The response variable Y is also called target variable,
outcome, dependent variable.
▶ The covariates X1 , . . . , Xp are referred to also as explanatory
variables, independent variables, predictors or features.
Gertraud Malsiner-Walli Linear regression 9 / 42
Linear regression for the advertising data
Relevant questions:
▶ Is there a relationship between advertising budgets and sales?
▶ How strong is the relationship between advertising budget and
sales?
▶ Is the relationship linear?
▶ Which media contribute to sales?
▶ How accurately can we predict future sales?
Gertraud Malsiner-Walli Linear regression 10 / 42
Simple linear regression
Gertraud Malsiner-Walli Linear regression 11 / 42
Simple linear regression
▶ Goal: predict a quantitative variable Y on the basis of a single
predictor X by assuming an approximately linear relationship.
▶ We assume:
Y = β0 + β1 X + ϵ
|{z} , E (ϵ|X ) = 0, (1)
| {z }
linear function error term
where β0 and β1 are unknown parameters (coefficients).
▶ Given some estimates β̂0 and β̂1 , we can predict future values
of Y by using the regression line:
ŷ = β̂0 + β̂1 x ,
where ŷ indicates a prediction of Y given X = x .
Gertraud Malsiner-Walli Linear regression 12 / 42
How do we obtain the best regression line?
▶ The goal is to find a line that lies “close’ ’ to the points in the
scatterplot.
▶ Let
ŷi = β̂0 + β̂1 xi
be the prediction or fitted value for Y based on xi .
Then
ei = yi − ŷi
is called the i-th residual.
20
Sales
5 10
0 50 100 150 200 250 300
Gertraud Malsiner-Walli TV
Linear regression 13 / 42
Least squares criterion
▶ One possibility to estimate the coefficients is to minimize the
sum of squared residuals:
n
X n
X n
X
SSR = ei2 = (yi − ŷi )2 = (yi − (β̂0 + β̂1 xi ))2 .
i=1 i=1 i=1
▶ The minimizing values are called the ordinary least squares
(OLS) estimates and are given by:
sy
β̂1 = rx ,y
sx
β̂0 = ȳ − β̂1 x̄ ,
where x̄ and ȳ denote the mean of the variables,
sy and sx the standard deviation of Y and X resp.,
and rx ,y is the correlation coefficient.
Gertraud Malsiner-Walli Linear regression 14 / 42
Least squares criterion for the advertising data
▶ We estimate the parameters for the simple regression where TV
budget is used as X and Sales is used as Y :
[ = 7.03259 + 0.04754 · TV
Sales (2)
▶ Interpretation of the coefficients
▶ β0 = 7.03259, the intercept, is the average value of Sales when
TV budget is equal to 0.
▶ β1 = 0.04754, the slope, is the marginal effect of TV on Sales.
The expected value for Sales will increase by
β1 = 0.04754 × 1000 ≈ 48 items for every additional 1000
dollars spent on TV advertising.
Gertraud Malsiner-Walli Linear regression 15 / 42
Goodness-of-fit (I)
▶ The goodness-of-fit of a linear regression can be assessed by
the residual standard error (RSE) and the R2 statistic.
▶ RSE is an estimate of the standard deviation of the error terms
ϵ: v s
u
u 1 X n
1
RSE = t (yi − ŷ )2 = SSR. (3)
n − 2 i=1 n−2
⇒ measures (roughly) the average amount by which the
response will deviate from the regression line.
▶ The RSE for the simple regression Sales ∼ TV in Equation (2)
is 3.26, meaning that the actual sales deviate from the
regression line on average by 3260 items.
Gertraud Malsiner-Walli Linear regression 16 / 42
Goodness-of-fit (II)
▶ The coefficient of determination R2 :
TSS − SSR SSR
R2 = =1− ,
TSS TSS
Pn
where TSS= i=1 (yi − ȳ )2 is the total sum of squares.
▶ It lies between 0 and 1 and measures the amount of variation
in Y that can be explained by the regression model.
▶ In the social sciences, low R 2 values in regression analysis are
not uncommon.
Gertraud Malsiner-Walli Linear regression 17 / 42
Goodness of fit for the advertising data
▶ The R 2 for the simple regression Sales ∼ TV is 0.6119.
▶ The R 2 for the simple regression Sales ∼ Radio is 0.332.
▶ The R 2 for the simple regression Sales ∼ Newspaper is
0.05212.
Gertraud Malsiner-Walli Linear regression 18 / 42
Statistical properties of OLS estimators:
Unbiasedness
▶ Question: Are the estimates β̂0 , β̂1 equal to the true values?
▶ Answer: Yes (on average) - if certain assumptions on the error
term hold.
Gertraud Malsiner-Walli Linear regression 19 / 42
Statistical properties of OLS estimators: standard
error
▶ Question: How large is the difference between the OLS
estimate and the true value?
▶ The standard error (SE) of an estimator reflects how the
estimator varies under repeated sampling.
▶ Good news: We can estimate the SE of the regression
coefficients under certain assumptions on the error terms:
σ2 x̄ 2
2 2 2 1
SE (β̂1 ) = nP 2
, SE (β̂0 ) = σ + n
P 2
i=1 (xi − x̄ ) n i=1 (xi − x̄ )
where σ 2 is the variance of the error terms, which is typically
unknown and can be estimated by the RSE in Equation (3).
Gertraud Malsiner-Walli Linear regression 20 / 42
Confidence intervals and hypothesis testing
▶ The standard errors can then be used to obtain confidence
intervals:
β̂1 ± 2 · SE (β̂1 ), β̂0 ± 2 · SE (β̂0 ).
▶ The standard errors can be used in hypothesis testing. The
most common test involves the following hypotheses:
H0 : β1 = 0, i.e. there is no linear relationship between X and Y
HA : β1 ̸= 0, i.e. there is a linear relationship between X and Y
▶ The test statistic of this hypothesis test is given by
t = β̂1 /SE (β̂1 ) and follows a t-distribution under the null
hypothesis.
Gertraud Malsiner-Walli Linear regression 21 / 42
Incorporating non-linearities
▶ It is rather easy to incorporate many nonlinearities into simple
regression model by appropriately defining the dependent
variable Y and the independent variable X .
Gertraud Malsiner-Walli Linear regression 22 / 42
Incorporating non-linearities - Examples
✓linear Y = e β0 X β1 ϵ̃ ⇒ log(Y ) = β0 + β1 log(X ) + ϵ
√
✓linear Y = β0 + β1 X + ϵ
1
X nonlinear Y = +ϵ
β0 + β1 X
Gertraud Malsiner-Walli Linear regression 23 / 42
Visual inspection of residuals for Sales ~ TV
5
Residuals
0
−5
8 10 12 14 16 18 20
Fitted values
Sales ~ TV
The plot should look random with constant variance but it is not
the case.
Gertraud Malsiner-Walli Linear regression 24 / 42
Incorporating non-linearities in the model Sales ~ TV
▶ The linearity assumption is violated at the left end of the plot.
▶ Solution: Estimate the model log(Sales) ∼ log(TV ).
▶ Interpretation: a one percent increase in TV generate a β1
percent change in Sales.
▶ Residual plot improved, so has R 2 .
0.4
0.2
Residuals
0.0
−0.4 −0.2
1.0 1.5 2.0 2.5 3.0
Fitted values
log(Sales) ~ log(TV)
Gertraud Malsiner-Walli Linear regression 25 / 42
Multiple linear regression
Gertraud Malsiner-Walli Linear regression 26 / 42
Multiple linear regression
▶ The multiple linear regression model is:
Y = β0 + β1 X1 + β2 X2 + . . . + βp Xp + ϵ.
▶ In the advertising example, the model becomes:
Sales = β0 + β1 · TV + β2 · radio + β3 · newspaper + ϵ.
▶ We interpret βj as the average effect on Y of a one unit
increase in Xj , holding all other predictors fixed (i.e., ceteris
paribus).
Gertraud Malsiner-Walli Linear regression 27 / 42
More on the interpretation of the coefficients
▶ If the predictors are uncorrelated, the coefficient β̂1 of the
simple linear regression Y ∼ X1 would equal the coefficient β̂1
of the multiple linear regression.
▶ With correlation amongst predictors comes the following issues:
▶ The variance of all coefficients increases.
▶ Because predictors usually change together, interpretations
become hazardous
Gertraud Malsiner-Walli Linear regression 28 / 42
Estimation and prediction
▶ Parameters are estimated using the least square method where
the sum of squared residuals is minimized:
n
X
SSR = (yi − ŷi )2
i=1
n
X
= (yi − β̂0 − β̂1 xi1 − · · · − β̂P xip )2
i=1
▶ This minimization problem is implemented in all statistical
software programs.
▶ Given estimates β̂0 , β̂1 , . . . , β̂p , we can predict future values of
Y by using the regression model:
ŷ = β̂0 + β̂1 x1 + . . . + β̂1 xp .
Gertraud Malsiner-Walli Linear regression 29 / 42
Figure 1: Linear regression with two covariates
Gertraud Malsiner-Walli Linear regression 30 / 42
Goodness of fit
▶ The coefficient of determination R2 is, as in the simple
linear regression case, equal to:
TSS − SSR
R2 = .
TSS
▶ Note: It never decreases, it usually increases when another
independent variable is added to a regression.
▶ This makes it a poor tool for deciding whether one variable or
several variables should be added to a model.
▶ Alternative: Adjusted R 2
2 (1 − R 2 )(n − 1)
Radj =1− .
n−p−1
Gertraud Malsiner-Walli Linear regression 31 / 42
Hypothesis Testing - the t-test
▶ Question: Which of the predictors are useful in predicting the
response?
▶ To answer this question we can construct hypothesis tests for
each regression coefficient βj :
H0 : βj = 0
HA : βj ̸= 0.
▶ Under certain assumptions and additionally by assuming the
errors are normally distributed, we have the following result:
β̂j /SE (β̂j ) ∼ tn−p−1 .
▶ The ratio β̂j /SE (β̂j ) is called the t-statistic.
▶ p-values can be computed based on the distribution under the
null-hypothesis.
Gertraud Malsiner-Walli Linear regression 32 / 42
Model selection
Gertraud Malsiner-Walli Linear regression 33 / 42
Model comparison: in-sample and out-of sample
approach
Comparison between models can be done based on
▶ in-sample (likelihood based) measures such as Akaike
information criterion (AIC) or Bayesian information criterion
(BIC);
▶ out-of-sample measures (by measuring the prediction
performance on a test data set).
▶ Typically the MSE (mean squared error) or the RMSE (root
MSE) is employed as a goodness of fit measure both in-sample
and out-of-sample:
n √
1X
MSE = (yi − ŷi )2 , RMSE = MSE
n i=1
Gertraud Malsiner-Walli Linear regression 34 / 42
Information Criteria: AIC and BIC
▶ AIC, BIC: Model Fit + Penalty
= n · log(MSE ) + const(n) + m · (p + 1)
▶ MSE : Mean square error
▶ p: Number of regressors
▶ m = 2: AIC (Akaike Information C riterion)
▶ m = log n: BIC (Bayesian IC )
▶ Choose the model that minimizes the criterion!
Gertraud Malsiner-Walli Linear regression 35 / 42
Variable selection: Irrelevant and omitted variables
▶ Adding irrelevant variables to a model does not affect
unbiasedness but increases the variance of the estimators.
▶ Excluding a relevant variable will typically bias the estimates.
Gertraud Malsiner-Walli Linear regression 36 / 42
Deciding on the important variables
▶ The most direct approach is called best subset regression: we
fit a OLS regression for each possible combination of the p
predictors and then choose between them based on some
criterion that balances MSE with model size.
▶ There are 2p models under consideration, so computation
becomes quickly infeasible.
▶ Alternative: automated approach that search through a subset
of all models - Stepwise regression
Gertraud Malsiner-Walli Linear regression 37 / 42
Forward selection
▶ Begin with null model - a model that contains an intercept but
no predictors.
▶ Fit p simple linear regressions with each regressor and add the
variable that results in the lowest AIC to the null model.
▶ Continue adding variables as long as the AIC will decrease.
Stopp if adding any variable will increase the AIC.
Gertraud Malsiner-Walli Linear regression 38 / 42
Backward selection
▶ Start with all variables in the model.
▶ Remove that variable whose exclusion yields the smallest AIC.
▶ Continue removing variables as long as the AIC will decrease.
Stop if removing any included variable will increase the AIC.
Gertraud Malsiner-Walli Linear regression 39 / 42
Further topics
Gertraud Malsiner-Walli Linear regression 40 / 42
Qualitative predictors (1)
▶ Categorical covariates are coded as dummy variables (which
take on 0 or 1 values) when used in a regression setting.
▶ If a categorical variable has K levels, K − 1 dummy variables
are needed to represent the variable.
▶ In our example, NewspaperCat has three levels, thus we need
two dummy variables:
(
1 if the newspaper budget is in the range (25, 50]
D1 =
0 otherwise
(
1 if the newspaper budget is in the range (50, 150]
D2 =
0 otherwise
▶ Note that we have not created any dummy variable for the first
level (0, 25] as this is considered to be the baseline.
Gertraud Malsiner-Walli Linear regression 41 / 42
Qualitative predictors (2)
▶ The model Sales ~ NewspaperCat becomes:
β0 if NewspaperCat (0, 25]
[ = β0 +β1 D1 +β2 D2 =
Sales β +β
0 1 if NewspaperCat (25, 50]
β + β if NewspaperCat (50, 150]
0 2
Gertraud Malsiner-Walli Linear regression 42 / 42