KEMBAR78
Chapter 2 Regression Analysis Notes | PDF | Coefficient Of Determination | Errors And Residuals
0% found this document useful (0 votes)
19 views11 pages

Chapter 2 Regression Analysis Notes

Regression analysis

Uploaded by

aphelelekoyo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views11 pages

Chapter 2 Regression Analysis Notes

Regression analysis

Uploaded by

aphelelekoyo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Regression analysis

July 31, 2024

1 Introduction

1.1 The model

The origin of time series analysis is the linear regression model that is concerned with the interde-
pendence of time series observations. We are looking to explain the behaviour of a certain variable
with the aid of a number of explanatory variables. We consider the following model

Linear regression expresses a dependent variable as a linear function of independent variables,


possibly random, and an error.
Xk
Yi = β0 + βj Xj,i + εi ,
j=1

where Yi is known as the regressand, dependent variable or simply the left-hand-side variable. The
k variables, X1,i , . . . , Xk,i are known as the regressors, independent variables or right-hand-side
variables. β1 , β2 , . . . , βk are the regression coefficients, εi is known as the innovation, shock or
error and i = 1, 2, . . . , n index the observation. While this representation clarifies the relationship
between Yi .

ˆ First, we consider the case of a univariate explanatory variable x: simple linear regression
(k = 1)

ˆ Then, we extend the model to include more variables: multiple regression model (k ≥ 1).

Regression analysis uses two principal types of data: cross sectional and time series. A cross-
sectional regression involves many observations of X and Y for the same time period. These
observations could come from different companies, asset classes, investment funds, countries, or
other entities, depending on the regression model. For example, a cross-sectional model might use
data from many companies to test whether predicted EPS growth explains differences in price-to-
earnings ratios during a specific time period. Note that if we use cross-sectional observations in
a regression, we usually denote the observations as i = 1, 2, . . . , n. Time-series data can also use

1
many observations from different time periods for the same company, asset class, investment fund,
country, or other entity, depending on the regression model. For example, a time-series model
might use monthly data from many years to test whether a country’s inflation rate, determines
its short-term interest rates. If we use time-series data in a regression, we usually denote the
observations as t = 1, 2, . . . , T .

2 Simple linear regression model

2.1 Assumptions

The four key assumptions of a linear regression model:

ˆ Linearity: The relationship between the dependent variable and the independent variable is
linear.

ˆ Homoskedasticity: The variance of the regression residuals is the same for all the observations.

ˆ Independence: The observations are independent of one another. This implies that the re-
gression residuals are uncorrelated across observations.

ˆ Normality: The regression residuals are normally distributed

2.1.1 Linearity

We assume that the true underlying relationship between the dependent and the independent
variables is linear; if not, the model will produce invalid results. For example, Yi = b0 ebi Xi + ϵi
is nonlinear in bi , so we should not apply the linear regression model to it. The independent
variable, X, must also not be random. If the independent variable is random, there would be no
linear relationship between the dependent and independent variables. The residuals of a model
should appear random, i.e. no pattern should be present when the residuals are plotted against
the independent variable.

2.1.2 Homoskedasticity

Observations exhibit homoskedasticity if the variance of the residuals is the same for all observa-
tions:
E ϵ2i = σϵ2 , i = 1, . . . , n.


If the residuals are not homoskedastic, then we refer to this as heteroskedasticity.

2
2.1.3 Independence

In a linear regression model, we assume that the observations are uncorrelated with one another,
implying they are independent. If this assumption is violated, the residuals will be correlated. It is
also a necessary assumption in order to correctly estimate the variances of the estimated parameters
of b0 and b1 that we use in hypothesis tests of the intercept and slope. Therefor, we need to visually
and statistically examine the residuals for a regression model.

2.1.4 Normality

This assumption requires that the residuals are normally distributed. This does not mean that
the dependent and independent variables must be normally distributed; it only means that the
residuals from the model are normally distributed. It is good practice though, to understand the
distribution of the variables in order to identify any outliers as it can substantially influence the
fitted lines such that the estimated model will not fit well for most of the other observations.

2.2 The model

Consider two random variables X and Y , and assume that we have two samples of size n from each
variable. The sample correlation coefficient is defined as
S
ρS = ,
Sx Sy

with v
u Pn 2
t i=1 x2i − ( i=1
u Pn xi )
n
Sx = ,
n−1
v
u Pn 2
t i=1 yi2 − ( i=1
u Pn yi )
n
Sy = ,
n−1
and
n Pn Pn !
1 X
i=1 x i i=1 y i
Sxy = x i yi − .
n−1 n
i=1

The correlation coefficient is indicator on the existence of a linear relationship between the two
variables.

ˆ If the two variables are very strongly positively related, the coefficient value is close to +1
(strong positive linear relationship).

ˆ If the two variables are very strongly negatively related, the coefficient value is close to -1
(strong negative linear relationship).

3
ˆ No straight line relationship is indicated by a coefficient close to zero

If the relationship between y and x is linear, then the variables are connected by regression line

y = β0 + β1 x + ϵ

The estimated regression model is


ŷ = β̂0 + β̂1 x,
where ŷ is the estimated or predicted value of y for a given value of x.

2.3 Estimates of the regression coefficients

The values of beta0 and beta1 are obtained using the least squares method. This method minimizes
the sum of
Psquared differences between the regression line and the data points, i.e., minimizes the
distance ni=1 (yi − ŷi )2 , where yi is the actual value and ŷi is the predicted value at the same x
obtained using ŷ = β0 + β1 x.

Proposition 2.1 The estimates of the regression coefficients are


P
xi yi − nx̄ȳ Sxy
β̂1 = Pi 2 2
= 2,
x
i i − nx̄ Sx
and
β̂0 = ȳ − β̂1 x̄.

Proof 2.1 Proof in class.

Proposition 2.2 An estimator of σ 2 is given by


Pn 2
2 2 e SSE
σ̂ = s = i=1 i = .
n−2 n−2

Proof 2.2 Proof in class.

2.4 Properties of the estimators

The estimators for β0 , β1 , and σ̂ 2 , have the following properties

Proposition 2.3 Gauss-Markov Theorem


Assuming that four linear regression assumptions hold true, the OLS estimators of β0 and β1 are
BLUE: best linear unbiased estimators.

4
The coefficients β̂0 and β̂1 are estimates (i.e., random). Hence, it is possible to show that
!
h i
2 1 x̄2
V β̂0 = σ +P 2
n i (xi − x̄)
h i σ2
V β̂1 = P 2.
i (xi − x̄)

Substituting the unbiased estimator s2 for σ 2 , the estimated variances of the regression coefficients
are !
h i
2 1 x̄2
V̂ β̂0 = s +P 2
n i (xi − x̄)
h i s2
V̂ β̂1 = P 2.
i (xi − x̄)
The variance estimator s2 is unbiased and its variance is given by
2σ 4
V s2 =
 
.
n−2
Confidence intervals for the regression coefficients are given by
h i h i
β̂0 − tn−2, α2 SE
d β̂0 ≤ β0 ≤ β̂0 + tn−2, α SE
2
d β̂0
h i h i
β̂1 − tn−2, α2 SE β̂1 ≤ β1 ≤ β̂1 + tn−2, α2 SE
d β̂1 .

A confidence interval for the regression variance is as follows


SSE SSE
≤ σ2 ≤ .
χ2n−2,1− a χ2n−2, a
2 2

3 Multiple linear regression model

3.1 The model

Recall the multiple linear regression model


k
X
Yi = β0 + βj Xj,i + εi ,
j=1

for i = 1, . . . , n. This is equivalent to


y1 =β0 + β1 x11 + β2 x12 + · · · + βK x1K + ε1
y2 =β0 + β1 x21 + β2 x22 + · · · + βK x2K + ε2
....
..
yn =β0 + β1 xn1 + β2 xn2 + · · · + βk xnK + εn .

5
It is more convenient to use matrix notation, and write the model as follows

y = Xβ + ε,

where,  
y1
 y2 
y=
 
.. 
 . 
yn

 
ε1
 ε2 
ε= ,
 
..
 . 
εn

 
β0
 β1 
β= 
 ··· 
βK
,

and  
1 x11 x12 · · · x1K
 1 x21 x22 · · · x2K 
X=
 
.. .. .. .. .. 
 . . . . . 
1 xn1 xn2 · · · xnK
.

3.2 Estimators

The estimation is based on the OLS, and we obtain the following estimation for the betas.

Proposition 3.1 The OLS estimator of β is unbiased and it is given by


 −1
β̂ = X⊤ X X⊤ y,

with  −1
V[β̂] = σ 2 X⊤ X .

Proof 3.1 In class.

6
The fitted values are given by
ŷ = Xβ̂
 −1
= X X⊤ X X⊤ y
| {z }
H
= Hy.
The residuals may also be expressed in terms of the matrix H, and we have
e = (I − H)y = (I − H)ε.
The matrix H has the following properties
H t = H,
and
H 2 = H.
We can show the followings

Pn
ˆ The sum of residuals is 0, i.e, i=1 ei =0
ˆ The residuals and fitted values are uncorrelated
e⊤ e
ˆ A unbiased estimator of σ 2 is s2 = n−K−1
−1
ˆ The variance-covariance matrix of the estimators β̂ is V[β̂] = σ 2 X⊤ X estimators

4 Assessing the model

There are several Goodness of Fit measures:

ˆ Coefficient of Determination R2 ,


ˆ F-statistic for the test of fit, and


ˆ Standard error of the Regression.

4.1 Coefficient of determination

This measure is also referred to as R-squared or R2 and is the percentage of the variation of the
dependent variable that is explained by the independent variable:
Sum of squares regression
R2 =
Sum of squares total
Pn  2
i=1 Ŷ i − Ȳ
R2 = Pn 2 .
i=1 Yi − Ȳ

7
By construction the coefficient of determination ranges from 0% to 100%. In a simple linear
regression, the square of the pairwise correlation is equal to the coefficient of determination:

r 2 = R2 .

The coefficient of determination is a descriptive measure, so in order to determine whether


our regression model is statistically meaningful, we will need to construct an F -distributed test
statistic.

4.2 F-statistic

We use an f -distributed test statistic to compare two variances. In regression analysis, we can
use an f -distributed test statistic to test whether the slopes in a regression are equal to zero (bi ),
against the alternative hypothesis that at least one slope is not equal to zero:

H0 : β0 = β1 = β2 = . . . = βk = 0.

Ha :At least one bk is not equal to zero. For simple linear regression, these hypotheses simplify to

H0 : β1 = 0
Ha : β1 ̸= 0.

The f -distributed test statistic is constructed by using the sum of squares regression and the sum of
squares error, each adjusted for degrees of freedom; in other words, it is the ratio of two variances.
We divide the sum of squares regression by the number of independent variables, represented by k.
In the case of a simple linear regression, k = 1, so we arrive at the mean square regression (MSR),
which is the same as the sum of squares regression:
Sum of squares regression
M SR = ,
k
P  2
which can be rewritten as M SR = ni=1 Ŷi − Ȳ for simple linear regression. The mean square
error (MSE) is the sum of squares error divided by the degrees of freedom, which is n − k − 1. In
simple linear regression, n − k − 1 becomes n − 2 :
Sum of squares error
M SE = ,
n−k−1
Pn  2
i=1 Yi − Ŷi
M SE = .
n−2

Therefore, the F -distributed test statistic is:


Sum of squares regression
k M SR
F = Sum of squares error
= ,
M SE
n−k−1

8
which is distributed with 1 and n − 2 degrees of freedom in simple linear regression. The F -
statistic in regression analysis is one sided, with the rejection region on the right side, because we
are interested in whether the variation in Y explained (the numerator) is larger than the varia-
tion in Y unexplained (the denominator). The sums of squares from a regression model is often
presented in an analysis of variance (ANOVA) table. An example of an ANOVA table is below:
Source Sum of Squares Degrees of Freedom Mean Square F-Statistic
Regression 191.625 1 191.625 16.0104
Error 47.875 4 11.96875
Total 239.50 5

4.3 Standard Error of the Regression

The standard error of the estimate (se ), which is also known as the standard error of the regression
or the root mean square error. The se is a measure of the distance between the observed values of
the dependent variable and those predicted from the estimated regression; the smaller the se , the
better the fit of the model. The se , along with the coefficient of determination and the F -statistic,
is a measure of the goodness of the fit of the estimated regression line. Unlike the coefficient
of determination and the F -statistic, which are relative measures of fit, the standard error of
the estimate is an absolute measure of the distance of the observed dependent variable from the
regression line. Thus, the se is an important statistic used to evaluate a regression model and is
used in calculating prediction
√ intervals and performing tests on the coefficients. Standard error of
the estimate (se ) = M SE.

5 Hypothesis Testing

We might want to perform hypothesis testing of the linear regression coefficients to determine the
significance of the coefficients. We will look at examples of testing:

ˆ Slope Coefficient(s)

ˆ Intercept

ˆ Independent variable is an indicator variable.

The choice of significance level in hypothesis testing is always a matter of judgment. Analysts
often choose the 0.05 level of significance, which indicates a 5% chance of rejecting the null hy-
pothesis when, in fact, it is true (a Type I error, or false positive). Of course, decreasing the level
of significance from 0.05 to 0.01 decreases the probability of Type I error, but it also increases the
probability of Type II error - failing to reject the null hypothesis when, in fact, it is false (that is, a
false negative). The p-value is the smallest level of significance at which the null hypothesis can be
rejected. The smaller the p-value, the smaller the chance of making a Type I error (i.e., rejecting
a true null hypothesis), so the greater the likelihood the regression model is valid. For example,

9
if the p-value is 0.005 , we reject the null hypothesis that the true parameter is equal to zero at
the 0.5% significance level ( 99.5% confidence). In most software packages, the p-values provided
for regression coefficients are for a test of null hypothesis that the true parameter is equal to zero
against the alternative that the parameter is not equal to zero.

6 Functional forms for simple linear regression

Not every set of independent and dependent variables has a linear relation. In fact, we often see
non-linear relationships in economic and financial data.

There are several different functional forms that can be used to potentially transform the data
to enable their use in linear regression. These transformations include using the log (i.e., natural
logarithm) of the dependent variable, the log of the independent variable, the reciprocal of the
independent variable, the square of the independent variable, or the differencing of the independent
variable. We illustrate and discuss three often-used functional forms, each of which involves log
transformation:

1. the log-lin model, in which the dependent variable is logarithmic but the independent variable
is linear;

2. the lin-log model, in which the dependent variable is linear but the independent variable is
logarithmic; and

3. the log − log model, where both the dependent and independent variables are in logarithmic
form.

7 CAPM and multifactor model

7.1 CAPM

The Capital Asset Pricing Model (CAPM)  considers the equilibrium relation between the expected
return of an asset or portfolio µi = E y i , the risk-free return rf , and the expected return of the
market portfolio (µm = E [y m ]). Based on various assumptions (e.g. quadratic utility or normality
of returns) the CAPM states that

µi − rf = βi (µm − rf ) .

This relation is also known as the security market line (SML). When we substitute these definitions
for the expected values in the CAPM we obtain the so-called market model

yti = αi + βi ytm + ϵit .

10
If we write the regression equation in terms of (observed) excess returns xit = yti − rf and xm
t =
m
yt − rf we obtain
xit = βi xm i
t + ϵt .

A testable implication of the CAPM is that the constant term in a simple linear regression using
excess returns should be equal to zero.

Example 7.1 Estimation of the CAPM model for US industries indices compiled by
French.
Using the excess of returns on the market, the consumer goods portfolio, the hi-tech portfolio,
we estimate the CAPM model. The data is available on https: // mba. tuck. dartmouth. edu/
pages/ faculty/ ken. french/ data_ library. html .

7.2 Multifactor model

The CAPM has been frequently challenged by empirical evidence indicating significant risk premia
associated with other factors than the market portfolio. According to the Arbitrage Pricing Theory
(APT) by Ross (1976) there exist several risk factors that are common to a set of assets. These
risk factors (and not only the market risk) capture the systematic risk component. We consider
one version of multi-factor models using the so-called Fama-French benchmark factors SMB (small
minus big) and HML (high minus low)

The factor SMB measures the difference in returns of portfolios of small and large stocks, and
is intended to measure the so-called size effect. The factor HML measures the difference between
value stocks (having a high book value relative to their market value) and growth stocks (with a
low book-market ratio).

Example 7.2 Estimation of the three-factor model for US industries indices compiled by French.
Using the excess of returns on the market, the consumer goods portfolio, the hi-tech portfolio, we
estimate the three-factor model with SMB and HML factors. The data is available on https:
// mba. tuck. dartmouth. edu/ pages/ faculty/ ken. french/ data_ library. html

11

You might also like