Linear Regression
1. Introduction and least squares estimates- simple linear regression models1
Regression analysis is a method for investigating the functional relationship among variables.
# Importing the data
production <- read.table("C:/Users/Prashobhan/OneDrive/Desktop/Regression
notes/production.txt",header=TRUE)
attach(production)
#Figure 1
par(mfrow=c(1,1))
plot(production$RunSize,production$RunTime,xlab="Run Size", ylab="Run Time")
1
First part of this note is an excerpt from the second chapter- simple linear regression of the text book -A Modern Approach to Regression
with R by Simon J. Sheather
1
Figure. 1 A scatter plot of the production data
Table.1 Production data
Case RunTime RunSize
1 195 175
2 215 189
3 243 344
4 162 88
5 185 114
6 231 338
7 234 271
8 166 173
9 253 284
10 196 277
11 220 337
12 168 58
13 207 146
14 225 277
15 169 123
16 215 227
17 147 63
18 230 337
19 208 146
20 172 68
When data are collected in pairs the standard notation used to designate this is:
(X1, Y1),(X2, Y2), . . . ,(XN, YN)
where X1 denotes the first value of the so-called X -variable and Y1 denotes the first value of
the so-called Y -variable. The X -variable is called the explanatory or predictor variable ,
while the Y -variable is called the response variable or the dependent variable . The X -
variable often has a different status to the Y –variable in that:
• It can be thought of as a potential predictor of the Y -variable
• Its value can sometimes be chosen by the person undertaking the study
Simple linear regression is typically used to model the relationship between two variables Y
and X so that given a specific value of X, that is, X = X*, we can predict the value of Y.
Mathematically, the regression of a random variable Y on a random variable X is
E(Y | X),
2
the expected value of Y when X takes the specific value X* . For example, if X = Day of the
week and Y = Sales at a given company, then the regression of Y on X represents the mean (or
average) sales on a given day.
The regression of Y on X is linear if
E(Y | X ) = β0 + β1 X [ Yi= β0 + β1 Xi ] (1)
where the unknown parameters β0 and β1 determine the intercept and the slope of a specific
straight line, respectively. Suppose that Y1 , Y2 , …, Yn are independent realizations of the
random variable Y that are observed at the values X1 , X2 , …, XN of a random variable X . If
the regression of Y on X is linear, then for i = 1, 2, …, n
Yi = E(Y | X) + ei = β0 + β1 Xi+ei [Yi= β0 + β1 Xi +ei] (2)
where ei is the random error in Yi and is such that
E(e | X) = 0 (3)
The random error term is there since there will almost certainly be some variation in Y due
strictly to random phenomenon that cannot be predicted or explained. In other words, all
unexplained variation is called random error. Thus, the random error term does not depend
on X, nor does it contain any information about Y (otherwise it would be a systematic error).
We shall begin by assuming that
Var(Y | X) =σ2 (4)
2. Estimating the population slope and intercept
Suppose for example that X = height and Y = weight of a randomly selected individual from
some population, then for a straight-line regression model the mean weight of individuals of a
given height would be a linear function of that height. In practice, we usually have a sample of
data instead of the whole population. The slope β1 and intercept β0 are unknown, since these
are the values for the whole population. Thus, we wish to use the given data to estimate the
slope and the intercept. This can be achieved by finding the equation of the line which “best”
fits our data, that is, choose b0 and b1 such that Ŷ = b0 + b1 X is as “close” as possible to Yi .
Here the notation Ŷi is used to denote the value of the line of best fit in order to distinguish it
from the observed values of Y, that is, Yi. We shall refer to Ŷi as the i th predicted value or the
fitted value of Yi .
3. Residuals
In practice, we wish to minimize the difference between the actual value of Y ( Yi ) and the
predicted value of Y ( Ŷi ). This difference is called the residual, êi , that is, êi = Yi– Ŷi .
3
Figure 2 shows a hypothetical situation based on six data points. Marked on this plot is a line
of best fit , Ŷi along with the residuals.
Figure.2 A scatter plot of data with a line of best fit and the residuals identified
4
Least squares line of best fit
A very popular method of choosing b0 and b1 is called the method of least squares. As the name
suggests b0 and b1 are chosen to minimize the sum of squared residuals (or residual sum of
squares [RSS]),
𝑛 𝑛 𝑛
2
𝑅𝑆𝑆 = ∑ 𝑒̂𝑖2 ̂𝑖 ) = ∑(𝑌𝑖 − 𝑏0 − 𝑏1 𝑌𝑖 )2
= ∑(𝑌𝑖 − 𝑌
𝑖=1 𝑖=1 𝑖=1
(For minima, the first order derivative is equal to zero and second order derivative is greater
than zero)
For RSS to be a minimum with respect to b0 and b1 we require
𝑛
𝜕𝑅𝑆𝑆
= −2 ∑(𝑌𝑖 − 𝑏0 − 𝑏1 𝑋𝑖 ) = 0
𝜕𝑏0
𝑖=1
and
𝑛
𝜕𝑅𝑆𝑆
= −2 ∑ 𝑋𝑖 (𝑌𝑖 − 𝑏0 − 𝑏1 𝑋𝑖 ) = 0
𝜕𝑏1
𝑖=1
(as -2 ≠ 0, second part of RHS in the equations should be equal to zero. So we get
∑𝑛𝑖=1(𝑌𝑖 − 𝑏0 − 𝑏1 𝑋𝑖 ) = 0 and ∑𝑛𝑖=1 𝑋𝑖 (𝑌𝑖 − 𝑏0 − 𝑏1 𝑋𝑖 ) = 0) ).
Rearranging terms in these last two equations gives
∑𝑛𝑖=1 𝑌𝑖 = 𝑛𝑏0 + 𝑏1 ∑𝑛𝑖=1 𝑋𝑖 Normal Equation (1)
and
∑𝑛𝑖=1 𝑋𝑖 𝑌𝑖 = 𝑏0 ∑𝑛𝑖=1 𝑋𝑖 + 𝑏1 ∑𝑛𝑖=1 𝑋𝑖2 Normal Equation (2)
These last two equations are called the normal equations. Solving these equations for b0 and
b1 give the so-called least squares estimates of the intercept and slope.
and the slope
(∑𝑛1 𝑌𝑖 ∑𝑛1 𝑋𝑖2 ) − (∑𝑛1 𝑋𝑖 𝑌𝑖 ∑𝑛1 𝑋𝑖 )
𝑏0 =
(𝑛 ∑𝑛1 𝑋𝑖2 ) − (∑𝑛1 𝑋𝑖 )2
𝑏0 = 𝑌̅ − 𝑏1 𝑋̅
(𝑛 ∑𝑛1 𝑋𝑖 𝑌𝑖 ) − ( ∑𝑛1 𝑋𝑖 ∑𝑛1 𝑌𝑖 )
𝑏1 =
(𝑛 ∑𝑛1 𝑋𝑖2 ) − (∑𝑛1 𝑋𝑖 )2
5
̅̅̅̅ ∑𝑛𝑖=1(𝑋𝑖 − 𝑋̅)(𝑌𝑖 − 𝑌̅) 𝑆𝑋𝑌
∑𝑛𝑖=1 𝑋𝑖 𝑌𝑖 − 𝑛𝑋𝑌
𝑏1 = = =
∑𝑛𝑖=1 𝑋𝑖2 − 𝑛𝑋̅ 2 ∑𝑛𝑖=1(𝑋𝑖 − 𝑋̅)2 𝑆𝑋𝑋
--------------------
Application of Cramer’s rule for solving linear system of equations
Write the normal equations in a compact matrix form
∑𝑛1 𝑌𝑖 𝑛 ∑𝑛1 𝑋𝑖 𝛽̂0
[ 𝑛 ]= [ 𝑛 ][ ]
∑1 𝑋𝑖 𝑌𝑖 ∑1 𝑋𝑖 ∑𝑛1 𝑋𝑖2 𝛽̂1
Let the matrices are A, B and β
Step-1
Find the determinant of B -> |B|
|B|=
𝑛 𝑛 𝑛
|𝐵| = (𝑛 × ∑ 𝑋𝑖2 ) − ( ∑ 𝑋𝑖 × ∑ 𝑋𝑖 )
1 1 1
𝑛 𝑛 2
|𝐵| = (𝑛 ∑ 𝑋𝑖2 ) − (∑ 𝑋𝑖 )
1 1
Step-2
Replace the first column of B with A
∑𝑛1 𝑌𝑖 ∑𝑛1 𝑋𝑖
[ 𝑛 ]
∑1 𝑋𝑖 𝑌𝑖 ∑𝑛1 𝑋𝑖2
and find the determinant of it. Let us call it as |B|0
𝑛 𝑛 𝑛 𝑛
|𝐵|0 = (∑ 𝑌𝑖 × ∑ 𝑋𝑖2 ) − (∑ 𝑋𝑖 𝑌𝑖 × ∑ 𝑋𝑖 )
1 1 1 1
6
𝑛 𝑛 𝑛 𝑛
|𝐵|0 = (∑ 𝑌𝑖 ∑ 𝑋𝑖2 ) − (∑ 𝑋𝑖 𝑌𝑖 ∑ 𝑋𝑖 )
1 1 1 1
Step-3
Replace the second column of B with A
𝑛 ∑𝑛1 𝑌𝑖
[ 𝑛 ]
∑1 𝑋𝑖 ∑𝑛1 𝑋𝑖 𝑌𝑖
and find the determinant it. Let us call it as |B|1
𝑛 𝑛 𝑛
|𝐵|1 = (𝑛 × ∑ 𝑋𝑖 𝑌𝑖 ) − ( ∑ 𝑋𝑖 × ∑ 𝑌𝑖 )
1 1 1
𝑛 𝑛 𝑛
|𝐵|1 = (𝑛 ∑ 𝑋𝑖 𝑌𝑖 ) − ( ∑ 𝑋𝑖 ∑ 𝑌𝑖 )
1 1 1
Step-4
Now we can find out β̂0 and β̂1 easily by dividing |B|0 and |B|1 by |B|
|𝐵|0 |𝐵|1
𝛽̂0 = 𝑎𝑛𝑑 𝛽̂1 =
|𝐵| |𝐵|
(∑𝑛1 𝑌𝑖 ∑𝑛1 𝑋𝑖2 ) − (∑𝑛1 𝑋𝑖 𝑌𝑖 ∑𝑛1 𝑋𝑖 )
𝛽̂0 =
(𝑛 ∑𝑛1 𝑋𝑖2 ) − (∑𝑛1 𝑋𝑖 )2
(𝑛 ∑𝑛1 𝑋𝑖 𝑌𝑖 ) − ( ∑𝑛1 𝑋𝑖 ∑𝑛1 𝑌𝑖 )
𝛽̂1 =
(𝑛 ∑𝑛1 𝑋𝑖2 ) − (∑𝑛1 𝑋𝑖 )2
---------------------------
4. Regression output from R
#R output
m1 <- lm(RunTime~RunSize)
summary(m1)
The least squares estimate for the production data were calculated using R, giving the following
results:
7
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 149.74770 8.32815 17.98 6.00e-13 ***
RunSize 0.25924 0.03714 6.98 1.61e-06 ***
---
Signif. codes:0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1
Residual standard error: 16.25 on 18 degrees of freedom
Multiple R-Squared: 0.7302, Adjusted R-squared: 0.7152
F-statistic: 48.72 on 1 and 18 DF, p-value: 1.615e-06
5. The least squares line of best fit for the production data
#Figure 3
plot(production$RunSize,production$RunTime,xlab="Run Size", ylab="Run Time")
abline(lsfit(production$RunSize,production$RunTime))
Figure 3 shows a scatter plot of the production data with the least squares line of best fit. The
equation of the least squares line of best fit is
Y = 149.7 + 0.26X.
Let us look at the results that we have obtained from the line of best fit in Figure 3. The intercept
in Figure 3 is 149.7, which is where the line of best fit crosses the run time axis. The slope of
the line in Figure 3 is 0.26. Thus, we say that each additional unit to be produced is predicted
to add 0.26 minutes to the run time. The intercept in the model has the following interpretation:
for any production run, the average set up time is 149.7 minutes.
8
Figure 3 A plot of the production data with the least squares line of best fit
6. Estimating the variance of the random error term
Consider the linear regression model with constant variance given by (1) and (2). In this case,
Yi = β0 + β1 X + ei (i = 1, 2, ... n)
where the random error ei has mean 0 and variance σ2 . We wish to estimate σ2 = Var(e). Notice
that
ei = Yi – (β0 + β1 X ) =Yi – unknown regression line at Xi.
Since β0 and β1 are unknown all we can do is estimate these errors by replacing β0 and β1 by
their respective least squares estimates β̂0 and β̂1 giving the residuals êi = Yi− (β̂0 + β̂1Xi) = Yi
− estimated regression line at Xi .
These residuals can be used to estimate σ2. In fact, it can be shown that
𝑛
𝑅𝑆𝑆 1
2
𝑆 = = ∑ 𝑒̂𝑖2
𝑛−2 𝑛−2
𝑖=1
is an unbiased estimate of σ2.
Two points to note are:
1. ̅ê =0 (since Σ êi= 0 as the least squares estimates minimize RSS = Σ êi2= 0)
2. The divisor in S2 is n – 2 since we have estimated two parameters, namely β0 and β1 .
7. Inferences about the slope and the intercept
9
In this section, we shall develop methods for finding confidence intervals and for performing
hypothesis tests about the slope and the intercept of the regression line.
8. Assumptions necessary in order to make inferences about the regression model
Throughout this section we shall make the following assumptions:
1. Y is related to x by the simple linear regression model
Yi = β0 + β1 X + ei (i = 1, 2, ... n), i.e., E(Y | X) = β0 + β1 X
2. The errors e1, e2,....., en are independent of each other
3. The errors e1, e2,....., en have a common variance σ2
4. The errors are normally distributed with a mean of 0 and variance σ2, that is, e | X~ N (0, σ2)
In addition, since the regression model is conditional on X we can assume that the values of
the predictor variable, X1 , X2 , …, XN are known fixed constants.
Under the above assumptions,
𝐸(𝛽̂1|𝑋) = 𝛽1
𝜎2
𝑉𝑎𝑟(𝛽̂1|𝑋) =
𝑆𝑋𝑋
𝜎2
𝛽̂1 |𝑋~𝑁(𝛽1 , )
𝑆𝑋𝑋
Note that the variance of the least squares slope estimate decreases as SXX increases (i.e., as
the variability in the X ’s increases). This is an important fact to note if the experimenter has
control over the choice of the values of the X variable.
Standardizing gives
𝛽̂1 − 𝛽1
𝑍=𝜎 ~𝑁(0,1)
⁄
√𝑆𝑋𝑋
If σ were known then we could use a Z to test hypotheses and find confidence intervals for β1 .
When σ is unknown (as is usually the case) replacing σ by S , the
standard deviation of the residuals results in
𝛽̂1 − 𝛽1 𝛽̂1 − 𝛽1
𝑇=𝑠 =
⁄ 𝑠𝑒(𝛽̂1 )
√𝑆𝑋𝑋
10
Where 𝑠𝑒(𝛽̂1 )= 𝑠⁄ is the estimated standard error (se) of βˆ1. In the production example
√𝑆𝑋𝑋
the X -variable is RunSize and so se (βˆ1) = 0.03714. It can be shown that under the above
assumptions that T has a t -distribution with n – 2 degrees of freedom, that is
𝛽̂1 − 𝛽1
𝑇= ~ 𝑡𝑛−2
𝑠𝑒(𝛽̂1 )
Notice that the degrees of freedom satisfies the following formula
degrees of freedom = sample size – number of mean parameters estimated.
In this case we are estimating two such parameters, namely, β0 and β1
For testing the hypothesis β1 = β01 the test statistic is
𝛽̂1 − 𝛽10
𝑇= ~ 𝑡𝑛−2 𝑤ℎ𝑒𝑟𝑒 𝐻0 𝑖𝑠 𝑡𝑟𝑢𝑒
𝑠𝑒(𝛽̂1 )
R provides the value of T and the p -value associated with testing H0: βˆ1= 0 against HA: βˆ1≠
0 (i.e., for the choice β01 = 0).
#t-value
tval <- qt(1-0.05/2,18)
tval
In the production example the X -variable is RunSize and T = 6.98, which results in a p -value
less than 0.0001. A 100(1–α) % confidence interval for β1 , the slope of the regression line,
is given by
𝛽̂1 − 𝑡(𝛼⁄2 , 𝑛 − 2)𝑠𝑒(𝛽̂1 ), 𝛽̂1 + 𝑡(𝛼⁄2 , 𝑛 − 2)𝑠𝑒(𝛽̂1 )
where t(α / 2, n – 2) is the 100(1– α / 2)th quantile of the t -distribution with n – 2 degrees of
freedom.
In the production example the X -variable is RunSize and βˆ1 = 0.25924, se(βˆ1) = 0.03714,
t (0.025, 20–2 = 18) = 2.1009. Thus a 95% confidence interval for βˆ1 is given by
(0.25924 ± 2.1009 × 0.03714) = (0.25924 ± 0.07803) = (0.181,0.337)
9. Inferences about the intercept of the regression line
Recall that the least squares estimate of β0 is given by
̂0 = 𝑌̅ − 𝛽
𝛽 ̂1 𝑋̅
11
Under the assumptions given previously we shall show that
𝐸(𝛽̂0|𝑋) = 𝛽0
1 𝑋̅ 2
𝑉𝑎𝑟(𝛽̂0|𝑋) = 𝜎 2 ( + )
𝑛 𝑆𝑋𝑋
1 𝑋̅ 2
𝛽̂0 |𝑋~𝑁 (𝛽0 , 𝜎 2 ( + ))
𝑛 𝑆𝑋𝑋
Standardizing gives
𝛽̂0 − 𝛽0
𝑍= ~𝑁(0,1)
1 𝑋̅ 2
𝜎√𝑛 + 𝑆𝑋𝑋
If σ were known then we could use Z to test hypotheses and find confidence intervals for β0 .
When σ is unknown (as is usually the case) replacing σ by S results in
𝛽̂0 − 𝛽0 𝛽̂0 − 𝛽0
𝑍= = ~ 𝑡𝑛−2
1 𝑋̅ 2 𝑠𝑒(𝛽̂0 )
𝑠√𝑛 + 𝑆𝑋𝑋
1 𝑥̅ 2
where 𝑆𝑒(𝛽̂0 ) = 𝑠√𝑛 + 𝑆𝑋𝑋 is the estimated standard error of βˆ0, which is given directly
by R. In the production example the intercept is called Intercept and so se(βˆ0) = 8.32815 .
For testing the hypothesis β0 = β00 the test statistic is
𝛽̂0 − 𝛽00
𝑇= ~ 𝑡𝑛−2 𝑤ℎ𝑒𝑟𝑒 𝐻0 𝑖𝑠 𝑡𝑟𝑢𝑒
𝑠𝑒(𝛽̂0 )
R provides the value of T and the p -value associated with testing H0: βˆ0= 0 against HA :βˆ0≠
0. In the production example In the production example the intercept is called Intercept and
T = 17.98 which results in a p -value < 0.0001. A 100(1–α) % confidence interval for β0 , the
intercept of the regression line, is given by
𝛽̂0 − 𝑡(𝛼⁄2 , 𝑛 − 2)𝑠𝑒(𝛽̂0 ), 𝛽̂0 + 𝑡(𝛼⁄2 , 𝑛 − 2)𝑠𝑒(𝛽̂0 )
where t(α / 2, n – 2) is the 100(1– α / 2)th quantile of the t -distribution with n – 2 degrees of
freedom.
In the production example βˆ0 = 149.7477, se(βˆ0) = 8.32815, t (0.025, 20–2 = 18) = 2.1009.
Thus a 95% confidence interval for βˆ0 is given by
12
(149.7477± 2.1009 × 8.32815) = (149.7477± 17.497) = (132.3,167.2)
Regression Output from R: 95% confidence intervals
#95% confidence intervals
round(confint(m1,level=0.95),3)
#95% confidence intervals
round(confint(m1,level=0.95),3)
2.5% 97.5%
(Intercept) 132.251 167.244
RunSize 0.181 0.337
10. Analysis of variance
There is a linear association between Y and X if
Y = β0 + β1 X + e
and β1 ≠ 0. If we knew that β1 ≠ 0 then we would predict Y by
Ŷ = β̂0 + β̂1 X
On the other hand, if we knew that β1 = 0 then we predict Y by
Ŷ = Ȳ
To test whether there is a linear association between Y and X we have to test
H0 : β1 = 0 against HA : β1 ≠ 0 .
We can perform this test using the following t -statistic
𝛽̂1 − 0
𝑇= ~ 𝑡𝑛−2 𝑤ℎ𝑒𝑟𝑒 𝐻0 𝑖𝑠 𝑡𝑟𝑢𝑒
𝑠𝑒(𝛽̂1 )
We next look at a different test statistic which can be used when there is more than one predictor
variable, that is, in multiple regression. First, we introduce some terminology.
Define the total corrected sum of squares of the Y ’s by
𝑛
𝑆𝑆𝑇 = 𝑆𝑆𝑌 = ∑(𝑌𝑖 − 𝑌̅)2
𝑖=1
Recall that the residual sum of squares is given by
13
𝑛
2
𝑅𝑆𝑆 = ∑(𝑌𝑖 − 𝑌̂𝑖 )
𝑖=1
Define the regression sum of squares (i.e., sum of squares explained by the regression model)
by
𝑛
2
𝑆𝑆𝑅𝑒𝑔 = ∑(𝑌̂𝑖 − 𝑌̅)
𝑖=1
It is clear that SSreg is close to zero if for each i, Ŷi is close to Ȳ while SSreg is large
if Ŷi differs from Ȳ for most values of X.
We next look at the hypothetical situation in Figure 4 with just a single data point ( Xi , Yi )
shown along with the least squares regression line and the mean of y based on all n data points.
It is apparent from Figure 4 that Yi-Ȳ = (Ŷi-Ȳ) + (Yi-Ŷi)
Further, it can be shown that
SST = SSreg + RSS
Total sample variability = Variability explained by the model + Unexplained (or error)
variability
14
Figure. 4 Graphical depiction that yi-ȳ = (ŷi-ȳ) + (yi-ŷi)
To test
H0 : β1 = 0 against HA : β1 ≠ 0 .
we can use the test statistic
𝑆𝑆𝑅𝑒𝑔⁄1
𝐹=
𝑅𝑆𝑆⁄(𝑛 − 2)
since RSS has ( n – 2) degrees of freedom and SSreg has 1 degree of freedom.
Under the assumption that e1, e2 ,..., en are independent and normally distributed with mean 0
and variance σ2 , it can be shown that F has an F distribution with 1 and n – 2 degrees of
freedom when H0 is true, that is,
𝑆𝑆𝑅𝑒𝑔⁄1
𝐹= , ~𝐹1,𝑛−2 𝑊ℎ𝑒𝑛 𝐻0 𝑖𝑠 𝑡𝑟𝑢𝑒
𝑅𝑆𝑆⁄(𝑛 − 2)
Form of test: reject H0 at level α if F> Fα,1,n-2 .
The usual way of setting out this test is to use an Analysis of variance table.
Source of Degrees of Sum of squares Mean square F
variation freedom (df) (SS) (MS)
Regression 1 SSreg SSreg/1 𝑆𝑆𝑅𝑒𝑔⁄1
𝐹=
Residual n–2 RSS RSS/( n – 2) 𝑅𝑆𝑆⁄(𝑛 − 2)
Total n–1 SST
Notes:
1. It can be shown that in the case of simple linear regression
̂ −0
𝛽1 𝑆𝑆𝑅𝑒𝑔⁄1
𝑇 = 𝑠𝑒(𝛽̂ ~ 𝑡𝑛−2 and 𝐹 = 𝑅𝑆𝑆⁄(𝑛−2) ~𝐹1,𝑛−2 are related via F = T2
1)
2. R2 , the coefficient of determination of the regression line, is defined as the proportion of the
total sample variability in the Y ’s explained by the regression model, that is,
𝑆𝑆𝑅𝑒𝑔 𝑅𝑆𝑆
𝑅2 = = 1−
𝑆𝑆𝑇 𝑆𝑆𝑇
The reason this quantity is called R2 is that it is equal to the square of the correlation between
Y and X . It is arguably one of the most commonly misused statistics.
15
11. Regression output from R
#R output
anova(m1)
detach(production)
Analysis of Variance Table
Response: RunTime
Df Sum Sq Mean Sq F value Pr(>F)
RunSize 1 12868.4 12868.4 48.717 1.615e-06 ***
Residuals 18 4754.6 264.1
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1
Notice that the observed F -value of 48.717 is just the square of the observed t –value 6.98
which can be found between Figures 2.2 and 2.3. We shall see that Analysis of Variance
overcomes the problems associated with multiple t –tests which occur when there are many
predictor variables.
#-------------------------------------------------------------------------------------
Changing the notations for making discussions in a more compact form
Consider the linear regression model written in matrix form as
Y=Xβ+e
with Var(e) = σ2 I , where I is the (n × n) identity matrix and the (n × 1) vectors, Y, β, e and
the n × (p + 1)matrix, X are given by
𝑦1 1 𝑥11 … … … . 𝑥1𝑝 𝛽1 𝑒1
𝑦2 1 𝑦21 … … … . 𝑦2𝑝 𝛽2 𝑒2
. ,𝑋 = . .
𝑌= . ,𝛽 . ,𝑒 = .
.. .. .. .. .. ..
………
(𝑦𝑛 ) (1 𝑦𝑛1 𝑦𝑛𝑝 ) (𝛽𝑛 ) (𝑒𝑛 )
The least squares estimates are given by
𝛽̂ = (𝑋 ′ 𝑋)−1 𝑋′𝑌
𝐸(𝛽̂ |𝑋) = ((𝑋 ′ 𝑋)−1 𝑋 ′ 𝑌|𝑋)
16
= (𝑋 ′ 𝑋)−1𝑋 ′ 𝐸(𝑌|𝑋)
= (𝑋 ′ 𝑋)−1 𝑋 ′ 𝑋𝛽
=β
17