Statistics 191: Introduction to Applied Statistics Jonathan Taylor Department of Statistics Stanford University
Statistics 191: Introduction to Applied Statistics
Simple Linear Regression: Diagnostics
Jonathan Taylor Department of Statistics Stanford University
January 28, 2009
1 / 25
Outline
Statistics 191: Introduction to Applied Statistics Jonathan Taylor Department of Statistics Stanford University
Diagnostics for simple regression Goodness of t of regression: analysis of variance. F -statistics. Residuals. Diagnostic plots.
2 / 25
Geometry of Least Squares
Statistics 191: Introduction to Applied Statistics Jonathan Taylor Department of Statistics Stanford University
3 / 25
Goodness of t
Statistics 191: Introduction to Applied Statistics Jonathan Taylor Department of Statistics Stanford University
Sums of squares
n n
SSE =
i=1 n
(Yi Yi )2 =
i=1 n
(Yi 0 1 Xi )2 (Y 0 1 Xi )2
i=1
SSR =
i=1 n
(Y Yi )2 =
SST =
i=1
(Yi Y )2 = SSE + SSR SSR SSE X =1 = Cor (X , Y )2 . SST SST
R2 =
Basic idea: if R 2 is large: a lot of the variability in Y is explained by X .
4 / 25
Total sum of squares
Statistics 191: Introduction to Applied Statistics Jonathan Taylor Department of Statistics Stanford University
R code
5 / 25
Error sum of squares
Statistics 191: Introduction to Applied Statistics Jonathan Taylor Department of Statistics Stanford University
R code
6 / 25
Regression sum of squares
Statistics 191: Introduction to Applied Statistics Jonathan Taylor Department of Statistics Stanford University
R code
7 / 25
F -statistics
Statistics 191: Introduction to Applied Statistics Jonathan Taylor Department of Statistics Stanford University
What is an F -statistic? An F -statistic is a ratio of sample variances (mean squares): it has a numerator, N, and a denominator, D that are independent. Let N and dene 2 num , dfnum F = D 2 den dfden
N . D We say F has an F distribution with parameters dfnum , dfden and write F Fdfnum ,dfden .
8 / 25
F -statistic in simple linear regression
Statistics 191: Introduction to Applied Statistics Jonathan Taylor Department of Statistics Stanford University
Goodness of t F -statistic The ratio F =
SSR/1 MSR = SSE /(n 2) MSE
can be thought of as a ratio of variances. In fact, under H0 : 1 = 0, F F1,n2 because SSR = Y Y 1 SSE = Y Y
2 2
and from our picture, these vectors are orthogonal.
9 / 25
F and t statistics
Statistics 191: Introduction to Applied Statistics Jonathan Taylor Department of Statistics Stanford University
Relation between F and t If T t , then T2 N(0, 1)2 2 /1 1 2 . 2 / /
In other words, the square of a t-statistic is an F -statistic. Because it is always positive, an F -statistic has no direction () associated with it. In fact, (see R code) F =
2 MSR 1 = . MSE SE (1 )2
10 / 25
F -statistics in regression models
Statistics 191: Introduction to Applied Statistics Jonathan Taylor Department of Statistics Stanford University
Interpretation of an F -statistic In regression, the numerator is usually a dierence in goodness of t of two (nested) models. The denominator is 2 an estimate of 2 . Our example today: the bigger model is the simple linear regression model, the smaller is the model with constant mean (one sample model). If the F is large, it says that the bigger model explains a lot more variability in Y (relative to 2 ) than the smaller one.
11 / 25
F -test in simple linear regression
Statistics 191: Introduction to Applied Statistics Jonathan Taylor Department of Statistics Stanford University
Example in more detail Full (bigger) model : FM : Yi = 0 + 1 Xi + i
Reduced (smaller) model: RM : Yi = 0 + i
The F -statistic has the form F = (SSE (RM) SSE (FM))/(dfRM dfFM ) . SSE (FM)/dfFM
Reject H0 : RM is correct, if F > F1,1,n2 .
12 / 25
Diagnostics
Statistics 191: Introduction to Applied Statistics Jonathan Taylor Department of Statistics Stanford University
What are the assumptions Yi = 0 + 1 Xi + i Errors i are assumed independent N(0, 2 ).
13 / 25
Diagnostics
Statistics 191: Introduction to Applied Statistics Jonathan Taylor Department of Statistics Stanford University
What can go wrong? Regression function can be wrong: maybe regression function should be quadratic (see R code). Model for the errors may be incorrect:
may not be normally distributed. may not be independent. may not have the same variance.
Detecting problems is more art then science, i.e. we cannot test for all possible problems in a regression model. Basic idea of diagnostic measures: if model is correct then residuals ei = Yi Yi , 1 i n should look like a sample of (not quite independent) N(0, 2 ) random variables.
14 / 25
A bad simple regression model
Statistics 191: Introduction to Applied Statistics Jonathan Taylor Department of Statistics Stanford University
R code
15 / 25
Diagnostic plots
Statistics 191: Introduction to Applied Statistics Jonathan Taylor Department of Statistics Stanford University
Problems in the regression function True regression function may have higher-order non-linear terms, polynomial or otherwise. Can sometimes be remedied by looking at a plot of X vs. residuals e . If there is any visible trend in this plot, may consider adding more terms to the model to capture this trend (this makes the model a multiple regression model).
16 / 25
Plot of residuals vs. X
Statistics 191: Introduction to Applied Statistics Jonathan Taylor Department of Statistics Stanford University
R code
17 / 25
Quadratic model
Statistics 191: Introduction to Applied Statistics Jonathan Taylor Department of Statistics Stanford University
R code
18 / 25
Plot of residuals vs. X (quadratic)
Statistics 191: Introduction to Applied Statistics Jonathan Taylor Department of Statistics Stanford University
R code
19 / 25
Problems with the errors
Statistics 191: Introduction to Applied Statistics Jonathan Taylor Department of Statistics Stanford University
Possible problems & diagnostic checks Errors may not be normally distributed or may not have the same variance qqnorm can help with this. Variance may not be constant. Can also be addressed in a plot of X vs. e : fan shape or other trend indicate non-constant variance. Outliers: points where the model really does not t! Possibly mistakes in data transcription, lab errors, who knows? Should be recognized and (hopefully) explained.
20 / 25
Non-normality
Statistics 191: Introduction to Applied Statistics Jonathan Taylor Department of Statistics Stanford University
qqnorm If ei , 1 i n were really a sample of N(0, 2 ) then their sample quantiles should be close to the sample quantiles of the N(0, 2 ) distribution. Plot: e(i) vs. E((i) ), 1 i n. where e(i) is the i-th smallest residual (order statistic) and E((i) ) is the expected value for independent i s N(0, 2 ).
21 / 25
QQplot of residuals
Statistics 191: Introduction to Applied Statistics Jonathan Taylor Department of Statistics Stanford University
R code
22 / 25
QQplot of residuals (quadratic)
Statistics 191: Introduction to Applied Statistics Jonathan Taylor Department of Statistics Stanford University
R code
23 / 25
Outlier and nonconstant variance
Statistics 191: Introduction to Applied Statistics Jonathan Taylor Department of Statistics Stanford University
R code
24 / 25
Outlier and nonconstant variance
Statistics 191: Introduction to Applied Statistics Jonathan Taylor Department of Statistics Stanford University
R code
25 / 25