Chapter 6 Variable Selection and
Model building
Department of Statistics
DDU
1
9.1 Introduction
9.1.1 The Model-Building Problem
• Ensure that the function form of the model is
correct and that the underlying assumptions are
not violated.
• A pool of candidate regressors
• Variable selection problem
• Two conflicting objectives:
– Include as many regressors as possible: the
information content in these factors can
influence the predicted values, y 2
– Include as few regressors as possible: the
variance of the prediction increases as the
number of the regressors increases
• “Best” regression equation???
• Several algorithms can be used for variable
selection, but these procedures frequently specify
different subsets of the candidate regressors as
best.
• An idealized setting:
– The correct functional forms of regressors are
known.
– No outliers or influential observations
3
• Residual analysis
• Iterative approach:
1. A variable selection strategy
2. Check the correct functional forms, outliers
and influential observations
• None of the variable selection procedures are
guaranteed to produce the best regression
equation for a given data set.
4
9.1.2 Consequences of Model Misspecification
• The full model
• The subset model
5
6
7
8
• Motivation for variable selection:
– Deleting variables from the model can improve
the precision of parameter estimates. This is
also true for the variance of predicted response.
– Deleting variable from the model will introduce
the bias.
– However, if the deleted variables have small
effects, the MSE of the biased estimates will be
less than the variance of the unbiased estimates.
9
9.1.3 Criteria for Evaluating Subset Regression
Models
• Coefficient of Multiple Determination:
10
– Aitkin (1974) : R2-adequate subset: the subset
regressor variables produce R2 > R20
11
12
13
14
15
16
• Uses of Regression and Model Evaluation Criteria
– Data description: Minimize SSRes and as few
regressors as possible
– Prediction and estimation: Minimize the mean
square error of prediction. Use PRESS statistic
– Parameter estimation: Chapter 10
– Control: minimize the standard errors of the
regression coefficients.
17
9.2 Computational Techniques
for Variable Selection
9.2.1 All Possible Regressions
• Fit all possible regression equations, and then
select the best one by some suitable criterions.
• Assume the model includes the intercept term
• If there are K candidate regressors, there are 2K
total equations to be estimated and examined.
18
Example 9.1 The Hald Cement Data
19
20
• R2p criterion:
21
22
23
24
25
26
27
28
9.2.2 Stepwise Regression Methods
• Three broad categories:
1. Forward selection
2. Backward elimination
3. Stepwise regression
29
30
Backward elimination
– Start with a model with all K candidate
regressors.
– The partial F-statistic is computed for each
regressor, and drop a regressor which has the
smallest F-statistic and < FOUT.
– Stop when all partial F-statistics > FOUT.
31
Stepwise Regression
• A modification of forward selection.
• A regressor added at an earlier step may be
redundant. Hence this variable should be dropped
from the model.
• Two cutoff values: FOUT and FIN
• Usually choose FIN > FOUT : more difficult to add a
regressor than to delete one.
32