Data Mining
Model Selection
Bob Stine
Dept of Statistics, Wharton School
University of Pennsylvania
Wharton
Department of Statistics
From Last Time
Review from prior class
Calibration
Missing data procedures
Missing at random vs. informative missing
Problems of greedy model selection
Problems with stepwise regression.
So then why be greedy?
Questions
Missing data procedure: Why not impute?
Add an indicator is fast, suited to problems with many missing.
Imputation more suited to small, well-specified models.
EG. Suppose every X has missing values. How many imputation
models do you need to build, and which cases should you use?
Wharton
Department of Statistics
Topics for Today
Over-fitting
Model promises more than it delivers
Model selection procedures
Subset selection
Regularization (aka, shrinkage)
Averaging
Cross-validation
Wharton
Department of Statistics
Model Validation
Narrow interpretation
A predictive model is valid if its predictions
have the properties advertised by model
Calibrated, right on average
Correct uncertainty, at least variance
mean
&
variance
Must know process that selected model
Cannot validate a model from a static,
published perspective
Stepwise model for S&P 500 looks okay, but...
Wharton
Department of Statistics
Model Validation
Fails miserably (as it should) when used to
predict future returns
Predictors are simply random noise
Greedy selection overfits, finding coincidental
patterns
2 RMSE
training
Wharton
Department of Statistics
test
5
Over-Fitting
Critical problem in data mining
Caused by an excess of potential explanatory
variables (predictors)
error rate
Claimed
steadily falls with
size of the model
Over-confident
over-fitting
Model claims to
predict new cases
better than it will.
Challenge
Wharton
Department of Statistics
Select predictors that produce a model that
minimizes the prediction error without over-fitting.
Multiplicity
Why is overfitting common?
Classical model comparison
Test statistic, like the usual t-statistic
Special case of likelihood ratio test
Designed for testing one, a priori hypothesis
Reject if |t| > 2, p-value < 0.05
Problem of multiple testing (multiplicity)
What is the chance
that the largest of p
z-statistics is greater
than 2?
Wharton
Department of Statistics
P(max |z|>1.96)
0.05
0.23
25
0.72
100
0.99
Model Selection
Approaches
Find predictive model without overfitting
Three broad methods
Subset selection
Greedy L0 methods like forward stepwise
Penalized likelihood (AIC, BIC, RIC)
Shrinkage
Regularized: L1 (lasso) and L2 (ridge regression)
Bayesian connections, shrink toward prior
Model averaging
Wharton
Department of Statistics
Dont pick one; rather, average several
Next week
8
Subset Solution
Bonferroni procedure
If testing p hypotheses, then test each at level
/p rather than testing each at level .
Pr(Error in p tests)
= Pr(E1 or E2 or Ep)
Pr(Error ith test)
If test each at level /p, then
Pr(Error in p tests) p(/p) =
Not very popular easy to see why
Loss of power
Bonferroni z
Cost of data-driven
2.6
25
100
3.1
3.5
100000
5.0
Wharton
Department of Statistics
hypothesis testing
Discussion
Bonferroni is pretty tight
Inequality is almost equality if tests are
independent and threshold /p is small
Flexible
Dont have to test every H0 at same level
Allocate more to interesting tests
Split =0.05 with to p linear terms and to all interactions
Process matters
Look at model for stock market in prior class
Many predictors in model pass Bonferroni!
The selection process produces biased estimate of error
Use Bonferroni from the start, not at the end
Wharton
Department of Statistics
10
Popular Alternative Rules
Model selection criteria
AIC (Akaike information criterion, Cp)
BIC (Bayesian information criterion, SIC)
RIC (risk inflation criterion)
Designed to solve different problems
Equivalent to varying p-to-enter threshold
AIC, Cp: Accept variable if
z2 > 2
Equivalent to putting p-to-enter 0.16
BIC:
z2 > log n
Aims to identify the true model
RIC:
z2 > 2 log p Bonferroni
Wharton
Department of Statistics
The more you consider, the stiffer the penalty
11
Penalized Likelihood
Alternative characterization of criteria
Maximum likelihood in LS regression
Find model that minimizes -2 log likelihood
Problem: always adds more variables (max R2)
Penalized methods
Add predictors so long as
-2 log likelihood + (model size)
decreases
Criteria vary in choice of
2 for AIC, (log n) for BIC, (2 log p) for RIC
Wharton
Department of Statistics
12
JMP output
Example
Osteo example
Results
Add variables so
long as BIC
decreases
Fit extra then
reverts back to
best
AIC vs BIC
AIC: less penalty,
larger model
What happens if try either with stock market model?
Wharton
Department of Statistics
13
Shrinkage Solution
Saturated model
Rather than pick a subset, consider models that
contain all possible features
p = # possible Xs
Good start (and maybe finished) if p << n
Shrinkage allows fitting all if p > n
Shrinkage maximizes penalized likelihood
RSS analogous to
-2 log likelihood
Wharton
Department of Statistics
Penalize by size of the coefficients
Fit has to improve by enough (RSS decrease)
to compensate for size of coefficients
= regularization
Ridge regression: min RSS + 2 bb
parameter,
a tuning parameter
that must be chosen
LASSO regression: min RSS + 1 |bj|
14
Lasso vs Ridge Regression
L1
min RSS, |bj|<c
L2
min RSS, bj2<c
Corners produce selection
Wharton
Department of Statistics
Interpret as Lagrange multiplier.
15
Cross-Validation Solution
Common sense alternative to criteria
Apply the model to new data
Estimate hidden curve plot of over-fitting
No free lunches
Trade-off
More data for testing means less for fitting:
Good estimate of the fit of a poorly estimated model.
Poor estimate of the fit of a well estimated model.
Highly variable
Results depend which group was excluded for testing
Multi-fold cross-validation has become common
Optimistic
Only place I know of a random sample from same population
Wharton
Department of Statistics
Multi-fold: leave out different subsets
1
2
3
4
5
16
Variability of CV
Example
Compare simple and complex osteo models
Need to fit both to the same CV samples Not so easy in JMP
Evaluate one model
Method of validation
Exclude some of the cases
Fit the model to others
Predict the held-back cases
Repeat, allowing missing data to affect results
Compare out-of-sample errors to model claims
Is assessment correct?
Under what conditions?
Wharton
Department of Statistics
17
Osteo Example
CV 50 times, split sample
Variability
Training
Wharton
Department of Statistics
SD of pred errors
SD of residuals
If only did one CV sample, might think model
would be 20% better or 15% worse than claimed!
Test cases
look worse
Test cases
look better
Test
18
CV in Data Mining
DM methods often require a three-way CV
Training sample to fit model
Tuning sample to pick special constants
Test sample to see how well final model does
Methods without tuning sample have advantage
Use all of the data to pick the model, without having
to reserve a portion for the choice of constants
Example: method that has honest p-values, akin to
regression model with Bonferroni
Caution
Software not always clear how the CV is done
Be sure CV includes the choice of form of model
Wharton
Department of Statistics
19
Lasso
Regularized regression model
Find regression that minimizes
Residual SS + |i|
where is a tuning constant
Bayesian: double exponential prior on
Scaling issues
What happens if the s are not on a common scale?
shrinkage
Shrink estimated parameters toward zero
Penalty determines amount of shrinkage
Larger penalty (), fewer variable effects in model
Equivalent to constrained optimization
Wharton
Department of Statistics
20
Lasso Example
How to set the tuning parameter ?
Empirical: Vary to see how fit changes
Cross-validation, typically 10-fold CV
Large values of lead to very sparse models
Shrinks all the way back to zero
Small values of produce dense models
CV compares prediction errors for choices
Implementations
Generalized regression in JMP Pro
glmnet package in R (See James et al, Ch 6)
More naked software than JMP or Stata
Wharton
Department of Statistics
21
Fit L
Lasso Example
1
regression, Lasso
Plot estimated coefficients as relax penalty
Implemented in JMP as generalized regression
osteo
model
Where to stop
adding features?
Wharton
Department of Statistics
22
Lasso Example in R
Follow script from James
See on-line document Glmnet Vignette
Similar output
Less formatting, but more accessible details
Wharton
Department of Statistics
Repeated 10-fold CV
23
Discussion of CV
Use in model selection vs model validation
Shrinkage methods use CV to pick model
Validation reserves data to test final model
Comments on use in validation
Cannot do selection and validation at same time
Flexible: models do not have to be nested
Optimistic
Splits in CV are samples from one population
Real test in practice often collected later than training data
Population drift
Populations often change over time; CV considers a shapshot
Alternatives?
Bootstrap methods
Wharton
Department of Statistics
24
Take-Aways
Overfitting
Increased model complexity often claims to
produce a better fit, but in fact it got worse
Model selection methods
Criteria such as AIC or p-value thresholds
Shrinkage methods such as lasso
Cross validation
Multiple roles: validation vs model selection
Flexible and intuitive, but highly variable
Wharton
Department of Statistics
25
Some questions to ponder...
If you fit a regression model with 10
coefficients, whats the chance that one is
statistically significant by chance alone?
How can you avoid this problem?
If you have a coefficient in your model that
has a t2, what is going to happen to its
significance if you apply split-sample CV?
Why is cross-validation used to pick lasso
models?
Is further CV needed to validation a lasso fit?
Wharton
Department of Statistics
26
Thursday
Next Time
Newberry Lab
Hands-on time with JMP, R, and data
Fit models to the ANES data
You can come to class, but I wont be here!
Friday
Wharton
Department of Statistics
July 4th holiday
27