0% found this document useful (0 votes)

435 views27 pages

Data Mining and Model Selection

This document discusses model selection and validation techniques to prevent overfitting. It covers topics like subset selection, regularization, averaging, and cross-validation. Specific methods covered include LASSO regression, which performs variable selection by shrinking coefficients, and penalized likelihood criteria like AIC and BIC, which add a penalty term based on model complexity. Cross-validation is presented as a common sense approach to estimate prediction error without overfitting, but it has limitations like variability in the splits and assuming the validation data comes from the same population as the training data.

Uploaded by

Ilinca Maria

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

435 views27 pages

Data Mining and Model Selection

Uploaded by

Ilinca Maria

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 27

Data Mining

Model Selection
Bob Stine
Dept of Statistics, Wharton School
University of Pennsylvania

Wharton
Department of Statistics

From Last Time

Review from prior class

Calibration
Missing data procedures

Missing at random vs. informative missing

Problems of greedy model selection

Problems with stepwise regression.
So then why be greedy?

Questions

Missing data procedure: Why not impute?

Add an indicator is fast, suited to problems with many missing.
Imputation more suited to small, well-specified models.
EG. Suppose every X has missing values. How many imputation
models do you need to build, and which cases should you use?

Wharton
Department of Statistics

Topics for Today

Over-fitting

Model promises more than it delivers

Model selection procedures

Subset selection
Regularization (aka, shrinkage)
Averaging

Cross-validation
Wharton
Department of Statistics

Model Validation

Narrow interpretation

A predictive model is valid if its predictions

have the properties advertised by model
Calibrated, right on average
Correct uncertainty, at least variance

mean
&
variance

Must know process that selected model

Cannot validate a model from a static,

published perspective
Stepwise model for S&P 500 looks okay, but...

Wharton
Department of Statistics

Model Validation

Fails miserably (as it should) when used to

predict future returns

Predictors are simply random noise

Greedy selection overfits, finding coincidental
patterns

2 RMSE

training

Wharton
Department of Statistics

test
5

Over-Fitting

Critical problem in data mining

Caused by an excess of potential explanatory

variables (predictors)

error rate
Claimed
steadily falls with

size of the model

Over-confident

over-fitting

Model claims to
predict new cases
better than it will.

Challenge
Wharton
Department of Statistics

Select predictors that produce a model that

minimizes the prediction error without over-fitting.

Multiplicity

Why is overfitting common?

Classical model comparison

Test statistic, like the usual t-statistic

Special case of likelihood ratio test

Designed for testing one, a priori hypothesis

Reject if |t| > 2, p-value < 0.05

Problem of multiple testing (multiplicity)

What is the chance
that the largest of p
z-statistics is greater
than 2?

Wharton
Department of Statistics

P(max |z|>1.96)

0.05

0.23

0.72

100

0.99

Model Selection

Approaches

Find predictive model without overfitting

Three broad methods

Subset selection

Greedy L0 methods like forward stepwise

Penalized likelihood (AIC, BIC, RIC)

Shrinkage

Regularized: L1 (lasso) and L2 (ridge regression)

Bayesian connections, shrink toward prior

Model averaging
Wharton
Department of Statistics

Dont pick one; rather, average several

Next week
8

Subset Solution

Bonferroni procedure

If testing p hypotheses, then test each at level

/p rather than testing each at level .
Pr(Error in p tests)
= Pr(E1 or E2 or Ep)

Pr(Error ith test)
If test each at level /p, then
Pr(Error in p tests) p(/p) =

Not very popular easy to see why

Loss of power

Bonferroni z

Cost of data-driven

2.6

25
100

3.1
3.5

100000

5.0

Wharton
Department of Statistics

hypothesis testing

Discussion

Bonferroni is pretty tight

Inequality is almost equality if tests are

independent and threshold /p is small

Flexible

Dont have to test every H0 at same level

Allocate more to interesting tests
Split =0.05 with to p linear terms and to all interactions

Process matters

Look at model for stock market in prior class

Many predictors in model pass Bonferroni!
The selection process produces biased estimate of error
Use Bonferroni from the start, not at the end

Wharton
Department of Statistics

Popular Alternative Rules

Model selection criteria

AIC (Akaike information criterion, Cp)

BIC (Bayesian information criterion, SIC)
RIC (risk inflation criterion)

Designed to solve different problems

Equivalent to varying p-to-enter threshold

AIC, Cp: Accept variable if
z2 > 2
Equivalent to putting p-to-enter 0.16

BIC:

z2 > log n
Aims to identify the true model

RIC:

z2 > 2 log p Bonferroni
Wharton
Department of Statistics

The more you consider, the stiffer the penalty

Penalized Likelihood

Alternative characterization of criteria

Maximum likelihood in LS regression

Find model that minimizes -2 log likelihood

Problem: always adds more variables (max R2)

Penalized methods

Add predictors so long as

-2 log likelihood + (model size)
decreases

Criteria vary in choice of

2 for AIC, (log n) for BIC, (2 log p) for RIC

Wharton
Department of Statistics

JMP output

Example

Osteo example

Results

Add variables so
long as BIC
decreases
Fit extra then
reverts back to
best

AIC vs BIC

AIC: less penalty,

larger model
What happens if try either with stock market model?
Wharton
Department of Statistics

Shrinkage Solution

Saturated model

Rather than pick a subset, consider models that

contain all possible features
p = # possible Xs
Good start (and maybe finished) if p << n

Shrinkage allows fitting all if p > n

Shrinkage maximizes penalized likelihood
RSS analogous to
-2 log likelihood

Wharton
Department of Statistics

Penalize by size of the coefficients

Fit has to improve by enough (RSS decrease)
to compensate for size of coefficients
= regularization
Ridge regression: min RSS + 2 bb
parameter,
a tuning parameter
that must be chosen
LASSO regression: min RSS + 1 |bj|
14

Lasso vs Ridge Regression

min RSS, |bj|<c

min RSS, bj2<c

Corners produce selection

Wharton
Department of Statistics

Interpret as Lagrange multiplier.

Cross-Validation Solution
Common sense alternative to criteria

Apply the model to new data

Estimate hidden curve plot of over-fitting

No free lunches
Trade-off

More data for testing means less for fitting:

Good estimate of the fit of a poorly estimated model.

Poor estimate of the fit of a well estimated model.

Highly variable
Results depend which group was excluded for testing
Multi-fold cross-validation has become common

Optimistic
Only place I know of a random sample from same population

Wharton
Department of Statistics

Multi-fold: leave out different subsets

1
2
3
4
5
16

Variability of CV

Example

Compare simple and complex osteo models

Need to fit both to the same CV samples Not so easy in JMP

Evaluate one model

Method of validation

Exclude some of the cases

Fit the model to others
Predict the held-back cases
Repeat, allowing missing data to affect results
Compare out-of-sample errors to model claims

Is assessment correct?

Under what conditions?

Wharton
Department of Statistics

Osteo Example

CV 50 times, split sample

Variability

Training
Wharton
Department of Statistics

SD of pred errors

SD of residuals

If only did one CV sample, might think model

would be 20% better or 15% worse than claimed!
Test cases
look worse
Test cases
look better
Test
18

CV in Data Mining

DM methods often require a three-way CV

Training sample to fit model
Tuning sample to pick special constants
Test sample to see how well final model does

Methods without tuning sample have advantage

Use all of the data to pick the model, without having

to reserve a portion for the choice of constants
Example: method that has honest p-values, akin to
regression model with Bonferroni

Caution

Software not always clear how the CV is done

Be sure CV includes the choice of form of model

Wharton
Department of Statistics

Lasso

Regularized regression model

Find regression that minimizes

Residual SS + |i|
where is a tuning constant
Bayesian: double exponential prior on
Scaling issues
What happens if the s are not on a common scale?

shrinkage

Shrink estimated parameters toward zero

Penalty determines amount of shrinkage
Larger penalty (), fewer variable effects in model

Equivalent to constrained optimization

Wharton
Department of Statistics

Lasso Example

How to set the tuning parameter ?

Empirical: Vary to see how fit changes

Cross-validation, typically 10-fold CV

Large values of lead to very sparse models
Shrinks all the way back to zero

Small values of produce dense models

CV compares prediction errors for choices

Implementations

Generalized regression in JMP Pro

glmnet package in R (See James et al, Ch 6)
More naked software than JMP or Stata

Wharton
Department of Statistics

Fit L

Lasso Example
1

regression, Lasso

Plot estimated coefficients as relax penalty

Implemented in JMP as generalized regression

osteo
model
Where to stop
adding features?

Wharton
Department of Statistics

Lasso Example in R

Follow script from James

See on-line document Glmnet Vignette

Similar output

Less formatting, but more accessible details

Wharton
Department of Statistics

Repeated 10-fold CV

Discussion of CV

Use in model selection vs model validation

Shrinkage methods use CV to pick model
Validation reserves data to test final model

Comments on use in validation

Cannot do selection and validation at same time

Flexible: models do not have to be nested
Optimistic
Splits in CV are samples from one population
Real test in practice often collected later than training data

Population drift

Populations often change over time; CV considers a shapshot

Alternatives?

Bootstrap methods

Wharton
Department of Statistics

Take-Aways

Overfitting

Increased model complexity often claims to

produce a better fit, but in fact it got worse

Model selection methods

Criteria such as AIC or p-value thresholds

Shrinkage methods such as lasso

Cross validation

Multiple roles: validation vs model selection

Flexible and intuitive, but highly variable

Wharton
Department of Statistics

Some questions to ponder...

If you fit a regression model with 10

coefficients, whats the chance that one is

statistically significant by chance alone?
How can you avoid this problem?

If you have a coefficient in your model that

has a t2, what is going to happen to its
significance if you apply split-sample CV?

Why is cross-validation used to pick lasso

models?

Is further CV needed to validation a lasso fit?

Wharton
Department of Statistics

Thursday

Next Time
Newberry Lab

Hands-on time with JMP, R, and data

Fit models to the ANES data
You can come to class, but I wont be here!

Friday

Wharton
Department of Statistics

July 4th holiday

Naive Bayes
No ratings yet
Naive Bayes
11 pages
Class Notes Unit 2 ML Material
No ratings yet
Class Notes Unit 2 ML Material
31 pages
Business Intelligence Overview
No ratings yet
Business Intelligence Overview
8 pages
Introduction To Time Series Analysis
No ratings yet
Introduction To Time Series Analysis
93 pages
Unit V
No ratings yet
Unit V
13 pages
Unit 5 - Data Mining - WWW - Rgpvnotes.in
No ratings yet
Unit 5 - Data Mining - WWW - Rgpvnotes.in
15 pages
Classification and Prediction
No ratings yet
Classification and Prediction
126 pages
Data Engineering Interview Preparation Questions
No ratings yet
Data Engineering Interview Preparation Questions
7 pages
Database Normalization Guide
No ratings yet
Database Normalization Guide
31 pages
Classification and Regression Trees (CART - I) : Dr. A. Ramesh
No ratings yet
Classification and Regression Trees (CART - I) : Dr. A. Ramesh
34 pages
DBMS Insem Question Paper
No ratings yet
DBMS Insem Question Paper
2 pages
ANN Unit-2 Chapter-2
No ratings yet
ANN Unit-2 Chapter-2
56 pages
Lab Program
100% (1)
Lab Program
15 pages
ML Unit V
No ratings yet
ML Unit V
12 pages
Model Building Through
No ratings yet
Model Building Through
21 pages
DSF Unit IV MCQ Notes
No ratings yet
DSF Unit IV MCQ Notes
6 pages
Lecture 01 05.08.2024 AI-ML Introduction
No ratings yet
Lecture 01 05.08.2024 AI-ML Introduction
46 pages
Dbms Aicte Lab
No ratings yet
Dbms Aicte Lab
42 pages
Data Warehousing Question Bank
No ratings yet
Data Warehousing Question Bank
10 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
23 pages
LP-VI - BI - Lab Manual
No ratings yet
LP-VI - BI - Lab Manual
48 pages
Autoencoders & Keras Overview
No ratings yet
Autoencoders & Keras Overview
42 pages
BI UNIT-I Chp01 (Business Intelligence)
No ratings yet
BI UNIT-I Chp01 (Business Intelligence)
14 pages
Fdsa Unit 5
No ratings yet
Fdsa Unit 5
48 pages
Machine Learning: PAC-Learning and VC-Dimension
No ratings yet
Machine Learning: PAC-Learning and VC-Dimension
31 pages
Business Intelligence
No ratings yet
Business Intelligence
60 pages
R Lab File Deepak
No ratings yet
R Lab File Deepak
27 pages
PPT1
No ratings yet
PPT1
93 pages
ASSIGNMENT 1 Questions BI
No ratings yet
ASSIGNMENT 1 Questions BI
1 page
BI Lab Manual
0% (1)
BI Lab Manual
9 pages
Unit #2 - Data Warehouse and Data Mining
No ratings yet
Unit #2 - Data Warehouse and Data Mining
51 pages
University Database E-R Diagram
No ratings yet
University Database E-R Diagram
5 pages
Crash Recovery
No ratings yet
Crash Recovery
30 pages
Naïve Bayes Classifier Algorithm
No ratings yet
Naïve Bayes Classifier Algorithm
10 pages
Mining Graphs
No ratings yet
Mining Graphs
23 pages
Single Layer Perceptron
No ratings yet
Single Layer Perceptron
6 pages
DWM Question Bank
No ratings yet
DWM Question Bank
3 pages
Data Mining Concept Description: Characterization and Comparison
No ratings yet
Data Mining Concept Description: Characterization and Comparison
14 pages
Computer Architecture Overview
No ratings yet
Computer Architecture Overview
4 pages
Sat - 13.Pdf - Child Mortality Prediction Using Machine Learning
No ratings yet
Sat - 13.Pdf - Child Mortality Prediction Using Machine Learning
11 pages
Artificial Intelligence: Chapter 6: Representing Knowledge Using Rules
No ratings yet
Artificial Intelligence: Chapter 6: Representing Knowledge Using Rules
54 pages
Artificial Intelligence: Adversarial Search
No ratings yet
Artificial Intelligence: Adversarial Search
36 pages
CCW331 Business Analytics Material Unit I Type2
No ratings yet
CCW331 Business Analytics Material Unit I Type2
43 pages
CCS341 Data Warehousing Syllabus
No ratings yet
CCS341 Data Warehousing Syllabus
2 pages
Decision Tree Induction
No ratings yet
Decision Tree Induction
23 pages
Syllabus
No ratings yet
Syllabus
9 pages
Data Warehousing and Data Mining Syllabus
No ratings yet
Data Warehousing and Data Mining Syllabus
2 pages
Unit 2 - Knowledge Delivery
No ratings yet
Unit 2 - Knowledge Delivery
31 pages
Unit 1 Introduction To Datascience
No ratings yet
Unit 1 Introduction To Datascience
14 pages
DBMS Assignment-2
No ratings yet
DBMS Assignment-2
6 pages
Data Science Assignment
No ratings yet
Data Science Assignment
18 pages
Ec 467 Pattern Recognition
No ratings yet
Ec 467 Pattern Recognition
2 pages
Unit-3 DWDM
No ratings yet
Unit-3 DWDM
11 pages
Unit Iv
No ratings yet
Unit Iv
8 pages
K-Means Clustering Numerical Example
No ratings yet
K-Means Clustering Numerical Example
5 pages
Lecture+Notes+ +clustering
No ratings yet
Lecture+Notes+ +clustering
13 pages
Unit 4 - Association Analysis
No ratings yet
Unit 4 - Association Analysis
12 pages
Data Warehouse Questions
No ratings yet
Data Warehouse Questions
2 pages
Data Mining Models - GeeksforGeeks
No ratings yet
Data Mining Models - GeeksforGeeks
4 pages
Model Selection for Statisticians
No ratings yet
Model Selection for Statisticians
41 pages
CLS - 1 Maths - Worksheet 2
No ratings yet
CLS - 1 Maths - Worksheet 2
4 pages
Chapter 3
No ratings yet
Chapter 3
59 pages
Bac E 2021 Math Matiques 044608
No ratings yet
Bac E 2021 Math Matiques 044608
2 pages
chapter3 答案
No ratings yet
chapter3 答案
11 pages
Camelopardalis: H H H H H H H H H H H
No ratings yet
Camelopardalis: H H H H H H H H H H H
1 page
CH 2 - Wave Propagation in Viscous Fluid PDF
No ratings yet
CH 2 - Wave Propagation in Viscous Fluid PDF
20 pages
CH - 12 LINEAR PROGRAMMING
No ratings yet
CH - 12 LINEAR PROGRAMMING
27 pages
The Determinacy of Long Games Reprint 2015 Itay Neeman Download
100% (4)
The Determinacy of Long Games Reprint 2015 Itay Neeman Download
87 pages
Class X O.Math Exam Questions
No ratings yet
Class X O.Math Exam Questions
2 pages
Mapwork Calculations Notes
No ratings yet
Mapwork Calculations Notes
3 pages
Pan System Type Curves
No ratings yet
Pan System Type Curves
3 pages
Cambridge IGCSE: MATHEMATICS 0580/21
No ratings yet
Cambridge IGCSE: MATHEMATICS 0580/21
12 pages
University of Delhi: Semester Examination May-June 2020 Transcript
No ratings yet
University of Delhi: Semester Examination May-June 2020 Transcript
2 pages
LM Add Maths Section 4 LVersion
No ratings yet
LM Add Maths Section 4 LVersion
26 pages
2019 Grade 6 ATP-1
No ratings yet
2019 Grade 6 ATP-1
13 pages
ECI - UNIT 1 Mark
No ratings yet
ECI - UNIT 1 Mark
5 pages
Calculus-Based Physics Problems
No ratings yet
Calculus-Based Physics Problems
5 pages
Skills Builder 8 Workbook Answers: Integers, Powers and Roots
100% (2)
Skills Builder 8 Workbook Answers: Integers, Powers and Roots
26 pages
Lecture-6 (Paper 1)
100% (2)
Lecture-6 (Paper 1)
26 pages
Lecture 13 Gauss Law and Electric Potential
No ratings yet
Lecture 13 Gauss Law and Electric Potential
53 pages
Machine Learning Coms-4771: Alina Beygelzimer Tony Jebara, John Langford, Cynthia Rudin
No ratings yet
Machine Learning Coms-4771: Alina Beygelzimer Tony Jebara, John Langford, Cynthia Rudin
17 pages
Irpwm Acs 21
No ratings yet
Irpwm Acs 21
52 pages
Journal of Statistical Software: Learning Bayesian Networks With The Bnlearn R Package
No ratings yet
Journal of Statistical Software: Learning Bayesian Networks With The Bnlearn R Package
22 pages
Modelling and Optimization of The Velocity Profiles at The Draft Tube Inlet of A Francis Turbine Within An Operating Range
No ratings yet
Modelling and Optimization of The Velocity Profiles at The Draft Tube Inlet of A Francis Turbine Within An Operating Range
17 pages
Introduction To Mathematical Finance and Derivatives (PHD) : Lecturer
No ratings yet
Introduction To Mathematical Finance and Derivatives (PHD) : Lecturer
3 pages
AAD Lec04
No ratings yet
AAD Lec04
3 pages
MR-Class A Python Tool For Brain MR Image Classifi
No ratings yet
MR-Class A Python Tool For Brain MR Image Classifi
18 pages
Report For Experiment #7 Work and Energy On An Air Track: Meghan Lumnah
No ratings yet
Report For Experiment #7 Work and Energy On An Air Track: Meghan Lumnah
13 pages