KEMBAR78
Ridge regression, lasso and elastic net | PDF
2/13/2014

Ridge Regression, LASSO and Elastic Net

Ridge Regression, LASSO and Elastic Net
A talk given at NYC open data meetup, find more at www.nycopendata.com
Yunting Sun
Google Inc

file:///Users/ytsun/elasticnet/index.html#42

1/42
2/13/2014

Ridge Regression, LASSO and Elastic Net

Overview
· Linear Regression
· Ordinary Least Square
· Ridge Regression
· LASSO
· Elastic Net
· Examples
· Exercises
Note: make sure that you have installed elasticnet package

library(MASS)
library(elasticnet)

2/42
file:///Users/ytsun/elasticnet/index.html#42

2/42
2/13/2014

Ridge Regression, LASSO and Elastic Net

Linear Regression
n observations, each has one response variable and p predictors

,) p X ,

p × n
, 1 x( = x

to

px

,

n

, 1x

y

, y( = Y
1

, 1X( = X
y

to predict

and

) px ,

) y,

of predictors

- describe the actual relationship between
T

x = y

- use

T

,

1 × n

· We want to find a linear combination

· Examples
- find relationship between pressure and water boiling point
- use GDP to predict interest rate (the accuracy of the prediction is important but the
actual relationship may not matter)

3/42
file:///Users/ytsun/elasticnet/index.html#42

3/42
2/13/2014

Ridge Regression, LASSO and Elastic Net

Quality of an estimator
+

T
0

x = y

, the difference between the actual response and the model prediction
2

2

])

T
0

)

T
0

x

T
0
T

0

x (raV + )

0
T
0

x

y([E = ) 0 x(EPE

x(E +
2

2

= ) 0 x(EPE

x( sa iB[ +

2

.

0

x

in estimating

T

T
0

x

)

= ) 0 x(EPE
T
0

x

T
0

0

x(E = )

T
0

x( sa iB

· The second and third terms make up the mean squared error of

0

x

Where

] 0 x = x| )

0

0

· Prediction error at

,

is the true value and

) 1 , 0( 

Suppose

.

· How to estimate prediction error?

4/42
file:///Users/ytsun/elasticnet/index.html#42

4/42
2/13/2014

Ridge Regression, LASSO and Elastic Net

K-fold Cross Validation
· Split dataset into K groups
- leave one group out as test set
- use the rest K-1 groups as training set to train the model
- estimate prediction error of the model from the test set

5/42
file:///Users/ytsun/elasticnet/index.html#42

5/42
2/13/2014

Ridge Regression, LASSO and Elastic Net

K-fold Cross Validation

i

E

Let

be the prediction errors for the ith test group, the average prediction error is
K

1
iE

∑ K

= E

1=i

6/42
file:///Users/ytsun/elasticnet/index.html#42

6/42
2/13/2014

Ridge Regression, LASSO and Elastic Net

Quality of an estimator
· Mean squared error of the estimator
2

] )
)

0

(raV + )

([ E = )
2

(ESM

( sa iB = )

(ESM

· A biased estimator may achieve smaller MSE than an unbiased estimator
· useful when our goal is to understand the relationship instead of prediction

7/42
file:///Users/ytsun/elasticnet/index.html#42

7/42
2/13/2014

Ridge Regression, LASSO and Elastic Net

Least Squares Estimator (LSE)
+

1× n

p× n X

1× p

=

1× n Y
p

n,

,1 = i , i

+

j

ji x

∑

=

i

y

1=j

) 1 , 0( 

d.i.i
i

Minimize Residual Sum of Square (RSS)
T

X

1

)X

T

X( = )
X

T

X

X

and

Y

Y(

T

)

p > n

X

Y( n im gra =

The solution is uniquely well defined when

inversible

8/42
file:///Users/ytsun/elasticnet/index.html#42

8/42
2/13/2014

Ridge Regression, LASSO and Elastic Net

Pros
de sa i b n u

= )

(E

· LSE has the minimum MSE among unbiased linear estimator though a biased estimator may
have smaller MSE than LSE
· explicit form
2

) p n(O

· computation

· confidence interval, significance of coefficient

9/42
file:///Users/ytsun/elasticnet/index.html#42

9/42
2/13/2014

Ridge Regression, LASSO and Elastic Net

Cons
2

1

)X

T

X( = )

(raV

· Multicollinearity leads to high variance of estimator
- exact or approximate linear relationship among predictors
1

)X

T

X(

-

tends to have large entries

· Requires n > p, i.e., number of observations larger than the number of predictors
r orre n o i tc i der p de tam i t se
p

2

+ ) n / p(

2

= ) 0 x(EPE

0x

E

· Prediction error increases linearly as a function of

· Hard to interpret when the number of predictors is large, need a smaller subset that exhibit
the strongest effects

10/42
file:///Users/ytsun/elasticnet/index.html#42

10/42
2/13/2014

Ridge Regression, LASSO and Elastic Net

Example: Leukemia classification
· Leukemia Data, Golub et al. Science 1999

ji

X

·

9217 = p

· There are 38 training samples and 34 test samples with total

genes (p >> n)

is the gene expression value for sample i and gene j

· Sample i either has tumor type AML or ALL
· We want to select genes relevant to tumor type
- eliminate the trivial genes
- grouped selection as many genes are highly correlated
· LSE does not work here!

11/42
file:///Users/ytsun/elasticnet/index.html#42

11/42
2/13/2014

Ridge Regression, LASSO and Elastic Net

Solution: regularization
· instead of minimizing RSS,
) sre temara p e h t n o y t la ne p ×
0 = !

+ SSR(

ez im i n im

· Trade bias for smaller variance, biased estimator when

· Continuous variable selection (unlike AIC, BIC, subset selection)
·

can be chosen by cross validation

12/42
file:///Users/ytsun/elasticnet/index.html#42

12/42
2/13/2014

Ridge Regression, LASSO and Elastic Net

Ridge Regression
}

2

+

2

Y

T

2
2

X

1

e g dir

X
)I

Y
+ X

{ n im gra =
T

e g dir

X( =

Pros:
· p >> n
· Multicollinearity
· biased but smaller variance and smaller MSE (Mean Squared Error)
· explicit solution
Cons:
· shrink coefficients to zero but can not produce a parsimonious model

13/42
file:///Users/ytsun/elasticnet/index.html#42

13/42
2/13/2014

Ridge Regression, LASSO and Elastic Net

Grouped Selection
· if two predictors are highly correlated among themselves, the estimated coefficients will be
similar for them.
· if some variables are exactly identical, they will have same coefficients
Ridge is good for grouped selection but not good for eliminating trivial genes

14/42
file:///Users/ytsun/elasticnet/index.html#42

14/42
2/13/2014

Ridge Regression, LASSO and Elastic Net

Example: Ridge Regression (Collinearity)
2x

+

1x

=

3x

· multicollinearity

· show that ridge regression beats OLS in the multilinearity case
lbayMS)
irr(AS
n=50
0
z=romn 0 1
nr(, , )
y=z+02*romn 0 1
.
nr(, , )
x =z+romn 0 1
1
nr(, , )
x =z+romn 0 1
2
nr(, , )
x =x +x
3
1
2
d=dt.rm( =y x =x,x =x,x =x)
aafaey
, 1
1 2
2 3
3

15/42
file:///Users/ytsun/elasticnet/index.html#42

15/42
2/13/2014

Ridge Regression, LASSO and Elastic Net

OLS
#OSfi t cluaecefcetfrx
L al o aclt ofiin o 3
osmdl=l( ~.-1 d
l.oe
my
, )
ce(l.oe)
ofosmdl

#
#

x
1

x
2

x
3

# 035 038
# .03 .17

N
A

16/42
file:///Users/ytsun/elasticnet/index.html#42

16/42
2/13/2014

Ridge Regression, LASSO and Elastic Net

Ridge Regression
#cos tnn prmtr
hoe uig aaee
rdemdl=l.ig( ~.-1 d lmd =sq0 1,01)
ig.oe
mrdey
, , aba
e(, 0 .)
lmd.p =rdemdllmd[hc.i(ig.oe$C)
abaot
ig.oe$abawihmnrdemdlGV]
#rdergeso (hikcefcet)
ig ersin srn ofiins
ce(mrdey~.-1 d lmd =lmd.p)
ofl.ig(
, , aba
abaot)

#
#

x
1

x
2

x
3

# 017 010 015
# .71 .92 .28

17/42
file:///Users/ytsun/elasticnet/index.html#42

17/42
2/13/2014

Ridge Regression, LASSO and Elastic Net

Approximately multicollinear
· show that ridge regreesion correct coefficient signs and reduce mean squared error
x =x +x +00 *romn 0 1
3
1
2
.5
nr(, , )
d=dt.rm( =y x =x,x =x,x =x)
aafaey
, 1
1 2
2 3
3
dtan=d140 ]
.ri
[:0,
dts =d4150 ]
.et
[0:0,

18/42
file:///Users/ytsun/elasticnet/index.html#42

18/42
2/13/2014

Ridge Regression, LASSO and Elastic Net

OLS
ostan=l( ~.-1 dtan
l.ri
my
, .ri)
ce(l.ri)
ofostan

#
#
x
1
x
2
x
3
# -.74-.52 063
# 036 032
.89

#peito err
rdcin ros
sm(.ety-peitostan nwaa=dts)^)
u(dts$
rdc(l.ri, edt
.et)2

# []3.3
# 1 75

19/42
file:///Users/ytsun/elasticnet/index.html#42

19/42
2/13/2014

Ridge Regression, LASSO and Elastic Net

Ridge Regression
#cos tnn prmtrfrrdergeso
hoe uig aaee o ig ersin
rdetan=l.ig( ~.-1 dtan lmd =sq0 1,01)
ig.ri
mrdey
, .ri, aba
e(, 0 .)
lmd.p =rdetanlmd[hc.i(ig.ri$C)
abaot
ig.ri$abawihmnrdetanGV]
rdemdl=l.ig( ~.-1 dtan lmd =lmd.p)
ig.oe
mrdey
, .ri, aba
abaot
ce(ig.oe) #cretsgs
ofrdemdl
orc in

#
#

x
1

x
2

x
3

# 011 013 014
# .73 .96 .30

ces=ce(ig.oe)
of
ofrdemdl
sm(.ety-a.arxdts[ -] %%mti(of,3 1)2
u(dts$
smti(.et, 1) * arxces , )^)

# []3.7
# 1 68

20/42
file:///Users/ytsun/elasticnet/index.html#42

20/42
2/13/2014

Ridge Regression, LASSO and Elastic Net

LASSO
}1

+

ossal

2

X

2

Y

{ n im gra =

Or equivalently
p

t

|

|

j

∑

=

1

. t. s

2
2

X

Y

n im

1=j

Pros
· allow p >> n
· enforce sparcity in parameters

·

goes to

, OLS solution

, t goes to 0,

0 =

goes to 0, t goes to

2

·

) p n(O

· quadratic programming problem, lars solution requires

21/42
file:///Users/ytsun/elasticnet/index.html#42

21/42
2/13/2014

Ridge Regression, LASSO and Elastic Net

Cons
· if a group of predictors are highly correlated among themselves, LASSO tends to pick only
one of them and shrink the other to zero
· can not do grouped selection, tend to select one variable
LASSO is good for eliminating trivial genes but not good for grouped selection

22/42
file:///Users/ytsun/elasticnet/index.html#42

22/42
2/13/2014

Ridge Regression, LASSO and Elastic Net

LARS algorithm of Efron et al (2004)
· stepwise variable selection (Least angle regression and shrinkage)
· less greedy version of traditional forward selection methods
· solve the entire lasso solution path efficiently
2

) p n(O

· same order of computational efforts as a single OLS fit

23/42
file:///Users/ytsun/elasticnet/index.html#42

23/42
2/13/2014

Ridge Regression, LASSO and Elastic Net

LARS Path
S LO

] 1 , 0[

s ,1

s

1

. t. s

2
2

X

Y

n im

24/42
file:///Users/ytsun/elasticnet/index.html#42

24/42
2/13/2014

Ridge Regression, LASSO and Elastic Net

parsimonious model
lbayMS)
irr(AS
n=2
0
#bt i sas
ea s pre
bt =mti((,15 0 0 2 0 0 0,8 1
ea
arxc3 ., , , , , , ) , )
p=lnt(ea
eghbt)
ro=03
h
.
cr =mti(,p p
or
arx0 , )
fr( i sqp){
o i n e()
fr( i sqp){
o j n e()
cr[,j =roasi-j
ori ]
h^b(
)
}
}
X=mromn m =rp0 p,Sga=cr)
vnr(, u
e(, ) im
or
y=X%%bt +3*romn 0 1
* ea
nr(, , )
d=a.aafaecidy X)
sdt.rm(bn(, )
clae()=c"" pse(x,sqp)
onmsd
(y, at0"" e())

25/42
file:///Users/ytsun/elasticnet/index.html#42

25/42
2/13/2014

Ridge Regression, LASSO and Elastic Net

OLS
nsm=10
.i
0
me=rp0 nsm
s
e(, .i)
fr( i sqnsm){
o i n e(.i)
X=mromn m =rp0 p,Sga=cr)
vnr(, u
e(, ) im
or
y=X%%bt +3*romn 0 1
* ea
nr(, , )
d=a.aafaecidy X)
sdt.rm(bn(, )
clae()=c"" pse(x,sqp)
onmsd
(y, at0"" e())
#ftOSwtotitret
i L ihu necp
osmdl=l( ~.-1 d
l.oe
my
, )
mei =sm(ofosmdl -bt)2
s[]
u(ce(l.oe)
ea^)
}
mda(s)
einme

# []63
# 1 .2

26/42
file:///Users/ytsun/elasticnet/index.html#42

26/42
2/13/2014

Ridge Regression, LASSO and Elastic Net

Ridge Regression
nsm=10
.i
0
me=rp0 nsm
s
e(, .i)
fr( i sqnsm){
o i n e(.i)
X=mromn m =rp0 p,Sga=cr)
vnr(, u
e(, ) im
or
y=X%%bt +3*romn 0 1
* ea
nr(, , )
d=a.aafaecidy X)
sdt.rm(bn(, )
clae()=c"" pse(x,sqp)
onmsd
(y, at0"" e())
rdec =l.ig( ~.-1 d lmd =sq0 1,01)
ig.v
mrdey
, , aba
e(, 0 .)
lmd.p =rdec$abawihmnrdec$C)
abaot
ig.vlmd[hc.i(ig.vGV]
#ftrdergeso wtotitret
i ig ersin ihu necp
rdemdl=l.ig( ~.-1 d lmd =lmd.p)
ig.oe
mrdey
, , aba
abaot
mei =sm(ofrdemdl -bt)2
s[]
u(ce(ig.oe)
ea^)
}
mda(s)
einme

# []404
# 1 .7

27/42
file:///Users/ytsun/elasticnet/index.html#42

27/42
2/13/2014

Ridge Regression, LASSO and Elastic Net

LASSO
lbayeatce)
irr(lsint
nsm=10
.i
0
me=rp0 nsm
s
e(, .i)
fr( i sqnsm){
o i n e(.i)
X=mromn m =rp0 p,Sga=cr)
vnr(, u
e(, ) im
or
y=X%%bt +3*romn 0 1
* ea
nr(, , )
ojc =c.ntX y lmd =0 s=sq01 1 lnt =10,po.t=FLE
b.v
vee(, , aba
,
e(., , egh
0) lti
AS,
md ="rcin,tae=FLE mxses=8)
oe
fato" rc
AS, a.tp
0
sot=ojc$[hc.i(b.vc)
.p
b.vswihmnojc$v]
lsomdl=ee(,y lmd =0 itret=FLE
as.oe
ntX , aba
, necp
AS)
ces=peitlsomdl s=sot tp ="ofiins,md ="rcin)
of
rdc(as.oe,
.p, ye
cefcet" oe
fato"
mei =sm(of$ofiins-bt)2
s[]
u(cescefcet
ea^)
}
mda(s)
einme

# []333
# 1 .9

28/42
file:///Users/ytsun/elasticnet/index.html#42

28/42
2/13/2014

Ridge Regression, LASSO and Elastic Net

Elastic Net
}

2
2

2

+

1

1

+ )

X

Y(

T

te ne

)

X

Y({ n im gra =

Pros
· enforce sparsity
· no limitation on the number of selected variable
· encourage grouping effect in the presence of highly correlated predictors
Cons
· naive elastic net suffers from double shrinkage
Correction
)2

+ 1( =

te ne

29/42
file:///Users/ytsun/elasticnet/index.html#42

29/42
2/13/2014

Ridge Regression, LASSO and Elastic Net

LASSO vs Elastic Net
Construct a data set with grouped effects to show that Elastic Net outperform LASSO in
grouped selection
· response y
, 2x , 1 x

as minor factors

2

) 1 , 0( N +

2z

z

and

3x

1. 0 +

1z

1

z

Two independent "hidden" factors

, 5x , 4 x

we would like to shrink to zero

as dominant factors,

6x

· 6 predictors fall into two group,

= y

Correlated grouped covariates
1z

+

3

2z

+

6

=
=

3x
6x

,2
,5

)6x ,

+
+

1z
2z

=
=

2x
5x

,1
,4

+
+

1z
2z

=
=

1x
4x

, 2 x , 1x( = X

30/42
file:///Users/ytsun/elasticnet/index.html#42

30/42
2/13/2014

Ridge Regression, LASSO and Elastic Net

Simulated data
N=10
0
z =rnfN mn=0 mx=2)
1
ui(, i
, a
0
z =rnfN mn=0 mx=2)
2
ui(, i
, a
0
y=z +01*z +romN
1
.
2
nr()
X=cidz %%mti((,-,1,1 3,z %%mti((,-,1,1 3)
bn(1 * arxc1 1 ) , ) 2 * arxc1 1 ) , )
X=X+mti(nr( *6,N 6
arxromN
) , )

31/42
file:///Users/ytsun/elasticnet/index.html#42

31/42
2/13/2014

Ridge Regression, LASSO and Elastic Net

LASSO path
lbayeatce)
irr(lsint
ojlso=ee(,y lmd =0
b.as
ntX , aba
)
po(b.as,ueclr=TU)
ltojlso s.oo
RE

32/42
file:///Users/ytsun/elasticnet/index.html#42

32/42
2/13/2014

Ridge Regression, LASSO and Elastic Net

Elastic Net
lbayeatce)
irr(lsint
ojee =ee(,y lmd =05
b.nt
ntX , aba
.)
po(b.nt ueclr=TU)
ltojee, s.oo
RE

33/42
file:///Users/ytsun/elasticnet/index.html#42

33/42
2/13/2014

Ridge Regression, LASSO and Elastic Net

How to choose tuning parameter
For a sequence of , find the s that minimizer of the CV prediction error and then find the
which minimize the CV prediction error
lbayeatce)
irr(lsint
ojc =c.ntX y lmd =05 s=sq0 1 lnt =10,md ="rcin,
b.v
vee(, , aba
.,
e(, , egh
0) oe
fato"
tae=FLE mxses=8)
rc
AS, a.tp
0

34/42
file:///Users/ytsun/elasticnet/index.html#42

34/42
2/13/2014

Ridge Regression, LASSO and Elastic Net

Prostate Cancer Example
· Predictors are eight clinical measures
· Training set with 67 observations
· Test set with 30 observations
· Modeling fitting and turning parameter selection by tenfold CV on training set
· Compare model performance by prediction mean-squared error on the test data

35/42
file:///Users/ytsun/elasticnet/index.html#42

35/42
2/13/2014

Ridge Regression, LASSO and Elastic Net

Compare models

· medium correlation among predictors and the highest correlation is 0.76
· elastic net beat LASSO and ridge regression beat OLS

36/42
file:///Users/ytsun/elasticnet/index.html#42

36/42
2/13/2014

Ridge Regression, LASSO and Elastic Net

Summary
· Ridge Regression:
- good for multicollinearity, grouped selection
- not good for variable selection
· LASSO
- good for variable selection
- not good for grouped selection for strongly correlated predictors
· Elastic Net
- combine strength between Ridge Regression and LASSO
· Regularization
- trade bias for variance reduction
- better prediction accuracy

37/42
file:///Users/ytsun/elasticnet/index.html#42

37/42
2/13/2014

Ridge Regression, LASSO and Elastic Net

Reference
Most of the materials covered in this slides are adapted from
· Paper: Regularization and variable selection via the elastic net
· Slide: http://www.stanford.edu/~hastie/TALKS/enet_talk.pdf
· The Elements of Statistical Learning

38/42
file:///Users/ytsun/elasticnet/index.html#42

38/42
2/13/2014

Ridge Regression, LASSO and Elastic Net

Exercise 1: simulated data
bt =mti((e(,1) rp0 2),4,1
ea
arxcrp3 5, e(, 5) 0 )
sga=1
im
5
n=50
0
z =mti(nr(,0 1,n 1
1
arxromn , ) , )
z =mti(nr(,0 1,n 1
2
arxromn , ) , )
z =mti(nr(,0 1,n 1
3
arxromn , ) , )
X =z %%mti(e(,5,1 5 +00 *mti(nr( *5,n 5
1
1 * arxrp1 ) , )
.1
arxromn
) , )
X =z %%mti(e(,5,1 5 +00 *mti(nr( *5,n 5
2
2 * arxrp1 ) , )
.1
arxromn
) , )
X =z %%mti(e(,5,1 5 +00 *mti(nr( *5,n 5
3
3 * arxrp1 ) , )
.1
arxromn
) , )
X =mti(nr( *2,0 1,n 2)
4
arxromn
5 , ) , 5
X=cidX,X,X,X)
bn(1 2 3 4
Y=X%%bt +sga*romn 0 1
* ea
im
nr(, , )
Ytan=Y140
.ri
[:0]
Xtan=X140 ]
.ri
[:0,
Yts =Y4050
.et
[0:0]
Xts =X4050 ]
.et
[0:0,

39/42
file:///Users/ytsun/elasticnet/index.html#42

39/42
2/13/2014

Ridge Regression, LASSO and Elastic Net

Questions:
· Fit OLS, LASSO, Ridge regression and elastic net to the training data and calculate the
prediction error from the test data
· Simulate the data set for 100 times and compare the median mean-squared errors for those
models

40/42
file:///Users/ytsun/elasticnet/index.html#42

40/42
2/13/2014

Ridge Regression, LASSO and Elastic Net

Exercise 2: Diabetes
· x a matrix with 10 columns
· y a numeric vector (442 rows)
· x2 a matrix with 64 columns
lbayeatce)
irr(lsint
dt(ibts
aadaee)
clae(ibts
onmsdaee)

# []"" "" "2
# 1 x
y
x"

41/42
file:///Users/ytsun/elasticnet/index.html#42

41/42
2/13/2014

Ridge Regression, LASSO and Elastic Net

Questions
· Fit LASSO and Elastic Net to the data with optimal tuning parameter chosen by cross
validation.
· Compare solution paths for the two methods

42/42
file:///Users/ytsun/elasticnet/index.html#42

42/42

Ridge regression, lasso and elastic net

  • 1.
    2/13/2014 Ridge Regression, LASSOand Elastic Net Ridge Regression, LASSO and Elastic Net A talk given at NYC open data meetup, find more at www.nycopendata.com Yunting Sun Google Inc file:///Users/ytsun/elasticnet/index.html#42 1/42
  • 2.
    2/13/2014 Ridge Regression, LASSOand Elastic Net Overview · Linear Regression · Ordinary Least Square · Ridge Regression · LASSO · Elastic Net · Examples · Exercises Note: make sure that you have installed elasticnet package library(MASS) library(elasticnet) 2/42 file:///Users/ytsun/elasticnet/index.html#42 2/42
  • 3.
    2/13/2014 Ridge Regression, LASSOand Elastic Net Linear Regression n observations, each has one response variable and p predictors ,) p X , p × n , 1 x( = x to px , n , 1x y , y( = Y 1 , 1X( = X y to predict and ) px , ) y, of predictors - describe the actual relationship between T x = y - use T , 1 × n · We want to find a linear combination · Examples - find relationship between pressure and water boiling point - use GDP to predict interest rate (the accuracy of the prediction is important but the actual relationship may not matter) 3/42 file:///Users/ytsun/elasticnet/index.html#42 3/42
  • 4.
    2/13/2014 Ridge Regression, LASSOand Elastic Net Quality of an estimator + T 0 x = y , the difference between the actual response and the model prediction 2 2 ]) T 0 ) T 0 x T 0 T 0 x (raV + ) 0 T 0 x y([E = ) 0 x(EPE x(E + 2 2 = ) 0 x(EPE x( sa iB[ + 2 . 0 x in estimating T T 0 x ) = ) 0 x(EPE T 0 x T 0 0 x(E = ) T 0 x( sa iB · The second and third terms make up the mean squared error of 0 x Where ] 0 x = x| ) 0 0 · Prediction error at , is the true value and ) 1 , 0(  Suppose . · How to estimate prediction error? 4/42 file:///Users/ytsun/elasticnet/index.html#42 4/42
  • 5.
    2/13/2014 Ridge Regression, LASSOand Elastic Net K-fold Cross Validation · Split dataset into K groups - leave one group out as test set - use the rest K-1 groups as training set to train the model - estimate prediction error of the model from the test set 5/42 file:///Users/ytsun/elasticnet/index.html#42 5/42
  • 6.
    2/13/2014 Ridge Regression, LASSOand Elastic Net K-fold Cross Validation i E Let be the prediction errors for the ith test group, the average prediction error is K 1 iE ∑ K = E 1=i 6/42 file:///Users/ytsun/elasticnet/index.html#42 6/42
  • 7.
    2/13/2014 Ridge Regression, LASSOand Elastic Net Quality of an estimator · Mean squared error of the estimator 2 ] ) ) 0 (raV + ) ([ E = ) 2 (ESM ( sa iB = ) (ESM · A biased estimator may achieve smaller MSE than an unbiased estimator · useful when our goal is to understand the relationship instead of prediction 7/42 file:///Users/ytsun/elasticnet/index.html#42 7/42
  • 8.
    2/13/2014 Ridge Regression, LASSOand Elastic Net Least Squares Estimator (LSE) + 1× n p× n X 1× p = 1× n Y p n, ,1 = i , i + j ji x ∑ = i y 1=j ) 1 , 0(  d.i.i i Minimize Residual Sum of Square (RSS) T X 1 )X T X( = ) X T X X and Y Y( T ) p > n X Y( n im gra = The solution is uniquely well defined when inversible 8/42 file:///Users/ytsun/elasticnet/index.html#42 8/42
  • 9.
    2/13/2014 Ridge Regression, LASSOand Elastic Net Pros de sa i b n u = ) (E · LSE has the minimum MSE among unbiased linear estimator though a biased estimator may have smaller MSE than LSE · explicit form 2 ) p n(O · computation · confidence interval, significance of coefficient 9/42 file:///Users/ytsun/elasticnet/index.html#42 9/42
  • 10.
    2/13/2014 Ridge Regression, LASSOand Elastic Net Cons 2 1 )X T X( = ) (raV · Multicollinearity leads to high variance of estimator - exact or approximate linear relationship among predictors 1 )X T X( - tends to have large entries · Requires n > p, i.e., number of observations larger than the number of predictors r orre n o i tc i der p de tam i t se p 2 + ) n / p( 2 = ) 0 x(EPE 0x E · Prediction error increases linearly as a function of · Hard to interpret when the number of predictors is large, need a smaller subset that exhibit the strongest effects 10/42 file:///Users/ytsun/elasticnet/index.html#42 10/42
  • 11.
    2/13/2014 Ridge Regression, LASSOand Elastic Net Example: Leukemia classification · Leukemia Data, Golub et al. Science 1999 ji X · 9217 = p · There are 38 training samples and 34 test samples with total genes (p >> n) is the gene expression value for sample i and gene j · Sample i either has tumor type AML or ALL · We want to select genes relevant to tumor type - eliminate the trivial genes - grouped selection as many genes are highly correlated · LSE does not work here! 11/42 file:///Users/ytsun/elasticnet/index.html#42 11/42
  • 12.
    2/13/2014 Ridge Regression, LASSOand Elastic Net Solution: regularization · instead of minimizing RSS, ) sre temara p e h t n o y t la ne p × 0 = ! + SSR( ez im i n im · Trade bias for smaller variance, biased estimator when · Continuous variable selection (unlike AIC, BIC, subset selection) · can be chosen by cross validation 12/42 file:///Users/ytsun/elasticnet/index.html#42 12/42
  • 13.
    2/13/2014 Ridge Regression, LASSOand Elastic Net Ridge Regression } 2 + 2 Y T 2 2 X 1 e g dir X )I Y + X { n im gra = T e g dir X( = Pros: · p >> n · Multicollinearity · biased but smaller variance and smaller MSE (Mean Squared Error) · explicit solution Cons: · shrink coefficients to zero but can not produce a parsimonious model 13/42 file:///Users/ytsun/elasticnet/index.html#42 13/42
  • 14.
    2/13/2014 Ridge Regression, LASSOand Elastic Net Grouped Selection · if two predictors are highly correlated among themselves, the estimated coefficients will be similar for them. · if some variables are exactly identical, they will have same coefficients Ridge is good for grouped selection but not good for eliminating trivial genes 14/42 file:///Users/ytsun/elasticnet/index.html#42 14/42
  • 15.
    2/13/2014 Ridge Regression, LASSOand Elastic Net Example: Ridge Regression (Collinearity) 2x + 1x = 3x · multicollinearity · show that ridge regression beats OLS in the multilinearity case lbayMS) irr(AS n=50 0 z=romn 0 1 nr(, , ) y=z+02*romn 0 1 . nr(, , ) x =z+romn 0 1 1 nr(, , ) x =z+romn 0 1 2 nr(, , ) x =x +x 3 1 2 d=dt.rm( =y x =x,x =x,x =x) aafaey , 1 1 2 2 3 3 15/42 file:///Users/ytsun/elasticnet/index.html#42 15/42
  • 16.
    2/13/2014 Ridge Regression, LASSOand Elastic Net OLS #OSfi t cluaecefcetfrx L al o aclt ofiin o 3 osmdl=l( ~.-1 d l.oe my , ) ce(l.oe) ofosmdl # # x 1 x 2 x 3 # 035 038 # .03 .17 N A 16/42 file:///Users/ytsun/elasticnet/index.html#42 16/42
  • 17.
    2/13/2014 Ridge Regression, LASSOand Elastic Net Ridge Regression #cos tnn prmtr hoe uig aaee rdemdl=l.ig( ~.-1 d lmd =sq0 1,01) ig.oe mrdey , , aba e(, 0 .) lmd.p =rdemdllmd[hc.i(ig.oe$C) abaot ig.oe$abawihmnrdemdlGV] #rdergeso (hikcefcet) ig ersin srn ofiins ce(mrdey~.-1 d lmd =lmd.p) ofl.ig( , , aba abaot) # # x 1 x 2 x 3 # 017 010 015 # .71 .92 .28 17/42 file:///Users/ytsun/elasticnet/index.html#42 17/42
  • 18.
    2/13/2014 Ridge Regression, LASSOand Elastic Net Approximately multicollinear · show that ridge regreesion correct coefficient signs and reduce mean squared error x =x +x +00 *romn 0 1 3 1 2 .5 nr(, , ) d=dt.rm( =y x =x,x =x,x =x) aafaey , 1 1 2 2 3 3 dtan=d140 ] .ri [:0, dts =d4150 ] .et [0:0, 18/42 file:///Users/ytsun/elasticnet/index.html#42 18/42
  • 19.
    2/13/2014 Ridge Regression, LASSOand Elastic Net OLS ostan=l( ~.-1 dtan l.ri my , .ri) ce(l.ri) ofostan # # x 1 x 2 x 3 # -.74-.52 063 # 036 032 .89 #peito err rdcin ros sm(.ety-peitostan nwaa=dts)^) u(dts$ rdc(l.ri, edt .et)2 # []3.3 # 1 75 19/42 file:///Users/ytsun/elasticnet/index.html#42 19/42
  • 20.
    2/13/2014 Ridge Regression, LASSOand Elastic Net Ridge Regression #cos tnn prmtrfrrdergeso hoe uig aaee o ig ersin rdetan=l.ig( ~.-1 dtan lmd =sq0 1,01) ig.ri mrdey , .ri, aba e(, 0 .) lmd.p =rdetanlmd[hc.i(ig.ri$C) abaot ig.ri$abawihmnrdetanGV] rdemdl=l.ig( ~.-1 dtan lmd =lmd.p) ig.oe mrdey , .ri, aba abaot ce(ig.oe) #cretsgs ofrdemdl orc in # # x 1 x 2 x 3 # 011 013 014 # .73 .96 .30 ces=ce(ig.oe) of ofrdemdl sm(.ety-a.arxdts[ -] %%mti(of,3 1)2 u(dts$ smti(.et, 1) * arxces , )^) # []3.7 # 1 68 20/42 file:///Users/ytsun/elasticnet/index.html#42 20/42
  • 21.
    2/13/2014 Ridge Regression, LASSOand Elastic Net LASSO }1 + ossal 2 X 2 Y { n im gra = Or equivalently p t | | j ∑ = 1 . t. s 2 2 X Y n im 1=j Pros · allow p >> n · enforce sparcity in parameters · goes to , OLS solution , t goes to 0, 0 = goes to 0, t goes to 2 · ) p n(O · quadratic programming problem, lars solution requires 21/42 file:///Users/ytsun/elasticnet/index.html#42 21/42
  • 22.
    2/13/2014 Ridge Regression, LASSOand Elastic Net Cons · if a group of predictors are highly correlated among themselves, LASSO tends to pick only one of them and shrink the other to zero · can not do grouped selection, tend to select one variable LASSO is good for eliminating trivial genes but not good for grouped selection 22/42 file:///Users/ytsun/elasticnet/index.html#42 22/42
  • 23.
    2/13/2014 Ridge Regression, LASSOand Elastic Net LARS algorithm of Efron et al (2004) · stepwise variable selection (Least angle regression and shrinkage) · less greedy version of traditional forward selection methods · solve the entire lasso solution path efficiently 2 ) p n(O · same order of computational efforts as a single OLS fit 23/42 file:///Users/ytsun/elasticnet/index.html#42 23/42
  • 24.
    2/13/2014 Ridge Regression, LASSOand Elastic Net LARS Path S LO ] 1 , 0[ s ,1 s 1 . t. s 2 2 X Y n im 24/42 file:///Users/ytsun/elasticnet/index.html#42 24/42
  • 25.
    2/13/2014 Ridge Regression, LASSOand Elastic Net parsimonious model lbayMS) irr(AS n=2 0 #bt i sas ea s pre bt =mti((,15 0 0 2 0 0 0,8 1 ea arxc3 ., , , , , , ) , ) p=lnt(ea eghbt) ro=03 h . cr =mti(,p p or arx0 , ) fr( i sqp){ o i n e() fr( i sqp){ o j n e() cr[,j =roasi-j ori ] h^b( ) } } X=mromn m =rp0 p,Sga=cr) vnr(, u e(, ) im or y=X%%bt +3*romn 0 1 * ea nr(, , ) d=a.aafaecidy X) sdt.rm(bn(, ) clae()=c"" pse(x,sqp) onmsd (y, at0"" e()) 25/42 file:///Users/ytsun/elasticnet/index.html#42 25/42
  • 26.
    2/13/2014 Ridge Regression, LASSOand Elastic Net OLS nsm=10 .i 0 me=rp0 nsm s e(, .i) fr( i sqnsm){ o i n e(.i) X=mromn m =rp0 p,Sga=cr) vnr(, u e(, ) im or y=X%%bt +3*romn 0 1 * ea nr(, , ) d=a.aafaecidy X) sdt.rm(bn(, ) clae()=c"" pse(x,sqp) onmsd (y, at0"" e()) #ftOSwtotitret i L ihu necp osmdl=l( ~.-1 d l.oe my , ) mei =sm(ofosmdl -bt)2 s[] u(ce(l.oe) ea^) } mda(s) einme # []63 # 1 .2 26/42 file:///Users/ytsun/elasticnet/index.html#42 26/42
  • 27.
    2/13/2014 Ridge Regression, LASSOand Elastic Net Ridge Regression nsm=10 .i 0 me=rp0 nsm s e(, .i) fr( i sqnsm){ o i n e(.i) X=mromn m =rp0 p,Sga=cr) vnr(, u e(, ) im or y=X%%bt +3*romn 0 1 * ea nr(, , ) d=a.aafaecidy X) sdt.rm(bn(, ) clae()=c"" pse(x,sqp) onmsd (y, at0"" e()) rdec =l.ig( ~.-1 d lmd =sq0 1,01) ig.v mrdey , , aba e(, 0 .) lmd.p =rdec$abawihmnrdec$C) abaot ig.vlmd[hc.i(ig.vGV] #ftrdergeso wtotitret i ig ersin ihu necp rdemdl=l.ig( ~.-1 d lmd =lmd.p) ig.oe mrdey , , aba abaot mei =sm(ofrdemdl -bt)2 s[] u(ce(ig.oe) ea^) } mda(s) einme # []404 # 1 .7 27/42 file:///Users/ytsun/elasticnet/index.html#42 27/42
  • 28.
    2/13/2014 Ridge Regression, LASSOand Elastic Net LASSO lbayeatce) irr(lsint nsm=10 .i 0 me=rp0 nsm s e(, .i) fr( i sqnsm){ o i n e(.i) X=mromn m =rp0 p,Sga=cr) vnr(, u e(, ) im or y=X%%bt +3*romn 0 1 * ea nr(, , ) ojc =c.ntX y lmd =0 s=sq01 1 lnt =10,po.t=FLE b.v vee(, , aba , e(., , egh 0) lti AS, md ="rcin,tae=FLE mxses=8) oe fato" rc AS, a.tp 0 sot=ojc$[hc.i(b.vc) .p b.vswihmnojc$v] lsomdl=ee(,y lmd =0 itret=FLE as.oe ntX , aba , necp AS) ces=peitlsomdl s=sot tp ="ofiins,md ="rcin) of rdc(as.oe, .p, ye cefcet" oe fato" mei =sm(of$ofiins-bt)2 s[] u(cescefcet ea^) } mda(s) einme # []333 # 1 .9 28/42 file:///Users/ytsun/elasticnet/index.html#42 28/42
  • 29.
    2/13/2014 Ridge Regression, LASSOand Elastic Net Elastic Net } 2 2 2 + 1 1 + ) X Y( T te ne ) X Y({ n im gra = Pros · enforce sparsity · no limitation on the number of selected variable · encourage grouping effect in the presence of highly correlated predictors Cons · naive elastic net suffers from double shrinkage Correction )2 + 1( = te ne 29/42 file:///Users/ytsun/elasticnet/index.html#42 29/42
  • 30.
    2/13/2014 Ridge Regression, LASSOand Elastic Net LASSO vs Elastic Net Construct a data set with grouped effects to show that Elastic Net outperform LASSO in grouped selection · response y , 2x , 1 x as minor factors 2 ) 1 , 0( N + 2z z and 3x 1. 0 + 1z 1 z Two independent "hidden" factors , 5x , 4 x we would like to shrink to zero as dominant factors, 6x · 6 predictors fall into two group, = y Correlated grouped covariates 1z + 3 2z + 6 = = 3x 6x ,2 ,5 )6x , + + 1z 2z = = 2x 5x ,1 ,4 + + 1z 2z = = 1x 4x , 2 x , 1x( = X 30/42 file:///Users/ytsun/elasticnet/index.html#42 30/42
  • 31.
    2/13/2014 Ridge Regression, LASSOand Elastic Net Simulated data N=10 0 z =rnfN mn=0 mx=2) 1 ui(, i , a 0 z =rnfN mn=0 mx=2) 2 ui(, i , a 0 y=z +01*z +romN 1 . 2 nr() X=cidz %%mti((,-,1,1 3,z %%mti((,-,1,1 3) bn(1 * arxc1 1 ) , ) 2 * arxc1 1 ) , ) X=X+mti(nr( *6,N 6 arxromN ) , ) 31/42 file:///Users/ytsun/elasticnet/index.html#42 31/42
  • 32.
    2/13/2014 Ridge Regression, LASSOand Elastic Net LASSO path lbayeatce) irr(lsint ojlso=ee(,y lmd =0 b.as ntX , aba ) po(b.as,ueclr=TU) ltojlso s.oo RE 32/42 file:///Users/ytsun/elasticnet/index.html#42 32/42
  • 33.
    2/13/2014 Ridge Regression, LASSOand Elastic Net Elastic Net lbayeatce) irr(lsint ojee =ee(,y lmd =05 b.nt ntX , aba .) po(b.nt ueclr=TU) ltojee, s.oo RE 33/42 file:///Users/ytsun/elasticnet/index.html#42 33/42
  • 34.
    2/13/2014 Ridge Regression, LASSOand Elastic Net How to choose tuning parameter For a sequence of , find the s that minimizer of the CV prediction error and then find the which minimize the CV prediction error lbayeatce) irr(lsint ojc =c.ntX y lmd =05 s=sq0 1 lnt =10,md ="rcin, b.v vee(, , aba ., e(, , egh 0) oe fato" tae=FLE mxses=8) rc AS, a.tp 0 34/42 file:///Users/ytsun/elasticnet/index.html#42 34/42
  • 35.
    2/13/2014 Ridge Regression, LASSOand Elastic Net Prostate Cancer Example · Predictors are eight clinical measures · Training set with 67 observations · Test set with 30 observations · Modeling fitting and turning parameter selection by tenfold CV on training set · Compare model performance by prediction mean-squared error on the test data 35/42 file:///Users/ytsun/elasticnet/index.html#42 35/42
  • 36.
    2/13/2014 Ridge Regression, LASSOand Elastic Net Compare models · medium correlation among predictors and the highest correlation is 0.76 · elastic net beat LASSO and ridge regression beat OLS 36/42 file:///Users/ytsun/elasticnet/index.html#42 36/42
  • 37.
    2/13/2014 Ridge Regression, LASSOand Elastic Net Summary · Ridge Regression: - good for multicollinearity, grouped selection - not good for variable selection · LASSO - good for variable selection - not good for grouped selection for strongly correlated predictors · Elastic Net - combine strength between Ridge Regression and LASSO · Regularization - trade bias for variance reduction - better prediction accuracy 37/42 file:///Users/ytsun/elasticnet/index.html#42 37/42
  • 38.
    2/13/2014 Ridge Regression, LASSOand Elastic Net Reference Most of the materials covered in this slides are adapted from · Paper: Regularization and variable selection via the elastic net · Slide: http://www.stanford.edu/~hastie/TALKS/enet_talk.pdf · The Elements of Statistical Learning 38/42 file:///Users/ytsun/elasticnet/index.html#42 38/42
  • 39.
    2/13/2014 Ridge Regression, LASSOand Elastic Net Exercise 1: simulated data bt =mti((e(,1) rp0 2),4,1 ea arxcrp3 5, e(, 5) 0 ) sga=1 im 5 n=50 0 z =mti(nr(,0 1,n 1 1 arxromn , ) , ) z =mti(nr(,0 1,n 1 2 arxromn , ) , ) z =mti(nr(,0 1,n 1 3 arxromn , ) , ) X =z %%mti(e(,5,1 5 +00 *mti(nr( *5,n 5 1 1 * arxrp1 ) , ) .1 arxromn ) , ) X =z %%mti(e(,5,1 5 +00 *mti(nr( *5,n 5 2 2 * arxrp1 ) , ) .1 arxromn ) , ) X =z %%mti(e(,5,1 5 +00 *mti(nr( *5,n 5 3 3 * arxrp1 ) , ) .1 arxromn ) , ) X =mti(nr( *2,0 1,n 2) 4 arxromn 5 , ) , 5 X=cidX,X,X,X) bn(1 2 3 4 Y=X%%bt +sga*romn 0 1 * ea im nr(, , ) Ytan=Y140 .ri [:0] Xtan=X140 ] .ri [:0, Yts =Y4050 .et [0:0] Xts =X4050 ] .et [0:0, 39/42 file:///Users/ytsun/elasticnet/index.html#42 39/42
  • 40.
    2/13/2014 Ridge Regression, LASSOand Elastic Net Questions: · Fit OLS, LASSO, Ridge regression and elastic net to the training data and calculate the prediction error from the test data · Simulate the data set for 100 times and compare the median mean-squared errors for those models 40/42 file:///Users/ytsun/elasticnet/index.html#42 40/42
  • 41.
    2/13/2014 Ridge Regression, LASSOand Elastic Net Exercise 2: Diabetes · x a matrix with 10 columns · y a numeric vector (442 rows) · x2 a matrix with 64 columns lbayeatce) irr(lsint dt(ibts aadaee) clae(ibts onmsdaee) # []"" "" "2 # 1 x y x" 41/42 file:///Users/ytsun/elasticnet/index.html#42 41/42
  • 42.
    2/13/2014 Ridge Regression, LASSOand Elastic Net Questions · Fit LASSO and Elastic Net to the data with optimal tuning parameter chosen by cross validation. · Compare solution paths for the two methods 42/42 file:///Users/ytsun/elasticnet/index.html#42 42/42