Ridge regression, lasso and elastic net

2/13/2014

Ridge Regression, LASSO and Elastic Net

A talk given at NYC open data meetup, find more at www.nycopendata.com
Yunting Sun
Google Inc

file:///Users/ytsun/elasticnet/index.html#42

1/42

2/13/2014


Overview
· Linear Regression
· Ordinary Least Square
· Ridge Regression
· LASSO
· Elastic Net
· Examples
· Exercises
Note: make sure that you have installed elasticnet package

library(MASS)
library(elasticnet)

2/42

2/42

2/13/2014


Linear Regression
n observations, each has one response variable and p predictors

,) p X ,

p × n
, 1 x( = x

to

px

,

n

, 1x

y

, y( = Y
1

, 1X( = X
y

to predict

and

) px ,

) y,

of predictors

- describe the actual relationship between
T

x = y

- use

T

,

1 × n

· We want to find a linear combination

· Examples
- find relationship between pressure and water boiling point
- use GDP to predict interest rate (the accuracy of the prediction is important but the
actual relationship may not matter)

3/42

3/42

2/13/2014


Quality of an estimator
+

T
0

x = y

, the difference between the actual response and the model prediction
2

2

])

T
0

)

T
0

x

T
0
T

0

x (raV + )

0
T
0

x

y([E = ) 0 x(EPE

x(E +
2

2

= ) 0 x(EPE

x( sa iB[ +

2

.

0

x

in estimating

T

T
0

x

)

= ) 0 x(EPE
T
0

x

T
0

0

x(E = )

T
0

x( sa iB

· The second and third terms make up the mean squared error of

0

x

Where

] 0 x = x| )

0

0

· Prediction error at

,

is the true value and

) 1 , 0( 

Suppose

.

· How to estimate prediction error?

4/42

4/42

2/13/2014


K-fold Cross Validation
· Split dataset into K groups
- leave one group out as test set
- use the rest K-1 groups as training set to train the model
- estimate prediction error of the model from the test set

5/42

5/42

2/13/2014


K-fold Cross Validation

i

E

Let

be the prediction errors for the ith test group, the average prediction error is
K

1
iE

∑ K

= E

1=i

6/42

6/42

2/13/2014


Quality of an estimator
· Mean squared error of the estimator
2

] )
)

0

(raV + )

([ E = )
2

(ESM

( sa iB = )

(ESM

· A biased estimator may achieve smaller MSE than an unbiased estimator
· useful when our goal is to understand the relationship instead of prediction

7/42

7/42

2/13/2014


Least Squares Estimator (LSE)
+

1× n

p× n X

1× p

=

1× n Y
p

n,

,1 = i , i

+

j

ji x

∑

=

i

y

1=j

) 1 , 0( 

d.i.i
i

Minimize Residual Sum of Square (RSS)
T

X

1

)X

T

X( = )
X

T

X

X

and

Y

Y(

T

)

p > n

X

Y( n im gra =

The solution is uniquely well defined when

inversible

8/42

8/42

2/13/2014


Pros
de sa i b n u

= )

(E

· LSE has the minimum MSE among unbiased linear estimator though a biased estimator may
have smaller MSE than LSE
· explicit form
2

) p n(O

· computation

· confidence interval, significance of coefficient

9/42

9/42

2/13/2014


Cons
2

1

)X

T

X( = )

(raV

· Multicollinearity leads to high variance of estimator
- exact or approximate linear relationship among predictors
1

)X

T

X(

-

tends to have large entries

· Requires n > p, i.e., number of observations larger than the number of predictors
r orre n o i tc i der p de tam i t se
p

2

+ ) n / p(

2

= ) 0 x(EPE

0x

E

· Prediction error increases linearly as a function of

· Hard to interpret when the number of predictors is large, need a smaller subset that exhibit
the strongest effects

10/42

10/42

2/13/2014


Example: Leukemia classification
· Leukemia Data, Golub et al. Science 1999

ji

X

·

9217 = p

· There are 38 training samples and 34 test samples with total

genes (p >> n)

is the gene expression value for sample i and gene j

· Sample i either has tumor type AML or ALL
· We want to select genes relevant to tumor type
- eliminate the trivial genes
- grouped selection as many genes are highly correlated
· LSE does not work here!

11/42

11/42

2/13/2014


Solution: regularization
· instead of minimizing RSS,
) sre temara p e h t n o y t la ne p ×
0 = !

+ SSR(

ez im i n im

· Trade bias for smaller variance, biased estimator when

· Continuous variable selection (unlike AIC, BIC, subset selection)
·

can be chosen by cross validation

12/42

12/42

2/13/2014


Ridge Regression
}

2

+

2

Y

T

2
2

X

1

e g dir

X
)I

Y
+ X

{ n im gra =
T

e g dir

X( =

Pros:
· p >> n
· Multicollinearity
· biased but smaller variance and smaller MSE (Mean Squared Error)
· explicit solution
Cons:
· shrink coefficients to zero but can not produce a parsimonious model

13/42

13/42

2/13/2014


Grouped Selection
· if two predictors are highly correlated among themselves, the estimated coefficients will be
similar for them.
· if some variables are exactly identical, they will have same coefficients
Ridge is good for grouped selection but not good for eliminating trivial genes

14/42

14/42

2/13/2014


Example: Ridge Regression (Collinearity)
2x

+

1x

=

3x

· multicollinearity

· show that ridge regression beats OLS in the multilinearity case
lbayMS)
irr(AS
n=50
0
z=romn 0 1
nr(, , )
y=z+02*romn 0 1
.
nr(, , )
x =z+romn 0 1
1
nr(, , )
x =z+romn 0 1
2
nr(, , )
x =x +x
3
1
2
d=dt.rm( =y x =x,x =x,x =x)
aafaey
, 1
1 2
2 3
3

15/42

15/42

2/13/2014


OLS
#OSfi t cluaecefcetfrx
L al o aclt ofiin o 3
osmdl=l( ~.-1 d
l.oe
my
, )
ce(l.oe)
ofosmdl

#
#

x
1

x
2

x
3

# 035 038
# .03 .17

N
A

16/42

16/42

2/13/2014


Ridge Regression
#cos tnn prmtr
hoe uig aaee
rdemdl=l.ig( ~.-1 d lmd =sq0 1,01)
ig.oe
mrdey
, , aba
e(, 0 .)
lmd.p =rdemdllmd[hc.i(ig.oe$C)
abaot
ig.oe$abawihmnrdemdlGV]
#rdergeso (hikcefcet)
ig ersin srn ofiins
ce(mrdey~.-1 d lmd =lmd.p)
ofl.ig(
, , aba
abaot)

#
#

x
1

x
2

x
3

# 017 010 015
# .71 .92 .28

17/42

17/42

2/13/2014


Approximately multicollinear
· show that ridge regreesion correct coefficient signs and reduce mean squared error
x =x +x +00 *romn 0 1
3
1
2
.5
nr(, , )
d=dt.rm( =y x =x,x =x,x =x)
aafaey
, 1
1 2
2 3
3
dtan=d140 ]
.ri
[:0,
dts =d4150 ]
.et
[0:0,

18/42

18/42

2/13/2014


OLS
ostan=l( ~.-1 dtan
l.ri
my
, .ri)
ce(l.ri)
ofostan

#
#
x
1
x
2
x
3
# -.74-.52 063
# 036 032
.89

#peito err
rdcin ros
sm(.ety-peitostan nwaa=dts)^)
u(dts$
rdc(l.ri, edt
.et)2

# []3.3
# 1 75

19/42

19/42

2/13/2014


Ridge Regression
#cos tnn prmtrfrrdergeso
hoe uig aaee o ig ersin
rdetan=l.ig( ~.-1 dtan lmd =sq0 1,01)
ig.ri
mrdey
, .ri, aba
e(, 0 .)
lmd.p =rdetanlmd[hc.i(ig.ri$C)
abaot
ig.ri$abawihmnrdetanGV]
rdemdl=l.ig( ~.-1 dtan lmd =lmd.p)
ig.oe
mrdey
, .ri, aba
abaot
ce(ig.oe) #cretsgs
ofrdemdl
orc in

#
#

x
1

x
2

x
3

# 011 013 014
# .73 .96 .30

ces=ce(ig.oe)
of
ofrdemdl
sm(.ety-a.arxdts[ -] %%mti(of,3 1)2
u(dts$
smti(.et, 1) * arxces , )^)

# []3.7
# 1 68

20/42

20/42

2/13/2014


LASSO
}1

+

ossal

2

X

2

Y

{ n im gra =

Or equivalently
p

t

|

|

j

∑

=

1

. t. s

2
2

X

Y

n im

1=j

Pros
· allow p >> n
· enforce sparcity in parameters

·

goes to

, OLS solution

, t goes to 0,

0 =

goes to 0, t goes to

2

·

) p n(O

· quadratic programming problem, lars solution requires

21/42

21/42

2/13/2014


Cons
· if a group of predictors are highly correlated among themselves, LASSO tends to pick only
one of them and shrink the other to zero
· can not do grouped selection, tend to select one variable
LASSO is good for eliminating trivial genes but not good for grouped selection

22/42

22/42

2/13/2014


LARS algorithm of Efron et al (2004)
· stepwise variable selection (Least angle regression and shrinkage)
· less greedy version of traditional forward selection methods
· solve the entire lasso solution path efficiently
2

) p n(O

· same order of computational efforts as a single OLS fit

23/42

23/42

2/13/2014


LARS Path
S LO

] 1 , 0[

s ,1

s

1

. t. s

2
2

X

Y

n im

24/42

24/42

2/13/2014


parsimonious model
lbayMS)
irr(AS
n=2
0
#bt i sas
ea s pre
bt =mti((,15 0 0 2 0 0 0,8 1
ea
arxc3 ., , , , , , ) , )
p=lnt(ea
eghbt)
ro=03
h
.
cr =mti(,p p
or
arx0 , )
fr( i sqp){
o i n e()
fr( i sqp){
o j n e()
cr[,j =roasi-j
ori ]
h^b(
)
}
}
X=mromn m =rp0 p,Sga=cr)
vnr(, u
e(, ) im
or
y=X%%bt +3*romn 0 1
* ea
nr(, , )
d=a.aafaecidy X)
sdt.rm(bn(, )
clae()=c"" pse(x,sqp)
onmsd
(y, at0"" e())

25/42

25/42

2/13/2014


OLS
nsm=10
.i
0
me=rp0 nsm
s
e(, .i)
fr( i sqnsm){
o i n e(.i)
vnr(, u
e(, ) im
or
y=X%%bt +3*romn 0 1
* ea
nr(, , )
d=a.aafaecidy X)
sdt.rm(bn(, )
onmsd
(y, at0"" e())
#ftOSwtotitret
i L ihu necp
osmdl=l( ~.-1 d
l.oe
my
, )
mei =sm(ofosmdl -bt)2
s[]
u(ce(l.oe)
ea^)
}
mda(s)
einme

# []63
# 1 .2

26/42

26/42

2/13/2014


Ridge Regression
nsm=10
.i
0
me=rp0 nsm
s
e(, .i)
fr( i sqnsm){
o i n e(.i)
vnr(, u
e(, ) im
or
y=X%%bt +3*romn 0 1
* ea
nr(, , )
d=a.aafaecidy X)
sdt.rm(bn(, )
onmsd
(y, at0"" e())
rdec =l.ig( ~.-1 d lmd =sq0 1,01)
ig.v
mrdey
, , aba
e(, 0 .)
lmd.p =rdec$abawihmnrdec$C)
abaot
ig.vlmd[hc.i(ig.vGV]
#ftrdergeso wtotitret
i ig ersin ihu necp
rdemdl=l.ig( ~.-1 d lmd =lmd.p)
ig.oe
mrdey
, , aba
abaot
mei =sm(ofrdemdl -bt)2
s[]
u(ce(ig.oe)
ea^)
}
mda(s)
einme

# []404
# 1 .7

27/42

27/42

2/13/2014


LASSO
lbayeatce)
irr(lsint
nsm=10
.i
0
me=rp0 nsm
s
e(, .i)
fr( i sqnsm){
o i n e(.i)
vnr(, u
e(, ) im
or
y=X%%bt +3*romn 0 1
* ea
nr(, , )
ojc =c.ntX y lmd =0 s=sq01 1 lnt =10,po.t=FLE
b.v
vee(, , aba
,
e(., , egh
0) lti
AS,
md ="rcin,tae=FLE mxses=8)
oe
fato" rc
AS, a.tp
0
sot=ojc$[hc.i(b.vc)
.p
b.vswihmnojc$v]
lsomdl=ee(,y lmd =0 itret=FLE
as.oe
ntX , aba
, necp
AS)
ces=peitlsomdl s=sot tp ="ofiins,md ="rcin)
of
rdc(as.oe,
.p, ye
cefcet" oe
fato"
mei =sm(of$ofiins-bt)2
s[]
u(cescefcet
ea^)
}
mda(s)
einme

# []333
# 1 .9

28/42

28/42

2/13/2014


Elastic Net
}

2
2

2

+

1

1

+ )

X

Y(

T

te ne

)

X

Y({ n im gra =

Pros
· enforce sparsity
· no limitation on the number of selected variable
· encourage grouping effect in the presence of highly correlated predictors
Cons
· naive elastic net suffers from double shrinkage
Correction
)2

+ 1( =

te ne

29/42

29/42

2/13/2014


LASSO vs Elastic Net
Construct a data set with grouped effects to show that Elastic Net outperform LASSO in
grouped selection
· response y
, 2x , 1 x

as minor factors

2

) 1 , 0( N +

2z

z

and

3x

1. 0 +

1z

1

z

Two independent "hidden" factors

, 5x , 4 x

we would like to shrink to zero

as dominant factors,

6x

· 6 predictors fall into two group,

= y

Correlated grouped covariates
1z

+

3

2z

+

6

=
=

3x
6x

,2
,5

)6x ,

+
+

1z
2z

=
=

2x
5x

,1
,4

+
+

1z
2z

=
=

1x
4x

, 2 x , 1x( = X

30/42

30/42

2/13/2014


Simulated data
N=10
0
z =rnfN mn=0 mx=2)
1
ui(, i
, a
0
z =rnfN mn=0 mx=2)
2
ui(, i
, a
0
y=z +01*z +romN
1
.
2
nr()
X=cidz %%mti((,-,1,1 3,z %%mti((,-,1,1 3)
bn(1 * arxc1 1 ) , ) 2 * arxc1 1 ) , )
X=X+mti(nr( *6,N 6
arxromN
) , )

31/42

31/42

2/13/2014


LASSO path
lbayeatce)
irr(lsint
ojlso=ee(,y lmd =0
b.as
ntX , aba
)
po(b.as,ueclr=TU)
ltojlso s.oo
RE

32/42

32/42

2/13/2014


Elastic Net
lbayeatce)
irr(lsint
ojee =ee(,y lmd =05
b.nt
ntX , aba
.)
po(b.nt ueclr=TU)
ltojee, s.oo
RE

33/42

33/42

2/13/2014


How to choose tuning parameter
For a sequence of , find the s that minimizer of the CV prediction error and then find the
which minimize the CV prediction error
lbayeatce)
irr(lsint
ojc =c.ntX y lmd =05 s=sq0 1 lnt =10,md ="rcin,
b.v
vee(, , aba
.,
e(, , egh
0) oe
fato"
tae=FLE mxses=8)
rc
AS, a.tp
0

34/42

34/42

2/13/2014


Prostate Cancer Example
· Predictors are eight clinical measures
· Training set with 67 observations
· Test set with 30 observations
· Modeling fitting and turning parameter selection by tenfold CV on training set
· Compare model performance by prediction mean-squared error on the test data

35/42

35/42

2/13/2014


Compare models

· medium correlation among predictors and the highest correlation is 0.76
· elastic net beat LASSO and ridge regression beat OLS

36/42

36/42

2/13/2014


Summary
· Ridge Regression:
- good for multicollinearity, grouped selection
- not good for variable selection
· LASSO
- good for variable selection
- not good for grouped selection for strongly correlated predictors
· Elastic Net
- combine strength between Ridge Regression and LASSO
· Regularization
- trade bias for variance reduction
- better prediction accuracy

37/42

37/42

2/13/2014


Reference
Most of the materials covered in this slides are adapted from
· Paper: Regularization and variable selection via the elastic net
· Slide: http://www.stanford.edu/~hastie/TALKS/enet_talk.pdf
· The Elements of Statistical Learning

38/42

38/42

2/13/2014


Exercise 1: simulated data
bt =mti((e(,1) rp0 2),4,1
ea
arxcrp3 5, e(, 5) 0 )
sga=1
im
5
n=50
0
z =mti(nr(,0 1,n 1
1
arxromn , ) , )
z =mti(nr(,0 1,n 1
2
arxromn , ) , )
z =mti(nr(,0 1,n 1
3
arxromn , ) , )
X =z %%mti(e(,5,1 5 +00 *mti(nr( *5,n 5
1
1 * arxrp1 ) , )
.1
arxromn
) , )
X =z %%mti(e(,5,1 5 +00 *mti(nr( *5,n 5
2
2 * arxrp1 ) , )
.1
arxromn
) , )
X =z %%mti(e(,5,1 5 +00 *mti(nr( *5,n 5
3
3 * arxrp1 ) , )
.1
arxromn
) , )
X =mti(nr( *2,0 1,n 2)
4
arxromn
5 , ) , 5
X=cidX,X,X,X)
bn(1 2 3 4
Y=X%%bt +sga*romn 0 1
* ea
im
nr(, , )
Ytan=Y140
.ri
[:0]
Xtan=X140 ]
.ri
[:0,
Yts =Y4050
.et
[0:0]
Xts =X4050 ]
.et
[0:0,

39/42

39/42

2/13/2014


Questions:
· Fit OLS, LASSO, Ridge regression and elastic net to the training data and calculate the
prediction error from the test data
· Simulate the data set for 100 times and compare the median mean-squared errors for those
models

40/42

40/42

2/13/2014


Exercise 2: Diabetes
· x a matrix with 10 columns
· y a numeric vector (442 rows)
· x2 a matrix with 64 columns
lbayeatce)
irr(lsint
dt(ibts
aadaee)
clae(ibts
onmsdaee)

# []"" "" "2
# 1 x
y
x"

41/42

41/42

2/13/2014


Questions
· Fit LASSO and Elastic Net to the data with optimal tuning parameter chosen by cross
validation.
· Compare solution paths for the two methods

42/42

42/42

Ridge regression, lasso and elastic net

More Related Content

What's hot

Similar to Ridge regression, lasso and elastic net

More from Vivian S. Zhang

Recently uploaded

Ridge regression, lasso and elastic net