SPECTRAL REGULARIZATION
Francesca Odone and Lorenzo Rosasco
June 6, 2011
Spectral Regularization
ABOUT THIS CLASS
GOAL To discuss how a class of regularization methods
originally designed for solving ill-posed inverse
problems, give rise to regularized learning algorithms.
These algorithms are kernel methods that can be easily
implemented and have a common derivation, but
different computational and theoretical properties.
Spectral Regularization
BASIC NOTATION
Training set S = (x
1
, y
1
), ..., (x
n
, y
n
).
X is the n by d input matrix.
Y = (y
1
, . . . , y
n
) is the output vector.
k denotes the kernel function , K the n by n kernel matrix with
entries K
ij
= k(x
i
, x
j
) and 1 the RKHS with kernel k.
RLS estimator solves
min
f H
1
n
n
i =1
(y
i
f (x
i
))
2
+|f |
2
H
.
Spectral Regularization
REGULARIZED LEAST SQUARES (RLS)
RLS implements a Tikhovon minimization problem with a square loss
min
f H
1
n
n
i =1
(f (x
i
) y
i
)
2
+[[f [[
2
k
REPRESENTER THEOREM
We have seen that RKHS allow us to write the RLS estimator in the
form
f
S
(x) =
n
i =1
c
i
k(x, x
i
)
Spectral Regularization
REGULARIZED LEAST SQUARES (RLS)
Using basic properties of RKHS we can rewrite the RLS functional as
min
cR
n
[[Y Kc[[
2
2
+c
t
Kc
Setting the derivative with respect to c to zero we see that
c must satisfy
(K +nI)c = Y
Spectral Regularization
EMPIRICAL RISK MINIMIZATION
Similarly we can prove that the solution of empirical risk minimization
min
f H
1
n
n
i =1
(y
i
f (x
i
))
2
can be written as
f
S
(x) =
n
i =1
c
i
k(x, x
i
)
where the coefcients satisfy
Kc = Y.
Spectral Regularization
THE ROLE OF REGULARIZATION
We observed that adding a penalization term can be interpreted as
way to to control smoothness and avoid overtting
min
f H
1
n
n
i =1
(y
i
f (x
i
))
2
min
f H
1
n
n
i =1
(y
i
f (x
i
))
2
+|f |
2
H
.
Spectral Regularization
THE ROLE OF REGULARIZATION
Now we can observe that adding a penalty has an effect from a
numerical point of view:
Kc = Y (K + nI)c = Y
it stabilizes a possibly ill-conditioned matrix inversion problem.
This is the point of view of regularization for (ill-posed) inverse
problems.
Spectral Regularization
ILL-POSED INVERSE PROBLEMS
Hadamard introduced the denition of ill-posedness. Ill-posed
problems are typically inverse problems.
If g G and f F, with G, F Hilbert spaces, a linear, continuous
operator L, consider the equation
g = Lf .
The direct problem is is to compute g given f ; the inverse problem is
to compute f given the data g.
The inverse problem of nding f is well-posed when
the solution exists,
is unique and
is stable, that is depends continuously on the initial data g.
Otherwise the problem is ill-posed.
Spectral Regularization
ILL-POSED INVERSE PROBLEMS
Hadamard introduced the denition of ill-posedness. Ill-posed
problems are typically inverse problems.
If g G and f F, with G, F Hilbert spaces, a linear, continuous
operator L, consider the equation
g = Lf .
The direct problem is is to compute g given f ; the inverse problem is
to compute f given the data g.
The inverse problem of nding f is well-posed when
the solution exists,
is unique and
is stable, that is depends continuously on the initial data g.
Otherwise the problem is ill-posed.
Spectral Regularization
ILL-POSED INVERSE PROBLEMS
Hadamard introduced the denition of ill-posedness. Ill-posed
problems are typically inverse problems.
If g G and f F, with G, F Hilbert spaces, a linear, continuous
operator L, consider the equation
g = Lf .
The direct problem is is to compute g given f ; the inverse problem is
to compute f given the data g.
The inverse problem of nding f is well-posed when
the solution exists,
is unique and
is stable, that is depends continuously on the initial data g.
Otherwise the problem is ill-posed.
Spectral Regularization
LINEAR SYSTEM FOR ERM
In the nite dimensional case the main problem is numerical stability.
For example, in the learning setting the kernel matrix can be
decomposed as K = QQ
T
, with = diag(
1
, . . . ,
n
),
1
2
...
n
0 and q
1
, . . . , q
n
are the corresponding
eigenvectors.
Then
c = K
1
Y = Q
1
Q
T
Y =
n
i =1
1
i
q
i
, Yq
i
.
In correspondence of small eigenvalues, small perturbations of the
data can cause large changes in the solution. The problem is
ill-conditioned.
Spectral Regularization
LINEAR SYSTEM FOR ERM
In the nite dimensional case the main problem is numerical stability.
For example, in the learning setting the kernel matrix can be
decomposed as K = QQ
T
, with = diag(
1
, . . . ,
n
),
1
2
...
n
0 and q
1
, . . . , q
n
are the corresponding
eigenvectors.
Then
c = K
1
Y = Q
1
Q
T
Y =
n
i =1
1
i
q
i
, Yq
i
.
In correspondence of small eigenvalues, small perturbations of the
data can cause large changes in the solution. The problem is
ill-conditioned.
Spectral Regularization
REGULARIZATION AS A FILTER
For Tikhonov regularization
c = (K + nI)
1
Y
= Q( + nI)
1
Q
T
Y
=
n
i =1
1
i
+ n
q
i
, Yq
i
.
Regularization lters out the undesired components.
For n, then
1
i
+n
1
i
.
For n, then
1
i
+n
1
n
.
Spectral Regularization
MATRIX FUNCTION
Note that we can look at a scalar function G
() as a function on the
kernel matrix.
Using the eigen-decomposition of K we can dene
G
(K) = QG
()Q
T
,
meaning
G
(K)Y =
n
i =1
G
(
i
)q
i
, Yq
i
.
For Tikhonov
G
() =
1
+ n
.
Spectral Regularization
REGULARIZATION IN INVERSE PROBLEMS
In the inverse problems literature many algorithms are known
besides Tikhonov regularization.
Each algorithm is dened by a suitable lter function G
.
This class of algorithms is known collectively as spectral
regularization.
Algorithms are not necessarily based on penalized empirical risk
minimization.
Spectral Regularization
ALGORITHMS
Gradient Descent or Landweber Iteration or L2 Boosting
-method, accelerated Landweber.
Iterated Tikhonov
Truncated Singular Value Decomposition (TSVD) Principal
Component Regression (PCR)
The spectral ltering perspective leads to a unied framework.
Spectral Regularization
PROPERTIES OF SPECTRAL FILTERS
Not every scalar function denes a regularization scheme.
Roughly speaking a good lter function must have the following
properties:
as goes to 0, G
() 1/ so that
G
(K) K
1
.
controls the magnitude of the (smaller) eigenvalues of G
(K).
Spectral Regularization
SPECTRAL REGULARIZATION FOR LEARNING
We can dene a class of Kernel Methods as follows.
SPECTRAL REGULARIZATION
We look for estimators
f
S
(X) =
n
i =1
c
i
k(x, x
i
)
where
c = G
(K)Y.
Spectral Regularization
GRADIENT DESCENT
Consider the (Landweber) iteration:
GRADIENT DESCENT
set c
0
= 0
for i= 1, . . . , t 1
c
i
= c
i 1
+(Y Kc
i 1
)
If the largest eigenvalue of K is smaller than n the above iteration
converges if we choose the step-size = 2/n.
The above iteration can be seen as the minimization of the empirical
risk
1
n
|Y Kc|
2
2
via gradient descent.
Spectral Regularization
GRADIENT DESCENT AS SPECTRAL FILTERING
Note that c
0
= 0, c
1
= Y,
c
2
= Y +(I K)Y
c
3
= Y +(I K)Y +(Y K(Y +(I K)Y))
= Y +(I K)Y +(I 2K +
2
K
2
)Y
One can prove by induction that the solution at the t th iteration is
given by
c =
t 1
i =0
(I K)
i
Y.
The lter function is
G
() =
t 1
i =0
(I )
i
.
Spectral Regularization
LANDWEBER ITERATION
Note that
i 0
x
i
= 1/(1 x), also holds replacing x with the a
matrix. If we consider the kernel matrix (or rather I K) we get
K
1
=
i =0
(I K)
i
t 1
i =0
(I K)
i
.
The lter function of Landweber iteration corresponds to a truncated
power expansion of K
1
.
Spectral Regularization
EARLY STOPPING
The regularization parameter is the number of iteration. Roughly
speaking t 1/.
Large values of t correspond to minimization of the empirical risk
and tend to overt.
Small values of t tends to oversmooth, recall we start from c = 0.
Early stopping of the iteration has a regularization effect.
Spectral Regularization
GRADIENT DESCENT AT WORK
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
2
1.5
1
0.5
0
0.5
1
1.5
2
X
Y
Spectral Regularization
GRADIENT DESCENT AT WORK
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
2
1.5
1
0.5
0
0.5
1
1.5
2
X
Y
Spectral Regularization
GRADIENT DESCENT AT WORK
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
2
1.5
1
0.5
0
0.5
1
1.5
2
X
Y
Spectral Regularization
CONNECTION TO L
2
BOOSTING
Landweber iteration (or gradient descent) has been rediscovered in
statistics with name of L
2
Boosting.
BOOSTING
Then name Boosting denotes a large class of methods building
estimators as linear (convex) combinations of weak learners.
Many boosting algorithms can be seen as gradient descent
minimization of the empirical risk on the linear span of some
basis function.
For Landweber iteration the weak learners are k(x
i
, ), i = 1, . . . , n.
Spectral Regularization
-METHOD
One can consider an accelerated gradient descent where the
The method is implemented by the following iteration.
GRADIENT DESCENT
set c
0
= 0
1
= (4 + 2)/(4 + 1)
c
1
= c
0
+
1
n
(Y Kc
0
)
for i= 2, . . . , t 1
c
i
= c
i 1
+ u
i
(c
i 1
c
i 2
) +
i
n
(Y Kc
i 1
)
u
i
=
(i 1)(2i 3)(2i +21)
(i +21)(2i +41)(2i +23)
i
= 4
(2i +21)(i +1)
(i +21)(2i +41)
We need
t iterations to get the same solution that gradient descent
would get after t iterations.
Spectral Regularization
TRUNCATED SINGULAR VALUE DECOMPOSITION
This method is one of the oldest regularization techniques and is also
called spectral cut-off.
TSVD
Given the eigen-decomposition K = QQ
t
, a regularized inverse
of the kernel matrix is built discarding all the eigenvalues before
the prescribed threshold n.
It is described by the lter function G
() = 1/ if /n and 0
otherwise.
Spectral Regularization
DIMENSIONALITY REDUCTION AND GENERALIZATION
Interestingly enough, one can show that TSVD is equivalent to the
following procedure:
(unsupervised) projection of the data using (kernel) PCA.
Empirical risk minimization on projected data without any
regularization.
The only free parameter is the number of components we retain for
the projection.
Spectral Regularization
DIMENSIONALITY REDUCTION AND GENERALIZATION
(CONT.)
Projection Regularizes!
Doing KPCA and then RLS is redundant.
If data are centered Spectral regularization (also Tikhonov) can see
as ltered projection on the principal components.
Spectral Regularization
COMMENTS ON COMPLEXITY AND PARAMETER CHOICE
Iterative methods perform matrix vector multiplication O(n
2
) at
each iteration and the regularization parameter is the number of
iteration itself.
There is not a closed form for leave one out error.
Parameter tuning is different from method to method.
Compared to RLS in iterative and projected methods the
regularization parameter is naturally discrete.
TSVD has a natural range for the search of the regularization
parameter.
For TSVD the regularization parameter can be interpreted in terms
of dimensionality reduction.
Spectral Regularization
FILTERING, REGULARIZARTION AND LEARNING
The idea of using regularization from inverse problems in statistics
(see Wahba) and machine learning (see Poggio and Girosi) is now
well known.
Ideas coming from inverse problems regarded mostly the use of
Tikhonov regularization.
The notion of lter function was studied in machine learning and gave
a connection between function approximation in signal processing
and approximation theory. The work of Poggio and Girosi enlighted
the relation between neural network, radial basis function and
regularization.
Filtering was typically used to dene a penalty for Tikhonov
regularization, in the following it is used to dene algorithms different
though similar to Tikhonov regularization.
Spectral Regularization
FINAL REMARKS
Many different principles lead to regularization: penalized
minimization, iterative optimization, projection. The common
intuition is that they enforce stability of the solution.
All the methods are implicitly based on the use of square loss.
For other loss function different notion of stability can be used.
Spectral Regularization
APPENDICES
Appendix 1: Other examples of Filters: accelerated Landweber
and Iterated Tikhonov.
Appendix 2: TSVD and PCA.
Appendix 3: Some thoughts about Generalization of Spectral
Methods.
Spectral Regularization
APPENDIX 1 :-METHOD
The so called -method or accelerated Landweber iteration can be
thought as an accelerated version of gradient descent.
The lter function is G
t
() = p
t
() with p
t
a polynomial of degree
t 1.
The regularization parameter (think of 1/) is
t (rather than t ): fewer
iterations are needed to attain a solution.
Spectral Regularization
-METHOD (CONT.)
The method is implemented by the following iteration.
GRADIENT DESCENT
set c
0
= 0
1
= (4 + 2)/(4 + 1)
c
1
= c
0
+
1
n
(Y Kc
0
)
for i= 2, . . . , t 1
c
i
= c
i 1
+ u
i
(c
i 1
c
i 2
) +
i
n
(Y Kc
i 1
)
u
i
=
(i 1)(2i 3)(2i +21)
(i +21)(2i +41)(2i +23)
i
= 4
(2i +21)(i +1)
(i +21)(2i +41)
Spectral Regularization
ITERATED TIKHONOV
The following method can be seen a combination of Tikhonov
regularization and gradient descent.
GRADIENT DESCENT
set c
0
= 0
for i= 0, . . . , t 1
(K + nI)c
i
= Y + nc
i 1
The lter function is:
G
() =
( +)
t
t
( +)
t
.
Spectral Regularization
ITERATED TIKHONOV (CONT.)
Both the number of iteration and can be seen as regularization
parameters.
It can be used to enforce more smoothness on the solution.
Tikhonov regularization suffers from a saturation effect: it cannot
exploit the regularity of the solution beyond a certain critical value.
Spectral Regularization
APPENDIX 2: TSVD AND CONNECTION TO PCA
Principal component Analysis is a well known dimensionality
reduction technique often used as preprocessing in learning.
PCA
Assuming centered data, X
T
X is the covariance matrix and its
eigenvectors (V
j
)
d
j =1
are the principal components.
PCA amounts to map each example x
i
in
x
i
= (x
T
i
V
1
, . . . , x
T
i
V
m
)
where m < minn, d.
notation: x
T
i
is the transpose of the rst row (example) of X.
Spectral Regularization
PCA (CONT.)
The above algorithm can be written using only the linear kernel matrix
XX
T
and its eigenvectors (U
i
)
n
i =1
.
The eigenvalues of XX
T
and X
T
X are the same and
V
j
=
1
i
X
T
U
j
.
Then
x
i
= (
1
i
n
j =1
U
1
j
x
T
i
x
j
), . . . ,
1
n
n
j =1
U
m
j
x
T
i
x
j
).
Note that x
T
i
x
j
= k(x
i
, x
j
).
Spectral Regularization
KERNEL PCA
We can perform a non linear principal component analysis, namely
KPCA, by choosing non linear kernel functions.
Using K = QQ
T
we can rewrite the projection in vector notation.
If we let
M
= diag(
1
, ,
m
, 0, , 0) then the projected data
matrix
X is
X = KQ
1/2
m
Spectral Regularization
PRINCIPAL COMPONENT REGRESSION
ERM on the projected data
min
R
m
2
n
,
is equivalent to perform truncated singular values decomposition on
the original problem.
Representer Theorem tells us that
x
i
=
n
j =1
x
T
j
x
i
c
j
with
c = (
X
X
T
)
1
Y.
Spectral Regularization
DIMENSIONALITY REDUCTION AND GENERALIZATION
Using
X = KQ
1/2
m
we get
X
X
T
= QQ
T
Q
1/2
m
1/2
m
Q
T
QQ
T
= Q
m
Q
T
.
so that
c = Q
1
m
Q
T
Y = G
(K)Y,
where G
is the lter function of TSVD.
The two procedure are equivalent. The regularization parameter is
the eigenvalue threshold in one case and the number of components
kept in the other case.
Spectral Regularization
APPENDIX 3: WHY SHOULD THESE METHODS LEARN?
we have seen that
G
(K) K
1
if 0
anyway usually, we DONT want to solve
Kc = Y
since it would simply correspond to an over-tting solution
STABILITY VS GENERALIZATION
how can we show that stability ensures generalization?
Spectral Regularization
POPULATION CASE
It is useful to consider what happens if we know the true distribution.
INTEGRAL OPERATOR
for n large enough
1
n
K L
k
f (s) =
X
k(x, s)f (x)p(x)dx
THE IDEAL PROBLEM
for n large enough we have
Kc = Y L
k
f = L
k
f
where f
is the regression (target) function dened by
f
(x) =
Y
yp(y[x)dy
Spectral Regularization
REGULARIZATION IN THE POPULATION CASE
it can be shown that which is the least squares problem associated to
L
k
f = L
k
f
.
tikhonov regularization in this case is simply
or equivalently
f
= (L
k
f +I)
1
L
k
f
Spectral Regularization
FOURIER DECOMPOSITION OF THE REGRESSION
FUNCTION
FOURIER DECOMPOSITION OF f
AND f
if we diagonalize L
k
to get the eigensystem (t
i
,
i
)
i
we can write
f
i
f
,
i
i
perturbations affect high order components.
tikhonov regularization can be written as
f
i
t
i
t
i
+
f
,
i
i
SAMPLING IS A PERTURBATION
stabilizing the problem with respect to random discretization
(sampling) we can recover f
Spectral Regularization