Kernels and the Kernel Trick
Martin Hofmann
Reading Club "Support Vector Machines"
Kernels and the Kernel Trick
Reading Club "Support Vector Machines"
1 / 13
Optimization Problem
maximize:
W() =
m
X
i=1
m
1X
j j yi yj hxi xj i
2
i,j=1
subject to i 0, i = 1, . . . , m and
m
i=1
i yi = 0
data not linear separable in input space
map into some feature space where data is linear separable
Kernels and the Kernel Trick
Reading Club "Support Vector Machines"
2 / 13
Mapping Example
map data points into feature space with some function
e.g.:
: R2 R2
(x2 , x2 ) (z1 , z2 , z3 ) := (x12 , 2x1 x2 , x22 )
hyperplane hw zi = 0, as a function of x:
w1 x12 + w2 2x1 x2 + w3 x22 = 0
Kernels and the Kernel Trick
Reading Club "Support Vector Machines"
3 / 13
Kernel Trick
solve maximisation problem using mapped data points
W() =
m
X
i=1
m
1X
i
j j yi yj h(xi ) (xj )i
2
i,j=1
Dual Representation of Hyperplane ( primal Lagrangian):
f (x) = hw xi + b =
i yi hxi xi with w =
i yi xi
weight vector represented only by data points
only inner product of data points necessary, no coordinates
kernel function K(x1 , x2 ) = h(xi ) (xj )i
not necessary any more
possible to operate in any n-dimensional FS
complexity independent of FS
Kernels and the Kernel Trick
Reading Club "Support Vector Machines"
4 / 13
Example Kernel Trick
~x = (x1 , x2 )
~z = (z1 , z2 )
2
K(x, z) = h~x ~zi
K(x, z)
= h~x ~zi
= (x1 z1 + x2 z2 )2
= (x12 z21 + 2x1 z1 x2 z2 + x22 z22 )
E
D
=
(x12 , 2x1 x2 , x22 ) (z21 , 2z1 z2 , z22 )
= h(~x) (~z)i
mapping function fused in K
implicit (~x) = (x12 , 2x1 x2 , x22 )
Kernels and the Kernel Trick
Reading Club "Support Vector Machines"
5 / 13
Typical Kernels
Polynomial Kernel
d
K(x, z) = (hx zi + ) ,
for d 0
Radial Basis Function (Gaussian Kernel)
K(x, z) = e
kxzk2
2 2
kxk :=
hx xi
(Sigmoid Kernel)
K(x, z) = tanh( hx zi +
Inverse multi-quadric
K(x, z) = p
kx zk2 2 2 + c2
Kernels and the Kernel Trick
Reading Club "Support Vector Machines"
6 / 13
Typical Kernels Cont.
Kernels for Sets -
, 0
0
K s(, ) =
N N0
X
X
k(xi , xj0 )
i=1 j=1
where k(xi , xj0 ) is a kernel on elements in
, 0
Kernels for strings (Spectral Kernels) and trees
no one-fits-all kernel
model search and cross-validation in practice
low polynomial or RBF a good initial try
Kernels and the Kernel Trick
Reading Club "Support Vector Machines"
7 / 13
Kernel Properties
Symmetry
K(x, z) = h(x) (z)i = h(z) (x)i = K(z, x)
Cauchy-Schwarz Inequality
2
K(x, z)2 = h(x) (z)i k(x)k2 k(z)k2
= h(x) (x)i h(z) ( z)i
= K(x, x)K(z, z)
Kernels and the Kernel Trick
Reading Club "Support Vector Machines"
8 / 13
Making Kernels from Kernels
create complex Kernels by combining simpler ones
Closure Properties:
K(x, z)
K(x, z)
K(x, z)
K(x, z)
K(x, z)
=
=
=
=
=
c K1 (x, z)
c + K1 (x, z)
K1 (x, z) + K2 (x, z)
K1 (x, z) K2 (x, z)
f (x) f (z)
if K1 and K2 are kernels, f : X R, and c > 0
Kernels and the Kernel Trick
Reading Club "Support Vector Machines"
9 / 13
Gram Matrix
Kernel function as similarity measure between input objects
Gram Matrix (Similarity/Kernel Matrix) represents similarities between
input vectors
let V = ~v1 , . . . ,~vn a set of input vectors, then the Gram Matrix K is defined
as:
h(~v1 ) (~v1 )i . . . h(~v1 ) (~vn )i
..
h(~v2 ) (~v1 )i . . .
K=
..
.
h(~vn ) (~v1 )i . . . h(~vn ) (~vn )i
K is symmetric and positive semis-definite (positive eigenvalues)
Kernels and the Kernel Trick
Reading Club "Support Vector Machines"
10 / 13
Mercers Theorem
assume:
finite input space X = {x1 , . . . , xn }
symmetric function K(x, z) on X
Gram Matrix K = (K(xi , xj ))ni,j=1
since K is symmetric there exists an orthogonal matrix V s.t. K = VV0
diagonal containing eigenvalues t of K
and eigenvectors vt = (vti )ni=1 as columns of V
all eigenvalues are non-negative and let feature mapping be
: xi 7
p
i vti
n
t=1
then
h(xi ) (xj )i =
n
X
Rn , i = 1, . . . , n.
t vti vtj = (VV0 )ij = Kij = K(xi , xj )
t=1
Kernels and the Kernel Trick
Reading Club "Support Vector Machines"
11 / 13
Mercers Theorem Cont.
every Gram Matrix is symmetric and positive semi-definite
every spsd matrix can be regarded as a Kernel Matrix, i.e. as an inner
product matrix in some space
diagonal matrix satisfies Mercers criteria, but not good as Gram Matrix
self-similarity dominates between-sample similarity
represents orthogonal samples
generalization for infinite input space
eigenvectors of the data in can be used to detect directions of maximum
variance
kernel principal components analysis
Kernels and the Kernel Trick
Reading Club "Support Vector Machines"
12 / 13
Summary
Kernel calculates dot product of mapped data points without mapping
function
represented by symmetric, positive semi-definite Gram Matrix
fuses information about data and kernel
standard kernels (cross validation)
every similarity matrix can be used as kernel (satisfying Mercers criteria)
ongoing research to estimate Kernel Matrix from available data
Kernels and the Kernel Trick
Reading Club "Support Vector Machines"
13 / 13