KEMBAR78
Kernel Methods in Machine Learning | PDF | Hilbert Space | Basis (Linear Algebra)
0% found this document useful (0 votes)
29 views65 pages

Kernel Methods in Machine Learning

Uploaded by

nicolas.ours23
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views65 pages

Kernel Methods in Machine Learning

Uploaded by

nicolas.ours23
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 65

Lecture notes: kernel and operator-theoretic methods

in machine learning
G. Blanchard, Université Paris-Saclay
April 11, 2024

Warning: these lecture notes are an incomplete work in progress. They likely contain
many typos, inconsistencies and more serious errors! If you happen to stumble upon this
document, and notice some errors, feel free to contact me.

Contents
1 Introduction (and conventions used in these notes) 3
1.1 Motivation: regression in Hilbert space . . . . . . . . . . . . . . . . . . . . 3
1.2 Notation and convention index . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Tools of operator theory and functional calculus 4


2.1 Basics on Hilbert spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Bounded operators on Hilbert space . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Compact operators on a Hilbert space and spectral theorems . . . . . . . . 7
2.4 Functional calculus for compact self-adjoint operators . . . . . . . . . . . . 10
2.5 Hilbert-Schmidt operators . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.6 Schatten p-classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.7 Trace-class operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.8 Some operator inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3 Tools from concentration of measure 22


3.1 Random variables in Banach space . . . . . . . . . . . . . . . . . . . . . . 22
3.2 Hoeffding’s inequality in Hilbert space . . . . . . . . . . . . . . . . . . . . 23
3.2.1 (*) Extension to Banach space . . . . . . . . . . . . . . . . . . . . . 24
3.3 Bernstein’s inequality in smooth Banach space . . . . . . . . . . . . . . . . 25
3.3.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.4 Bernstein’s inequality in operator norm . . . . . . . . . . . . . . . . . . . . 30

1
4 Spectral regularization methods 38
4.1 Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.2 Probabilistic inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.3 Analysis of spectral regularization methods . . . . . . . . . . . . . . . . . . 44
4.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

5 Reproducing kernel methods 51


5.1 Reproducing kernel Hilbert spaces . . . . . . . . . . . . . . . . . . . . . . . 51
5.2 Kernel operators in reproducing kernel Hilbert spaces . . . . . . . . . . . . 52
5.3 Spectral regularization in a rkHs regression setting . . . . . . . . . . . . . . 54
5.4 Kernel mean embeddings of distributions . . . . . . . . . . . . . . . . . . . 56

6 Acceleration methods 57
6.1 Parallelizing: divide and average . . . . . . . . . . . . . . . . . . . . . . . . 57
6.2 Nyström methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

2
1 Introduction (and conventions used in these notes)
1.1 Motivation: regression in Hilbert space
TODO

1.2 Notation and convention index


H separable Hilbert space
B(H, H′ ) space of bounded linear operators from H to H′ ; B(H) := B(H, H)
∥A∥op operator norm of bounded operator A, ∥A∥op := max x∈H ∥Ax∥
∥x∥=1
HS(H) space of Hilbert-Schmidt operators on H
K(H, H′ ) space of compact linear operators from H to H′ ; K(H) := K(H, H).
⟨, ⟩2 if A, B ∈ HS(H): ⟨A, B⟩2 = Tr(AB ∗ )
∥A∥2 Hilbert-Schmidt norm of A if A ∈ HS(H) : ∥A∥22 = ⟨A, A⟩2
Bp (H) Schatten p-class of H
∥A∥p Schatten p-norm of A if A ∈ B(H)

3
2 Tools of operator theory and functional calculus
2.1 Basics on Hilbert spaces
[Source: [7, Chapter 1]]
Definition 2.1. A real resp. complex Hilbert space H is a R-, resp C-vector space with an
inner product ⟨·, ·⟩ on H2 which is complete for the metric induced by that inner product’s
norm. We recall here the defining properties of an inner product:
• ⟨u, v⟩ ∈ R, resp C;

• ⟨v, u⟩ = ⟨u, v⟩;

• ⟨αu + βu′ , v⟩ = α⟨u, v⟩ + β⟨u′ , v⟩;

• (consequence of the two previous items): ⟨u, αv + βv ′ ⟩ = α⟨u, v⟩ + β⟨u, v ′ ⟩.

• ⟨u, u⟩ ≥ 0;

• ⟨u, u⟩ = 0 ⇒ u = 0.
1
The norm induced by the inner product is ∥u∥ = ⟨u, u⟩ 2 . The inner product satisfies the
Cauchy-Schwartz inequality:
|⟨u, v⟩| ≤ ∥u∥∥v∥.
Proposition/Definition 2.2.
• If ⟨u, v⟩ = 0 we denote u ⊥ v.

• For a subset A ⊆ H we denote A⊥ = {u ∈ H : ⟨u, a⟩ = 0 for all a ∈ A}.

• (A⊥ )⊥ = [A], where [A] is the closure of the linear span of A.

• If A is a closed linear subspace of H, then A ⊕ A⊥ = A, i.e. for any u ∈ H there


exists a unique decomposision u = u∥ + u⊥ , where u∥ ∈ A and u⊥ ∈ A⊥ .
The component u∥ is called orthogal projection of U onto A.
We will need the following definition:
Proposition/Definition 2.3. If (ui , i ∈ I) is a family of elements of R, C, or a vector
space, we will say that it is Hilbert-summable with sum h if:
1. ui = 0 for all i except on a finite or countable subset of indices Ie ⊆ I;
P
2. The sum i∈Ie ui converges to h regardless of the order of summation (inconditional
convergence).

4
The first point allows to formally define uncountable Hilbert sums; the second indicates that
the notion of summability is quite strong. For real or complex-valued sums, it is equivalent
to the corresponding sum being P absolutely convergent; for sums of Hilbert space elements
however, it does not imply that i∈Ie∥ui ∥ is convergent.

In the rest of the section all infinite sums will be Hilbert-sums.

Proposition/Definition 2.4.

• An orthonormal set (ek )k∈I is a family of unit norm, pairwise orthogonal vectors.

• A (Hilbert) basis is a maximal orthonormal set.

• For an orthonormal set (ek )k∈I , it holds for any h ∈ H:


X
|⟨h, ek ⟩|2 ≤ ∥h∥2 .
k∈I

(Bessel’s inequality)

• Any orthonormal set can be completed to a basis containing it.

• Any basis of H has the same cardinality; if H is separable, then any basis of H is
countable (or finite)

Proposition 2.5. If E = (ek )k∈I is a Hilbert basis of H, we have the following properties:

• E ⊥ = {0}.

• [E] = H.

• For any h ∈ H, it holds


P
k∈I ⟨h, ek ⟩ek = h.

• For any h, g ∈ H, it holds ⟨h, g⟩ =


P
k∈I ⟨h, ek ⟩⟨g, ek ⟩.

• (Consequence of the previous point) For any h ∈ H, it holds ∥h∥2 = 2


P
k∈I |⟨h, ek ⟩|
(Parseval’s identity).

(All sums above are Hilbert sums.)

5
2.2 Bounded operators on Hilbert space
[Source: [7, Chapter 2]]

Proposition/Definition 2.6. For a linear operator A between Hilbert spaces H and H′ ,


the following are equivalent:

• A is continuous.

• A is bounded on the unit ball: suph∈H,∥h∥≤1 ∥Ah∥ < ∞.

In this case we define the operator norm

∥A∥op := sup ∥Ah∥;


h∈H,∥h∥≤1

it is a norm on the space of bounded/continuous operators, denoted B(H, H′ ).


The space of bounded operators is complete for the metric induced by the operator norm.
It holds for any h ∈ H:
∥Ah∥ ≤ ∥A∥op ∥h∥.

It holds for bounded operators A, B s.t. the output space of B is the input space of A:

∥AB∥op ≤ ∥A∥op ∥B∥op .

Proposition/Definition 2.7. For an operator A ∈ B(H, H′ ) there exists a unique A∗ ∈


B(H′ , H) called adjoint of A, such that

∀u ∈ H, v ∈ H′ ⟨Au, v⟩ = ⟨u, A∗ v⟩.

If A ∈ B(H), it is said self-adjoint if A∗ = A.

Proposition 2.8. We have the following properties for bounded operators A, B (with ap-
propriate compatibility for input/output spaces of operators):

• (αA + B)∗ = αA∗ + B ∗ .

• (AB)∗ = B ∗ A∗ .

• (Consequence of the previous point) AA∗ and A∗ A are self-adjoint.

• (A∗ )∗ = A.

• If A is invertible in B(H), then so is A∗ and (A−1 )∗ = (A∗ )−1 .


1 1
• ∥A∥op = ∥A∗ ∥op = ∥AA∗ ∥op
2 = ∥A∗ A∥ 2 .
op

6
Proof. We prove only the last statement: for any h ∈ H it holds

∥Ah∥2 = ⟨Ah, Ah⟩ = ⟨A∗ Ah, h⟩ ≤ ∥A∗ Ah∥∥h∥ ≤ ∥A∗ A∥∥h∥2 ,

this implies ∥A∥2 ≤ ∥A∗ A∥ ≤ ∥A∗ ∥∥A∥, and further ∥A∥ ≤ ∥A∗ ∥. But since (A∗ )∗ = A,
we also obtain ∥A∗ ∥ ≤ ∥A∥. Thus we have equalities everywhere.

Proposition/Definition 2.9 (Rank-one and finite rank operators). Given two elements
u ∈ H, v ∈ H′ , we denote v ⊗ u∗ the linear mapping

v ⊗ u∗ : x ∈ H 7→ ⟨x, u⟩v.

It is a rank one operator. It holds (v ⊗ u∗ )∗ = u ⊗ v ∗ and ∥v ⊗ u∗ ∥op = ∥u∥∥v∥.


The (finite) linear combinations of rank one operators form the space of finite rank opera-
tors.

2.3 Compact operators on a Hilbert space and spectral theorems


Proposition/Definition 2.10. An operator A ∈ B(H, H′ ) is compact if the image of the
unit ball of H by A is relatively compact in H′ .
The set of compact operators from H to H′ is a closed linear subspace of B(H, H′ ) denoted
K(H, H′ ) (in the literature is is sometimes denoted B0 instead of K).

For a proof of the closedness (implying completeness) of K(H, H′ ), see [7, Prop 4.2]

Proposition 2.11. If A ∈ K(H, H′ ) and B ∈ B(H′ , H′′ ) then AB ∈ K(H, H′′ ).


If A ∈ K(H′ , H′′ ) and B ∈ B(H, H′ ) then BA ∈ K(H, H′′ ).
Thus K(H) is an ideal of B(H).

Proof: straightforward from the definition.

Proposition 2.12. If A ∈ K(H, H′ ), then A∗ ∈ K(H′ , H).


Furthermore, A ∈ K(H, H′ ) iff A is the limit of a sequence of finite-rank operators, con-
verging in operator norm.

Proof: see [5, Thm. VI.4] or [7, Thm. II.4.4].


Compact operators behave very much like finite matrices, and that we will mostly deal
with such operators (or with operators of the form I + K with K compact, which are called
”compact pertubations of identity”).
The following result is the cornerstone of the theory of compact operators on a Hilbert
space.

Theorem 2.13 (Spectral theorem for compact self-adjoint operators). Let A ∈ K(H) be
a compact, self-adjoint operator.
Then there exists a finite or countably infinite orthonormal family (ek )k∈I of eigenvectors

7
of A and family of corresponding real nonzero eigenvalues (λk )k∈I (here I = JnK for some
n ∈ N or I = N; possibly I = ∅ when A = 0) with λk → 0 if I = N, such that
X
A= λk ek ⊗ e∗k , (2.1)
k∈I

that is to say: X
∀u ∈ H Au = λk ⟨u, ek ⟩ek , (2.2)
k∈I

and the series in (2.1) converges in operator norm. It can be assumed, if needed, that the
sequence (λk )k∈I is ordered by decreasing absolute value (which we will just call ”ordered”
for short).
If I = N, the set σ(A) := {λk , k ∈ I} ∪ {0} ⊂ R is the spectrum of A and has 0 as only
accumulation value. If I is finite, we define σ(A) := {λk , k ∈ I} ∪ {0} if 0 is an eigenvalue
of A and σ(A) := {λk , k ∈ I} otherwise (the latter case can only happen if H is finite-
dimensional). P
If we group indices k corresponding to the same eigenvalue λ and denote Pλ := k:λk =λ ek ⊗
e∗k (necessarily a finite sum) then Pλ is the orthogonal projector on the eigenspace associated
to λ; when rewritten in the form
X
A= λPλ , (2.3)
λ∈σ(A)\{0}

the decomposition is unique (we will call this the ”canonical form” of the decomposition).
Correspondingly every representation of the form (2.1) has the same ordered sequence
(λk )k∈I (every nonzero eigenvalue of A is present in this sequence with its degree of multi-
plicity.)
If H is separable, we can complete (ek )k∈I to a finite or countable Hilbert basis of H. Defin-
ing λℓ = 0 for the added vectors of the completed basis, relations (2.1)-(2.2)-(2.3) still hold
for this completed basis (for (2.3), the sum is then over σ(A) and we include P0 , the orthog-
onal projector on the null space of A). We will call this the completed eigendecomposition
of A.
This completion operation can also be made if H is nonseparable, but then we have to
complete the orthonormal family (ek )k∈i to an uncountable Hilbert basis.
For a proof, see e.g. [5, Section 6.4] or [7, Section II.5]. Note that there is a more
general theory for the spectral decomposition of noncompact self-adjoint and even normal
(commuting with their adjoint) operators, for which the sum is replaced by an integral in
a suitable sense (see e.g. [7, Chapter IX]), be we won’t consider it here.
Proposition 2.14. If A is a compact, self-adjoint operator, it is positive if and only if its
spectrum is nonnegative.

P we see that 2λk = ⟨Aek , ek ⟩ ≥ 0 if A is positive.


Proof. Using the eigendecomposition (2.2)
Conversely, if u ∈ H then ⟨Au, u⟩ = k∈I λk |⟨u, ek ⟩| is nonnegative if λk ≥ 0 for all
k ∈ I.

8
The following consequence is important:
Theorem 2.15 (Singular value decomposition of a compact operator). Let A ∈ B(H, H′ ).
Then A is compact if there exists a finite or countably infinite set I (I = JnK for some
n ∈ N or I = N) and:
(1) a positive sequence (σk )k∈I , converging to 0 if I = N;

(2) an orthonormal system (ek )k∈I of H;

(3) an orthonormal system (fk )k∈I of H′ ;


such that X
A= σk fk ⊗ e∗k , (2.4)
k∈I

that is to say: X
∀u ∈ H Au = σk ⟨u, ek ⟩fk , (2.5)
k∈I

and the series in (2.4) converges in operator norm.


The familyn√{σk , k ∈ I}∪{0}o is called set of singular values of A, denoted sv(A), and it holds
sv(A) = λ, λ ∈ σ(A∗ A) , and ∥A∥op = maxk σk , and sv(A∗ ) = sv(A). Conversely, for
any (σk , ek , fk )k∈I satisfying conditions (1)-(2)-(3), the series in (2.4) converges in operator
norm and defines a compact operator.
The fact that the theorem is ”iff” shows that (2.4) is a complete characterization of
compact operators between Hilbert spaces.
Proof. Assume that A ∈ K(H, H′ ); hence A∗ ∈ K(H′ , H). It is easily seen that the
composition of compact operators is compact, hence A∗ A ∈ K(H) is compact and self-
adjoint. We can apply Theorem 2.13 to find (λk , ek )k∈I an eigendecomposition
√ of A∗ A.
Obviously A∗ A ⪰ 0, so λk > 0 for all k ∈ I. Let us denote σk := λk , and fk := σk−1 Aek ,
k ∈ I.
Let us first prove that (fk )k∈I is an orthonormal system of H′ :

⟨fk , fℓ ⟩ = σk−1 Aek , σℓ−1 Aeℓ


= σk−1 σℓ−1 ⟨ek , A∗ Aeℓ ⟩
= σk−1 σℓ−1 1{k = ℓ}σk2
= 1{k = ℓ}.

We now establish that A = 0 on the orthogonal of {ek , k ∈ I}. Namely, if u ∈ H is


orthogonal to all ek , k ∈ I, then A∗ Au = 0 by the eigendecomposition of A∗ A, hence
⟨u, A∗ Au⟩ = ∥Au∥2 = 0. It follows that for any v ∈ H:
X X
Av = ⟨v, ek ⟩Aek = σk ⟨v, ek ⟩fk .
k∈I k∈I

9
This establishes (2.5) i.e. convergence in the weak sense.
We will now check that conversely, if (σk , ek , fk )k∈I satisfy (1)-(2)-(3), then the sum
2
in (2.5) is well defined.
P Let M := maxk∈I σk2 . For any u ∈ H, since (ek )k∈I is an orthonor-
mal system we have k∈I ⟨u, ek ⟩ ≤ ∥u∥2 (Bessel’s inequality). Hence if ak := σk ⟨u, ek ⟩ we
2

have a2k ≤ M 2 ⟨u, ek ⟩2 , and further k∈I a2k ≤ M 2 ∥u∥2 , hence the sum k∈I ak fk is a well-
P P
defined element of H′ since (fk )k∈I is an orthonormal system. Linearity of this sum wrt.
u is straightforward, and we also have as a byproduct that the resulting linear operator
from H to H′ has operator norm bounded by M . This also establishes strong convergence
of (2.4) in operator norm, since by the same token, for any subset I ′ or I, it holds
X
σk fk ⊗ e∗k ≤ max′ σk ,
k∈I
k∈I ′ op

and since σk → 0 this implies the convergence in operator norm of (2.4).

2.4 Functional calculus for compact self-adjoint operators


The following construction allows to apply a (bounded) real function to a compact self-
adjoint operator:
Definition 2.16. Let A ∈ K(H) be a compact self-adjoint operator, and f be a bounded
function σ(A) → R or C (a real or complex function of real variable). Then if (λk , ek )k∈I
is a completed eigendecomposition of A, we define
X
f (A) := f (λk )ek ⊗ e∗k ∈ B(H), (2.6)
k∈I

where the above series converges for the weak topology, that is to say
X
∀u ∈ H f (A)u = f (λk )⟨u, ek ⟩ek . (2.7)
k∈I

Proof. We have to prove the claim that the series (2.7) converges, and that the operator
defined this way is in B(H), so that this definition makes sense. The argument is the same
as the one used in the proof of Thm. 2.15: for any u ∈ H, we have k∈I ⟨u, ek ⟩2 ≤ ∥u∥2
P

(Bessel’s inequality). Hence


P if a2k := f (σ
2
k )⟨u, ek ⟩ we have ak ≤ M 2 ⟨u, ek ⟩2 , where M :=
2
supk∈I f (σk ) and further k∈I ak ≤ M 2 ∥u∥ , hence the sum k∈I ak ek is a well-defined
P
element of H. Furthermore it shows that the operator defined this way belongs to B(H),
with operator norm bounded by M .
Finally, we check that the definition (2.6) is unique, because it can be rewritten as
X
f (A) = f (λ)Pλ ,
λ∈σ(A)

and the spectral decomposition of A in canonical form is unique.

10
Note that we do not have convergence in operator norm in (2.6) general, since (f (σk ))k∈I
does not converge to 0 in general (in fact, Theorem 2.15 indicates that convergence in
operator norm is equivalent to (f (σk ))k∈I converges to 0 and f (A) compact). For example,
if f is the function identically equal to 1, f (A) is the identity and the convergence is not
in the strong sense.
We have the following properties:
Proposition 2.17. Let A ∈ K(H) be a compact self-adjoint operator, and f, g be bounded
functions σ(A) → R. Then:
(a) For any λ, µ ∈ C : (λf + µg)(A) = λf (A) + µg(A);
(b) (f g)(A) = f (A)g(A), implying in particular f (A)g(A) = g(A)f (A);
(c) ∥f (A)∥op = supt∈σ(A) |f (t)|;
(d) If f is the constant function equal to 1, then f (A) = I;
(e) If f is the identity function, then f (A) = A.
The following proposition gives a useful trick:
Proposition 2.18 (Shift formula). Let A ∈ K(H, H′ ) and g : {λ2 |λ ∈ sv(A)} → R be a
bounded function. Then it holds
g(AA∗ )A = Ag(A∗ A) and A∗ g(AA∗ ) = g(A∗ A)A∗ . (2.8)
ek , fk )k∈I be an SVD of A, i.e. A = k∈I σk fk ⊗e∗k . Then A∗ = k∈I σk ek ⊗
P P
Proof. Let (σk ,P
fk∗ and AA∗ = k∈I σk2 fk ⊗ fk∗ , which is an eigendecomposition of AA∗ , hence
X  X  X
∗ ∗
g(AA )A = 2
g(σk )fk ⊗ fk σ ℓ fℓ ⊗ e ℓ = σk g(σk2 )fk ⊗ e∗k ,
k∈I ℓ∈I k∈I

it can be checked that Ag(AA∗ ) leads to the the same formula. The other part of the claim
is proved similarly.

2.5 Hilbert-Schmidt operators


Source: [8, Chap.3]
Proposition/Definition 2.19. If (ek )k∈I and (fk )k∈J are Hilbert bases of H, H′ and
A ∈ B(H, H′ ), it holds
X X X
∥Aek ∥2 = ∥A∗ fℓ ∥2 = |⟨Aek , fℓ ⟩|2 , (2.9)
k∈I ℓ∈J (k,ℓ)∈I×J

meaning that if any of the sums is convergent the other also are, and their value is in-
dependent of the choice of basis. If these sums are convergent, the operator A is called
Hilbert-Schmidt operator.

11
Proof. By Parseval’s identity we have ∥Aek ∥2 = 2 ∗ 2
P P
ℓ∈J |⟨Aek , fℓ ⟩| = ℓ∈J |⟨ek , A fℓ ⟩| .
Summing over k ∈ I and using Fubini’s relation yields the two first equalities.
Proposition/Definition 2.20. The set of Hilbert-Schmidt operators from H to H′ , de-
noted HS(H, H′ ) (or sometimes B2 (H, H′ ) in the literature), is a closed linear subspace of
K(H, H′ ), and a Hilbert space, once endowed with the Hilbertian product
X X X
⟨A, B⟩2 := ⟨Aek , Bek ⟩ = ⟨A∗ fℓ , B ∗ fℓ ⟩ = ⟨Aek , fℓ ⟩⟨Bek , fℓ ⟩, (2.10)
k ℓ (k,ℓ)∈I×J

where (ek )k∈I , (fℓ )ℓ∈J are any orthonormal bases of H, H′ .


This definition does not depend on the choice of bases.
The associated Hilbert norm satisfies

∥A∥op ≤ ∥A∥2 .

Proof. Let us fix (ek )k∈I , (fℓ )ℓ∈J orthonormal bases of H, H′ . It is easy to check that
HS(H, H′ ) is a vector space, using the definition and the fact that

|⟨(A + B)ek , fℓ ⟩|2 ≤ (|⟨Aek , fℓ ⟩| + |⟨Bek , fℓ ⟩|)2 ≤ 2 |⟨Aek , fℓ ⟩|2 + |⟨Bek , fℓ ⟩|2 .


Similarly, the sums in (2.10) are absolutely convergent if A and B are Hilbert-Schmidt due
to
1
⟨Aek , fℓ ⟩⟨Bek , fℓ ⟩ ≤ |⟨Aek , fℓ ⟩|2 + |⟨Bek , fℓ ⟩|2

2
for the last sum, and to the Cauchy-Schwartz inequality (to apply twice) for the two first
sums. P
It is straightforward to check that it is sesquilinear. Furthermore ⟨A, A⟩2 = k,ℓ |⟨Aek , fℓ ⟩| =
P 2
k ∥Aek ∥ is 0 iff A = 0. It is thus a Hibertian product and induces a norm on HS(H).
As a consequence, the Hilbertian product can be obtained by the polarization formula
⟨A, B⟩2 = 41 ∥A + B∥22 − ∥A + B∥22 (for a R-Hilbert space) or
⟨A, B⟩2 = 14 ∥A + B∥22 − ∥A + B∥22 + i∥A + iB∥22 − i∥A − iB∥22 (for a C-Hilbert space).


We have seen from Proposition 2.19 that the corresponding formula (2.9) does not depend
on the choice of basis, therefore neither does (2.10).
If u is a unit vector, we can complete it to an orthonormal basis (uk )k∈I , and we have
X
∥Au∥2 ≤ ∥Auk ∥2 = ∥A∥22 ,
k∈I

which implies ∥A∥op ≤ ∥A∥2 by taking the supremum in u.


From this, we deduce that any operator A ∈ HS(H, H′ ) can be arbitrarily approximated
by a finiteP
rank operator in HS-norm: for fixed ε > 0 given, let Iε be a finite subset of I
such that k∈I\Iε ∥Aek ∥2 ≤ ε. Define the operator Bε = APε , where Pε is the orthogonal
projector on [ek , k ∈ Iε ]. Then Bε is finite-rank and ∥Bε − A∥22 = k∈I\Iε ∥Aek ∥2 ≤ ε.
P
Since the operator norm is dominated by the HS-norm, this proves in particular that any

12
Hilbert-Schmidt operator can be arbitrarily approximated in operator norm by finite rank
operators, and hence is compact.
It remains to justify that HS(H, H′ ) is complete for its norm. Because the Hilbert-Schmidt
norm dominates the operator norm, a Cauchy sequence An in HS norm is Cauchy on
operator norm and converges in operator norm towards some limit A∞ , since B(H) is
complete. Thus ∥An ek ∥2 converges pointwise to ∥A∞ ek ∥2 for every k. By Fatou’s Lemma,
this implies that A∞ is Hilbert-Schmidt and that ∥An − A∞ ∥22 → 0.
Proposition 2.21. We have the following properties:
• For any u, v ∈ H and w, x ∈ H′ , and A ∈ HS(H, H′ ), it holds

⟨A, w ⊗ u∗ ⟩2 = ⟨Au, w⟩;


⟨w ⊗ u∗ , x ⊗ v ∗ ⟩2 = ⟨w, x⟩⟨u, v⟩.

• If (ek )k∈I and (fk )k∈J are bases of H, H′ , the family of rank-one operators (fℓ ⊗
e∗k )(k,ℓ)∈I×J forms an orthonormal basis of HS(H, H′ ).

• If A is compact is Hilbert-Schmidt, and (σk , ek , fk )k∈I is a svd of A, then


X
∥A∥22 = σk2 ; (2.11)
k∈I

conversely if the sum in (2.11) converges, then A is Hilbert-Schmidt.


Proof. For the first point, without loss of generality we can assume ∥w∥ = ∥u∥ = 1, and
we can choose a Hilbert basis of H, H′ with u, resp. v as their first element. Using this
basis to develop the product ⟨A, w ⊗ u∗ ⟩2 as per (2.10) yields the first point, which can be
easily specialized to the case A = x ⊗ v ∗ .
It results that if (ek )k∈I and (fk )k∈J are bases of H, H′ , then
ei ⊗ fj∗ , ek ⊗ fℓ∗ 2 = 1{i = j}1{k = ℓ}, so we have an orthonormal family. If it was not a
basis, we could find a nonzero HS operator A in their orthogonal, but then by using the
formula (2.10) and the previous point we would have ∥A∥2 = 0, a contradiction.
Finally, for the last point we can use the Definition 2.9 of the squared Hilbert norm and
apply it to the two bases entering into the svd (σk , ek , fk )k∈I of A.

2.6 Schatten p-classes


Using functional calculus, if A is a compact, self-adjoint operator such that A ⪰ 0, then
1 1
the square root A 2 of A is well-defined (and satisfies the expected relation (A 2 )2 = A).
1
Definition 2.22. If A ∈ K(H, H′ ) is a compact operator, we define |A| := (A∗ A) 2 .
Proposition 2.23. If A ∈ K(H, H′ ), we have the following properties:
• It holds sv(A) = σ(|A|).

13
• If A = σk fk ⊗ e∗k is an svd of A, then
P
k∈I
X
|A| = σk ek ⊗ e∗k .
k∈I

• For any u ∈ H,
∥Au∥ = ∥|A|u∥.

• With the notation of the previous item, if W : H → H′ is the operator defined as


W ek = fk , k ∈ I and W u = 0 for u ∈ [ek , i ∈ I]⊥ , then A = W |A| and W is a partial
isometry (i.e. an isometry from the orthogonal of its null space onto its image). This
is called the polar decomposition of A.

Proof: left to the reader.


From now on, we will restrict our attention to endomorphisms of H i.e. elements of
B(H) (instead of operators from H to H′ ). Some of the definitions below can be extended
to B(H, H′ ) but some are specific to endomorphisms (for instance the notion of trace).

Proposition/Definition 2.24 (Schatten p-class). Let p ∈ [1, ∞). An operator A ∈ K(H)


is said to in a Schatten p-class if either of the following equivalent properties is satisfied:
p
1. |A| 2 ∈ HS(H).
P p
2. k∈I σk < ∞, where (σk )k∈I are the singular values of A (with multiplicity).

3. There exists a basis (ek )k∈I of H such that k∈I ⟨|A|p ek , ek ⟩ < ∞.
P

P p
4. For any basis (ek )k∈I of H, it holds k∈I ⟨|A| ek , ek ⟩ < ∞ and the value of this
quantity is independent of the choice of basis.

5. It holds X
sup |⟨Aek , fk ⟩|p < ∞,
(ek )k∈I ,(fk )k∈I k

where the supremum is over (ek )k∈I and (fk )k∈I bases of H.

The Schatten p-class of operators is denoted Bp (H).


It is a closed linear subspace of K(H). The quantity appearing in the points 2-3-4-5 above
is the same, denoted ∥A∥pp . ∥.∥p is a norm on Bp (H) called Schatten p-norm. If A ∈ Bp (H),
then A∗ ∈ Bp (H) with ∥A∗ ∥p = ∥A∥p , and it holds for any q ≥ p that A ∈ Bq (H) with
∥A∥p ≥ ∥A∥q ≥ ∥A∥op ; Bp (H) is complete for this norm.

Proof. The equivalence of points 1 to 4 (and equality of the quantities defined therein) is
a direct consequence of Propositions 2.19, 2.20, 2.21p
for Hilbert-Schmidt operators, and of
2
2.23, remarking that since |A| is self-adjoint, |A| 2 e = ⟨|A|p e, e⟩.
Concerning point 5, note that 2 ⇒ 5 by choosing the ”input and output” bases of a singular
value decomposition of A, so that |⟨Aek , fk ⟩|p = σkp for all k.

14
We now show the converse. Let (ek )k∈I , (fk )k∈I be the bases of a singular value decompo-
sition of A as above, and (eek )k∈I , (fek )k∈I arbitrary bases of H. We have
D E X X
Ae
ek , fek = σℓ ⟨ek , eeℓ ⟩⟨fk , feℓ ⟩ ≤ σℓ Wk,ℓ ,
ℓ ℓ

where
Wk,ℓ := |⟨ek , eeℓ ⟩| ⟨fk , feℓ ⟩ .
Observe that by the Cauchy-Schwartz inequality, it holds
X  21  X  12
2
X 2
0 ≤ Wk,• := Wk,ℓ ≤ |⟨ek , eeℓ ⟩| ⟨fk , fℓ ⟩
e = ∥ek ∥∥fk ∥ = 1,
ℓ ℓ ℓ
P
and similarly 0 ≤ W•,ℓ := Wk,ℓ ≤ 1. Therefore, by Jensen’s inequality:
k
XD E p XX p
Ae
ek , fek ≤ σℓ Wk,ℓ
k k ℓ
XX
≤ σℓp Wk,ℓ Wk,•
p−1

k ℓ
X
≤ σℓp W•,ℓ

X
≤ σℓp .

This argument also shows that the quantities appearing in points 2 and 5 are identical.
The variational characterization of point 5 allows us to establish the triangle inequality
(note that the other characterizations are not so nice for this since they involve |A|, not
A, and we are not sure what to do with |A + B|), namely for any bases (ek )k∈I , (fk )k∈I :
X  p1  X  p  p1
p
|⟨A + Bek , fk ⟩| ≤ |⟨Aek , fk ⟩| + |⟨Bek , fk ⟩|
k k
X  p1 X  p1
p
≤ |⟨Aek , fk ⟩| + |⟨Bek , fk ⟩|p
k k
≤ ∥A∥p + ∥B∥p .
The announced norm inequalities follow directly from point 2, and the closedness/completeness
property a similar argument as in the proof of Theorem 2.20, using the characterization of
point and Fatou’s lemma.

2.7 Trace-class operators


Proposition/Definition
P 2.25. For any A ∈ B1 (H) and any basis (uℓ )ℓ∈I , the sum
ℓ∈I ⟨Au ℓ , uℓ ⟩ is a Hilbert sum (it converges absolutely) and its value is independent of the

15
choice of basis.
This quantity is called trace of A and denoted tr(A).
For this reason an operator in B1 (H) is also called ”trace-class” and ∥A∥1 sometimes ”trace
norm” (note that ∥A∥1 ̸= Tr(A) in general, though!)
Proof. As usual we start with a svd of A, A = k∈I σk fk ⊗ e∗k . Then for any orthonormal
P
basis (uℓ )ℓ∈I , it holds
X XX
|⟨Auℓ , uℓ ⟩| = σk ⟨uℓ , ek ⟩⟨fk , uℓ ⟩
ℓ∈I ℓ∈I k∈I
XX
≤ σk |⟨uℓ , ek ⟩⟨fk , uℓ ⟩|
ℓ∈I k∈I
X X
= σk |⟨uℓ , ek ⟩⟨fk , uℓ ⟩|
k∈I ℓ∈I
! 21 ! 21
X X X
≤ σk |⟨uℓ , ek ⟩|2 |⟨fk , uℓ ⟩|2
k∈I ℓ∈I ℓ∈I
X
= σk = ∥A∥1 < ∞,
k∈I

thus all the sums involved in the above chain on inequality converge (absolutely).
Since the first sum converges absolutely, we can write
X XX
⟨Auℓ , uℓ ⟩ = σk ⟨uℓ , ek ⟩⟨fk , uℓ ⟩
ℓ∈I ℓ∈I k∈I
X X
= σk ⟨fk , uℓ ⟩⟨ek , uℓ ⟩
k∈I ℓ∈I
X
= σk ⟨fk , ek ⟩;
k∈I

hence the value of the sum is independent of the choice of basis.


Proposition 2.26. The following properties hold:
1. The trace is a linear functional on B1 (H).

2. If A ∈ B1 (H) then |Tr(A)| ≤ ∥A∥1 .

3. If A ∈ B1 (H) then Tr(A∗ ) = Tr(A).


P
4. If A ∈ B1 (H) is self-adjoint, then Tr(A) = k∈J λk , where (λk )k∈J are the eigenval-
ues of A (with multiplicity).

5. (Consequence of the previous point) if A ∈ B1 (H) is self-adjoint positive, then


Tr(A) ≥ 0.

16
1
6. If A ∈ Bp (H) for p ∈ [1, ∞), then ∥A∥p = Tr(|A|p ) p .

7. If A, B are Hilbert-Schmidt operators, then Tr(AB) = Tr(BA) and ⟨A, B⟩2 = Tr(B ∗ A).

8. If A ∈ B1 (H) and B ∈ B(H) then AB and BA are both trace-class and Tr(AB) =
Tr(BA).

Proof. We only prove the two last points, as the previous ones are straightforward (possibly
reusing the explicit expression found in the proof of Proposition 2.25).
If A, B are Hilbert-Schmidt operators, we simply identify from (2.10) that for any basis
(ek )k∈I : X X
⟨A, B⟩2 = ⟨Aek , Bek ⟩ = ⟨B ∗ Aek , ek ⟩ = Tr(B ∗ A).
k k

(we know that the left-hand side is a Hilbert sum, so the right-hand side too, implying
B ∗ A ∈ B1 (H).)
Using (2.10) again, we have
X X
Tr(AB) = ⟨B, A∗ ⟩2 = ⟨Bek , eℓ ⟩⟨A∗ ek , eℓ ⟩ = ⟨Bek , eℓ ⟩⟨Aeℓ , ek ⟩, (2.12)
k,ℓ k,ℓ

we first check that the double sum is absolutely convergent:


! 21 ! 21
X X X
|⟨Bek , eℓ ⟩⟨Aeℓ , ek ⟩| ≤ |⟨Bek , eℓ ⟩|2 ⟨Aeℓ , ek ⟩2 ≤ ∥A∥2 ∥B∥2 ,
k,ℓ k,ℓ k,ℓ

where we have used characterization (2.9) of the HS norm. Since the double sum in
expression (2.12) is symmetrical in A and B, it holds Tr(AB) = Tr(BA).
If A ∈ B1 (H) and T ∈ B(H), we can we write A as a product A = BC of two Hilbert-
Schmidt operators B, C (this is clear from a svd of A.) Furthermore, since T is bounded,
CT and T B are also Hilbert-Schmidt (use Definition 2.19 of a Hilbert-Schmidt operator,
or Proposition 2.30 in the next section), therefore using the previous point

Tr(AT ) = Tr(B(CT )) = Tr((CT )B) = Tr(C(T B)) = Tr((T B)C) = Tr(BA).

We end this section with a definition that will be useful later.

Definition 2.27. If A ∈ B1 (H), A ̸= 0, we call intrinsic dimension of A the quantity

Tr(A)
intdim(A) := .
∥A∥op

(We define also intdim(A) = 0.)

17
The interpretation of this quantity is that is measures over how many dimensions the
spectrum of A is ”mainly concentrated”. Here are a few properties to get some intuition
on this quantity.
Proposition 2.28.
• If A is finite-rank, A ̸= 0, then 1 ≤ intdim(A) ≤ rank(A).
• If A is an orthogonal projector onto a finite-dimensional subspace E, then intdim(A) =
dim(E).
• If ∥A∥op = 1 and the singular values of A satisgy σk (A) ≤ k −α , α > 1, then
α
intdim(A) ≤ α−1 .
Proof. For the last point, use the sum-integral comparison
Z ∞
X
−α α
Tr(A) ≤ k ≤1+ x−α dx = .
k≥1 1 α−1

2.8 Some operator inequalities


We begin with a different variational characterization of the Schatten p-norm.
Proposition 2.29. Let A ∈ K(H). If p ∈ [2, ∞), then
X X
∥A∥pp = sup ∥Aek ∥p = sup |⟨Aek , fℓ ⟩|p ;
(ek )k∈I k (ek )k∈I ,(fℓ )k∈I k,ℓ

if p ∈ [1, 2], then


X X
∥A∥pp = inf ∥Aek ∥p = inf |⟨Aek , fℓ ⟩|p ;
(ek )k∈I (ek )k∈I ,(fℓ )k∈I
k k,ℓ

in each case the sup or inf is over (orthonormal) bases of H.


Proof. Let (σk , ek , fk )k∈I be a completed svd of A. Note that k ∥Aek ∥p = k σkp = ∥A∥pp .
P P
On the other hand, for any basis (uk )k∈I of H, put Wk,ℓ := ⟨uk , eℓ ⟩, it holds:
X X X 2  p2
∥Auk ∥p = σℓ Wℓ,k fℓ
k k ℓ
XX  p2
= σℓ2 |Wℓ,k |2
k ℓ
X
⋚ σℓ |Wℓ,k |2
p

k,ℓ
X
= σℓp ,

18
where we have used k |Wℓ,k |2 = ∥uℓ ∥2 = 1, ℓ |Wℓ,k |2 = ∥ek ∥2 = 1, and Jensen’s inequality
P P
p
for the convex (if p ≥ 2) resp. concave (if 1 ≤ p ≤ 2) function x 7→ x 2 (with the inequality
in different directions according to the case; it is an equality if p = 2.)
On the other hand, for any bases (uk )k∈I , (vℓ )ℓ∈I of H, it holds
X XX p X
p 2 2
∥Auk ∥ = |⟨Auk , vℓ ⟩| ⋛ |⟨Auk , vℓ ⟩|p ,
k k ℓ k,ℓ
p
where we have super- (if p ≥ 2) resp sub-additivity (if p ≤ 2) of the function x 7→ x 2 (note
that this is in the opposite direction as the previous display.) Note again that if take the
input/output bases of the svd of A we find again ∥A∥pp .
Proposition 2.30. If A ∈ B(H) and B ∈ Bp (H) (p ∈ [1, ∞]), then AB ∈ Bp (H) and
∥AB∥p ≤ ∥A∥op ∥B∥p .
Similarly, BA ∈ Bp (H) and ∥BA∥p ≤ ∥A∥op ∥B∥p .
(So Bp (H) is an ideal of B(H).)
Proof. We assume p < ∞ as the case p = ∞ (i.e. ∥.∥∞ = ∥.∥op ) was handled before.
It holds for any orthonormal basis (ek )k≥1 of H:
X X X
∥ABek ∥p ≤ (∥A∥op ∥Bek ∥)p = ∥A∥pop ∥Bek ∥p .
k k k

We now use the variational characterization of Proposition 2.29: if p ≤ 2 we take an


infimum over the basis on the left, then the right-hand side; if p ≥ 2 we take a supremum
over the right, then the left-hand side. In all cases we conclude to AB ∈ Bp (H) and
∥AB∥p ≤ ∥A∥op ∥B∥p .
For the operator BA: we have
∥BA∥p = ∥A∗ B ∗ ∥p ≤ ∥A∗ ∥op ∥B ∗ ∥p = ∥A∥op ∥B∥p ;
note that the equality ∥A∥p = ∥A∗ ∥p comes from sv(A) = sv(A∗ ).
Proposition 2.31 (Hölder’s inequality for operators). Let A ∈ Bp (H) and B ∈ Bq (H)
with p−1 + q −1 = 1. Then AB ∈ B1 (H), BA ∈ B1 (H), Tr(AB) = Tr(BA) and
|Tr(AB)| ≤ ∥A∥p ∥B∥q . (2.13)
Proof. Assume that 1 ≤ q ≤ 2 ≤ p < ∞ and consider an svd of B, B = k σk vk ⊗ u∗k . We
P
can write (justifying the absolute convergence of involved sums by a similar token for sum
of absolute values, and the last bound)
X X
⟨ABuk , uk ⟩ = σk ⟨Avk , uk ⟩
k k
! 1q ! p1
X X p
≤ σkq |⟨Avk , uk ⟩|
k k
≤ ∥B∥q ∥A∥p ,

19
where we have used the (standard) Hölder’s inequality for the first inequality, and point 5
of Proposition 2.24 for the second.
Proving that Tr(AB) = Tr(BA) in that context is annoying. If A = k µk fk ⊗ e∗k is an
P
svd of A, we can start as above, writing
X X
⟨ABuk , uk ⟩ = σk ⟨Avk , uk ⟩
k k
XX
= σk µℓ ⟨vk , eℓ ⟩⟨fℓ , uk ⟩. (2.14)
k ℓ

It seems that the obtained expression is symmetric in the role of A, B and that we are
done? Unfortunately, for this argument to be correct we have to establish that the double
sum over k, ℓ is absolutely convergent (we know that for any fixed k, the sum over ℓ is
absolutely convergent; that is not enough to establish that the double sum is.) Let us

denote Wk,ℓ := |⟨vk , ℓ⟩| and Wk,ℓ := |⟨fℓ , uk ⟩|. Assume 1 < p ≤ 2 ≤ q < ∞. We want to
establish the convergence of
X X

σk µℓ |⟨vk , eℓ ⟩||⟨fℓ , uk ⟩| = σk µℓ Wk,ℓ Wk,ℓ
k,ℓ k,ℓ
2
X
q ′ 1− 2
= (σk Wk,ℓ )(µℓ Wk,ℓ q Wk,ℓ )
k,ℓ
! 1q ! p1
X X p(1− 2 )
≤ σkq Wk,ℓ
2
µpℓ Wk,ℓ q (Wk,ℓ
′ p
)
k,ℓ k,ℓ
! 1q ! p1
X X X p(1− 2 )
= σkq p
µℓ ′ p
Wk,ℓ q (Wk,ℓ ) ,
k ℓ k

2
P
where we have used Hölder’s inequality, then k Wk,ℓ = 1. Finally, applying Hölder’s
inequality again:
 1− p2 ! p2
(1− 2
q)
X p(1− 2 ) X p (1− p ) X
′ p ′ 2
Wk,ℓ q (Wk,ℓ ) ≤ Wk,ℓ 2  (Wk,ℓ )
k k k
!1− p2 ! p2
X X
2 ′ 2
= Wk,ℓ (Wk,ℓ ) = 1.
k k

Thus (2.14) is absolutely convergent (we ended up also re-proving the trace-Hölder inequal-
ities established before, in a more complicated way. . . )
Proposition 2.32. Let A, B be two selfadjoint Hilbert-Schmidt operators and f : R → R
an L-Lipschitz function.
Then
∥f (A) − f (B)∥2 ≤ L∥A − B∥2 . (2.15)

20
Important remark: one can wonder if (2.15) holds for other norms. It is not the
case. In particular, it does not hold in general for the operator norm ∥.∥op : functions
satisfying (2.15) for the operator norm are called ”operator Lipschitz”, and not every
real-valued Lipschitz function is operator Lipschitz, even with a different constant.
Proof. Let (ek , λk )k≥1 and (fℓ , µℓ )ℓ≥1 be eigendecompositions of A and B, respectively.
Observe that in general, for an operator M ∈ HS(H), since both (ek )k≥1 and (fℓ )ℓ≥1
are Hilbert bases it holds
X X
∥M ∥22 = ∥M ek ∥2 = |⟨M ek , fℓ ⟩|2 .
k k,ℓ

We apply this formula to M = A − B to get


X
∥A − B∥22 = |⟨(A − B)ek , fℓ ⟩|2
k,ℓ
X
= |⟨Aek , fℓ ⟩ − ⟨ek , Bfℓ ⟩|2 ,
k,ℓ
X
= |λk − µℓ |2 |⟨ek , fℓ ⟩|2 ,
k,ℓ

where we have used that B is selfadjoint in the second equality.


By the same token,
X
∥f (A) − f (B)∥22 = |f (λk ) − f (µℓ )|2 |⟨ek , fℓ ⟩|2 .
k,ℓ

It is now obvious that we can use the Lipschitz property of f for each (k, ℓ) term to reach
the claim.

21
3 Tools from concentration of measure
3.1 Random variables in Banach space
Proposition/Definition 3.1. Let B be a separable Banach space, with its Borel σ-
algebra. A random variable X from a base probability space (Ω, F, P ) to B is Bochner
integrable if E[∥X∥] < ∞. In this case there is a well-defined expectation E[X] ∈ B
satisfying the following properties:
• ∥E[X]∥ ≤ E[∥X∥];
• Simple linearity: if X, Y are Bochner-integrable then E[λX + Y ] = λE[X] + E[Y ];
• Operator linearity: for any bounded linear operator A from B to a separable Banach
space B ′ it holds that AX is Bochner-integrable in B ′ and

E[AX] = AE[X].

Proposition 3.2 (Positivity of expectation for operators). Let A be a Bochner-integrable


random variable taking values in B(H) and such that A is a.s. self-adjoint positive. Then
E[A] is selfadjoint positive.
Proof. Since A 7→ A∗ is a bounded linear operator on B(H), A∗ is Bochner integrable as
soon as A is, and it holds E[A]∗ = E[A∗ ] = E[A] when A is a.s. self-adjoint, thus E[A] is
self-adjoint. Furthermore, if A is a.s. positive, then for any u ∈ H:

⟨u, E[A]u⟩ = E[⟨u, Au⟩] ≥ 0,

hence E[A] is positive.


Proposition/Definition 3.3. Let X be random variable taking values in a separable
2
Hilbert space H, and assume E ∥X∥ < ∞. Then X ⊗ X ∗ is Bochner-integrable in B(H)
and we call Σ = E[X ⊗ X ∗ ] the second moment operator of X. It satisfies for all u, v in H:

E[⟨X, u⟩⟨X, v⟩] = ⟨v, Σu⟩.

Proof. Recall that ∥u ⊗ v ∗ ∥op = ∥u∥∥v∥. Thus ∥X ⊗ X ∗ ∥op = ∥X∥2 is integrable by


assumption, therefore X ⊗ X ∗ is Bochner-integrable in the Banach space B(H).
Therefore, by operator linearity of the Bochner integral, it holds for any u ∈ H that
X ⊗ X ∗ u = ⟨u, X⟩X is Bochner-integrable with E[⟨u, X⟩X] = E[(X ⊗ X ∗ )u] = Σu, and
further for any v ∈ H that ⟨u, X⟩⟨v, X⟩ = ⟨v, ⟨u, X⟩X⟩ so that

E[⟨X, u⟩⟨X, v⟩] = E[⟨v, ⟨u, X⟩X⟩] = ⟨v, E[⟨u, X⟩X]⟩ = ⟨v, Σu⟩.

For Bochner integrable random variables, there is a well-defined notion of conditional


expectation as in the real case and it satisfies the above properties as well.

22
3.2 Hoeffding’s inequality in Hilbert space
We first recall the Azuma-McDiarmid concentration theorem for “stable” functions of
independent random variables, also called Bounded difference inequality.
Theorem 3.4 (Azuma-McDiarmid). Let X be a measurable space, and f : X n → R a
measurable function such that
∀i ∈ {1, . . . , n}, ∀(x1 , . . . , xn ) ∈ X n , ∀x′i ∈ X :
|f (x1 , . . . , xi , . . . , xn ) − f (x1 , . . . , x′i , . . . , xn )| ≤ 2ci , (Stab)
for some positive constants (c1 , . . . , cn ).
Let (X1 , . . . , Xn ) be a independent family of random variables taking values in X (not
necessarily
Pnidentically distributed), then f (X1 , . . . , Xn ) is a sub-Gaussian variable with pa-
2
rameter i=1 ci , so that in particular
t2
 
P[f (X1 , . . . , Xn ) > E[f (X1 , . . . , Xn )] + t] ≤ exp − Pn 2 . (3.1)
2 i=1 ci
(In particular, if all constants ci are equal to c, the bound is exp(−t2 /(2nc2 )).)
For a proof, see e.g. [13, Section 3.4] or [2, Section 6.2] or [4, Section 6.1].
Theorem 3.5. Let X1 , . . . , Xn be i.i.d., Bochner-integrable random variables taking values
in a Hilbert space H, and having expectation 0.
Assume ∥Xi ∥ ≤ PB a.s. for some constant B.
n
Then if Sn := i=1 Xi , it holds for any δ ∈ (0, 1):
 
1 B  p
−1
P Sn ≥ √ 1 + 2 log(δ ) ≤ δ.
n n
Proof. By the assumption ∥Xi ∥ ≤ B, we can assume that the variables Xi in fact take
their values in the ball of H centered at the origin and of radius B. As a first step, we note
that the function F (X1 , . . . , Xn ) := n1 Sn satisfies the condition (Stab) (with ci = B for
all i) on this ball, by the triangle inequality: namely, if we replace Xi by Xi′ in the sum
(i)
Sn , denoting it Sn , it holds
1 1 (i) 1 1
Sn − Sn ≤ Sn − Sn(i) = ∥Xi − Xi′ ∥ ≤ 2B.
n n n n
Applying the Azuma-McDiarmid inequality, we get
nt2
   
1 1
P Sn > E[∥Sn ∥] + t ≤ exp − 2 . (3.2)
n n 2B
In a Hilbert space, we have moreover due to Jensen’s inequality:
" n # 12
1 X √  1 √
E[∥Sn ∥] ≤ E ∥Sn ∥2 2 ≤ E ⟨Xi , Xj ⟩ = nE ∥X1 ∥2 2 ≤ B n,
 
i,j=1

since E[⟨Xi , Xj ⟩] =
p0 if i ̸= j, using independence and E[Xi ] = 0. Combining this with (3.2)
and taking t = B 2 log(δ −1 )/n yields the claim.

23
3.2.1 (*) Extension to Banach space
Observe that equation (3.2) also holds more generally in a Banach space, since we have only
used the triangle inequality for the bounded difference concentration inequality. Only the
upper bound on E[∥Sn ∥] used the Hilbertian structure. For Banach spaces, a convenient
notion is that of type:

Definition 3.6. A Banach space B is said to be of (Rademacher) type p ∈ [1, 2] if there


exists a constant C > 0 such that
" n # n
X p X
Eε εi xi ≤ Cp ∥xi ∥p ,
i=1 i=1

for all finite sequences (x1 , . . . , xn ) of elements of B, where ε1 , ε2 , . . . is an infinite sequence


of i.i.d. Rademacher random variables (random signs).
The best constant C so that the above holds is denoted Tp (B).

Lemma 3.7. Let X1 , . . . , Xn be i.i.d., Bochner-integrable random variables taking values


in a Banach space B, and having expectation 0.
Assume that B is a Banach space of type p ∈ [1, 2]. Then if Sn := ni=1 Xi , it holds
P

1 1
E[∥Sn ∥] ≤ 2Tp (B)n p E[∥X1 ∥p ] p .

Proof. Denote (X1′ , . . . , Xn′ ) an independent copy of (X1 , . . . , Xn ). Put C = Tp (B). Then
we have
" n #
X
E[∥Sn ∥] = E (Xi − E[Xi′ ])
i=1
" n
#
X
≤E (Xi − Xi′ ) (3.3)
i=1
" n
#
X
=E εi (Xi − Xi′ ) (3.4)
i=1
" n
#
X
≤ 2E εi X i (3.5)
i=1
" n
# p1
X p
≤ 2E εi X i (3.6)
i=1
n
! p1
X
≤ 2C E[∥Xi ∥p ] (3.7)
i=1
1 1
= 2Cn E[∥X1 ∥p ] p ,
p (3.8)

24
where (3.4) is due to invariance of the distribution of the vector (Xi − Xi′ )1≤i≤n by sign-
flipping, (3.5) is the triangle inequality, (3.6) is Jensen’s inequality, and (3.7) is the defini-
tion of Rademacher type p. Concerning (3.3), we use the first property of Proposition 3.1
for the conditional expectation E[.|Xi ].
Once we plug this estimate into (3.2), we see that for Banach spaces of type p < 2,
the situation is radically different than in Hilbert spaces, because the above bound on the
expectation of ∥n−1 Sn ∥ is of higher order O(n−(1−1/p) ) than the deviation term of order
1
O(n− 2 ).
Still, he have the following facts:

• an Lp space for p ∈ [1, ∞) is of type max(p, 2);

• for p ∈ [1, ∞), the Schatten p-class Sp (H) of operators is of the same type as Lp [25].

Hence, we have (as far as the order in n is concerned) a control of the same order in n (up to
constants) as in a Hilbert space for Lp spaces and Sp spaces for 2 ≤ p < ∞. Unfortunately,
p = ∞ is radically different as L∞ spaces are of type 1. Note that the above arguments
are useless in Banach spaces of type 1 (it does not give a better result than the triangle
inequality; any Banach space is at least of type 1).

3.3 Bernstein’s inequality in smooth Banach space


Source: [20].

Lemma 3.8. Let (Xi )i∈JnK be a sequence of Bochner integrable, independent random vari-
ables with values in a separable Banach space B with E[Xi ] = 0 for all i.
Let F : B → R be a function with the following properties:

a) For all x, u ∈ B, u ̸= 0, the function Fx,u : t ∈ [0, 1] 7→ F (x + tu) is twice differen-


tiable.

b) For all x ∈ B, there exists an element dFx ∈ B ′ such that (Fx,u



)|t=0 = dFx (u) for all
u ̸= 0.

′′
c) For all x, u ∈ B, u ̸= 0, it holds Fx,u (t) ≤ H(u, t)F (x) for some function H(u, t) ≥ 0.

Then denoting Sn := nk=1 Xn , it holds (a.s.)


P

 Z 1 
En [F (Sn )] ≤ F (Sn−1 ) 1 + E H(Xn , t)(1 − t)dt ,
0

where En denotes expectation conditionally to (X1 , . . . , Xn−1 ).

25
Proof. Using the assumptions and Taylor’s expansion with integral remainder we get

F (Sn ) = F (Sn−1 + Xn ) = FSn−1 ,Xn (1)


Z 1
= F (Sn−1 ) + dFSn−1 (Xn ) + (1 − t)FS′′n−1 ,Xn dt
0
Z 1
≤ F (Sn−1 ) + dFSn−1 (Xn ) + F (Sn−1 ) (1 − t)H(Xn , t)dt.
0

We now take expectation En with respect to Xn conditionally to Sn−1 , due to independence


it holds  
En dFSn−1 (Xn ) = dFSn−1 (En [Xn ]) = 0,
and we obtain the claim.

Theorem 3.9 (Pinelis). Let (Xi )i∈JnK be a sequence of Bochner integrable, independent
Pn with values in a separable Banach space B with E[Xi ] = 0 for all i.
random variables
Denote Sn = i=1 Xn .
Assume Ψ : B → R+ is a function satisfying the following:

1. Ψ(0) = 0 and Ψ is twice Fréchet differentiable;

2. ∥dΨx ∥op ≤ 1 for all x ∈ B;

3. For a constant D ≥ 1, it holds ∥d2 (Ψ2 )x ∥op ≤ D2 for all x ∈ B.

Then for any λ > 0 such that E[exp(λ∥Xk ∥)] for all k, it holds for all t > 0:
n
!
X
P[Ψ(Sn ) > t] ≤ 2 exp −λt + D2 Ak , (3.9)
k=1

where Ak := E[π(λ∥Xk ∥)], with π(u) = eu − 1 − u.

Proof. Since we want to bound deviations from 0 rather than from the expectation E[Ψ(Sn )],
we introduce the symmetrized random variable RΨ(Sn ), where R is a Rademacher variable
(random sign) independent of Sn .
We start by the usual Chernov’s method bound: for any λ ≥ 0 and t ≥ 0, and recalling
Ψ(.) ≥ 0, it holds

E[exp(λRΨ(Sn ))] E[cosh(λΨ(Sn ))]


P[Ψ(Sn ) > t] = 2P[RΨ(Sn ) > t] ≤ 2 =2 ,
exp(λt) exp(λt)

where we recall cosh(u) = 21 (eu + e−u ).


We now want to apply the principle of the previous lemma for the function F (u) =
cosh(λΨ(u)), so Fx,u (t) = cosh(λ(Ψ(x + tu))) = cosh(λG(t)), where G(t) = Ψ(x + tu).
Assumption (a) of Lemma 3.8 is satisfied.

26
It comes

Fx,u (t) = λ sinh(λG(t))G′ (t) = λ sinh(λG(t))[dΨx+tu ](u),
which shows that Assumption (b) of Lemma 3.8 is satisfied (with Fx = sinh(λG(0))dΨx );
and
′′
Fx,u (t) = λ sinh(λG(t))G′′ (t) + λ2 cosh(λG(t))(G′ (t))2 .
To upper bound the first term, we bound it by 0 if G(t) and G′′ (t) are of opposite signs,
otherwise we use |sinh(u)| ≤ u cosh(u), thus
(
′′ (G(t)G′′ (t) + (G′ (t))2 )λ2 cosh(λG(t)) if G(t)G′′ (t) ≥ 0;
Fx,u (t) ≤
(G′ (t))2 )λ2 cosh(λG(t)) if G(t)G′′ (t) < 0.

In the first case, we have


 1 1
G(t)G′′ (t) + (G′ (t))2 = (G2 )′′ (t) = (d2 (Ψ2 )x+tu )(u, u) ≤ D2 ∥u∥2 ,
2 2
by Assumption 3, and in the second case, by Assumption 1:

(G′ (t))2 = (dΨx+tu u)2 ≤ ∥u∥2 ≤ D2 ∥u∥2 ,

since D ≥ 1.
Note that Assumption 2 implies that Ψ is Lipschitz; we use this to get the bound

cosh(λG(t)) = cosh(λΨ(x + tu))


≤ cosh(λΨ(x) + λt∥u∥))
≤ cosh(λΨ(x)) exp(λt∥u∥)
= F (x) exp(λt∥u∥),

where we have used cosh(a + b) ≤ cosh(a) exp(b) for b ≥ 0. Gathering the previous
estimates we obtain (in all cases)
′′
Fx,u (t) ≤ D2 λ2 ∥u∥2 exp(λt∥u∥)F (x),

i.e. Assumption (c) of Lemma 3.8 is satisfied with H(u, t) = D2 λ2 ∥u∥2 exp(λt∥u∥). To
apply the lemma all is left is to evaluate
Z 1 Z 1
2 2 2
H(u, t)(1 − t)dt = D λ ∥u∥ (1 − t) exp(λt∥u∥)dt = D2 (exp(λ∥u∥) − 1 − λ∥u∥).
0 0

Applying Lemma 3.8 recursively (starting from n backwards to 1), and using Ψ(0) = 0 for
the last step, we thus get, with Ak := E[exp(λ∥Xk ∥) − 1 − λ∥Xk ∥]:
n n
!
Y X
E[cosh(λΨ(Sn ))] ≤ (1 + D2 Ak ) ≤ exp D2 Ak ,
k=1 k=1

leading to the announced conclusion.

27
Corollary 3.10. Under the same assumptions as Theorem 3.9, if for all i = 1, . . . , n, and
for all integers k ≥ 2:
h i k! 2 k−2
E ∥Xi ∥k ≤ σ M , (3.10)
2D2
for some constants M, σ > 0,
then for all t ≥ 0:
σ2
      
1 M
P ψ Sn > t ≤ 2 exp −n 2 h1 t 2 ,
n M σ

where h1 (u) := 1 + u − 1 + 2u.
As a consequence, for any x ≥ 0:
"   r #
1 2x M x
P ψ Sn > σ + ≤ 2 exp(−x).
n n n

uj
Proof. Since π(u) = eu − 1 − u =
P
j≥2 j! , it holds (using Xi /n instead of Xi to account
for the sum normalization):
X λk h k
i X
k −k 2 k−2 2 1 σ 2 n−2 λ2
Ai = E[π(λ∥Xi ∥/n)] = k
E ∥X i ∥ ≤ λ n σ M /2D = 2 2(1 − λM n−1 )
,
k≥2
k!n k≥2
D

for λ ∈ [0, M −1 ). Plugging this into (3.9) yields

σ 2 λ2 /n
     
1
P Ψ Sn > t ≤ 2 exp −λt + ,
n 2(1 − λM/n)
and elementary computations show that
λ2 v
   
v ct
sup λt − = 2 h1 ,
λ∈(0,1/c) 2(1 − cλ) c v

leading to the first claim. It can be checked that h−1
1 (u) = u + 2u, leading to the second
claim.
The primary application of Pinelis’ concentration inequality is for Hilbert and (certain)
Banach norms. However norms are not differentiable everywhere (in particular not at the
origin). For this, we need the following result in order to slightly weaken the assumption
on Ψ:
Definition 3.11. If B is a Banach space, we call a function Ψ : B → R+ (2, D)-smooth
(for some constant D ≥ 1) if it satisfies Ψ(0) = 0 and for any x, u ∈ B:

|Ψ(x + u) − Ψ(u)| ≤ ∥u∥; (3.11)


2 2 2 2
Ψ (x + u) − 2Ψ (x) + Ψ (x − u) ≤ 2D∥u∥ . (3.12)

28
Proposition 3.12. Theorem 3.9 and Corollary 3.10 also hold for any (2, D)-smooth func-
tion Ψ.
Proof. We only sketch the proof. As a first step, a centered random variable X taking values
in a separable Banach space B can be approximated by a sequence Xk such that Xk con-
ververges to X in probability, Xk is centered and takes only a finite number of values. This
can be seen as follows: put ε = 1/k, there exists a compact Kε such that P[X ̸∈ Kε ] ≤ ε by
tighthness of a probability measure on a separable Banach space. Cover Kε by a finite num-
ber of closed balls of radius ε. Define Xε = E[X|Fε ], where Fε is the (finite) sigma-algebra
generated by those balls. Thus Xε only takes a finite number of values, E[Xε ] = E[X] = 0,
P
and Xε → X because P[∥X − Xε ∥ > 2ε] ≤ P[X ̸∈ Kε ] + P[X ∈ Kε ; ∥X − Xε ∥ > 2ε] ≤ ε.
The second event has probability 0 because on Kε , Xε is the conditional average of X on
a partition piece of diameter less than 2ε, so ∥X − E[X|Fε ]∥ ≤ 2ε.
As a second step, we establish that on a finite-dimensional Banach space Be (the one
generated by the finite-numbered values of the approximant Xk for fixed k), there exists a
sequence Ψ e n of functions Be → R+ such that Ψ
e n (x) → Ψ(x) for all x ∈ B,
e and functions Ψen
satisfy (independently of n) conditions 1-2-3 of Theorem 3.9. Namely, let N be a centered
Gaussian variable in Be (for instance, take any (finite) basis of Be and a standard normal
with respect to that basis), then define
 1
Ψε (x) = E Ψ2 (x − εN ) 2 .


Then by properties of finite-dimensional convolution, since the Gaussian density is C ∞ , Ψε


is C ∞ , Ψε (x) → Ψ(x) as ε → 0, and we have:
1 
d(Ψε )2x (v) ≤ lim sup E Ψ2 (x + tv − εN ) − Ψ2 (x − εN )

t→0 t
1
= lim sup E[|Ψ(x + tv − εN ) − Ψ(x − εN )|(Ψ(x + tv − εN ) + Ψ(x − εN ))]
t→0 t
≤ ∥v∥ lim sup E[Ψ(x + tv − εN ) + Ψ(x − εN )]
t→0
= 2∥v∥E[Ψ(x − εN )]
where we have used (3.12) for the second inequality. Thus:
d(Ψε )2x (v)
|d(Ψε )x (v)| =
2Ψε (x)
E[Ψ(x − εN )]
≤ ∥v∥ 1 ≤ ∥v∥.
E[Ψ2 (x − εN )] 2
This proves that ∥(dΨε )x ∥op ≤ 1 for any x.
Furthermore, since we know that Ψε is C ∞ , (3.12) implies by standard (Taylor expansion)
arguments that (d2 (Ψ2ε )x )(v, v) ≤ D∥v∥2 , i.e. ∥d2 (Ψ2ε )x ∥op ≤ D for any x. Thus points
1-2-3 of Theorem 3.9 hold for Ψ e n = Ψ1/n − Ψ1/n (0), and since Ψ e n → Ψ pointwise the
conclusion of the theorem holds for Ψ as well.

29
A particular case of interest is for bounded random vectors. If X is a centered, bounded
random vector in Hilbert space, with ∥X∥ ≤ M and E[X ⊗ X ∗ ] = Σ its covariance opera-
tor, then
h i
E ∥X∥k ≤ M k−2 E ∥X∥2 = M k−2 E[Tr(X ⊗ X ∗ )] = M k−2 Tr(Σ),
 

so (3.10) is satisfied with σ 2 = Tr(Σ) (and M = M ; and D = 1 for a Hilbert space).


As in the previous section, it is legitimate to ask: apart from a Hilbert space, are some
of known spaces (2, D)-smooth? We have the following facts:

• any Lp space over a measure space is (2, p − 1)-smooth for p ≥ 2 [20] (in particular
the ℓp norm on finite vectors or on sequences);

• using a closely related (and more standard) notion of smoothness (for a connection
between the two notions see e.g. [11], eq. (9) ), it was established by [25] that the
p-Schatten classes share the same ”modulus of smoothness” (up to constant factors)
as an Lp space, for 1 ≤ p < ∞.

3.3.1 Discussion
TODO

3.4 Bernstein’s inequality in operator norm


Sources: [26, 17]
As we have noted from the previous section, the vector-Hoeffding’s inequality and
Pinelis’ inequality do not apply to L∞ spaces, nor to B(H) with the operator norm. Because
the latter space is fundamental, we now turn to concentration results on B(H). We will
be adapting the ”Matrix Bernstein inequalities” from [26] (see also M. Lerasle’s notes
[13], Chap. 4) to the operator setting. Note that we will only concentrate on self-adjoint
operators in this section.
There are two issues with extending the arguments of Matrix Bernstein concentration
in infinite-dimensional space:

• they rely on inequalities relating traces and functional calculus for matrices (”Trace
inequalities”), in particular Lieb’s inequality. One has to be somewhat careful that
the traces exist for operators and that the arguments can be carried over.

• the most basic inequalities involve the matrix dimension, which won’t be possible
for opertors over an infinite-dimensional Hilbert space. For this reason we will look
at refined concentration inequalities using the intrinsic dimension rather than the
ambient dimension.

30
Theorem 3.13 (Matrix Bernstein’s inequality with intrinsic dimension). Let X1 , . . . , Xn
be independent random
Pn matrices of the same size with E[Xk ] = 0 and ∥Xk ∥op ≤ L for all
k. Denote Sn = i=1 Xi .
Assume there are two matrices V1 , V2 satisfying
n
X
V1 ⪰ E[Sn Sn∗ ] = E[Xi Xi∗ ];
i=1
n
X
V2 ⪰ E[Sn∗ Sn ] = E[Xi∗ Xi ];
i=1

(where for Hermitian matrices, A ⪰ B ⇔ (A − B) positive semidefinite).


Assume max ∥V1 ∥op , ∥V2 ∥op ≤σ 2 and let d := (Tr V1σ+Tr
2
V2 )
.
Then for t ≥ σ + L/3, it holds

t2 /2
h i  
P ∥Sn ∥op > t ≤ 8d exp − 2 .
σ + Lt/3

Lets us establish that this result extends to operators.

Corollary 3.14 (Operator Bernstein’s inequality). Let X1 , . . . , Xn be independent, random


self-adjoint elements of B(H). Assume ∥Xk ∥op ≤ L for all k (thus Xk , Xk2 are Bochner-
integrable in B(H)) and that E[Xk ] = 0.
Denote Sn = ni=1 Xi .
P
Assume that there exist two positive self-adjoint trace-class operators V1 , V2 such that
n
X
V1 ⪰ E[Sn Sn∗ ] = E[Xi Xi∗ ]; (3.13)
i=1
Xn
V2 ⪰ E[Sn∗ Sn ] = E[Xi∗ Xi ] (3.14)
i=1

(where for self-adjoint operators A ⪰ B ⇔ (A − B) positive operator).


Assume max(∥V1 ∥op , ∥V2 ∥op ) ≤ σ 2 and let d := (Tr V1σ+Tr
2
V2 )
.
Then for t ≥ σ + L/3, it holds

t2 /2
h i  
P ∥Sn ∥op > t ≤ 8d exp − 2 .
σ + Lt/3

Consequently, for any δ ∈ (0, 1), it holds with probability at least 1 − δ:


s
1 2∥V ∥op β 2Lβ
β = log 8dδ −1 .

Sn ≤ + ,
n op m 3n

31
Remark: Condition (3.13) implies by positivity of expectation that, in fact, Xi must
be Hilbert-Schmidt operators. We have not required this explicitly in the assumption, to
avoid splitting hairs about what space the Xi s are Bochner integrable in. The condition
∥Xk ∥op ≤ L is enough to ensure Bochner integrability in B(H), but Bochner integrability
in HS(H) is not formally required (note that the Xi s may be unbounded in HS(H)).
We use the same device as in the proof of Proposition 3.12, which we rewrite for
convenience as a separate lemma:

Lemma 3.15. Let X be a Bochner-integrable random variable taking values in the space
K(H, H′ ) of compact operators from H to H′ , with E[X] = 0. Then there exists a sequence
of random variables X (k) converging in probability to X for ∥.∥op , i.e.
h i
(k)
∀t > 0 : lim P X −X op
≥ t → 0,
k→∞

and such that:

• E X (k) = 0;
 

• X (k) only takes a finite number of different values in a subspace K(k) of operators,
such that there exists two finite-dimensional subspaces E (k) , F (k) of H and H′ with

∀A ∈ K(k) : Ker(A) ⊆ (E (k) )⊥ ; Ran(A) ⊆ F (k) ;

(in other words PE (k) APF (k) = A for all A ∈ K(k) , where PE (k) , PF (k) are the orthogonal
projections onto E (k) , F (k) ).

• If ∥X∥op ≤ M a.s. for some constant M , then X (k) op ≤ M a.s. as well.

• If XX∗ and X ∗ X are Bochner integrable, then E X (k) (X (k) )∗ ⪯ PE (k) E[XX ∗ ]PE (k) ,
 

and E (X (k) )∗ X (k) ⪯ PF (k) E[X ∗ X]PF (k) .

Proof. For a fixed k, put ε = 1/k, there exists a compact Kε such that P[X ̸∈ Kε ] ≤ ε
by tighthness of a probability measure on a separable Banach space. Cover Kε by a fi-
nite number of closed balls of radius ε. Define Xε = E[X|Fε ], where Fε is the (finite)
sigma-algebra generated by those balls. Thus Xε only takes a finite number of values, and
E[Xε ] = E[X] = 0 by the properties of conditional expectation.
Let (A1 , . . . , ANε ) be the finite set of values taken by Xε . Since these are compact op-
erators, there exists (B1 , . . . , BNε ) such that Bi is finite-rank, ∥Bi − Ai ∥op ≤ ε for all
i ∈ {1, . . . , Nε }. Let Eε = ( Ker(Bi ), i ≤ Nε )⊥ and Fε = [Ran(Bi ), i ≤ Nε ]. Then be-
T
cause Ker(Bi ) is of finite codimension and Ran(Ai ) is of finite dimension, Eε and Fε are of
finite dimension; and we have PEε Bi PFε = Bi for all i ≤ N ε .
Define now X eε = PEε Xε PFε . Then PEε X eε PFε = Xeε , E Xeε = 0 by linearity, and it holds
h i h i h i
P X eε − X > 4ε ≤ P e ε − Xε
X > 2ε + P X ε − X > 2ε . (3.15)
op op op

32
Concerning the first term, we have
e ε − Xε
X ≤ sup PEε Ai PFε − Ai op
op
i≤Nε
 
≤ sup PEε Ai PFε − Bi op + ∥Bi − Ai ∥op
i≤Nε
 
≤ ε + sup PEε (Ai − Bi )PFε op
i≤Nε
 
≤ ε + sup Ai − Bi op
i≤Nε

≤ 2ε,

hence the first probability is zero. Concerning the second term in (3.15):
h i h i
P Xε − X op > 2ε ≤ P X ∈ Kε ; Xε − X op > 2ε + P[X ̸∈ Kε ] ≤ ε,

The first event above has probability 0 because on Kε , Xε is the conditional average of X
on a partition piece of diameter less than 2ε, so ∥X − E[X|Fε ]∥ ≤ 2ε. This proves that
X (k) := X
e1/k converges in probability to X for ∥.∥ .
op
Let us turn to the additional claims on boundedness and second moment. Since Xε is
defined as a conditional expectation of X, it inherits the boundedness property ∥Xε ∥op ≤ L,
and it holds X eε
op
= PEε Xε PFε op ≤ ∥Xε ∥op ≤ L. Concerning the variance, first note
that mimicking the usual argument for vector-valued variables, in general for an operator-
valued random variable Z such that ZZ ∗ is Bochner integrable, it holds

E[ZZ ∗ ] − E[Z]E[Z]∗ = E[(Z − E[Z])(Z − E[Z])∗ ] ⪰ 0,

so E[ZZ ∗ ] ⪰ E[Z]E[Z]∗ ; and this also holds for conditional expectations. Because Xε is
∗ ∗

a conditional expectation of X, we therefore have E Xε Xε ⪯ E[XX ]. Finally, we have
Xeε = P Xε P ′ for two orthogonal projectors P, P ′ .
In general, A∗ P A ⪯ A∗ A. Namely, it holds for any u:

⟨A∗ P P Au, u⟩ = ∥P Au∥2 ≤ ∥Au∥2 = ⟨A∗ Au, u⟩,

Similarly, it is easy to check that if A ⪯ B, then P AP ⪯ P BP . Hence Xε P ′ Xε∗ ⪯


Xε Xε∗ , implying E X eε∗ ⪯ P E[XX ∗ ]P ; the argument is similar to establish E X
eε∗ X
 
eε X eε ⪯
′ ∗ ′
P E[X X]P ;
Proof of Corollary 3.14. We use the construction of Lemma 3.15 for X1 , . . . , Xn , resulting
(k) (k) (k)
in approximants X1 , . . . , Xn . Since each Xi only depends on Xi , the independence of
(k)
Xi s carries over to independence of Xi s (for fixed k). Furthermore, while in principle the
finite dimensional spaces K (k) , E (k) , F (k) in the construction of Lemma 3.15 depend on i,
obviously we can assume that they are common to all indices by replacing them by the
respective linear span of their union over i = 1, . . . , n. By the same token we can actually

33
assume E (k) = F (k) ; let us denote P (k) the orthogonal projector on E (k) . Furthermore
(k)
Lemma 3.15 guarantees Xi op ≤ L and
n
X n
X
 (k) 2  (k)
E Xi2 P (k) ⪯ P (k) V P (k) := V (k) .
 
E (Xi ) ⪯ P
i=1 i=1

Observe that Tr(V (k) ) = Tr(P (k) V ) ≤ P (k) op Tr |V | = Tr V since V is positive; and
V (k) op = P (k) V P (k) op ≤ ∥V ∥op ≤ σ 2 . To summarize, for fixed k the approximant
(k) (k)
variables X1 , . . . , Xn are independent, self-adjoint operators that are null on the orthog-
onal of the finite-dimensional subspace E (k) , hence can be conceived as finite-dimensional
Hermitian matrices acting on E (k) ; this is also the case for V (k) . We can thus apply
Theorem 3.13 to these variables, resulting in
t2 /2
h i  
(k)
P Sn op > t ≤ 4d exp − 2 ,
σ + Lt/3
(k) Pn (k) (k) (k)
where Sn = i=1 Xi . Since Xi converges in probability to Xi , so does Sn to Sn ,
yielding the claim.

To prove Theorem 3.13, we concentrate on the Hermitian (=self-adjoint) case; for an


extension to the general case see [26]. We want to use a ”matrix version” of Chernov’s
method, but a critical point is that it does not hold that exp(A + B) = exp(A) exp(B) in
general (unless A and B commute). One can think of it the following way: exp(A + B) =
exp(B + A), but it would seem strange that exp(A) and exp(B) commute if A and B don’t
(and indeed, this is not true). We will actually be interested in the trace of the exponential,
and it turns out that Tr exp(A + B) ≤ Tr(exp(A) exp(B)) (Golden-Thompson inequality)
but unfortunately this does not extend to more than 2 matrices.
It turns out that a convenient central tool for the development of the ”Matrix Chernov’s
method” is the following:
Theorem 3.16 (Lieb’s inequality). Let H be an Hermitian matrix with dimension d, then
the function
A 7→ Tr exp(H + log(A)) (3.16)
is concave on the the cone Cd of d × d positive-definite matrices to R.
For a proof, see e.g. [26]. From this, we deduce the principal device underlying the
Matrix Chernov’s method:
Proposition 3.17. Let X1 , . . . , Xn be random Hermitian matrices of the same dimension.
Then it holds for any real λ:
" n
!# n
!
X X
E Tr exp λ Xi ≤ Tr exp log E[exp λXi ] .
i=1 i=1

34
Proof. We take successively expectation with respect to X1 , . . . , Xn . Assume that after
k − 1 steps, we have established
" n
! # k−1 n
!
X X X
E Tr exp λ Xi Xk , . . . , Xn ≤ Tr exp ξi + λ Xi , (3.17)
i=1 i=1 i=k

where ξi := log E[exp λXi ]. Putting Hk = k−1


P Pn
i=1 ξi +λ i=k+1 Xi and Ak = exp λXk , we use
Jensen’s inequality and the concavity property of the function defined in (3.16) to obtain,
when taking expectation with respect to Xk :

E[Tr exp(Hk + log(exp λXk ))|Xk+1 , . . . , Xn ] ≤ Tr exp(Hk + log(E[exp λXk ]));

combining with (3.17), and replacing the value of Hk we get (3.17) for k ← (k + 1). We
conclude by a straightforward recursion.
In order to establish Theorem 3.13, we will need the following properties on Hermitian
matrix functional calculus (defined in the same way as operator functional calculus, see
Section 2.4):
Proposition 3.18. Let A, B be Hermitian matrices. We denote A ⪯ B, resp. A ≺ B iff
B − A is positive semidefinite, resp. positive definite.
1. If f is nondecreasing on the union of the spectra of A and B, and A ⪯ B, then
Tr f (A) ≤ Tr f (B).

2. If f ≤ g on the spectrum of A, then f (A) ⪯ g(A).

3. If 0 ≺ A ⪯ B, then log(A) ⪯ log(B).


Proof. For the first point, use the Courant-Fisher ”max-min” theorem: for a Hermitian
matrix A, denote λi (A) the i-th largest eigenvalue of A (counted with multiplicity), then
it holds
λi (A) = max min ⟨u, Au⟩,
V :dim(V )=1 u∈V,∥u∥=1

where the maximum runs over linear subspaces of dimension i. From this it follows that

A ⪯ B ⇒ ∀i λi (A) ≤ λi (B)
⇒ ∀i λi (f (A)) = f (λi (A)) ≤ f (λi (B)) = λi (f (B))
⇒ Tr(f (A)) ≤ Tr(f (B)).

(note that we have used the fact that f is nondecreasing to justify each of the relations in
the second implication).
For the second point, let (λi , ei ) be an eigendecomposition of A, then we have for any
i: ⟨ei , f (A)ei ⟩ = f (λi ) ≤ g(λi ) = ⟨ei , g(B)ei ⟩. It follows f (A) ⪯ g(A).
The third point is non-trivial: it is not true that a monotone function is ”operator
monotone”, in general. But is is true for log. See [26] for a short proof.

35
Lemma 3.19. If X is a random Hermitian matrix such that it spectrum is upper bounded
by L and E[X] = 0, then, for λ ∈ [0, 3/L]:

λ2 /2
E X2 .
 
log E[exp λX] ⪯
1 − λL/3

Proof. We start with defining ψ(x) := exp(x) − x − 1 and f (x) := ψ(x)/x2 . By inspection
1
of its series expansion, it holds that f (x) is nondecreasing and that f (x) ≤ g(x) := 2(1−x/3) .
Thus for x ≤ L and λ ∈ [0, 3/L]:

ψ(λx) ≤ λ2 x2 g(λL).

Using point 2 of Proposition 3.18, it follows that for a Hermitian matrix X such that its
spectrum is upper bounded by L, it holds

ψ(λX) ⪯ λ2 X 2 g(λL).

Taking the expectation (and using positivity of expectation) yields

λ2 /2
E X2 .
 
E[ψ(λX)] = E[exp(λX)] − I ⪯
1 − λL/3

Using operator-monotonicity of log (Point 3 of Proposition 3.18), then log(1 + u) ≤ u and


Point 2 of the proposition again, we get

λ2 /2 λ2 /2
 
 2
E X2 .
 
log E[exp(λX)] ⪯ log I + E X ⪯
1 − λL/3 1 − λL/3

We can now turn to the proof of Theorem 3.13.


Proof of Theorem 3.13. Applying Lemma 3.19 repeatedly, we get for λ ∈ [0, L/3]:
n n
X λ2 /2 X  2 
log E[exp λXi ] ⪯ E Xi ⪯ g(λ)V,
i=1
1 − λL/3 i=1

λ2 /2
where g(λ) = 1−λL/3
. Using point 1 of Proposition 3.18, it comes

n
!
X
Tr exp log E[exp λXi ] ≤ Tr exp(g(λ)V ).
i=1

Combining the above with Proposition 3.17, we obtain a bound on the ”matrix Laplace
transform”
Tr E[exp λSn ] ≤ Tr exp(g(λ)V ). (3.18)

36
First, we relate the left-hand side of (3.18) to the probability of deviation of λmax (Sn ),
where λmax denotes largest eigenvalue. In general, if ψ is a nondecreasing function from R
to R+ , it holds:

E[ψ(λmax (Sn ))] Tr E[ψ(Sn )]


P[λmax (Sn ) > t] ≤ P[ψ(λmax (Sn )) > ψ(t)] ≤ ≤ .
ψ(t) ψ(t)

We apply this to the function ψ(t) = exp(λt) − λt − 1 (with λ > 0), use E[Sn ] = 0 and
combine with (3.18) to obtain

Tr E[exp(λSn ) − 1] Tr(exp(g(λ)V ) − 1)
P[λmax (Sn ) > t] ≤ ≤ . (3.19)
exp(λt) − λt − 1 exp(λt) − λt − 1

Second, we further bound above right-hand side. For this, denote ψ(t) e = exp(t) − 1; we
t e
note that since ψ is convex, it holds ψ(t) ≤ M ψ(M ) for t ≤ M . Since ∥V ∥ ≤ σ 2 , we deduce
e e
by points 1-2 of Proposition 3.18 that

Tr g(λ)V e 2
Tr(exp(g(λ)V ) − 1) ≤ ψ(σ g(λ)) ≤ d exp(σ 2 g(λ)). (3.20)
g(λ)σ 2

Thus, combining with (3.19), we finally arrive at the estimate

exp(σ 2 g(λ))
P[λmax (Sn ) > t] ≤ d . (3.21)
exp(λt) − λt − 1

The rest is just somewhat tedious estimates. We rewrite the expression in the above bound
as
exp(σ 2 g(λ))
   
exp(λt) 2 3
= exp(σ g(λ) − λt) ≤ 1 + 2 2 exp(σg(λ) − λt),
exp(λt) − λt − 1 exp(λt) − λt − 1 λt
u
e
were one used eu −u−1 ≤ 1 + u32 for u ≥ 0, the proof of which is left to the reader. Next, we
choose λ = t/(σ 2 + Lt/3). If t ≥ σ + L/3,
 it can be checked (again, left to the reader) that
3
with this choice of λ it holds 1 + λ2 t2 ≤ 4, and substituting the value of λ into g(λ) in
(3.21) yields
t2 /2
 
P[λmax (Sn ) > t] ≤ 4d exp − 2 ;
σ + Lt/3
we conclude by a union bound with a similar control for P[λmax (−S) > t].

37
4 Spectral regularization methods
Many sources on this theme, including [1, 6, 3, 16, 15, 21]. . .

4.1 Setting
In this chapter, we will consider a problem of linear regression with random design where
the covariate X lies in a Hilbert space, of the form

Y = ⟨f ∗ , X⟩ + ξ, (4.1)

where f ∗ is an unknown element of an Hilbert space H, and X is a random variable taking


values in H.
This model is relevant in particular for Functional Data Analysis (FDA) and reproducing
kernel Hilbert space (rkHs) methods. We will come back to the latter particular setting
later, but for now we will focus on the abstract model (4.1) and assume we ”observe” the
data in Hilbert space, putting computational feasibility aside.
We will consider the following simple distributional assumptions:

Assumption 4.1 (Distribution assumptions).


We observe n i.i.d. data points (Xi , Yi )i∈JnK following the model (4.1). The unknown joint
distribution on H × R is denoted P . Its marginal on H (X-marginal) is denoted ρ.
The covariate is bounded: ∥X∥ ≤ κ (ρ-a.s.)
The noise ξ satisfies E[ξ|X] = 0.
The ouput variable Y is bounded: |Y | ≤ M (P -a.s.).
Note that the latter point implies |⟨X, f ∗ ⟩| ≤ M as well and thus |ξ| ≤ 2M , P -a.s.

These assumptions are quite restrictive and can be significantly weakened in the litera-
ture with the price of a more refined analysis. We will study this setting here for simplicity.

In the analysis to come, a lot of constants will depend on the parameters appearing in
the assumptions (such as κ and M abovel there will be more later.) To avoid a cumbersome
tracking of the effect of the constants, we will often use the notation C▲ to denote a number
implicitly depending on ”less important” parameters in the assumptions. For this section
C▲ will be a positive number only depending on (κ, M ). Note that the value of C▲ might
change in different contexts and even change from line to line!
We first need to introduce some notation which will be the infinite-dimensional analogue
of quantities appearing in traditional linear regression. For this we will need to introduce
the Hilbert space L2 (H, ρ) whose norm (and scalar product) we will denote as ∥·∥ρ resp
⟨·, ·⟩ρ .

38
Proposition/Definition 4.2. Let ρ be a distribution on the Hilbert space H such that
E ∥X∥2 < ∞ (this is implied in particular by Assumption 4.1). Denote S the ”population
 

evaluation” operator
S : H → L2 (H, ρ), f 7→ ⟨f, .⟩ = [x 7→ ⟨f, x⟩].
This is a Hilbert-Schmidt operator, and its adjoint is given by
S ∗ : L2 (H, ρ) → H, g 7→ E[g(X)X].
Finally, it holds S ∗ S = E[X ⊗ X ∗ ] = Σ, and Σ is a trace-class operator.
Proof. Let (ei )i∈I be an orthonormal basis of H. Then we have
X X 
∥Sei ∥2 = E |⟨ei , X⟩|2 = E ∥X∥2 < ∞,
  
i i

establishing that S is Hilbert-Schmidt. To determine its adjoint, we notice


⟨Sf, g⟩ρ = E[⟨f, X⟩g(X)] = E[⟨f, Xg(X)⟩] = ⟨f, E[Xg(X)]⟩,
1  1
etablishing the announced formula for S ∗ . Observe that E[∥Xg(X)∥] ≤ E ∥X∥2 2 E |g(X)|2 2 <


∞, since g is an element of L2 (H, ρ); this proves that Xg(X) is Bochner-integrable.


Finally, Σ = S ∗ S is a trace-class operator as product of two Hilbert-Schmidt operators,
and we identify
S ∗ Sf = E[X(Sf )(X)] = E[X⟨X, f ⟩] = E[X ⊗ X ∗ ]f,
establishing the announced formula. Observe that X ⊗ X ∗ is Bochner-integrable as an
element of the Banach space of trace-class operators B1 (H), since ∥X ⊗ X ∗ ∥1 = ∥X∥2 ,
which is integrable.
We need empirical analogues of the above operators. For this we mimic the above con-
struction with L2 (H, ρb) instead of L2 (H, ρ), where ρb is the empirical distribution associated
to the sampleP(X1 , . . . , Xn ). We actually identify L2 (H, ρb) with Rn with the scalar product
⟨u, v⟩n := n1 i∈JnK ui vi . (This identication is not quite correct if some of the values Xi
repeat)
Proposition/Definition 4.3. Conditional to (X1 , . . . , Xn ) denote Sb the ”sample evalua-
tion” operator
Sb : H → (Rn , ⟨·, ·⟩n ), f 7→ (⟨f, X1 ⟩, . . . , ⟨f, Xn ⟩).
Its adjoint is given by
1 X
Sb∗ : (Rn , ⟨·, ·⟩n ) → H, (a1 , . . . , an ) 7→ ai X i .
n
i∈JnK

Finally, it holds Sb∗ Sb = 1


Xi ⊗ Xi∗ = Σ.
P b
n i∈JnK

39
Proof. Just note  
1 X 1 X
Sf,
b u
n
= ⟨f, Xi ⟩ui = f, u i Xi ,
n n
i∈JnK i∈JnK

and  X 
∗ 1 X 1 ∗
S Sf =
b Xi ⟨f, Xi ⟩ = Xi ⊗ Xi f.
n n
i∈JnK i∈JnK


Finally, let us define the excess risk we want to control forh an estimator if of f . We
b
will focus here on the quadratic prediction risk R(fb) = E ( fb, X − Y )2 , where the
expectation is over a new, independent example (X, Y ) drawn from P . Consequently, the
excess risk with respect to the optimal prediction f ∗ can be rewritten as
h i
∗ 2 ∗ 2
R(f ) − R(f ) = E ( f , X − Y ) − (⟨f , X⟩ − Y )
b b
h i
= E ( fb − f ∗ , X − ξ)2 − ξ 2
h i
= E ( fb − f ∗ , X )2
= S(fb − f ∗ ), S(fb − f ∗ ) ρ

= (fb − f ∗ ), Σ(fb − f ∗ ) H
1 2
= Σ 2 (fb − f ∗ ) H
(4.2)

4.2 Probabilistic inequalities


In analogy with
−1 the finite-dimensional ordinary least squares estimator
1 1 −1 1
T T
 T
θ = nX X
b
n
X Y = Σ n X Y (where X is the (n, d) design matrix whose rows
b
are the Xi , and Y is the column vector (Y1 , . . . , Yn )T ), note that the operator Sb is the
Hilbert space analogue of X, so that we will consider etimates of the form
b Sb∗ Y ,
fbλ = Fλ (Σ)

where Fλ (.) is a suitable ”regularized inverse” driven by a regularization parameter λ > 0.


Under model (4.1), we have Y = Sf b ∗ + ξ, where ξ = (ξ1 , . . . , ξn ), so

b Sb∗ (Sf
fbλ = Fλ (Σ) b ∗ + ξ) = Fλ (Σ) b ∗ + Fλ (Σ)(
b Σf b Sb∗ ξ). (4.3)

b and Sb∗ ξ.
In view of the above, two quantities we wish to have control on are Σ
Let us start with an application of the simple Hoeffding’s inequality:

40
Proposition 4.4. Under Assumption 4.1, for δ ∈ (0, 1) denote Lδ := 1 + log δ −1 , it holds
with probability 1 − δ: √
4M κ L
Sb∗ ξ ≤ √ δ. (4.4)
n
Also, with probability 1 − δ:

2κ2 Lδ
b −Σ
Σ op
≤ Σ
b −Σ
2
≤ √ . (4.5)
n

Proof. Note that Sb∗ ξ = n1 m


P
i=1 ξi Xi , and we have ∥ξi Xi ∥ ≤ 2M κ. Applying Hoeffding’s
inequality in Hilbert space yields the first Pclaim.
For the second, we recall that Σ = n ni=1 Xi ⊗ Xi∗ , and ∥Xi ⊗ Xi∗ ∥2 = ∥Xi ∥2 ≤ κ2 .
b 1

Applying Hoeffding’s inequality in the Hilbert space HS(H) yields the second claim.
We turn to applications of Bernstein’s inequality. To better exploit it, we will consider
a ”warped” version of the quantities of interest. The following quantity will play an
important role:
Definition 4.5. In the context of Assumption 4.1, introduce and denote for λ > 0:

Σλ := (Σ + λI),

and
 X λk
N (λ) := Tr ΣΣ−1
λ = , (4.6)
k≥1
λk + λ

where (λk )k≥1 is the sequence of eigenvalues of Σ (with multiplicity).


(Observe that N (λ) is well-defined for any λ > 0 since Σ is trace-class.)
Proposition 4.6. Under Assumption 4.1, for δ ∈ (0, 1), λ > 0, denoting Lδ = 1 + log δ −1
each of the following events hold with probability 1 − δ:
r r !
− 12 ∗ √ L δ N (λ) 2M κL δ Lδ N (λ) Lδ
Σλ Sb ξ ≤ 2 2M + √ ≤ C▲ + √ ; (4.7)
n n λ n n λ
r r !
2
− 21  2N (λ)L δ 2L δ κ Lδ N (λ) Lδ
Σλ Σ b −Σ ≤κ + √ ≤ C▲ + √ ; (4.8)
2 n n λ n n λ

and provided λ ≤ ∥Σ∥op :


r
−1  1
b − Σ Σ− 2 2(log(8N (δ)) + Lδ ) 2(log(8N (δ)) + Lδ )
Σλ 2 Σ λ ≤κ +
op λn 3λn
r !
log(2N (λ)) + Lδ
≤ C▲ . (4.9)
λn

41
Proof. For the first two cases, we’ll apply Pinelis’ inequality in a Hilbert space (Cor. 3.10
with Ψ(x) = ∥x∥ in a Hilbert space). For the first one, note that

−1 1 X −1
Σλ 2 Sb∗ ξ = Zi , Zi := Σλ 2 Xi ξi .
n
i∈JnK

−1 1 √
Since Σλ 2 ≤ λ− 2 , we have ∥Zi ∥ ≤ 2M κ/ λ. Moreover,
op

− 21
h i
 2 2 2
E ∥Zi ∥ ≤ 4M E Σλ Xi
−1 −1
h i
= 4M 2 E Tr Σλ 2 Xi ⊗ (Σλ 2 Xi )∗
− 21 − 12 
h i
2 ∗
= 4M E Tr Σλ (Xi ⊗ Xi )Σλ
−1 −1
= 4M 2 Tr(Σλ 2 E[Xi ⊗ Xi∗ ]Σλ 2 )
= 4M 2 N (λ).
−1 −1
= n1 i∈JnK Ai , with Ai = Σλ 2 ((Xi ⊗ Xi∗ ) − E[Xi ⊗ Xi∗ ]).
P
For the second one, we have Σλ 2 (Σ−Σ)
b
It holds
−1 −1 1
∥Ai ∥2 ≤ 2 Σλ 2 (Xi ⊗ Xi∗ ) 2 ≤ 2 Σλ 2 op ∥Xi ⊗ Xi∗ ∥2 ≤ 2λ− 2 κ2 ;

and, due to E ∥Z − E[Z]∥2 = E[∥Z 2 ∥] − ∥E[Z]∥2 ≤ E ∥Z∥2 for a Hilbert norm:


   

 2

 2 − 12 ∗
E ∥Ai ∥2 ≤ E Σλ (Xi ⊗ Xi )
2
−1

= E Tr (Xi ⊗ Xi )Σλ (Xi ⊗ Xi∗ )
 
h i
≤ E ∥Xi ⊗ Xi∗ ∥op Tr Σ−1 λ (X i ⊗ X ∗
i )
≤ κ2 N (λ).

For the last claim, we will apply the operator Bernstein’s inequality (Theorem 3.14). The
estimates are similar to the above, now we consider a sum of i.i.d. self-adjoint random
−1 −1
operators having the form Bi := Σλ 2 (Xi ⊗ Xi − E[Xi ⊗ Xi∗ ])Σλ 2 . It holds ∥Bi ∥op ≤ 2κ2 /λ,
and due to E[(M − E(M ))2 ] ⪯ E[M 2 ] for a self-adjoint operator-valued random variable
M (s.t. M 2 is Bocher-integrable), it holds
−1   − 12
E Bi2 ⪯ Σλ 2 E (Xi ⊗ Xi∗ )Σ−1 ∗
 
λ (X i ⊗ X i ) Σλ ,

Observe that in general, (u ⊗ v)M (w ⊗ x) = ⟨v, M w⟩u ⊗ x. Therefore

κ2
(Xi ⊗ Xi∗ )Σ−1 ∗ −1 ∗
λ (Xi ⊗ Xi ) = Xi , Σλ Xi Xi ⊗ Xi ⪯ (Xi ⊗ Xi∗ ).
λ

42
Using this into the previous display, and positivity of expectation, we obtain
  κ2 − 1 −1
E Bi2 ⪯ Σλ 2 ΣΣλ 2 := V.
λ
It holds
κ2 − 12 −1 κ2
∥V ∥op ≤ Σλ ΣΣλ 2 ≤ .
λ op λ
Furthermore, since intrinsic dimension is invariant by rescaling, we have
 1
−1


Tr Σλ 2 ΣΣλ 2
intdim(V ) = − 12 − 12
≤ 2N (λ),
Σλ ΣΣλ op
−1 −1
where for the last inequality we used that if λ1 = ∥Σ∥op , then Σλ 2 ΣΣλ 2 op
= λ1 /(λ1 +
λ) ≥ 1/2 provided λ ≤ λ1 .
The following corollary of (4.9) is extremely important and useful.
Corollary 4.7. Under Assumption 4.1, for λ ∈ (0, ∥Σ∥op ), and δ ∈ (0, 1), provided
log(2N (λ)) + Lδ
n ≥ C▲ A (4.10)
λ
for some A ≥ 1, then with probability at least (1 − δ) it holds simultaneously
− 1 12 2 C▲
Σλ 2 Σ b
λ ≤1+ √ ; (4.11)
op A
1 1 2 C▲
Σb− 2 Σ 2 ≤1+ √ . (4.12)
λ λ
op A
Proof. Provided C▲ is chosen large enough in condition (4.10), we have from (4.9) (with
probability at least 1 − δ):
−1  1
b − Σ Σ− 2 C▲′
Σλ 2 Σ λ ≤ √ .
op A
(we can assume C▲′ ≤ 21 provided C▲ is chosen large enough in condition (4.10)).
This immediately implies the first claim, since
− 1 12 2 − 12
b λ Σ− 2
1
− 12 1
b − Σ)Σ− 2 + I C▲′
Σλ 2 Σ
b
λ = Σ λ Σ λ = Σλ (Σ λ ≤ 1 + √ .
op op op A
For the second claim, we have

−1
C▲′′

− 12 1 2 − 12 − 21 −1 C
1 1
 
−1 ▲
Σ
b Σ2
λ = Σ 2Σb Σ2
λ = I − Σ λ (Σ − Σ)Σ
b
λ ≤ 1 − √ = 1+ √ ,
op op op A A

where we have used C▲′ / A < 12 . The second equality above is due to (for two invertible
self-adjoint operators A, B)
1 1 1 1 1 1
I − A− 2 (A − B)A− 2 = A− 2 BA− 2 = (A 2 B −1 A 2 )−1 .

43
4.3 Analysis of spectral regularization methods
As announced earlier, we will study estimates of the form
b Sb∗ Y ,
fbλ = Fλ (Σ) (4.13)
where Fλ : R+ → R+ is a ”regularized inverse” function depending on a regularization
parameter λ > 0.
We will study the statistical properties of this type of algorithms under somewhat
”generic” conditions for the family Fλ . These conditions are meant to allow for a large
variety of different methods and algorithms in practice. We defer precise examples to a
later section.

Assumption 4.8. For the family of functions Fλ : [0, κ2 ] → R+ defined for λ ∈ [0, κ2 ], is
said to be a regularization (or filter) function of qualification q > 0 if there exist positive
constants D, E, such that for all λ ∈ [0, κ2 ] and t ∈ [0, κ2 ], it holds:
Fλ (t) ≤ E min(λ−1 , t−1 ); (4.14)
 q
λ
|1 − tFλ (t)| ≤ D . (4.15)
t
The following useful estimates follow directly:
Lemma 4.9. Under Assumption 4.8, the following holds true for all λ ∈ [0, κ2 ] and t ∈
[0, κ2 ]:
for all β ∈ [0, 1] : Fλ (t)tβ ≤ Eλβ−1 ; (4.16)
for all γ ∈ [0, q] : |1 − tFλ (t)|tγ ≤ D′ λγ , (4.17)
where E ′ = max(D, 1 + E).
Proof. For any λ ∈ [0, κ2 ] and t ∈ [0, κ2 ], and β ∈ [0, 1], it holds using (4.14):
Fλ (t)tβ = (Fλ (t))1−β (tFλ (t))β ≤ (E/λ)1−β E β ≤ Eλβ−1 .
Furthermore, for any γ ∈ [0, q], using (4.14) and (4.15):
γ γ γ γ
|1 − tFλ (t)|tγ = (|1 − tFλ (t)|tq ) q |1 − tFλ (t)|1− q ≤ D q (1 + E)1− q λγ .

The second type of assumption we will make concerns the ”regularity” of the target
function f ∗ , expressed in the ”scale” of the second moment operator Σ.

Assumption 4.10. Under the notation of Assumption 4.1, we say that the target f ∗ has a
Hölder source regularity condition of order r ≥ 0 if it can be written under the form
f ∗ = Σr g0 , (4.18)
for some g0 ∈ H.

44
Observe that since Σ is not invertible, the image of Σ is not H, and thus this condition
is not trivial; not every element of H has a non-trivial (r > 0) source regularity condition.
The higher r, the most restrictive the condition, and hence the higher ”regularity” the
target function has. Note that there exist more general source conditions in the literature
that use functions of Σ different from (fractional) powers, but the Hölder source condition
(using powers of Σ) is the most classical, and is the only type we will consider.
From now on, the “generic number” C▲ will be allowed to depend on κ, M , the constants
(E, F ) from Assumption 4.8 as well as (r, ∥g0 ∥) from Assumption 4.10. In fact, it is
probably more enlightening to say that C▲ indicates a factor that does not depend on n, λ
or δ. Remember that the value of C▲ might change from one line to the other.

Proposition 4.11. Suppose granted Assumption 4.1, regularization function Assump-


tion 4.8 with qualification q, and Hölder source Assumption 4.18 of order r, such that
q ≥ r + 21 .
For any λ ∈ [0, ∥Σ∥op ], δ ∈ (0, 1) and n such that condition (4.10) holds, it holds with
probability at least 1 − δ:
r !
1 N (λ) 1 p
Σ 2 (fbλ − f ∗ ) H ≤ C▲ + λr+ 2 Lδ . (4.19)
n

Corollary 4.12. Under the same assumptions as Proposition 4.11, assume additionally
that the ordered eigenvalues of Σ satisfy

λk (Σ) ≤ ck −α , (4.20)

for some constants c > 0 and α > 1. Put β := α r + 12 . Then, choosing the regularization


constant α
λn = n− 2β+1 , (4.21)
for fixed δ ∈ (0, 1), for n big enough it holds with probability at least 1 − δ:

R(fbλn ) − R(f ∗ ) ≤ C▲ n− 2β+1 , (4.22)

where C▲ depends on (c, α) in addition to the constants appearing in the other assumptions.

Proof. Let us derive a rough estimate of the effective dimension N (λ) in this case. Denote
kλ∗ = mink≥1:λk ≤λ . Then λk /(λk + λ) ≤ 21 for k ≤ kλ∗ ; and the assumption (4.20) implies

45
−1
that kλ∗ ≤ (λ/c)−α . Thus
X λk X λk X λk
N (λ) = = +
k≥1
λ + λk 1≤k<k∗ λk + λ k≥k∗ λk + λ
λ λ

1 X
≤ kλ∗ + λ−1 c k −α
2 k≥kλ∗
Z
1
≤ kλ∗ + cλ−1 t−α dt
2 t≥kλ∗

1 ∗ 1
≤ kλ + λ−1 (kλ∗ )1−α
2 α−1
−1
≤ C(c, α)λ−α .

It can then be checked that the choice (4.21) for λn , which balances the two terms for the
obtained risk bound (4.19), leads to (4.22).
Proof of Prop. 4.11. We will assume for this proof that the probabilistic inequalities of
Proposition 4.6 are satisfied, as well as those of Corollary 4.7. By the assumptions made,
the required conditions for Corollary 4.7, namely λ ≤ ∥Σ∥op ≤ κ2 and (4.10) are satisfied.
Note that we also implicitly use a union bound to get simultaneously the controls of
Proposition 4.6, but this amounts to replace Lδ by Lδ/c ≤ C▲ Lδ for a finite number
of events c to apply the union bound over, and this can be included in the numerical
constants.
We recall the starting decomposition 4.3 coming from model (4.1) and the definition of
the estimator:
b Sb∗ (Sf
fbλ = Fλ (Σ) b ∗ + ξ) = Fλ (Σ) b ∗ + Fλ (Σ)(
b Σf b Sb∗ ξ), (4.23)
hence the quantity we want to analyze for the control of the excess risk (4.2) is
1 1 1
Σ 2 (fbλ − f ∗ ) = Σ 2 (Fλ (Σ)
b Σb − I)f ∗ + Σ 2 Fλ (Σ)(
b Sb∗ ξ). (4.24)

We will control the two terms above, starting with the second one, “noise”, term. It holds
−1 1
−1 1 1 1 1
−1
1 1
b Sb∗ ξ) ≤ Σ 2 Σ 2
Σ 2 Fλ (Σ)( Σλ2 Σ
b 2 Σ
b 2 Fλ (Σ)
b Σb2 b− 2 Σ 2
Σ Σλ 2 Sb∗ ξ .
λ λ λ λ λ λ
op op op op

The first factor is bounded by 1. The second and fourth factors are bounded by a number
C▲ (with high probability) due to Corollary 4.7. The last factor is bounded using (4.7).
As for the the third factor, it holds
1 1
Σ
b 2 Fλ (Σ)
λ
b Σb2
λ ≤ sup Fλ (t)(t + λ)
op t∈[0,κ2 ]

≤ 2E,

46
using (4.14). In the end, we get
r ! r
1
b∗ Lδ N (λ) Lδ Lδ N (λ)
b S ξ) ≤ C▲
Σ Fλ (Σ)(
2 + √ ≤ C▲′ . (4.25)
n n λ n

For the last inequality, we used condition (4.10) which implies (noting that λ ≤ ∥Σ∥op
implies N (λ) ≥ 12 ):
log(2N (λ)) + Lδ Lδ
n ≥ C▲ ≥ C▲ ,
λ λ
so that √ r
Lδ Lδ Lδ N (δ)
√ ≤ C▲ √ ≤ C▲′ .
n λ n n
Let us turn to the first, ”approximation”, term in (4.24). We use the assumed source
condition (4.18) and start similarly as above; we denote Rλ (t) := (tFλ (t) − 1):
1 1
−1 1
1
Σ 2 (Fλ (Σ)
b Σb − I)f ∗ ≤ Σ 12 Σ− 2 Σλ2 Σ
b 2 b 2 Rλ (Σ)Σ
Σ b 12 Rλ (Σ)Σ
b r ∥g0 ∥ ≤ C▲ Σ b r .
λ λ λ
op op op
(4.26)
We will distinguish two cases: first, if r ≤ 21 , we will use the Cordes inequality ∥As B s ∥op ≤
∥AB∥sop if A,B, are self-adjoint and s ∈ [0, 1] to obtain
1 1 1 1 2r
Σ b r
b 2 Rλ (Σ)Σ ≤ Σ
b 2 Rλ (Σ)
b Σbr b− 2 Σ 2
Σ Σ−r r
λ λ λ λ λ λ Σ op
op op op

The last factor is bounded by 1 as before, the second by a number C▲ (with high proba-
bility) due to Corollary 4.7, and the first by
1  
br r+ 21
b 2 Rλ (Σ)
Σ λ
b Σ
λ ≤ sup Rλ (t)(t + λ)
op t∈[0,κ2 ]
 1 1

≤ sup Rλ (t)(tr+ 2 + λr+ 2 )
t∈[0,κ2 ]
1
≤ 2D′ λr+ 2 ,

where we have used r + 21 ≤ 1 for the second inequality, and property (4.17) for the last
(under the assumption r + 12 ≤ q).
If r ≥ 12 , we modify the argument:
1 1
b 2 Rλ (Σ)Σ
Σ λ
b r ≤ Σ b r
b 2 Rλ (Σ)Σ
λ λ Σ−r
λ Σ
r
op op | {z }
≤1
1 1
≤ Σ
b Rλ (Σ)
b Σ
2 b rλ + Σ
b 2 Rλ (Σ)
b Σrλ − Σ
b rλ
λ λ
op op op
2(r−1)+

1
′ 12 C▲ r(2κ) (2κ2 ) Lδ
≤ 2D′ λr+ 2 + 2D λ √ .
λ(1−r)+ n

47
To justify the last inequality, we use the HS-norm control (4.5) together with the Hilbert-
Schmidt Lipschitz perturbation inequality (2.15), for the function φr : x 7→ xr on the
interval [λ, 2κ2 ] containing the spectrum of both Σλ and Σ b λ (remember we assume λ ≤ κ2 ).
On this interval the function φr is rλr−1 -Lipschitz if r ≤ 1, and r(2κ)r−1 -Lipschitz if r ≥ 1.
Summing up the last computations into (4.26) and wrapping various factors into the generic
constant, we get
 1
√ √ !
1 1
b − I)f ∗ ≤ C▲ λr+ 2 + 1 r ≥ 2
λ Lδ
Σ 2 (Fλ (Σ)
b Σ √ . (4.27)
λ(1−r)+ n

Plugging in (4.25) and (4.27) into (4.24), we thus obtain the risk bound holding with high
probability (using Lδ ≥ 1 to pull it out as a factor up to changes in the front constant):
r  1
√ !
1 N (λ) 1 1 r ≥ λ p
Σ 2 (fb − f ∗ ) H ≤ C▲ + λr+ 2 + (1−r)+2 √ Lδ . (4.28)
n λ n

Let us finally clean up the above expression by noticing that the third term is upper
bounded by the first up to a C▲ -factor, since, for r ≥ 12 :

λ min( 12 ,r− 12 )
p
= λ ≤ max(1, κ) ≤ max(1, κ) 2N (λ).
λ(1−r)+
This finally implies the announced estimate (4.19).

4.4 Examples
In this section we give a few examples of classical regularization functions and check that
they satisfy the conditions of Assumption 4.8. Most of these examples come from the
theory of inverse problems [9].
Spectral cut-off. The spectral cut-off (or truncated singular value decomposition,
TSVD) regularization function is given by Fλ (t) = 1{t ≥ λ}/t. In words, once applied to
a self-adjoint operator, this regularization function projects this operator onto the sum of
eigenspaces for eigenvalues less than λ, and takes its Moore-Penrose pseudoinverse. It is
immediate to check that Fλ (t) ≤ t−1 , Fλ (t) ≤ λ−1 , and |1 − tFλ (t)| = 1{t < λ} ≤ (λ/t)q ,
for any q ≥ 0. Therefore, the conditions of Assumption 4.8 are satisfied for E = D = 1
and any q > 0 we can say that this regularization has ”infinite qualification”.
Having infinite qualification sounds like a very desirable property, since it can adapt to
an arbitrarily regular source condition. However, eigendecomposition truncation is difficult
in practice since it requires to compute the eigendecomposition of Σ.b Furthermore, in prac-
tice somewhat more ”smooth” regularization functions turn out to have better behavior.
Ridge regression/Tikhonov regularization. The ridge regression regularizer, also
known as Tikhonov regularization, is given by Fλ (t) = (t + λ)−1 . It is easy to check that:
 
1 1 1 t λ λ
≤ min , ; 1− = ≤ ,
λ+t λ t λ+t λ+t t

48
hence the conditions of Assumption 4.8 are satisfied for E = D = 1 and qualification q = 1.
On the other hand, it can be check that qualification higher than 1 does not hold.
Iterated ridge regression/Tikhonov. To compensate for the limited qualification
of the standard ridge regression, it can be proposed to iterate it by applying it (with the
the same regularization parameter λ) recursively to the residuals. The following formulas
can be easily shown by recursion for m-times iteration:
m
λi−1 λm
 
(m)
X 1
Fλ (t) = = 1−
i=1
(λ + t)i t (λ + t)m
(m) (m) λm
(residuals)Rλ (t) = 1 − tFλ (t) = .
(λ + t)m

It is easy to check that:


m m  m
λi−1
  
X m 1 1 λ λ
i
≤ ≤ m min , ; ≤ ,
i=1
(λ + t) λ + t λ t λ+t t

hence the conditions of Assumption 4.8 are satisfied for E = m, D = 1 and qualification
q = m.
Gradient descent/Landweber iteration. Consider the gradient method based on
the quadratic loss objective function
n
b )= 1
X 2
L(f (⟨f, Xi ⟩ − Yi )2 = Sf
b −Y ,
n i=1 n

with the gradient


b = Sb∗ (Sf
∇f L b − Sb∗ Y .
b − Y ) = Σf

Thus, if the estimate after k gradient iterations (with fixed stepsize η) has the form fbk =
b Sb∗ Y , the next gradient step is η(I − ΣF
Fk (Σ) b Sb∗ Y . Therefore, the k-th step of
b k (Σ))
gradient descent take the form of a regularization function Fk (t) satisfying the recursion

Fk+1 (t) = Fk (t) + η(1 − tFk (t)) = η + (1 − ηt)Fk (t),

after unfolding the recursion we get


k−1
X 1
(1 − ηt)ℓ = 1 − (1 − ηt)k .

Fk (t) = η
ℓ=0
t

If t ∈ [0, κ2 ] and η ∈ (0, κ−2 ) then ηt < 1 and we have Fk (t) ≤ 1/t and, using (1 − x)k ≥
1 − kx for x ≤ 1 by convexity,
1
Fk (t) ≤ kηt ≤ kη.
t

49
Thus, is we define the equivalent regularization parameter λk := (ηk)−1 , the first part of
Assumption 4.8 is satisfied (for E = 1) and it holds for any q > 0:
 q  q
k 1 λ
|1 − tFk (t)| = (1 − ηt) ≤ exp(−kηt) ≤ cq = cq ,
kηt t
 
where cq = (q/e)q . For the last inequality we used the elementary fact that maxu>0 log u − uq =
log q−1, so that exp(−u) ≤ (q/e)q u−q . Hence the second part of Assumption 4.8 is satisfied
for any q > 0 with D = cq .
To summarize, gradient descent (for the quadratic risk) with fixed stepsize acts a regu-
lazation if if is stopped early at step k, provided the stopping time is chosen in accordance
with the target function’s source regularity.

50
5 Reproducing kernel methods
5.1 Reproducing kernel Hilbert spaces
We quickly review some important facts about the reproducing kernel Hilbert space method-
ology in data science. Initially considered in the context of spline methods in the 1970s
[27], for which what we now call ”Kernel ridge regression” was already introduced, they
enjoyed an important resurgence in the 2000s due to their versatility for applications of
machine learning methods, in particular Support Vector Machines. While they have been
outperformed by modern Deep Learning methods, they remain an important tool to un-
derstand and analyze machine learning methods (see e.g. works on the ”Neural Tangent
Kernel” [12])
A (reproducing) kernel Hilbert space (rkHs) can be defined in several equivalent ways
(see [2], Chap. 5, for instance).
Definition 5.1 (kernel Hilbert space, abstract version). Given a base space X , a kernel
Hilbert space over X is a R- or C-Hilbert space together with a “feature” mapping Φ :
X → H. The associated kernel is the function k defined as
k : X × X → R or C : (x, y) 7→ ⟨Φ(x), Φ(y)⟩. (5.1)
Proposition/Definition 5.2 (rkHS, functional space version). Given a base space X , a
reproducing kernel Hilbert space over X is R- or C-Hilbert space whose elements are R-
or C-valued functions on X , together with a “feature” mapping Φ : X → H such that the
following property holds:
∀f ∈ H, x ∈ X : f (x) = ⟨f, Φ(x)⟩. (5.2)
The associated kernel is the function k defined as (5.1).
As a consequence of (5.2), it can be checked that Φ must in fact be the mapping
∀x ∈ X : Φ(x) = k(x, ·) = (y 7→ k(x, y)), (5.3)
implying in particular that all functions k(x, ·) must belong to H.

The proof of (5.3) is left to the reader. The property (5.2) is called ”self-reproducing
property” because once applied to the functions Φ(x) = k(x, ·) and Φ(y) = k(y, ·) it yields
⟨k(x, ·), k(y, ·)⟩ = k(x, y).
In this sense the term ”reproducing” only makes sense in the functional space version, thus
when referring to a rKHS one always implicitly assumes the functional space.
The following equivalent definition of the functional space version is sometimes useful:
Property 5.3. Let H be a X is R- or C-Hilbert space whose elements are R- or C-valued
functions on X , and such that for any x ∈ X , the (linear) evaluation functional at point
x: f ∈ H 7→ f (x) is continuous. Then H is a rkHs (functional space version).

51
Proof. By Riesz’ theorem, since for any x ∈ X the evaluation mapping f ∈ H 7→ f (x) is
continuous, there exists an element Φ(x) such that f (x) = ⟨f, Φ(x)⟩, i.e. (5.2) is satisfied.

Obviously, a rkHS (functional space version) is a rkHS (abstract version). However,


the functional space version is canonical in the sense given by the following theorem.
Theorem 5.4 (Characterization theorem). If H is a rkHS over X , the associated kernel
function k is of positive semidefinite (psd) type, meaning that for any n ∈ N, (x1 , . . . , xn ) ∈
Hn , the “kernel Gram matrix” G given by Gij = k(xi , xj ) is Hermitian (=self-adjoint).
Conversely, if k is a kernel function of psd type on X , then there exists a rkHS (func-
tional version) Hk on X with kernel k. It is the completion of the pre-Hilbert space of
functions Hpre = Span{k(x, .), x ∈ X } with the inner product
* +
X X X
αi k(xi , .), βj k(xj , .) = αi β̄j k(xi , xj ).
i j i,j

This Hilbert space Hk is the canonical (functional space) rkHS associated with the psd
kernel k. For any rkHs (abstract version) H◦ on X with feature mapping Φ◦ and kernel k,
the mapping ξ : u 7→ (x 7→ ⟨u, Φ◦ (x)⟩◦ ) maps H◦ onto Hk and satisfies ξ ◦ Φ◦ (x) = k(x, ·)
i.e. ξ ◦ Φ◦ is the canonical feature mapping from X to Hk .
For a proof see e.g. [24, 2].
Observe that if H is a Hilbert space, we can see its dual H∗ as a rkHs on H, with
the canonical mapping f ∈ H 7→ f ∗ ∈ H∗ and the ”linear” kernel k(f, g) = ⟨f, g⟩. This
somewhat convoluted way to present a Hilbert space is merely to remark that the case
of the linear regression model with covariate in a Hilbert space (4.1) considered in the
previous chapter can be cast into the framework considered here.
To come back to our purposes, given a rkHs H with kernel k and feature mapping Φ
on X , and data (Xi , Yi )i∈JnK taking values in X × R, we will intend to apply the results of
the previous chapter to the mapped data (X ei , Yi ) ∈ H × R, where X
e = Φ(X).
The following property is immediate and useful, and related boundedness of the kernel
to boundedness of the Hilbert-valued covariate X: e

Lemma 5.5. If k is a psd kernel on a space X , and if supx∈X k(x, x) = κ2 < ∞, then
for any rkHS with kernel k over X with feature mapping Φ, it holds ∥Φ(x)∥ ≤ κ for any
x ∈ X . Furthermore, it holds |k(x, y)| ≤ κ2 for any (x, y) ∈ X 2 .
The proof is almost tautologic and left to the reader (for the second part, use the
Cauchy-Schwartz inequality).

5.2 Kernel operators in reproducing kernel Hilbert spaces


As a first step, we generalize the model and different operators introduced in the previous
chapter in the framework of data mapped into a (functional space) rkHS.

52
Proposition 5.6. Let H be a rkHs over X with kernel k and canonical feature mapping
Φ(x) = k(x, ·). Let ρ be a probability distribution over X such that Eρ [k(X, X)] < ∞ (for
instance, this is the case under the assumption supx∈X k(x, x) = κ2 < ∞).
Then we generalize the different operators appearing in Propositions 4.2 and 4.3 as
follows:

• The population evaluation operator S maps an element f ∈ H to itself, as an element


of L2 (X , ρ). (In this sense it is a ”change of geometry” operator). It is a Hilbert-
Schmidt operator.

• The adjoint S ∗ of S is given by the kernel integral operator


 Z 
2
f ∈ L (X , ρ) 7→ t 7→ k(x, t)f (x)dρ(x) = Eρ [f (X)k(X, t)] ∈ H. (5.4)
X

• The operator S ∗ S = E[k(x, ·) ⊗ k(x, ·)∗ ], which is a trace-class operator, is also given
by the kernel integral operator (5.4), but as an operator from H to H; the operator
SS ∗ is again given by the same formula, but as an operator from L2 (X , ρ) to itself.

• The sample evaluation operator Sb is given by

Sb : H → (Rn , ⟨·, ·⟩n ), f 7→ (f (X1 ), . . . , f (Xn )).

• Its adjoint is given by


1 X
Sb∗ : (Rn , ⟨·, ·⟩n ) → H, (a1 , . . . , an ) 7→ ai k(Xi , ·). (5.5)
n
i∈JnK

• It holds Sb∗ Sb = Σ
b 1 Pn k(Xi , ·) ⊗ k(Xi , ·)∗ , while SbSb∗ : Rn → Rn is the normalized
n k=1
Gram kernel matrix G b given by

bi,j = 1 k(Xi , Xj ),
G (i, j) ∈ JnK2 .
n

Proof. For the first point, we must check that S is a well-defined, Hilbert-Schmidt operator.
First, under the assumption, any element f ∈ H is (as a function over X ) squared-integrable
with respect to ρ, since by the reproducing property and Cauchy-Schwarz:

Eρ |f (x)|2 = Eρ |⟨f, k(x, ·)⟩H |2 ≤ Eρ ∥f ∥2H ∥k(x, ·)∥2H = ∥f ∥2H Eρ [k(x, x)] < ∞.
     

If (ei )i∈I is a basis of H, then


X X   X 
∥Sei ∥2L2 (X ,ρ) = Eρ |ei (x)|2 = Eρ |⟨ei , k(x, ·)⟩H |2 = Eρ ∥k(x, ·)∥2 = Eρ [k(x, x)] < ∞,
  
i∈I i∈I i∈I

53
establishing that S is Hilbert-Schmidt.
For the second point, we first check that for f ∈ L2 (X , ρ), the variable Z(X) =
f (X)k(X, ·) is Bochner-integrable in H (it does take its values in H, since for fixed X
it is a multiple of k(X, ·) ∈ H.) We have by the Cauchy-Schwarz inequality:
 1
Eρ [∥f (X)k(X, ·)∥] = Eρ [|f (X)|∥k(X, ·)∥] ≤ Eρ |f (X)|2 Eρ ∥k(X, ·)∥2 2
  
1
= ∥f ∥L2 (X ,ρ) Eρ [k(X, X)] 2 < ∞,

establishing that Z(X) is Bochner-integrable; we can then write:

⟨Sf, g⟩L2 (X ,ρ) = Eρ [f (X)g(X)] = Eρ [⟨f, k(X, ·)⟩H g(X)] = ⟨f, Eρ [k(X, ·)g(X)]⟩H ,

leading to the announced formula for S ∗ . The rest is left to the reader.
Again, the ”abstract” setting of the previous chapter can be recovered if we assume
directly random data taking values in a Hilbert space H, linear kernel and the rkHs given
by the dual H∗ .
Conversely, if we have a rkHs over X with a feature mapping X e = Φ(X), we can
”forget” the original covariate space X and its rkHS and see the problem in terms only in
terms of Xe and the setting of the previous chapter. From the point of view of the statistical
analysis, it does not change the arguments. But the rkHS view is richer in the sense that it
describes the model in terms of functions the original covariate X which is more concrete
than its mapped version X. e
In particular, the linear regression in Hilbert space model (4.1) becomes, in the rkHs
setting and in terms of the original covariate X ∈ X :

Y = f ∗ (X) + ξ, (5.6)

where f ∗ is an element of the rkHs H.

5.3 Spectral regularization in a rkHs regression setting


Let us consider the model (5.6), which is (4.1) ”incarnated” in a rkHs setting. Remember
we analyzed last chapter spectral regularization methods of the form 4.13
b Sb∗ Y .
fbλ = Fλ (Σ) (5.7)

Computing this ”abstract” estimator seems impossible in practice since we apparently need
to manipulate infinite-dimensional vectors and operators in a Hilbert space. However,
thanks to the shift formula (2.8) we can rewrite the above as

b Sb∗ Y = Sb∗ Fλ (Sb∗ S)Y


fbλ = Fλ (Sb∗ S) b = Sb∗ Fλ (G)Y
b . (5.8)

54
b is a finite n × n matrix, and Y an n-
The benefit of this rewriting is that since G
dimensional vector, we can (at least in principle) numerically compute the n-vector of
coefficients α
b = Fλ (G)Y
b . Using (5.5), the estimated function is then

1 X
fbλ = bi k(Xi , ·).
α
n
i∈JnK

Hence, given the vector of coefficients α b and continued access to the training points (Xi ),
we can compute easily the prediction fλ (x) at any test point x.
b
To sum up, we can use the equivalent representation (5.8) for actual numerical computa-
tion of the estimated function at any point, while we use the ”abstract” representation (5.8)
for the statistical analysis of the estimator, for which the entirety of the arguments from
the previous chapter apply.
A worked example. As a an additional step towards interpretation of the statistical
results in this setting, let us look at the kind of ”regularity” that the source assumption 4.10
entails. We will be looking at a deliberately simplified illustrative example.
Let X be the interval [0, 2π] (seen as the unit circle) and assume the covariate distri-
bution ρ is the uniform distribution on X . Furthermore, assume we consider a kernel k on
X of the form k(x, y) = F (x − y), where F is a function X → R. We assume F can be
written as a Fourier series X
F (t) = a0 + 2ak cos(kt),
k≥0

with positive, summable coefficients (ak )k∈N (the factor 2 for k ≥ 1 is introduced to simplify
things later on).
Let us first justify that k is a spd kernel. We have
X
k(x, y) = F (x − y) = a0 + 2ak (cos(kx) cos(ky) + sin(kx) sin(ky)) = ⟨Φ(x), Φ(y)⟩ℓ2 ,
k≥1
√ √ √ √ √
where Φ(x) = ( a0 , 2a1 cos(x), 2a1 sin(x), . . . , 2ak cos(kx), 2ak sin(kx), . . .) ∈ ℓ2 (N).
Through this explicit representation in the (real) Hilbert space ℓ2 (N), via Definition 5.1 it
is indeed checked that k is a spd kernel on X .
Moreover, via Proposition 5.6 and in particular formula (5.4), we see that if f is a
function on X with Fourier expansion
X
f (x) = f0 + (fkc cos(kx) + fks sin(kx)), (5.9)
k≥1

then X
S ∗ f (x) = a0 f0 + (ak fkc cos(kx) + ak fks sin(kx)).
k≥1

In particular, we see that S (x 7→ cos(kx)) = ak (x 7→ cos(kx)), thus all trigonometric
functions are elements of H, are in fact eigenfunctions of the operator Σ = S ∗ S with
corresponding eigenvalues (λk = ak )k≥0 .

55
1
Due to the fact that S(S ∗ S)− 2 is a partial isometry (this holds in general, as seen
from the singular value decomposition), whose range is dense in L2 ([0, 2π]), we can deduce
that
P if −1 f is a function with Fourier decomposition (5.9), then f is an element of H iff
c 2 s 2
k≥0 ak ((fk ) + (fk ) ) < ∞, and in fact we have
X
∥f ∥2H = a−1 2
0 f0 + a−1 c 2 s 2
k ((fk ) + (fk ) ).
k≥1

From this, we understand precisely the nature of the functions in the rkHs, which have
roughly speaking half the regularity of the kernel function F . For instance, if ak ∝ k −α
with α > 1, then the kernel function is α times differentiable (possibly in a fractional
sense), while the rkHs is made of functions that are α/2 times differentiable.
By the same type of arguments, we find that if f is in the range of Σr then
2 −(1+2r) 2
X −(1+2r)
Σ−r f H = a0 f0 + ak ((fkc )2 + (fks )2 ),
k≥1

so
P the functions satifying the Hölder source condition of order r are exactly those such that
−(1+2r) c 2 s 2 −α
a
k≥1 k ((f k ) + (fk ) ) < ∞. Again, if ak ∝ k , a function satisfying a source condi-
tion of order r iff (up to minor caveats) it is β = α(r + 21 ) times (fractional) differentiable
(also called β-Hölder). It is notable that this is only this ”intrinsic” regularity parameter
of the target function that drives the convergence rate of the statistical analysis (4.22).
In other words, if we used a different kernel function with a different spectral decay of
order α′ of the associated kernel integral operator, we would get the same convergence rate
for target functions of Hölder regularity β because they would satisfy a different source
regularity condition of order r′ for this kernel, leading to the same convergence rate only
depending on β. (In fact the convergence rate appearing in (4.22) can be shown to be
statistically minimax optimal for functions of that intrinsic regularity). This argument
holds provided α′ < 2β (since we need r′ ≥ 0) and the qualification of the methods is large
enough to cover source regularity r′ .
A conclusion from this example is that it could be preferable to choose a less regular
kernel and use a regularization method with large qualification, because it can adapt to
target functions more regular than the kernel via the source condition, while the converse
is not guaranteed: it is not clear if using a smooth kernel can adapt to irregular functions
(of smoothness less than half of that of the kernel). There are actually results in that
direction, but they are more difficult and require more assumptions.

5.4 Kernel mean embeddings of distributions


TODO! Source eg. [18].

56
6 Acceleration methods
In this section we will consider different approaches to speed up the numerical computa-
tion of procedures seen previously, such as the spectral regularization procedures (4.13).
Usually, this is done at the price of some approximation, and it is of interest to analyze if
this can be done while keeping the statistical guarantee on the obtained estimator. Note
that a particularly important computational bottleneck is the computation of the regular-
ized inverse Fλ (G).
b Even if for some regularization methods, it consists of forward matrix
multiplications, we still have to manipulate a n × n matrix and when the datasize n is
large, it can be problematic or time-costly. We would like to find ways to alleviate this
point in particular.

6.1 Parallelizing: divide and average


References: [28, 19, 15] (between many others)
A common situation is to have a large number m of computers available. Distributing
across machines is easy when operations have to be done in parallel and on different parts
of the data. However it is not obvious a priori how to identify a parallelization opportunity
in the estimator (4.13) and in particular, as mentioned above, concerning the computation
of Fλ (G).
b
An apparently naive approach is simply to divide the data into m blocks of approx-
imately equal size N = n/m (we will assume that n is a multiple of m to simplify, but
(1) (m)
obviously this is not required), compute independently the estimators fbλ , . . . , fbλ on
the separate blocks the data on the m separate machines, and take a simple average of
their output. Not only is the computational burden parallelized accross machines, but the
(k)
computations for each fbλ can be simpler since they rely on less data. For example, if
computing exactly the estimator fbλ has a superlinear cost (in floating point operations,
flops) O(nγ ) with γ ≥ 1, then the total cost for the divide-and-average approach is only
O(m(n/m)γ ), giving an operation complexity gain factor of O(mγ−1 ) (and a time complex-
ity gain factor of O(mγ ) since the machines run in parallel).
Of course, the important question is whether this approach preserves the statistical
convergence rate shown in Section 4. Surprisingly enough, it is actually the case (whithin
certain limits). The following is an analogue of Proposition 4.11 in the distributed setting.

Proposition 6.1. Suppose granted Assumption 4.1, regularization function Assumption 4.8
with qualification q, and Hölder source Assumption 4.18 of order r, such that q ≥ r + 21 .
For any λ ∈ [0, ∥Σ∥op ], consider the ”distribute-and-average” estimator

1 X b(k)
feλ = fλ ,
m
k∈JmK

(k)
where fbλ are given by (4.13) applied to each of the m subsamples of size n/m.

57
Assume that it holds
λ
m ≤ C▲ n , (6.1)
log(2N (λ))
and, if r > 21 :
m ≤ N (λ)λ− min(1,2r−1) . (6.2)
Then, for n large enough, it holds with probability at least 1 − 2/n:
r !
1 N (λ) 1
Σ 2 (fbλ − f ∗ ) H ≤ C▲ + λr+ 2 log n. (6.3)
n

This means that we can attain the same stastistical convergence rates as in Corol-
lary 4.12 (up to logarithmic factor in n) in the distributed setting, provided for the choice
of regularization parameter given by (4.21), the constraints (6.1) and (6.2) are satisfied.
Concretely, in the same setting as in Corollary 4.12, the first constraints reads as
α
m ≤ C▲ n1− 2β+1 ,

and the second constraint as (for r ∈ [ 21 , 1], i.e. β ∈ [α, 3α/2]) :


1+α
m ≤ C▲ n1− 2β+1 ,

and for r ≥ 1, i.e. β ≥ 3α/2:


2(β−α)
m ≤ C▲ n1− 2β+1 .
While the complete interpretation is not obvious (also the above bounds are not claimed
to be optimal), one can check that there is always a nontrivial possibility to distribute, i.e.
m can be chosen as a a nontrivial power of n, whithin some limits.
Proof. The key is to start again from the decomposition (4.24). Let us which we restate here
(k)
for each of the estimators fbλ (indicating with the index (k) that the empirical quantities
only depend on subsample k):
1 (k) 1 1
Σ 2 (fbλ − f ∗ ) = Σ 2 (Fλ (Σ b (k) − I)f ∗ + Σ 2 Fλ (Σ
b (k) )Σ ∗
b (k) )(Sb(k) ξ(k) )
= (I)(k) + (II)(k) . (6.4)

We note that the ”bias” term (I)(k) has non-zero expectation which is the same accross
machines, while the ”noise” term (II)(k) has expectation zero and has independent real-
izations across machines; we can therefore hope for a noise reduction effect for term (II)(k)
when averaging accross machines. On the other hand, this effect won’t apply to term (I)(k) ,
which should therefore be uniformly small accross machines.
We will need to reiterate the arguments used in the proof of Proposition 4.11 separately
on each of the m machines each dealing with an indepedent data set of size N . For this,
as before we will assume that the probabilistic inequalities of Proposition 4.6 are satisfied,

58
as well as those of Corollary 4.7, simulaneously for all machines and data subsets. We
therefore use an union bound over machines, which amounts to say that the statements
we make will hold with probability 1 − mδ rather than 1 − δ. (Thus, for statements with
probability 1 − δ we will need to replace Lδ by Lδ/m ≤ C▲ Lδ + log(m), which we will do at
the end.)
Furthermore, remember that requirements for these inequalities to hold is λ ∈ (0, ∥Σ∥op )
(which will be satisfied for n large enough, so we won’t discuss it in more detail), and more
importantly, condition (4.10) for a data sample of size N :

log(2N (λ)) + Lδ
N ≥ C▲ A . (6.5)
λ
Let us start with term (I)(k) ; we recall the control obtained in (4.27) for a data sample
of size N :
 1
√ √ !
1 1
b (k) − I)f ∗ ≤ C▲ λr+ 2 + 1 r ≥ 2
λ Lδ
(I)(k) = Σ 2 (Fλ (Σb (k) )Σ √ , (6.6)
λ(1−r)+ N

which will hold (with probability 1 − mδ) for all machines simultaneously (for the Σ b (k)
corresponding to their respective data subsamples, k ∈ JmK).
For the ”noise” term (II)(k) , under the same condition (6.5), for any individual k ∈ JmK
(indicating the subsample) we have with probability 1 − δ the control (4.25), which we
rewrite here:
r ! r
1
b (k) )(Sb∗ ξ) ≤ C▲ L δ N (λ) L δ Lδ N (λ)
(II)(k) = Σ 2 Fλ (Σ (k) + √ ≤ C▲′ . (6.7)
N N λ N

Let us denote the Hilbert-valued random variables


1

U(k) := (II)(k) = Σ 2 Fλ (Σ
b (k) )(Sb(k) ξ(k) ), k ∈ JmK.

Note that the variables U(k) are independent (since U(k) only depends on theq subsample k,
and these subsamples are independent), and bounded in norm by B := C▲′ Lδ NN(λ) with
high probability 1−δ taken individually, or simultaneously with probability 1−mδ. We will
therefore apply Hoeffding’s inequality, using the following trick: consider the ”truncated”
random variables: (
e(k) := U(k) , if U(k) ≤ B;
U
0, if U(k) > B.

We will apply Hoeffding’s inequality to the family (U e(k) )k∈JmK and argue that with high
probability their sum coincides with that of the (U(k) )k∈JmK .
First, an annoyance point is that the modified variables U e(k) are not guaranteed to be
centered like the U(k) s where. Let us therefore roughly upper bound this discrepancy: since

59

supt∈[0,κ2 ] |Fλ (t)| ≤ E/λ and Sb(k) ξk is an average of variables bounded in norm by 2M κ
(see e.g. proof of Proposition 4.4), the following rough upper bound holds (always):
C▲
U(k) ≤ ,
λ
and therefore
      
E Ue(k) = E Ue(k) − U(k) = E (U
e(k) − U(k) )1 Ue(k) ̸= U(k)
C▲  e 
≤ P U(k) ̸= U(k)
λ
C▲  
= P ∥U ∥(k) > B
λ
δ
≤ C▲ .
λ
We will assume in the sequel the condition
λ
δ≤√ , (6.8)
n
 
which implies in particular from the above that E U e(k) ≤ C▲ B (using the definition of
B to check this).
Applying the vectorial Hoeffding’s inequality to the centered variables U (k) := Ue(k) −
 
E U e(k) , k ∈ JmK, independent and bounded in norm by C▲ B, therefore yields that for any
η ∈ (0, 1), with probability at least 1 − η it holds
p
1 X C▲ B Lη
U (k) ≤ √ ,
m m
k∈JmK

entailing
p !
1 X e δ C▲ B Lη
U(k) ≤ C▲ + √
m λ m
k∈JmK
r
N (λ)Lδ Lη
≤ C▲ ,
n
where we have used the definition of B, condition (6.8), and n = N m.
Finally, the latter bound also holds for m1 k∈JmK U(k) with probability 1 − η − mδ,
P

e(k) = U(k) for all k ∈ JmK with probability at least 1 − mδ.


since it holds U
Summing up: plugging in this estimate as well as the estimate (6.6) into (6.4), we
obtain
 1
√ r !

 
1 1 X (k) 1 1 r λ N (λ) p
Σ2 fλ − f ∗ ≤ C▲ λr+ 2 + 2
√ + Lδ Lη , (6.9)
m λ(1−r)+ N n
k∈JmK

60
with probability 1 − η − mδ, and provided that conditions (6.5) and (6.8) hold. This is to
be compared to the bound (4.28) that we obtained for the ”single machine” analysis.√
Let
√ us analyze these conditions: first, condition (6.5) implies in particular λ ≥ C▲ / N ≥
−3/2
C▲ / n. Hence, (6.8) is ensured if (6.5) is and if δ ≤ C▲ n . This is reasonable since it
will only result in a logarithmic factor (coming from Lδ ). We choose henceforth δ = n−2 ,
so that mδ ≤ 1/n; and η = n−1 .
As for condition (6.5) itself, due to N = n/m is is implied by the sufficient condition
on the number of subsamples/machines m:
λ
m ≤ C▲ n .
log(2N (λ))
Finally, as in the analysis of (4.28), we would like to be able to wrap the second term into
the third one. This is the case (i.e. the second term is smaller), again using N = n/m, if

m ≤ N (λ)λ− min(1,2r−1) ;

remember that this constraint is only relevant if r ≥ 21 .

6.2 Nyström methods


Sources: [10, 23, 14] (between many others)
The Nyström approximation method consists, roughly speaking, in approximating the
kernel covariance and/or the kernel Gram matrix by a lower rank matrix obtained by
subsampling points.
We expose it here for kernel ridge regression (KRR). Remember that the kernel ridge
regression algorithm using a rkHs H with kernel k over the input space X can be seen as
the solution of the following minimization problem:
 
1 X
fbλ = Arg Min (f (Xi ) − Yi )2 + λ∥f ∥2H 
f ∈H n
i∈JnK
 
= Arg Min Sfb − Y 2 + λ∥f ∥2 , (6.10)
n H
f ∈H

whose explicit solution is

fbλ = (Sb∗ Sb + λI)−1 Sb∗ Y = Sb∗ (SbSb∗ + λI)−1 Y ;

we recall that the latter form means that the solution can be written as
X
fbλ = Sbα
b= bi k(Xi , ·);
α with α b + λI)−1 Y ,
b = (G
i∈JnK

which is also known as a form of the so-called representer theorem (see for instance [2] for
details).

61
The idea of Nyström-based methods is to approximate the above expansion of f by a
reduced expansion on a subset points of size m, i.e. of the form
X
feλ = ei k(Xi , ·),
α (6.11)
i∈I

for a subset of indices I with |I| = m.


Since the original KRR estimator is obtained by minimization of objective (6.10) over
f ∈ H, or, equivalently because of the above representation, over f ∈ Hn = Spank(Xi , ·), i ∈ JnK,
the natural approach when considering elements with the reduced expansion (6.11), i.e.
f ∈ HI = Spank(Xi , ·), i ∈ I is to minimize the same objective under this constraint.
Proposition 6.2. Let I ⊆ JmK be a subset of indices and HI = Span{k(Xi , ·), i ∈ I}.
Then the solution of  
Arg Min Sfb − Y 2 + λ∥f ∥2 (6.12)
n H
f ∈HI

is given by
1X bt b −1 b
feλ = ei k(Xi , ·),
α with α
e = (G
bI,JnK G
I,JnK + λGI,I ) GI,JnK Y , (6.13)
n i∈I

where G bI,J denotes the submatrix of the normalized Gram kernel G b corresponding to indices
sets I and J.
P
Proof. We can write explicitly, for f = i∈I αi k(Xi , ·), that the vector of evaluation of f
at points (X1 , . . . , Xn ) is given by nGbJnK,I α (the factor n because G b is the normalized Gram
2 P
matrix). Furthermore, by properties of a rkHs, it holds ∥f ∥H = i,j∈I αi αj k(Xi , Xj ) =
nαt GbI,I α.
Thus, (6.12) is rewritten as the minimization of
2  
nGbJnK,I α − Y + nλαt G bI,I α = αt nG bt
bI,JnK G + nλGbI,I α − 2αt GbI,JnK Y + ∥Y ∥2 .
I,JnK n
n

Standard formulas for quadratic optimization and some bookkeeping yield (6.13).
Note the interesting fact that to compute the Nyström approximate solution, it is not
necessary to compute the full kernel Gram matrix G, b only the submatrix G bi,JnK . Further-
more, the costly step of matrix inversion only concerns a (m, m) matrix instead of a (n, n)
one, thus significantly reducing computation.
Several strategies can be proposed for selection of the subset I:
• Uniform sampling (with or without replacement) of m indices whithin JnK;
• Leverage score sampling, where the indices are sampled with weights proportional to
so-called leverage scores,
  
ℓλ (i) := G b + λI −1 .
b G
ii

62
A theoretical analysis (see [23, 14]) shows that under some lower bound of the subsample
size m depending on the problem parameters (source condition, intrinsic dimension) but
still allow m ≪ n, the statistical convergence rate obtained in Corollary 4.12 can be
preserved for the Nyström approximated estimator feλ . The use of the leverage score
sampling allows further reduction of the subsample size m, however there is a chicken-and-
egg problem in that exact computation of these scores itselfs in principle require inversion
of the (n, n) matrix we were trying to avoid! To alleviate this, several approaches to
approximate the leverage scores have been proposed, see for instance [22].

63
References
[1] Frank Bauer, Sergei Pereverzev, and Lorenzo Rosasco. On regularization algorithms
in learning theory. Journal of complexity, 23(1):52–72, 2007.
[2] Gilles Blanchard. Mathematics for artificial intelligence I. (M1 Lecture notes), 2022.
[3] Gilles Blanchard and Nicole Mücke. Optimal rates for regularization of statistical
inverse learning problems. Foundations of Computational Mathematics, 18(4):971–
1013, 2018.
[4] Stéphane Boucheron, Gabor Lugósi, and Pascal Massart. Concentration Inequalities:
a nonasymptotic theory of independence. Oxford University Press, 2013.
[5] Haim Brezis. Functional Analysis, Sobolev Spaces and Partial Differential Equations.
Universitext. Springer Nature, New York, NY, 2010.
[6] Andrea Caponnetto and Ernesto De Vito. Optimal rates for the regularized least-
squares algorithm. Foundations of Computational Mathematics, 7:331–368, 2007.
[7] John B Conway. A Course in Functional Analysis, volume 96 of Graduate Texts in
Mathematics. Springer, 1985. [Disponible en version électronique à la bibliothèque du
LMO].
[8] John B. Conway. A Course in Operator Theory, volume 21 of Graduate stud-
ies in mathematics. American Mathematical Society, 2000. [Disponible en version
électronique à la bibliothèque du LMO].
[9] H.W. Engl, M. Hanke, and A. Neubauer. Regularization of Inverse Problems. Kluwer
Academic Publishers, 1996.
[10] Alex Gittens and Michael Mahoney. Revisiting the Nyström method for improved
large-scale machine learning. In International Conference on Machine Learning, pages
567–575. PMLR, 2013.
[11] Milen Ivanov and Stanimir Troyanski. Uniformly smooth renorming of Banach
spaces with modulus of convexity of power type 2. Journal of Functional Analysis,
237(2):373–390, 2006.
[12] Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Conver-
gence and generalization in neural networks. Advances in neural information process-
ing systems, 31, 2018.
[13] Matthieu Lerasle. Lectures on high-dimensional probability. (M2 Lecture notes), 2023.
[14] Jian Li, Yong Liu, and Weiping Wang. Optimal convergence rates for agnostic nyström
kernel learning. In International Conference on Machine Learning, pages 19811–19836.
PMLR, 2023.

64
[15] Junhong Lin and Volkan Cevher. Optimal convergence for distributed learning with
stochastic gradient methods and spectral algorithms. Journal of Machine Learning
Research, 21(147):1–63, 2020.
[16] Junhong Lin, Alessandro Rudi, Lorenzo Rosasco, and Volkan Cevher. Optimal rates
for spectral algorithms with least-squares regression over Hilbert spaces. Applied and
Computational Harmonic Analysis, 48(3):868–890, 2020.
[17] Stanislav Minsker. On some extensions of Bernstein’s inequality for self-adjoint oper-
ators. Statistics & Probability Letters, 127:111–119, 2017.
[18] Krikamol Muandet, Kenji Fukumizu, Bharath Sriperumbudur, and Bernhard
Schölkopf. Kernel mean embedding of distributions: a review and beyond. Foun-
dations and Trends in Machine Learning, 10, 2017.
[19] Nicole Mücke and Gilles Blanchard. Parallelizing spectrally regularized kernel algo-
rithms. Journal of Machine Learning Research, 19(30):1–29, 2018.
[20] Iosif Pinelis. Optimum bounds for the distributions of martingales in Banach spaces.
The Annals of Probability, 22(4):1679–1706, 1994.
[21] Abhishake Rastogi and Sivananthan Sampath. Optimal rates for the regularized learn-
ing algorithms under general source condition. Frontiers in Applied Mathematics and
Statistics, 3:3, 2017.
[22] Alessandro Rudi, Daniele Calandriello, Luigi Carratino, and Lorenzo Rosasco. On
fast leverage score sampling and optimal learning. Advances in Neural Information
Processing Systems, 31, 2018.
[23] Alessandro Rudi, Raffaello Camoriano, and Lorenzo Rosasco. Less is more: Nyström
computational regularization. Advances in neural information processing systems, 28,
2015.
[24] I. Steinwart and A. Christmann. Support Vector Machines. Springer, 2008.
[25] Nicole Tomczak-Jaegermann. The moduli of smoothness and convexity and the
Rademacher averages of the trace classes Sp (1 ≤ p < ∞). Studia Mathematica,
50(2):163–182, 1974.
[26] Joel A. Tropp. An introduction to matrix concentration inequalities. Foundations and
Trends in Machine Learning, 8(1-2):1–230, 2015.
[27] G. Wahba. Spline Models for Observational Data, volume 59. SIAM CBMS-NSF
Series in Applied Mathematics, 1990.
[28] Yuchen Zhang, John Duchi, and Martin Wainwright. Divide and conquer kernel ridge
regression. In Conference on learning theory, pages 592–617. PMLR, 2013.

65

You might also like