Mathematics For Machine Learning
Mathematics For Machine Learning
MACHINE LEARNING
Foreword 1
2       Linear Algebra                                                                       17
2.1     Systems of Linear Equations                                                          19
2.2     Matrices                                                                             22
2.3     Solving Systems of Linear Equations                                                  27
2.4     Vector Spaces                                                                        35
2.5     Linear Independence                                                                  40
2.6     Basis and Rank                                                                       44
2.7     Linear Mappings                                                                      48
2.8     Affine Spaces                                                                        61
2.9     Further Reading                                                                      63
        Exercises                                                                            64
3       Analytic Geometry                                                                    70
3.1     Norms                                                                                71
3.2     Inner Products                                                                       72
3.3     Lengths and Distances                                                                75
3.4     Angles and Orthogonality                                                             76
3.5     Orthonormal Basis                                                                    78
3.6     Orthogonal Complement                                                                79
3.7     Inner Product of Functions                                                           80
3.8     Orthogonal Projections                                                               81
3.9     Rotations                                                                            91
3.10    Further Reading                                                                      94
        Exercises                                                                            96
4       Matrix Decompositions                                                                98
4.1     Determinant and Trace                                                                99
                                                                                                i
This material is published by Cambridge University Press as Mathematics for Machine Learning by
Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong (2020). This version is free to view
and download for personal use only. Not for re-distribution, re-sale, or use in derivative works.
 c
by  M. P. Deisenroth, A. A. Faisal, and C. S. Ong, 2021. https://mml-book.com.
ii                                                                             Contents
References                                                                                 395
Index                                                                                      407
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                                      Foreword
                                                                                               1
This material is published by Cambridge University Press as Mathematics for Machine Learning by
Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong (2020). This version is free to view
and download for personal use only. Not for re-distribution, re-sale, or use in derivative works.
 c
by  M. P. Deisenroth, A. A. Faisal, and C. S. Ong, 2021. https://mml-book.com.
                       2                                                                             Foreword
covered in high school mathematics and physics. For example, the reader
should have seen derivatives and integrals before, and geometric vectors
in two or three dimensions. Starting from there, we generalize these con-
cepts. Therefore, the target audience of the book includes undergraduate
university students, evening learners and learners participating in online
machine learning courses.
   In analogy to music, there are three types of interaction that people
have with machine learning:
   Astute Listener The democratization of machine learning by the pro-
vision of open-source software, online tutorials and cloud-based tools al-
lows users to not worry about the specifics of pipelines. Users can focus on
extracting insights from data using off-the-shelf tools. This enables non-
tech-savvy domain experts to benefit from machine learning. This is sim-
ilar to listening to music; the user is able to choose and discern between
different types of machine learning, and benefits from it. More experi-
enced users are like music critics, asking important questions about the
application of machine learning in society such as ethics, fairness, and pri-
vacy of the individual. We hope that this book provides a foundation for
thinking about the certification and risk management of machine learning
systems, and allows them to use their domain expertise to build better
machine learning systems.
   Experienced Artist Skilled practitioners of machine learning can plug
and play different tools and libraries into an analysis pipeline. The stereo-
typical practitioner would be a data scientist or engineer who understands
machine learning interfaces and their use cases, and is able to perform
wonderful feats of prediction from data. This is similar to a virtuoso play-
ing music, where highly skilled practitioners can bring existing instru-
ments to life and bring enjoyment to their audience. Using the mathe-
matics presented here as a primer, practitioners would be able to under-
stand the benefits and limits of their favorite method, and to extend and
generalize existing machine learning algorithms. We hope that this book
provides the impetus for more rigorous and principled development of
machine learning methods.
   Fledgling Composer As machine learning is applied to new domains,
developers of machine learning need to develop new methods and extend
existing algorithms. They are often researchers who need to understand
the mathematical basis of machine learning and uncover relationships be-
tween different tasks. This is similar to composers of music who, within
the rules and structure of musical theory, create new and amazing pieces.
We hope this book provides a high-level overview of other technical books
for people who want to become composers of machine learning. There is
a great need in society for new researchers who are able to propose and
explore novel approaches for attacking the many challenges of learning
from data.
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
4                               Foreword
                               Acknowledgments
We are grateful to many people who looked at early drafts of the book
and suffered through painful expositions of concepts. We tried to imple-
ment their ideas that we did not vehemently disagree with. We would
like to especially acknowledge Christfried Webers for his careful reading
of many parts of the book, and his detailed suggestions on structure and
presentation. Many friends and colleagues have also been kind enough
to provide their time and energy on different versions of each chapter.
We have been lucky to benefit from the generosity of the online commu-
nity, who have suggested improvements via https://github.com, which
greatly improved the book.
   The following people have found bugs, proposed clarifications and sug-
gested relevant literature, either via https://github.com or personal
communication. Their names are sorted alphabetically.
  Contributors through GitHub, whose real names were not listed on their
GitHub profile, are:
   We are also very grateful to Parameswaran Raman and the many anony-
mous reviewers, organized by Cambridge University Press, who read one
or more chapters of earlier versions of the manuscript, and provided con-
structive criticism that led to considerable improvements. A special men-
tion goes to Dinesh Singh Negi, our LATEX support, for detailed and prompt
advice about LATEX-related issues. Last but not least, we are very grateful
to our editor Lauren Cowles, who has been patiently guiding us through
the gestation process of this book.
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
6                                                                             Foreword
                                 Table of Symbols
    Symbol                Typical meaning
    a, b, c, α, β, γ      Scalars are lowercase
    x, y, z               Vectors are bold lowercase
    A, B, C               Matrices are bold uppercase
    x> , A>               Transpose of a vector or matrix
    A−1                   Inverse of a matrix
    hx, yi                Inner product of x and y
    x> y                  Dot product of x and y
    B = (b1 , b2 , b3 )   (Ordered) tuple
    B = [b1 , b2 , b3 ]   Matrix of column vectors stacked horizontally
    B = {b1 , b2 , b3 }   Set of vectors (unordered)
    Z, N                  Integers and natural numbers, respectively
    R, C                  Real and complex numbers, respectively
    Rn                    n-dimensional vector space of real numbers
    ∀x                    Universal quantifier: for all x
    ∃x                    Existential quantifier: there exists x
    a := b                a is defined as b
    a =: b                b is defined as a
    a∝b                   a is proportional to b, i.e., a = constant · b
    g◦f                   Function composition: “g after f ”
     ⇐⇒                   If and only if
     =⇒                   Implies
    A, C                  Sets
    a∈A                   a is an element of set A
    ∅                     Empty set
    A\B                   A without B : the set of elements in A but not in B
    D                     Number of dimensions; indexed by d = 1, . . . , D
    N                     Number of data points; indexed by n = 1, . . . , N
    Im                    Identity matrix of size m × m
    0m,n                  Matrix of zeros of size m × n
    1m,n                  Matrix of ones of size m × n
    ei                    Standard/canonical vector (where i is the component that is 1)
    dim                   Dimensionality of vector space
    rk(A)                 Rank of matrix A
    Im(Φ)                 Image of linear mapping Φ
    ker(Φ)                Kernel (null space) of a linear mapping Φ
    span[b1 ]             Span (generating set) of b1
    tr(A)                 Trace of A
    det(A)                Determinant of A
    |·|                   Absolute value or determinant (depending on context)
    k·k                   Norm; Euclidean, unless specified
    λ                     Eigenvalue or Lagrange multiplier
    Eλ                    Eigenspace corresponding to eigenvalue λ
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                                         Part I
Mathematical Foundations
                                                                                               9
This material is published by Cambridge University Press as Mathematics for Machine Learning by
Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong (2020). This version is free to view
and download for personal use only. Not for re-distribution, re-sale, or use in derivative works.
 c
by  M. P. Deisenroth, A. A. Faisal, and C. S. Ong, 2021. https://mml-book.com.
                                              1
   Since machine learning is inherently data driven, data is at the core                            data
of machine learning. The goal of machine learning is to design general-
purpose methodologies to extract valuable patterns from data, ideally
without much domain-specific expertise. For example, given a large corpus
of documents (e.g., books in many libraries), machine learning methods
can be used to automatically find relevant topics that are shared across
documents (Hoffman et al., 2010). To achieve this goal, we design mod-
els that are typically related to the process that generates data, similar to                       model
the dataset we are given. For example, in a regression setting, the model
would describe a function that maps inputs to real-valued outputs. To
paraphrase Mitchell (1997): A model is said to learn from data if its per-
formance on a given task improves after the data is taken into account.
The goal is to find good models that generalize well to yet unseen data,
which we may care about in the future. Learning can be understood as a                              learning
way to automatically find patterns and structure in data by optimizing the
parameters of the model.
  While machine learning has seen many success stories, and software is
readily available to design and train rich and flexible machine learning
systems, we believe that the mathematical foundations of machine learn-
ing are important in order to understand fundamental principles upon
which more complicated machine learning systems are built. Understand-
ing these principles can facilitate creating new machine learning solutions,
understanding and debugging existing approaches, and learning about the
inherent assumptions and limitations of the methodologies we are work-
ing with.
                                                                                             11
This material is published by Cambridge University Press as Mathematics for Machine Learning by
Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong (2020). This version is free to view
and download for personal use only. Not for re-distribution, re-sale, or use in derivative works.
 c
by  M. P. Deisenroth, A. A. Faisal, and C. S. Ong, 2021. https://mml-book.com.
                  12                                                      Introduction and Motivation
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                    14                                                                     Introduction and Motivation
Dimensionality
                                                                                                  Classification
                                                             Reduction
                                             Regression
                                                                              Estimation
                                                                               Density
                                      Vector Calculus     Probability & Distributions        Optimization
                                    Linear Algebra              Analytic Geometry           Matrix Decomposition
                    between the two parts of the book to link mathematical concepts with
                    machine learning algorithms.
                       Of course there are more than two ways to read this book. Most readers
                    learn using a combination of top-down and bottom-up approaches, some-
                    times building up basic mathematical skills before attempting more com-
                    plex concepts, but also choosing topics based on applications of machine
                    learning.
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                 16                                                      Introduction and Motivation
Linear Algebra
                  →    →                        4                                                   Figure 2.1
                  x+y                                                                               Different types of
                                                2
                                                                                                    vectors. Vectors can
                                                0                                                   be surprising
                                                                                                    objects, including
                                           y
          →                                    −2                                                   (a) geometric
           x                 →                                                                      vectors
                              y                −4
                                                                                                    and (b) polynomials.
                                               −6
                                                      −2           0          2
                                                                   x
                                                                                             17
This material is published by Cambridge University Press as Mathematics for Machine Learning by
Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong (2020). This version is free to view
and download for personal use only. Not for re-distribution, re-sale, or use in derivative works.
 c
by  M. P. Deisenroth, A. A. Faisal, and C. S. Ong, 2021. https://mml-book.com.
                      18                                                                      Linear Algebra
                                                                   Vector
                                                                                                                                     Figure 2.2 A mind
                                                                                                                                     map of the concepts
                                                              es
                                                         p  os                            pro
                                                                                              p                                      introduced in this
                                                                                               erty
                                                                      closure
                                                        m
                                                   co                                                 of                             chapter, along with
  Chapter 5                                                                                                                          where they are used
                                         Matrix                                 Abelian
Vector calculus                                                                 with +                                               in other parts of the
                                   s                                                                           Linear
                                 nt                            Vector space                  Group         independence              book.
                                              rep
                            es e
                         pr
                                                   res
re
                                                                                                                  maximal set
                                                    ent
                                                        s
      System of
   linear equations
                                              Linear/affine
                            so                  mapping
                               lv
           solved by
es Basis
                                         Matrix
                                         inverse
      Gaussian
     elimination
                                             Chapter 3                                                       Chapter 10
                                                                                    Chapter 12
                                          Analytic geometry                                                Dimensionality
                                                                                   Classification
                                                                                                              reduction
resources are Gilbert Strang’s Linear Algebra course at MIT and the Linear
Algebra Series by 3Blue1Brown.
   Linear algebra plays an important role in machine learning and gen-
eral mathematics. The concepts introduced in this chapter are further ex-
panded to include the idea of geometry in Chapter 3. In Chapter 5, we
will discuss vector calculus, where a principled knowledge of matrix op-
erations is essential. In Chapter 10, we will use projections (to be intro-
duced in Section 3.8) for dimensionality reduction with principal compo-
nent analysis (PCA). In Chapter 9, we will discuss linear regression, where
linear algebra plays a central role for solving least-squares problems.
Example 2.1
  A company produces products N1 , . . . , Nn for which resources
R1 , . . . , Rm are required. To produce a unit of product Nj , aij units of
resource Ri are needed, where i = 1, . . . , m and j = 1, . . . , n.
  The objective is to find an optimal production plan, i.e., a plan of how
many units xj of product Nj should be produced if a total of bi units of
resource Ri are available and (ideally) no resources are left over.
  If we produce x1 , . . . , xn units of the corresponding products, we need
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                   20                                                                      Linear Algebra
                   a total of
                                                   ai1 x1 + · · · + ain xn                           (2.2)
                   many units of resource Ri . An optimal production plan (x1 , . . . , xn ) ∈ Rn ,
                   therefore, has to satisfy the following system of equations:
                                              a11 x1 + · · · + a1n xn = b1
                                                             ..            ,                         (2.3)
                                                              .
                                             am1 x1 + · · · + amn xn = bm
                   where aij ∈ R and bi ∈ R.
system of linear      Equation (2.3) is the general form of a system of linear equations, and
equations          x1 , . . . , xn are the unknowns of this system. Every n-tuple (x1 , . . . , xn ) ∈
solution           Rn that satisfies (2.3) is a solution of the linear equation system.
                   Example 2.2
                   The system of linear equations
                                         x1 + x2 + x3 = 3                        (1)
                                         x1 − x2 + 2x3 = 2                       (2)                 (2.4)
                                        2x1      + 3x3 = 1                       (3)
                   has no solution: Adding the first two equations yields 2x1 +3x3 = 5, which
                   contradicts the third equation (3).
                     Let us have a look at the system of linear equations
                                         x1 + x2 + x3 = 3                       (1)
                                         x1 − x2 + 2x3 = 2                      (2) .                (2.5)
                                              x2 + x3 = 2                       (3)
                   From the first and third equation, it follows that x1 = 1. From (1)+(2),
                   we get 2x1 + 3x3 = 5, i.e., x3 = 1. From (3), we then get that x2 = 1.
                   Therefore, (1, 1, 1) is the only possible and unique solution (verify that
                   (1, 1, 1) is a solution by plugging in).
                     As a third example, we consider
                                         x1 + x2 + x3 = 3                        (1)
                                         x1 − x2 + 2x3 = 2                       (2) .               (2.6)
                                        2x1      + 3x3 = 5                       (3)
                   Since (1)+(2)=(3), we can omit the third equation (redundancy). From
                   (1) and (2), we get 2x1 = 5−3x3 and 2x2 = 1+x3 . We define x3 = a ∈ R
                   as a free variable, such that any triplet
                                           5 3 1 1
                                                            
                                              − a, + a, a , a ∈ R                   (2.7)
                                           2 2 2 2
                                      4x1 + 4x2 = 5
                                                                                         (2.8)
                                      2x1 − 4x2 = 1
where the solution space is the point (x1 , x2 ) = (1, 41 ). Similarly, for three
variables, each linear equation determines a plane in three-dimensional
space. When we intersect these planes, i.e., satisfy all linear equations at
the same time, we can obtain a solution set that is a plane, a line, a point
or empty (when the planes have no common intersection).                        ♦
   For a systematic approach to solving systems of linear equations, we
will introduce a useful compact notation. We collect the coefficients aij
into vectors and collect the vectors into matrices. In other words, we write
the system from (2.3) in the following form:
                                                      
               a11          a12                a1n           b1
              ..         ..               ..          .. 
              .  x1 +  .  x2 + · · · +  .  xn =  .             (2.9)
                  am1              am2                    amn              bm
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                             22                                                                          Linear Algebra
                                                                                
                                        a11 · · · a1n     x1       b1
                                        ..        .       .
                                                   ..   ..  =  ... 
                                    ⇐⇒  .                             .                                       (2.10)
                                                              
                                        am1 · · · amn xn          bm
                             In the following, we will have a close look at these matrices and de-
                             fine computation rules. We will return to solving linear equations in Sec-
                             tion 2.3.
                                                                 2.2 Matrices
                             Matrices play a central role in linear algebra. They can be used to com-
                             pactly represent systems of linear equations, but they also represent linear
                             functions (linear mappings) as we will see later in Section 2.7. Before we
                             discuss some of these interesting topics, let us first define what a matrix
                             is and what kind of operations we can do with matrices. We will see more
                             properties of matrices in Chapter 4.
matrix                       Definition 2.1 (Matrix). With m, n ∈ N a real-valued (m, n) matrix A is
                             an m·n-tuple of elements aij , i = 1, . . . , m, j = 1, . . . , n, which is ordered
                             according to a rectangular scheme consisting of m rows and n columns:
                                                    a11 a12 · · · a1n
                                                                             
                                                   a21 a22 · · · a2n 
                                             A =  ..        ..            ..  , aij ∈ R .               (2.11)
                                                                             
                                                   .         .             . 
                                                   am1 am2 · · · amn
row                          By convention (1, n)-matrices are called rows and (m, 1)-matrices are called
column                       columns. These special matrices are also called row/column vectors.
row vector
column vector                  Rm×n is the set of all real-valued (m, n)-matrices. A ∈ Rm×n can be
Figure 2.4 By                equivalently represented as a ∈ Rmn by stacking all n columns of the
stacking its                 matrix into a long vector; see Figure 2.4.
columns, a matrix A
can be represented
as a long vector a.
                                               2.2.1 Matrix Addition and Multiplication
 A ∈ R4×2           a ∈ R8
                             The sum of two matrices A ∈ Rm×n , B ∈ Rm×n is defined as the element-
         re-shape
                             wise sum, i.e.,
                                                                         
                                                a11 + b11 · · · a1n + b1n
                                                     ..             ..          m×n
                                    A + B :=                             ∈R       .         (2.12)
                                                                         
                                                      .              .
                                                       am1 + bm1 · · · amn + bmn
Note the size of the           For matrices A ∈ Rm×n , B ∈ Rn×k , the elements cij of the product
matrices.                    C = AB ∈ Rm×k are computed as
C =
                                               n
np.einsum(’il,                                 X
lj’, A, B)                             cij =         ail blj ,   i = 1, . . . , m,   j = 1, . . . , k.          (2.13)
                                               l=1
This means, to compute element cij we multiply the elements of the ith                             There are n columns
row of A with the j th column of B and sum them up. Later in Section 3.2,                          in A and n rows in
                                                                                                   B so that we can
we will call this the dot product of the corresponding row and column. In
                                                                                                   compute ail blj for
cases, where we need to be explicit that we are performing multiplication,                         l = 1, . . . , n.
we use the notation A · B to denote multiplication (explicitly showing                             Commonly, the dot
“·”).                                                                                              product between
                                                                                                   two vectors a, b is
Remark. Matrices can only be multiplied if their “neighboring” dimensions                          denoted by a> b or
match. For instance, an n × k -matrix A can be multiplied with a k × m-                            ha, bi.
matrix B , but only from the left side:
                                      A |{z}
                                     |{z} B = |{z}
                                               C                                       (2.14)
                                      n×k k×m        n×m
Example 2.3                        
                             0 2
         1 2 3
For A =          ∈ R2×3 , B = 1 −1 ∈ R3×2 , we obtain
         3 2 1
                               0 1
                            
                    0 2            
             1 2 3               2 3
      AB =            1 −1 =           ∈ R2×2 ,                                       (2.15)
             3 2 1                2 5
                      0 1
                                        
              0 2                6 4 2
                     1 2 3
      BA = 1 −1              = −2 0 2 ∈ R3×3 .                                     (2.16)
                     3 2 1
              0 1                  3 2 1
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                     24                                                                      Linear Algebra
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                 26                                                                      Linear Algebra
associativity
                   Associativity:
                   (λψ)C = λ(ψC), C ∈ Rm×n
                   λ(BC) = (λB)C = B(λC) = (BC)λ, B ∈ Rm×n , C ∈ Rn×k .
                   Note that this allows us to move scalar values around.
distributivity     (λC)> = C > λ> = C > λ = λC > since λ = λ> for all λ ∈ R.
                   Distributivity:
                   (λ + ψ)C = λC + ψC, C ∈ Rm×n
                   λ(B + C) = λB + λC, B, C ∈ Rm×n
                 and use the rules for matrix multiplication, we can write this equation
                 system in a more compact form as
                                                     
                                        2 3     5      x1      1
                                      4 −2 −7 x2  = 8 .                     (2.36)
                                        9 5 −3 x3              2
                 Note that x1 scales the first column, x2 the second one, and x3 the third
                 one.
                   Generally, a system of linear equations can be compactly represented in
                 their matrix form as Ax = b; see (2.3), and the product Ax is a (linear)
                 combination of the columns of A. We will discuss linear combinations in
                 more detail in Section 2.5.
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                   28                                                                      Linear Algebra
                   so that 0 = 8c1 + 2c2 − 1c3 + 0c4 and (x1 , x2 , x3 , x4 ) = (8, 2, −1, 0). In
                   fact, any scaling of this solution by λ1 ∈ R produces the 0 vector, i.e.,
                                            
                                                8
                          1 0 8 −4   2           
                                                       = λ1 (8c1 + 2c2 − c3 ) = 0 .
                                            λ1                                         (2.41)
                          0 1 2 12              −1 
                                                    
                                                  0
                   Following the same line of reasoning, we express the fourth column of the
                   matrix in (2.38) using the first two columns and generate another set of
                   non-trivial versions of 0 as
                                               −4
                                          
                                     
                        1 0 8 −4   12 
                                                 
                                                    = λ2 (−4c1 + 12c2 − c4 ) = 0
                                          λ2                                       (2.42)
                        0 1 2 12               0 
                                                 
                                               −1
                   for any λ2 ∈ R. Putting everything together, we obtain all solutions of the
general solution   equation system in (2.38), which is called the general solution, as the set
                                                                                
                                                               −4
                                                           
                       
                                       42          8                            
                                                                                 
                                        8           2           12
                                                                          
                               4
                        x ∈ R : x =   + λ1   + λ2   , λ1 , λ2 ∈ R . (2.43)
                                                              
                       
                                       0          −1           0                
                                                                                 
                                        0           0          −1
                                                                                
Example 2.6
For a ∈ R, we seek all solutions of the following system of equations:
       −2x1      +     4x2 − 2x3           − x4         + 4x5        = −3
        4x1      −     8x2 + 3x3           − 3x4        + x5         =  2
                                                                          .            (2.44)
         x1      −     2x2 + x3            − x4         + x5         =  0
         x1      −     2x2                 − 3x4        + 4x5        =  a
We start by converting this system of equations into the compact matrix
notation Ax = b. We no longer mention the variables x explicitly and
build the augmented matrix (in the form A | b )                                                    augmented matrix
          −2           −2     −1           −3
                                              
                  4                    4          Swap with R3
         4
               −8       3    −3       1     2 
                                               
         1     −2       1    −1       1     0  Swap with R1
            1   −2       0    −3       4     a
where we used the vertical line to separate the left-hand side from the
right-hand side in (2.44). We use    to indicate a transformation of the
augmented matrix using elementary transformations.                                                 The augmented
                                                                                                               
                                                                                                   matrix A | b
   Swapping Rows 1 and 3 leads to
                                                                                                   compactly
                     −2           −1                                                               represents the
                                                    
               1             1             1      0
                                                                                                   system of linear
          
              4     −8      3    −3       1      2  −4R1                                        equations Ax = b.
           −2         4    −2    −1       4    −3  +2R1
               1     −2      0    −3       4      a    −R1
When we now apply the indicated transformations (e.g., subtract Row 1
four times from Row 2), we obtain
                    −2            −1
                                                
                1            1          1      0
           
               0     0    −1      1   −3      2 
                                                 
               0     0      0    −3    6    −3 
                0     0    −1     −2    3      a    −R2 − R3
                    −2            −1
                                                
                1            1          1      0
           
               0     0    −1      1   −3         ·(−1)
                                               2 
               0     0      0    −3    6    −3  ·(− 31 )
                0     0      0     0    0 a+1
                    −2            −1
                                                
                1            1          1      0
           
               0     0      1    −1    3    −2 
               0     0      0     1   −2      1 
                0     0      0     0    0 a+1
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                        30                                                                      Linear Algebra
row-echelon form        This (augmented) matrix is in a convenient form, the row-echelon form
                        (REF). Reverting this compact notation back into the explicit notation with
                        the variables we seek, we obtain
                              x1 − 2x2 + x3 − x4 + x5                           =   0
                                         x3 − x4 + 3x5                          =  −2
                                                                                                   .     (2.45)
                                              x4 − 2x5                          =   1
                                                     0                          = a+1
particular solution     Only for a = −1 this system can be solved. A particular solution is
                                                               2
                                                       
                                                     x1
                                                    x2   0 
                                                       
                                                    x3  = −1 .                          (2.46)
                                                       
                                                    x4   1 
                                                     x5        0
general solution        The general solution, which captures the set of all possible solutions, is
                                                                                      
                                             2         2          2
                                                             
                          
                                                                                      
                                                                                       
                          
                          
                                           0       1        0                   
                                                                                       
                                                                                       
                                  5
                                                                
                            x∈R :x=        −1
                                            
                                                + λ1 0 + λ2 −1 , λ1 , λ2 ∈ R . (2.47)
                                                               
                          
                          
                                           1       0        2                   
                                                                                       
                                                                                       
                                                                                      
                                             0         0          1
                                                                                      
                          All rows that contain only zeros are at the bottom of the matrix; corre-
                          spondingly, all rows that contain at least one nonzero element are on
                          top of rows that contain only zeros.
                          Looking at nonzero rows only, the first nonzero number from the left
pivot                     (also called the pivot or the leading coefficient) is always strictly to the
leading coefficient       right of the pivot of the row above it.
In other texts, it is
sometimes required      Remark (Basic and Free Variables). The variables corresponding to the
that the pivot is 1.    pivots in the row-echelon form are called basic variables and the other
basic variable          variables are free variables. For example, in (2.45), x1 , x3 , x4 are basic
free variable           variables, whereas x2 , x5 are free variables.                            ♦
                        Remark (Obtaining a Particular Solution). The row-echelon form makes
                       0        0          0        0
From here, we find relatively directly that λ3 = 1, λ2 = −1, λ1 = 2. When
we put everything together, we must not forget the non-pivot columns
for which we set the coefficients implicitly to 0. Therefore, we get the
particular solution x = [2, 0, −1, 1, 0]> .                             ♦
Remark (Reduced Row Echelon Form). An equation system is in reduced                                reduced
row-echelon form (also: row-reduced echelon form or row canonical form) if                         row-echelon form
   It is in row-echelon form.
   Every pivot is 1.
   The pivot is the only nonzero entry in its column.
                                                                         ♦
   The reduced row-echelon form will play an important role later in Sec-
tion 2.3.3 because it allows us to determine the general solution of a sys-
tem of linear equations in a straightforward way.
                                                                                                   Gaussian
Remark (Gaussian Elimination). Gaussian elimination is an algorithm that                           elimination
performs elementary transformations to bring a system of linear equations
into reduced row-echelon form.                                           ♦
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
32                                                                      Linear Algebra
the second column from three times the first column. Now, we look at the
fifth column, which is our second non-pivot column. The fifth column can
be expressed as 3 times the first pivot column, 9 times the second pivot
column, and −4 times the third pivot column. We need to keep track of
the indices of the pivot columns and translate this into 3 times the first col-
umn, 0 times the second column (which is a non-pivot column), 9 times
the third column (which is our second pivot column), and −4 times the
fourth column (which is the third pivot column). Then we need to subtract
the fifth column to obtain 0. In the end, we are still solving a homogeneous
equation system.
   To summarize, all solutions of Ax = 0, x ∈ R5 are given by
                                                             
                             3            3
                                      
       
                                                             
                                                              
       
       
                          −1         0                   
                                                              
                                                              
                5
                                         
          x ∈ R : x = λ1  0  + λ2  9  , λ1 , λ2 ∈ R .
                                                                   (2.50)
       
       
                           0        −4                   
                                                              
                                                              
                                                             
                             0          −1
                                                             
                                                        I n |A−1 .
                                                             
                           A|I n           ···                                         (2.56)
This means that if we bring the augmented equation system into reduced
row-echelon form, we can read out the inverse on the right-hand side of
the equation system. Hence, determining the inverse of a matrix is equiv-
alent to solving systems of linear equations.
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
34                                                                      Linear Algebra
and use the Moore-Penrose pseudo-inverse (A> A)−1 A> to determine the                              Moore-Penrose
solution (2.59) that solves Ax = b, which also corresponds to the mini-                            pseudo-inverse
mum norm least-squares solution. A disadvantage of this approach is that
it requires many computations for the matrix-matrix product and comput-
ing the inverse of A> A. Moreover, for reasons of numerical precision it
is generally not recommended to compute the inverse or pseudo-inverse.
In the following, we therefore briefly discuss alternative approaches to
solving systems of linear equations.
   Gaussian elimination plays an important role when computing deter-
minants (Section 4.1), checking whether a set of vectors is linearly inde-
pendent (Section 2.5), computing the inverse of a matrix (Section 2.2.2),
computing the rank of a matrix (Section 2.6.2), and determining a basis
of a vector space (Section 2.6.1). Gaussian elimination is an intuitive and
constructive way to solve a system of linear equations with thousands of
variables. However, for systems with millions of variables, it is impracti-
cal as the required number of arithmetic operations scales cubically in the
number of simultaneous equations.
   In practice, systems of many linear equations are solved indirectly, by ei-
ther stationary iterative methods, such as the Richardson method, the Ja-
cobi method, the Gauß-Seidel method, and the successive over-relaxation
method, or Krylov subspace methods, such as conjugate gradients, gener-
alized minimal residual, or biconjugate gradients. We refer to the books
by Stoer and Burlirsch (2002), Strang (2003), and Liesen and Mehrmann
(2015) for further details.
   Let x∗ be a solution of Ax = b. The key idea of these iterative methods
is to set up an iteration of the form
                                  x(k+1) = Cx(k) + d                                   (2.60)
for suitable C and d that reduces the residual error kx(k+1) − x∗ k in every
iteration and converges to x∗ . We will introduce norms k · k, which allow
us to compute similarities between vectors, in Section 3.1.
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                  36                                                                            Linear Algebra
                                                        2.4.1 Groups
                  Groups play an important role in computer science. Besides providing a
                  fundamental framework for operations on sets, they are heavily used in
                  cryptography, coding theory, and graphics.
                  Definition 2.7 (Group). Consider a set G and an operation ⊗ : G ×G → G
group             defined on G . Then G := (G, ⊗) is called a group if the following hold:
closure
associativity     1.   Closure of G under ⊗: ∀x, y ∈ G : x ⊗ y ∈ G
neutral element   2.   Associativity: ∀x, y, z ∈ G : (x ⊗ y) ⊗ z = x ⊗ (y ⊗ z)
inverse element   3.   Neutral element: ∃e ∈ G ∀x ∈ G : x ⊗ e = x and e ⊗ x = x
                  4.   Inverse element: ∀x ∈ G ∃y ∈ G : x ⊗ y = e and y ⊗ x = e, where e is
                       the neutral element. We often write x−1 to denote the inverse element
                       of x.
                  Remark. The inverse element is defined with respect to the operation ⊗
                  and does not necessarily mean x1 .                                   ♦
Abelian group     If additionally ∀x, y ∈ G : x ⊗ y = y ⊗ x, then G = (G, ⊗) is an Abelian
                  group (commutative).
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                38                                                                      Linear Algebra
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                     40                                                                      Linear Algebra
Remark. The following properties are useful to find out whether vectors
are linearly independent:
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
42                                                                      Linear Algebra
     – The pivot columns indicate the vectors, which are linearly indepen-
       dent of the vectors on the left. Note that there is an ordering of vec-
       tors when the matrix is built.
     – The non-pivot columns can be expressed as linear combinations of
       the pivot columns on their left. For instance, the row-echelon form
                                              
                                       1 3 0
                                                                       (2.66)
                                       0 0 2
       tells us that the first and third columns are pivot columns. The sec-
       ond column is a non-pivot column because it is three times the first
       column.
  All column vectors are linearly independent if and only if all columns
  are pivot columns. If there is at least one non-pivot column, the columns
  (and, therefore, the corresponding vectors) are linearly dependent.
Example 2.14
Consider R4 with
                                                            −1
                                                        
                         1                 1
                        2               1              −2
                  x1 = 
                       −3 ,
                                    x2 = 
                                          0 ,
                                                     x3 = 
                                                            1 .
                                                                                (2.67)
                         4                 2                 1
To check whether they are linearly dependent, we follow the general ap-
proach and solve
                                                    −1
                                               
                                 1        1
                                2      1      −2
    λ1 x1 + λ2 x2 + λ3 x3 = λ1 
                               −3 + λ2 0 + λ3  1  = 0
                                                           (2.68)
                                 4        2          1
for λ1 , . . . , λ3 . We write the vectors xi , i = 1, 2, 3, as the columns of a
matrix and apply elementary row operations until we identify the pivot
columns:
                         1 −1                              1 −1
                                                             
                   1                                  1
                  2     1 −2                       0    1 0
                 
                 −3
                                         ···                  .               (2.69)
                         0 1                        0    0 1
                   4     2 1                          0    0 0
Here, every column of the matrix is a pivot column. Therefore, there is no
non-trivial solution, and we require λ1 = 0, λ2 = 0, λ3 = 0 to solve the
equation system. Hence, the vectors x1 , x2 , x3 are linearly independent.
This means that {x1 , . . . , xm } are linearly independent if and only if the
column vectors {λ1 , . . . , λm } are linearly independent.
                                                                            ♦
Remark. In a vector space V , m linear combinations of k vectors x1 , . . . , xk
are linearly dependent if m > k .                                             ♦
Example 2.15
Consider a set of linearly independent vectors b1 , b2 , b3 , b4 ∈ Rn and
             x1    =    b1        −     2b2 + b3                    −      b4
             x2    =    −4b1      −     2b2                         +      4b4
                                                                               .       (2.73)
             x3    =    2b1       +     3b2 − b3                    −      3b4
             x4    =    17b1      −     10b2 + 11b3                 +      b4
Are the vectors x1 , . . . , x4 ∈ Rn linearly independent? To answer this
question, we investigate whether the column vectors
                                            
                    
                           1     −4      2      17 
                        −2 , −2 ,  3  , −10
                                            
                         1   0  −1  11                   (2.74)
                    
                                                    
                          −1      4      −3       1
                                                    
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                 44                                                                      Linear Algebra
                    Generating sets are sets of vectors that span vector (sub)spaces, i.e.,
                 every vector can be represented as a linear combination of the vectors
                 in the generating set. Now, we will be more specific and characterize the
                 smallest generating set that spans a vector (sub)space.
Example 2.16
   The set
                                     
                                
                                  1    2     1 
                                     
                                   2 −1  1 
                                                 
                              A=    ,     ,                                           (2.80)
                                 3
                                    0   0 
                                                 
                                   4    2    −4
                                                
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                      46                                                                      Linear Algebra
                                                          2 −1 −4 8 
                                                                         
                                      
                                       x1 , x2 , x3 , x4 = −1 1
                                                                  3 −5  .                           (2.83)
                                                           −1 2   5 −6
                                                            −1 −2 −3 1
                      With the basic transformation rules for systems of linear equations, we
                      obtain the row-echelon form
                              1    2    3 −1                      1   2    3 −1
                                                                               
                            2 −1 −4 8                        0     1    2 −2 
                                                                               
                           −1 1        3  −5       · · ·       0   0    0    1 .
                                                                               
                           −1 2        5 −6                    0   0    0    0 
                            −1 −2 −3 1                            0   0    0    0
Since the pivot columns indicate which set of vectors is linearly indepen-
dent, we see from the row-echelon form that x1 , x2 , x4 are linearly inde-
pendent (because the system of linear equations λ1 x1 + λ2 x2 + λ4 x4 = 0
can only be solved with λ1 = λ2 = λ4 = 0). Therefore, {x1 , x2 , x4 } is a
basis of U .
                                       2.6.2 Rank
The number of linearly independent columns of a matrix A ∈ Rm×n
equals the number of linearly independent rows and is called the rank                              rank
of A and is denoted by rk(A).
Remark. The rank of a matrix has some important properties:
   rk(A) = rk(A> ), i.e., the column rank equals the row rank.
   The columns of A ∈ Rm×n span a subspace U ⊆ Rm with dim(U ) =
   rk(A). Later we will call this subspace the image or range. A basis of
   U can be found by applying Gaussian elimination to A to identify the
   pivot columns.
   The rows of A ∈ Rm×n span a subspace W ⊆ Rn with dim(W ) =
   rk(A). A basis of W can be found by applying Gaussian elimination to
   A> .
   For all A ∈ Rn×n it holds that A is regular (invertible) if and only if
   rk(A) = n.
   For all A ∈ Rm×n and all b ∈ Rm it holds that the linear equation
   system Ax = b can be solved if and only if rk(A) = rk(A|b), where
   A|b denotes the augmented system.
   For A ∈ Rm×n the subspace of solutions for Ax = 0 possesses dimen-
   sion n − rk(A). Later, we will call this subspace the kernel or the null                        kernel
   space.                                                                                          null space
   A matrix A ∈ Rm×n has full rank if its rank equals the largest possible                         full rank
   rank for a matrix of the same dimensions. This means that the rank of
   a full-rank matrix is the lesser of the number of rows and columns, i.e.,
   rk(A) = min(m, n). A matrix is said to be rank deficient if it does not                         rank deficient
   have full rank.
                                                                                             ♦
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                 48                                                                      Linear Algebra
                                  
                         1   2 1
                   A = −2 −3 1 .
                         3   5 0
                    We use Gaussian elimination to determine the rank:
                                                                  
                                 1     2 1                   1 2 1
                              −2 −3 1             ···    0 1 3 .                              (2.84)
                                 3     5 0                   0 0 0
                   Here, we see that the number of linearly independent rows and columns
                   is 2, such that rk(A) = 2.
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                       50                                                                      Linear Algebra
B = (b1 , . . . , bn ) (2.89)
x = α1 b1 + . . . + αn bn (2.90)
Example 2.20
Let us have a look at a geometric vector x ∈ R2 with coordinates [2, 3]>                           Figure 2.9
with respect to the standard basis (e1 , e2 ) of R2 . This means, we can write                     Different coordinate
                                                                                                   representations of a
x = 2e1 + 3e2 . However, we do not have to choose the standard basis to
                                                                                                   vector x, depending
represent this vector. If we use the basis vectors b1 = [1, −1]> , b2 = [1, 1]>                    on the choice of
we will obtain the coordinates 21 [−1, 5]> to represent the same vector with                       basis.
respect to (b1 , b2 ) (see Figure 2.9).                                                            x = 2e1 + 3e2
                                                                                                   x = − 12 b1 + 52 b2
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                        52                                                                      Linear Algebra
ŷ = AΦ x̂ . (2.94)
                        This means that the transformation matrix can be used to map coordinates
                        with respect to an ordered basis in V to coordinates with respect to an
                        ordered basis in W .
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
54                                                                          Linear Algebra
where we first expressed the new basis vectors c̃k ∈ W as linear com-
binations of the basis vectors cl ∈ W and then swapped the order of
summation.
  Alternatively, when we express the b̃j ∈ V as linear combinations of
bj ∈ V , we arrive at
                     n
                             !   n              n     m
          (2.106)
                   X            X               X     X
   Φ(b̃j ) = Φ         sij bi =    sij Φ(bi ) =   sij   ali cl (2.109a)
                          i=1                    i=1                     i=1   l=1
               m n
                                      !
               X X
           =                ali sij       cl ,    j = 1, . . . , n ,                 (2.109b)
               l=1    i=1
and, therefore,
such that
ÃΦ = T −1 AΦ S , (2.112)
ÃΦ = T −1 AΦ S. (2.113)
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                         56                                                                        Linear Algebra
                                                         
                                             1     1     0     1
          1     0     1                           1 0 1 0
   B̃ = (1 , 1 , 0) ∈ R3 ,                  0 , 1 , 1 , 0) .
                                            C̃ = (                           (2.119)
          0     1     1
                                                   0     0     0     1
   Then,
                                                                 
                                             1        1    0   1
                   1 0 1                      1        0    1   0
              S = 1 1 0 ,                T =
                                              0
                                                                  ,                 (2.120)
                                                        1    1   0
                   0 1 1
                                               0        0    0   1
where the ith column of S is the coordinate representation of b̃i in
terms of the basis vectors of B . Since B is the standard basis, the co-
ordinate representation is straightforward to find. For a general basis B ,
we would need to solve a linear equation system to find the λi such that
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
             58                                                                      Linear Algebra
             P3
               i=1 λi bi = b̃j , j = 1, . . . , 3. Similarly, the j th column of T is the coordi-
             nate representation of c̃j in terms of the basis vectors of C .
               Therefore, we obtain
                                                      1 −1 −1
                                                                              
                                                 1                        3 2 1
                                          1  1 −1 1 −1               0 4 2
                 ÃΦ = T −1 AΦ S = 
                                                                                
                                                                                        (2.121a)
                                          2 −1 1            1     1 10 8 4
                                                 0    0      0     2      1 6 3
                           −4 −4 −2
                                             
                          6       0      0
                       = 4
                                              .                                        (2.121b)
                                   8      4
                            1      6      3
ker(Φ) Im(Φ)
0V 0W
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                    60                                                                      Linear Algebra
                                                                           
                                                      1        2        −1       0
                                                = x1     + x2     + x3     + x4     (2.125b)
                                                      1        0        0        1
                    is linear. To determine Im(Φ), we can take the span of the columns of the
                    transformation matrix and obtain
                                                            
                                                      1    2    −1     0
                                       Im(Φ) = span[     ,    ,      ,   ].          (2.126)
                                                      1    0     0     1
                    To compute the kernel (null space) of Φ, we need to solve Ax = 0, i.e.,
                    we need to solve a homogeneous equation system. To do this, we use
                    Gaussian elimination to transform A into reduced row-echelon form:
                                                                      
                            1 2 −1 0                      1 0 0        1
                                                 ···                       .       (2.127)
                            1 0 0 1                       0 1 − 21 − 12
                      This matrix is in reduced row-echelon form, and we can use the Minus-
                    1 Trick to compute a basis of the kernel (see Section 2.3.3). Alternatively,
                    we can express the non-pivot columns (columns 3 and 4) as linear com-
                    binations of the pivot columns (columns 1 and 2). The third column a3 is
                    equivalent to − 21 times the second column a2 . Therefore, 0 = a3 + 12 a2 . In
                    the same way, we see that a4 = a1 − 12 a2 and, therefore, 0 = a1 − 12 a2 −a4 .
                    Overall, this gives us the kernel (null space) as
                                                                     −1
                                                               
                                                              0
                                                            1  1 
                                            ker(Φ) = span[  1  ,  0 ] .
                                                              2  2                     (2.128)
                                                              0       1
rank-nullity
theorem             Theorem 2.24 (Rank-Nullity Theorem). For vector spaces V, W and a lin-
                    ear mapping Φ : V → W it holds that
                                       dim(ker(Φ)) + dim(Im(Φ)) = dim(V ) .                        (2.129)
fundamental            The rank-nullity theorem is also referred to as the fundamental theorem
theorem of linear   of linear mappings (Axler, 2015, theorem 3.22). The following are direct
mappings
                    consequences of Theorem 2.24:
                      If dim(Im(Φ)) < dim(V ), then ker(Φ) is non-trivial, i.e., the kernel
                      contains more than 0V and dim(ker(Φ)) > 1.
                      If AΦ is the transformation matrix of Φ with respect to an ordered basis
                      and dim(Im(Φ)) < dim(V ), then the system of linear equations AΦ x =
                      0 has infinitely many solutions.
                      If dim(V ) = dim(W ), then the following three-way equivalence holds:
                      – Φ is injective
                      – Φ is surjective
                      – Φ is bijective
                      since Im(Φ) ⊆ W .
   One-dimensional affine subspaces are called lines and can be written                            line
   as y = x0 + λb1 , where λ ∈ R and U = span[b1 ] ⊆ Rn is a one-
   dimensional subspace of Rn . This means that a line is defined by a sup-
   port point x0 and a vector b1 that defines the direction. See Figure 2.13
   for an illustration.
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                        62                                                                      Linear Algebra
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
64                                                                      Linear Algebra
                                      Exercises
2.1   We consider (R\{−1}, ?), where
a ? b := ab + a + b, a, b ∈ R\{−1} (2.134)
3 ? x ? x = 15
                          k = {x ∈ Z | x − k = 0 (modn)}
                            = {x ∈ Z | ∃a ∈ Z : (x − k = n · a)} .
Zn = {0, 1, . . . , n − 1}
a ⊕ b := a + b
a ⊗ b = a × b, (2.135)
                                0 0 1
                                                             
       a.
                                                                             
                                               1       2    1       1       0
                                              4       5  0       1       1
                                               7       8    1       0       1
       b.
                                                                                 
                                           1       2       3   1        1       0
                                          4       5       6 0        1       1
                                           7       8       9   1        0       1
       c.
                                                                                 
                                           1       1       0   1        2       3
                                          0       1       1 4        5       6
                                           1       0       1   7        8       9
       d.
                                                                                       
                                                                 0             3
                                         1    2     1          2 1            −1
                                         4    1    −1          −4 2             1
                                                                        5        2
       e.
                                                  
                                       0        3
                                                                                       
                                     1
                                              −1 1           2    1          2
                                     2         1 4            1   −1          −4
                                       5        2
2.5   Find the set S of all solutions in x of the following inhomogeneous linear
      systems Ax = b, where A and b are defined as follows:
       a.
                                                                                              
                                  1            1       −1       −1                  1
                                2             5       −7       −5               −2
                              A=                                  ,           b=   
                                2            −1        1       3                4
                                  5            2       −4        2                  6
       b.
                                                                                                  
                               1             −1        0     0      1                        3
                             1               1        0    −3       0                   6
                           A=                                         ,               b=    
                             2              −1        0     1      −1                   5
                               −1             2        0    −2      −1                      −1
2.6   Using Gaussian elimination, find all solutions of the inhomogeneous equa-
      tion system Ax = b with
                                                         
                               0      1       0    0       1    0                 2
                          A = 0      0       0    1       1    0 ,        b = −1 .
                               0      1       0    0       0    1                 1
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
66                                                                                       Linear Algebra
                                            
                                        x1
2.7   Find all solutions in x = x2  ∈ R3 of the equation system Ax = 12x,
                                        x3
      where
                                                                 
                                              6           4       3
                                         A = 6           0       9
                                              0           8       0
      and 3i=1 xi = 1.
          P
2.8   Determine the inverses of the following matrices if possible:
      a.
                                                                      
                                                  2           3       4
                                             A = 3           4       5
                                                  4           5       6
      b.
                                                                          
                                             1            0       1    0
                                           0             1       1    0
                                         A=
                                           1
                                                                         
                                                          1       0    1
                                             1            1       1    0
1 −1 1 1 0 −1
     Determine a basis of U1 ∩ U2 .
2.13 Consider two subspaces U1 and U2 , where U1 is the solution space of the
     homogeneous equation system A1 x = 0 and U2 is the solution space of the
     homogeneous equation system A2 x = 0 with
                                                        
                                 1    0     1                    3    −3     0
                              1     −2     −1             1         2     3
                         A1 =                ,       A2 =                   .
                              2      1     3              7        −5     2
                                 1    0     1                    3    −1     2
             where L1 ([a, b]) denotes the set of integrable functions on [a, b].
       b.
                                          Φ : C1 → C0
                                               f 7→ Φ(f ) = f 0 ,
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
68                                                                                  Linear Algebra
       c.
                                        Φ:R→R
                                            x 7→ Φ(x) = cos(x)
       d.
                                        Φ : R 3 → R2
                                                                 
                                                        1    2   3
                                              x 7→                 x
                                                        1    4   3
and let us define two ordered bases B = (b1 , b2 ) and B 0 = (b01 , b02 ) of R2 .
       a. Show that B and B 0 are two bases of R2 and draw those basis vectors.
       b. Compute the matrix P 1 that performs a basis change from B 0 to B .
       c. We consider c1 , c2 , c3 , three vectors of R3 defined in the standard basis
          of R3 as
                                                             
                                     1               0                 1
                              c1 =  2  ,     c2 = −1 ,      c3 =  0 
                                    −1               2                −1
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                                                                      3
Analytic Geometry
                                                       Orthogonal
                          Lengths                                                Angles                   Rotations
                                                       projection
                        70
                        This material is published by Cambridge University Press as Mathematics for Machine Learning by
                        Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong (2020). This version is free to view
                        and download for personal use only. Not for re-distribution, re-sale, or use in derivative works.
                         c
                        
by  M. P. Deisenroth, A. A. Faisal, and C. S. Ong, 2021. https://mml-book.com.
3.1 Norms                                                                                   71
                                         3.1 Norms
When we think of geometric vectors, i.e., directed line segments that start
at the origin, then intuitively the length of a vector is the distance of the
“end” of this directed line segment from the origin. In the following, we
will discuss the notion of the length of vectors using the concept of a norm.
                                     k · k : V → R,                                      (3.1)
                                             x 7→ kxk ,                                  (3.2)
which assigns each vector x its length kxk ∈ R, such that for all λ ∈ R                            length
and x, y ∈ V the following hold:
                                                                                                   absolutely
   Absolutely homogeneous: kλxk = |λ|kxk                                                           homogeneous
                                 Xn
                        kxk1 :=      |xi | ,                                             (3.3)
                                                i=1
where | · | is the absolute value. The left panel of Figure 3.3 shows all
vectors x ∈ R2 with kxk1 = 1. The Manhattan norm is also called `1                                 `1 norm
norm.
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                     72                                                                  Analytic Geometry
Euclidean distance   and computes the Euclidean distance of x from the origin. The right panel
                     of Figure 3.3 shows all vectors x ∈ R2 with kxk2 = 1. The Euclidean
`2 norm              norm is also called `2 norm.
                     Remark. Throughout this book, we will use the Euclidean norm (3.4) by
                     default if not stated otherwise.                                   ♦
                     We will refer to this particular inner product as the dot product in this
                     book. However, inner products are more general concepts with specific
                     properties, which we will now introduce.
where Aij := hbi , bj i and x̂, ŷ are the coordinates of x and y with respect
to the basis B . This implies that the inner product h·, ·i is uniquely deter-
mined through A. The symmetry of the inner product also means that A
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                      74                                                                  Analytic Geometry
                        The null space (kernel) of A consists only of 0 because x> Ax > 0 for
                        all x 6= 0. This implies that Ax 6= 0 if x 6= 0.
                        The diagonal elements aii of A are positive because aii = e>
                                                                                   i Aei > 0,
                        where ei is the ith vector of the standard basis in Rn .
in a natural way, such that we can compute lengths of vectors using the in-
ner product. However, not every norm is induced by an inner product. The
Manhattan norm (3.3) is an example of a norm without a corresponding
inner product. In the following, we will focus on norms that are induced
by inner products and introduce geometric concepts, such as lengths, dis-
tances, and angles.
Remark (Cauchy-Schwarz Inequality). For an inner product vector space
(V, h·, ·i) the induced norm k · k satisfies the Cauchy-Schwarz inequality                         Cauchy-Schwarz
                                                                                                   inequality
                                  | hx, yi | 6 kxkkyk .                                (3.17)
                                                                                             ♦
is called the distance between x and y for x, y ∈ V . If we use the dot                            distance
product as the inner product, then the distance is called Euclidean distance.                      Euclidean distance
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                            76                                                                  Analytic Geometry
The mapping
                                                           d:V ×V →R                                         (3.22)
                                                              (x, y) 7→ d(x, y)                              (3.23)
positive definite           1. d is positive definite, i.e., d(x, y) > 0 for all x, y ∈ V and d(x, y) =
                               0 ⇐⇒ x = y .
symmetric                   2. d is symmetric, i.e., d(x, y) = d(y, x) for all x, y ∈ V .
triangle inequality         3. Triangle inequality: d(x, z) 6 d(x, y) + d(y, z) for all x, y, z ∈ V .
                            Remark. At first glance, the lists of properties of inner products and met-
                            rics look very similar. However, by comparing Definition 3.3 with Defini-
                            tion 3.6 we observe that hx, yi and d(x, y) behave in opposite directions.
                            Very similar x and y will result in a large value for the inner product and
                            a small value for the metric.                                             ♦
          0
                                                                    hx, yi
                                                           −1 6            6 1.                              (3.24)
                                                                   kxk kyk
         −1
              0   π/2   π
                  ω         Therefore, there exists a unique ω ∈ [0, π], illustrated in Figure 3.4, with
                                                                        hx, yi
                                                             cos ω =           .                             (3.25)
                                                                       kxk kyk
angle                       The number ω is the angle between the vectors x and y . Intuitively, the
                            angle between two vectors tells us how similar their orientations are. For
                            example, using the dot product, the angle between x and y = 4x, i.e., y
                            is a scaled version of x, is 0: Their orientation is the same.
   Consider two vectors x = [1, 1]> , y = [−1, 1]> ∈ R2 ; see Figure 3.6.
We are interested in determining the angle ω between them using two
different inner products. Using the dot product as the inner product yields
an angle ω between x and y of 90◦ , such that x ⊥ y . However, if we
choose the inner product
                                           
                                     > 2 0
                          hx, yi = x          y,                    (3.27)
                                        0 1
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                      78                                                                  Analytic Geometry
                      which gives exactly the angle between x and y . This means that orthog-
                      onal matrices A with A> = A−1 preserve both angles and distances. It
                      turns out that orthogonal matrices define transformations that are rota-
                      tions (with the possibility of flips). In Section 3.9, we will discuss more
                      details about rotations.
for all i, j = 1, . . . , n then the basis is called an orthonormal basis (ONB).                   orthonormal basis
If only (3.33) is satisfied, then the basis is called an orthogonal basis. Note                    ONB
                                                                                                   orthogonal basis
that (3.34) implies that every basis vector has length/norm 1.
   Recall from Section 2.6.1 that we can use Gaussian elimination to find a
basis for a vector space spanned by a set of vectors. Assume we are given
a set {b̃1 , . . . , b̃n } of non-orthogonal and unnormalized basis vectors. We
concatenate them into a matrix B̃ = [b̃1 , . . . , b̃n ] and apply Gaussian elim-
                                                                        >
ination to the augmented matrix (Section 2.3.2) [B̃ B̃ |B̃] to obtain an
orthonormal basis. This constructive way to iteratively build an orthonor-
mal basis {b1 , . . . , bn } is called the Gram-Schmidt process (Strang, 2003).
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                      80                                                                      Analytic Geometry
                                                                                         e1
                                                        U
for lower and upper limits a, b < ∞, respectively. As with our usual inner
product, we can define norms and orthogonality by looking at the inner
product. If (3.37) evaluates to 0, the functions u and v are orthogonal. To
make the preceding inner product mathematically precise, we need to take
care of measures and the definition of integrals, leading to the definition of
a Hilbert space. Furthermore, unlike inner products on finite-dimensional
vectors, inner products on functions may diverge (have infinite value). All
this requires diving into some more intricate details of real and functional
analysis, which we do not cover in this book.
                                                                                                   sin(x) cos(x)
product evaluates to 0. Therefore, sin and cos are orthogonal functions.
                                                                                                                    0.0
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                      82                                                                  Analytic Geometry
Figure 3.9
Orthogonal
projection (orange                           2
dots) of a
                                             1
two-dimensional
dataset (blue dots)
                                       x2
                                             0
onto a
one-dimensional                             −1
subspace (straight
line).                                      −2
                                                 −4     −2         0          2          4
                                                                   x1
b x
πU (x)
        ω                                                                        sin ω
                                                              ω     cos ω                 b
(a) Projection of x ∈ R2 onto a subspace U            (b) Projection of a two-dimensional vector
with basis vector b.                                  x with kxk = 1 onto a one-dimensional
                                                      subspace spanned by b.
    We can now exploit the bilinearity of the inner product and arrive at                          With a general inner
                                                                                                   product, we get
                                                        hx, bi   hb, xi                            λ = hx, bi if
              hx, bi − λ hb, bi = 0 ⇐⇒ λ =                     =        .                (3.40)    kbk = 1.
                                                        hb, bi    kbk2
    In the last step, we exploited the fact that inner products are symmet-
    ric. If we choose h·, ·i to be the dot product, we obtain
                                            b> x   b> x
                                       λ=        =      .                                (3.41)
                                            b> b   kbk2
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                       84                                                                  Analytic Geometry
                                                                   hx, bi     b> x
                                                   πU (x) = λb =          b =      b,                   (3.42)
                                                                    kbk2      kbk2
                            where the last equality holds for the dot product only. We can also
                            compute the length of πU (x) by means of Definition 3.1 as
                            Hence, our projection is of length |λ| times the length of b. This also
                            adds the intuition that λ is the coordinate of πU (x) with respect to the
                            basis vector b that spans our one-dimensional subspace U .
                              If we use the dot product as an inner product, we get
                                                                         b> x   bb>
                                                 πU (x) = λb = bλ = b         =      x,                 (3.45)
                                                                         kbk2   kbk2
                            we immediately see that
                                                                     bb>
                                                             Pπ =         .                             (3.46)
                                                                     kbk2
Projection matrices         Note that bb> (and, consequently, P π ) is a symmetric matrix (of rank
are always                  1), and kbk2 = hb, bi is a scalar.
symmetric.
                       The projection matrix P π projects any vector x ∈ Rn onto the line through
                       the origin with direction b (equivalently, the subspace U spanned by b).
                       Remark. The projection πU (x) ∈ Rn is still an n-dimensional vector and
                       not a scalar. However, we no longer require n coordinates to represent the
                       projection, but only a single one if we want to express it with respect to
                       the basis vector b that spans the subspace U : λ.                       ♦
Let us now choose a particular x and see whether it lies in the subspace
                              >
spanned by b. For x = 1 1 1 , the projection is
                                                    
                       1 2 2 1             5               1
                    1                  1
   πU (x) = P π x =    2 4 4 1 = 10 ∈ span[2] . (3.48)
                    9 2 4 4 1           9 10               2
Note that the application of P π to πU (x) does not change anything, i.e.,
P π πU (x) = πU (x). This is expected because according to Definition 3.10,
we know that a projection matrix P π satisfies P 2π x = P π x for all x.
Remark. With the results from Chapter 4, we can show that πU (x) is an
eigenvector of P π , and the corresponding eigenvalue is 1.          ♦
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                          86                                                                  Analytic Geometry
pseudo-inverse                 The matrix (B > B)−1 B > is also called the pseudo-inverse of B , which
                               can be computed for non-square matrices B . It only requires that B > B
                               is positive definite, which is the case if B is full rank. In practical ap-
                               plications (e.g., linear regression), we often add a “jitter term” I to
Remark. The solution for projecting onto general subspaces includes the
1D case as a special case: If dim(U ) = 1, then B > B ∈ R is a scalar and
we can rewrite the projection matrix in (3.59) P π = B(B > B)−1 B > as
         >
P π = BB
      B> B
           , which is exactly the projection matrix in (3.46).          ♦
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                        88                                                                  Analytic Geometry
projection error        The corresponding projection error is the norm of the difference vector
The projection error    between the original vector and its projection onto U , i.e.,
                                                                         
 √
is also called the                                       
           > 
reconstruction error.                   kx − πU (x)k = 
 1 −2 1 
 = 6 .                  (3.63)
                                                         
                          To verify the results, we can (a) check whether the displacement vector
                        πU (x) − x is orthogonal to all basis vectors of U , and (b) verify that
                        P π = P 2π (see Definition 3.10).
                        Remark. The projections πU (x) are still vectors in Rn although they lie in
                        an m-dimensional subspace U ⊆ Rn . However, to represent a projected
                        vector we only need the m coordinates λ1 , . . . , λm with respect to the
                        basis vectors b1 , . . . , bm of U .                                     ♦
                        Remark. In vector spaces with general inner products, we have to pay
                        attention when computing angles and distances, which are defined by
                        means of the inner product.                                       ♦
We can find
approximate                Projections allow us to look at situations where we have a linear system
solutions to            Ax = b without a solution. Recall that this means that b does not lie in
unsolvable linear
equation systems
                        the span of A, i.e., the vector b does not lie in the subspace spanned by
using projections.      the columns of A. Given that the linear equation cannot be solved exactly,
                        we can find an approximate solution. The idea is to find the vector in the
                        subspace spanned by the columns of A that is closest to b, i.e., we compute
                        the orthogonal projection of b onto the subspace spanned by the columns
                        of A. This problem arises often in practice, and the solution is called the
least-squares           least-squares solution (assuming the dot product as the inner product) of
solution                an overdetermined system. This is discussed further in Section 9.4. Using
                        reconstruction errors (3.63) is one possible approach to derive principal
                        component analysis (Section 10.3).
                        Remark. We just looked at projections of vectors x onto a subspace U with
                        basis vectors {b1 , . . . , bk }. If this basis is an ONB, i.e., (3.33) and (3.34)
                        are satisfied, the projection equation (3.58) simplifies greatly to
                                                         πU (x) = BB > x                                 (3.65)
                                                                                                                 Figure 3.12
         b2                              b2                          u2           b2                             Gram-Schmidt
                                                                                                                 orthogonalization.
                                                                                                                 (a) non-orthogonal
                                                                                                                 basis (b1 , b2 ) of R2 ;
   0                      b1       0      πspan[u1 ] (b2 )   u1           0        πspan[u1 ] (b2 )    u1        (b) first constructed
                                                                                                                 basis vector u1 and
(a) Original non-orthogonal (b) First new basis vector (c) Orthogonal basis vectors u1
                                                                                                                 orthogonal
basis vectors b1 , b2 .     u1 = b1 and projection of b2 and u2 = b2 − πspan[u1 ] (b2 ).
                                                                                                                 projection of b2
                            onto the subspace spanned by
                                                                                                                 onto span[u1 ];
                            u1 .
                                                                                                                 (c) orthogonal basis
Consider a basis (b1 , b2 ) of R2 , where                                                                        (u1 , u2 ) of R2 .
                                            
                                    2         1
                            b1 =      , b2 =     ;                                                    (3.69)
                                    0         1
see also Figure 3.12(a). Using the Gram-Schmidt method, we construct an
orthogonal basis (u1 , u2 ) of R2 as follows (assuming the dot product as
the inner product):
              
              2
 u1 := b1 =       ,                                                (3.70)
              0
                                     u1 u>
                                                           
                            (3.45)        1       1    1 0 1         0
 u2 := b2 − πspan[u1 ] (b2 ) = b2 −         b
                                           2 2 =    −            =      .
                                     ku1 k        1    0 0 1         1
                                                                   (3.71)
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                        90                                                                   Analytic Geometry
Figure 3.13                              x                                                               x
Projection onto an
affine space.
(a) original setting;
(b) setting shifted                                           x − x0
                                                  L                                                                L
by −x0 so that                                                                                          πL(x)
x − x0 can be                       x0                                                             x0
projected onto the
                               b2                              b 2 U = L − x0                 b2
direction space U ;
                                                              πU (x − x0)
(c) projection is
translated back to       0                   b1          0               b1            0                     b1
x0 + πU (x − x0 ),
                               (a) Setting.             (b) Reduce problem to pro-    (c) Add support point back in
which gives the final
                                                        jection πU onto vector sub-   to get affine projection πL .
orthogonal
                                                        space.
projection πL (x).
                        These steps are illustrated in Figures 3.12(b) and (c). We immediately see
                        that u1 and u2 are orthogonal, i.e., u>1 u2 = 0.
                                                                                                   Figure 3.14 A
                                                                                                   rotation rotates
                                                                                                   objects in a plane
                                                                                                   about the origin. If
                                                             Original
                                                                                                   the rotation angle is
                              Rotated by 112.5◦
                                                                                                   positive, we rotate
                                                                                                   counterclockwise.
                                      3.9 Rotations
Length and angle preservation, as discussed in Section 3.4, are the two
characteristics of linear mappings with orthogonal transformation matri-
ces. In the following, we will have a closer look at specific orthogonal
transformation matrices, which describe rotations.
   A rotation is a linear mapping (more specifically, an automorphism of                           rotation
a Euclidean vector space) that rotates a plane by an angle θ about the
origin, i.e., the origin is a fixed point. For a positive angle θ > 0, by com-
mon convention, we rotate in a counterclockwise direction. An example is
shown in Figure 3.14, where the transformation matrix is
                                                  
                                    −0.38 −0.92
                             R=                      .                  (3.74)
                                     0.92 −0.38
Important application areas of rotations include computer graphics and
robotics. For example, in robotics, it is often important to know how to
rotate the joints of a robotic arm in order to pick up or place an object,
see Figure 3.15.
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                       92                                                                            Analytic Geometry
                                                                            θ
                                                 − sin θ                           e1        cos θ
                                                   3.9.1 Rotations in R2
                                                                        
                                                                1          0
                       Consider the standard basis e1 =            , e2 =       of R2 , which defines
                                                                0          1
                       the standard coordinate system in R2 . We aim to rotate this coordinate
                       system by an angle θ as illustrated in Figure 3.16. Note that the rotated
                       vectors are still linearly independent and, therefore, are a basis of R2 . This
                       means that the rotation performs a basis change.
                          Rotations Φ are linear mappings so that we can express them by a
rotation matrix        rotation matrix R(θ). Trigonometry (see Figure 3.16) allows us to de-
                       termine the coordinates of the rotated axes (the image of Φ) with respect
                       to the standard basis in R2 . We obtain
                                                                               
                                                     cos θ                − sin θ
                                           Φ(e1 ) =           , Φ(e2 ) =            .          (3.75)
                                                      sin θ                cos θ
                       Therefore, the rotation matrix that performs the basis change into the
                       rotated coordinates R(θ) is given as
                                                                              
                                                               cos θ − sin θ
                                     R(θ) = Φ(e1 ) Φ(e2 ) =                      .     (3.76)
                                                                 sin θ cos θ
                                                    3.9.2 Rotations in R3
                       In contrast to the R2 case, in R3 we can rotate any two-dimensional plane
                       about a one-dimensional axis. The easiest way to specify the general rota-
                       tion matrix is to specify how the images of the standard basis e1 , e2 , e3 are
                       supposed to be rotated, and making sure these images Re1 , Re2 , Re3 are
                       orthonormal to each other. We can then obtain a general rotation matrix
                       R by combining the images of the standard basis.
                          To have a meaningful rotation angle, we have to define what “coun-
                       terclockwise” means when we operate in more than two dimensions. We
                       use the convention that a “counterclockwise” (planar) rotation about an
                       axis refers to a rotation about an axis when we look at the axis “head on,
                       from the end toward the origin”. In R3 , there are therefore three (planar)
                       rotations about the three standard basis vectors (see Figure 3.17):
                                e3                                                                 Figure 3.17
                                                                                                   Rotation of a vector
                                                                                                   (gray) in R3 by an
                                                                                                   angle θ about the
                                                                                                   e3 -axis. The rotated
                                                                                                   vector is shown in
                                                                                                   blue.
                                                e2
θ e1
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                  94                                                                  Analytic Geometry
                  trix
                                     I i−1          ···    ···
                                                                      
                                             0                     0
                                     0
                                          cos θ     0   − sin θ   0     n×n
                         Rij (θ) := 
                                     0      0   I j−i−1    0      0  ∈R     ,                   (3.80)
                                     0    sin θ     0    cos θ    0 
                                       0    ···     ···     0    I n−j
Givens rotation   for 1 6 i < j 6 n and θ ∈ R. Then Rij (θ) is called a Givens rotation.
                  Essentially, Rij (θ) is the identity matrix I n with
                          rii = cos θ ,   rij = − sin θ ,    rji = sin θ ,    rjj = cos θ .        (3.81)
                  In two dimensions (i.e., n = 2), we obtain (3.76) as a special case.
kernel methods (Schölkopf and Smola, 2002). Kernel methods exploit the
fact that many linear algorithms can be expressed purely by inner prod-
uct computations. Then, the “kernel trick” allows us to compute these
inner products implicitly in a (potentially infinite-dimensional) feature
space, without even knowing this feature space explicitly. This allowed the
“non-linearization” of many algorithms used in machine learning, such as
kernel-PCA (Schölkopf et al., 1997) for dimensionality reduction. Gaus-
sian processes (Rasmussen and Williams, 2006) also fall into the category
of kernel methods and are the current state of the art in probabilistic re-
gression (fitting curves to data points). The idea of kernels is explored
further in Chapter 12.
   Projections are often used in computer graphics, e.g., to generate shad-
ows. In optimization, orthogonal projections are often used to (iteratively)
minimize residual errors. This also has applications in machine learning,
e.g., in linear regression where we want to find a (linear) function that
minimizes the residual errors, i.e., the lengths of the orthogonal projec-
tions of the data onto the linear function (Bishop, 2006). We will investi-
gate this further in Chapter 9. PCA (Pearson, 1901; Hotelling, 1933) also
uses projections to reduce the dimensionality of high-dimensional data.
We will discuss this in more detail in Chapter 10.
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
96                                                                              Analytic Geometry
                                       Exercises
3.1   Show that h·, ·i defined for all x = [x1 , x2 ]> ∈ R2 and y = [y1 , y2 ]> ∈ R2 by
      is an inner product.
3.2   Consider R2 with h·, ·i defined for all x and y in R2 as
                                                    
                                                           2       0
                                    hx, yi := x>                     y.
                                                           1       2
                                                          | {z }
                                                           =:A
      using
      a. hx, yi := x> y
                                                              
                                      2         1         0
      b. hx, yi := x> Ay ,      A := 1         3         −1
                                      0        −1          2
3.4   Compute the angle between
                                                                     
                                              1                    −1
                                    x=          ,      y=
                                              2                    −1
      using
      a. hx, yi := x> y
                                                 
                     >               2        1
      b. hx, yi := x By ,       B :=
                                     1        3
3.5   Consider the Euclidean vector space R5 with the dot product. A subspace
      U ⊆ R5 and x ∈ R5 are given by
                                                      
                               0          1           −3            −1            −1
                          −1       −3          4  −3                    −9
                                                                         
                 U = span[
                           2 ,                                                −1 .
                                     1 ,        1  ,  5 ] ,            x=  
                                                   
                          0        −1         2 0                       4
                               2          2           1             7              1
3.8   Using the Gram-Schmidt method, turn the basis B = (b1 , b2 ) of a two-
      dimensional subspace U ⊆ R3 into an ONB C = (c1 , c2 ) of U , where
                                             
                                          1                −1
                                   b1 := 1 ,      b2 :=  2  .
                                          1                 0
by 30◦ .
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                                                                     4
Matrix Decompositions
                       98
                       This material is published by Cambridge University Press as Mathematics for Machine Learning by
                       Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong (2020). This version is free to view
                       and download for personal use only. Not for re-distribution, re-sale, or use in derivative works.
                        c
                       
by  M. P. Deisenroth, A. A. Faisal, and C. S. Ong, 2021. https://mml-book.com.
4.1 Determinant and Trace                                                                                99
                                                                             used in
                                                                                                              where they are used
                                                                                                              in other parts of the
                                                                                                              book.
     Eigenvalues                                                      Chapter 6
                                                                      Probability
                                                                    & distributions
            determines
                                            used
                                                     in
                         constructs                             used in
     Eigenvectors                     Orthogonal matrix                                Diagonalization
                                                                         n
                                                                  di
                                                               use
                            us
                                                                     in
                              ed
ed
                                              SVD
                                                               us
                                 in
used in
                                          Chapter 10
                                        Dimensionality
                                           reduction
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
100                                                            Matrix Decompositions
For a memory aid of the product terms in Sarrus’ rule, try tracing the
elements of the triple products in the matrix.
   We call a square matrix T an upper-triangular matrix if Tij = 0 for                             upper-triangular
i > j , i.e., the matrix is zero below its diagonal. Analogously, we define a                      matrix
lower-triangular matrix as a matrix with zeros above its diagonal. For a tri-                      lower-triangular
angular matrix T ∈ Rn×n , the determinant is the product of the diagonal                           matrix
elements, i.e.,
                                                  n
                                                  Y
                                    det(T ) =           Tii .                            (4.8)
                                                  i=1
                                                                                                   The determinant is
                                                                                                   the signed volume
                                                                                                   of the parallelepiped
Example 4.2 (Determinants as Measures of Volume)                                                   formed by the
                                                                                                   columns of the
The notion of a determinant is natural when we consider it as a mapping
                                                                                                   matrix.
from a set of n vectors spanning an object in Rn . It turns out that the de-                       Figure 4.2 The area
terminant det(A) is the signed volume of an n-dimensional parallelepiped                           of the parallelogram
formed by columns of the matrix A.                                                                 (shaded region)
   For n = 2, the columns of the matrix form a parallelogram; see Fig-                             spanned by the
                                                                                                   vectors b and g is
ure 4.2. As the angle between vectors gets smaller, the area of a parallel-                        |det([b, g])|.
ogram shrinks, too. Consider two vectors b, g that form the columns of a
matrix A = [b, g]. Then, the absolute value of the determinant of A is the
area of the parallelogram with vertices 0, b, g, b + g . In particular, if b, g                     b
are linearly dependent so that b = λg for some λ ∈ R, they no longer
                                                                                                               g
form a two-dimensional parallelogram. Therefore, the corresponding area
is 0. On the contrary, if b, g are linearly independent and are multiples                          Figure 4.3 The
                                                                        of                       volume of the
                                                                        b
the canonical basis vectors e1 , e2 then they can be written as b =        and                     parallelepiped
                                         
                                                                        0                          (shaded volume)
       
       0                              b 0
                                           = bg − 0 = bg .                                        spanned by vectors
g=       , and the determinant is                                                                r, b, g is
       g                               0 g
                                                                                                   |det([r, b, g])|.
   The sign of the determinant indicates the orientation of the spanning
vectors b, g with respect to the standard basis (e1 , e2 ). In our figure, flip-
ping the order to g, b swaps the columns of A and reverses the orientation
of the shaded area. This becomes the familiar formula: area = height ×
                                                                                                   b
length. This intuition extends to higher dimensions. In R3 , we consider
                                                                                                                   r
three vectors r, b, g ∈ R3 spanning the edges of a parallelepiped, i.e., a                                g
solid with faces that are parallel parallelograms (see Figure 4.3). The ab-                        The sign of the
                                                                                                   determinant
solute value of the determinant of the 3 × 3 matrix [r, b, g] is the volume
                                                                                                   indicates the
of the solid. Thus, the determinant acts as a function that measures the                           orientation of the
signed volume formed by column vectors composed in a matrix.                                       spanning vectors.
   Consider the three linearly independent vectors r, g, b ∈ R3 given as
                                                 
                          2               6           1
                  r =  0  , g = 1 , b =  4  .                       (4.9)
                         −8               0          −1
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                       102                                                            Matrix Decompositions
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                       104                                                            Matrix Decompositions
                                                                    n
                                                                    X
                                                        tr(A) :=          aii ,                         (4.18)
                                                                    i=1
                                   c0 = det(A) ,                                       (4.23)
                                 cn−1 = (−1)n−1 tr(A) .                                (4.24)
  The characteristic polynomial (4.22a) will allow us to compute eigen-
values and eigenvectors, covered in the next section.
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                106                                                            Matrix Decompositions
algebraic       Definition 4.9. Let a square matrix A have an eigenvalue λi . The algebraic
multiplicity    multiplicity of λi is the number of times the root appears in the character-
                istic polynomial.
                  A matrix A and its transpose A> possess the same eigenvalues, but not
                  necessarily the same eigenvectors.
                  The eigenspace Eλ is the null space of A − λI since
                      Ax = λx ⇐⇒ Ax − λx = 0                                                   (4.27a)
                              ⇐⇒ (A − λI)x = 0 ⇐⇒ x ∈ ker(A − λI).                             (4.27b)
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                         108                                                            Matrix Decompositions
                                                                                    
                                                      x1                               1
                         This means any vector x =        , where x2 = −x1 , such as       , is an
                                                      x2                              −1
                         eigenvector with eigenvalue 2. The corresponding eigenspace is given as
                                                                
                                                                 1
                                                   E2 = span[       ].                     (4.35)
                                                                −1
                         Example 4.6            
                                           2 1
                         The matrix A =             has two repeated eigenvalues λ1 = λ2 = 2 and an
                                           0 2
                         algebraic multiplicity of 2. The eigenvalue has, however, only one distinct
                                                   1
                         unit eigenvector x1 =          and, thus, geometric multiplicity 1.
                                                   0
                                                                                                   Figure 4.4
                                                                                                   Determinants and
                                                                                                   eigenspaces.
                                                                                                   Overview of five
                                           λ1 = 2.0                                                linear mappings and
                                           λ2 = 0.5                                                their associated
                                           det(A) = 1.0
                                                                                                   transformation
                                                                                                   matrices
                                                                                                   Ai ∈ R2×2
                                                                                                   projecting 400
                                                                                                   color-coded points
                                                                                                   x ∈ R2 (left
                                           λ1 = 1.0
                                           λ2 = 1.0                                                column) onto target
                                           det(A) = 1.0                                            points Ai x (right
                                                                                                   column). The
                                                                                                   central column
                                                                                                   depicts the first
                                                                                                   eigenvector,
                                                                                                   stretched by its
                                           λ1 = (0.87-0.5j)                                        associated
                                           λ2 = (0.87+0.5j)                                        eigenvalue λ1 , and
                                           det(A) = 1.0
                                                                                                   the second
                                                                                                   eigenvector
                                                                                                   stretched by its
                                                                                                   eigenvalue λ2 . Each
                                                                                                   row depicts the
                                                                                                   effect of one of five
                                           λ1 = 0.0
                                           λ2 = 2.0                                                transformation
                                           det(A) = 0.0                                            matrices Ai with
                                                                                                   respect to the
                                                                                                   standard basis.
                                           λ1 = 0.5
                                           λ2 = 1.5
                                           det(A) = 0.75
   half of the vertical axis, and to the left vice versa. This mapping is area
   preserving (det(A2 ) = 1). The eigenvalue λ1 = 1 = λ2 is repeated
   and the eigenvectors are collinear (drawn here for emphasis in two
   opposite directions). This indicates that the mapping acts only along
   one direction     (the horizontal
                                  axis).
                                        √
            cos( π6 ) − sin( π6 )
                                                  
                                      1    3 √ −1
   A3 =                            = 2               The matrix A3 rotates the
             sin( π6 ) cos( π6 )          1      3
   points by π6 rad = 30◦ counter-clockwise and has only complex eigen-
   values, reflecting that the mapping is a rotation (hence, no eigenvectors
   are drawn). A rotation has to be volume preserving, and so the deter-
   minantis 1. For    more details on rotations, we refer to Section 3.9.
              1 −1
   A4 =                  represents a mapping in the standard basis that col-
            −1 1
   lapses a two-dimensional domain onto one dimension. Since one eigen-
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                       110                                                                                              Matrix Decompositions
Figure 4.5
                                              0
Caenorhabditis                                                                                        25
elegans neural                               50                                                       20
network (Kaiser and
                                                                                                      15
Hilgetag,                                   100
                             neuron index
eigenvalue
2006).(a) Sym-                                                                                        10
metrized                                    150                                                        5
connectivity matrix;
                                                                                                       0
(b) Eigenspectrum.                          200
                                                                                                     −5
250 −10
                          Methods to analyze and learn from network data are an essential com-
                       ponent of machine learning methods. The key to understanding networks
                       is the connectivity between network nodes, especially if two nodes are
                       connected to each other or not. In data science applications, it is often
                       useful to study the matrix that captures this connectivity data.
                          We build a connectivity/adjacency matrix A ∈ R277×277 of the complete
                       neural network of the worm C.Elegans. Each row/column represents one
                       of the 277 neurons of this worm’s brain. The connectivity matrix A has
                       a value of aij = 1 if neuron i talks to neuron j through a synapse, and
                       aij = 0 otherwise. The connectivity matrix is not symmetric, which im-
                       plies that eigenvalues may not be real valued. Therefore, we compute a
                       symmetrized version of the connectivity matrix as Asym := A + A> . This
                       new matrix Asym is shown in Figure 4.5(a) and has a nonzero value aij if
                       and only if two neurons are connected (white pixels), irrespective of the
                       direction of the connection. In Figure 4.5(b), we show the correspond-
                       ing eigenspectrum of Asym . The horizontal axis shows the index of the
                       eigenvalues, sorted in descending order. The vertical axis shows the corre-
                       sponding eigenvalue. The S -like shape of this eigenspectrum is typical for
                       many biological neural networks. The underlying mechanism responsible
                       for this is an area of active neuroscience research.
Example 4.8
Consider the matrix
                                              
                                         3 2 2
                                    A = 2 3 2 .                                      (4.37)
                                         2 2 3
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
112                                                            Matrix Decompositions
                                                                                                   Figure 4.6
                                                                                                   Geometric
         x2                       A                                                                interpretation of
                                                                                                   eigenvalues. The
                                            v2                                                     eigenvectors of A
                     x1                                            v1                              get stretched by the
                                                                                                   corresponding
                                                                                                   eigenvalues. The
                                                                                                   area of the unit
Theorem 4.17. The trace of a matrix A ∈ Rn×n is the sum of its eigenval-
                                                                                                   square changes by
ues, i.e.,                                                                                         |λ1 λ2 |, the
                                    Xn                                                             perimeter changes
                           tr(A) =     λi ,                       (4.43)                           by a factor of
                                                                                                    1
                                                 i=1                                                2
                                                                                                      (|λ1 | + |λ2 |).
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                  114                                                            Matrix Decompositions
                  on a different web site. The matrix A has the property that for any ini-
                  tial rank/importance vector x of a web site the sequence x, Ax, A2 x, . . .
PageRank          converges to a vector x∗ . This vector is called the PageRank and satisfies
                  Ax∗ = x∗ , i.e., it is an eigenvector (with corresponding eigenvalue 1) of
                  A. After normalizing x∗ , such that kx∗ k = 1, we can interpret the entries
                  as probabilities. More details and different perspectives on PageRank can
                  be found in the original technical report (Page et al., 1999).
Comparing the left-hand side of (4.45) and the right-hand side of (4.46)
shows that there is a simple pattern in the diagonal elements lii :
        √                q
                                  2
                                             q
                                                       2    2
   l11 = a11 , l22 = a22 − l21      , l33 = a33 − (l31   + l32 ) . (4.47)
Similarly for the elements below the diagonal (lij , where i > j ), there is
also a repeating pattern:
                 1                   1                    1
        l21 =       a21 ,   l31 =       a31 ,    l32 =       (a32 − l31 l21 ) .        (4.48)
                l11                 l11                  l22
Thus, we constructed the Cholesky decomposition for any symmetric, pos-
itive definite 3 × 3 matrix. The key realization is that we can backward
calculate what the components lij for the L should be, given the values
aij for A and previously computed values of lij .
                                           0    · · · cn
They allow fast computation of determinants, powers, and inverses. The
determinant is the product of its diagonal entries, a matrix power D k is
given by each diagonal element raised to the power k , and the inverse
D −1 is the reciprocal of its diagonal elements if all of them are nonzero.
  In this section, we will discuss how to transform matrices into diagonal
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                 116                                                            Matrix Decompositions
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                        118                                                                 Matrix Decompositions
                        polynomial of A is
                                                         5                      
                                                                −λ         −1
                                det(A − λI) = det           2
                                                                         5                               (4.56a)
                                                                −1       2
                                                                           −λ
                                = ( 25 − λ)2 − 1 = λ2 − 5λ +         21
                                                                      4
                                                                          = (λ − 27 )(λ − 23 ) .         (4.56b)
                        Therefore, the eigenvalues of A are λ1 = 72 and λ2 = 32 (the roots of the
                        characteristic polynomial), and the associated (normalized) eigenvectors
                        are obtained via
                                                      7               3
                                               Ap1 = p1 , Ap2 = p2 .                      (4.57)
                                                      2               2
                        This yields
                                                  1                   1 1
                                                                        
                                                       1
                                            p1 = √         , p2 = √          .            (4.58)
                                                    2 −1               2 1
                        Step 2: Check for existence. The eigenvectors p1 , p2 form a basis of R2 .
                        Therefore, A can be diagonalized.
                        Step 3: Construct the matrix P to diagonalize A. We collect the eigen-
                        vectors of A in P so that
                                                               1
                                                                      
                                                                   1 1
                                             P = [p1 , p2 ] = √            .               (4.59)
                                                                2 −1 1
                        We then obtain
                                                                  7          
                                                                          0
                                                    P −1 AP =        2
                                                                          3       = D.                    (4.60)
                                                                     0    2
Figure 4.7 visualizes    Equivalently, we get (exploiting that P −1 = P > since the eigenvectors
the                     p1 and p2 in this example form an ONB)
eigendecomposition
                                1 5 −2           1     1 1 72 0 1 1 −1
                                                                       
           5   −2
of A =                                        =√
          −2    5                                                 3 √             .       (4.61)
as a sequence of                2 −2 5            2 −1 1 0 2           2 1 1
                               |     {z     } |       {z    } | {z } |    {z    }
linear                                 A                   P              D              P −1
transformations.
Ak = (P DP −1 )k = P D k P −1 . (4.62)
                             A     =             U       Σ     V>
                         m
(4.64)
   The diagonal entries σi , i = 1, . . . , r, of Σ are called the singular values,                singular values
ui are called the left-singular vectors, and v j are called the right-singular                     left-singular vectors
vectors. By convention, the singular values are ordered, i.e., σ1 > σ2 >                           right-singular
σr > 0.                                                                                            vectors
   The singular value matrix Σ is unique, but it requires some attention.                          singular value
Observe that the Σ ∈ Rm×n is rectangular. In particular, Σ is of the same                          matrix
size as A. This means that Σ has a diagonal submatrix that contains the
singular values and needs additional zero padding. Specifically, if m > n,
then the matrix Σ has diagonal structure up to row n and then consists of
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                        120                                                                Matrix Decompositions
value matrix Σ. Finally, it performs a second basis change via U . The SVD
entails a number of important details and caveats, which is why we will
review our intuition in more detail.                                                               It is useful to review
  Assume we are given a transformation matrix of a linear mapping Φ :                              basis changes
                                                                                                   (Section 2.7.2),
Rn → Rm with respect to the standard bases B and C of Rn and Rm ,
                                                                                                   orthogonal matrices
respectively. Moreover, assume a second basis B̃ of Rn and C̃ of Rm . Then                         (Definition 3.8) and
                                                                                                   orthonormal bases
1. The matrix V performs a basis change in the domain Rn from B̃ (rep-                             (Section 3.5).
   resented by the red and orange vectors v 1 and v 2 in the top-left of Fig-
   ure 4.8) to the standard basis B . V > = V −1 performs a basis change
   from B to B̃ . The red and orange vectors are now aligned with the
   canonical basis in the bottom-left of Figure 4.8.
2. Having changed the coordinate system to B̃ , Σ scales the new coordi-
   nates by the singular values σi (and adds or deletes dimensions), i.e.,
   Σ is the transformation matrix of Φ with respect to B̃ and C̃ , rep-
   resented by the red and orange vectors being stretched and lying in
   the e1 -e2 plane, which is now embedded in a third dimension in the
   bottom-right of Figure 4.8.
3. U performs a basis change in the codomain Rm from C̃ into the canoni-
   cal basis of Rm , represented by a rotation of the red and orange vectors
   out of the e1 -e2 plane. This is shown in the top-right of Figure 4.8.
The SVD expresses a change of basis in both the domain and codomain.
This is in contrast with the eigendecomposition that operates within the
same vector space, where the same basis change is applied and then un-
done. What makes the SVD special is that these two different bases are
simultaneously linked by the singular value matrix Σ.
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                     122                                                                  Matrix Decompositions
                     the x1 -x2 plane. The third coordinate is always 0. The vectors in the x1 -x2
                     plane have been stretched by the singular values.
                        The direct mapping of the vectors X by A to the codomain R3 equals
                     the transformation of X by U ΣV > , where U performs a rotation within
                     the codomain R3 so that the mapped vectors are no longer restricted to
                     the x1 -x2 plane; they still are on a plane as shown in the top-right panel
                     of Figure 4.9.
                                                                                                                                 x3
structure of                                                                                                               0.0
                     x2
                           0.0
Figure 4.8.
                                                                                                                          -0.5
                          −0.5
                                                                                                                          -1.0
                                                                                                                            1.5
                          −1.0                                                                                      0.5
                                                                          -1.5
                          −1.5                                                     -0.5                   -0.5       x2
                            −1.5   −1.0   −0.5   0.0   0.5    1.0   1.5                   0.5
                                                 x1                                  x1         1.5
                                                                                                   -1.5
                           1.5
1.0
0.5
                                                                                                                            0    x3
                     x2
0.0
−0.5
                                                                                                                          1.5
                          −1.0
                                                                                                                    0.5
                                                                            -1.5
                          −1.5                                                     -0.5                      -0.5    x2
                            −1.5   −1.0   −0.5   0.0   0.5    1.0   1.5                 0.5
                                                 x1                                  x1     1.5       -1.5
                                               0 · · · λn
where P is an orthogonal matrix, which is composed of the orthonormal
eigenbasis. The λi > 0 are the eigenvalues of A> A. Let us assume the
SVD of A exists and inject (4.64) into (4.71). This yields
          A> A = (U ΣV > )> (U ΣV > ) = V Σ> U > U ΣV > ,                              (4.72)
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                 124                                                               Matrix Decompositions
                 The spectral theorem tells us that AA> = SDS > can be diagonalized
                 and we can find an ONB of eigenvectors of AA> , which are collected in
                 S . The orthonormal eigenvectors of AA> are the left-singular vectors U
                 and form an orthonormal basis in the codomain of the SVD.
                    This leaves the question of the structure of the matrix Σ. Since AA>
                 and A> A have the same nonzero eigenvalues (see page 106), the nonzero
                 entries of the Σ matrices in the SVD for both cases have to be the same.
                    The last step is to link up all the parts we touched upon so far. We have
                 an orthonormal set of right-singular vectors in V . To finish the construc-
                 tion of the SVD, we connect them with the orthonormal vectors U . To
                 reach this goal, we use the fact the images of the v i under A have to be
                 orthogonal, too. We can show this by using the results from Section 3.4.
                 We require that the inner product between Av i and Av j must be 0 for
                 i 6= j . For any two orthogonal eigenvectors v i , v j , i 6= j , it holds that
                                                >
                       (Av i )> (Av j ) = v >              >                  >
                                            i (A A)v j = v i (λj v j ) = λj v i v j = 0 .         (4.77)
                 For the case m > r, it holds that {Av 1 , . . . , Av r } is a basis of an r-
                 dimensional subspace of Rm .
                   To complete the SVD construction, we need left-singular vectors that
                 are orthonormal: We normalize the images of the right-singular vectors
                 Av i and obtain
                                                Av i      1      1
                                       ui :=           = √ Av i = Av i ,                          (4.78)
                                               kAv i k    λi     σ i
                 where the last equality was obtained from (4.75) and (4.76b), showing
                 us that the eigenvalues of AA> are such that σi2 = λi .
                    Therefore, the eigenvectors of A> A, which we know are the right-
                 singular vectors v i , and their normalized images under A, the left-singular
                 vectors ui , form two self-consistent ONBs that are connected through the
                 singular value matrix Σ.
singular value      Let us rearrange (4.78) to obtain the singular value equation
equation
                                           Av i = σi ui ,     i = 1, . . . , r .                  (4.79)
This equation closely resembles the eigenvalue equation (4.25), but the
vectors on the left- and the right-hand sides are not the same.
  For n < m, (4.79) holds only for i 6 n, but (4.79) says nothing about
the ui for i > n. However, we know by construction that they are or-
thonormal. Conversely, for m < n, (4.79) holds only for i 6 m. For i > m,
we have Av i = 0 and we still know that the v i form an orthonormal set.
This means that the SVD also supplies an orthonormal basis of the kernel
(null space) of A, the set of vectors x with Ax = 0 (see Section 2.7.3).
  Concatenating the v i as the columns of V and the ui as the columns of
U yields
                                       AV = U Σ ,                                      (4.80)
where Σ has the same dimensions as A and a diagonal structure for rows
1, . . . , r. Hence, right-multiplying with V > yields A = U ΣV > , which is
the SVD of A.
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
126                                                            Matrix Decompositions
A> A we obtain them straight from √ D . Since rk(A) = 2, there are only
two nonzero singular values: σ1 = 6 and σ2 = 1. The singular value
matrix must be the same size as A, and we obtain
                                √         
                                   6 0 0
                          Σ=                 .                   (4.85)
                                  0 1 0
   Step 3: Left-singular vectors as the normalized image of the right-
singular vectors.
   We find the left-singular vectors by computing the image of the right-
singular vectors under A and normalizing them by dividing them by their
corresponding singular value. We obtain
                                              
                                         √530    "       #
              1           1                          √1
                             
                               1 0 1  √−2 
        u1 = Av 1 = √                          =      5
                                                            ,     (4.86)
              σ1           6 −2 1 0 √130            − √25
                                          
                                                        30
                                                0
                                                       
                                                      " #
             1           1 1 0                         √2
                                           
                                           1  √1       5
        u2 = Av 2 =                                 =  √1
                                                           ,                     (4.87)
            σ2           1 −2 1            0 √25         5
                                                    5
                          1
                                          
                               1         2
        U = [u1 , u2 ] = √                 .                                     (4.88)
                           5  −2         1
Note that on a computer the approach illustrated here has poor numerical
behavior, and the SVD of A is normally computed without resorting to the
eigenvalue decomposition of A> A.
                                    Chandra
                          Beatrix
                                                                                                   ratings of three
                    Ali                                                                            people for four
                                                                                                   movies and its SVD
                                                                                         
                                                                                                   decomposition.
     Star Wars     5       4            1    −0.6710      0.0236     0.4647 −0.5774
                                              −0.7197      0.2054 −0.4759        0.4619
                                                                                            
 Blade Runner    5       5            0 
                                          =                                               
                   0       0            5   −0.0939 −0.7705 −0.5268 −0.3464
                                                                                           
        Amelie                                                                             
  Delicatessen     1       0            4     −0.1515 −0.6030         0.5293 −0.5774
                                                                           
                                                 9.6438         0         0
                                               
                                                     0 6.3639            0 
                                                                            
                                                     0         0 0.7056 
                                                      0         0         0
                                                                                     
                                                      −0.7367     −0.6515    −0.1811
                                                    0.0852        0.1762 −0.9807 
                                                                                     
                                                        0.6708 −0.7379 −0.0743
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                         128                                                            Matrix Decompositions
   Sometimes this formulation is called the reduced SVD (e.g., Datta (2010))                       reduced SVD
   or the SVD (e.g., Press et al. (2007)). This alternative format changes
   merely how the matrices are constructed but leaves the mathematical
   structure of the SVD unchanged. The convenience of this alternative
   formulation is that Σ is diagonal, as in the eigenvalue decomposition.
   In Section 4.6, we will learn about matrix approximation techniques
   using the SVD, which is also called the truncated SVD.                                          truncated SVD
   It is possible to define the SVD of a rank-r matrix A so that U is an
   m × r matrix, Σ a diagonal matrix r × r, and V an r × n matrix.
   This construction is very similar to our definition, and ensures that the
   diagonal matrix Σ has only nonzero entries along the diagonal. The
   main convenience of this alternative notation is that Σ is diagonal, as
   in the eigenvalue decomposition.
   A restriction that the SVD for A only applies to m × n matrices with
   m > n is practically unnecessary. When m < n, the SVD decomposition
   will yield Σ with more zero columns than rows and, consequently, the
   singular values σm+1 , . . . , σn are 0.
   The SVD is used in a variety of applications in machine learning from
least-squares problems in curve fitting to solving systems of linear equa-
tions. These applications harness various important properties of the SVD,
its relation to the rank of a matrix, and its ability to approximate matrices
of a given rank with lower-rank matrices. Substituting a matrix with its
SVD has often the advantage of making calculation more robust to nu-
merical rounding errors. As we will explore in the next section, the SVD’s
ability to approximate matrices with “simpler” matrices in a principled
manner opens up machine learning applications ranging from dimension-
ality reduction and topic modeling to data compression and clustering.
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                         130                                                                   Matrix Decompositions
Figure 4.11 Image
processing with the
SVD. (a) The
original grayscale
image is a
1, 432 × 1, 910
matrix of values
between 0 (black)              (a) Original image A.        (b) A1 , σ1 ≈ 228, 052.                (c) A2 , σ2 ≈ 40, 647.
and 1 (white).
(b)–(f) Rank-1
matrices
A1 , . . . , A5 and
their corresponding
singular values
σ1 , . . . , σ5 . The
grid-like structure of
each rank-1 matrix             (d) A3 , σ3 ≈ 26, 125.        (e) A4 , σ4 ≈ 20, 232.                (f) A5 , σ5 ≈ 15, 436.
is imposed by the
outer-product of the
left and
right-singular
vectors.
                           A matrix A ∈ Rm×n of rank r can be written as a sum of rank-1 matrices
                         Ai so that
                                                  Xr               r
                                                                   X
                                                               >
                                            A=        σi u i v i =   σi Ai ,              (4.91)
                                                            i=1                 i=1
                         of A with rk(A(k))
                                        b       = k . Figure 4.12 shows low-rank approximations
                         A(k) of an original image A of Stonehenge. The shape of the rocks be-
                          b
                         comes increasingly visible and clearly recognizable in the rank-5 approx-
                         imation. While the original image requires 1, 432 · 1, 910 = 2, 735, 120
                         numbers, the rank-5 approximation requires us only to store the five sin-
                         gular values and the five left- and right-singular vectors (1, 432 and 1, 910-
                         dimensional each) for a total of 5 · (1, 432 + 1, 910 + 1) = 16, 715 numbers
                         – just above 0.6% of the original.
                            To measure the difference (error) between A and its rank-k approxima-
                         tion A(k)
                              b    , we need the notion of a norm. In Section 3.1, we already used
                          A(k)
                          b    = argminrk(B)=k kA − Bk2 ,                              (4.94)
                   
         
                   
A − A(k) 
 = σk+1 .                                                (4.95)
                   
    b    
                                    2
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
132                                                              Matrix Decompositions
                                           
          0.4943         0.4372      0.1215
         0.5302         0.4689      0.1303
        =
         0.0692
                                            .                                      (4.100b)
                         0.0612      0.0170
          0.1116         0.0987      0.0274
This first rank-1 approximation A1 is insightful: it tells us that Ali and
Beatrix like science fiction movies, such as Star Wars and Bladerunner
(entries have values > 0.4), but fails to capture the ratings of the other
movies by Chandra. This is not surprising, as Chandra’s type of movies is
not captured by the first singular value. The second singular value gives
us a better rank-1 approximation for those movie-theme lovers:
                             
                       0.0236
                     0.2054  
     A2 = u2 v >
                                                            
                               0.0852 0.1762 −0.9807
                2 =                                              (4.101a)
                    
                      −0.7705
                      −0.6030
                         0.0042 −0.0231
                                          
              0.0020
             0.0175     0.0362 −0.2014
         = −0.0656 −0.1358 0.7556  .
                                                                 (4.101b)
             −0.0514 −0.1063 0.5914
In this second rank-1 approximation A2 , we capture Chandra’s ratings
and movie types well, but not the science fiction movies. This leads us to
consider the rank-2 approximation A(2)
                                   b    , where we combine the first two
rank-1 approximations
                                                          
                              4.7801 4.2419        1.0244
                            5.2252 4.7522 −0.0250
    A(2)
    b     = σ1 A1 + σ2 A2 = 
                            0.2493 −0.2743 4.9724  .
                                                                 (4.102)
                              0.7495 0.2756        4.0278
A(2)
b    is similar to the original movie ratings table
                                           
                                   5 4 1
                                  5 5 0
                             A=  0 0 5 ,
                                                                                    (4.103)
                                   1 0 4
and this suggests that we can ignore the contribution of A3 . We can in-
terpret this so that in the data table there is no evidence of a third movie-
theme/movie-lovers category. This also means that the entire space of
movie-themes/movie-lovers in our example is a two-dimensional space
spanned by science fiction and French art house movies and lovers.
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                    134                                                                                   Matrix Decompositions
Figure 4.13 A
functional                                                             Real matrices
                                                                       ∃ Pseudo-inverse
phylogeny of                                                                ∃ SVD
matrices
encountered in
                                                                      Rn×n              Rn×m
machine learning.                                  Square
                                                ∃ Determinant                                             Nonsquare
                                                   ∃ Trace
                                                No basis of                                               det =
                                                                                                                0
                                               eigenvectors
                                                                                                                      Singular
                                                                                                          de
                                                                                Basis of
                                                                                                           t
                                   Defective
                                                                                                           6=
                                                                             eigenvectors
                                                                                                               0
                                                                   Non-defective
                                                                  (diagonalizable)
Normal Non-normal
                                                                                  A>
                                                                                    A
                                                                                        =
                                                                                            A
                                                                                             A>
                                                                                                                    ∃ Inverse Matrix
                                                    Symmetric                                     =
                                                 eigenvalues ∈ R
                                                                                                      I                 Regular
                                                                                                                     (invertible)
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                      136                                                            Matrix Decompositions
                                           Exercises
4.1   Compute the determinant using the Laplace expansion (using the first row)
      and the Sarrus rule for
                                               
                                           1             3       5
                                       A= 2             4       6 .
                                           0             2       4
       b.
                                                                        
                                                    −2                   2
                                               B :=
                                                    2                    1
4.6   Compute the eigenspaces of the following transformation matrices. Are they
      diagonalizable?
       a. For
                                                                            
                                                    2            3       0
                                               A = 1            4       3
                                                    0            0       1
       b. For
                                                                                
                                               1             1       0       0
                                             0              0       0       0
                                           A=
                                             0
                                                                               
                                                             0       0       0
                                               0             0       0       0
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
138                                                                         Matrix Decompositions
4.7   Are the following matrices diagonalizable? If yes, determine their diagonal
      form and a basis with respect to which the transformation matrices are di-
      agonal. If no, give reasons why they are not diagonalizable.
      a.
                                                                 
                                               0                1
                                           A=
                                              −8                4
      b.
                                                                   
                                               1            1    1
                                          A = 1            1    1
                                               1            1    1
      c.
                                                                           
                                      5                4         2      1
                                     0                1        −1      −1
                                  A=
                                                                          
                                      −1               −1        3      0
                                      1                1        −1       2
      d.
                                                                       
                                           5               −6       −6
                                     A = −1                4       2
                                           3               −6       −4
4.11 Show that for any A ∈ Rm×n the matrices A> A and AA> possess the
     same nonzero eigenvalues.
4.12 Show that for x 6= 0 Theorem 4.24 holds, i.e., show that
                                           kAxk2
                                     max         = σ1 ,
                                      x     kxk2
Vector Calculus
                                                        0
 y
                                                                                                    density estimation,
     −2                                                                                             i.e., modeling data
                                                      −5
                                                                                                    distributions.
     −4
                                                      −10
          −4    −2       0       2        4             −10     −5        0         5        10
                         x                                                x1
(a) Regression problem: Find parameters, (b) Density estimation with a Gaussian mixture
such that the curve explains the observations model: Find means and covariances, such that
(crosses) well.                               the data (dots) can be explained well.
                                                                                            139
This material is published by Cambridge University Press as Mathematics for Machine Learning by
Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong (2020). This version is free to view
and download for personal use only. Not for re-distribution, re-sale, or use in derivative works.
 c
by  M. P. Deisenroth, A. A. Faisal, and C. S. Ong, 2021. https://mml-book.com.
                        140                                                                        Vector Calculus
                                                                  defines
when they are used                                                                    i  n
                                                                                   ed
in other parts of the                                                            us
book.
                                                                  collected in
                                                                                 us
                                                                                      ed
                                                                                         in
                                                                                 us
                               Chapter 6        used in      Jacobian                            Chapter 11
                                                                                   ed
                               Probability                    Hessian                         Density estimation
                                                                                        in
                                                                  used in
                        Example 5.1
                        Recall the dot product as a special case of an inner product (Section 3.2).
                        In the previous notation, the function f (x) = x> x, x ∈ R2 , would be
                        specified as
                                                        f : R2 → R                                         (5.2a)
                                                             x 7→ x21 + x22 .                              (5.2b)
                           δy     f (x + δx) − f (x)
                              :=                                      (5.3)
                           δx             δx
computes the slope of the secant line through two points on the graph of
f . In Figure 5.3, these are the points with x-coordinates x0 and x0 + δx.
  The difference quotient can also be considered the average slope of f
between x and x + δx if we assume f to be a linear function. In the limit
for δx → 0, we obtain the tangent of f at x, if f is differentiable. The
tangent is then the derivative of f at x.
Definition 5.2 (Derivative). More formally, for h > 0 the derivative of f                          derivative
at x is defined as the limit
                        df         f (x + h) − f (x)
                            := lim                   ,              (5.4)
                        dx     h→0         h
and the secant in Figure 5.3 becomes a tangent.
   The derivative of f points in the direction of steepest ascent of f .
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                    142                                                                    Vector Calculus
                                                             x h − xn
                                            = lim i=0 i                    .               (5.5c)
                                              h→0              h
                    We see that xn = n0 xn−0 h0 . By starting the sum at 1, the xn -term cancels,
                                       
                    and we obtain
                                              Pn n n−i i
                                    df                    x h
                                        = lim i=1 i                                       (5.6a)
                                    dx    h→0           h
                                                        !
                                                n
                                              X      n n−i i−1
                                        = lim             x h                             (5.6b)
                                          h→0
                                              i=1
                                                      i
                                                   !           n
                                                                     !
                                                 n n−1 X n n−i i−1
                                        = lim        x     +          x h                  (5.6c)
                                          h→0    1           i=2
                                                                   i
                                                             |       {z        }
                                                                         →0   as h→0
                                                 n!
                                           =            xn−1 = nxn−1 .                               (5.6d)
                                             1!(n − 1)!
Taylor polynomial   Definition 5.3 (Taylor Polynomial). The Taylor polynomial of degree n of
We define t0 := 1   f : R → R at x0 is defined as
for all t ∈ R.
                                                        n
                                                        X f (k) (x0 )
                                            Tn (x) :=                   (x − x0 )k ,                  (5.7)
                                                        k=0
                                                                k!
For x0 = 0, we obtain the Maclaurin series as a special instance of the                            f ∈ C ∞ means that
Taylor series. If f (x) = T∞ (x), then f is called analytic.                                       f is continuously
                                                                                                   differentiable
                                                                                                   infinitely many
Remark. In general, a Taylor polynomial of degree n is an approximation                            times.
of a function, which does not need to be a polynomial. The Taylor poly-                            Maclaurin series
nomial is similar to f in a neighborhood around x0 . However, a Taylor                             analytic
polynomial of degree n is an exact representation of a polynomial f of
degree k 6 n since all derivatives f (i) , i > k vanish.              ♦
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                      144                                                                      Vector Calculus
                                        y
Taylor polynomials
(dashed) around                              0
x0 = 0.
Higher-order Taylor
polynomials                                 −2
approximate the
function f better                                 −4             −2     0      2           4
and more globally.                                                      x
T10 is already
similar to f in
[−4, 4].
                      Example 5.4 (Taylor Series)
                      Consider the function in Figure 5.4 given by
                                                   f (x) = sin(x) + cos(x) ∈ C ∞ .                     (5.19)
                      We seek a Taylor series expansion of f at x0 = 0, which is the Maclaurin
                      series expansion of f . We obtain the following derivatives:
                                                  f (0) = sin(0) + cos(0) = 1                          (5.20)
                                                 f 0 (0) = cos(0) − sin(0) = 1                         (5.21)
                                                 f 00 (0) = − sin(0) − cos(0) = −1                     (5.22)
                                                 (3)
                                             f         (0) = − cos(0) + sin(0) = −1                    (5.23)
                                                 (4)
                                             f         (0) = sin(0) + cos(0) = f (0) = 1               (5.24)
                                                         ..
                                                          .
                      We can see a pattern here: The coefficients in our Taylor series are only
                      ±1 (since sin(0) = 0), each of which occurs twice before switching to the
                      other one. Furthermore, f (k+4) (0) = f (k) (0).
                        Therefore, the full Taylor series expansion of f at x0 = 0 is given by
                                       ∞
                                       X f (k) (x0 )
                            T∞ (x) =                       (x − x0 )k                                (5.25a)
                                       k=0
                                                   k!
                                              1 2     1       1       1
                                   =1+x−        x − x3 + x4 + x5 − · · ·                             (5.25b)
                                             2!       3!      4!      5!
                                          1       1                 1       1
                                   = 1 − x2 + x4 ∓ · · · + x − x3 + x5 ∓ · · ·                        (5.25c)
                                          2!     4!                3!      5!
                                      ∞                     ∞
                                     X           1         X             1
                                   =     (−1)k       x2k +     (−1)k           x2k+1                 (5.25d)
                                     k=0
                                               (2k)!       k=0
                                                                     (2k + 1)!
                                   = cos(x) + sin(x) ,                                               (5.25e)
where ak are coefficients and c is a constant, which has the special form
in Definition 5.4.                                                      ♦
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                        146                                                                    Vector Calculus
        ∂f (x, y)               ∂
                  = 2(x + 2y 3 ) (x + 2y 3 ) = 12(x + 2y 3 )y 2 .                      (5.42)
           ∂y                   ∂y
where we used the chain rule (5.32) to compute the partial derivatives.
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                        148                                                                    Vector Calculus
                                                  ∂                ∂           ∂g ∂f
                              Chain rule:           (g ◦ f )(x) =    g(f (x)) =                          (5.48)
                                                 ∂x               ∂x            ∂f ∂x
This is only an         Let us have a closer look at the chain rule. The chain rule (5.48) resem-
intuition, but not      bles to some degree the rules for matrix multiplication where we said that
mathematically
                        neighboring dimensions have to match for matrix multiplication to be de-
correct since the
partial derivative is   fined; see Section 2.2.1. If we go from left to right, the chain rule exhibits
not a fraction.         similar properties: ∂f shows up in the “denominator” of the first factor
                        and in the “numerator” of the second factor. If we multiply the factors to-
                        gether, multiplication is defined, i.e., the dimensions of ∂f match, and ∂f
                        “cancels”, such that ∂g/∂x remains.
                        Example 5.8
                        Consider f (x1 , x2 ) = x21 + 2x2 , where x1 = sin t and x2 = cos t, then
                                    df    ∂f ∂x1         ∂f ∂x2
                                       =            +                                                  (5.50a)
                                    dt   ∂x1 ∂t         ∂x2 ∂t
                                                 ∂ sin t     ∂ cos t
                                       = 2 sin t         +2                                            (5.50b)
                                                   ∂t          ∂t
                                       = 2 sin t cos t − 2 sin t = 2 sin t(cos t − 1)                  (5.50c)
                        is the corresponding derivative of f with respect to t.
                                                   ∂f   ∂f ∂x1   ∂f ∂x2
                                                      =        +        ,                                (5.51)
                                                   ∂s   ∂x1 ∂s   ∂x2 ∂s
                                                   ∂f   ∂f ∂x1   ∂f ∂x2
                                                      =        +        ,                                (5.52)
                                                   ∂t   ∂x1 ∂t   ∂x2 ∂t
                                     ∂f      | ∂s {z ∂t }
                                   =
                                     ∂x            ∂x
                                               =
                                                  ∂(s, t)
This compact way of writing the chain rule as a matrix multiplication only                         The chain rule can
makes sense if the gradient is defined as a row vector. Otherwise, we will                         be written as a
                                                                                                   matrix
need to start transposing gradients for the matrix dimensions to match.
                                                                                                   multiplication.
This may still be straightforward as long as the gradient is a vector or a
matrix; however, when the gradient becomes a tensor (we will discuss this
in the following), the transpose is no longer a triviality.
Remark (Verifying the Correctness of a Gradient Implementation). The
definition of the partial derivatives as the limit of the corresponding dif-
ference quotient (see (5.39)) can be exploited when numerically checking
the correctness of gradients in computer programs: When we compute                                 Gradient checking
gradients and implement them, we can use finite differences to numer-
ically test our computation and implementation: We choose the value h
to be small (e.g., h = 10−4 ) and compare the finite-difference approxima-
tion from (5.39) with our (analytic) implementation of the gradient. If the
error is small, ourqgradient
                     P
                              implementation is probably correct. “Small”
                             (dh −df )2
could mean that Pi (dhii +dfii )2 < 10−6 , where dhi is the finite-difference
                    i
approximation and dfi is the analytic gradient of f with respect to the ith
variable xi .                                                              ♦
                                            fm (x)
Writing the vector-valued function in this way allows us to view a vector-
valued function f : Rn → Rm as a vector of functions [f1 , . . . , fm ]> ,
fi : Rn → R that map onto R. The differentiation rules for every fi are
exactly the ones we discussed in Section 5.2.
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                    150                                                                    Vector Calculus
                                                                                                   (5.55)
                    From (5.40), we know that the gradient of f with respect to a vector is
                    the row vector of the partial derivatives. In (5.55), every partial derivative
                    ∂f /∂xi is itself a column vector. Therefore, we obtain the gradient of f :
                    Rn → Rm with respect to x ∈ Rn by collecting these partial derivatives:
                                                                             
                                 df (x)           ∂f (x)            ∂f (x)
                                        =                    ···                                   (5.56a)
                                  dx               ∂x1               ∂xn
                                                                                 
                                                   ∂f1 (x)            ∂f1 (x)
                                                   ∂x1       ···      ∂xn        
                                                      ..                 ..        ∈ Rm×n .
                                                                                 
                                          =                                                       (5.56b)
                                                      .                  .       
                                                 ∂fm (x)            ∂fm (x)      
                                                    ∂x1       ···     ∂xn
exists also the denominator layout, which is the transpose of the numerator                        denominator layout
layout. In this book, we will use the numerator layout.                  ♦
   We will see how the Jacobian is used in the change-of-variable method
for probability distributions in Section 6.7. The amount of scaling due to
the transformation of a variable is provided by the determinant.
   In Section 4.1, we saw that the determinant can be used to compute
the area of a parallelogram. If we are given two vectors b1 = [1, 0]> ,
b2 = [0, 1]> as the sides of the unit square (blue; see Figure 5.5), the area
of this square is
                                      
                            det 1 0  = 1 .
                                          
                                                                       (5.60)
                                  0 1 
If we take a parallelogram with the sides c1 = [−2, 1]> , c2 = [1, 1]>
(orange in Figure 5.5), its area is given as the absolute value of the deter-
minant (see Section 4.1)
                                  
                     det −2 1  = | − 3| = 3 ,
                                       
                                                                      (5.61)
                              1 1 
i.e., the area of this is exactly three times the area of the unit square.
We can find this scaling factor by finding a mapping that transforms the
unit square into the other square. In linear algebra terms, we effectively
perform a variable transformation from (b1 , b2 ) to (c1 , c2 ). In our case,
the mapping is linear and the absolute value of the determinant of this
mapping gives us exactly the scaling factor we are looking for.
   We will describe two approaches to identify this mapping. First, we ex-
ploit that the mapping is linear so that we can use the tools from Chapter 2
to identify this mapping. Second, we will find the mapping using partial
derivatives using the tools we have been discussing in this chapter.
   Approach 1           To get started with the linear algebra approach, we
identify both {b1 , b2 } and {c1 , c2 } as bases of R2 (see Section 2.6.1 for a
recap). What we effectively perform is a change of basis from (b1 , b2 ) to
(c1 , c2 ), and we are looking for the transformation matrix that implements
the basis change. Using results from Section 2.7.2, we identify the desired
basis change matrix as
                                             
                                        −2 1
                                 J=              ,                      (5.62)
                                         1 1
such that J b1 = c1 and J b2 = c2 . The absolute value of the determi-
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                         152                                                                    Vector Calculus
                         nant of J , which yields the scaling factor we are looking for, is given as
                         |det(J )| = 3, i.e., the area of the square spanned by (c1 , c2 ) is three times
                         greater than the area spanned by (b1 , b2 ).
                            Approach 2          The linear algebra approach works for linear trans-
                         formations; for nonlinear transformations (which become relevant in Sec-
                         tion 6.7), we follow a more general approach using partial derivatives.
                            For this approach, we consider a function f : R2 → R2 that performs a
                         variable transformation. In our example, f maps the coordinate represen-
                         tation of any vector x ∈ R2 with respect to (b1 , b2 ) onto the coordinate
                         representation y ∈ R2 with respect to (c1 , c2 ). We want to identify the
                         mapping so that we can compute how an area (or volume) changes when
                         it is being transformed by f . For this, we need to find out how f (x)
                         changes if we modify x a bit. This question is exactly answered by the
                         Jacobian matrix df dx
                                                ∈ R2×2 . Since we can write
                                                           y1 = −2x1 + x2                                 (5.63)
                                                           y2 = x 1 + x 2                                 (5.64)
                         we obtain the functional relationship between x and y , which allows us
                         to get the partial derivatives
                                        ∂y1              ∂y1            ∂y2             ∂y2
                                            = −2 ,           = 1,           = 1,            =1            (5.65)
                                        ∂x1              ∂x2            ∂x1             ∂x2
                         and compose the Jacobian as
                                                 ∂y             ∂y1  
                                                            1                 
                                                                         −2 1
                                                 J =  ∂x        ∂x2 
                                                                 ∂y2  = 1 1 .                            (5.66)
                                                      1
                                                       ∂y   2
                                                         ∂x1     ∂x2
Geometrically, the       The Jacobian represents the coordinate transformation we are looking
Jacobian                 for. It is exact if the coordinate transformation is linear (as in our case),
determinant gives
                         and (5.66) recovers exactly the basis change matrix in (5.62). If the co-
the magnification/
scaling factor when      ordinate transformation is nonlinear, the Jacobian approximates this non-
we transform an          linear transformation locally with a linear one. The absolute value of the
area or volume.          Jacobian determinant |det(J )| is the factor by which areas or volumes are
Jacobian
                         scaled when coordinates are transformed. Our case yields |det(J )| = 3.
determinant
                            The Jacobian determinant and variable transformations will become
                         relevant in Section 6.7 when we transform random variables and prob-
Figure 5.6               ability distributions. These transformations are extremely relevant in ma-
Dimensionality of        chine learning in the context of training deep neural networks using the
(partial) derivatives.
                         reparametrization trick, also called infinite perturbation analysis.
        x                   In this chapter, we encountered derivatives of functions. Figure 5.6 sum-
f (x)                    marizes the dimensions of those derivatives. If f : R → R the gradient is
                  ∂f     simply a scalar (top-left entry). For f : RD → R the gradient is a 1 × D
                  ∂x     row vector (top-right entry). For f : R → RE , the gradient is an E × 1
                         column vector, and for f : RD → RE the gradient is an E × D matrix.
We collect the partial derivatives in the Jacobian and obtain the gradient
         ∂f1         ∂f1 
                · · · ∂x
                                                 
          ∂x1             N
                                 A11 · · · A1N
 df
     =  ...            ..  =  ..            ..  = A ∈ RM ×N . (5.68)
        
 dx                      .    .               . 
          ∂fM         ∂fM
          ∂x   1
                · · · ∂x      N
                                AM 1 · · · AM N
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                       154                                                                    Vector Calculus
                       Remark. We would have obtained the same result without using the chain
                       rule by immediately looking at the function
                                       L2 (θ) := ky − Φθk2 = (y − Φθ)> (y − Φθ) .                       (5.84)
                       This approach is still practical for simple functions like L2 but becomes
                       impractical for deep function compositions.                            ♦
                                                        dà                  dA
                                                            ∈ R8×3               ∈ R4×2×3
                    A ∈ R4×2            Ã ∈ R8         dx                    dx
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                      156                                                                    Vector Calculus
                                ∂fi
                                        = 0> ∈ R1×1×N                                  (5.91)
                               ∂Ak6=i,:
where we have to pay attention to the correct dimensionality. Since fi
maps onto R and each row of A is of size 1 × N , we obtain a 1 × 1 × N -
sized tensor as the partial derivative of fi with respect to a row of A.
   We stack the partial derivatives (5.91) and get the desired gradient
in (5.87) via
                                 >
                                 0
                                 .. 
                                 . 
                                 >
                                0 
                        ∂fi      >        1×(M ×N )
                        ∂A 
                             =  x 
                                   >
                                     ∈R              .               (5.92)
                                0 
                                 
                                 . 
                                 .. 
                                         0>
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
158                                                                    Vector Calculus
                                 
                                 
                                  Riq      if j = p, p 6= q
                                   Rip      if j = q, p 6= q
                                 
                       ∂pqij   =                             .                   (5.98)
                                 
                                  2Riq     if j = p, p = q
                                   0        otherwise
                                 
From (5.94), we know that the desired gradient has the dimension (N ×
N ) × (M × N ), and every single entry of this tensor is given by ∂pqij
in (5.98), where p, q, j = 1, . . . , N and i = 1, . . . , M .
                                                               (5.110)
Writing out the gradient in this explicit way is often impractical since it
often results in a very lengthy expression for a derivative. In practice,
it means that, if we are not careful, the implementation of the gradient
could be significantly more expensive than computing the function, which
imposes unnecessary overhead. For training deep neural network mod-
els, the backpropagation algorithm (Kelley, 1960; Bryson, 1961; Dreyfus,                           backpropagation
1962; Rumelhart et al., 1986) is an efficient way to compute the gradient
of an error function with respect to the parameters of the model.
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                        160                                                                          Vector Calculus
We discuss the case,     In neural networks with multiple layers, we have functions fi (xi−1 ) =
where the activation    σ(Ai−1 xi−1 + bi−1 ) in the ith layer. Here xi−1 is the output of layer i − 1
functions are
                        and σ an activation function, such as the logistic sigmoid 1+e1−x , tanh or a
identical in each
layer to unclutter      rectified linear unit (ReLU). In order to train these models, we require the
notation.               gradient of a loss function L with respect to all model parameters Aj , bj
                        for j = 1, . . . , K . This also requires us to compute the gradient of L with
                        respect to the inputs of each layer. For example, if we have inputs x and
                        observations y and a network structure defined by
                                    f 0 := x                                                               (5.112)
                                    f i := σi (Ai−1 f i−1 + bi−1 ) ,          i = 1, . . . , K ,           (5.113)
                                               ∂L      ∂L ∂f K        ∂f i+2 ∂f i+1
                                                    =             ···                                      (5.118)
                                               ∂θ i   ∂f K ∂f K−1     ∂f i+1 ∂θ i
                        The orange terms are partial derivatives of the output of a layer with
                        respect to its inputs, whereas the blue terms are partial derivatives of
                        the output of a layer with respect to its parameters. Assuming, we have
                        already computed the partial derivatives ∂L/∂θ i+1 , then most of the com-
                        putation can be reused to compute ∂L/∂θ i . The additional terms that we
                                                                                                   Figure 5.9
                                                                                                   Backward pass in a
   x                    f1                              f K−1                 fK          L        multi-layer neural
                                                                                                   network to compute
                                                                                                   the gradients of the
                                                                                                   loss function.
            A 0 , b0            A1 , b1   AK−2 , bK−2           AK−1 , bK−1
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                           162                                                                      Vector Calculus
                           Example 5.14
                           Consider the function
                                            q
                                    f (x) = x2 + exp(x2 ) + cos x2 + exp(x2 )
                                                                             
                                                                                                            (5.122)
                           from (5.109). If we were to implement a function f on a computer, we
intermediate               would be able to save some computation by using intermediate variables:
variables
                                                               a = x2 ,                                     (5.123)
                                                               b = exp(a) ,                                 (5.124)
                                                               c = a + b,                                   (5.125)
                                                                   √
                                                               d = c,                                       (5.126)
                                                               e = cos(c) ,                                 (5.127)
                                                               f = d + e.                                   (5.128)
                                                                                      √
Figure 5.11                                         exp(·)       b                        ·     d
Computation graph
with inputs x,
function values f ,
                            x      (·)2       a                        +        c                       +       f
and intermediate
variables a, b, c, d, e.                                                              cos(·)        e
                              This is the same kind of thinking process that occurs when applying
                           the chain rule. Note that the preceding set of equations requires fewer
                           operations than a direct implementation of the function f (x) as defined
                           in (5.109). The corresponding computation graph in Figure 5.11 shows
                           the flow of data and computations required to obtain the function value
                           f.
                              The set of equations that include intermediate variables can be thought
                           of as a computation graph, a representation that is widely used in imple-
                           mentations of neural network software libraries. We can directly compute
                           the derivatives of the intermediate variables with respect to their corre-
                           sponding inputs by recalling the definition of the derivative of elementary
                           functions. We obtain the following:
                                                              ∂a
                                                                 = 2x                                       (5.129)
                                                              ∂x
                                                              ∂b
                                                                 = exp(a)                                   (5.130)
                                                              ∂a
                           ∂c        ∂c
                              =1=                                                    (5.131)
                           ∂a        ∂b
                           ∂d     1
                              = √                                                    (5.132)
                           ∂c    2 c
                           ∂e
                              = − sin(c)                                             (5.133)
                           ∂c
                           ∂f        ∂f
                              =1=        .                                           (5.134)
                           ∂d        ∂e
By looking at the computation graph in Figure 5.11, we can                          compute
∂f /∂x by working backward from the output and obtain
                           ∂f     ∂f ∂d ∂f ∂e
                               =          +                         (5.135)
                           ∂c      ∂d ∂c     ∂e ∂c
                           ∂f     ∂f ∂c
                               =                                    (5.136)
                           ∂b      ∂c ∂b
                           ∂f     ∂f ∂b ∂f ∂c
                               =          +                         (5.137)
                           ∂a      ∂b ∂a     ∂c ∂a
                           ∂f     ∂f ∂a
                               =          .                         (5.138)
                           ∂x      ∂a ∂x
Note that we implicitly applied the chain rule to obtain ∂f /∂x. By substi-
tuting the results of the derivatives of the elementary functions, we get
                       ∂f         1
                           = 1 · √ + 1 · (− sin(c))                  (5.139)
                       ∂c       2 c
                       ∂f    ∂f
                           =     ·1                                  (5.140)
                       ∂b    ∂c
                       ∂f    ∂f            ∂f
                           =     exp(a) +     ·1                     (5.141)
                       ∂a    ∂b            ∂c
                       ∂f    ∂f
                           =     · 2x .                              (5.142)
                       ∂x    ∂a
By thinking of each of the derivatives above as a variable, we observe
that the computation required for calculating the derivative is of similar
complexity as the computation of the function itself. This is quite counter-
intuitive since the mathematical expression for the derivative ∂f∂x
                                                                     (5.110)
is significantly more complicated than the mathematical expression of the
function f (x) in (5.109).
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                       164                                                                        Vector Calculus
                       where the gi (·) are elementary functions and xPa(xi ) are the parent nodes
                       of the variable xi in the graph. Given a function defined in this way, we
                       can use the chain rule to compute the derivative of the function in a step-
                       by-step fashion. Recall that by definition f = xD and hence
                                                                ∂f
                                                                   = 1.                                 (5.144)
                                                               ∂xD
                       For other variables xi , we apply the chain rule
                                ∂f           X             ∂f ∂xj        X            ∂f ∂gj
                                   =                               =                          ,         (5.145)
                                ∂xi x                      ∂xj ∂xi                    ∂xj ∂xi
                                          j :xi ∈Pa(xj   )           j :xi ∈Pa(xj
                                                                     x              )
                                                                                                   first-order Taylor
                                                                                                   series expansion.
                       −1           f (x0)   f (x0) + f 0(x0)(x − x0)
−2
                            −4         −2         0            2             4
                                                  x
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                        166                                                                       Vector Calculus
Figure 5.13
Visualizing outer
products. Outer
products of vectors
increase the
dimensionality of
the array by 1 per
term. (a) The outer            (a) Given a vector δ ∈ R4 , we obtain the outer product δ 2 := δ ⊗ δ = δδ > ∈
product of two                 R4×4 as a matrix.
vectors results in a
matrix; (b) the
outer product of
three vectors yields
a third-order tensor.
                        where Dxk f (x0 ) is the k -th (total) derivative of f with respect to x, eval-
                        uated at x0 .
Taylor polynomial       Definition 5.8 (Taylor Polynomial). The Taylor polynomial of degree n of
                        f at x0 contains the first n + 1 components of the series in (5.151) and is
                        defined as
                                                                    n
                                                                    X Dk f (x0 )
                                                    Tn (x) =              x
                                                                                   δk .                   (5.152)
                                                                    k=0
                                                                              k!
in the Taylor series, where Dxk f (x0 )δ k contains k -th order polynomials.
   Now that we defined the Taylor series for vector fields, let us explicitly
write down the first terms Dxk f (x0 )δ k of the Taylor series expansion for
k = 0, . . . , 3 and δ := x − x0 :
                                                                                                             np.einsum(
  k = 0 : Dx0 f (x0 )δ 0 = f (x0 ) ∈ R                                                             (5.156)   ’i,i’,Df1,d)
                                                                D
                                                                                                             np.einsum(
                                                                X                                            ’ij,i,j’,
  k=1:      Dx1 f (x0 )δ 1    = ∇x f (x0 ) |{z}
                                            δ =                       ∇x f (x0 )[i]δ[i] ∈ R (5.157)          Df2,d,d)
                                | {z }                          i=1
                                         1×D         D×1                                                     np.einsum(
                                                                                                             ’ijk,i,j,k’,
            Dx2 f (x0 )δ 2                     δ > = δ > H(x0 )δ
                                                  
  k=2:                        = tr H(x0 ) |{z}
                                           δ |{z}                                                  (5.158)   Df3,d,d,d)
                                   | {z }
                                           D×D         D×1 1×D
         D X
         X D
     =             H[i, j]δ[i]δ[j] ∈ R                                                             (5.159)
         i=1 j=1
                                   X   D
                                     D X
                                   D X
  k = 3 : Dx3 f (x0 )δ 3 =                             Dx3 f (x0 )[i, j, k]δ[i]δ[j]δ[k] ∈ R
                                   i=1 j=1 k=1
                                                                                                   (5.160)
Here, H(x0 ) is the Hessian of f evaluated at x0 .
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
168                                                                    Vector Calculus
                      ∂f                ∂f
                         = 2x + 2y =⇒      (1, 2) = 6                          (5.163)
                      ∂x                ∂x
                      ∂f                 ∂f
                         = 2x + 3y 2 =⇒     (1, 2) = 14 .                      (5.164)
                      ∂y                 ∂y
Therefore, we obtain
                                      h                       i
   1                                      ∂f          ∂f
                                                                  = 6 14 ∈ R1×2
                                                                       
  Dx,y f (1, 2) = ∇x,y f (1, 2) =         ∂x
                                             (1, 2)   ∂y
                                                         (1, 2)
                                                                               (5.165)
such that
      1
    Dx,y f (1, 2)
                                   
                             x−1
                    δ = 6 14          = 6(x − 1) + 14(y − 2) .                 (5.166)
           1!                   y−2
           1
Note that Dx,y f (1, 2)δ contains only linear terms, i.e., first-order polyno-
mials.
  The second-order partial derivatives are given by
                       ∂2f           ∂2f
                            = 2  =⇒      (1, 2) = 2                            (5.167)
                       ∂x2           ∂x2
                       ∂2f            ∂2f
                            = 6y  =⇒       (1, 2) = 12                         (5.168)
                       ∂y 2           ∂y 2
                        ∂2f            ∂2f
                             = 2 =⇒          (1, 2) = 2                        (5.169)
                       ∂y∂x           ∂y∂x
                        ∂2f            ∂2f
                             = 2 =⇒          (1, 2) = 2 .                      (5.170)
                       ∂x∂y           ∂x∂y
When we collect the second-order partial derivatives, we obtain the Hes-
sian
                        " 2
                                 ∂2f
                                     # 
                          ∂ f                    
                          ∂x2   ∂x∂y       2 2
                   H = ∂2f       ∂2f
                                       =           ,             (5.171)
                                   2
                                           2 6y
                                 ∂y∂x       ∂y
such that
                                         
                                     2 2
                          H(1, 2) =         ∈ R2×2 .                           (5.172)
                                     2 12
Therefore, the next term of the Taylor-series expansion is given by
       2
      Dx,y f (1, 2) 2 1 >
                   δ = δ H(1, 2)δ                                             (5.173a)
           2!          2
                       1
                                                   
                                      2 2 x−1
                     =    x−1 y−2                                             (5.173b)
                       2                2 12 y − 2
                     = (x − 1) + 2(x − 1)(y − 2) + 6(y − 2)2 .
                              2
                                                                              (5.173c)
       2
Here, Dx,y f (1, 2)δ 2 contains only quadratic terms, i.e., second-order poly-
nomials.
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                      170                                                                    Vector Calculus
                                                           Exercises
                      5.1   Compute the derivative f 0 (x) for
                                                        f (x) = log(x4 ) sin(x3 ) .
            where x, µ ∈ RD , S ∈ RD×D .
       b.
                                 f (x) = tr(xx> + σ 2 I) ,       x ∈ RD
          Here tr(A) is the trace of A, i.e., the sum of the diagonal elements Aii .
          Hint: Explicitly write out the outer product.
       c. Use the chain rule. Provide the dimensions of every single partial deriva-
          tive. You do not need to compute the product of the partial derivatives
          explicitly.
                          f = tanh(z) ∈ RM
                          z = Ax + b,         x ∈ RN , A ∈ RM ×N , b ∈ RM .
            Here, tanh is applied to every component of z .
5.9   We define
                                g(z, ν) := log p(x, z) − log q(z, ν)
                                      z := t(, ν)
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                                                                6
                  172
                  This material is published by Cambridge University Press as Mathematics for Machine Learning by
                  Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong (2020). This version is free to view
                  and download for personal use only. Not for re-distribution, re-sale, or use in derivative works.
                   c
                  
by  M. P. Deisenroth, A. A. Faisal, and C. S. Ong, 2021. https://mml-book.com.
6.1 Construction of a Probability Space                                                                173
                                                                  ple
                                                                                             reduction
                              y
                            rit
      Independence                                                      Bernoulli
                        ila
                       m
                                     Sufficient statistics
                       Si
Conjugate
                                                                                            Chapter 11
                                               Finite
Density estimation
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
174                                                     Probability and Distributions
(Grinstead and Snell, 1997; Jaynes, 2003) that introduce the three con-
cepts of sample space, event space, and probability measure. The prob-
ability space models a real-world process (referred to as an experiment)
with random outcomes.
   The probability of a single event must lie in the interval [0, 1], and the
total probability over all outcomes in the sample space Ω must be 1, i.e.,
P (Ω) = 1. Given a probability space (Ω, A, P ), we want to use it to model
some real-world phenomenon. In machine learning, we often avoid explic-
itly referring to the probability space, but instead refer to probabilities on
quantities of interest, which we denote by T . In this book, we refer to T
as the target space and refer to elements of T as states. We introduce a                           target space
function X : Ω → T that takes an element of Ω (an outcome) and returns
a particular quantity of interest x, a value in T . This association/mapping
from Ω to T is called a random variable. For example, in the case of tossing                       random variable
two coins and counting the number of heads, a random variable X maps
to the three possible outcomes: X(hh) = 2, X(ht) = 1, X(th) = 1, and
X(tt) = 0. In this particular case, T = {0, 1, 2}, and it is the probabilities
on elements of T that we are interested in. For a finite sample space Ω and                        The name “random
finite T , the function corresponding to a random variable is essentially a                        variable” is a great
                                                                                                   source of
lookup table. For any subset S ⊆ T , we associate PX (S) ∈ [0, 1] (the
                                                                                                   misunderstanding
probability) to a particular event occurring corresponding to the random                           as it is neither
variable X . Example 6.1 provides a concrete illustration of the terminol-                         random nor is it a
ogy.                                                                                               variable. It is a
                                                                                                   function.
Remark. The aforementioned sample space Ω unfortunately is referred
to by different names in different books. Another common name for Ω
is “state space” (Jacod and Protter, 2004), but state space is sometimes
reserved for referring to states in a dynamical system (Hasselblatt and
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                       176                                                     Probability and Distributions
                       Example 6.1
This toy example is    We assume that the reader is already familiar with computing probabilities
essentially a biased
                       of intersections and unions of sets of events. A gentler introduction to
coin flip example.
                       probability with many examples can be found in chapter 2 of Walpole
                       et al. (2011).
                          Consider a statistical experiment where we model a funfair game con-
                       sisting of drawing two coins from a bag (with replacement). There are
                       coins from USA (denoted as $) and UK (denoted as £) in the bag, and
                       since we draw two coins from the bag, there are four outcomes in total.
                       The state space or sample space Ω of this experiment is then ($, $), ($,
                       £), (£, $), (£, £). Let us assume that the composition of the bag of coins is
                       such that a draw returns at random a $ with probability 0.3.
                          The event we are interested in is the total number of times the repeated
                       draw returns $. Let us define a random variable X that maps the sample
                       space Ω to T , which denotes the number of times we draw $ out of the
                       bag. We can see from the preceding sample space we can get zero $, one $,
                       or two $s, and therefore T = {0, 1, 2}. The random variable X (a function
                       or lookup table) can be represented as a table like the following:
                                                          X(($, $)) = 2                                  (6.1)
                                                         X(($, £)) = 1                                   (6.2)
                                                         X((£, $)) = 1                                   (6.3)
                                                         X((£, £)) = 0 .                                 (6.4)
                       Since we return the first coin we draw before drawing the second, this
                       implies that the two draws are independent of each other, which we will
                       discuss in Section 6.4.5. Note that there are two experimental outcomes,
                       which map to the same event, where only one of the draws returns $.
                       Therefore, the probability mass function (Section 6.2.1) of X is given by
                                   P (X = 2) = P (($, $))
                                             = P ($) · P ($)
                                             = 0.3 · 0.3 = 0.09                                          (6.5)
                                   P (X = 1) = P (($, £) ∪ (£, $))
                                             = P (($, £)) + P ((£, $))
                                             = 0.3 · (1 − 0.3) + (1 − 0.3) · 0.3 = 0.42                  (6.6)
                                   P (X = 0) = P ((£, £))
                                             = P (£) · P (£)
                                             = (1 − 0.3) · (1 − 0.3) = 0.49 .                            (6.7)
                                     6.1.3 Statistics
Probability theory and statistics are often presented together, but they con-
cern different aspects of uncertainty. One way of contrasting them is by the
kinds of problems that are considered. Using probability, we can consider
a model of some process, where the underlying uncertainty is captured
by random variables, and we use the rules of probability to derive what
happens. In statistics, we observe that something has happened and try
to figure out the underlying process that explains the observations. In this
sense, machine learning is close to statistics in its goals to construct a
model that adequately represents the process that generated the data. We
can use the rules of probability to obtain a “best-fitting” model for some
data.
   Another aspect of machine learning systems is that we are interested
in generalization error (see Chapter 8). This means that we are actually
interested in the performance of our system on instances that we will
observe in future, which are not identical to the instances that we have
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                        178                                                     Probability and Distributions
                                                       ci                                          Figure 6.2
                                                     z }|{                                         Visualization of a
                                y1                                                                 discrete bivariate
                                                                  o                                probability mass
                           Y    y2                     nij            rj                           function, with
                                                                                                   random variables X
                                y3                                                                 and Y . This
                                     x1   x2    x3     x4    x5                                    diagram is adapted
                                                                                                   from Bishop (2006).
                                                X
Example 6.2
Consider two random variables X and Y , where X has five possible states
and Y has three possible states, as shown in Figure 6.2. We denote by nij
the number of events with state X = xi and Y = yj , and denote by
N the total number of events. The value ciP   is the sum of the individual
                                                3
frequencies for the ith column, that is, ci = j=1 nij . Similarly, the value
                                 P5
rj is the row sum, that is, rj = i=1 nij . Using these definitions, we can
compactly express the distribution of X and Y .
   The probability distribution of each random variable, the marginal
probability, can be seen as the sum over a row or column
                                            P3
                                      ci     j=1 nij
                       P (X = xi ) =     =                            (6.10)
                                     N         N
and
                                            P5
                                     rj           nij
                       P (Y = yj ) =     = i=1        ,               (6.11)
                                     N        N
where ci and rj are the ith column and j th row of the probability table,
respectively. By convention, for discrete random variables with a finite
number of events, we assume that probabilties sum up to one, that is,
            5
            X                                    3
                                                 X
                  P (X = xi ) = 1 and                  P (Y = yj ) = 1 .               (6.12)
            i=1                                  j=1
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                       180                                                     Probability and Distributions
Remark. We reiterate that there are in fact two distinct concepts when
talking about distributions. First is the idea of a pdf (denoted by f (x)),
which is a nonnegative function that sums to one. Second is the law of a
random variable X , that is, the association of a random variable X with
the pdf f (x).                                                           ♦
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                       182                                                                        Probability and Distributions
Figure 6.3                          2.0                                                2.0
Examples of
(a) discrete and                    1.5                                                1.5
(b) continuous          P (Z = z)
                                                                                p(x)
uniform                             1.0                                                1.0
distributions. See
Example 6.3 for                     0.5                                                0.5
details of the
                                    0.0                                                0.0
distributions.                            −1          0          1      2                     −1           0        1       2
                                                          z                                                    x
                                          (a) Discrete distribution                          (b) Continuous distribution
                         For most of this book, we will not use the notation f (x) and FX (x) as
                       we mostly do not need to distinguish between the pdf and cdf. However,
                       we will need to be careful about pdfs and cdfs in Section 6.7.
                       Example 6.3
                       We consider two examples of the uniform distribution, where each state is
                       equally likely to occur. This example illustrates some differences between
                       discrete and continuous probability distributions.
                         Let Z be a discrete uniform random variable with three states {z =
The actual values of   −1.1, z = 0.3, z = 1.5}. The probability mass function can be represented
these states are not   as a table of probability values:
meaningful here,
and we deliberately                                                   z −1.1 0.3 1.5
chose numbers to
                                                                            1          1      1
drive home the                                                  P (Z = z)   3          3      3
point that we do not
want to use (and
should ignore) the     Alternatively, we can think of this as a graph (Figure 6.3(a)), where we
ordering of the        use the fact that the states can be located on the x-axis, and the y -axis
states.
                       represents the probability of a particular state. The y -axis in Figure 6.3(a)
                       is deliberately extended so that is it the same as in Figure 6.3(b).
                          Let X be a continuous random variable taking values in the range 0.9 6
                       X 6 1.6, as represented by Figure 6.3(b). Observe that the height of the
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                     184                                                     Probability and Distributions
                     naturally from fulfilling the desiderata (Jaynes, 2003, chapter 2). Prob-
                     abilistic modeling (Section 8.4) provides a principled foundation for de-
                     signing machine learning methods. Once we have defined probability dis-
                     tributions (Section 6.2) corresponding to the uncertainties of the data and
                     our problem, it turns out that there are only two fundamental rules, the
                     sum rule and the product rule.
                        Recall from (6.9) that p(x, y) is the joint distribution of the two ran-
                     dom variables x, y . The distributions p(x) and p(y) are the correspond-
                     ing marginal distributions, and p(y | x) is the conditional distribution of y
                     given x. Given the definitions of the marginal and conditional probability
                     for discrete and continuous random variables in Section 6.2, we can now
These two rules      present the two fundamental rules in probability theory.
arise                   The first rule, the sum rule, states that
naturally (Jaynes,                        X
2003) from the                                 p(x, y)       if y is discrete
requirements we
                                         
                                         
                                            y∈Y
discussed in                   p(x) =       Z                                    ,          (6.20)
Section 6.1.1.                                 p(x, y)dy      if y is continuous
                                         
                                         
                                         
sum rule                                      Y
                     where Y are the states of the target space of random variable Y . This
                     means that we sum out (or integrate out) the set of states y of the random
marginalization      variable Y . The sum rule is also known as the marginalization property.
property             The sum rule relates the joint distribution to a marginal distribution. In
                     general, when the joint distribution contains more than two random vari-
                     ables, the sum rule can be applied to any subset of the random variables,
                     resulting in a marginal distribution of potentially more than one random
                     variable. More concretely, if x = [x1 , . . . , xD ]> , we obtain the marginal
                                                     Z
                                           p(xi ) = p(x1 , . . . , xD )dx\i                    (6.21)
                     The product rule can be interpreted as the fact that every joint distribu-
                     tion of two random variables can be factorized (written as a product)
of two other distributions. The two factors are the marginal distribu-
tion of the first random variable p(x), and the conditional distribution
of the second random variable given the first p(y | x). Since the ordering
of random variables is arbitrary in p(x, y), the product rule also implies
p(x, y) = p(x | y)p(y). To be precise, (6.22) is expressed in terms of the
probability mass functions for discrete random variables. For continuous
random variables, the product rule is expressed in terms of the probability
density functions (Section 6.2.3).
   In machine learning and Bayesian statistics, we are often interested in
making inferences of unobserved (latent) random variables given that we
have observed other random variables. Let us assume we have some prior
knowledge p(x) about an unobserved random variable x and some rela-
tionship p(y | x) between x and a second random variable y , which we
can observe. If we observe y , we can use Bayes’ theorem to draw some
conclusions about x given the observed values of y . Bayes’ theorem (also                          Bayes’ theorem
Bayes’ rule or Bayes’ law)                                                                         Bayes’ rule
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                        186                                                     Probability and Distributions
                          The quantity
                                                    Z
                                          p(y) :=       p(y | x)p(x)dx = EX [p(y | x)]                   (6.27)
marginal likelihood     is the marginal likelihood/evidence. The right-hand side of (6.27) uses the
evidence                expectation operator which we define in Section 6.4.1. By definition, the
                        marginal likelihood integrates the numerator of (6.23) with respect to the
                        latent variable x. Therefore, the marginal likelihood is independent of
                        x, and it ensures that the posterior p(x | y) is normalized. The marginal
                        likelihood can also be interpreted as the expected likelihood where we
                        take the expectation with respect to the prior p(x). Beyond normalization
                        of the posterior, the marginal likelihood also plays an important role in
                        Bayesian model selection, as we will discuss in Section 8.6. Due to the
Bayes’ theorem is       integration in (8.44), the evidence is often hard to compute.
also called the            Bayes’ theorem (6.23) allows us to invert the relationship between x
“probabilistic
                        and y given by the likelihood. Therefore, Bayes’ theorem is sometimes
inverse.”
probabilistic inverse   called the probabilistic inverse. We will discuss Bayes’ theorem further in
                        Section 8.4.
                        Remark. In Bayesian statistics, the posterior distribution is the quantity
                        of interest as it encapsulates all available information from the prior and
                        the data. Instead of carrying the posterior around, it is possible to focus
                        on some statistic of the posterior, such as the maximum of the posterior,
                        which we will discuss in Section 8.3. However, focusing on some statistic
                        of the posterior leads to loss of information. If we think in a bigger con-
                        text, then the posterior can be used within a decision-making system, and
                        having the full posterior can be extremely useful and lead to decisions that
                        are robust to disturbances. For example, in the context of model-based re-
                        inforcement learning, Deisenroth et al. (2015) show that using the full
                        posterior distribution of plausible transition functions leads to very fast
                        (data/sample efficient) learning, whereas focusing on the maximum of
                        the posterior leads to consistent failures. Therefore, having the full pos-
                        terior can be very useful for a downstream task. In Chapter 9, we will
                        continue this discussion in the context of linear regression.             ♦
Definition 6.3 (Expected Value). The expected value of a function g : R →                          expected value
R of a univariate continuous random variable X ∼ p(x) is given by
                                   Z
                       EX [g(x)] =   g(x)p(x)dx .                   (6.28)
                                               X
where X is the set of possible outcomes (the target space) of the random
variable X .
where the subscript EXd indicates that we are taking the expected value
with respect to the dth element of the vector x.                     ♦
   Definition 6.3 defines the meaning of the notation EX as the operator
indicating that we should take the integral with respect to the probabil-
ity density (for continuous distributions) or the sum over all states (for
discrete distributions). The definition of the mean (Definition 6.4), is a
special case of the expected value, obtained by choosing g to be the iden-
tity function.
Definition 6.4 (Mean). The mean of a random variable X with states mean
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
         188                                                     Probability and Distributions
         Example 6.4
         Consider the two-dimensional distribution illustrated in Figure 6.4:
                                                                
                               10     1 0                   0     8.4 2.0
            p(x) = 0.4 N x        ,         + 0.6 N x        ,              .
                                 2     0 1                     0    2.0 1.7
                                                                            (6.33)
                                                           2
         We will define the Gaussian distribution N µ, σ in Section 6.5. Also
         shown is its corresponding marginal distribution in each dimension. Ob-
         serve that the distribution is bimodal (has two modes), but one of the
                                                                                                   Figure 6.4
                                                                        Mean                       Illustration of the
                                                                        Modes                      mean, mode, and
                                                                        Median                     median for a
                                                                                                   two-dimensional
                                                                                                   dataset, as well as
                                                                                                   its marginal
                                                                                                   densities.
Remark. The expected value (Definition 6.3) is a linear operator. For ex-
ample, given a real-valued function f (x) = ag(x) + bh(x) where a, b ∈ R
and x ∈ RD , we obtain
                      Z
        EX [f (x)] = f (x)p(x)dx                                 (6.34a)
                      Z
                   = [ag(x) + bh(x)]p(x)dx                       (6.34b)
                        Z                  Z
                   = a g(x)p(x)dx + b h(x)p(x)dx                 (6.34c)
                                                                                             ♦
   For two random variables, we may wish to characterize their correspon-
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                      190                                                      Probability and Distributions
                                                     y
                                                                                                         each axis (colored
      0                                                   0                                              lines) but with
                                                                                                         different
     −2                                                  −2                                              covariances.
             −5           0              5                      −5            0             5
                          x                                                   x
(a) x and y are negatively correlated. (b) x and y are positively correlated.
                                         Z
                              p(xi ) =         p(x1 , . . . , xD )dx\i ,                    (6.39)
where “\i” denotes “all variables but i”. The off-diagonal entries are the
cross-covariance terms Cov[xi , xj ] for i, j = 1, . . . , D, i 6= j .                                   cross-covariance
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                       192                                                     Probability and Distributions
                       where xn ∈ RD .
empirical covariance    Similar to the empirical mean, the empirical covariance matrix is a D×D
                       matrix
                                                    N
                                                 1 X
                                          Σ :=         (xn − x̄)(xn − x̄)> .               (6.42)
                                                 N n=1
Throughout the
book, we use the          To compute the statistics for a particular dataset, we would use the
empirical              realizations (observations) x1 , . . . , xN and use (6.41) and (6.42). Em-
covariance, which is   pirical covariance matrices are symmetric, positive semidefinite (see Sec-
a biased estimate.
The unbiased
                       tion 3.2.3).
(sometimes called
corrected)
covariance has the                      6.4.3 Three Expressions for the Variance
factor N − 1 in the
denominator            We now focus on a single random variable X and use the preceding em-
instead of N .         pirical formulas to derive three possible expressions for the variance. The
The derivations are    following derivation is the same for the population variance, except that
exercises at the end   we need to take care of integrals. The standard definition of variance, cor-
of this chapter.
                       responding to the definition of covariance (Definition 6.5), is the expec-
                       tation of the squared deviation of a random variable X from its expected
                       value µ, i.e.,
                                                    VX [x] := EX [(x − µ)2 ] .                          (6.43)
                       The expectation in (6.43) and the mean µ = EX (x) are computed us-
                       ing (6.32), depending on whether X is a discrete or continuous random
                       variable. The variance as expressed in (6.43) is the mean of a new random
                       variable Z := (X − µ)2 .
                          When estimating the variance in (6.43) empirically, we need to resort
                       to a two-pass algorithm: one pass through the data to calculate the mean
                       µ using (6.41), and then a second pass using this estimate µ̂ calculate the
variance. It turns out that we can avoid two passes by rearranging the
terms. The formula in (6.43) can be converted to the so-called raw-score                           raw-score formula
formula for variance:                                                                              for variance
                                                                2
                            VX [x] = EX [x2 ] − (EX [x]) .                             (6.44)
The expression in (6.44) can be remembered as “the mean of the square
minus the square of the mean”. It can be calculated empirically in one pass
through data since we can accumulate xi (to calculate the mean) and x2i
simultaneously, where xi is the ith observation. Unfortunately, if imple-                          If the two terms
mented in this way, it can be numerically unstable. The raw-score version                          in (6.44) are huge
                                                                                                   and approximately
of the variance can be useful in machine learning, e.g., when deriving the
                                                                                                   equal, we may
bias–variance decomposition (Bishop, 2006).                                                        suffer from an
   A third way to understand the variance is that it is a sum of pairwise dif-                     unnecessary loss of
ferences between all pairs of observations. Consider a sample x1 , . . . , xN                      numerical precision
                                                                                                   in floating-point
of realizations of random variable X , and we compute the squared differ-
                                                                                                   arithmetic.
ence between pairs of xi and xj . By expanding the square, we can show
that the sum of N 2 pairwise differences is the empirical variance of the
observations:
                                                          !2 
             N                       N              N
         1   X                    1 X          1   X
           2
                (xi − xj )2 = 2        x2i −           xi  .        (6.45)
        N i,j=1                   N i=1        N i=1
We see that (6.45) is twice the raw-score expression (6.44). This means
that we can express the sum of pairwise distances (of which there are N 2
of them) as a sum of deviations from the mean (of which there are N ). Ge-
ometrically, this means that there is an equivalence between the pairwise
distances and the distances from the center of the set of points. From a
computational perspective, this means that by computing the mean (N
terms in the summation), and then computing the variance (again N
terms in the summation), we can obtain an expression (left-hand side
of (6.45)) that has N 2 terms.
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                        194                                                     Probability and Distributions
                        Example 6.5
                        Consider a random variable X with zero mean (EX [x] = 0) and also
                        EX [x3 ] = 0. Let y = x2 (hence, Y is dependent on X ) and consider the
                        covariance (6.36) between X and Y . But this gives
                                         Cov[x, y] = E[xy] − E[x]E[y] = E[x3 ] = 0 .                     (6.54)
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                       196                                                                       Probability and Distributions
Figure 6.6
Geometry of
random variables. If
random variables X
and Y are
uncorrelated, they
are orthogonal
vectors in a
corresponding
vector space, and                                                                         [y ]
                                                                                  +   var
the Pythagorean                                                              [ x]
                                                                     p var
theorem applies.
                                                                    =
                                                                                                  p
                                                                  ]                         a         var[x]
                                                               +y      c
                                                     p var[x
                                                                       b
                                                                  p
                                                                      var[y]
                                                                                                   Figure 6.7
                                                                                                   Gaussian
                                                                                                   distribution of two
                                                                                                   random variables x1
                                                                         0.20                      and x2 .
                                                                             p(x1, x2)
                                                                         0.15
                                                                        0.10
                                                                        0.05
                                                                        0.00
                                                                   7.5
                                                                 5.0
                                                               2.5
                            −1                              0.0 x 2
                                        0                −2.5
                                   x1          1       −5.0
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                      198                                                              Probability and Distributions
Figure 6.8
                                                                                   8
Gaussian                                                                                                        Mean
                                                              p(x)
                       0.20                                                                                     Sample
distributions                                                 Mean                 6
overlaid with 100                                             Sample
                       0.15                                                        4
                                                              2σ
samples. (a) One-
                                                                             x2
                                                                                   2
dimensional case;      0.10
(b) two-dimensional                                                                0
                       0.05
case.                                                                             −2
                       0.00
                                                                                  −4
                              −5.0   −2.5   0.0       2.5   5.0        7.5               −1          0          1
                                                  x                                                 x1
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                    200                                                                           Probability and Distributions
                    Example 6.6
Figure 6.9
(a) Bivariate                                          8
Gaussian;
                                                       6
(b) marginal of a
joint Gaussian                                         4
distribution is
                                                 x2
                                                       2
Gaussian; (c) the
conditional                                            0         x2 = −1
distribution of a                                     −2
Gaussian is also
Gaussian.                                             −4
                                                                    −1              0               1
                                                                                   x1
                                                                                   0.4
                     0.2
                                                                                   0.2
                     0.0                                                           0.0
                           −1.5   −1.0   −0.5   0.0        0.5      1.0      1.5         −1.5   −1.0    −0.5     0.0   0.5   1.0     1.5
                                                x1                                                               x1
Example 6.7
Since expectations are linear operations, we can obtain the weighted sum
of independent Gaussian random variables
             p(ax + by) = N aµx + bµy , a2 Σx + b2 Σy .
                                                          
                                                                   (6.79)
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
202                                                     Probability and Distributions
where the scalar 0 < α < 1 is the mixture weight, and p1 (x) and p2 (x) are
univariate Gaussian densities (Equation (6.62)) with different parameters,
i.e., (µ1 , σ12 ) 6= (µ2 , σ22 ).
   Then the mean of the mixture density p(x) is given by the weighted sum
of the means of each random variable:
(6.82)
Proof The mean of the mixture density p(x) is given by the weighted
sum of the means of each random variable. We apply the definition of the
mean (Definition 6.4), and plug in our mixture (6.80), which yields
               Z ∞
        E[x] =      xp(x)dx                                      (6.83a)
                −∞
               Z ∞
             =      (αxp1 (x) + (1 − α)xp2 (x)) dx               (6.83b)
                −∞
                 Z ∞                       Z ∞
             =α        xp1 (x)dx + (1 − α)     xp2 (x)dx         (6.83c)
                      −∞                             −∞
               = αµ1 + (1 − α)µ2 .                                             (6.83d)
To compute the variance, we can use the raw-score version of the vari-
ance from (6.44), which requires an expression of the expectation of the
squared random variable. Here we use the definition of an expectation of
a function (the square) of a random variable (Definition 6.3),
                Z ∞
           2
       E[x ] =      x2 p(x)dx                                   (6.84a)
                 −∞
                Z ∞
                     αx2 p1 (x) + (1 − α)x2 p2 (x) dx
                                                  
             =                                                  (6.84b)
                   −∞
Remark. The preceding derivation holds for any density, but since the
Gaussian is fully determined by the mean and variance, the mixture den-
sity can be determined in closed form.                               ♦
   For a mixture density, the individual components can be considered
to be conditional distributions (conditioned on the component identity).
Equation (6.85c) is an example of the conditional variance formula, also
known as the law of total variance, which generally states that for two ran-                       law of total variance
dom variables X and Y it holds that VX [x] = EY [VX [x|y]]+ VY [EX [x|y]],
i.e., the (total) variance of X is the expected conditional variance plus the
variance of a conditional mean.
   We consider in Example 6.17 a bivariate standard Gaussian random
variable X and performed a linear transformation Ax on it. The outcome
is a Gaussian random variable with mean zero and covariance AA> . Ob-
serve that adding a constant vector will change the mean of the distribu-
tion, without affecting its variance, that is, the random variable x + µ is
Gaussian with mean µ and identity covariance. Hence, any linear/affine
transformation of a Gaussian random variable is Gaussian distributed.                             Any linear/affine
   Consider a Gaussian distributed random variable X ∼ N µ, Σ . For                                transformation of a
                                                                                                   Gaussian random
a given matrix A of appropriate shape, let Y be a random variable such
                                                                                                   variable is also
that y = Ax is a transformed version of x. We can compute the mean of                              Gaussian
y by exploiting that the expectation is a linear operator (6.50) as follows:                       distributed.
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                         204                                                     Probability and Distributions
It turns out that the class of distributions called the exponential family                         exponential family
provides the right balance of generality while retaining favorable compu-
tation and inference properties. Before we introduce the exponential fam-
ily, let us see three more members of “named” probability distributions,
the Bernoulli (Example 6.8), Binomial (Example 6.9), and Beta (Exam-
ple 6.10) distributions.
Example 6.8
The Bernoulli distribution is a distribution for a single binary random                            Bernoulli
variable X with state x ∈ {0, 1}. It is governed by a single continuous pa-                        distribution
rameter µ ∈ [0, 1] that represents the probability of X = 1. The Bernoulli
distribution Ber(µ) is defined as
                      p(x | µ) = µx (1 − µ)1−x ,          x ∈ {0, 1} ,                 (6.92)
                         E[x] = µ ,                                                    (6.93)
                        V[x] = µ(1 − µ) ,                                              (6.94)
where E[x] and V[x] are the mean and variance of the binary random
variable X .
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                       206                                                     Probability and Distributions
Figure 6.10
Examples of the                                                                            µ = 0.1
Binomial                               0.3                                                 µ = 0.4
distribution for
µ ∈ {0.1, 0.4, 0.75}                                                                       µ = 0.75
and N = 15.
                                       0.2
                                p(m)
0.1
                                       0.0
                                             0.0    2.5       5.0      7.5      10.0    12.5     15.0
                                             Number m of observations x = 1 in N = 15 experiments
                           10                                                                      Figure 6.11
                                  α = 0.5 = β                                                      Examples of the
                            8     α=1=β                                                            Beta distribution for
                                  α = 2, β = 0.3                                                   different values of α
                                                                                                   and β.
               p(µ|α, β)
                            6     α = 4, β = 10
                                  α = 5, β = 1
                            4
                            0
                            0.0   0.2        0.4       0.6        0.8        1.0
                                                   µ
Remark. There is a whole zoo of distributions with names, and they are
related in different ways to each other (Leemis and McQueston, 2008).
It is worth keeping in mind that each named distribution is created for a
particular reason, but may have other applications. Knowing the reason
behind the creation of a particular distribution often allows insight into
how to best use it. We introduced the preceding three distributions to be
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                  208                                                         Probability and Distributions
                                                   6.6.1 Conjugacy
                  According to Bayes’ theorem (6.23), the posterior is proportional to the
                  product of the prior and the likelihood. The specification of the prior can
                  be tricky for two reasons: First, the prior should encapsulate our knowl-
                  edge about the problem before we see any data. This is often difficult to
                  describe. Second, it is often not possible to compute the posterior distribu-
                  tion analytically. However, there are some priors that are computationally
conjugate prior   convenient: conjugate priors.
conjugate         Definition 6.13 (Conjugate Prior). A prior is conjugate for the likelihood
                  function if the posterior is of the same form/type as the prior.
                    Conjugacy is particularly convenient because we can algebraically cal-
                  culate our posterior distribution by updating the parameters of the prior
                  distribution.
                  Remark. When considering the geometry of probability distributions, con-
                  jugate priors retain the same distance structure as the likelihood (Agarwal
                  and Daumé III, 2010).                                                   ♦
                    To introduce a concrete example of conjugate priors, we describe in Ex-
                  ample 6.11 the Binomial distribution (defined on discrete random vari-
                  ables) and the Beta distribution (defined on continuous random vari-
                  ables).
                                 ∝ Beta(h + α, N − h + β) ,                         (6.104d)
i.e., the posterior distribution is a Beta distribution as the prior, i.e., the
Beta prior is conjugate for the parameter µ in the Binomial likelihood
function.
   Table 6.2 lists examples for conjugate priors for the parameters of some
standard likelihoods used in probabilistic modeling. Distributions such as                         The Gamma prior is
Multinomial, inverse Gamma, inverse Wishart, and Dirichlet can be found                            conjugate for the
                                                                                                   precision (inverse
in any statistical text, and are described in Bishop (2006), for example.
                                                                                                   variance) in the
   The Beta distribution is the conjugate prior for the parameter µ in both                        univariate Gaussian
the Binomial and the Bernoulli likelihood. For a Gaussian likelihood func-                         likelihood, and the
tion, we can place a conjugate Gaussian prior on the mean. The reason                              Wishart prior is
                                                                                                   conjugate for the
why the Gaussian likelihood appears twice in the table is that we need
                                                                                                   precision matrix
to distinguish the univariate from the multivariate case. In the univariate                        (inverse covariance
(scalar) case, the inverse Gamma is the conjugate prior for the variance.                          matrix) in the
In the multivariate case, we use a conjugate inverse Wishart distribution                          multivariate
                                                                                                   Gaussian likelihood.
as a prior on the covariance matrix. The Dirichlet distribution is the conju-
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                        210                                                     Probability and Distributions
                        gate prior for the multinomial likelihood function. For further details, we
                        refer to Bishop (2006).
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                     212                                                      Probability and Distributions
                                   µ
                          θ = log 1−µ                                                (6.115)
                     φ(x) = x                                                        (6.116)
                     A(θ) = − log(1 − µ) = log(1 + exp(θ)).                          (6.117)
The relationship between θ and µ is invertible so that
                                               1
                                   µ=                 .                              (6.118)
                                          1 + exp(−θ)
The relation (6.118) is used to obtain the right equality of (6.117).
Example 6.15
Recall the exponential family form of the Bernoulli distribution (6.113d)
                                                   
                                  µ
            p(x | µ) = exp x log       + log(1 − µ) .              (6.121)
                                 1−µ
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
214                                                     Probability and Distributions
ables. However, we may not be able to obtain the functional form of the
distribution under transformations. Furthermore, we may be interested
in nonlinear transformations of random variables for which closed-form
expressions are not readily available.
Remark (Notation). In this section, we will be explicit about random vari-
ables and the values they take. Hence, recall that we use capital letters
X, Y to denote random variables and small letters x, y to denote the val-
ues in the target space T that the random variables take. We will explicitly
write pmfs of discrete random variables X as P (X = x). For continuous
random variables X (Section 6.2.2), the pdf is written as f (x) and the cdf
is written as FX (x).                                                     ♦
  We will look at two approaches for obtaining distributions of transfor-
mations of random variables: a direct approach using the definition of a
cumulative distribution function and a change-of-variable approach that
uses the chain rule of calculus (Section 5.2.2). The change-of-variable ap-                        Moment generating
proach is widely used because it provides a “recipe” for attempting to                             functions can also
                                                                                                   be used to study
compute the resulting distribution due to a transformation. We will ex-
                                                                                                   transformations of
plain the techniques for univariate random variables, and will only briefly                        random
provide the results for the general case of multivariate random variables.                         variables (Casella
  Transformations of discrete random variables can be understood di-                               and Berger, 2002,
                                                                                                   chapter 2).
rectly. Suppose that there is a discrete random variable X with pmf P (X =
x) (Section 6.2.1), and an invertible function U (x). Consider the trans-
formed random variable Y := U (X), with pmf P (Y = y). Then
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                      216                                                     Probability and Distributions
                      We also need to keep in mind that the domain of the random variable may
                      have changed due to the transformation by U .
                      Example 6.16
                      Let X be a continuous random variable with probability density function
                      on 0 6 x 6 1
                                                          f (x) = 3x2 .                              (6.128)
                      We are interested in finding the pdf of Y = X 2 .
                        The function f is an increasing function of x, and therefore the resulting
                      value of y lies in the interval [0, 1]. We obtain
                            FY (y) = P (Y 6 y)                           definition of cdf          (6.129a)
                                   = P (X 2 6 y)                transformation of interest          (6.129b)
                                                 1
                                  = P (X 6 y )   2                                    inverse       (6.129c)
                                            1
                                  = FX (y ) 2                               definition of cdf       (6.129d)
                                    Z y 21
                                  =        3t2 dt                 cdf as a definite integral        (6.129e)
                                       0
                                    t=y 12
                                  = t3 t=0                            result of integration         (6.129f)
                                       3
                                  =y , 2     0 6 y 6 1.                                             (6.129g)
                      Therefore, the cdf of Y is
                                                                        3
                                                           FY (y) = y 2                              (6.130)
                      for 0 6 y 6 1. To obtain the pdf, we differentiate the cdf
                                                               d          3 1
                                                     f (y) =      FY (y) = y 2                       (6.131)
                                                               dy         2
                      for 0 6 y 6 1.
Y := FX (X) (6.132)
   Theorem 6.15 is known as the probability integral transform, and it is                          probability integral
used to derive algorithms for sampling from distributions by transforming                          transform
the result of sampling from a uniform random variable (Bishop, 2006).
The algorithm works by first generating a sample from a uniform distribu-
tion, then transforming it by the inverse cdf (assuming this is available)
to obtain a sample from the desired distribution. The probability integral
transform is also used for hypothesis testing whether a sample comes from
a particular distribution (Lehmann and Romano, 2005). The idea that the
output of a cdf gives a uniform distribution also forms the basis of copu-
las (Nelsen, 2006).
Let us break down the reasoning step by step, with the goal of understand-
ing the more general change-of-variables approach in Theorem 6.16.                                 Change of variables
                                                                                                   in probability relies
Remark. The name “change of variables” comes from the idea of chang-                               on the
ing the variable of integration when faced with a difficult integral. For                          change-of-variables
univariate functions, we use the substitution rule of integration,                                 method in
      Z                     Z                                                                      calculus (Tandra,
                  0                                                                                2014).
         f (g(x))g (x)dx = f (u)du , where u = g(x) .              (6.133)
The derivation of this rule is based on the chain rule of calculus (5.32) and
by applying twice the fundamental theorem of calculus. The fundamental
theorem of calculus formalizes the fact that integration and differentiation
are somehow “inverses” of each other. An intuitive understanding of the
rule can be obtained by thinking (loosely) about small changes (differen-
tials) to the equation u = g(x), that is by considering ∆u = g 0 (x)∆x as a
differential of u = g(x). By substituting u = g(x), the argument inside the
integral on the right-hand side of (6.133) becomes f (g(x)). By pretending
that the term du can be approximated by du ≈ ∆u = g 0 (x)∆x, and that
dx ≈ ∆x, we obtain (6.133).                                                 ♦
   Consider a univariate random variable X , and an invertible function
U , which gives us another random variable Y = U (X). We assume that
random variable X has states x ∈ [a, b]. By the definition of the cdf, we
have
                                  FY (y) = P (Y 6 y) .                               (6.134)
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
218                                                     Probability and Distributions
Theorem 6.16. [Theorem 17.2 in Billingsley (1995)] Let f (x) be the value
of the probability density of the multivariate continuous random variable X .
If the vector-valued function y = U (x) is differentiable and invertible for
all values within the domain of x, then for corresponding values of y , the
probability density of Y = U (X) is given by
                                                        
                                                  ∂ −1
                 f (y) = fx (U −1 (y)) · det
                                                           
                                                    U (y)  .       (6.144)
                                                 ∂y
   The theorem looks intimidating at first glance, but the key point is that
a change of variable of a multivariate random variable follows the pro-
cedure of the univariate change of variable. First we need to work out
the inverse transform, and substitute that into the density of x. Then we
calculate the determinant of the Jacobian and multiply the result. The
following example illustrates the case of a bivariate random variable.
Example 6.17                                                               
                                                                           x1
Consider a bivariate random variable X with states x =                        and proba-
                                                                           x2
bility density function
                                      >  !
                            1       1 x1
                    
                      x1                   x1
                 f       =    exp −            .                                     (6.145)
                      x2   2π       2 x2   x2
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
220                                                     Probability and Distributions
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
222                                                         Probability and Distributions
                                         Exercises
6.1   Consider the following bivariate distribution p(x, y) of two discrete random
      variables X and Y .
                                         x1     x2    x3     x4     x5
                                                      X
      Compute:
      a. The marginal distributions p(x) and p(y).
      b. The conditional distributions p(x|Y = y1 ) and p(y|X = x3 ).
6.2   Consider a mixture of two Gaussian distributions (illustrated in Figure 6.4),
                                                        
                                10   1     0                0   8.4        2.0
                   0.4 N           ,            + 0.6 N       ,                  .
                                 2   0     1                0   2.0        1.7
      Choose a conjugate prior for the Bernoulli likelihood and compute the pos-
      terior distribution p(µ | x1 , . . . , xN ).
6.4   There are two bags. The first bag contains four mangos and two apples; the
      second bag contains four mangos and four apples.
      We also have a biased coin, which shows “heads” with probability 0.6 and
      “tails” with probability 0.4. If the coin shows “heads”. we pick a fruit at
      random from bag 1; otherwise we pick a fruit at random from bag 2.
      Your friend flips the coin (you cannot see the result), picks a fruit at random
      from the corresponding bag, and presents you a mango.
      What is the probability that the mango was picked from bag 2?
      Hint: Use Bayes’ theorem.
6.5   Consider the time-series model
                                                               
                                xt+1 = Axt + w ,      w ∼ N 0, Q
                                                                   
                                   y t = Cxt + v ,    v ∼ N 0, R ,
      where w, v are i.i.d. Gaussian noise variables. Further, assume that p(x0 ) =
      N µ0 , Σ0 .
             C = (A−1 + B −1 )−1
              c = C(A−1 a + B −1 b)
                          D               1
              c = (2π)− 2 | A + B | − 2 exp − 12 (a − b)> (A + B)−1 (a − b) .
                                                                                      
     Furthermore, we have
                                         y = Ax + b + w ,
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
224                                                     Probability and Distributions
Continuous Optimization
but there are several design choices, which we discuss in Section 7.1. For
constrained optimization, we need to introduce other concepts to man-
age the constraints (Section 7.2). We will also introduce a special class
of problems (convex optimization problems in Section 7.3) where we can
make statements about reaching the global optimum.
   Consider the function in Figure 7.2. The function has a global minimum                           global minimum
around x = −4.5, with a function value of approximately −47. Since
the function is “smooth,” the gradients can be used to help find the min-
imum by indicating whether we should take a step to the right or left.
This assumes that we are in the correct bowl, as there exists another local                         local minimum
minimum around x = 0.7. Recall that we can solve for all the stationary
points of a function by calculating its derivative and setting it to zero. For                      Stationary points
                                                                                                    are the real roots of
                         `(x) = x4 + 7x3 + 5x2 − 17x + 3 ,                                (7.1)     the derivative, that
                                                                                                    is, points that have
we obtain the corresponding gradient as                                                             zero gradient.
                          d`(x)
                                = 4x3 + 21x2 + 10x − 17 .                                 (7.2)
                           dx
                                                                                            225
This material is published by Cambridge University Press as Mathematics for Machine Learning by
Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong (2020). This version is free to view
and download for personal use only. Not for re-distribution, re-sale, or use in derivative works.
 c
by  M. P. Deisenroth, A. A. Faisal, and C. S. Ong, 2021. https://mml-book.com.
                      226                                                          Continuous Optimization
                                            Lagrange                   Chapter 11
                                            multipliers             Density estimation
                        Convex
                      Since this is a cubic equation, it has in general three solutions when set to
                      zero. In the example, two of them are minimums and one is a maximum
                      (around x = −1.4). To check whether a stationary point is a minimum
                      or maximum, we need to take the derivative a second time and check
                      whether the second derivative is positive or negative at the stationary
                      point. In our case, the second derivative is
                                                 d2 `(x)
                                                         = 12x2 + 42x + 10 .                               (7.3)
                                                  dx2
                      By substituting our visually estimated values of x = −4.5, −1.4,
                                                                                    2 0.7, we
                                                                                             
                      will observe that as expected the middle point is a maximum d dx`(x)
                                                                                        2  <0
                      and the other two stationary points are minimums.
                         Note that we have avoided analytically solving for values of x in the
                      previous discussion, although for low-order polynomials such as the pre-
                      ceding we could do so. In general, we are unable to find analytic solu-
                      tions, and hence we need to start at some value, say x0 = −6, and follow
                      the negative gradient. The negative gradient indicates that we should go
−20
−40
            −60
               −6   −5      −4         −3       −2       −1             0          1         2
                                        Value of parameter
right, but not how far (this is called the step-size). Furthermore, if we                          According to the
had started at the right side (e.g., x0 = 0) the negative gradient would                           Abel–Ruffini
                                                                                                   theorem, there is in
have led us to the wrong minimum. Figure 7.2 illustrates the fact that for
                                                                                                   general no algebraic
x > −1, the negative gradient points toward the minimum on the right of                            solution for
the figure, which has a larger objective value.                                                    polynomials of
   In Section 7.3, we will learn about a class of functions, called convex                         degree 5 or more
                                                                                                   (Abel, 1826).
functions, that do not exhibit this tricky dependency on the starting point
of the optimization algorithm. For convex functions, all local minimums
are global minimum. It turns out that many machine learning objective                              For convex functions
functions are designed such that they are convex, and we will see an ex-                           all local minima are
                                                                                                   global minimum.
ample in Chapter 12.
   The discussion in this chapter so far was about a one-dimensional func-
tion, where we are able to visualize the ideas of gradients, descent direc-
tions, and optimal values. In the rest of this chapter we develop the same
ideas in high dimensions. Unfortunately, we can only visualize the con-
cepts in one dimension, but some concepts do not generalize directly to
higher dimensions, therefore some care needs to be taken when reading.
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                    228                                                         Continuous Optimization
                    Example 7.1
                    Consider a quadratic function in two dimensions
                                                >            >  
                                             1 x1
                                  
                                   x1                  2 1 x1       5   x1
                               f          =                       −                                   (7.7)
                                   x2        2 x2      1 20 x2      3   x2
                    with gradient
                                                >        >
                                               x1   x    2 1     5
                                        ∇f        = 1          −     .                                (7.8)
                                               x2   x2   1 20    3
                    Starting at the initial location x0 = [−3, −1]> , we iteratively apply (7.6)
                    to obtain a sequence of estimates that converge to the minimum value
               2      50.
                            0
                                       40.0
                                                                                                     90          Figure 7.3 Gradient
                                                                                                                 descent on a
                                                                                                     75          two-dimensional
                                                                                                                 quadratic surface
               1                                                                                     60          (shown as a
                                                                                           0.0
                                                                                                                 heatmap). See
                                                                                                     45          Example 7.1 for a
        x2
               0                                                                                                 description.
                                                                                                     30
             −1                                                                                      15
                                                                         10.0
                                                      30.0
                                                                                    20.0
                                                                                                     0
                     70.        60.
                   80. 0           0
                                               50.0
                      0                                           40.0
             −2                                                                                      −15
               −4                      −2                    0                  2                4
                                                             x1
(illustrated in Figure 7.3). We can see (both from the figure and by plug-
ging x0 into (7.8) with γ = 0.085) that the negative gradient at x0 points
north and east, leading to x1 = [−1.98, 1.21]> . Repeating that argument
gives us x2 = [−1.32, −0.42]> , and so on.
                                              7.1.1 Step-size
As mentioned earlier, choosing a good step-size is important in gradient
descent. If the step-size is too small, gradient descent can be slow. If the                                     The step-size is also
step-size is chosen too large, gradient descent can overshoot, fail to con-                                      called the learning
                                                                                                                 rate.
verge, or even diverge. We will discuss the use of momentum in the next
section. It is a method that smoothes out erratic behavior of gradient up-
dates and dampens oscillations.
   Adaptive gradient methods rescale the step-size at each iteration, de-
pending on local properties of the function. There are two simple heuris-
tics (Toussaint, 2012):
   When the function value increases after a gradient step, the step-size
   was too large. Undo the step and decrease the step-size.
   When the function value decreases the step could have been larger. Try
   to increase the step-size.
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                    230                                                         Continuous Optimization
where α ∈ [0, 1]. Sometimes we will only know the gradient approxi-
mately. In such cases, the momentum term is useful since it averages out
different noisy estimates of the gradient. One particularly useful way to
obtain an approximate gradient is by using a stochastic approximation,
which we discuss next.
where xn ∈ RD are the training inputs, yn are the training targets, and θ
are the parameters of the regression model.
  Standard gradient descent, as introduced previously, is a “batch” opti-
mization method, i.e., optimization is performed using the full training set
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
232                                                          Continuous Optimization
for a suitable step-size parameter γi . Evaluating the sum gradient may re-
quire expensive evaluations of the gradients from all individual functions
Ln . When the training set is enormous and/or no simple formulas exist,
evaluating the sums P  of gradients becomes very expensive.
                          N
   Consider the term n=1 (∇Ln (θ i )) in (7.15). We can reduce the amount
of computation by taking a sum over a smaller set of Ln . In contrast to
batch gradient descent, which uses all Ln for n = 1, . . . , N , we randomly
choose a subset of Ln for mini-batch gradient descent. In the extreme
case, we randomly select only a single Ln to estimate the gradient. The
key insight about why taking a subset of data is sensible is to realize that
for gradient descent to converge, we only require that the   PN gradient is an
unbiased estimate of the true gradient. In fact the term n=1 (∇Ln (θ i ))
in (7.15) is an empirical estimate of the expected value (Section 6.4.1) of
the gradient. Therefore, any other unbiased empirical estimate of the ex-
pected value, for example using any subsample of the data, would suffice
for convergence of gradient descent.
Remark. When the learning rate decreases at an appropriate rate, and sub-
ject to relatively mild assumptions, stochastic gradient descent converges
almost surely to local minimum (Bottou, 1998).                          ♦
   Why should one consider using an approximate gradient? A major rea-
son is practical implementation constraints, such as the size of central
processing unit (CPU)/graphics processing unit (GPU) memory or limits
on computational time. We can think of the size of the subset used to esti-
mate the gradient in the same way that we thought of the size of a sample
when estimating empirical means (Section 6.4.1). Large mini-batch sizes
will provide accurate estimates of the gradient, reducing the variance in
the parameter update. Furthermore, large mini-batches take advantage of
highly optimized matrix operations in vectorized implementations of the
cost and gradient. The reduction in variance leads to more stable conver-
gence, but each gradient calculation will be more expensive.
   In contrast, small mini-batches are quick to estimate. If we keep the
mini-batch size small, the noise in our gradient estimate will allow us to
get out of some bad local optima, which we may otherwise get stuck in.
In machine learning, optimization methods are used for training by min-
imizing an objective function on the training data, but the overall goal
is to improve generalization performance (Chapter 8). Since the goal in
machine learning does not necessarily need a precise estimate of the min-
imum of the objective function, approximate gradients using mini-batch
approaches have been widely used. Stochastic gradient descent is very
effective in large-scale machine learning problems (Bottou et al., 2018),
                    3                                                                              Figure 7.4
                                                                                                   Illustration of
                                                                                                   constrained
                                                                                                   optimization. The
                    2
                                                                                                   unconstrained
                                                                                                   problem (indicated
                                                                                                   by the contour
                    1                                                                              lines) has a
                                                                                                   minimum on the
                                                                                                   right side (indicated
                                                                                                   by the circle). The
              x2
                    0
                                                                                                   box constraints
                                                                                                   (−1 6 x 6 1 and
                                                                                                   −1 6 y 6 1) require
                   −1
                                                                                                   that the optimal
                                                                                                   solution is within
                                                                                                   the box, resulting in
                   −2                                                                              an optimal value
                                                                                                   indicated by the
                                                                                                   star.
                   −3
                    −3         −2       −1       0          1           2            3
                                                 x1
where f : RD → R.
  In this section, we have additional constraints. That is, for real-valued
functions gi : RD → R for i = 1, . . . , m, we consider the constrained
optimization problem (see Figure 7.4 for an illustration)
                         min f (x)                                                       (7.17)
                           x
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                      234                                                              Continuous Optimization
                      This gives infinite penalty if the constraint is not satisfied, and hence
                      would provide the same solution. However, this infinite step function is
                      equally difficult to optimize. We can overcome this difficulty by introduc-
Lagrange multiplier   ing Lagrange multipliers. The idea of Lagrange multipliers is to replace the
                      step function with a linear function.
Lagrangian               We associate to problem (7.17) the Lagrangian by introducing the La-
                      grange multipliers λi > 0 corresponding to each inequality constraint re-
                      spectively (Boyd and Vandenberghe, 2004, chapter 4) so that
                                                                      m
                                                                      X
                                                 L(x, λ) = f (x) +              λi gi (x)              (7.20a)
                                                                          i=1
                                                                            >
                                                            = f (x) + λ g(x) ,                         (7.20b)
                      where in the last line we have concatenated all constraints gi (x) into a
                      vector g(x), and all the Lagrange multipliers into a vector λ ∈ Rm .
                         We now introduce the idea of Lagrangian duality. In general, duality
                      in optimization is the idea of converting an optimization problem in one
                      set of variables x (called the primal variables), into another optimization
                      problem in a different set of variables λ (called the dual variables). We
                      introduce two different approaches to duality: In this section, we discuss
                      Lagrangian duality; in Section 7.3.3, we discuss Legendre-Fenchel duality.
                      Remark. In the discussion of Definition 7.1, we use two concepts that are
                      also of independent interest (Boyd and Vandenberghe, 2004).
minimax inequality       First is the minimax inequality, which says that for any function with
                      two arguments ϕ(x, y), the maximin is less than the minimax, i.e.,
Note that taking the maximum over y of the left-hand side of (7.24) main-
tains the inequality since the inequality is true for all y . Similarly, we can
take the minimum over x of the right-hand side of (7.24) to obtain (7.23).
   The second concept is weak duality, which uses (7.23) to show that                              weak duality
primal values are always greater than or equal to dual values. This is de-
scribed in more detail in (7.27).                                             ♦
   Recall that the difference between J(x) in (7.18) and the Lagrangian
in (7.20b) is that we have relaxed the indicator function to a linear func-
tion. Therefore, when λ > 0, the Lagrangian L(x, λ) is a lower bound of
J(x). Hence, the maximum of L(x, λ) with respect to λ is
                                 J(x) = max L(x, λ) .                                  (7.25)
                                             λ>0
This is also known as weak duality. Note that the inner part of the right-                         weak duality
hand side is the dual objective function D(λ) and the definition follows.
   In contrast to the original optimization problem, which has constraints,
minx∈Rd L(x, λ) is an unconstrained optimization problem for a given
value of λ. If solving minx∈Rd L(x, λ) is easy, then the overall problem is
easy to solve. We can see this by observing from (7.20b) that L(x, λ) is
affine with respect to λ. Therefore minx∈Rd L(x, λ) is a pointwise min-
imum of affine functions of λ, and hence D(λ) is concave even though
f (·) and gi (·) may be nonconvex. The outer problem, maximization over
λ, is the maximum of a concave function and can be efficiently computed.
   Assuming f (·) and gi (·) are differentiable, we find the Lagrange dual
problem by differentiating the Lagrangian with respect to x, setting the
differential to zero, and solving for the optimal value. We will discuss two
concrete examples in Sections 7.3.1 and 7.3.2, where f (·) and gi (·) are
convex.
Remark (Equality Constraints). Consider (7.17) with additional equality
constraints
                         min f (x)
                           x
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                      236                                                         Continuous Optimization
                   30                     y = 3x2 − 5x + 2
               y
20
10
                    0
                        −3     −2       −1        0        1       2        3
                                                  x
Example 7.3
The negative entropy f (x) = x log2 x is convex for x > 0. A visualization
of the function is shown in Figure 7.8, and we can see that the function is
convex. To illustrate the previous definitions of convexity, let us check the
calculations for two points x = 2 and x = 4. Note that to prove convexity
of f (x) we would need to check for all points x ∈ R.
   Recall Definition 7.3. Consider a point midway between the two points
(that is θ = 0.5); then the left-hand side is f (0.5 · 2 + 0.5 · 4) = 3 log2 3 ≈
4.75. The right-hand side is 0.5(2 log2 2) + 0.5(4 log2 4) = 1 + 4 = 5. And
therefore the definition is satisfied.
   Since f (x) is differentiable, we can alternatively use (7.31). Calculating
the derivative of f (x), we obtain
                                                  1                 1
      ∇x (x log2 x) = 1 · log2 x + x ·                 = log2 x +        .             (7.32)
                                              x loge 2            loge 2
Using the same two test points x = 2 and x = 4, the left-hand side of
(7.31) is given by f (4) = 8. The right-hand side is
           f (x) + ∇>
                    x (y − x) = f (2) + ∇f (2) · (4 − 2)                              (7.33a)
                                            1
                              = 2 + (1 +        ) · 2 ≈ 6.9 .                         (7.33b)
                                         loge 2
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                     238                                                         Continuous Optimization
                                           5
                                  f (x)
                                               0       1        2       3         4        5
                                                                    x
                     Example 7.4
                     A nonnegative weighted sum of convex functions is convex. Observe that
                     if f is a convex function, and α > 0 is a nonnegative scalar, then the
                     function αf is convex. We can see this by multiplying α to both sides of the
                     equation in Definition 7.3, and recalling that multiplying a nonnegative
                     number does not change the inequality.
                        If f1 and f2 are convex functions, then we have by the definition
                                          f1 (θx + (1 − θ)y) 6 θf1 (x) + (1 − θ)f1 (y)                (7.34)
                                          f2 (θx + (1 − θ)y) 6 θf2 (x) + (1 − θ)f2 (y) .              (7.35)
                     Summing up both sides gives us
                             f1 (θx + (1 − θ)y) + f2 (θx + (1 − θ)y)
                              6 θf1 (x) + (1 − θ)f1 (y) + θf2 (x) + (1 − θ)f2 (y) ,                   (7.36)
                     where the right-hand side can be rearranged to
                                          θ(f1 (x) + f2 (x)) + (1 − θ)(f1 (y) + f2 (y)) ,             (7.37)
                     completing the proof that the sum of convex functions is convex.
                        Combining the preceding two facts, we see that αf1 (x) + βf2 (x) is
                     convex for α, β > 0. This closure property can be extended using a sim-
                     ilar argument for nonnegative weighted sums of more than two convex
                     functions.
Remark. The inequality in (7.30) is sometimes called Jensen’s inequality.                          Jensen’s inequality
In fact, a whole class of inequalities for taking nonnegative weighted sums
of convex functions are all called Jensen’s inequality.                   ♦
  In summary, a constrained optimization problem is called a convex opti-                          convex optimization
mization problem if                                                                                problem
                           minf (x)
                             x
                                 subject to       Ax 6 b ,
where A ∈ Rm×d and b ∈ Rm . This is known as a linear program. It has d                            linear program
variables and m linear constraints. The Lagrangian is given by                                     Linear programs are
                                                                                                   one of the most
                           L(x, λ) = c> x + λ> (Ax − b) ,                              (7.40)      widely used
                                                                                                   approaches in
where λ ∈ Rm is the vector of non-negative Lagrange multipliers. Rear-                             industry.
ranging the terms corresponding to x yields
                          L(x, λ) = (c + A> λ)> x − λ> b .                             (7.41)
Taking the derivative of L(x, λ) with respect to x and setting it to zero
gives us
                                      c + A> λ = 0 .                                   (7.42)
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                      240                                                         Continuous Optimization
by the contour                          8
lines) has a
minimum on the
right side. The
optimal value given                     6
                                  x2
                                        0
                                            0      2     4     6   8      10     12        14          16
                                                                   x1
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                       242                                                         Continuous Optimization
                         Note that the preceding convex conjugate definition does not need the
                       function f to be convex nor differentiable. In Definition 7.4, we have used
                       a general inner product (Section 3.2) but in the rest of this section we
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
244                                                         Continuous Optimization
Example 7.8
In machine learning, we often use sums of functions; for example, the ob-
jective function of the training set includes a sum of the losses for each ex-
ample in the training set. In the following, we derive the convex conjugate
of a sum of losses `(t), where ` : R → R. This also illustrates   Pthe  appli-
                                                                     n
cation of the convex conjugate to the vector case. Let L(t) = i=1 `i (ti ).
Then,
                                n
                                X
  L∗ (z) = sup hz, ti −               `i (ti )                                 (7.63a)
              t∈Rn              i=1
                     n
                     X
          = sup           zi ti − `i (ti )       definition of dot product (7.63b)
              t∈Rn i=1
              Xn
          =       sup zi ti − `i (ti )                                         (7.63c)
                      n
              i=1 t∈R
                 n
                 X
           =           `∗i (zi ) .                    definition of conjugate (7.63d)
                 i=1
Example 7.9
Let f (y) and g(x) be convex functions, and A a real matrix of appropriate
dimensions such that Ax = y . Then
                         min f (Ax) + g(x) = min f (y) + g(x).                         (7.64)
                              x                      Ax=y
where the last step of swapping max and min is due to the fact that f (y)
and g(x) are convex functions. By splitting up the dot product term and
collecting x and y ,
          max min f (y) + g(x) + (Ax − y)> u                                          (7.66a)
           u   x,y
                                 h                 i
        = max min −y > u + f (y) + min(Ax)> u + g(x)                                  (7.66b)
           u     y                    x
                                 h                 i
        = max min −y > u + f (y) + min x> A> u + g(x)                                 (7.66c)
             u           y                            x
 Recall the convex conjugate (Definition 7.4) and the fact that dot prod-                          For general inner
                                                                                                   products, A> is
ucts are symmetric,
                                                                                                   replaced by the
                                                                                                   adjoint A∗ .
                                h                        i
          max min −y > u + f (y) + min x> A> u + g(x)            (7.67a)
             u           y                            x
                          ∗           ∗        >
        = max −f (u) − g (−A u) .                                                     (7.67b)
             u
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                      246                                                         Continuous Optimization
                                         Exercises
7.1   Consider the univariate function
f (x) = x3 + 6x2 − 3x − 5.
      Find its stationary points and indicate whether they are maximum, mini-
      mum, or saddle points.
7.2   Consider the update equation for stochastic gradient descent (Equation (7.15)).
      Write down the update when we use a mini-batch size of one.
7.3   Consider whether the following statements are true or false:
       a. The intersection of any two convex sets is convex.
       b. The union of any two convex sets is convex.
       c. The difference of a convex set A from another convex set B is convex.
7.4   Consider whether the following statements are true or false:
       a.   The sum of any two convex functions is convex.
       b.   The difference of any two convex functions is convex.
       c.   The product of any two convex functions is convex.
       d.   The maximum of any two convex functions is convex.
7.5   Express the following optimization problem as a standard linear program in
      matrix notation
                                           max       p> x + ξ
                                        x∈R2 , ξ∈R
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
248                                                                Continuous Optimization
     Derive the convex conjugate function f ∗ (s), by assuming the standard dot
     product.
     Hint: Take the gradient of an appropriate function and set the gradient to zero.
7.10 Consider the function
                                          1 >
                                f (x) =     x Ax + b> x + c ,
                                          2
     where A is strictly positive definite, which means that it is invertible. Derive
     the convex conjugate of f (x).
     Hint: Take the gradient of an appropriate function and set the gradient to zero.
7.11 The hinge loss (which is the loss used by the support vector machine) is
     given by
                                   L(α) = max{0, 1 − α} ,
                                                                                            249
This material is published by Cambridge University Press as Mathematics for Machine Learning by
Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong (2020). This version is free to view
and download for personal use only. Not for re-distribution, re-sale, or use in derivative works.
 c
by  M. P. Deisenroth, A. A. Faisal, and C. S. Ong, 2021. https://mml-book.com.
                                              8
In the first part of the book, we introduced the mathematics that form
the foundations of many machine learning methods. The hope is that a
reader would be able to learn the rudimentary forms of the language of
mathematics from the first part, which we will now use to describe and
discuss machine learning. The second part of the book introduces four
pillars of machine learning:
   Regression (Chapter 9)
   Dimensionality reduction (Chapter 10)
   Density estimation (Chapter 11)
   Classification (Chapter 12)
The main aim of this part of the book is to illustrate how the mathematical
concepts introduced in the first part of the book can be used to design
machine learning algorithms that can be used to solve tasks within the
remit of the four pillars. We do not intend to introduce advanced machine
learning concepts, but instead to provide a set of practical methods that
allow the reader to apply the knowledge they gained from the first part
of the book. It also provides a gateway to the wider machine learning
literature for readers already familiar with the mathematics.
                                                                                            251
This material is published by Cambridge University Press as Mathematics for Machine Learning by
Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong (2020). This version is free to view
and download for personal use only. Not for re-distribution, re-sale, or use in derivative works.
 c
by  M. P. Deisenroth, A. A. Faisal, and C. S. Ong, 2021. https://mml-book.com.
                     252                                                          When Models Meet Data
Table 8.1 Example           Name         Gender      Degree     Postcode       Age    Annual salary
data from a                 Aditya         M         MSc        W21BG          36           89563
fictitious human            Bob            M         PhD        EC1A1BA        47          123543
resource database           Chloé         F         BEcon      SW1A1BH        26           23989
that is not in a            Daisuke        M         BSc        SE207AT        68          138769
numerical format.           Elisabeth      F         MBA        SE10AA         33          113888
                     used to talk about machine learning models. By doing so, we briefly out-
                     line the current best practices for training a model such that the resulting
                     predictor does well on data that we have not yet seen.
                        As mentioned in Chapter 1, there are two different senses in which we
                     use the phrase “machine learning algorithm”: training and prediction. We
                     will describe these ideas in this chapter, as well as the idea of selecting
                     among different models. We will introduce the framework of empirical
                     risk minimization in Section 8.2, the principle of maximum likelihood in
                     Section 8.3, and the idea of probabilistic models in Section 8.4. We briefly
                     outline a graphical language for specifying probabilistic models in Sec-
                     tion 8.5 and finally discuss model selection in Section 8.6. The rest of this
                     section expands upon the three main components of machine learning:
                     data, models and learning.
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                         254                                                          When Models Meet Data
75
50
25
                 0
                     0    10      20      30      40      50      60      70      80
                                                  x
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                        256                                                          When Models Meet Data
50
25
                                       0
                                           0    10     20     30      40     50      60     70      80
                                                                      x
1. Prediction or inference
2. Training or parameter estimation
3. Hyperparameter tuning or model selection
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                        258                                                          When Models Meet Data
                        the model to only fit the training data well, the predictor needs to per-
                        form well on unseen data. We simulate the behavior of our predictor on
cross-validation        future unseen data using cross-validation (Section 8.2.4). As we will see
                        in this chapter, to achieve the goal of performing well on unseen data,
                        we will need to balance between fitting well on training data and finding
                        “simple” explanations of the phenomenon. This trade-off is achieved us-
                        ing regularization (Section 8.2.3) or by adding a prior (Section 8.3.2). In
                        philosophy, this is considered to be neither induction nor deduction, but
abduction               is called abduction. According to the Stanford Encyclopedia of Philosophy,
                        abduction is the process of inference to the best explanation (Douven,
A good movie title is   2017).
“AI abduction”.            We often need to make high-level modeling decisions about the struc-
                        ture of the predictor, such as the number of components to use or the
                        class of probability distributions to consider. The choice of the number of
hyperparameter          components is an example of a hyperparameter, and this choice can af-
                        fect the performance of the model significantly. The problem of choosing
model selection         among different models is called model selection, which we describe in
                        Section 8.6. For non-probabilistic models, model selection is often done
nested                  using nested cross-validation, which is described in Section 8.6.1. We also
cross-validation        use model selection to choose hyperparameters of our model.
                        Remark. The distinction between parameters and hyperparameters is some-
                        what arbitrary, and is mostly driven by the distinction between what can
                        be numerically optimized versus what needs to use search techniques.
                        Another way to consider the distinction is to consider parameters as the
                        explicit parameters of a probabilistic model, and to consider hyperparam-
                        eters (higher-level parameters) as parameters that control the distribution
                        of these explicit parameters.                                             ♦
                           In the following sections, we will look at three flavors of machine learn-
                        ing: empirical risk minimization (Section 8.2), the principle of maximum
                        likelihood (Section 8.3), and probabilistic modeling (Section 8.4).
Section 8.2.1 What is the set of functions we allow the predictor to take?
Section 8.2.2 How do we measure how well the predictor performs on
  the training data?
Section 8.2.3 How do we construct predictors from only training data
  that performs well on unseen test data?
Section 8.2.4 What is the procedure for searching over the space of mod-
  els?
Example 8.1
We introduce the problem of ordinary least-squares regression to illustrate
empirical risk minimization. A more comprehensive account of regression
is given in Chapter 9. When the label yn is real-valued, a popular choice
of function class for predictors is the set of affine functions. We choose a                       Affine functions are
                                                                                                   often referred to as
more compact notation for an affine function by concatenating an addi-
                                                                         (D) >
                                                                                                   linear functions in
tional unit feature x(0) = 1 to xn , i.e., xn = [1, x(1)     (2)
                                                     n , xn , . . . , xn ] . The                   machine learning.
parameter vector is correspondingly θ = [θ0 , θ1 , θ2 , . . . , θD ]> , allowing us
to write the predictor as a linear function
                                   f (xn , θ) = θ > xn .                                 (8.4)
This linear predictor is equivalent to the affine model
                                                   D
                                                   X
                              f (xn , θ) = θ0 +           θd x(d)
                                                              n .                        (8.5)
                                                    d=1
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                        260                                                          When Models Meet Data
empirical risk          where ŷn = f (xn , θ). Equation (8.6) is called the empirical risk and de-
                        pends on three arguments, the predictor f and the data X, y . This general
empirical risk          strategy for learning is called empirical risk minimization.
minimization
where y is the label and f (x) is the prediction based on the example x.
The notation Rtrue (f ) indicates that this is the true risk if we had access to
an infinite amount of data. The expectation is over the (infinite) set of all                      Another phrase
possible data and labels. There are two practical questions that arise from                        commonly used for
                                                                                                   expected risk is
our desire to minimize expected risk, which we address in the following
                                                                                                   “population risk”.
two subsections:
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                       262                                                          When Models Meet Data
  The regularization term is sometimes called the penalty term, which bi-                          penalty term
ases the vector θ to be closer to the origin. The idea of regularization also
appears in probabilistic models as the prior probability of the parameters.
Recall from Section 6.6 that for the posterior distribution to be of the same
form as the prior distribution, the prior and the likelihood need to be con-
jugate. We will revisit this idea in Section 8.3.2. We will see in Chapter 12
that the idea of the regularizer is equivalent to the idea of a large margin.
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                      264                                                          When Models Meet Data
                                                                  K
                                                               1 X
                                            EV [R(f, V)] ≈           R(f (k) , V (k) ) ,               (8.13)
                                                               K k=1
                      where R(f (k) , V (k) ) is the risk (e.g., RMSE) on the validation set V (k) for
                      predictor f (k) . The approximation has two sources: first, due to the finite
                      training set, which results in not the best possible f (k) ; and second, due to
                      the finite validation set, which results in an inaccurate estimation of the
                      risk R(f (k) , V (k) ). A potential disadvantage of K -fold cross-validation is
                      the computational cost of training the model K times, which can be bur-
                      densome if the training cost is computationally expensive. In practice, it
                      is often not sufficient to look at the direct parameters alone. For example,
                      we need to explore multiple complexity parameters (e.g., multiple regu-
                      larization parameters), which may not be direct parameters of the model.
                      Evaluating the quality of the model, depending on these hyperparameters,
                      may result in a number of training runs that is exponential in the number
                      of model parameters. One can use nested cross-validation (Section 8.6.1)
                      to search for good hyperparameters.
embarrassingly           However, cross-validation is an embarrassingly parallel problem, i.e., lit-
parallel              tle effort is needed to separate the problem into a number of parallel
                      tasks. Given sufficient computing resources (e.g., cloud computing, server
                      farms), cross-validation does not require longer than a single performance
                      assessment.
                         In this section, we saw that empirical risk minimization is based on the
                      following concepts: the hypothesis class of functions, the loss function and
                      regularization. In Section 8.3, we will see the effect of using a probability
                      distribution to replace the idea of loss functions and regularization.
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                  266                                                          When Models Meet Data
                  The notation Lx (θ) emphasizes the fact that the parameter θ is varying
                  and the data x is fixed. We very often drop the reference to x when writing
                  the negative log-likelihood, as it is really a function of θ , and write it as
                  L(θ) when the random variable representing the uncertainty in the data
                  is clear from the context.
                     Let us interpret what the probability density p(x | θ) is modeling for a
                  fixed value of θ . It is a distribution that models the uncertainty of the data
                  for a given parameter setting. For a given dataset x, the likelihood allows
                  us to express preferences about different settings of the parameters θ , and
                  we can choose the setting that more “likely” has generated the data.
                     In a complementary view, if we consider the data to be fixed (because
                  it has been observed), and we vary the parameters θ , what does L(θ) tell
                  us? It tells us how likely a particular setting of θ is for the observations x.
                  Based on this second view, the maximum likelihood estimator gives us the
                  most likely parameter θ for the set of data.
                     We consider the supervised learning setting, where we obtain pairs
                  (x1 , y1 ), . . . , (xN , yN ) with xn ∈ RD and labels yn ∈ R. We are inter-
                  ested in constructing a predictor that takes a feature vector xn as input
                  and produces a prediction yn (or something close to it), i.e., given a vec-
                  tor xn we want the probability distribution of the label yn . In other words,
                  we specify the conditional probability distribution of the labels given the
                  examples for the particular parameter setting θ .
                  Example 8.4
                   The first example that is often used is to specify that the conditional
                  probability of the labels given the examples is a Gaussian distribution. In
                  other words, we assume that we can explain our observation uncertainty
                  by independent   Gaussian noise (refer to Section 6.5) with zero mean,
                  εn ∼ N 0, σ 2 . We further assume that the linear model x>   n θ is used for
                  prediction. This means we specify a Gaussian likelihood for each example
                  label pair (xn , yn ),
                                         p(yn | xn , θ) = N yn | x>      2
                                                                           
                                                                  n θ, σ     .          (8.15)
                  An illustration of a Gaussian likelihood for a given parameter θ is shown
                  in Figure 8.3. We will see in Section 9.2 how to explicitly expand the
                  preceding expression out in terms of the Gaussian distribution.
independent and     We assume that the set of examples (x1 , y1 ), . . . , (xN , yN ) are independent
identically       and identically distributed (i.i.d.). The word “independent” (Section 6.4.5)
distributed
                  implies that the likelihood involving the whole dataset (Y = {y1 , . . . , yN }
                  and X = {x1 , . . . , xN }) factorizes into a product of the likelihoods of
While it is temping to interpret the fact that θ is on the right of the condi-
tioning in p(yn |xn , θ) (8.15), and hence should be interpreted as observed
and fixed, this interpretation is incorrect. The negative log-likelihood L(θ)
is a function of θ . Therefore, to find a good parameter vector θ that
explains the data (x1 , y1 ), . . . , (xN , yN ) well, minimize the negative log-
likelihood L(θ) with respect to θ .
Remark. The negative sign in (8.17) is a historical artifact that is due
to the convention that we want to maximize likelihood, but numerical
optimization literature tends to study minimization of functions.     ♦
Example 8.5
Continuing on our example of Gaussian likelihoods (8.15), the negative
log-likelihood can be rewritten as
                  N
                  X                                 N
                                                    X
                                                          log N yn | x>      2
                                                                                 
    L(θ) = −            log p(yn | xn , θ) = −                        n θ, σ          (8.18a)
                  n=1                               n=1
                  N
                         1            (yn − x> n θ)
                                                   2
                  X                                 
           =−           log √ exp −                                                   (8.18b)
              n=1       2πσ 2              2σ 2
               N                                  N
                              (yn − x>n θ)
                                          2
                                                            1
              X                             X
           =−     log exp −                  −       log √                            (8.18c)
              n=1
                                  2σ 2
                                                 n=1       2πσ 2
                     N                   N
                1 X             >   2
                                        X          1
           =            (yn − x n θ)  −     log √       .                            (8.18d)
               2σ 2 n=1                 n=1       2πσ 2
As σ is given, the second term in (8.18d) is constant, and minimizing L(θ)
corresponds to solving the least-squares problem (compare with (8.8))
expressed in the first term.
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                        268                                                          When Models Meet Data
                                       0
                                           0   10      20     30      40     50      60     70      80
                                                                      x
Figure 8.6
Comparing the                        150       MLE
predictions with the                           MAP
maximum likelihood                   125
estimate and the
MAP estimate at
                                     100
x = 60. The prior
biases the slope to
                                 y
                                      75
be less steep and the
intercept to be
                                      50
closer to zero. In
this example, the
bias that moves the                   25
intercept closer to
zero actually                          0
                                           0   10      20     30      40     50      60     70      80
increases the slope.
                                                                      x
                                              p(x | θ)p(θ)
                                p(θ | x) =                 .                           (8.19)
                                                 p(x)
Recall that we are interested in finding the parameter θ that maximizes
the posterior. Since the distribution p(x) does not depend on θ , we can
ignore the value of the denominator for the optimization and obtain
                                p(θ | x) ∝ p(x | θ)p(θ) .                              (8.20)
The preceding proportion relation hides the density of the data p(x),
which may be difficult to estimate. Instead of estimating the minimum
of the negative log-likelihood, we now estimate the minimum of the neg-
ative log-posterior, which is referred to as maximum a posteriori estima-                          maximum a
tion (MAP estimation). An illustration of the effect of adding a zero-mean                         posteriori
                                                                                                   estimation
Gaussian prior is shown in Figure 8.6.
                                                                                                   MAP estimation
Example 8.6
In addition to the assumption of Gaussian likelihood in the previous exam-
ple, we assume that the parameter vector is distributed
                                                         as a multivariate
Gaussian with zero mean, i.e., p(θ) = N 0, Σ , where Σ is the covari-
ance matrix (Section 6.5). Note that the conjugate prior of a Gaussian
is also a Gaussian (Section 6.6.1), and therefore we expect the posterior
distribution to also be a Gaussian. We will see the details of maximum a
posteriori estimation in Chapter 9.
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                       270                                                          When Models Meet Data
                                                                                      y
                                                                                                                                   different model
     −2                                          −2                                       −2
                                                                                                                                   classes to a
     −4                                          −4                                       −4
                                                                                                                                   regression dataset.
          −4     −2    0     2        4               −4   −2   0     2        4               −4      −2    0     2        4
                       x                                        x                                            x
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                     272                                                          When Models Meet Data
                     this is the model class we would want to work with since it has good
                     generalization properties.
                        In practice, we often define very rich model classes Mθ with many pa-
                     rameters, such as deep neural networks. To mitigate the problem of over-
                     fitting, we can use regularization (Section 8.2.3) or priors (Section 8.3.2).
                     We will discuss how to choose the model class in Section 8.6.
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                        274                                                          When Models Meet Data
                        and they no longer depend on the model parameters θ , which have been
                        marginalized/integrated out. Equation (8.23) reveals that the prediction
                        is an average over all plausible parameter values θ , where the plausibility
                        is encapsulated by the parameter distribution p(θ).
                           Having discussed parameter estimation in Section 8.3 and Bayesian in-
                        ference here, let us compare these two approaches to learning. Parameter
                        estimation via maximum likelihood or MAP estimation yields a consistent
                        point estimate θ ∗ of the parameters, and the key computational problem
                        to be solved is optimization. In contrast, Bayesian inference yields a (pos-
                        terior) distribution, and the key computational problem to be solved is
                        integration. Predictions with point estimates are straightforward, whereas
                        predictions in the Bayesian framework require solving another integration
                        problem; see (8.23). However, Bayesian inference gives us a principled
                        way to incorporate prior knowledge, account for side information, and
                        incorporate structural knowledge, all of which is not easily done in the
                        context of parameter estimation. Moreover, the propagation of parameter
                        uncertainty to the prediction can be valuable in decision-making systems
                        for risk assessment and exploration in the context of data-efficient learn-
                        ing (Deisenroth et al., 2015; Kamthe and Deisenroth, 2018).
                           While Bayesian inference is a mathematically principled framework for
                        learning about parameters and making predictions, there are some prac-
                        tical challenges that come with it because of the integration problems we
                        need to solve; see (8.22) and (8.23). More specifically, if we do not choose
                        a conjugate prior on the parameters (Section 6.6.1), the integrals in (8.22)
                        and (8.23) are not analytically tractable, and we cannot compute the pos-
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                       276                                                          When Models Meet Data
                       may make the model structure and the generative process easier, learning
                       in latent-variable models is generally hard, as we will see in Chapter 11.
                          Since latent-variable models also allow us to define the process that
                       generates data from parameters, let us have a look at this generative pro-
                       cess. Denoting data by x, the model parameters by θ and the latent vari-
                       ables by z , we obtain the conditional distribution
                                                             p(x | z, θ)                                (8.24)
                       that allows us to generate data for any model parameters and latent vari-
                       ables. Given that z are latent variables, we place a prior p(z) on them.
                          As the models we discussed previously, models with latent variables
                       can be used for parameter learning and inference within the frameworks
                       we discussed in Sections 8.3 and 8.4.2. To facilitate learning (e.g., by
                       means of maximum likelihood estimation or Bayesian inference), we fol-
                       low a two-step procedure. First, we compute the likelihood p(x | θ) of the
                       model, which does not depend on the latent variables. Second, we use this
                       likelihood for parameter estimation or Bayesian inference, where we use
                       exactly the same expressions as in Sections 8.3 and 8.4.2, respectively.
                          Since the likelihood function p(x | θ) is the predictive distribution of the
                       data given the model parameters, we need to marginalize out the latent
                       variables so that
                                                        Z
                                             p(x | θ) = p(x | z, θ)p(z)dz ,                    (8.25)
                       where p(x | z, θ) is given in (8.24) and p(z) is the prior on the latent
The likelihood is a    variables. Note that the likelihood must not depend on the latent variables
function of the data   z , but it is only a function of the data x and the model parameters θ .
and the model
                           The likelihood in (8.25) directly allows for parameter estimation via
parameters, but is
independent of the     maximum likelihood. MAP estimation is also straightforward with an ad-
latent variables.      ditional prior on the model parameters θ as discussed in Section 8.3.2.
                       Moreover, with the likelihood (8.25) Bayesian inference (Section 8.4.2)
                       in a latent-variable model works in the usual way: We place a prior p(θ)
                       on the model parameters and use Bayes’ theorem to obtain a posterior
                       distribution
                                                                  p(X | θ)p(θ)
                                                    p(θ | X ) =                                         (8.26)
                                                                     p(X )
                       over the model parameters given a dataset X . The posterior in (8.26) can
                       be used for predictions within a Bayesian inference framework; see (8.23).
                          One challenge we have in this latent-variable model is that the like-
                       lihood p(X | θ) requires the marginalization of the latent variables ac-
                       cording to (8.25). Except when we choose a conjugate prior p(z) for
                       p(x | z, θ), the marginalization in (8.25) is not analytically tractable, and
                       we need to resort to approximations (Bishop, 2006; Paquet, 2008; Mur-
                       phy, 2012; Moustaki et al., 2015).
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                     278                                                          When Models Meet Data
                     2017) are used. Moustaki et al. (2015) and Paquet (2008) provide a good
                     overview of Bayesian inference in latent-variable models.
                       In recent years, several programming languages have been proposed
                     that aim to treat the variables defined in software as random variables
                     corresponding to probability distributions. The objective is to be able to
                     write complex functions of probability distributions, while under the hood
                     the compiler automatically takes care of the rules of Bayesian inference.
probabilistic        This rapidly changing field is called probabilistic programming.
programming
c x3 x4
Example 8.7
Consider the joint distribution
                                 p(a, b, c) = p(c | a, b)p(b | a)p(a)                  (8.29)
of three random variables a, b, c. The factorization of the joint distribution
in (8.29) tells us something about the relationship between the random
variables:
   c depends directly on a and b.
   b depends directly on a.
   a depends neither on b nor on c.
For the factorization in (8.29), we obtain the directed graphical model in
Figure 8.9(a).
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
280                                                             When Models Meet Data
exactly the opposite and describe how to extract the joint distribution of
a set of random variables from a given graphical model.
Example 8.8
Looking at the graphical model in Figure 8.9(b), we exploit two proper-
ties:
  The joint distribution p(x1 , . . . , x5 ) we seek is the product of a set of
  conditionals, one for each node in the graph. In this particular example,
  we will need five conditionals.
  Each conditional depends only on the parents of the corresponding
  node in the graph. For example, x4 will be conditioned on x2 .
These two properties yield the desired factorization of the joint distribu-
tion
p(x1 , x2 , x3 , x4 , x5 ) = p(x1 )p(x5 )p(x2 | x5 )p(x3 | x1 , x2 )p(x4 | x2 ) . (8.30)
where Pak means “the parent nodes of xk ”. Parent nodes of xk are nodes
that have arrows pointing to xk .
  We conclude this subsection with a concrete example of the coin-flip
experiment. Consider a Bernoulli experiment (Example 6.8) where the
probability that the outcome x of this experiment is “heads” is
                                 p(x | µ) = Ber(µ) .                             (8.32)
We now repeat this experiment N times and observe outcomes x1 , . . . , xN
so that we obtain the joint distribution
                                                  N
                                                  Y
                       p(x1 , . . . , xN | µ) =         p(xn | µ) .              (8.33)
                                                  n=1
x1 xN n = 1, . . . , N n = 1, . . . , N
the plate notation. The plate (box) repeats everything inside (in this case,                       plate
the observations xn ) N times. Therefore, both graphical models are equiv-
alent, but the plate notation is more compact. Graphical models immedi-
ately allow us to place a hyperprior on µ. A hyperprior is a second layer                          hyperprior
of prior distributions on the parameters of the first layer of priors. Fig-
ure 8.10(c) places a Beta(α, β) prior on the latent variable µ. If we treat
α and β as deterministic parameters, i.e., not random variables, we omit
the circle around it.
                                        A⊥
                                         ⊥ B|C ,                                        (8.34)
   The arrows on the path meet either head to tail or tail to tail at the
   node, and the node is in the set C .
   The arrows meet head to head at the node, and neither the node nor
   any of its descendants is in the set C .
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                       282                                                          When Models Meet Data
Figure 8.11                                             a         b          c
D-separation
example.
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                        284                                                           When Models Meet Data
                        is that at training time we can only use the training set to evaluate the
                        performance of the model and learn its parameters. However, the per-
                        formance on the training set is not really what we are interested in. In
                        Section 8.3, we have seen that maximum likelihood estimation can lead
                        to overfitting, especially when the training dataset is small. Ideally, our
                        model (also) works well on the test set (which is not available at training
                        time). Therefore, we need some mechanisms for assessing how a model
                        generalizes to unseen test data. Model selection is concerned with exactly
                        this problem.
                                                                                                   Figure 8.14
                                                   Evidence                                        Bayesian inference
                                                                                                   embodies Occam’s
                                                                                                   razor. The
                                                                                                   horizontal axis
                                                                                                   describes the space
                                                         p(D | M1 )
                                                                                                   of all possible
                                                                                                   datasets D. The
                                                                                                   evidence (vertical
                                                                                                   axis) evaluates how
                                                                                                   well a model
                    p(D | M2 )                                                                     predicts available
                                                                                                   data. Since
                                                                                                   p(D | Mi ) needs to
                                                                                                   integrate to 1, we
                                                                                                   should choose the
                                                                              D
                                               C                                                   model with the
                                                                                                   greatest evidence.
                                                                                                   Adapted
mean estimate is. Once the model is chosen, we can evaluate the final                              from MacKay
performance on the test set.                                                                       (2003).
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                       286                                                          When Models Meet Data
        D              With a uniform prior p(Mk ) = K1 , which gives every model equal (prior)
                       probability, determining the MAP estimate over models amounts to pick-
model evidence
                       ing the model that maximizes the model evidence (8.44).
marginal likelihood
                       Remark (Likelihood and Marginal Likelihood). There are some important
                       differences between a likelihood and a marginal likelihood (evidence):
                       While the likelihood is prone to overfitting, the marginal likelihood is typ-
                       ically not as the model parameters have been marginalized out (i.e., we
                       no longer have to fit the parameters). Furthermore, the marginal likeli-
                       hood automatically embodies a trade-off between model complexity and
                       data fit (Occam’s razor).                                                  ♦
The ratio of the posteriors is also called the posterior odds. The first frac-                     posterior odds
tion on the right-hand side of (8.46), the prior odds, measures how much                           prior odds
our prior (initial) beliefs favor M1 over M2 . The ratio of the marginal like-
lihoods (second fraction on the right-hand-side) is called the Bayes factor                        Bayes factor
and measures how well the data D is predicted by M1 compared to M2 .
Remark. The Jeffreys-Lindley paradox states that the “Bayes factor always                          Jeffreys-Lindley
favors the simpler model since the probability of the data under a complex                         paradox
model with a diffuse prior will be very small” (Murphy, 2012). Here, a
diffuse prior refers to a prior that does not favor specific models, i.e.,
many models are a priori plausible under this prior.                    ♦
   If we choose a uniform prior over models, the prior odds term in (8.46)
is 1, i.e., the posterior odds is the ratio of the marginal likelihoods (Bayes
factor)
                                        p(D | M1 )
                                                   .                                   (8.47)
                                        p(D | M2 )
If the Bayes factor is greater than 1, we choose model M1 , otherwise
model M2 . In a similar way to frequentist statistics, there are guidelines
on the size of the ratio that one should consider before ”significance” of
the result (Jeffreys, 1961).
Remark (Computing the Marginal Likelihood). The marginal likelihood
plays an important role in model selection: We need to compute Bayes
factors (8.46) and posterior distributions over models (8.43).
   Unfortunately, computing the marginal likelihood requires us to solve
an integral (8.44). This integration is generally analytically intractable,
and we will have to resort to approximation techniques, e.g., numerical
integration (Stoer and Burlirsch, 2002), stochastic approximations using
Monte Carlo (Murphy, 2012), or Bayesian Monte Carlo techniques (O’Hagan,
1991; Rasmussen and Ghahramani, 2003).
   However, there are special cases in which we can solve it. In Section 6.6.1,
we discussed conjugate models. If we choose a conjugate parameter prior
p(θ), we can compute the marginal likelihood in closed form. In Chap-
ter 9, we will do exactly this in the context of linear regression.        ♦
   We have seen a brief introduction to the basic concepts of machine
learning in this chapter. For the rest of this part of the book we will see
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                       288                                                          When Models Meet Data
                       how the three different flavors of learning in Sections 8.2, 8.3, and 8.4 are
                       applied to the four pillars of machine learning (regression, dimensionality
                       reduction, density estimation, and classification).
Linear Regression
                                                                                                    Figure 9.1
      0.4                                                 0.4                                       (a) Dataset;
                                                                                                    (b) possible solution
      0.2                                                 0.2
                                                                                                    to the regression
      0.0                                                 0.0                                       problem.
 y
−0.2 −0.2
−0.4 −0.4
            −4   −2      0      2      4                        −4   −2     0       2      4
                         x                                                  x
(a) Regression problem: observed noisy func-       (b) Regression solution: possible function
tion values from which we wish to infer the        that could have generated the data (blue)
underlying function that generated the data.       with indication of the measurement noise of
                                                   the function value at the corresponding in-
                                                   puts (orange distributions).
                                                                                            289
This material is published by Cambridge University Press as Mathematics for Machine Learning by
Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong (2020). This version is free to view
and download for personal use only. Not for re-distribution, re-sale, or use in derivative works.
 c
by  M. P. Deisenroth, A. A. Faisal, and C. S. Ong, 2021. https://mml-book.com.
                       290                                                                  Linear Regression
                        p(y | x) = N y | f (x), σ 2 .
                                                   
                                                                     (9.1)
                                      y = f (x) +  ,                                    (9.2)
                     
where  ∼ N 0, σ 2 is independent, identically distributed (i.i.d.) Gaus-
sian measurement noise with mean 0 and variance σ 2 . Our objective is
to find a function that is close (similar) to the unknown function f that
generated the data and that generalizes well.
   In this chapter, we focus on parametric models, i.e., we choose a para-
metrized function and find parameters θ that “work well” for modeling the
data. For the time being, we assume that the noise variance σ 2 is known
and focus on learning the model parameters θ . In linear regression, we
consider the special case that the parameters θ appear linearly in our
model. An example of linear regression is given by
                         p(y | x, θ) = N y | x> θ, σ 2
                                                       
                                                                     (9.3)
                                >                       2
                                                          
                    ⇐⇒ y = x θ +  ,  ∼ N 0, σ ,                    (9.4)
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                       292                                                                                      Linear Regression
Figure 9.2 Linear            20
regression example.                                              10                                      10
(a) Example                   0                                   0                                       0
                        y
                                                                                                    y
functions that fall
into this category;                                             −10                                     −10
                            −20
(b) training set;                 −10       0         10          −10    −5     0       5     10          −10   −5   0    5   10
                                            x                                   x                                    x
(c) maximum
likelihood estimate.   (a) Example functions (straight                (b) Training set.            (c) Maximum likelihood esti-
                       lines) that can be described us-                                            mate.
                       ing the linear model in (9.4).
                       refers to models that are “linear in the parameters”, i.e., models that de-
                       scribe a function by a linear combination of input features. Here, a “fea-
                       ture” is a representation φ(x) of the inputs x.
                          In the following, we will discuss in more detail how to find good pa-
                       rameters θ and how to evaluate whether a parameter set “works well”.
                       For the time being, we assume that the noise variance σ 2 is known.
                                            p(y∗ | x∗ , θ ∗ ) = N y∗ | x> ∗    2
                                                                                 
                                                                        ∗θ , σ     .             (9.6)
where we exploited that the likelihood (9.5b) factorizes over the number
of data points due to our independence assumption on the training set.
   In the linear regression model (9.4), the likelihood is Gaussian (due to
the Gaussian additive noise term), such that we arrive at
                                    1
                 log p(yn | xn , θ) = − (yn − x>   2
                                               n θ) + const ,         (9.9)
                                   2σ 2
where the constant includes all terms independent of θ . Using (9.9) in the
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                         294                                                                  Linear Regression
                                    dL       d     1
                                                                              
                                                                >
                                         =            (y − Xθ) (y − Xθ)                          (9.11a)
                                     dθ     dθ 2σ 2
                                             1 d  >                                  
                                         = 2         y y − 2y > Xθ + θ > X > Xθ                  (9.11b)
                                            2σ dθ
                                             1
                                         = 2 (−y > X + θ > X > X) ∈ R1×D .                       (9.11c)
                                            σ
                         The maximum likelihood estimator θ ML solves             dL
                                                                                  dθ
                                                                                       = 0> (necessary opti-
Ignoring the             mality condition) and we obtain
possibility of
duplicate data                              dL      (9.11c)
points, rk(X) = D                              = 0> ⇐⇒ θ >      >     >
                                                            ML X X = y X                                (9.12a)
                                            dθ
if N > D, i.e., we
do not have more                                    ⇐⇒ θ >        >    >
                                                            ML = y X(X X)
                                                                          −1
                                                                                                        (9.12b)
parameters than
                                                         ⇐⇒ θ ML = (X > X)−1 X > y .                    (9.12c)
data points.
                                                     K−1
                                   >
                                                     X                                 (9.13)
                     ⇐⇒ y = φ (x)θ +  =                   θk φk (x) +  ,
                                                     k=0
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                     296                                                                  Linear Regression
                     K − 1 is
                                                         K−1
                                                         X
                                               f (x) =         θk xk = φ> (x)θ ,                      (9.15)
                                                         k=0
                                                                                                     Figure 9.4
       4                                                4                         Training data      Polynomial
                                                                                  MLE                regression:
       2                                                2                                            (a) dataset
                                                                                                     consisting of
       0                                                0
  y
                                                                                                     (xn , yn ) pairs,
      −2                                               −2
                                                                                                     n = 1, . . . , 10;
                                                                                                     (b) maximum
      −4                                               −4                                            likelihood
           −4       −2       0       2       4              −4    −2      0      2         4         polynomial of
                             x                                            x                          degree 4.
                (a) Regression dataset.           (b) Polynomial of degree 4 determined by max-
                                                  imum likelihood estimation.
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                    298                                                                  Linear Regression
                                                                                      y
                                                                                                                                different polynomial
      −2                                        −2                                        −2
                                                                                                                                degrees M .
      −4                                        −4                                        −4
           −4   −2   0      2        4               −4   −2   0      2        4               −4   −2   0      2        4
                     x                                         x                                         x
       0                                         0                                         0
  y
                                                                                      y
      −2                                        −2                                        −2
−4 −4 −4
           −4   −2   0      2        4               −4   −2   0      2        4               −4   −2   0      2        4
                     x                                         x                                         x
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                      300                                                                         Linear Regression
                                          RMSE
                                                  4
                                                  0
                                                      0       2       4         6        8   10
                                                                  Degree of polynomial
training error        maximum likelihood fits in Figure 9.5. Note that the training error (blue
                      curve in Figure 9.6) never increases when the degree of the polynomial in-
                      creases. In our example, the best generalization (the point of the smallest
test error            test error) is obtained for a polynomial of degree M = 4.
where the constant comprises the terms that are independent of θ . We see
that the log-posterior in (9.25) is the sum of the log-likelihood p(Y | X , θ)
and the log-prior log p(θ) so that the MAP estimate will be a “compromise”
between the prior (our suggestion for plausible parameter values before
observing data) and the data-dependent likelihood.
  To find the MAP estimate θ MAP , we minimize the negative log-posterior
distribution with respect to θ , i.e., we solve
                 θ MAP ∈ arg min{− log p(Y | X , θ) − log p(θ)} .                      (9.26)
                                  θ
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                     302                                                                          Linear Regression
                                    2
                     Φ> Φ + σb2 I is symmetric and strictly positive definite (i.e., its inverse
                     exists and the MAP estimate is the unique solution of a system of linear
                     equations). Moreover, it reflects the impact of the regularizer.
Figure 9.7
Polynomial                 4                                               4                           Training data
regression:                                                                                            MLE
maximum likelihood         2                                               2                           MAP
and MAP estimates.
                           0                                               0
                      y
(a) Polynomials of
degree 6;
                          −2        Training data                         −2
(b) polynomials of                  MLE
degree 8.                 −4        MAP                                   −4
                               −4       −2          0   2     4                −4     −2      0        2        4
                                                    x                                         x
useful for variable selection. For p = 1, the regularizer is called LASSO                          LASSO
(least absolute shrinkage and selection operator) and was proposed by Tib-
shirani (1996).                                                         ♦
                               2
   The regularizer λ kθk2 in (9.32) can be interpreted as a negative log-
Gaussian prior, which we use in MAP estimation; see (9.26). More specif-
ically, with a Gaussian prior p(θ) = N 0, b2 I , we obtain the negative
log-Gaussian prior
                                          1     2
                            − log p(θ) =     kθk2 + const               (9.33)
                                         2b2
so that for λ = 2b12     the regularization term and the negative log-Gaussian
prior are identical.
   Given that the regularized least-squares loss function in (9.32) consists
of terms that are closely related to the negative log-likelihood plus a neg-
ative log-prior, it is not surprising that, when we minimize this loss, we
obtain a solution that closely resembles the MAP estimate in (9.31). More
specifically, minimizing the regularized least-squares loss function yields
                             θ RLS = (Φ> Φ + λI)−1 Φ> y ,                              (9.34)
                                                                             2
which is identical to the MAP estimate in (9.31) for λ = σb2 , where σ 2 is
                          2
the noise variance and b the variance of the (isotropic) Gaussian prior
               2
p(θ) = N 0, b I .                                                                                  A point estimate is a
   So far, we have covered parameter estimation using maximum likeli-                              single specific
                                                                                                   parameter value,
hood and MAP estimation where we found point estimates θ ∗ that op-
                                                                                                   unlike a distribution
timize an objective function (likelihood or posterior). We saw that both                           over plausible
maximum likelihood and MAP estimation can lead to overfitting. In the                              parameter settings.
next section, we will discuss Bayesian linear regression, where we use
Bayesian inference (Section 8.4) to find a posterior distribution over the
unknown parameters, which we subsequently use to make predictions.
More specifically, for predictions we will average over all plausible sets of
parameters instead of focusing on a point estimate.
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                      304                                                                  Linear Regression
                                                         9.3.1 Model
                      In Bayesian linear regression, we consider the model
                                                                   
                                    prior        p(θ) = N m0 , S 0 ,
                                                                                                       (9.35)
                                      likelihood p(y | x, θ) = N y | φ> (x)θ, σ 2 ,
                                                                                 
                                                                                           
Figure 9.8            where we now explicitly place a Gaussian prior p(θ) = N m0 , S 0 on θ ,
Graphical model for   which turns the parameter vector into a random variable. This allows us
Bayesian linear
                      to write down the corresponding graphical model in Figure 9.8, where we
regression.
                      made the parameters of the Gaussian prior on θ explicit. The full proba-
m0              S0    bilistic model, i.e., the joint distribution of observed and unobserved ran-
                      dom variables, y and θ , respectively, is
         θ
                 σ                              p(y, θ | x) = p(y | x, θ)p(θ) .                        (9.36)
 x       y
                                                  9.3.2 Prior Predictions
                      In practice, we are usually not so much interested in the parameter values
                      θ themselves. Instead, our focus often lies in the predictions we make
                      with those parameter values. In a Bayesian setting, we take the parameter
                      distribution and average over all plausible parameter settings when we
                      make predictions. More specifically, to make predictions at an input x∗ ,
                      we integrate out θ and obtain
                                          Z
                             p(y∗ | x∗ ) = p(y∗ | x∗ , θ)p(θ)dθ = Eθ [p(y∗ | x∗ , θ)] ,   (9.37)
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                      306                                                                  Linear Regression
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                       308                                                                  Linear Regression
                                                            Σ := A ,                                    (9.53)
                                                            µ := Σ−1 a                                  (9.54)
                                                   A := σ −2 Φ> Φ + S −1
                                                                      0 ,                               (9.55)
                                                   a := σ −2 Φ> y + S −1
                                                                      0 m0 .                            (9.56)
E[y∗ | X , Y, x∗ ] =   The term φ> (x∗ )S N φ(x∗ ) reflects the posterior uncertainty associated
φ> (x∗ )mN =           with the parameters θ . Note that S N depends on the training inputs
φ> (x∗ )θ MAP .
                       through Φ; see (9.43b). The predictive mean φ> (x∗ )mN coincides with
                       the predictions made with the MAP estimate θ MAP .
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                     310                                                                                     Linear Regression
                        Figure 9.10 shows the posterior over functions that we obtain via
                     Bayesian linear regression. The training dataset is shown in panel (a);
                     panel (b) shows the posterior distribution over functions, including the
                     functions we would obtain via maximum likelihood and MAP estimation.
                     The function we obtain using the MAP estimate also corresponds to the
                     posterior mean function in the Bayesian linear regression setting. Panel (c)
                     shows some plausible realizations (samples) of functions under that pos-
                     terior over functions.
Figure 9.10
Bayesian linear           4                                  4                                      4
regression and            2                                  2                                      2
posterior over
                          0                                  0                                      0
                     y
                                                                                               y
functions.                                                            Training data
                         −2                                 −2        MLE                          −2
(a) training data;                                                    MAP
(b) posterior            −4                                 −4        BLR                          −4
                              −4   −2                            −4      −2                             −4   −2
distribution over                       0
                                        x
                                             2     4                              0
                                                                                  x
                                                                                      2   4                       0
                                                                                                                  x
                                                                                                                      2   4
functions;
                              (a) Training data.       (b) Posterior over functions rep-      (c) Samples from the posterior
(c) Samples from
                                                       resented by the marginal uncer-        over functions, which are in-
the posterior over
                                                       tainties (shaded) showing the          duced by the samples from the
functions.
                                                       67% and 95% predictive con-            parameter posterior.
                                                       fidence bounds, the maximum
                                                       likelihood estimate (MLE) and
                                                       the MAP estimate (MAP), the
                                                       latter of which is identical to
                                                       the posterior mean function.
                                                 y
                                                                                                   67% (dark gray)
                 Training data
                                                                                                   and 95% (light
       −2        MLE                                 −2
                 MAP
                                                                                                   gray) predictive
                 BLR                                                                               confidence bounds.
       −4                                            −4
                                                                                                   The mean of the
            −4      −2       0    2        4              −4    −2      0       2       4          Bayesian linear
                             x                                          x
                                                                                                   regression model
 (a) Posterior distribution for polynomials of degree M = 3 (left) and samples from the pos-       coincides with the
 terior over functions (right).                                                                    MAP estimate. The
                                                                                                   predictive
                                                                                                   uncertainty is the
        4                                             4                                            sum of the noise
                                                                                                   term and the
        2                                             2
                                                                                                   posterior parameter
                                                                                                   uncertainty, which
        0                                             0
   y
 (b) Posterior distribution for polynomials of degree M = 5 (left) and samples from the
 posterior over functions (right).
        4                        Training data        4
                                 MLE
        2                        MAP                  2
                                 BLR
        0                                             0
   y
−2 −2
−4 −4
            −4      −2       0    2        4              −4    −2      0       2       4
                             x                                          x
 (c) Posterior distribution for polynomials of degree M = 7 (left) and samples from the pos-
 terior over functions (right).
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                         312                                                                  Linear Regression
                                                   y
                                                                                                         (b) maximum
                                                                       Projection                        likelihood solution
     −2                                                −2
                                                                       Observations                      interpreted as a
                                                                       Maximum likelihood estimate
                                                                                                         projection.
     −4                                                −4
      −4      −2         0         2        4           −4      −2         0           2             4
                         x                                                 x
(a) Regression dataset consisting of noisy ob-    (b) The orange dots are the projections of
servations yn (blue) of function values f (xn )   the noisy observations (blue dots) onto the
at input locations xn .                           line θML x. The maximum likelihood solution to
                                                  a linear regression problem finds a subspace
                                                  (line) onto which the overall projection er-
                                                  ror (orange lines) of the observations is mini-
                                                  mized.
              = N y | Xm0 , XS 0 X > + σ 2 I .
                                            
                                                                                           (9.64b)
Given the close connection with the posterior predictive distribution (see
Remark on Marginal Likelihood and Posterior Predictive Distribution ear-
lier in this section), the functional form of the marginal likelihood should
not be too surprising.
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                       314                                                                  Linear Regression
When the basis is not orthogonal, one can convert a set of linearly inde-
pendent basis functions to an orthogonal basis by using the Gram-Schmidt
process; see Section 3.8.3 and (Strang, 2003).
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                     316                                                                  Linear Regression
Working directly with high-dimensional data, such as images, comes with                             A 640 × 480 pixel
some difficulties: It is hard to analyze, interpretation is difficult, visualiza-                   color image is a data
                                                                                                    point in a
tion is nearly impossible, and (from a practical point of view) storage of
                                                                                                    million-dimensional
the data vectors can be expensive. However, high-dimensional data often                             space, where every
has properties that we can exploit. For example, high-dimensional data is                           pixel responds to
often overcomplete, i.e., many dimensions are redundant and can be ex-                              three dimensions,
                                                                                                    one for each color
plained by a combination of other dimensions. Furthermore, dimensions
                                                                                                    channel (red, green,
in high-dimensional data are often correlated so that the data possesses an                         blue).
intrinsic lower-dimensional structure. Dimensionality reduction exploits
structure and correlation and allows us to work with a more compact rep-
resentation of the data, ideally without losing information. We can think
of dimensionality reduction as a compression technique, similar to jpeg or
mp3, which are compression algorithms for images and music.
   In this chapter, we will discuss principal component analysis (PCA), an                          principal component
algorithm for linear dimensionality reduction. PCA, proposed by Pearson                             analysis
(1901) and Hotelling (1933), has been around for more than 100 years                                PCA
                                                                                                    dimensionality
and is still one of the most commonly used techniques for data compres-                             reduction
sion and data visualization. It is also used for the identification of simple
patterns, latent factors, and structures of high-dimensional data. In the
                                                                                                    Figure 10.1
                                                                                                    Illustration:
       4                                                 4
                                                                                                    dimensionality
                                                                                                    reduction. (a) The
       2                                                 2                                          original dataset
                                                                                                    does not vary much
                                                                                                    along the x2
 x2
x2
       0                                                 0
                                                                                                    direction. (b) The
                                                                                                    data from (a) can be
      −2                                                −2
                                                                                                    represented using
                                                                                                    the x1 -coordinate
      −4                                                −4                                          alone with nearly no
                                                                                                    loss.
           −5.0   −2.5    0.0      2.5      5.0              −5.0   −2.5   0.0     2.5      5.0
                          x1                                               x1
      (a) Dataset with x1 and x2 coordinates.     (b) Compressed dataset where only the x1 coor-
                                                  dinate is relevant.
                                                                                            317
This material is published by Cambridge University Press as Mathematics for Machine Learning by
Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong (2020). This version is free to view
and download for personal use only. Not for re-distribution, re-sale, or use in derivative works.
 c
by  M. P. Deisenroth, A. A. Faisal, and C. S. Ong, 2021. https://mml-book.com.
                       318                 Dimensionality Reduction with Principal Component Analysis
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                     320                 Dimensionality Reduction with Principal Component Analysis
Figure 10.3
Examples of
handwritten digits
from the MNIST
dataset. http:
//yann.lecun.
com/exdb/mnist/.
                     occurring example, which contains 60,000 examples of handwritten digits
                     0 through 9. Each digit is a grayscale image of size 28 × 28, i.e., it contains
                     784 pixels so that we can interpret every image in this dataset as a vector
                     x ∈ R784 . Examples of these digits are shown in Figure 10.3.
                     i.e., the variance of the low-dimensional code does not depend on the
                     mean of the data. Therefore, we assume without loss of generality that the
                     data has mean 0 for the remainder of this section. With this assumption
                     the mean of the low-dimensional code is also 0 since Ez [z] = Ex [B > x] =
                     B > Ex [x] = 0.                                                          ♦
                                       z1n = b>
                                              1 xn ,                                   (10.8)
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                       322                 Dimensionality Reduction with Principal Component Analysis
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                       324                              Dimensionality Reduction with Principal Component Analysis
                                                                                                             400
                        Eigenvalue
Eigenvalues sorted                   30
in descending order;                                                                                         300
(b) Variance                         20
                                                                                                             200
captured by the
                                     10
principal                                                                                                    100
components                            0
                                          0        50     100        150      200                                  0     50         100        150      200
associated with the                                      Index                                                         Number of principal components
largest eigenvalues.
                       (a) Eigenvalues (sorted in descending order) of                (b) Variance captured by the principal compo-
                       the data covariance matrix of all digits “8” in                nents.
                       the MNIST training set.
                                                                                                   Figure 10.6
                                                                                                   Illustration of the
                                                                                                   projection
                                                                                                   approach: Find a
                                                                                                   subspace (line) that
                                                                                                   minimizes the
                                                                                                   length of the
                                                                                                   difference vector
                                                                                                   between projected
                                                                                                   (orange) and
                                                                                                   original (blue) data.
   Taking all digits “8” in the MNIST training data, we compute the eigen-
values of the data covariance matrix. Figure 10.5(a) shows the 200 largest
eigenvalues of the data covariance matrix. We see that only a few of
them have a value that differs significantly from 0. Therefore, most of
the variance, when projecting data onto the subspace spanned by the cor-
responding eigenvectors, is captured by only a few principal components,
as shown in Figure 10.5(b).
where the λm are the M largest eigenvalues of the data covariance matrix
S . Consequently, the variance lost by data compression via PCA is
                                       D
                                       X
                            JM :=              λj = VD − VM .                        (10.25)
                                      j=M +1
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                         326                         Dimensionality Reduction with Principal Component Analysis
Figure 10.7                    2.5                                                           2.5
Simplified
projection setting.            2.0                                                           2.0
(a) A vector x ∈ R2
(red cross) shall be           1.5                                                           1.5
projected onto a
one-dimensional
                         x2
                                                                                       x2
                               1.0                                                           1.0
subspace U ⊆ R2                                                              U                                                           U
spanned by b. (b)              0.5                                                           0.5
shows the difference                                               b                                                          b
vectors between x              0.0                                                           0.0
and some
candidates x̃.                −0.5                                                          −0.5
                                −1.0   −0.5    0.0     0.5     1.0     1.5       2.0          −1.0   −0.5   0.0        0.5   1.0   1.5       2.0
                                                       x1                                                              x1
                         following, we will look at the difference vectors between the original data
                         xn and their reconstruction x̃n and minimize this distance so that xn and
                         x̃n are as close as possible. Figure 10.6 illustrates this setting.
tion, we would arrive at exactly the same solution, but the notation would
be substantially more cluttered.
   We are interested in finding the best linear projection of X onto a lower-
dimensional subspace U of RD with dim(U ) = M and orthonormal basis
vectors b1 , . . . , bM . We will call this subspace U the principal subspace.                     principal subspace
The projections of the data points are denoted by
                                   M
                                   X
                          x̃n :=         zmn bm = Bz n ∈ RD ,                        (10.28)
                                   m=1
where we make it explicit that the dimension of the subspace onto which
we project the data is M . In order to find this optimal linear projection,
we need to find the orthonormal basis of the principal subspace and the
coordinates z n ∈ RM of the projections with respect to this basis.
   To find the coordinates z n and the ONB of the principal subspace, we
follow a two-step approach. First, we optimize the coordinates z n for a
given ONB (b1 , . . . , bM ); second, we find the optimal ONB.
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                          328                                  Dimensionality Reduction with Principal Component Analysis
Figure 10.8                     3.25                                                                     2.5
Optimal projection
                                3.00
of a vector x ∈ R2                                                                                       2.0
onto a                          2.75
one-dimensional                 2.50
                                                                                                         1.5
subspace
                           kx − x̃k
2.25
                                                                                                   x2
(continuation from                                                                                       1.0
Figure 10.7).                                                                                                                                         U
                                2.00                                                                                                             x̃
(a) Distances                                                                                            0.5
                                1.75                                                                                                 b
kx − x̃k for some
                                                                                                         0.0
x̃ ∈ U .                        1.50
(b) Orthogonal
                                1.25                                                                    −0.5
projection and                    −1.0       −0.5       0.0     0.5   1.0      1.5     2.0                −1.0   −0.5   0.0   0.5   1.0    1.5            2.0
                                                                x1                                                            x1
optimal coordinates.
                          (a) Distances kx − x̃k for some x̃ = z1 b ∈                             (b) The vector x̃ that minimizes the distance
                          U = span[b]; see panel (b) for the setting.                             in panel (a) is its orthogonal projection onto
                                                                                                  U . The coordinate of the projection x̃ with
                                                                                                  respect to the basis vector b that spans U
                                                                                                  is the factor we need to scale b in order to
                                                                                                  “reach” x̃.
                                                                                                M
                                                                                                                 !
                                                                 ∂ x̃n      (10.28)    ∂        X
                                                                              =                        zmn bm        = bi                 (10.30c)
                                                                 ∂zin                 ∂zin     m=1
the basis vector bj that spans that subspace since zj bj = x̃. Figure 10.8(b)
illustrates this setting.
   More generally, if we aim to project onto an M -dimensional subspace
of RD , we obtain the orthogonal projection of x onto the M -dimensional
subspace with orthonormal basis vectors b1 , . . . , bM as
                                  >    −1 >     >
                                | {zB}) B x = BB x ,
                         x̃ = B(B                                                    (10.34)
                                      =I
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                      330                 Dimensionality Reduction with Principal Component Analysis
Figure 10.9
Orthogonal                                        6                      U⊥
projection and
displacement                                      4
vectors. When
projecting data                                   2
points xn (blue)
                                            x2
                                                  0
onto subspace U1 ,                                                                          U
we obtain x̃n
                                                 −2
(orange). The
displacement vector                              −4
x̃n − xn lies
completely in the                                −6
orthogonal
                                                           −5                 0             5
complement U2 of                                                         x1
U1 .
                      Since we can generally write the original data point xn as a linear combi-
                      nation of all basis vectors, it holds that
                                    D                D               D
                                                                             !
                                             (10.32)
                                  X                  X               X
                            xn =      zdn bd =          (x>
                                                          n bd )bd =   bd b>
                                                                           d   xn       (10.37a)
                                    d=1                   d=1                      d=1
                                      M
                                                      !              D
                                                                                  !
                                      X                              X
                                =          bm b>
                                               m xn +                         bj b>
                                                                                  j xn ,            (10.37b)
                                     m=1                            j=M +1
                      where we split the sum with D terms into a sum over M and a sum
                      over D − M terms. With this result, we find that the displacement vector
                      xn − x̃n , i.e., the difference vector between the original data point and its
                      projection, is
                                                                D
                                                                         !
                                                               X       >
                                                xn − x̃n =         bj bj xn               (10.38a)
                                                                     j=M +1
                                                                     D
                                                                     X
                                                                =            (x>
                                                                               n bj )bj .           (10.38b)
                                                                    j=M +1
                      This means the difference is exactly the projection of the data point onto
                      the orthogonal complement of the principal subspace: We identify the ma-
                      trix j=M +1 bj b>
                          PD
                                       j in (10.38a) as the projection matrix that performs this
                      projection. Hence the displacement vector xn − x̃n lies in the subspace
                      that is orthogonal to the principal subspace as illustrated in Figure 10.9.
                      Remark (Low-Rank Approximation). In (10.38a), we saw that the projec-
                      tion matrix, which projects x onto x̃, is given by
                                                          M
                                                          X
                                                                bm b>     >
                                                                    m = BB .                         (10.39)
                                                          m=1
We now explicitly compute the squared norm and exploit the fact that the
bj form an ONB, which yields
                   N    D                  N    D
                1 X X                   1 X X >
      JM =                  (b> xn )2
                                      =             b xn b>
                                                          j xn                         (10.42a)
                N n=1 j=M +1 j          N n=1 j=M +1 j
                   N    D
                1 X X >
            =               b xn x>
                                  n bj ,                                               (10.42b)
                N n=1 j=M +1 j
where we exploited the symmetry of the dot product in the last step to
write b>         >
       j xn = xn bj . We now swap the sums and obtain
           D             N
                                  !      D
          X      >    1 X               X
 JM =           bj              >
                            xn xn bj =        b>
                                               j Sbj         (10.43a)
         j=M +1
                     N  n=1            j=M +1
                   |      {z      }
                                =:S
             D
             X                         D
                                       X                             D
                                                                   X                   
        =            tr(b>
                         j Sbj ) =             tr(Sbj b>
                                                       j ) = tr                   bj b>
                                                                                      j  S ,
            j=M +1                    j=M +1                           j=M +1
                                                                   |         {z            }
                                                                       projection matrix
                                                                                       (10.43b)
where we exploited the property that the trace operator tr(·) (see (4.18))
is linear and invariant to cyclic permutations of its arguments. Since we
assumed that our dataset is centered, i.e., E[X ] = 0, we identify S as the
data covariance matrix. Since the projection matrix in (10.43b) is con-
structed as a sum of rank-one matrices bj b>j it itself is of rank D − M .
   Equation (10.43a) implies that we can formulate the average squared
reconstruction error equivalently as the covariance matrix of the data,
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                        332                 Dimensionality Reduction with Principal Component Analysis
Figure 10.10
Embedding of
MNIST digits 0
(blue) and 1
(orange) in a
two-dimensional
principal subspace
using PCA. Four
embeddings of the
digits “0” and “1” in
the principal
subspace are
highlighted in red
with their
corresponding
original digit.
                        Figure 10.10 visualizes the training data of the MMIST digits “0” and “1”
                        embedded in the vector subspace spanned by the first two principal com-
                        ponents. We observe a relatively clear separation between “0”s (blue dots)
                        and “1”s (orange dots), and we see the variation within each individual
cluster. Four embeddings of the digits “0” and “1” in the principal subspace
are highlighted in red with their corresponding original digit. The figure
reveals that the variation within the set of “0” is significantly greater than
the variation within the set of “1”.
With the results from Section 4.5, we get that the columns of U are the                            The columns of U
eigenvectors of XX > (and therefore S ). Furthermore, the eigenvalues                              are the eigenvectors
                                                                                                   of S.
λd of S are related to the singular values of X via
                                      σd2
                                         λd =
                                          .                   (10.49)
                                      N
This relationship between the eigenvalues of S and the singular values
of X provides the connection between the maximum variance view (Sec-
tion 10.2) and the singular value decomposition.
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                  334                 Dimensionality Reduction with Principal Component Analysis
                                          X̃ M = U M ΣM V >  M ∈ R
                                                                   D×N
                                                                                                 (10.51)
                                                 |{z} |{z} |{z}
                                                   D×M M ×M M ×N
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                  336                 Dimensionality Reduction with Principal Component Analysis
                  and we recover the data covariance matrix again. This now also means
                  that we recover Xcm as an eigenvector of S .
                  Remark. If we want to apply the PCA algorithm that we discussed in Sec-
                  tion 10.6, we need to normalize the eigenvectors Xcm of S so that they
                  have norm 1.                                                         ♦
x2
                                                                           x2
      0.0                                  0.0                                   0.0
                                                                                                            (c) divide by
                                                                                                            standard deviation;
     −2.5                                 −2.5                                  −2.5                        (d) eigendecomposi-
                0         5                            0          5                    0          5         tion; (e) projection;
                    x1                                     x1                              x1               (f) mapping back to
      (a) Original dataset.          (b) Step 1: Centering by sub-        (c) Step 2: Dividing by the       original data space.
                                     tracting the mean from each          standard deviation to make
                                     data point.                          the data unit free. Data has
                                                                          variance 1 along each axis.
x2
                                                                           x2
      0.0                                  0.0                                   0.0
                0         5                            0          5                    0          5
                    x1                                     x1                              x1
(d) Step 3: Compute eigenval-        (e) Step 4: Project data onto        (f) Undo the standardization
ues and eigenvectors (arrows)        the principal subspace.              and move projected data back
of the data covariance matrix                                             into the original data space
(ellipse).                                                                from (a).
with coordinates
z ∗ = B > x∗ (10.60)
     with respect to the basis of the principal subspace. Here, B is the ma-
     trix that contains the eigenvectors that are associated with the largest
     eigenvalues of the data covariance matrix as columns. PCA returns the
     coordinates (10.60), not the projections x∗ .
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                      338                 Dimensionality Reduction with Principal Component Analysis
PCs: 10
PCs: 100
PCs: 500
                                                                                                                    Figure 10.13
                                                          500                                                       Average squared
                                                                                                                    reconstruction error
                                                          400
                                                                                                                    as a function of the
                                                                                                                    number of principal
                                                          300
                                                                                                                    components. The
                                                                                                                    average squared
                                                          200
                                                                                                                    reconstruction error
                                                          100                                                       is the sum of the
                                                                                                                    eigenvalues in the
                                                            0                                                       orthogonal
                                                                0       200       400         600   800             complement of the
                                                                              Number of PCs                         principal subspace.
   Come with a likelihood function, and we can explicitly deal with noisy
   observations (which we did not even discuss earlier)
   Allow us to do Bayesian model comparison via the marginal likelihood
   as discussed in Section 8.6
   View PCA as a generative model, which allows us to simulate new data
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                     340                 Dimensionality Reduction with Principal Component Analysis
                                                                                                   Figure 10.14
                                                  zn                                               Graphical model for
                                                                                                   probabilistic PCA.
                                                                                                   The observations xn
                                B                            µ                                     explicitly depend on
                                                                                                   corresponding
                                                                                                   latent variables
                                                  xn         σ                                                    
                                                                                                   z n ∼ N 0, I . The
                                                                                                   model parameters
                                        n = 1, . . . , N
                                                                                                   B, µ and the
                                                                                                   likelihood
                                                                                                   parameter σ are
Remark. Note the direction of the arrow that connects the latent variables                         shared across the
z and the observed data x: The arrow points from z to x, which means                               dataset.
that the PPCA model assumes a lower-dimensional latent cause z for high-
dimensional observations x. In the end, we are obviously interested in
finding something out about z given some observations. To get there we
will apply Bayesian inference to “invert” the arrow implicitly and go from
observations to latent variables.                                        ♦
                                                                                                   Figure 10.15
                                                                                                   Generating new
                                                                                                   MNIST digits. The
                                                                                                   latent variables z
                                                                                                   can be used to
                                                                                                   generate new data
                                                                                                   x̃ = Bz. The closer
                                                                                                   we stay to the
                                                                                                   training data, the
                                                                                                   more realistic the
                                                                                                   generated data.
   Figure 10.15 shows the latent coordinates of the MNIST digits “8” found
by PCA when using a two-dimensional principal subspace (blue dots). We
can query any vector z ∗ in this latent space and generate an image x̃∗ =
Bz ∗ that resembles the digit “8”. We show eight of such generated images
with their corresponding latent space representation. Depending on where
we query the latent space, the generated images look different (shape,
rotation, size, etc.). If we query away from the training data, we see more
and more artifacts, e.g., the top-left and top-right digits. Note that the
intrinsic dimensionality of these generated images is only two.
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                      342                 Dimensionality Reduction with Principal Component Analysis
                      From Section 6.5, we know that the solution to this integral is a Gaussian
                      distribution with mean
                                                              BB > + σ 2 I B
                                                                          
                                          2        x  µ
                          p(x, z | B, µ, σ ) = N            ,                        , (10.72)
                                                   z  0           B>          I
                      with a mean vector of length D + M and a covariance matrix of size
                      (D + M ) × (D + M ).
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                        344                 Dimensionality Reduction with Principal Component Analysis
                                     N                    N
                                  1 X                  1 X
          >
                                                                         
2
                                        kxn − x̃n k2 =       
xn − BB xn 
 .                           (10.76)
                                                             
           
                                  N n=1                N n=1
                        This means we end up with the same objective function as in (10.29) that
                        we discussed in Section 10.3 so that we obtain the PCA solution when we
                        minimize the squared auto-encoding loss. If we replace the linear map-
                        ping of PCA with a nonlinear mapping, we get a nonlinear auto-encoder.
                        A prominent example of this is a deep auto-encoder where the linear func-
                        tions are replaced with deep neural networks. In this context, the encoder
recognition network     is also known as a recognition network or inference network, whereas the
inference network       decoder is also called a generator.
generator
                           Another interpretation of PCA is related to information theory. We can
                        think of the code as a smaller or compressed version of the original data
                        point. When we reconstruct our original data using the code, we do not
                        get the exact data point back, but a slightly distorted or noisy version
The code is a           of it. This means that our compression is “lossy”. Intuitively, we want
compressed version      to maximize the correlation between the original data and the lower-
of the original data.
                        dimensional code. More formally, this is related to the mutual information.
                        We would then get the same solution to PCA we discussed in Section 10.3
                        by maximizing the mutual information, a core concept in information the-
                        ory (MacKay, 2003).
                           In our discussion on PPCA, we assumed that the parameters of the
                        model, i.e., B, µ, and the likelihood parameter σ 2 , are known. Tipping
                        and Bishop (1999) describe how to derive maximum likelihood estimates
                        for these parameters in the PPCA setting (note that we use a different
                        notation in this chapter). The maximum likelihood parameters, when pro-
where T ∈ RD×M contains M eigenvectors of the data covariance matrix,                              The matrix Λ − σ 2 I
Λ = diag(λ1 , . . . , λM ) ∈ RM ×M is a diagonal matrix with the eigenvalues                       in (10.78) is
                                                                                                   guaranteed to be
associated with the principal axes on its diagonal, and R ∈ RM ×M is
                                                                                                   positive semidefinite
an arbitrary orthogonal matrix. The maximum likelihood solution B ML is                            as the smallest
unique up to an arbitrary orthogonal transformation, e.g., we can right-                           eigenvalue of the
multiply B ML with any rotation matrix R so that (10.78) essentially is a                          data covariance
                                                                                                   matrix is bounded
singular value decomposition (see Section 4.5). An outline of the proof is
                                                                                                   from below by the
given by Tipping and Bishop (1999).                                                                noise variance σ 2 .
   The maximum likelihood estimate for µ given in (10.77) is the sample
mean of the data. The maximum likelihood estimator for the observation
noise variance σ 2 given in (10.79) is the average variance in the orthog-
onal complement of the principal subspace, i.e., the average leftover vari-
ance that we cannot capture with the first M principal components is
treated as observation noise.
   In the noise-free limit where σ → 0, PPCA and PCA provide identical
solutions: Since the data covariance matrix S is symmetric, it can be di-
agonalized (see Section 4.4), i.e., there exists a matrix T of eigenvectors
of S so that
                                      S = T ΛT −1 .                                  (10.80)
In the PPCA model, the data covariance matrix is the covariance matrix of
the Gaussian likelihood p(x | B, µ, σ 2 ), which is BB > +σ 2 I , see (10.70b).
For σ → 0, we obtain BB > so that this data covariance must equal the
PCA data covariance (and its factorization given in (10.80)) so that
                                                                     1
           Cov[X ] = T ΛT −1 = BB > ⇐⇒ B = T Λ 2 R ,                                 (10.81)
i.e., we obtain the maximum likelihood estimate in (10.78) for σ = 0.
From (10.78) and (10.80), it becomes clear that (P)PCA performs a de-
composition of the data covariance matrix.
   In a streaming setting, where data arrives sequentially, it is recom-
mended to use the iterative expectation maximization (EM) algorithm for
maximum likelihood estimation (Roweis, 1998).
   To determine the dimensionality of the latent variables (the length of
the code, the dimensionality of the lower-dimensional subspace onto which
we project the data), Gavish and Donoho (2014) suggest the heuristic
that, if we can estimate the noise variance σ 2 of the data, we should
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                       346                 Dimensionality Reduction with Principal Component Analysis
                                                                           √
                       discard all singular values smaller than 4σ√3D . Alternatively, we can use
                       (nested) cross-validation (Section 8.6.1) or Bayesian model selection cri-
                       teria (discussed in Section 8.6.2) to determine a good estimate of the
                       intrinsic dimensionality of the data (Minka, 2001b).
                          Similar to our discussion on linear regression in Chapter 9, we can place
                       a prior distribution on the parameters of the model and integrate them
                       out. By doing so, we (a) avoid point estimates of the parameters and the
                       issues that come with these point estimates (see Section 8.6) and (b) al-
                       low for an automatic selection of the appropriate dimensionality M of the
Bayesian PCA           latent space. In this Bayesian PCA, which was proposed by Bishop (1999),
                       a prior p(µ, B, σ 2 ) is placed on the model parameters. The generative
                       process allows us to integrate the model parameters out instead of condi-
                       tioning on them, which addresses overfitting issues. Since this integration
                       is analytically intractable, Bishop (1999) proposes to use approximate in-
                       ference methods, such as MCMC or variational inference. We refer to the
                       work by Gilks et al. (1996) and Blei et al. (2017) for more details on these
                       approximate inference techniques.
                         In PPCA, we considered the linear model p(xn | z n ) = N xn | Bz n +
                                                          
                       µ, σ 2 I with prior p(z n ) = N 0, I , where all observation dimensions
                       are affected by the same amount of noise. If we allow each observation
factor analysis        dimension d to have a different variance σd2 , we obtain factor analysis
                       (FA) (Spearman, 1904; Bartholomew et al., 2011). This means that FA
                       gives the likelihood some more flexibility than PPCA, but still forces the
An overly flexible     data to be explained by the model parameters B, µ.However, FA no
likelihood would be    longer allows for a closed-form maximum likelihood solution so that we
able to explain more
                       need to use an iterative scheme, such as the expectation maximization
than just the noise.
                       algorithm, to estimate the model parameters. While in PPCA all station-
                       ary points are global optima, this no longer holds for FA. Compared to
                       PPCA, FA does not change if we scale the data, but it does return different
                       solutions if we rotate the data.
independent               An algorithm that is also closely related to PCA is independent com-
component analysis     ponent analysis (ICA (Hyvarinen et al., 2001)). Starting again with the
ICA                    latent-variable perspective p(xn | z n ) = N xn | Bz n + µ, σ 2 I we now
                       change the prior on z n to non-Gaussian distributions. ICA can be used
blind-source           for blind-source separation. Imagine you are in a busy train station with
separation             many people talking. Your ears play the role of microphones, and they
                       linearly mix different speech signals in the train station. The goal of blind-
                       source separation is to identify the constituent parts of the mixed signals.
                       As discussed previously in the context of maximum likelihood estimation
                       for PPCA, the original PCA solution is invariant to any rotation. Therefore,
                       PCA can identify the best lower-dimensional subspace in which the sig-
                       nals live, but not the signals themselves (Murphy, 2012). ICA addresses
                       this issue by modifying the prior distribution p(z) on the latent sources
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                                                                   11
Figure 11.1
Two-dimensional
dataset that cannot                     4
be meaningfully
represented by a                        2
Gaussian.
                                        0
                                  x2
−2
−4
                                                    −5                  0                    5
                                                                        x1
                      348
                      This material is published by Cambridge University Press as Mathematics for Machine Learning by
                      Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong (2020). This version is free to view
                      and download for personal use only. Not for re-distribution, re-sale, or use in derivative works.
                       c
                      
by  M. P. Deisenroth, A. A. Faisal, and C. S. Ong, 2021. https://mml-book.com.
11.1 Gaussian Mixture Model                                                                349
                                         K
                                         X                         
                           p(x | θ) =          πk N x | µk , Σk                        (11.3)
                                         k=1
                                               K
                                               X
                           0 6 πk 6 1 ,              πk = 1 ,                          (11.4)
                                               k=1
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                       350                                           Density Estimation with Gaussian Mixture Models
                                      p(x)
                                             0.15
convex combination
of Gaussian
                                             0.10
distributions and is
more expressive                              0.05
than any individual
component. Dashed                            0.00
lines represent the                                        −4        −2        0       2       4             6               8
weighted Gaussian                                                                  x
components.
0.05
                                                    0.00
                                                                −5         0               5       10                  15
                                                                                   x
                                                                  (11.11)
This simple form allows us to find closed-form maximum likelihood esti-
mates of µ and Σ, as discussed in Chapter 8. In (11.10), we cannot move
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                  352                             Density Estimation with Gaussian Mixture Models
                  the log into the sum over k so that we cannot obtain a simple closed-form
                  maximum likelihood solution.                                            ♦
                    Any local optimum of a function exhibits the property that its gradi-
                  ent with respect to the parameters must vanish (necessary condition); see
                  Chapter 7. In our case, we obtain the following necessary conditions when
                  we optimize the log-likelihood in (11.10) with respect to the GMM param-
                  eters µk , Σk , πk :
                                            N
                               ∂L          X   ∂ log p(xn | θ)
                                   = 0> ⇐⇒                     = 0> ,                            (11.12)
                               ∂µk         n=1
                                                    ∂µ k
                                           N
                                ∂L        X   ∂ log p(xn | θ)
                                   = 0 ⇐⇒                     = 0,                               (11.13)
                               ∂Σk        n=1
                                                   ∂Σk
                                           N
                               ∂L         X   ∂ log p(xn | θ)
                                   = 0 ⇐⇒                     = 0.                               (11.14)
                               ∂πk        n=1
                                                    ∂πk
                  For all three necessary conditions, by applying the chain rule (see Sec-
                  tion 5.2.2), we require partial derivatives of the form
                                      ∂ log p(xn | θ)      1      ∂p(xn | θ)
                                                      =                      ,                   (11.15)
                                            ∂θ          p(xn | θ)    ∂θ
                  where θ = {µk , Σk , πk , k = 1, . . . , K} are the model parameters and
                                          1                  1
                                                 = PK                     .                     (11.16)
                                       p(xn | θ)    j=1 πj N xn | µj , Σj
                                              11.2.1 Responsibilities
                  We define the quantity
                                                                            
                                                       πk N xn | µk , Σk
                                          rnk := PK                                             (11.17)
                                                      j=1   πj N xn | µj , Σj
responsibility    as the responsibility of the k th mixture component for the nth data point.
                  The responsibility rnk of the k th mixture component for data point xn is
                  proportional to the likelihood
                                                                              
                                    p(xn | πk , µk , Σk ) = πk N xn | µk , Σk        (11.18)
r n follows a     of the mixture component given the data point. Therefore, mixture com-
Boltzmann/Gibbs   ponents have a high responsibility for a data point when the data point
distribution.
                  could be a plausible sample from that mixture component. Note that
                  r n := [rn1 , . . . , rnK ]> ∈ RK is a (normalized) probability vector, i.e.,
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
354                                Density Estimation with Gaussian Mixture Models
Here we used the identity from (11.16) and the result of the partial deriva-
tive in (11.21b) to get to (11.22b). The values rnk are the responsibilities
we defined in (11.17).
                                           ∂L(µnew )
   We now solve (11.22c) for µnew
                                k   so that ∂µk = 0> and obtain
                                                           k
 N                N                                PN                      N
 X                X                                        rnk xn        1 X
       rnk xn =         rnk µnew
                             k     ⇐⇒   µnew
                                         k     = P   n=1
                                                                     =              rnk xn ,
 n=1              n=1
                                                  N
                                                               rnk       Nk   n=1
                                                         n=1
                                                                                    (11.23)
where we defined
                                             N
                                             X
                                     Nk :=         rnk                              (11.24)
                                             n=1
Therefore, the mean µk is pulled toward a data point xn with strength                                                    Figure 11.4 Update
given by rnk . The means are pulled stronger toward data points for which                                                of the mean
                                                                                                                         parameter of
the corresponding mixture component has a high responsibility, i.e., a high
                                                                                                                         mixture component
likelihood. Figure 11.4 illustrates this. We can also interpret the mean up-                                             in a GMM. The
date in (11.20) as the expected value of all data points under the distri-                                               mean µ is being
bution given by                                                                                                          pulled toward
                                                                                                                         individual data
                                 r k := [r1k , . . . , rN k ]> /Nk ,                                (11.25)              points with the
                                                                                                                         weights given by the
which is a normalized probability vector, i.e.,                                                                          corresponding
                                                                                                                         responsibilities.
                                              µk ← Erk [X ] .                                       (11.26)
                                                                                                                                x2               x3
                                                                                                                                       r2
Example 11.3 (Mean Updates)                                                                                                       r1            r3
                                                                                                                         x1
                                                                                                                                            µ
p(x)
  In our example from Figure 11.3, the mean values are updated as fol-
lows:
                                           µ1 : −4 → −2.7                                           (11.27)
                                           µ2 : 0 → −0.4                                            (11.28)
                                           µ3 : 8 → 3.7                                             (11.29)
Here we see that the means of the first and third mixture component
move toward the regime of the data, whereas the mean of the second
component does not change so dramatically. Figure 11.5 illustrates this
change, where Figure 11.5(a) shows the GMM density prior to updating
the means and Figure 11.5(b) shows the GMM density after updating the
mean values µk .
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
356                             Density Estimation with Gaussian Mixture Models
and obtain (after some rearranging) the desired partial derivative required
in (11.31) as
   ∂p(xn | θ)                     
              = πk N xn | µk , Σk
     ∂Σk
               · − 12 (Σ−1     −1                  > −1
                                                       
                        k − Σk (xn − µk )(xn − µk ) Σk ) .                     (11.35)
      · − 21 (Σ−1       −1                     > −1
                                                     
                k − Σk (xn − µk )(xn − µk ) Σk )                                    (11.36b)
                  N
               1X
         =−          rnk (Σ−1   −1                    > −1
                           k − Σk (xn − µk )(xn − µk ) Σk )                         (11.36c)
               2 n=1
                  N             N
                                                              !
            1 −1 X        1 −1 X
         = − Σk      rnk + Σk      rnk (xn − µk )(xn − µk ) >
                                                                Σ−1
                                                                 k .
            2    n=1
                          2    n=1
                 | {z }
                        =Nk
                                                                                    (11.36d)
We see that the responsibilities rnk also appear in this partial derivative.
Setting this partial derivative to 0, we obtain the necessary optimality
condition
                           N
                                                        !
                          X
         Nk Σ−1
              k = Σk
                      −1
                               rnk (xn − µk )(xn − µk )> Σ−1k     (11.37a)
                                 n=1
                       N
                                                             !
                       X
    ⇐⇒ Nk I =                rnk (xn − µk )(xn − µk )> Σ−1
                                                        k .                         (11.37b)
                       n=1
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                      358                                      Density Estimation with Gaussian Mixture Models
                      Here we see that the variances of the first and third component shrink
                      significantly, whereas the variance of the second component increases
                      slightly.
                         Figure 11.6 illustrates this setting. Figure 11.6(a) is identical (but
                      zoomed in) to Figure 11.5(b) and shows the GMM density and its indi-
                      vidual components prior to updating the variances. Figure 11.6(b) shows
                      the GMM density after updating the variances.
Figure 11.6 Effect
of updating the               0.30                              π1 N (x|µ1 , σ12 )            0.35                         π1 N (x|µ1 , σ12 )
                                                                π2 N (x|µ2 , σ22 )                                         π2 N (x|µ2 , σ22 )
variances in a GMM.           0.25                              π3 N (x|µ3 , σ32 )
                                                                                              0.30                         π3 N (x|µ3 , σ32 )
                                                                                       p(x)
variances; (b) GMM            0.15
                                                                                              0.15
after updating the            0.10
                                                                                              0.10
variances while               0.05                                                            0.05
retaining the means
                              0.00                                                            0.00
and mixture                          −4   −2   0     2     4   6             8                       −4   −2   0   2   4   6            8
weights.                                             x                                                             x
                      (a) GMM density and individual components                      (b) GMM density and individual components
                      prior to updating the variances.                               after updating the variances.
where L is the log-likelihood from (11.10) and the second term encodes
for the equality constraint that all the mixture weights need to sum up to
1. We obtain the partial derivative with respect to πk as
              N                       
      ∂L    X         N xn | µk , Σk
          =       PK                       +λ                    (11.44a)
      ∂πk   n=1     j=1 πj N xn | µj , Σj
                  N                        
              1 X πk N xn | µk , Σk                  Nk
          =                                    +λ =    + λ,      (11.44b)
             πk n=1 K   j=1 πj N xn | µj , Σj
                                                     πk
                     P
                 |             {z             }
                                        =Nk
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                       360                                      Density Estimation with Gaussian Mixture Models
                                                                                        p(x)
                                                                                               0.15
mixture weights;               0.15
                                                                                               0.10
(b) GMM after                  0.10
                                                            11.3 EM Algorithm
                       Unfortunately, the updates in (11.20), (11.30), and (11.42) do not consti-
                       tute a closed-form solution for the updates of the parameters µk , Σk , πk
                       of the mixture model because the responsibilities rnk depend on those pa-
                       rameters in a complex way. However, the results suggest a simple iterative
                       scheme for finding a solution to the parameters estimation problem via
EM algorithm           maximum likelihood. The expectation maximization algorithm (EM algo-
                                                                                                                             Figure 11.8 EM
                                           π1 N (x|µ1 , σ12 )                             28                                 algorithm applied to
        0.30
                                           π2 N (x|µ2 , σ22 )
                                                                                                                             the GMM from
                                                                Negative log-likelihood
                                                                                          26
        0.25                               π3 N (x|µ3 , σ32 )
                                           GMM density                                    24                                 Figure 11.2. (a)
        0.20
                                                                                          22                                 Final GMM fit;
 p(x)
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                        362                                         Density Estimation with Gaussian Mixture Models
Figure 11.9                   10
Illustration of the                                                                                          104
                                                                              Negative log-likelihood
EM algorithm for               5
fitting a Gaussian
mixture model with
                        x2
                               0
three components to                                                                                     6 × 103
a two-dimensional            −5
dataset. (a) Dataset;
                                                                                                        4 × 103
(b) negative                 −10
                               −10        −5          0         5        10                                        0             20              40       60
log-likelihood                                        x1                                                                              EM iteration
(lower is better) as
                                               (a) Dataset.                                                          (b) Negative log-likelihood.
a function of the EM
iterations. The red
                              10                                                                              10
dots indicate the
iterations for which
                               5                                                                               5
the mixture
components of the
                        x2
                                                                                                        x2
                               0                                                                               0
corresponding GMM
fits are shown in (c)
                             −5                                                                              −5
through (f). The
yellow discs indicate
                             −10                                                                             −10
the means of the               −10        −5          0         5        10                                    −10          −5            0           5        10
                                                      x1                                                                                  x1
Gaussian mixture
components.                            (c) EM initialization.                                                          (d) EM after one iteration.
Figure 11.10(a)
shows the final               10                                                                              10
GMM fit.
                               5                                                                               5
                        x2
x2
0 0
−5 −5
                             −10                                                                             −10
                               −10        −5          0         5        10                                    −10          −5            0           5        10
                                                      x1                                                                                  x1
                           When we run EM on our example from Figure 11.3, we obtain the final
                        result shown in Figure 11.8(a) after five iterations, and Figure 11.8(b)
                        shows how the negative log-likelihood evolves as a function of the EM
                        iterations. The final GMM is given as
                                                                                    
                            p(x) = 0.29N x | − 2.75, 0.06 + 0.28N x | − 0.50, 0.25
                                                                                        (11.57)
                                    + 0.43N x | 3.64, 1.63 .
                                                   x2
      0                                                  0
                                                                                                   EM converges;
     −2                                                 −2
                                                                                                   (b) each data point
     −4                                                 −4                                         is colored according
     −6                                                 −6                                         to the
              −5           0           5                      −5          0           5
                           x1                                             x1                       responsibilities of
                                                                                                   the mixture
          (a) GMM fit after 62 iterations.        (b) Dataset colored according to the respon-
                                                                                                   components.
                                                  sibilities of the mixture components.
the corresponding final GMM fit. Figure 11.10(b) visualizes the final re-
sponsibilities of the mixture components for the data points. The dataset is
colored according to the responsibilities of the mixture components when
EM converges. While a single mixture component is clearly responsible
for the data on the left, the overlap of the two data clusters on the right
could have been generated by two mixture components. It becomes clear
that there are data points that cannot be uniquely assigned to a single
component (either blue or yellow), such that the responsibilities of these
two clusters for those points are around 0.5.
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                         364                             Density Estimation with Gaussian Mixture Models
                         so that
                                                                              
                                                 p(x | zk = 1) = N x | µk , Σk .                        (11.58)
πk = p(zk = 1) (11.60)
for k = 1, . . . , K , so that
                                                        
                        p(x, z1 = 1)     π1 N x | µ1 , Σ1
      p(x, z) =              ..                 ..
                                     =                     ,                       (11.62)
                                                          
                               .                  .        
                       p(x, zK = 1)     πK N x | µK , ΣK
which fully specifies the probabilistic model.
                                      11.4.2 Likelihood
To obtain the likelihood p(x | θ) in a latent-variable model, we need to
marginalize out the latent variables (see Section 8.4.3). In our case, this
can be done by summing out all latent variables from the joint p(x, z)
in (11.62) so that
             X
  p(x | θ) =    p(x | θ, z)p(z | θ) , θ := {µk , Σk , πk : k = 1, . . . , K} .
                 z
                                                                                      (11.63)
We now explicitly condition on the parameters θ of the probabilistic model,
which we previously omitted. In (11.63), P we sum over all K possible one-
hot encodings of z , which is denoted by z . Since there is only a single
nonzero single entry in each z there are only K possible configurations/
settings of z . For example, if K = 3, then z can have the configurations
                                   
                                1    0     0
                              0 , 1 , 0 .                    (11.64)
                                0    0     1
Summing over all possible configurations of z in (11.63) is equivalent to
looking at the nonzero entry of the z -vector and writing
                       X
            p(x | θ) =   p(x | θ, z)p(z | θ)                    (11.65a)
                                z
                               K
                               X
                           =         p(x | θ, zk = 1)p(zk = 1 | θ)                   (11.65b)
                               k=1
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                      366                             Density Estimation with Gaussian Mixture Models
Figure 11.12                                                                     π
Graphical model for
a GMM with N data
points.
                                                                                zn
                                                              µk
                                                              Σk                xn
                                                   k = 1, . . . , K
                                                                      n = 1, . . . , N
                      which is exactly the GMM likelihood from (11.9). Therefore, the latent-
                      variable model with latent indicators zk is an equivalent way of thinking
                      about a Gaussian mixture model.
where the expectation of log p(x, z | θ) is taken with respect to the poste-
rior p(z | x, θ (t) ) of the latent variables. The M-step selects an updated set
of model parameters θ (t+1) by maximizing (11.73b).
   Although an EM iteration does increase the log-likelihood, there are
no guarantees that EM converges to the maximum likelihood solution.
It is possible that the EM algorithm converges to a local maximum of
the log-likelihood. Different initializations of the parameters θ could be
used in multiple EM runs to reduce the risk of ending up in a bad local
optimum. We do not go into further details here, but refer to the excellent
expositions by Rogers and Girolami (2016) and Bishop (2006).
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
368                             Density Estimation with Gaussian Mixture Models
                          0.15
                                                                                                   kernel density
                                                                                                   estimator produces
                          0.10                                                                     a smooth estimate
                                                                                                   of the underlying
                          0.05                                                                     density, whereas the
                                                                                                   histogram is an
                          0.00                                                                     unsmoothed count
                                 −4   −2     0        2    4      6      8
                                                  x                                                measure of how
                                                                                                   many data points
                                                                                                   (black) fall into a
   In this chapter, we discussed mixture models for density estimation.                            single bin.
There is a plethora of density estimation techniques available. In practice,
we often use histograms and kernel density estimation.                                             histogram
   Histograms provide a nonparametric way to represent continuous den-
sities and have been proposed by Pearson (1895). A histogram is con-
structed by “binning” the data space and count, how many data points fall
into each bin. Then a bar is drawn at the center of each bin, and the height
of the bar is proportional to the number of data points within that bin. The
bin size is a critical hyperparameter, and a bad choice can lead to overfit-
ting and underfitting. Cross-validation, as discussed in Section 8.2.4, can
be used to determine a good bin size.                                                              kernel density
   Kernel density estimation, independently proposed by Rosenblatt (1956)                          estimation
and Parzen (1962), is a nonparametric way for density estimation. Given
N i.i.d. samples, the kernel density estimator represents the underlying
distribution as
                                     N
                                 1 X        x − xn
                                                  
                        p(x) =          k            ,              (11.74)
                                N h n=1       h
where k is a kernel function, i.e., a nonnegative function that integrates to
1 and h > 0 is a smoothing/bandwidth parameter, which plays a similar
role as the bin size in histograms. Note that we place a kernel on every
single data point xn in the dataset. Commonly used kernel functions are
the uniform distribution and the Gaussian distribution. Kernel density esti-
mates are closely related to histograms, but by choosing a suitable kernel,
we can guarantee smoothness of the density estimate. Figure 11.13 illus-
trates the difference between a histogram and a kernel density estimator
(with a Gaussian-shaped kernel) for a given dataset of 250 data points.
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                                                                     12
                        370
                        This material is published by Cambridge University Press as Mathematics for Machine Learning by
                        Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong (2020). This version is free to view
                        and download for personal use only. Not for re-distribution, re-sale, or use in derivative works.
                         c
                        
by  M. P. Deisenroth, A. A. Faisal, and C. S. Ong, 2021. https://mml-book.com.
Classification with Support Vector Machines                                                371
                                                                                                   Figure 12.1
                                                                                                   Example 2D data,
                                                                                                   illustrating the
                                                                                                   intuition of data
                                                                                                   where we can find a
                                                                                                   linear classifier that
                            x(2)
                                                                                                   separates orange
                                                                                                   crosses from blue
                                                                                                   discs.
x(1)
SVMs. First, the SVM allows for a geometric way to think about supervised
machine learning. While in Chapter 9 we considered the machine learning
problem in terms of probabilistic models and attacked it using maximum
likelihood estimation and Bayesian inference, here we will consider an
alternative approach where we reason geometrically about the machine
learning task. It relies heavily on concepts, such as inner products and
projections, which we discussed in Chapter 3. The second reason why we
find SVMs instructive is that in contrast to Chapter 9, the optimization
problem for SVM does not admit an analytic solution so that we need to
resort to a variety of optimization tools introduced in Chapter 7.
   The SVM view of machine learning is subtly different from the max-
imum likelihood view of Chapter 9. The maximum likelihood view pro-
poses a model based on a probabilistic view of the data distribution, from
which an optimization problem is derived. In contrast, the SVM view starts
by designing a particular function that is to be optimized during training,
based on geometric intuitions. We have seen something similar already
in Chapter 10, where we derived PCA from geometric principles. In the
SVM case, we start by designing a loss function that is to be minimized
on training data, following the principles of empirical risk minimization
(Section 8.2).
   Let us derive the optimization problem corresponding to training an
SVM on example–label pairs. Intuitively, we imagine binary classification
data, which can be separated by a hyperplane as illustrated in Figure 12.1.
Here, every example xn (a vector of dimension 2) is a two-dimensional
location (x(1)        (2)
            n and xn ), and the corresponding binary label yn is one of
two different symbols (orange cross or blue disc). “Hyperplane” is a word
that is commonly used in machine learning, and we encountered hyper-
planes already in Section 2.8. A hyperplane is an affine subspace of di-
mension D − 1 (if the corresponding vector space is of dimension D).
The examples consist of two classes (there are two possible labels) that
have features (the components of the vector representing the example)
arranged in such a way as to allow us to separate/classify them by draw-
ing a straight line.
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
372                                    Classification with Support Vector Machines
where the second line is obtained by the linearity of the inner product
(Section 3.2). Since we have chosen xa and xb to be on the hyperplane,
this implies that f (xa ) = 0 and f (xb ) = 0 and hence hw, xa − xb i = 0.
Recall that two vectors are orthogonal when their inner product is zero.                           w is orthogonal to
Therefore, we obtain that w is orthogonal to any vector on the hyperplane.                         any vector on the
                                                                                                   hyperplane.
Remark. Recall from Chapter 2 that we can think of vectors in different
ways. In this chapter, we think of the parameter vector w as an arrow
indicating a direction, i.e., we consider w to be a geometric vector. In
contrast, we think of the example vector x as a data point (as indicated
by its coordinates), i.e., we consider x to be the coordinates of a vector
with respect to the standard basis.                                     ♦
   When presented with a test example, we classify the example as pos-
itive or negative depending on the side of the hyperplane on which it
occurs. Note that (12.3) not only defines a hyperplane; it additionally de-
fines a direction. In other words, it defines the positive and negative side
of the hyperplane. Therefore, to classify a test example xtest , we calcu-
late the value of the function f (xtest ) and classify the example as +1 if
f (xtest ) > 0 and −1 otherwise. Thinking geometrically, the positive ex-
amples lie “above” the hyperplane and the negative examples “below” the
hyperplane.
   When training the classifier, we want to ensure that the examples with
positive labels are on the positive side of the hyperplane, i.e.,
                          hw, xn i + b > 0 when yn = +1                                (12.5)
and the examples with negative labels are on the negative side, i.e.,
                        hw, xn i + b < 0 when yn = −1 .                                (12.6)
Refer to Figure 12.2 for a geometric intuition of positive and negative
examples. These two conditions are often presented in a single equation
                                  yn (hw, xn i + b) > 0 .                              (12.7)
Equation (12.7) is equivalent to (12.5) and (12.6) when we multiply both
sides of (12.5) and (12.6) with yn = 1 and yn = −1, respectively.
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                       374                                    Classification with Support Vector Machines
Figure 12.3
Possible separating
hyperplanes. There
are many linear
classifiers (green
lines) that separate
                                                 x(2)
orange crosses from
blue discs.
x(1)
                                .
                                0
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                      376                                           Classification with Support Vector Machines
Figure 12.5
Derivation of the
                                                                        .xa
               1
                                                                    r                       w
                                                        x0a .
margin: r = kwk   .
                                                                                  hw
                                                                                       ,x
                                                                        hw
                                                                                         i+
                                                                           ,
                                                                           xi
                                                                                            b=
                                                                              +
                                                                                b=
                                                                                                 1
                                                                                   0
                      problem, we obtain the objective
                                   max       r
                                    w,b,r   |{z}
                                            margin
                                                                                                         (12.10)
                            subject to      yn (hw, xn i + b) > r , kwk = 1 ,                   r > 0,
                                            |        {z         } | {z }
                                                     data fitting           normalization
                      which says that we want to maximize the margin r while ensuring that
                      the data lies on the correct side of the hyperplane.
                      Remark. The concept of the margin turns out to be highly pervasive in ma-
                      chine learning. It was used by Vladimir Vapnik and Alexey Chervonenkis
                      to show that when the margin is large, the “complexity” of the function
                      class is low, and hence learning is possible (Vapnik, 2000). It turns out
                      that the concept is useful for various different approaches for theoret-
                      ically analyzing generalization error (Steinwart and Christmann, 2008;
                      Shalev-Shwartz and Ben-David, 2014).                                   ♦
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                        378                                                Classification with Support Vector Machines
                                           max         r2
                                           w0 ,b,r
                                                            
                                                                  w0
                                                                                                                   (12.22)
                                 subject to            yn              , xn + b > r,                     r > 0.
                                                                 kw0 k
                        Equation (12.22) explicitly states that the distance r is positive. Therefore,
Note that r > 0         we can divide the first constraint by r, which yields
because we
assumed linear                 max        r2
                               w0 ,b,r
separability, and                                                            
hence there is no
                                                                                                                     (12.23)
                                                         *            +
issue to divide by r.                                      w0             b 
                               subject to            yn          , xn +       > 1,                        r>0
                                                                             
                                                         kw0 k r         r
                                                                         |{z}
                                                          | {z }           00
                                                                 w00                     b
                                                   x(2)
                         x(1)                                            x(1)
     (a) Linearly separable data, with a large            (b) Non-linearly separable data
     margin
                                                                           0
renaming the parameters to w00 and b00 . Since w00 = kww0 kr , rearranging for
r gives
                         
 w0 
                w0 
                         
        
         
      
                  00                    1 
             1
                kw k = 
 0 
 = · 
 0 
                         
        
         
        = .              (12.24)
                           kw k r       r     kw k 
    r
By substituting this result into (12.23), we obtain
                                             1
                                 max             2
                                  00 00
                                 w ,b     kw00 k                                        (12.25)
                        subject to        yn (hw00 , xn i + b00 ) > 1 .
                                                              1
The final step is to observe that maximizing                kw00 k2
                                                                      yields the same solution
                   1      00 2
as minimizing      2
                       kw k , which concludes the proof of Theorem 12.1.
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                       380                                    Classification with Support Vector Machines
                                                                             hw
measures the
                                                                                  ,x
                                                                  hw
distance of a
                                                                                    i+
                                                                     ,
positive example
xi
                                                                                      b=
x+ to the positive
                                                                         +
margin hyperplane
b=
                                                                                         1
hw, xi + b = 1
                                                                              0
when x+ is on the
wrong side.
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                     382                                         Classification with Support Vector Machines
                                           max{0, 1 − t}
convex upper bound
                                                                           Hinge loss
of zero-one loss.
                                                   2
                                                   0
                                                           −2         0           2
                                                                      t
                     This loss can be interpreted as never allowing any examples inside the
                     margin.
                       For a given training set {(x1 , y1 ), . . . , (xN , yN )}, we seek to minimize
                     the total loss, while regularizing the objective with `2 -regularization (see
                     Section 8.2.3). Using the hinge loss (12.28) gives us the unconstrained
                     optimization problem
                                                   N
                                     1       2
                                                  X
                            min        kwk + C        max{0, 1 − yn (hw, xn i + b)} .               (12.31)
                            w,b
                                    |2 {z }       n=1
                                    regularizer
                                                |              {z                 }
                                                                     error term
regularizer          The first term in (12.31) is called the regularization term or the regularizer
loss term            (see Section 8.2.3), and the second term is called the loss term or the error
error term                                                                    2
                     term. Recall from Section 12.2.4 that the term 12 kwk arises directly from
                     the margin. In other words, margin maximization can be interpreted as
regularization       regularization.
                        In principle, the unconstrained optimization problem in (12.31) can
                     be directly solved with (sub-)gradient descent methods as described in
                     Section 7.1. To see that (12.31) and (12.26a) are equivalent, observe that
                     the hinge loss (12.28) essentially consists of two linear parts, as expressed
                     in (12.29). Consider the hinge loss for a single example-label pair (12.28).
                     We can equivalently replace minimization of the hinge loss over t with a
                     minimization of a slack variable ξ with two constraints. In equation form,
                                                           min max{0, 1 − t}                        (12.32)
                                                            t
                     is equivalent to
                                                           min ξ
                                                           ξ,t
                                                                                                    (12.33)
                                                subject to       ξ > 0,      ξ > 1 − t.
                     By substituting this expression into (12.31) and rearranging one of the
                     constraints, we obtain exactly the soft margin SVM (12.26a).
                     Remark. Let us contrast our choice of the loss function in this section to the
                     loss function for linear regression in Chapter 9. Recall from Section 9.2.1
                     that for finding maximum likelihood estimators, we usually minimize the
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                         384                                         Classification with Support Vector Machines
representer theorem      which is a particular instance of the representer theorem (Kimeldorf and
The representer          Wahba, 1970). Equation (12.38) states that the optimal weight vector in
theorem is actually      the primal is a linear combination of the examples xn . Recall from Sec-
a collection of
                         tion 2.6.1 that this means that the solution of the optimization problem
theorems saying
that the solution of     lies in the span of training data. Additionally, the constraint obtained by
minimizing               setting (12.36) to zero implies that the optimal weight vector is an affine
empirical risk lies in   combination of the examples. The representer theorem turns out to hold
the subspace
                         for very general settings of regularized empirical risk minimization (Hof-
(Section 2.4.3)
defined by the           mann et al., 2008; Argyriou and Dinuzzo, 2014). The theorem has more
examples.                general versions (Schölkopf et al., 2001), and necessary and sufficient
                         conditions on its existence can be found in Yu et al. (2013).
                         Remark. The representer theorem (12.38) also provides an explanation
                         of the name “support vector machine.” The examples xn , for which the
                         corresponding parameters αn = 0, do not contribute to the solution w at
support vector           all. The other examples, where αn > 0, are called support vectors since
                         they “support” the hyperplane.                                       ♦
                           By substituting the expression for w into the Lagrangian (12.34), we
                         obtain the dual
                                            N   N                            N
                                                                                      *N                 +
                                         1 XX                               X          X
                           D(ξ, α, γ) =            yi yj αi αj hxi , xj i −     yi αi      yj αj xj , xi
                                         2 i=1 j=1                          i=1        j=1
                                                N
                                                X              N
                                                               X               N
                                                                               X            N
                                                                                            X               N
                                                                                                            X
                                          +C          ξi − b         yi αi +         αi −         αi ξi −         γ i ξi .
                                                i=1            i=1             i=1          i=1             i=1
                                                                                                                     (12.39)
bilinear (see Section 3.2). Therefore, the first two terms in (12.39) are
over the same objects. These terms (colored blue) can be simplified, and
we obtain the Lagrangian
                       N    N                              N             N
                  1 XX                               X        X
D(ξ, α, γ) = −              yi yj αi αj hxi , xj i +     αi +     (C − αi − γi )ξi .
                  2 i=1 j=1                          i=1      i=1
                                                                          (12.40)
The last term in this equation is a collection of all terms that contain slack
variables ξi . By setting (12.37) to zero, we see that the last term in (12.40)
is also zero. Furthermore, by using the same equation and recalling that
the Lagrange multiplers γi are non-negative, we conclude that αi 6 C .
We now obtain the dual optimization problem of the SVM, which is ex-
pressed exclusively in terms of the Lagrange multipliers αi . Recall from
Lagrangian duality (Definition 7.1) that we maximize the dual problem.
This is equivalent to minimizing the negative dual problem, such that we
end up with the dual SVM                                                                           dual SVM
                                 N    N                              N
                             1 XX                               X
                    min                yi yj αi αj hxi , xj i −     αi
                      α      2 i=1 j=1                          i=1
                                      N
                                      X                                              (12.41)
                    subject to              yi αi = 0
                                      i=1
0 6 αi 6 C for all i = 1, . . . , N .
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                       386                                    Classification with Support Vector Machines
Figure 12.9 Convex
hulls. (a) Convex
hull of points, some
of which lie within
the boundary;
(b) convex hulls
around positive and
negative examples.
                                                                                                      c
                                    (a) Convex hull.               (b) Convex hulls around positive (blue) and
                                                                   negative (orange) examples. The distance be-
                                                                   tween the two convex sets is the length of the
                                                                   difference vector c − d.
w := c − d . (12.44)
Picking the points c and d as in the preceding cases, and requiring them
to be closest to each other is equivalent to minimizing the length/norm of
w, so that we end up with the corresponding optimization problem
                                                             1    2
                          arg min kwk = arg min                kwk .                 (12.45)
                                 w                   w       2
Since c must be in the positive convex hull, it can be expressed as a convex
combination of the positive examples, i.e., for non-negative coefficients
αn+
                                   X
                             c=         αn+ xn .                     (12.46)
                                         n:yn =+1
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                      388                                        Classification with Support Vector Machines
                      The objective function (12.48) and the constraint (12.50), along with the
                      assumption that α > 0, give us a constrained (convex) optimization prob-
                      lem. This optimization problem can be shown to be the same as that of
                      the dual hard margin SVM (Bennett and Bredensteiner, 2000a).
                      Remark. To obtain the soft margin dual, we consider the reduced hull. The
reduced hull          reduced hull is similar to the convex hull but has an upper bound to the
                      size of the coefficients α. The maximum possible value of the elements
                      of α restricts the size that the convex hull can take. In other words, the
                      bound on α shrinks the convex hull to a smaller volume (Bennett and
                      Bredensteiner, 2000b).                                                  ♦
                                                           12.4 Kernels
                      Consider the formulation of the dual SVM (12.41). Notice that the in-
                      ner product in the objective occurs only between examples xi and xj .
                      There are no inner products between the examples and the parameters.
                      Therefore, if we consider a set of features φ(xi ) to represent xi , the only
                      change in the dual SVM will be to replace the inner product. This mod-
                      ularity, where the choice of the classification method (the SVM) and the
                      choice of the feature representation φ(x) can be considered separately,
                      provides flexibility for us to explore the two problems independently. In
                      this section, we discuss the representation φ(x) and briefly introduce the
                      idea of kernels, but do not go into the technical details.
                         Since φ(x) could be a non-linear function, we can use the SVM (which
                      assumes a linear classifier) to construct classifiers that are nonlinear in
                      the examples xn . This provides a second avenue, in addition to the soft
                      margin, for users to deal with a dataset that is not linearly separable. It
                      turns out that there are many algorithms and statistical methods that have
                      this property that we observed in the dual SVM: the only inner products
                      are those that occur between examples. Instead of explicitly defining a
                      non-linear feature map φ(·) and computing the resulting inner product
                      between examples xi and xj , we define a similarity function k(xi , xj ) be-
kernel                tween xi and xj . For a certain class of similarity functions, called kernels,
                      the similarity function implicitly defines a non-linear feature map φ(·).
The inputs X of the   Kernels are by definition functions k : X × X → R for which there exists
kernel function can   a Hilbert space H and φ : X → H a feature map such that
be very general and
are not necessarily                             k(xi , xj ) = hφ(xi ), φ(xj )iH .                    (12.52)
restricted to RD .
                                                      Second feature
                                                                                                           nonlinear, the
                                                                                                           underlying problem
                                                                                                           being solved is for a
                                                                                                           linear separating
                                                                                                           hyperplane (albeit
                                                                                                           with a nonlinear
                                                                                                           kernel).
Second feature
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                       390                                    Classification with Support Vector Machines
                       and Williams, 2006). Figure 12.10 illustrates the effect of different kernels
                       on separating hyperplanes on an example dataset. Note that we are still
                       solving for hyperplanes, that is, the hypothesis class of functions are still
                       linear. The non-linear surfaces are due to the kernel function.
                       Remark. Unfortunately for the fledgling machine learner, there are mul-
                       tiple meanings of the word “kernel.” In this chapter, the word “kernel”
                       comes from the idea of the reproducing kernel Hilbert space (RKHS) (Aron-
                       szajn, 1950; Saitoh, 1988). We have discussed the idea of the kernel in lin-
                       ear algebra (Section 2.7.3), where the kernel is another word for the null
                       space. The third common use of the word “kernel” in machine learning is
                       the smoothing kernel in kernel density estimation (Section 11.5).         ♦
                          Since the explicit representation φ(x) is mathematically equivalent to
                       the kernel representation k(xi , xj ), a practitioner will often design the
                       kernel function such that it can be computed more efficiently than the
                       inner product between explicit feature maps. For example, consider the
                       polynomial kernel (Schölkopf and Smola, 2002), where the number of
                       terms in the explicit expansion grows very quickly (even for polynomials
                       of low degree) when the input dimension is large. The kernel function
                       only requires one multiplication per input dimension, which can provide
                       significant computational savings. Another example is the Gaussian ra-
                       dial basis function kernel (Schölkopf and Smola, 2002; Rasmussen and
                       Williams, 2006), where the corresponding feature space is infinite dimen-
                       sional. In this case, we cannot explicitly represent the feature space but
The choice of          can still compute similarities between a pair of examples using the kernel.
kernel, as well as        Another useful aspect of the kernel trick is that there is no need for
the parameters of
                       the original data to be already represented as multivariate real-valued
the kernel, is often
chosen using nested    data. Note that the inner product is defined on the output of the function
cross-validation       φ(·), but does not restrict the input to real numbers. Hence, the function
(Section 8.6.1).       φ(·) and the kernel function k(·, ·) can be defined on any object, e.g.,
                       sets, sequences, strings, graphs, and distributions (Ben-Hur et al., 2008;
                       Gärtner, 2008; Shi et al., 2009; Sriperumbudur et al., 2010; Vishwanathan
                       et al., 2010).
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
392                                    Classification with Support Vector Machines
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
394                                    Classification with Support Vector Machines
lationship between loss function and the likelihood (also compare Sec-
tions 8.2 and 8.3). The maximum likelihood approach corresponding to
a well-calibrated transformation during training is called logistic regres-
sion, which comes from a class of methods called generalized linear mod-
els. Details of logistic regression from this point of view can be found in
Agresti (2002, chapter 5) and McCullagh and Nelder (1989, chapter 4).
Naturally, one could take a more Bayesian view of the classifier output by
estimating a posterior distribution using Bayesian logistic regression. The
Bayesian view also includes the specification of the prior, which includes
design choices such as conjugacy (Section 6.6.1) with the likelihood. Ad-
ditionally, one could consider latent functions as priors, which results in
Gaussian process classification (Rasmussen and Williams, 2006, chapter
3).
                                                                                            395
This material is published by Cambridge University Press as Mathematics for Machine Learning by
Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong (2020). This version is free to view
and download for personal use only. Not for re-distribution, re-sale, or use in derivative works.
 c
by  M. P. Deisenroth, A. A. Faisal, and C. S. Ong, 2021. https://mml-book.com.
396                                                                           References
Bennett, Kristin P., and Bredensteiner, Erin J. 2000a. Duality and Geometry in SVM
   Classifiers. In: Proceedings of the International Conference on Machine Learning.
Bennett, Kristin P., and Bredensteiner, Erin J. 2000b. Geometry in Learning. Pages
   132–145 of: Geometry at Work. Mathematical Association of America.
Berlinet, Alain, and Thomas-Agnan, Christine. 2004. Reproducing Kernel Hilbert Spaces
   in Probability and Statistics. Springer.
Bertsekas, Dimitri P. 1999. Nonlinear Programming. Athena Scientific.
Bertsekas, Dimitri P. 2009. Convex Optimization Theory. Athena Scientific.
Bickel, Peter J., and Doksum, Kjell. 2006. Mathematical Statistics, Basic Ideas and
   Selected Topics. Vol. 1. Prentice Hall.
Bickson, Danny, Dolev, Danny, Shental, Ori, Siegel, Paul H., and Wolf, Jack K. 2007.
   Linear Detection via Belief Propagation. In: Proceedings of the Annual Allerton Con-
   ference on Communication, Control, and Computing.
Billingsley, Patrick. 1995. Probability and Measure. Wiley.
Bishop, Christopher M. 1995. Neural Networks for Pattern Recognition. Clarendon
   Press.
Bishop, Christopher M. 1999. Bayesian PCA. In: Advances in Neural Information Pro-
   cessing Systems.
Bishop, Christopher M. 2006. Pattern Recognition and Machine Learning. Springer.
Blei, David M., Kucukelbir, Alp, and McAuliffe, Jon D. 2017. Variational Inference: A
   Review for Statisticians. Journal of the American Statistical Association, 112(518),
   859–877.
Blum, Arvim, and Hardt, Moritz. 2015. The Ladder: A Reliable Leaderboard for Ma-
   chine Learning Competitions. In: International Conference on Machine Learning.
Bonnans, J. Frédéric, Gilbert, J. Charles, Lemaréchal, Claude, and Sagastizábal, Clau-
   dia A. 2006. Numerical Optimization: Theoretical and Practical Aspects. Springer.
Borwein, Jonathan M., and Lewis, Adrian S. 2006. Convex Analysis and Nonlinear
   Optimization. 2nd edn. Canadian Mathematical Society.
Bottou, Léon. 1998. Online Algorithms and Stochastic Approximations. Pages 9–42
   of: Online Learning and Neural Networks. Cambridge University Press.
Bottou, Léon, Curtis, Frank E., and Nocedal, Jorge. 2018. Optimization Methods for
   Large-Scale Machine Learning. SIAM Review, 60(2), 223–311.
Boucheron, Stephane, Lugosi, Gabor, and Massart, Pascal. 2013. Concentration In-
   equalities: A Nonasymptotic Theory of Independence. Oxford University Press.
Boyd, Stephen, and Vandenberghe, Lieven. 2004. Convex Optimization. Cambridge
   University Press.
Boyd, Stephen, and Vandenberghe, Lieven. 2018. Introduction to Applied Linear Alge-
   bra. Cambridge University Press.
Brochu, Eric, Cora, Vlad M., and de Freitas, Nando. 2009. A Tutorial on Bayesian
   Optimization of Expensive Cost Functions, with Application to Active User Modeling
   and Hierarchical Reinforcement Learning. Tech. rept. TR-2009-023. Department of
   Computer Science, University of British Columbia.
Brooks, Steve, Gelman, Andrew, Jones, Galin L., and Meng, Xiao-Li (eds). 2011. Hand-
   book of Markov Chain Monte Carlo. Chapman and Hall/CRC.
Brown, Lawrence D. 1986. Fundamentals of Statistical Exponential Families: With Ap-
   plications in Statistical Decision Theory. Institute of Mathematical Statistics.
Bryson, Arthur E. 1961. A Gradient Method for Optimizing Multi-Stage Allocation
   Processes. In: Proceedings of the Harvard University Symposium on Digital Computers
   and Their Applications.
Bubeck, Sébastien. 2015. Convex Optimization: Algorithms and Complexity. Founda-
   tions and Trends in Machine Learning, 8(3-4), 231–357.
Bühlmann, Peter, and Van De Geer, Sara. 2011. Statistics for High-Dimensional Data.
   Springer.
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
398                                                                          References
Grinstead, Charles M., and Snell, J. Laurie. 1997. Introduction to Probability. American
   Mathematical Society.
Hacking, Ian. 2001. Probability and Inductive Logic. Cambridge University Press.
Hall, Peter. 1992. The Bootstrap and Edgeworth Expansion. Springer.
Hallin, Marc, Paindaveine, Davy, and Šiman, Miroslav. 2010. Multivariate Quan-
   tiles and Multiple-Output Regression Quantiles: From `1 Optimization to Halfspace
   Depth. Annals of Statistics, 38, 635–669.
Hasselblatt, Boris, and Katok, Anatole. 2003. A First Course in Dynamics with a
   Panorama of Recent Developments. Cambridge University Press.
Hastie, Trevor, Tibshirani, Robert, and Friedman, Jerome. 2001. The Elements of Sta-
   tistical Learning – Data Mining, Inference, and Prediction. Springer.
Hausman, Karol, Springenberg, Jost T., Wang, Ziyu, Heess, Nicolas, and Riedmiller,
   Martin. 2018. Learning an Embedding Space for Transferable Robot Skills. In:
   Proceedings of the International Conference on Learning Representations.
Hazan, Elad. 2015. Introduction to Online Convex Optimization. Foundations and
   Trends in Optimization, 2(3–4), 157–325.
Hensman, James, Fusi, Nicolò, and Lawrence, Neil D. 2013. Gaussian Processes for
   Big Data. In: Proceedings of the Conference on Uncertainty in Artificial Intelligence.
Herbrich, Ralf, Minka, Tom, and Graepel, Thore. 2007. TrueSkill(TM): A Bayesian
   Skill Rating System. In: Advances in Neural Information Processing Systems.
Hiriart-Urruty, Jean-Baptiste, and Lemaréchal, Claude. 2001. Fundamentals of Convex
   Analysis. Springer.
Hoffman, Matthew D., Blei, David M., and Bach, Francis. 2010. Online Learning for
   Latent Dirichlet Allocation. Advances in Neural Information Processing Systems.
Hoffman, Matthew D., Blei, David M., Wang, Chong, and Paisley, John. 2013. Stochas-
   tic Variational Inference. Journal of Machine Learning Research, 14(1), 1303–1347.
Hofmann, Thomas, Schölkopf, Bernhard, and Smola, Alexander J. 2008. Kernel Meth-
   ods in Machine Learning. Annals of Statistics, 36(3), 1171–1220.
Hogben, Leslie. 2013. Handbook of Linear Algebra. Chapman and Hall/CRC.
Horn, Roger A., and Johnson, Charles R. 2013. Matrix Analysis. Cambridge University
   Press.
Hotelling, Harold. 1933. Analysis of a Complex of Statistical Variables into Principal
   Components. Journal of Educational Psychology, 24, 417–441.
Hyvarinen, Aapo, Oja, Erkki, and Karhunen, Juha. 2001. Independent Component Anal-
   ysis. Wiley.
Imbens, Guido W., and Rubin, Donald B. 2015. Causal Inference for Statistics, Social
   and Biomedical Sciences. Cambridge University Press.
Jacod, Jean, and Protter, Philip. 2004. Probability Essentials. Springer.
Jaynes, Edwin T. 2003. Probability Theory: The Logic of Science. Cambridge University
   Press.
Jefferys, William H., and Berger, James O. 1992. Ockham’s Razor and Bayesian Anal-
   ysis. American Scientist, 80, 64–72.
Jeffreys, Harold. 1961. Theory of Probability. Oxford University Press.
Jimenez Rezende, Danilo, and Mohamed, Shakir. 2015. Variational Inference with Nor-
   malizing Flows. In: Proceedings of the International Conference on Machine Learning.
Jimenez Rezende, Danilo, Mohamed, Shakir, and Wierstra, Daan. 2014. Stochastic
   Backpropagation and Approximate Inference in Deep Generative Models. In: Pro-
   ceedings of the International Conference on Machine Learning.
Joachims, Thorsten. 1999. Advances in Kernel Methods – Support Vector Learning. MIT
   Press. Chap. Making Large-Scale SVM Learning Practical, pages 169–184.
Jordan, Michael I., Ghahramani, Zoubin, Jaakkola, Tommi S., and Saul, Lawrence K.
   1999. An Introduction to Variational Methods for Graphical Models. Machine Learn-
   ing, 37, 183–233.
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
400                                                                            References
Julier, Simon J., and Uhlmann, Jeffrey K. 1997. A New Extension of the Kalman Filter
   to Nonlinear Systems. In: Proceedings of AeroSense Symposium on Aerospace/Defense
   Sensing, Simulation and Controls.
Kaiser, Marcus, and Hilgetag, Claus C. 2006. Nonoptimal Component Placement, but
   Short Processing Paths, Due to Long-Distance Projections in Neural Systems. PLoS
   Computational Biology, 2(7), e95.
Kalman, Dan. 1996. A Singularly Valuable Decomposition: The SVD of a Matrix. Col-
   lege Mathematics Journal, 27(1), 2–23.
Kalman, Rudolf E. 1960. A New Approach to Linear Filtering and Prediction Problems.
   Transactions of the ASME – Journal of Basic Engineering, 82(Series D), 35–45.
Kamthe, Sanket, and Deisenroth, Marc P. 2018. Data-Efficient Reinforcement Learning
   with Probabilistic Model Predictive Control. In: Proceedings of the International
   Conference on Artificial Intelligence and Statistics.
Katz, Victor J. 2004. A History of Mathematics. Pearson/Addison-Wesley.
Kelley, Henry J. 1960. Gradient Theory of Optimal Flight Paths. Ars Journal, 30(10),
   947–954.
Kimeldorf, George S., and Wahba, Grace. 1970. A Correspondence between Bayesian
   Estimation on Stochastic Processes and Smoothing by Splines. Annals of Mathemat-
   ical Statistics, 41(2), 495–502.
Kingma, Diederik P., and Welling, Max. 2014. Auto-Encoding Variational Bayes. In:
   Proceedings of the International Conference on Learning Representations.
Kittler, Josef, and Föglein, Janos. 1984. Contextual Classification of Multispectral Pixel
   Data. Image and Vision Computing, 2(1), 13–29.
Kolda, Tamara G., and Bader, Brett W. 2009. Tensor Decompositions and Applications.
   SIAM Review, 51(3), 455–500.
Koller, Daphne, and Friedman, Nir. 2009. Probabilistic Graphical Models. MIT Press.
Kong, Linglong, and Mizera, Ivan. 2012. Quantile Tomography: Using Quantiles with
   Multivariate Data. Statistica Sinica, 22, 1598–1610.
Lang, Serge. 1987. Linear Algebra. Springer.
Lawrence, Neil D. 2005. Probabilistic Non-Linear Principal Component Analysis with
   Gaussian Process Latent Variable Models. Journal of Machine Learning Research,
   6(Nov.), 1783–1816.
Leemis, Lawrence M., and McQueston, Jacquelyn T. 2008. Univariate Distribution
   Relationships. American Statistician, 62(1), 45–53.
Lehmann, Erich L., and Romano, Joseph P. 2005. Testing Statistical Hypotheses.
   Springer.
Lehmann, Erich Leo, and Casella, George. 1998. Theory of Point Estimation. Springer.
Liesen, Jörg, and Mehrmann, Volker. 2015. Linear Algebra. Springer.
Lin, Hsuan-Tien, Lin, Chih-Jen, and Weng, Ruby C. 2007. A Note on Platt’s Probabilistic
   Outputs for Support Vector Machines. Machine Learning, 68, 267–276.
Ljung, Lennart. 1999. System Identification: Theory for the User. Prentice Hall.
Loosli, Gaëlle, Canu, Stéphane, and Ong, Cheng Soon. 2016. Learning SVM in Kreı̆n
   Spaces. IEEE Transactions of Pattern Analysis and Machine Intelligence, 38(6), 1204–
   1216.
Luenberger, David G. 1969. Optimization by Vector Space Methods. Wiley.
MacKay, David J. C. 1992. Bayesian Interpolation. Neural Computation, 4, 415–447.
MacKay, David J. C. 1998. Introduction to Gaussian Processes. Pages 133–165 of:
   Bishop, C. M. (ed), Neural Networks and Machine Learning. Springer.
MacKay, David J. C. 2003. Information Theory, Inference, and Learning Algorithms.
   Cambridge University Press.
Magnus, Jan R., and Neudecker, Heinz. 2007. Matrix Differential Calculus with Appli-
   cations in Statistics and Econometrics. Wiley.
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
402                                                                          References
Ruffini, Paolo. 1799. Teoria Generale delle Equazioni, in cui si Dimostra Impossibile la
   Soluzione Algebraica delle Equazioni Generali di Grado Superiore al Quarto. Stampe-
   ria di S. Tommaso d’Aquino.
Rumelhart, David E., Hinton, Geoffrey E., and Williams, Ronald J. 1986. Learning
   Representations by Back-Propagating Errors. Nature, 323(6088), 533–536.
Sæmundsson, Steindór, Hofmann, Katja, and Deisenroth, Marc P. 2018. Meta Rein-
   forcement Learning with Latent Variable Gaussian Processes. In: Proceedings of the
   Conference on Uncertainty in Artificial Intelligence.
Saitoh, Saburou. 1988. Theory of Reproducing Kernels and its Applications. Longman
   Scientific and Technical.
Särkkä, Simo. 2013. Bayesian Filtering and Smoothing. Cambridge University Press.
Schölkopf, Bernhard, and Smola, Alexander J. 2002. Learning with Kernels – Support
   Vector Machines, Regularization, Optimization, and Beyond. MIT Press.
Schölkopf, Bernhard, Smola, Alexander J., and Müller, Klaus-Robert. 1997. Kernel
   Principal Component Analysis. In: Proceedings of the International Conference on
   Artificial Neural Networks.
Schölkopf, Bernhard, Smola, Alexander J., and Müller, Klaus-Robert. 1998. Nonlinear
   Component Analysis as a Kernel Eigenvalue Problem. Neural Computation, 10(5),
   1299–1319.
Schölkopf, Bernhard, Herbrich, Ralf, and Smola, Alexander J. 2001. A Generalized
   Representer Theorem. In: Proceedings of the International Conference on Computa-
   tional Learning Theory.
Schwartz, Laurent. 1964. Sous Espaces Hilbertiens d’Espaces Vectoriels Topologiques
   et Noyaux Associés. Journal d’Analyse Mathématique, 13, 115–256.
Schwarz, Gideon E. 1978. Estimating the Dimension of a Model. Annals of Statistics,
   6(2), 461–464.
Shahriari, Bobak, Swersky, Kevin, Wang, Ziyu, Adams, Ryan P., and De Freitas, Nando.
   2016. Taking the Human out of the Loop: A Review of Bayesian Optimization.
   Proceedings of the IEEE, 104(1), 148–175.
Shalev-Shwartz, Shai, and Ben-David, Shai. 2014. Understanding Machine Learning:
   From Theory to Algorithms. Cambridge University Press.
Shawe-Taylor, John, and Cristianini, Nello. 2004. Kernel Methods for Pattern Analysis.
   Cambridge University Press.
Shawe-Taylor, John, and Sun, Shiliang. 2011. A Review of Optimization Methodologies
   in Support Vector Machines. Neurocomputing, 74(17), 3609–3618.
Shental, Ori, Siegel, Paul H., Wolf, Jack K., Bickson, Danny, and Dolev, Danny. 2008.
   Gaussian Belief Propagation Solver for Systems of Linear Equations. Pages 1863–
   1867 of: Proceedings of the International Symposium on Information Theory.
Shewchuk, Jonathan R. 1994. An Introduction to the Conjugate Gradient Method with-
   out the Agonizing Pain.
Shi, Jianbo, and Malik, Jitendra. 2000. Normalized Cuts and Image Segmentation.
   IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8), 888–905.
Shi, Qinfeng, Petterson, James, Dror, Gideon, Langford, John, Smola, Alexander J.,
   and Vishwanathan, S. V. N. 2009. Hash Kernels for Structured Data. Journal of
   Machine Learning Research, 2615–2637.
Shiryayev, Albert N. 1984. Probability. Springer.
Shor, Naum Z. 1985. Minimization Methods for Non-Differentiable Functions. Springer.
Shotton, Jamie, Winn, John, Rother, Carsten, and Criminisi, Antonio. 2006. Texton-
   Boost: Joint Appearance, Shape and Context Modeling for Multi-Class Object Recog-
   nition and Segmentation. In: Proceedings of the European Conference on Computer
   Vision.
Smith, Adrian F. M., and Spiegelhalter, David. 1980. Bayes Factors and Choice Criteria
   for Linear Models. Journal of the Royal Statistical Society B, 42(2), 213–220.
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
404                                                                          References
Snoek, Jasper, Larochelle, Hugo, and Adams, Ryan P. 2012. Practical Bayesian Op-
   timization of Machine Learning Algorithms. In: Advances in Neural Information
   Processing Systems.
Spearman, Charles. 1904. “General Intelligence,” Objectively Determined and Mea-
   sured. American Journal of Psychology, 15(2), 201–292.
Sriperumbudur, Bharath K., Gretton, Arthur, Fukumizu, Kenji, Schölkopf, Bernhard,
   and Lanckriet, Gert R. G. 2010. Hilbert Space Embeddings and Metrics on Proba-
   bility Measures. Journal of Machine Learning Research, 11, 1517–1561.
Steinwart, Ingo. 2007. How to Compare Different Loss Functions and Their Risks.
   Constructive Approximation, 26, 225–287.
Steinwart, Ingo, and Christmann, Andreas. 2008. Support Vector Machines. Springer.
Stoer, Josef, and Burlirsch, Roland. 2002. Introduction to Numerical Analysis. Springer.
Strang, Gilbert. 1993. The Fundamental Theorem of Linear Algebra. The American
   Mathematical Monthly, 100(9), 848–855.
Strang, Gilbert. 2003. Introduction to Linear Algebra. Wellesley-Cambridge Press.
Stray, Jonathan. 2016. The Curious Journalist’s Guide to Data. Tow Center for Digital
   Journalism at Columbia’s Graduate School of Journalism.
Strogatz, Steven. 2014. Writing about Math for the Perplexed and the Traumatized.
   Notices of the American Mathematical Society, 61(3), 286–291.
Sucar, Luis E., and Gillies, Duncan F. 1994. Probabilistic Reasoning in High-Level
   Vision. Image and Vision Computing, 12(1), 42–60.
Szeliski, Richard, Zabih, Ramin, and Scharstein, Daniel, et al. 2008. A Compar-
   ative Study of Energy Minimization Methods for Markov Random Fields with
   Smoothness-Based Priors. IEEE Transactions on Pattern Analysis and Machine In-
   telligence, 30(6), 1068–1080.
Tandra, Haryono. 2014. The Relationship between the Change of Variable Theorem
   and the Fundamental Theorem of Calculus for the Lebesgue Integral. Teaching of
   Mathematics, 17(2), 76–83.
Tenenbaum, Joshua B., De Silva, Vin, and Langford, John C. 2000. A Global Geometric
   Framework for Nonlinear Dimensionality Reduction. Science, 290(5500), 2319–
   2323.
Tibshirani, Robert. 1996. Regression Selection and Shrinkage via the Lasso. Journal
   of the Royal Statistical Society B, 58(1), 267–288.
Tipping, Michael E., and Bishop, Christopher M. 1999. Probabilistic Principal Compo-
   nent Analysis. Journal of the Royal Statistical Society: Series B, 61(3), 611–622.
Titsias, Michalis K., and Lawrence, Neil D. 2010. Bayesian Gaussian Process Latent
   Variable Model. In: Proceedings of the International Conference on Artificial Intelli-
   gence and Statistics.
Toussaint, Marc. 2012. Some Notes on Gradient Descent. https://ipvs.informatik.uni-
   stuttgart.de/mlr/marc/notes/gradientDescent.pdf.
Trefethen, Lloyd N., and Bau III, David. 1997. Numerical Linear Algebra. SIAM.
Tucker, Ledyard R. 1966. Some Mathematical Notes on Three-Mode Factor Analysis.
   Psychometrika, 31(3), 279–311.
Vapnik, Vladimir N. 1998. Statistical Learning Theory. Wiley.
Vapnik, Vladimir N. 1999. An Overview of Statistical Learning Theory. IEEE Transac-
   tions on Neural Networks, 10(5), 988–999.
Vapnik, Vladimir N. 2000. The Nature of Statistical Learning Theory. Springer.
Vishwanathan, S. V. N., Schraudolph, Nicol N., Kondor, Risi, and Borgwardt,
   Karsten M. 2010. Graph Kernels. Journal of Machine Learning Research, 11, 1201–
   1242.
von Luxburg, Ulrike, and Schölkopf, Bernhard. 2011. Statistical Learning Theory:
   Models, Concepts, and Results. Pages 651–706 of: D. M. Gabbay, S. Hartmann,
   J. Woods (ed), Handbook of the History of Logic, vol. 10. Elsevier.
Wahba, Grace. 1990. Spline Models for Observational Data. Society for Industrial and
   Applied Mathematics.
Walpole, Ronald E., Myers, Raymond H., Myers, Sharon L., and Ye, Keying. 2011.
   Probability and Statistics for Engineers and Scientists. Prentice Hall.
Wasserman, Larry. 2004. All of Statistics. Springer.
Wasserman, Larry. 2007. All of Nonparametric Statistics. Springer.
Whittle, Peter. 2000. Probability via Expectation. Springer.
Wickham, Hadley. 2014. Tidy Data. Journal of Statistical Software, 59, 1–23.
Williams, Christopher K. I. 1997. Computing with Infinite Networks. In: Advances in
   Neural Information Processing Systems.
Yu, Yaoliang, Cheng, Hao, Schuurmans, Dale, and Szepesvári, Csaba. 2013. Charac-
   terizing the Representer Theorem. In: Proceedings of the International Conference on
   Machine Learning.
Zadrozny, Bianca, and Elkan, Charles. 2001. Obtaining Calibrated Probability Esti-
   mates from Decision Trees and Naive Bayesian Classifiers. In: Proceedings of the
   International Conference on Machine Learning.
Zhang, Haizhang, Xu, Yuesheng, and Zhang, Jun. 2009. Reproducing Kernel Banach
   Spaces for Machine Learning. Journal of Machine Learning Research, 10, 2741–2775.
Zia, Royce K. P., Redish, Edward F., and McKay, Susan R. 2009. Making Sense of the
   Legendre Transform. American Journal of Physics, 77(614), 614–622.
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
                                          Index
                                                                                            407
This material is published by Cambridge University Press as Mathematics for Machine Learning by
Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong (2020). This version is free to view
and download for personal use only. Not for re-distribution, re-sale, or use in derivative works.
 c
by  M. P. Deisenroth, A. A. Faisal, and C. S. Ong, 2021. https://mml-book.com.
408                                 Index
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).
410                                 Index
c
2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).