A Comprehensive Guide To Machine Learning
A Comprehensive Guide To Machine Learning
About
CS 189 is the Machine Learning course at UC Berkeley. In this guide we have created a com-
prehensive course guide in order to share our knowledge with students and the general public,
and hopefully draw the interest of students from other universities to Berkeley’s Machine Learning
curriculum.
We owe gratitude to Professor Anant Sahai and Professor Stella Yu, as this book is heavily inspired
from their lectures. In addition, we are indebted to Professor Jonathan Shewchuk for his machine
learning notes, from which we drew inspiration.
The latest version of this document can be found at http://snasiriany.me/cs189/. Please report
any mistakes to snasiriany@berkeley.edu. Please contact the authors if you wish to redistribute
this document.
Notation
Notation Meaning
R set of real numbers
Rn set (vector space) of n-tuples of real numbers, endowed with the usual inner product
Rm×n set (vector space) of m-by-n matrices
δij Kronecker delta, i.e. δij = 1 if i = j, 0 otherwise
∇f (x) gradient of the function f at x
∇2 f (x) Hessian of the function f at x
p(X) distribution of random variable X
p(x) probability density/mass function evaluated at x
E[X] expected value of random variable X
Var(X) variance of random variable X
Cov(X, Y ) covariance of random variables X and Y
Other notes:
• Vectors and matrices are in bold (e.g. x, A). This is true for vectors in Rn as well as for
vectors in general vector spaces. We generally use Greek letters for scalars and capital Roman
letters for matrices and random variables.
• We assume that vectors are column vectors, i.e. that a vector in Rn can be interpreted as an
n-by-1 matrix. As such, taking the transpose of a vector is well-defined (and produces a row
vector, which is a 1-by-n matrix).
Contents
1 Regression, Validation 5
1.1 Machine Learning Principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Ordinary Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Feature Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4 Ridge Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.5 Hyperparameters and Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.6 Bias-Variance Tradeoff . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3 Low-Rank approximation 43
3.1 Total Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.2 Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.3 Canonical Correlation Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.4 Dimensionality Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5 Neural Networks 69
5.1 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.2 Training Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
3
4 CONTENTS
6 Classification 81
6.1 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6.2 Generative Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6.3 QDA Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
6.4 LDA Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
6.5 LDA vs. QDA: Differences and Decision Boundaries . . . . . . . . . . . . . . . . . . 85
6.6 Discriminative Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.7 Least Squares Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . . . . . 89
6.8 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.9 Multiclass Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.10 Training Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.11 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
8 Sparsity 131
8.1 Sparsity and LASSO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
8.2 Coordinate Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
8.3 Sparse Least-Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
Regression, Validation
Have you ever tailored your exam preparation, to optimize exam performance? It seems likely that
you would, and in fact, those preparations may range from “taking one practice exam” to “wearing
lucky underwear”. The question were implicitly answering is: which of these actually betters exam
performance? This is where machine learning comes in. In machine learning, we strive to find
patterns in data, then make predictions using those patterns, just as the average student would do
to better exam scores. We start by structuring an approach to understanding machine learning.
Levels of Abstraction
We will take an example from astronomy, below. How exactly do we define the parameters of the
problem, and how do we leverage the machine learning framework to help with this problem? See
below, for a concrete outline of what to think about.
1. Applications and Data: What are you trying to do? What is your data like? Here, identify
your problem and the nature of your observations. Your data may be Cartesian coordinates,
an matrix of RGB values etc.
Example: We have (x1 , y1 ) coordinates. We want to compute each planet’s orbit.
2. Model: What kind of pattern do you want to find? This could be a polynomial, a geometric
figure, a set of dynamics governing a self-feedback loop etc.
Example: An ellipse for an orbit.
3. Optimization Problem: Whatever problem you have, turn it into an optimization problem.
We could minimize losses or maximize gains, but in either case, define an objective function
that formally specifies what you care about.
Example: minx kAx − bk22
5
6 CHAPTER 1. REGRESSION, VALIDATION
In this course, we will study both models and optimization problems extensively. Being able
to compartmentalize each of these will make it easier for you to frame and understand machine
learning.
Let us now use the framework specified above to discuss ordinary least squares, a canonical op-
timization problem that we’ll study in-depth from various perspectives. We’ll then slowly add
complexity to least squares, with a number of different techniques. We’re most familiar with the
2-dimensional case, commonly called “linear regression”. So, let’s start there.
1. Applications and Data: We have n data points (ai , bi ) and wish to determine how the ai ’s
determine the bi ’s.
2. Model: With least squares, we assume ai , bi are linearly related. More concretely, we wish
to find a model m, c such that bi ≈ mai + c where bi , ai , m, c ∈ R. In other words, we want to
learn how the data determines m, c.
4. Optimization Algorithm: As it turns out, ordinary least squares has a closed-form solution
for the optimal value of x. We can solve for this a number of ways, and below, we’ll see
perspectives that each of these methods belies.
What if our samples are vectors, instead? Our model is then a matrix X, and we’ll need to be more
careful about our formulation of the least squares framework.
1. Applications and Data: We have n data points (ai , bi ) and wish to determine how the ai ’s
determine the bi ’s.
2. Model: Assume ai , bi are linearly related. Find X such that bi ≈ Xai where bi ∈ Rm , ai ∈
Rl , X ∈ Rm×l .
aT1 0 0 0
0
aT1 0 0
..
0
0 . 0
0 0 T
0 a1
aT 0 0 0
2
0 aT2 0 0
b1
..
0
0 . 0
0
0 0 aT2 b2
aT 0 0 0
min k − k2
3
x 0 aT3 0 0 .
..
..
0 0 . 0
.
T
0
0 0 a3
.. bn
.. .. .. ..
. . . .
T
an 0 0 0
aTn
0 0 0
..
. 0
0 0
0 0 0 aTn
Using the notation Mm×n to denote a matrix with m rows and n columns (an m by n matrix),
we see that the matrices and vectors above have the following dimensions:
We consider the brute-force approach to deriving the solution, first. Just as you would find optimum
for scalar-valued functions, take the derivative, set it equal to 0, and solve. Before we begin, keep
in mind the following convention.
∂wT x ∂xT Ax
= w, = (A + AT )x
∂x ∂x
Using these conventions, we can now take the gradient of our objective function. Apply the defini-
tion of the L2-norm to start, that kxk22 = xT x.
∂
kAx − bk22
∂x
∂
= ((Ax − b)T (Ax − b))
∂x
∂ T T
= (x A Ax − (Ax)T b − bT (Ax) + bT b)
∂x
∂ T T
= (x A Ax − 2xT AT b + bT b) recall bT a = aT b for a, b ∈ Rd
∂x
= 2AT Ax − 2AT b
8 CHAPTER 1. REGRESSION, VALIDATION
2AT Ax = 2AT b
AT Ax = AT b
x∗ = (AT A)−1 AT b
Projection Approach
We want to find something, denoted x̂, spanned by the columns of A, that is closest to b. As it
turns out, this is precisely b projected onto the column space of A. Then, the error b − Ax is
perpendicular to everything spanned by the columns of A. This gives us the following formulation:
aTi (b − Ax) = 0, ∀x
AT (b − Ax) = 0
AT b = AT Ax
x∗ = (AT A)−1 AT b
represents the “best-fit” linear model, by projecting b onto the subspace spanned by the columns
of A. However, sometimes we would like to fit not just linear models, but also nonlinear models
such as ellipses and cubic functions. We can still do this under the framework of least-squares,
by augmenting (in addition to the raw features) new arbitrary features to the data. Note that
the resulting models are still linear w.r.t the augmented features, but they are nonlinear w.r.t the
original, raw features.
Let’s use least-squares to fit a model for data points that come from an ellipse.
1. Applications and Data: We have n data points (xi , yi ), which may be noisy (could be off
the actual orbit). Our goal is to determine how the xi ’s determine the yi ’s.
Since every single data point has the same prediction value of 1, we can represent b ∈ Rn as
all 1’s. Each coefficient ai can be interpreted as the weight of the i’th feature. In expanded
form, the optimization problem can be formulated as the following:
1 x21 y12 x1 y1 x1 y1 a0 1
1 x2 y 2 x2 y2 x2 y2 a1 1
2 2
min k − .. k2
a0 ,a1 ,a2 ,a3 ,a4 ,a5
.. .. .. .. .. .. ..
. . . . . . . .
1 x2n yn2 xn yn xn yn a5 1
Specifically:
• Each column represents a feature
• Each row represents a data point
• A feature is a building block of a model
• Assuming the columns are linearly independent, we can fit the data with OLS
• Augmenting more features in addition to the raw features makes the columns less and
less linearly independent, making OLS more and more infeasible
Model Notation
• y is the prediction from the model (note that y is a linear combination of the features)
• Φi represents the i’th feature - which depends on the raw features of the data point x
• αi is the fixed coefficient/weight corresponding to the i’th feature
Polynomial Features
There are many different kinds of features, and an important class is polynomial features. Poly-
nomial features can be interpreted as polynomials of the raw features.
Remember that polynomials are merely linear combination of monomial basis terms. Monomials
can be classified in two ways, by their degree and dimension:
degree → 0 1 2 3 ···
↓ dimension
univariate 1 x x2 x3 ···
bivariate 1 x1 , x2 x21 , x22 , x1 x2 x31 , x32 , x21 x2 , x1 x22 ···
Why are polynomial features so important? Note that under Taylor’ Theorem, any sufficiently
smooth function can be approximated arbitrarily closely by a high enough degree polynomial. This
10 CHAPTER 1. REGRESSION, VALIDATION
is true for both univariate polynomials and multivariate polynomials, giving them the title of uni-
versal approximators. In this sense, polynomial features are building blocks for many kinds of
functions.
One downside of polynomials is that as their degree increases, they increase exponentially in the
number of terms. That is, for a polynomial of degree at most d in l dimensional space,
# of terms = O(ld )
The justification is as follows: there are d “slots” (corresponding to the degree), and l “choices”
for each slot (corresponding to which of the l original “raw” features will be used).
Due to their exponential number of terms, polynomial features suffer from the curse of dimen-
sionality. Note however that we can bypass this issue by employing kernels, allowing us to even
deal with infinite-diemsional polynomials! More on this later in the course.
Types of Error
One question when using polynomial features is how do we choose the right degree for the polyno-
mial? In order to answer this question we would idealy measure the prediction error of our model
for different choices of the degree and choose the degree that minimizes the error.
Here we must make a distinction between training error and true error. Training error is the
prediction error when the model is evaluated on the training data, in other words the same data
that was used to train the model is being used to evaluate it. True error is the prediction error
when the model is evaluated on unseen data that comes from the true underlying model.
For different choices of the degree, let’s observe what happens to the training error versus the
true error. We observe that as degree increases, both training error and true error decrease initially
due to using a more complex model. Training error continues to always decrease as the degree
increases. This is true, because as the degree increases, the polynomial becomes strictly more ex-
pressive, fitting the training data more accurately. However, the true error eventually increases as
the degree increases. This indicates the presence of overfitting: the model is too complex, and
it is fitting the noise too much, leading to a low training error but high true error. Overfitting
highlights that getting a better fit for your training data does not mean that you’ve learned the
correct model.
Motivation
While OLS can be used for solving linear least squares problems, it falls short for two reasons:
numerical instabilty and overfitting.
Numerical Instability: Numerical instability arises when the features of the data are close to
collinear (leading to linearly dependent feature columns), causing the feature matrix A to have
singular values that are either 0 or very close to 0. Why are small singular values bad? Let us
illustrate this via the singular value decomposition (SVD) of A:
A = UΣVT
1.5. HYPERPARAMETERS AND VALIDATION 11
Now let’s try to expand the (AT A)−1 term in OLS using the SVD of A:
(AT A)−1 = (VΣUT UΣVT )−1 = (VΣ(I)ΣVT )−1 = (VΣ2 VT )−1 = (VT )−1 (Σ2 )−1 V−1 = VΣ−2 VT
This means that (AT A)−1 will have singular values that are the squared inverse of the singular
values of A, leading to either extremely large or infinite singular values (when the singular value
of A is 0). Such excessively large singular values can be very problematic for numerical stability
purposes.
Overfitting: Overfitting arises when the features of the A matrix are too complex and prevent
OLS from producing a model that generalizes to unseen data. We want to be able to still keep
these complex features, but somehow penalize them in our objective function so that we do not
overfit.
Solution
There is a very simple solution to both of these issues: penalize the entries of x from becoming
too large. We can do this by adding a penalty term constraining the norm of x. For a fixed, small
scalar λ, we now have:
min(kAx − bk2 + λ2 kxk2 )
x
Note that the λ in our objective function is a hyperparameter that measures the sensivity to the
values in x. Just like the degree in polynomial features, λ is a value that we must choose arbitrarily.
Let’s expand the terms of the objective function:
kAx − bk2 + λ2 kxk2 = (Ax − b)T (Ax − b) + λ2 xT x
= xT AT Ax − bT Ax − xT AT b + bT b + λ2 xT x
Finally take the gradient of the objective and find the value of x that achieves 0 for the gradient:
0 = 2x(AT A + λ2 I) − 2bT A
x = bT A(AT A + λ2 I)−1
x = (AT A + λ2 I)−1 AT b
The technique we just described is known as ridge regression, aka Tikhonov regularization.
Note that now, the SVD of (AT A + λ2 I)−1 becomes:
(AT A + λ2 I)−1 = (VΣ2 VT + λ2 I)−1 = (VΣ2 VT + Vλ2 IVT )−1 = (V(Σ2 + λ2 I)VT )−1
Now with our slight tweak, the singular values have become 1/(σ 2 + λ2 ), meaning that even if
σ = 0, the singular values are guaranteed to be at least 1/λ2 , solving our numerical instability
issues. Furthermore, we have partially solved the overfitting issue. By penalizing the norm of x,
we encourage the weights corresponding to relevant features that capture the main structure of the
true model, and penalized the weights corresponding to complex features that only serve to fine
tune the model and fit noise in the data.
Recall the supervised regression setting in which we attempt to learn a mapping f : Rd → R from
labeled examples {(xi , yi )}ni=1 , where yi ≈ f (xi ). The typical strategy in regression is to pick some
12 CHAPTER 1. REGRESSION, VALIDATION
parametric model and then fit the parameters of the model to the data. Here we consider a model
of the form
model order
p
X
fα (x) = αj φj (x)
j=1
|{z} | {z }
weights features
The functions φj compute features of the input x. They are sometimes called basis functions
because our model fα is a linear combination of them. In the simplest case, we could just use the
components of x as features (i.e. φj (x) = xj ) but in future lectures it will sometimes be helpful
to disambiguate the features of an example from the example itself, so we encourage this way of
thinking now.
Once we’ve fixed a model, we can the optimal value of α by minimizing some cost function L that
measures how poorly the model fits the data:
n
e.g. X
min L({xi , yi }ni=1 , α) = min (fα (x) − yi )2
α α
i=1
Observe that the model order p is not one of the decision variables being optimized in the mini-
mization above. For this reason p is called a hyperparameter. We might say more specifically
that it is a model hyperparameter, since it determines the structure of the model.
Let’s consider another example. Ridge regression is like ordinary least squares except that we
add an `2 penalty on the parameters α:
The solution to this optimization problem is obtained by solving the linear system
(AT A + λI)α = AT b
The regularization weight λ is also a hyperparameter, as it is fixed during the minimization above.
However λ, unlike the previously discussed hyperparameter p, is not a part of the model. Rather,
it is an aspect of the optimization procedure used to fit the model, so we say it is an optimization
hyperparameter. Hyperparameters tend to fall into one of these two categories.
Types of Error
Let us define the mean-squared training error (or just training error) as
n
1X
d train (α) =
mse (fα (xi ) − yi )2
n
i=1
training error
Model Order
But training error is not exactly the kind of error we care about. In particular, training error only
reflects the data in the training set, whereas we really want the model to perform well on new points
sampled from the same data distribution as the training data. With this motivation we define the
true error as
where the expectation is taken over all pairs (x, y) from the data distribution. We generally cannot
compute the true error because we do not have access to the data distribution, but we will see later
that we can estimate it.
Adding more features tends to reduce true error as long as the additional features are useful
predictors of the output. However, if we keep adding features, these begin to fit noise in the
training data instead of the true signal, causing true error to actually increase. This phenomenon
is known as overfitting.
True Error
Model Order
The regularization hyperparameter λ has a somewhat different effect on training error. Observe
that if λ = 0, we recover the exact OLS problem, which is directly minimizing the training error. As
λ increases, the optimizer places less emphasis on the training error and more emphasis on reducing
the magnitude of the parameters. This leads to a degradation in training error as λ grows:
14 CHAPTER 1. REGRESSION, VALIDATION
training error
Regularization Weight
Validation
We’ve given some examples of hyperparameters and mentioned that they are not determined by
the data-fitting optimization procedure. How, then, should we choose the values of p and λ?
Validation Error
The key idea is to set aside some portion (say 30%) of the training data, called the validation set,
which is not allowed to be used when fitting the model:
Model Order
Note that this approximation only works because the validation set is disjoint from the training
set. If we trained and evaluated the model using the same data points, the estimate would be very
biased, since the model has been constructed specifically to perform well on those points.
1.6. BIAS-VARIANCE TRADEOFF 15
Now that we can estimate the true error, we have a simple method for choosing hyperparameter
values: simply try a bunch of configurations of the hyperparameters and choose the one that yields
the lowest validation error.
Cross-validation
Setting aside a validation set works well, but comes at a cost, since we cannot use the validation
data for training. Since having more data generally improves the quality of the trained model,
we may prefer not to let that data go to waste, especially if we have little data to begin with
and/or collecting more data is expensive. Cross-validation is an alternative to having a dedicated
validation set.
k-fold cross-validation works as follows:
1. Shuffle the data and partition it into k equally-sized (or as equal as possible) blocks.
2. For i = 1, . . . , k,
• Train the model on all the data except block i.
• Evaluate the model (i.e. compute the validation error) using block i.
1 2 3 4 5 6 ··· k
validate train
3. Average the k validation errors; this is our final estimate of the true error.
Note that this process (except for the shuffling and partitioning) must be repeated for every hyper-
parameter configuration we wish to test. This is the principle drawback of k-fold cross-validation
as compared to using a held-out validation set – there is roughly k times as much computation
required. This is not a big deal for the relatively small linear models that we’ve seen so far, but it
can be prohibitively expensive when the model takes a long time to train, as is the case in the Big
Data regime or when using neural networks.
Recall from our previous note that for a fixed input x, our measurement Y is a noisy measurement
of the true underlying response f (x):
Y = f (x) + N
Where N is a zero-mean random variable, and is typically represented as a Gaussian distribution.
Our goal in regression is to recover the underlying model f (.) as closely as possible. We previously
mentioned MLE and MAP as two techniques that try to find of reasonable approximation to f (.)
16 CHAPTER 1. REGRESSION, VALIDATION
by solving a probabilistic objective. In this section, we would like to form a theoretical metric that
can exactly measure the effectiveness of a hypothesis function h. Keep in mind that this is only a
theoretical metric that cannot be measured in real life, but it can be approximated via empirical
experiments – more on this later.
Before we introduce the metric, let’s make a few subtle statements about the data and hypothesis.
As you may recall from our previous discussion on MLE and MAP, we had a dataset D that con-
sisted of n (xi , yi ) pairs. In that context, we treated the xi ’s in our dataset D as fixed values. In
this case however, we treat the xi ’s as random variables. Specifically:
D = {X1 , Y1 ; X2 , Y2 ; . . . ; Xn , Yn }
D is a random variable, because it consists of random variables Xi and Yi . (Notice that because
these quantities are random variables, they are all represented by uppercase letters.) Our hypothesis
model h(.) is also a random variable – h(x, D) depends on some arbitrary test input x, as well as
the random variable D that was used to train h. Since D is random, we will have a slightly different
hypothesis model h(., D) everytime we use a new dataset.
Metric
Our objective is to, for a fixed value of x, evaluate how closely the hypothesis can estimate the noisy
observation corresponding to x. Note that we have denoted x here as a lowercase letter because
we are treating it as a fixed constant, while we have denoted the Y and D as uppercase letters
because we are treating them as random variables. Y and D as independent random variables,
because our x and Y have no relation to the set of Xi ’s and Yi ’s in D. In a way, we can think of
D as the training data, and (x, Y ) as the test data – in fact, x is often not even in the training set
D! Mathematically, we represent our metric as the expected squared error between the hypothesis
and the observation Y = f (x) + N :
ε(x; h) = E[(h(x; D) − Y )2 ]
Note that the error is w.r.t the observation and not the true underlying model, because we do not
know the true model and only have access to the noisy observations from the true model.
Bias-Variance Decomposition
The error metric is difficult to interpret and work with, so let’s try to decompose it into parts that
are easier to understand. Before we start, let’s find the expectation and variance of Y :
Recall that for any two independent random variables D and Y , g1 (D) and g2 (Y ) are also in-
dependent, for any functions g1 , g2 . This implies that h(x; D) and Y are independent, allowing
us to express E[h(x; D) · Y ] = E[h(x; D)] · E[Y ] in the second line of the derivation. The final
decomposition, also known as the bias-variance decomposition, consists of three terms:
• Bias2 of method: Measures how well the average hypothesis (over all possible training sets)
can come close to the true underlying value f (x), for a fixed value of x. A low bias means
that on average the regressor h(x) accurately estimates f (x).
• Variance of method: Measures the variance of the hypothesis (over all possible training
sets), for a fixed value of x. A low variance means that the prediction does not change much
as the training set varies. An un-biased method (bias = 0) could have a large variance.
• Irreducible error: This is the error in our model that we cannot control or eliminate, because
it is due to errors inherent in our noisy observation Y .
The decomposition allows us to measure the error in terms of bias, variance, and irreducible error.
Irreducible error has no relation with the hypothesis model, so we can fully ignore it in theory when
minimizing the error. As we have discussed before, models that are very complex have very little
bias because on average they can fit the true underlying model value f (x) very well, but have very
high bias and are very far off from f (x) on an individual basis.
Note that the error above is only for a fixed input x, but in regression our goal is to minimize
the average error over all possible values of X. If we know the distribution for X, we can find the
effectiveness of a hypothesis model as a whole by taking an expectation of the error over all possible
values of x: EX [ε(x; h)].
Alternative Decomposition
The previous derivation is short, but may seem somewhat hacky and arbitrary. Here is an al-
ternative derivation that is longer, but more intuitive. At its core, it uses the technique that
h 2 i h 2 i
E Z −Y = E (Z − E[Z]) + (E[Z] − Y ) which decomposes to easily give us the variance
18 CHAPTER 1. REGRESSION, VALIDATION
h 2 i
ε(x; h) = E[(h(x; D) − Y )2 ] = E h(x; D) − E[h(x; D)] + E[h(x; D)] − Y
h 2 i h 2 i h i
= E h(x; D) − E[h(x; D)] + E E[h(x; D)] − Y + 2E h(x; D) − E[h(x; D)] · E[h(x; D)] − Y
h 2 i h 2 i
= E h(x; D) − E[h(x; D)] + E E[h(x; D)] − Y + 2E[h(x; D) − E[h(x; D)]] · E[E[h(x; D)] − Y ]
h 2 i h 2 i
= E h(x; D) − E[h(x; D)] + E E[h(x; D)] − Y + 2 E[h(x; D)] − E[h(x; D)] · E[E[h(x; D)] − Y ]
h 2 i h 2 i
= E h(x; D) − E[h(x; D)] + E E[h(x; D)] − Y + 2 · (0) · E[E[h(x; D)] − Y ]
h 2 i h 2 i
= E h(x; D) − E[h(x; D)] + E E[h(x; D)] − Y
h 2 i
= Var((h(x; D)) + E E[h(x; D)] − Y
h i2
= Var((h(x; D)) + Var E[h(x; D)] − Y + E E[h(x; D)] − Y
2
= Var((h(x; D)) + Var(Y ) + E[h(x; D)] − E[Y ]
2
= Var((h(x; D)) + Var(N ) + E[h(x; D)] − f (x)
2
= E[h(x; D)] − f (x) + Var(h(x; D)) + Var(N )
| {z |
} variance {z } | {z }
of method irreducible error
bias2 of method
Experiments
Let’s confirm the theory behind the bias-variance decomposition with an empirical experiment that
measures the bias and variance for polynomial regression with 0 degree, 1st degree, and 2nd degree
polynomials. In our experiment, we will repeatedly fit our hypothesis model to a random training
set. We then find the expectation and variance of the fitted models generated from these training
sets.
Let’s first look at a 0 degree (constant) regression model. We repeatedly fit an optimal constant
line to a training set of 10 points. The true model is denoted by gray and the hypothesis is denoted
by red. Notice that at each time the red line is slightly different due to the different training set
used.
1.6. BIAS-VARIANCE TRADEOFF 19
Let’s combine all of these hypotheses together into one picture to see the bias and variance of our
model.
On the top left diagram we see all of our hypotheses and all training sets used. The bottom left
diagram shows the average hypothesis in cyan. As we can see, this model has low bias for x’s in
20 CHAPTER 1. REGRESSION, VALIDATION
the center of the graph, but very high bias for x’s that are away from the center of the graph. The
diagram in the bottom right shows that the variance of the hypotheses is quite high, for all values
of x.
Now let’s look at a 1st degree (linear) regression model.
The bias is now very low bias for all x’s. The variance is low for x’s in the middle of the graph,
1.6. BIAS-VARIANCE TRADEOFF 21
but higher for x’s that are away from the center of the graph.
Finally, let’s look at a 2nd degree (quadratic) regression model.
The bias is still very low for all x’s. However, the variance is much higher for all values of x.
Let’s summarize our results. We find the bias and the variance empirically and graph them for all
values of x, as shown in the first two graphs. Finally, we take an expectation over the bias and
22 CHAPTER 1. REGRESSION, VALIDATION
Takeaways
2. Training error reflects bias but not variance. Test error reflects both. In practice, if the
training error is much smaller than the test error, then there is overfitting.
3. V ariance → 0 as n → ∞ .
4. If f is in the set of hypothesis functions, bias will decrease with more data. If f is not in the
set of hypothesis functions, then there is an underfitting problem and more data won’t help.
1.6. BIAS-VARIANCE TRADEOFF 23
5. Adding good features will decrease the bias, but adding a bad feature rarely increase the bias.
(just set the coefficient of the feature to 0)
6. Adding a feature usually increase the variance, so a feature should only be added if it decreases
bias more than it increases variance.
7. Irreducible error can not be reduced.
8. Noise in the test set only affects Var(N ) , but noise in the training set also affects bias and
variance.
9. For real-world data, f is rarely known, and the noise model might be wrong, so we can’t
calculate bias and variance. But we can test algorithms over synthetic data.
24 CHAPTER 1. REGRESSION, VALIDATION
Chapter 2
So far, we’ve explored a number of ideas under the least squares framework:
n
X
min (yi − φ(xi )T w)2 = ky − Awk22 =⇒ w∗OLS = (AT A)−1 AT y
w
i=1
• Ridge Regression
n
X d
X
min (yi −φ(xi )T w)2 +λ2 wj2 = ky−Awk22 +λ2 kwk22 =⇒ w∗Ridge = (AT A+λ2 I)−1 AT y
w
i=1 j=1
One question that you may have been asking yourself is why we are using the squared error to
measure the effectiveness of our model, and why we use the `2 norm for the model weights (and
not some other norm). We will justify all of these design choices by exploring the statistical
interpretations of supervised learning methods and in particular, regression methods. In more
concrete terms, we will use a variety of concepts from probability, such as Gaussians, MLE and
MAP, in order to validate what we’ve done so far through a different lens.
Our goal is to find the optimal slope and intercept values that we believe best describe the data.
Each arbitrary (slope, intercept) pair forms a line in data space. However, it can also be viewed
as a single point in model space. Learning the optimal model amounts to fitting a line to the
data points in the data space; it’s equivalent to locating the optimal parameter point in the model
space:
25
26 CHAPTER 2. MLE, MAP, AND GAUSSIANS
The data that we observe are random samples with different distributions, coverages, and densities.
There are many different distributions that we will encounter, such as Unifrom, Gaussian, and
Laplacian.
However, it is arguable that Gaussians distributions are by far the most prevalent. For the purposes
of this section we will assume that you already have had exposure to Gaussian distributions before.
Let’s review their properties for convenience. Gaussians are particularly appealing because they
occur quite frequently in natural settings. Noise distributions in particular are often assumed to be
Gaussian distributions, partly because most of noise is captured within 1 or 2 standard deviations
of the mean:
2.1. MLE AND MAP 27
In the context of regression (and all of supervised learning for that matter), we assume a true
underlying model that maps inputs to outputs. Our goal as machine learning experts is to find
a hypothesis model that best represents the true underlying model, by using the data we are
given.
Let’s define more concretely our definition of the data and the model. Our data takes consists of
n (x, y) pairs, just as we have seen before:
D = {(xi , yi )}ni=1
The true underlying model f is a function that maps the inputs xi to the true outputs f (xi ). Each
observation yi is simply a noisy version of f (xi ):
yi = f (xi ) + Ni
Note that f (xi ) is a constant, while Ni is a random variable. We always assume that Ni has zero
mean, because otherwise there would be systematic bias in our observations. The Ni ’s could be gaus-
i.i.d
sian, uniform, laplacian, etc.. Here, let us assume that they are i.i.d and gaussian: Ni ∼ N (0, σ 2 ).
2
We can therefore say that yi |xi ∼ N (f (xi ), σ ).
Now that we have introduced the data and model, we wish to find a hypothesis model that best
describes the data, while possibly taking into account prior beliefs that we have about the true
model. We can represent this as a probability problem, where the goal is to find the optimal model
that maximizes our probability.
In Maximum Likelihood Estimation (MLE), the goal is to find the model that maximizes the
probability of the data. If we denote the set of hypothesis models as H, we can represent the
28 CHAPTER 2. MLE, MAP, AND GAUSSIANS
problem as:
h∗M LE = arg max P (data = D| true model = h)
h∈H
More concretely:
h∗M LE = arg max P (y1 , . . . , yn |x1 , . . . , xn , h)
h∈H
Note that we actually conditioned on the xi ’s, because we treat them as fixed values of the data.
The only randomness in our data comes from the yi ’s (since they are noisy versions of the true
values f (xi )).
From the chain rule of probability, we can expand the probability statement:
P (y1 , . . . , yn |x1 , . . . , xn , h) = P (y1 |x1 , . . . , xn , h)·P (y2 |y1 , x1 , x2 . . . , xn , h)·. . .·P (yn |y1 , . . . , yn−1 , x1 , . . . , xn , h)
We can simplify this expression by viewing the problem as a graphical model. Note that yi only
depends on its parent in the graphical model, xi . It does not depend on the other yj ’s, since all y’s
have independent noise terms. We can therefore simplify the objective:
n
Y
h∗M LE = arg max P (y1 , . . . yn |x1 , . . . xn , h) = P (yi |xi , h)
h∈H i=1
Now let’s focus on each individual term P (yi |xi , h). We know that yi |xi , h ∼ N (h(xi ), σ 2 ), which is
cumbersome to work with because gaussians have exponential terms. So instead we wish to work
with logs, which eliminate the exponential terms:
n
X
h∗M LE = arg max log[P (y1 , . . . yn |x1 , . . . xn , h)] = log[P (yi |xi , h)]
h∈H i=1
Note that with logs we are still working with the same problem, because logarithms are monotonic
functions. Let’s try to understand this mathematically through calculus:
d 1 df df
log(f (x)) = 0 ⇐⇒ = 0 ⇐⇒ =0
dx f (x) dx dx
The last statement is true in this case since f (x) is a gaussian function and thus can never equal
0. Continuing with logs:
n
X
h∗M LE = arg max log[P (yi |xi , h)] (2.1)
h∈H i=1
n
X (yi − h(xi ))2 √
= arg max − 2
− n log 2πσ (2.2)
h∈H 2σ
i=1
n
X (yi − h(xi ))2 √
= arg min 2
+ n log 2πσ (2.3)
h∈H 2σ
i=1
n
X
= arg min (yi − h(xi ))2 (2.4)
h∈H i=1
Note that in step (3) we turned the problem from a maximization problem to a minimization prob-
lem by negating the objective. In step (4) we eliminated the second term and the denominator in
2.1. MLE AND MAP 29
the first term, because they do not depend on the variables we are trying to optimize over.
Now let’s look at the case of regression. In that case our hypothesis has the form h(xi ) = φ(xi )T w,
where w ∈ Rd , where d is the number of dimensions of our featurized datapoints. For this specific
setting, the problem becomes:
n
X (yi − φ(xi )T w)2
w∗M LE = arg min
w∈Rd 2σ 2
i=1
This is just the Ordinary Least Squares (OLS) problem! We just proved that OLS and MLE for
regression lead to the same answer! We conclude that MLE is a probabilistic justification for why
using squared error (which is the basis of OLS) is a good metric for evaluating a regression model.
Maximum a Posteriori
In Maximum a Posteriori (MAP) Estimation, the goal is to find the model, for which the data
maximizes the probability of the model:
h∗M AP = arg max P (true model = h| data = D) (2.1)
h∈H
P (true model = h, data = D)
= arg max (2.2)
h∈H P ( data = D)
= arg max c · P (true model = h, data = D) (2.3)
h∈H
= arg max P (true model = h, data = D) (2.4)
h∈H
= arg max P ( data = D|true model = h) · P (true model = h) (2.5)
h∈H
Here, we used Bayes’ Rule to reexpress the objective. In step (3) we represent P ( data = D) as
a constant value because it does not depend on the variables we are optimizing over. Notice that
MAP is just like MLE, except we add a term P (h) to our objective. This term is the prior over
our true model.
More concretely, we have (just as we did with MLE):
h∗M AP = arg max P (h|y1 , . . . , yn , x1 , . . . , xn ) (2.1)
h∈H
P (h, y1 , . . . , yn |x1 , . . . , xn )
= arg max (2.2)
h∈H P (y1 , . . . , yn |x1 , . . . , xn )
= arg max c · P (h, y1 , . . . , yn |x1 , . . . , xn ) (2.3)
h∈H
= arg max P (h, y1 , . . . , yn |x1 , . . . , xn ) (2.4)
h∈H
= arg max P (y1 , . . . , yn |h, x1 , . . . , xn ) · P (h) (2.5)
h∈H
= arg max log[P (y1 , . . . , yn |h, x1 , . . . , xn ) · P (h)] (2.6)
h∈H
= arg max log[P (y1 , . . . , yn |h, x1 , . . . , xn )] + log[P (h)] (2.7)
h∈H
n
X
= arg max log[P (yi |xi , h)] + log[P (h)] (2.8)
h∈H i=1
30 CHAPTER 2. MLE, MAP, AND GAUSSIANS
Again, just as in MLE, notice that we condition on the xi ’s in the whole process because we treat
them as constants. Also, let us assume as before that the noise terms are i.i.d and gaussian:
i.i.d
Ni ∼ N (0, σ 2 ). For the prior term P (h), we assume that it follows a shifted and scaled version of
the standard Multivariate Gaussian distribution: h ∼ N (h0 , σh2 I). Using this specific information,
we now have:
X n
h∗M AP = arg max log[P (yi |xi , h)] + log[P (h)] (2.1)
h∈H i=1
Pn
(yi − h(xi ))2 −kh − h0 k2
= arg max − i=1 + (2.2)
h∈H 2σ 2 2σh2
n (y − h(x ))2 kh − h k2
P
i=1 i i 0
= arg min + (2.3)
h∈H 2σ 2 2σh2
Xn σ2
= arg min (yi − h(xi ))2 + 2 kh − h0 k2 (2.4)
h∈H i=1
σh
Let’s look again at the case for linear regression to illustrate the effect of the prior term when
h0 = 0:
X n σ2
∗
wM AP = arg min (yi − φ(xi )T w))2 + 2 kwk2
w∈Rd i=1
σh
This is just the Ridge Regression problem! We just proved that Ridge Regression and MAP for
regression lead to the same answer! We can simply set λ = σσh . We conclude that MAP is a
probabilistic justification for adding the penalized ridge term in Ridge Regression.
Based on our analysis of Ordinary Least Squares Regression and Ridge Regression, we should expect
to see MAP perform better than MLE. But is that always the case? Let us revisit at the (slope,
intercept) example from earlier. We already know the true underlying model parameters, and we
will compare them to the values that MLE and MAP select. Let’s start with MLE:
2.1. MLE AND MAP 31
The diagram on the left shows the data space representation of the problem, and the diagram on
the right shows the model space representation. The gray line in the left diagram and the gray
dot in the right diagram are the true underlying model. Using noisy data samples, MLE predicts a
reasonable hypothesis model (as indicated by the green line in the left diagram and the green dot
in the right diagram).
Now, let’s take a look at the hypothesis model from MAP. One question that arises is where the
prior should be centered and what its variance should be. This depends on your belief of what the
true underlying model is. If you have reason to believe that the model weights should all be small,
then the prior should be centered at zero. Let’s look at MAP for a prior that is centered at zero:
For reference, we have marked the MLE estimation from before as a green point and the true model
as a gray point. As we can see from the right diagram, using a prior centered at zero leads us to
skew our prediction of the model weights toward the origin, leading to a less accurate model than
MLE.
Let’s say in our case that we have reason to believe that both model weights should be centered
around the 0.5 to 1 range. As the first set of diagrams below show, our prediction would be better
than MLE. However, if we believe the model weights should be centered around the -0.5 to -1 range,
we would make a much poorer prediction than MLE.
32 CHAPTER 2. MLE, MAP, AND GAUSSIANS
As always, in order to compare our beliefs to see which prior works best in practice, we should use
cross validation!
So far we have used MLE in the context of Gaussian noise to justify the optimization formulation
of regression problems, such as OLS. Let’s apply this dual optimization-probability philosophy to
other regression problems, such as Weighted Least Squares.
Optimization View
The basic idea of weighted least squares is the following: we place more emphasis on the loss
contributed from certain data points over others - that is, we care more about fitting some data
2.2. WEIGHTED LEAST SQUARES 33
points over others. From an optimization perspective, the problem can be expressed as
n
X
w∗W LS = arg min ωi (yi − φ(xi )T w)2
w∈Rd i=1
This objective is the same as OLS, except that each term in the sum is weighted by a coefficient
ωi . As always, we can vectorize this problem:
Where the i’th row A is φ(xi )T , and Ω ∈ Rn×n is a diagonal matrix with Ωi,i = ωi .
We rewrite the WLS objective to an OLS objective:
This formulation is identical to OLS except that we have scaled the data matrix and the observation
vector by Ω1/2 , and we conclude that
−1 T
w∗W LS = (Ω1/2 A)T (Ω1/2 A) Ω1/2 A Ω1/2 y = (AT ΩA)−1 AT Ωy
Probabilistic View
As in MLE, we assume that our observations y are noisy, but now suppose that some of the yi ’s
are more noisy than others. How can we take this into account in our learning algorithm so we can
get a better estimate of the weights?
yi φ(xi )T w Zi
= +
σi σi σi
Note that the scaled noise entries are now i.i.d:
Zi i.i.d
∼ N (0, 1)
σi
The MLE estimate of this scaled problem is equivalent to the WLS estimate of the original problem:
−1 −1 −1 −1
w∗W LS = (AT Σz 2 Σz 2 A)−1 AT Σz 2 Σz 2 y = (AT Σ−1 −1 T −1
z A) A Σz y
where 2
σ1 0 ··· 0
0 σ22 ··· 0
Σz = ..
.. .. ..
. . . .
0 · · · · · · σn2
Note that as long as no σ is 0, Σz is invertible. Note that ωi from the optimization perspective is
directly related to σi2 from the probabilistic perspective: ωi = σ12 . As the variance σi2 of the noise
i
corresponding to data point i decreases, the weight ωi increases: we are more concerned about
fitting data point i because it is likely to match the true underlying denoised point. Inversely, as
the variance σi2 increases, the weight ωi decreases: we are less concerned about fitting data point i
because it is noisy and should not be trusted.
Dependent Noise
What if the Zi ’s aren’t independent? This usually happens for time series data. Again, we assume
we have a model for how the noise behaves. The noise entries are not independent, but there is a
known process.
e.g. Zi+1 = rZi + Ui where Ui ∼ N (0, 1), i.i.d, −1 ≤ r ≤ 1 (so that it doesn’t blow up)
or a ”sliding window” example (like echo of audio) where Zi = Σrj Ui−j , Ui ∼ N (0, 1).
There are three equivalent definitions of a jointly Gaussian (JG) random vector:
Note that all of these conditions are equivalent. In this note we will start by showing a proof that
(1) =⇒ (3). We will leave it as an exercise to prove the rest of the implications needed to show
that the three conditions are in fact equivalent.
In the context of the noise problem we defined earlier, we are starting with condition (1), ie.
Z = RU (in this case k = l = n), and we would like to derive the probability density of Z. Note
that here we removed the µ from consideration because in machine learning we always assume that
the noise has a mean of 0. We leave it as an exercise for the reader to prove the case for an arbitrary
µ.
We will first start by relating the probability density function of U to that of Z. Denote fU (u) as
the probability density for U = u, and similarly denote fZ (z) as the probability density for Z = z.
One may initially believe that fU (u) = fZ (Ru), but this is NOT true. Remember that since there
is a change of variables from U to Z, we must make sure to incorporate the change of variables
constant, which in this case is the absolute value of the determinant of R. Incorporating this
constant, we will have the correct formula:
Let’s see why this is true, with a simple 2D geometric explanation. Define U space to be the 2D
space with axes U1 and U2 . Now take any arbitrary region R0 in U space (note that this R0 is
different from the matrix R that relates U to Z). As shown in the diagram below, we have some
off-centered circular region R0 and we would like to approximate the probability that U takes a
value in this region. We can do so by taking a Riemann sum of the density function fU (.) over
smaller and smaller squares that make up the region R0 :
Now, let’s apply the linear transformation Z = RU, mapping the region R0 in U space, to the
region T (R0 ) in Z space.
36 CHAPTER 2. MLE, MAP, AND GAUSSIANS
The graph on the right is now Z space, the 2D space with axes Z1 and Z2 . Assuming that the
matrix R is invertible, there is a one-to-one correspondence between points in U space to points in
Z space. As we can note in the diagram above, each unit square in U space maps to a parallelogram
in Z space (in higher dimensions, we would use the terms hypercube and parallelepiped). Recall
the relationship between each unit hypercube and the parallelepiped it maps to:
Area(parallelepiped) = | det(R)| · Area(hypercube)
In this 2D example, if we denote the area of each unit square as ∆u1 ∆u2 , and the area of each unit
parallelepiped as ∆A, we say that
∆A = | det(R)| · ∆u1 ∆u2
Now let’s take a Riemann sum to find the probability that Z takes a value in T (R0 ):
ZZ X X XX
0
P (Z ⊆ T (R )) = fZ (z1 , z2 ) dz1 dz2 ≈ fZ (z) ∆A = fZ (Ru) | det(R)|∆u1 ∆u2
T (R0 ) T (R0 ) R0
Note the change of variables in the last step: we sum over the squares in U space, instead of
parallelograms in R space.
and Z ZZ
0
P (Z ⊆ T (R )) = ... fZ (Ru) | det(R)|du1 du2 . . . dun
R0
Notice that these two probabilities are equivalent! The probability that U takes value in R0 must
equal the probability that the transformed random vector Z takes a value in the transformed region
T (R0 ).
Therefore, we can say that
Z ZZ Z ZZ
P (U ⊆ R0 ) = . . . fU (u) du1 du2 . . . dun = . . . fZ (Ru) | det(R)|du1 du2 . . . dun = P (Z ⊆ T (R0 ))
0 0
R R
2.4. MLE WITH DEPENDENT NOISE 37
We conclude that
fU (u) = fZ (Ru) | det(R)|
Since the densities for all the Ui ’s are i.i.d, and U = R−1 Z, we can write the joint density function
of Z as
1
fZ (z) = fU (R−1 z)
| det(R)|
n
1 Y
= fUi ((R−1 z)i )
| det(R)|
i=1
1 1 1 −1 T −1
= √ e− 2 (R z) (R z)
| det(R)| ( 2π)n
1 1 1 T −T −1
= √ e− 2 z R R z
| det(R)| ( 2π) n
1 1 1 T T −1
= √ e− 2 z (RR ) z
| det(R)| ( 2π)n
Up to this point, we have been able to easily view regression problems from an optimization
perspective. In the context of dependent noise however, it is much easier to view the problem from
a probabilistic perspective. Note as before that:
y = Aw + Z
Where Z is now a jointly Gaussian random vector. That is, Z ∼ N (0, ΣZ ), and y ∼ N (Aw, ΣZ ).
Our goal is to maximize the probability of our data over the set of possible w’s:
1 1 1 T −1
w∗ = arg max p √e− 2 (y−Aw) ΣZ (y−Aw) (2.1)
w∈Rd | det(ΣZ )| ( 2π)n
= arg min(y − Aw)T Σ−1
Z (y − Aw) (2.2)
w∈Rd
38 CHAPTER 2. MLE, MAP, AND GAUSSIANS
Notice that ΣZ is symmetric, which means it has a good eigen structure, therefore we can take the
advantage interpret this geometrically. ΣZ can be written as
2
σ1 0 · · · 0
0 σ2 · · · 0
2 T
ΣZ = Q .. .. Q
.. ..
. . . .
0 · · · · · · σn2
where Q is orthonormal. This means a multivariate Gaussian can be thought of having its level
sets be the ellipsoid having axis given by Q and amount of stretch given by σs.
Let
σ1 0 ··· 0
1 0 σ2 ··· 0
RZ = ΣZ2 = Q ..
.. .. ..
. . . .
0 · · · · · · σn
As before, we can scale the data to morph the problem into an MLE problem with i.i.d noise
variables, by premultiplying the data matrix A and the observation vector y by R−1
Z .
In a very similar fashion to the independent noise problem, the MLE of the scaled dependent noise
problem is w∗ = (AT Σ−1 −1 T −1
Z A) A ΣZ y.
Recall the ordinary least squares (OLS) model. We have a dataset D = {(ai , yi )}ni=1 and assume
that each yi is a linear function of ai , plus some independent Gaussian noise, which we have rescaled
to have variance 1:
iid
zi ∼ N (0, 1) (2.3)
yi = a>i w + zi (2.4)
Initially we used the geometric interpretation of OLS to solve for w. The previous two lectures
showed how we can find w with estimators instead:
When deriving ridge regression via MAP estimation, our prior assumed that wi were i.i.d. (uni-
variate) Gaussian, but more generally, we can allow w to be any multivariate Gaussian:
w ∼ N (µw , Σw )
Recall (see Discussion 4) that we can rewrite a multivariate Gaussian variable as an affine trans-
formation of a standard Gaussian variable:
1
w = Σw/2 |{z}
v + µw v ∼ N (0, I)
|{z}
noise mean
2.5. MAP WITH COLORED NOISE 39
Since the variance from data and prior have both been normalized, the noise-to-signal ratio (λ) is
equal to 1.
However v is not what we care about – we need to convert back to the actual weights w in order
to make predictions. Using our identity again,
ŵ = µw + Σw/2 (Σw/2 A>AΣw/2 + I)−1 Σw/2 A>(y − Aµw )
1 T 1 T
| {z }
Σ−1
w
Note that there are two terms: the prior mean µw , plus another term that depends on both the
data and the prior. The precision matrix of w’s prior (Σ−1
w ) controls how the data fit error affects
our estimate.
To gain intuition, let us consider the simplified case where
σ2 0 ··· 0
w,1 2
0 σw,2 ··· 0
Σw = .. .. .. ..
. . . .
0 0 2
· · · σw,n
When the prior variance σw,j2 for dimension j is large, the prior is telling us that wj may take on
a wide range of values. Thus we do not want to penalize that dimension as much, preferring to let
the data fit sort it out. And indeed the corresponding entry in Σ−1 w will be small, as desired.
2
Conversely if σw,j is small, there is little variance in the value of wj , so wj ≈ µj . As such we
penalize the magnitude of the data-fit contribution to ŵj more heavily.
Alternative derivation
Now w | Y is straightforward. Since v 0 = D−1 Y, the conditional mean and variance of w | Y can
be computed as follows:
E[w | Y] = E[Au0 + Bv 0 | Y]
= E[Au0 | Y] + E[BD−1 Y | Y]
= A E[u0 ] +E[BD−1 Y | Y]
| {z }
0
= BD−1 Y
Var(w | Y) = E[(w − E[w])(w − E[w])> | Y]
= E[(Au0 + BD−1 Y − BD−1 Y)(Au0 + BD−1 Y − BD−1 Y)> | Y]
= E[(Au0 )(Au0 )> | Y]
= E[Au0 (u0 )>A>]
= A E[u0 (u0 )>] A>
| {z }
=Var(u0 )=I
= AA>
In both cases above where we drop the conditioning on Y, we are using the fact u0 is independent
of v 0 (and thus independent of Y = Dv 0 ). Therefore
w | Y ∼ N (BD−1 Y, AA>)
Recall that a Gaussian distribution is completely specified by its mean and covariance matrix. We
see that the covariance matrix of the joint distribution is
" # " > #
w A B A 0
h i
E w> Y> =
Y 0 D B> D>
" #
AA> + BB> BD>
=
DB> DD>
Σw Σw,Y
=
ΣY,w ΣY
Matching the corresponding terms, we can express the conditional mean and variance of w | Y in
terms of these (cross-)covariance matrices:
D−T} D−1 Y = (BD>)(DD>)−1 Y = Σw,Y Σ−1
BD−1 Y = B |D>{z Y Y
I
AA = AA + BB> − BB>
> >
I I
> > > > −1
= AA + BB − (BD )(DD ) DB>
= Σw − Σw,Y Σ−1
Y ΣY,w
We can then apply the same reasoning to the original setup:
" # " #
Y
h i
R R> + AR R> A> AR R>
> z w w
E Y w> = z w w
w Rw R>w A> Rw R>w
ΣY ΣY,w
=
Σw,Y Σw
2.5. MAP WITH COLORED NOISE 41
Σw = Rw R>w
ΣY = Σz + AΣw A>
ΣY,w = AΣw
Σw,Y = Σw A>
ŵ = E[w | Y = y]
= Σw,Y Σ−1
Y y
= Σw A>(Σz + AΣw A> )−1 y
One may be concerned because this expression does not take the form we expect – the inverted
matrix is hitting y directly, unlike in other solutions we’ve seen. But using the Woodbury matrix
identity1 , it turns out that we can rewrite this expression as
ŵ = (A>Σ−1 −1 −1 > −1
z A + Σ w ) A Σz y
Low-Rank approximation
Previously, we have covered Ordinary Least Squares (OLS). which assumes that the dependent
variable y is noisy but the independent variables x are noise-free. We now discuss Total Least
Squares (TLS), which arises in the case where we assume that our x data is also corrupted by
noise. Both LS methods want to get a model that produce an approximation closest to all the
points, but they measure the distance differently. OLS tries to minimize the vertical distance
between the fitted line and data points, while TLS tries to minimize the perpendicular distance.
The red line represents vertical distance, which OLS aims to minimize. The blue line represents
perpendicular distance, which TLS aims to minimize. Note that all blue lines are perpendicular
to the black line (hypothesis model), while all red lines are perpendicular to the x axis.
We might begin with a probabilistic formulation and fit the parameters via maximum likelihood
estimation, as before. Suppose on the plane, we have a true model that we want to recover from
some data points:
yi = axi (1)
iid
where the noise terms are normally distributed, i.e. zxi , zyi ∼ N (0, 1).
Applying the equation (1) that comes from the true model mentioned above, we rewrite (2) in the
43
44 CHAPTER 3. LOW-RANK APPROXIMATION
Given these assumptions, we can derive the likelihood for just 1 point under hypothesis a:
!
1 1 (yi − axi )2
P (xi , yi ; a) = p exp − (4)
2π(a2 + 1) 2 a2 + 1
1 1 (y − ax )2
i i
log P (xi , yi ; a) = constant − log a2 + 1 − (5)
2 2 a2 + 1
Observe that a shows up in three places, unlike the form that we are familiar with, where a only
appears in the quadratic term. Our usual strategy of setting the derivative equal to zero to find
a maximizer will not yield a nice system of linear equations in this case, so we’ll try a different
approach.
Solution
To solve the TLS problem, we develop another formulation that can be solved using the singular
value decomposition.
Assume we have n data points, xi ∈ Rd , yi ∈ R, and we stack them to get
(X + e)w = y + f (6)
where w ∈ Rd is the weight, and E and f are noise terms that we add to explain the error in the
model. Our goal is to minimize the Frobenius norm1 of the matrix composed of these error vectors.
Recall that from a probabilistic perspective, finding the most likely value of a Gaussian corresponds
to minimizing the squared distance from the mean. Since we assume the noise is 0-centered, we
want to minimize the sum of squares of each entry in the error matrix, which corresponds exactly to
minimizing the Frobenius norm. Thus we arrive at the following constrained optimization problem:
2
min
E f
subject to (X + E)w = y + f
E,f F
In order to separate out the term being minimized, we rearrange the constraint equation as
w
X y + E f =0 (7)
| {z } −1
∈Rn×(d+1)
vector: XX
kAk2F = A2ij
i j
3.1. TOTAL LEAST SQUARES 45
we must add something to it so that it loses rank, since otherwise the nullspace is just {0} and the
equation cannot be solved. We use the SVD coordinate system to achieve this:
σ1 · · · 0
.. . . .. v>1
. . . .
X y = u1 . . . ud+1 | Urest 0 · · · σd+1 .. (8)
.. . . .. v>
. . . d+1
0 ··· 0
where σ1 ≥ σ2 ≥ · · · ≥ σd+1 > 0, and U and V are orthogonal matrices. Recall that this implies
not change the Frobenius norm, so minimizing k E f k2F is
that multiplication by U or V does
equivalent to minimizing k E 0 f 0 k2F where E 0 , f 0 are E, f expressed in the SVD coordinates. Now
our problem reduces to finding E 0 , f 0 such that
σ1 · · · 0
.. . . ..
. . .
0 · · · σd+1 + E 0 f 0
(9)
.. . . ..
. . .
0 ··· 0
0
is not full rank and k E 0 f k2F is as small as possible. Since the matrix on the left is diagonal, we
can
0 reduce
its rank by simply zeroing out one of its diagonal elements. Therefore our perturbation
E f 0 will have −σj in the (j, j) position for some j, and zeros everywhere else. To minimize the
size of the perturbation, we decide to eliminate the smallest σj by taking
0 ··· 0
.. . . ..
0 0
. . .
E f = 0 · · · −σd+1
.. . . ..
. . .
0 ··· 0
Such a perturbation in SVD coordinates corresponds to a perturbation of
0 ··· 0
.. . . .. v>1
. . . . >
E f = u1 . . . ud+1 | Urest 0 · · · −σd+1 .. = −σd+1 ud+1 vd+1
.. . . .. v>
. . . d+1
0 ··· 0
in the original coordinate system. It turns out that this choice is optimal, as guaranteed by the
Eckart-Young theorem, which is stated at the end for reference.
The nullspace of our resulting matrix is then
d
X
σj uj v>j = span{vd+1 }
null X y + E f = null
j=1
where the last equality holds because {v1 , . . . , vd+1 } form an orthogonal basis for Rd+1 . To get the
weight w, we find a scaling α such that (w, −1) is in the nullspace, i.e.
w
= αvd+1
−1
46 CHAPTER 3. LOW-RANK APPROXIMATION
Once we have vd+1 , or any scalar multiple of it, we simply rescale it so that the second component
is −1, and then the first component gives us w. Since vd+1 is a right-singular vector of X y , it
is an eigenvector of the matrix
" #
> X>X X>y
X y X y = > (11)
y X y>y
Observe that the off-diagonal terms of E[Z> x Zx ] terms are zero because the ith and jth rows of Zx are
independent for i 6= j, and the on-diagonal terms are essentially variances. Thus the −σd+1 2 I term
is there to compensate for the extra noise introduced by our assumptions regarding the independent
variables.
Eckart-Young Theorem
The Eckart-Young theorem essentially says that the best low-rank approximation (in terms of the
Frobenius norm) is obtained by throwing away the smallest singular values.
Theorem. Suppose A ∈ Rm×n has rank r ≤ min(m, n), and let A = U ΣV > = ri=1 σi ui v>i be its
P
singular value decomposition. Then
σ1 · · · 0 · · · 0
.. . .
Xk . . 0 · · · 0
σi ui v>i = U
>
Ak = 0 0 σ k · · · 0 V
. . . .
i=1
..
.. .. . . . ..
0 0 0 ··· 0
3.2. PRINCIPAL COMPONENT ANALYSIS 47
kA − Ak kF ≤ kA − ÃkF
In machine learning, the data we have are often very high-dimensional. There are a number of
reasons why we might want to work with a lower-dimensional representation:
• Visualization (if we can get it down to 2 or 3 dimensions), e.g. for exploratory data analysis
• Reduce noise
Projection
Let us first review the meaning of scalar projection of one vector onto another. If u ∈ Rd is a unit
vector, i.e. kuk = 1, then the projection of another vector x ∈ Rd onto u is given by x>u. This
quantity tells us roughly how much of the projected vector x lies along the direction given direction
u. Why does this expression make sense? Recall the slightly more general formula which holds for
vectors of any length:
x>u = kxkkuk cos θ
where θ is the angle between the vectors. In this case, since kuk = 1, the expression simplifies to
x>u = kxk cos θ. But since cosine gives the ratio of the adjacent side (the projection we want to
find) to the hypotenuse (kxk), this is exactly what we want:
48 CHAPTER 3. LOW-RANK APPROXIMATION
Formulating PCA
Let X ∈ Rn×d be our matrix of data, where each row is a d-dimensional data point. We will assume
that the data points have mean zero; otherwise we subtract the mean to make them zero-mean:
1
1 > ..
X − 1n 1n X, where 1n = .
n
1
The motivation for this is that we want to find directions of high variance within the data, and
variance is defined relative to the mean of the data. If we did not zero-center the data, the directions
found would be heavily influenced by where the data lie relative to the origin, rather than where
they lie relative to the other data, which is more useful. For example, translating all the data by
some fixed vector could completely change the principal components if we did not center.
Recall that for any random variable Z,
Var(Z) = E[(Z − E[Z])2 ]
so if E[Z] = 0 then Var(Z) = E[Z 2 ].
Hence once X is zero-mean, the variance of the projections is given by
n
X
P CA var (u) = (x>i u)2 = kXuk2
i=1
where u is constrained to have unit norm. We want to maximize the variance, so the objective
becomes
kXuk2 u>X>Xu
max P CA var (u) = max kXuk2 = max = max
kuk=1 kuk=1 u6=0 kuk2 u6=0 u>u
The ratio on the right is known as a Rayleigh quotient. We will see that Rayleigh quotients are
heavily related to eigenvalues, so anytime you see one, your eigensense should tingle.
Rayleigh Quotients
Suppose M ∈ Rd×d is a real, symmetric (M = M>) matrix. The Rayleigh quotient of M is defined
as
u>M u
R(u; M ) = >
uu
Denote the eigenvalues of M by λ1 ≥ λ2 ≥ · · · ≥ λd , with corresponding eigenvectors v1 , . . . , vd .
That is, M vj = λj vj for j = 1, · · · , d. If we stack the vj as columns of a matrix V :
V = v1 · · · vd
then the eigenvector equations can be simulatneously written as
MV = V Λ
where
Λ = diag(λ1 , · · · , λd )
Then since V is an orthogonal matrix, it is invertible with V −1 = V >, so
V >M V = Λ
3.2. PRINCIPAL COMPONENT ANALYSIS 49
Let u be a unit length vector. Since {vj } form a basis, we can write
d
X
u= αj vj = V α
j=1
Since we have the constraint α11 + · · · + αd2 = 1, the right-most expression is a weighted average of
the eigenvalues and hence bounded by the smallest and largest of these:
λd ≤ R(u; M ) ≤ λ1
The lower bound is achieved by putting αd = ±1, αj6=d = 0, and the upper bound by α1 = ±1,
αj6=1 = 0. The maximizing value can then be recovered by
d
X
αj vj = ±v1
j=1
Finally, note that since the Rayleigh quotient is scale invariant, i.e. R(γu; M ) = R(u; M ) for any
γ 6= 0, the inequality above holds for any scaling of the vectors, not just unit-length vectors.
Armed with our knowledge of Rayleigh quotients, the solution to the PCA problem is immediate:
u>X>Xu
max P CA var (u) = max >u
= λ1 (X>X)
kuk=1 u6=0 u
| {z }
R(u,X>X)
where the maximizer u∗ is a unit eigenvector corresponding to this eigenvalue. Writing X = U ΣV >,
we have
X>X = V Σ> U >
U ΣV > = V Σ>ΣV >
|{z}
I
We have seen how to derive the first principal component, which maximizes the variance of the
projected data. But usually we will need more than one direction, since one direction is unlikely
50 CHAPTER 3. LOW-RANK APPROXIMATION
to capture the data well. The basic idea here is to subtract off the contributions of the previ-
ously computed principal components, and then apply the same rule as before to what remains. If
u(1) , . . . , u(k−1) denote the principal components already computed, this subtracting off is accom-
plished by
k−1
X k−1
X
X̂ = X − Xu(j) u>(j) = X I − u(j) u>(j)
j=1 j=1
This expression should be understood as applying the same subtracting transformation to each row
of the data2 :
k−1
X k−1
X
x̂i = I − u(j) u>(j) xi = xi − u(j) u>
(j) xi
j=1 j=1
The vector u(j) u>(j) xi should be recognized as the orthogonal projection of xi onto the subspace
spanned by u(j) . Hence x̂i is what’s left when you start with xi and then remove all the components
that belong to the subspaces spanned by each u(j) .3
We want to find the direction of largest variance subject to the constraint that it must be orthogonal
to all the previously computed directions. Thus we have a constrained problem of the form
u>X̂>X̂u
u(k) = arg max
∀j<k: u>u(j) =0 u>u
But we don’t want to actually compute X̂. Fortunately, we don’t have to! Consider that if u is
orthogonal to u(1) , . . . , u(k−1) (as we constrain it to be), then
k−1
X k−1
X
>
X̂u = X − Xu(j) u(j) u = Xu − Xu(j) u>(j) u = Xu
j=1 j=1 | {z }
0
u>X>Xu
u(k) = arg max
∀j<k: u>u(j) =0 u>u
eliminating the need to compute X̂. Unsurprisingly, the solution to this problem is given by
u(k) = vk , that is, a unit eigenvector corresponding to the kth largest eigenvalue of X>X.
Rather than iteratively computing each new u(j) , we can view the problem of finding the first k
principal components as a joint optimization problem over all k directions simultaneously. This
amounts to maximizing the variance as projected onto a k-dimensional subspace:
k
X
U = arg max u>(j) X>Xu(j) = arg max tr U>X>XU
U>U =I j=1 U>U =I
As before, the bounds for this expression are given in terms of the smallest and largest eigenvalues,
but now there are k of them:
λ1 + λ2 + · · · + λk = R([v1 , v2 , . . . , vk ]; M )
≥ R(U ; M )
≥ R([vd−k+1 , . . . , vd−1 , vd ]; M )
= λd−k+1 + · · · + λd−1 + λd
Hence, projection onto the subspace spanned by the first k leading eigenvectors maximizes the
variance of the projected data. We can find k principal components by computing the SVD,
X = U ΣV >, and then taking the first k columns of the matrix V .
Once we have the principal components, we can use them as a new coordinate system. To do this
we must project the data onto this coordinate system, which can be done in the same way as above
(taking inner products). Each data point xi ∈ Rd becomes a new vector x̃i ∈ Rk , where k is the
number of principal components. The components of the projection write
[x̃i ]j = x>i uj
We can compute all these vectors at once more efficiently using a matrix-matrix multiplication
X̃ = XU
Observe that the data are uncorrelated in the projected space. Also note that this example does
not show the full power of PCA since we have not reduced the dimensionality of the data at all –
the plot is merely to show the PCA coordinate transformation.
We have given the most common derivation of PCA above, but it turns out that there are other
equivalent ways to arrive at the same formulation. These give us helpful additional perspectives on
what PCA is doing.
52 CHAPTER 3. LOW-RANK APPROXIMATION
Gaussian assumption
Let us assume that the data are generated by a multivariate Gaussian distribution:
iid
xi ∼ N (µ, Σ)
Then the maximum likelihood estimate of the covariance matrix Σ is
n
1X 1
Σ̂ = (xi − x̄)(xi − x̄)> = X>X
n n
i=1
where x̄ is the sample average and the matrix X is assumed to be zero-mean as before. The
eigenvectors of Σ̂ and X>X are the same since they are positive scalar multiples of each other.
The contours of the multivariate Gaussian density form ellipsoids (see Figure 1). The direction of
largest variance (i.e. the first principal component) is the eigenvector corresponding to the smallest
eigenvalue of Σ−1 , which is the largest eigenvalue of Σ. We do not know Σ in general, so we use Σ̂
in its place. Thus the principal component is an eigenvector corresponding to the largest eigenvalue
of Σ̂. As mentioned earlier, this matrix has the same eigenvalues and eigenvectors as X>X, so we
arrive at the same solution.
Ordinary least squares minimizes the vertical distance between the fitted line and the data points:
ky − Xuk2
We show that PCA can be interpreted as minimizing the perpendicular distance between the
principal component subspace and the data points, so in this sense it is doing the same thing as
total least squares.
The orthogonal projection of a vector x onto the subspace spanned by a unit vector u equals u
scaled by the scalar projection of x onto u:
Pu x = (uu>)x = (x>u)u
Suppose we want to minimize the total reconstruction error:
n
X
P CA Err (u) = kxi − Pu xi k2
i=1
n
X
2 2
= kxi k − kPu xi k (∗)
i=1
Xn n
X
= 2
kxi k − k(x>i u)uk2
i=1 i=1
n
X n
X
= kxi k2 − (x>i u)2
i=1 i=1
| {z }
P CA V ar (u)
2
P
since x − Pu x ⊥ Pu x. Then since the first term i kxi k is constant with respect to u, we have
arg min P CA Err (u) = arg min constant − P CA V ar (u) = arg max P CA V ar (u)
u u u
The Pearson Correlation Coefficient ρ(X, Y ) is a way to measure how linearly related (in other
words, how well a linear model captures the relationship between) random variables X and Y .
Cov(X, Y )
ρ(X, Y ) = p
Var(X) Var(Y )
Here are some important facts about it:
The correlation is defined in terms of random variables rather than observed data. Assume now
that x, y ∈ Rn are vectors containing n independent observations of X and Y , respectively. Recall
the law of large numbers, which states that for i.i.d. Xi with mean µ,
n
1X a.s.
Xi −→ µ as n → ∞
n
i=1
Plugging these estimates into the definition for correlation and canceling the factor of 1/n leads us
to the sample Pearson Correlation Coefficient ρ̂:
Pn
(xi − x̄)(yi − ȳ)
ρ̂(x, y) = pPn i=1 Pn
2 2
i=1 (xi − x̄) · i=1 (yi − ȳ)
x̃>ỹ
=p where x̃ = x − x̄, ỹ = y − ȳ
x̃>x̃ · ỹ>ỹ
Here are some 2-D scatterplots and their corresponding correlation coefficients:
then we can define a distribution on normalized X and Y and have their relationship entirely
captured by ρ(X, Y ). First write
σxy
ρ(X, Y ) =
σx σy
Then " 2 #
σx2 σxy
σx ρσx σy
Σ= =
σxy σy 2 ρσx σy σy2
so
" # " # " #>
−1
σx 0 σ −1 0 σ −1 0
X
∼ N 0, x Σ x
0 σy−1 Y 0 σy−1 0 σy−1
!
1 ρ
∼ N 0,
ρ 1
3.3. CANONICAL CORRELATION ANALYSIS 55
Canonical Correlation Analysis (CCA) is a method of modeling the relationship between two
point sets by making use of the correlation coefficient. Formally, given zero-mean random vectors
Xrv ∈ Rp and Yrv ∈ Rq , we want to find projection vectors u ∈ Rp and v ∈ Rq that maximizes the
correlation between X> >
rv u and Yrv v:
Cov(X> >
rv u, Yrv v)
max ρ(X> >
rv u, Yrv v) = max p
u,v u,v Var(X> >
rv u) Var(Yrv v)
Observe that
Cov(X> > > > > >
rv u, Yrv v) = E[(Xrv u − E[Xrv u])(Yrv v − E[Yrv v])]
= E[u>(Xrv − E[Xrv ])(Yrv − E[Yrv ])>v]
= u>E[(Xrv − E[Xrv ])(Yrv − E[Yrv ])>]v
= u> Cov(Xrv , Yrv )v
which also implies (since Var(Z) = Cov(Z, Z) for any random variable Z) that
Var(X> >
rv u) = u Cov(Xrv , Xrv )u
>
Var(Yrv v) = v> Cov(Yrv , Yrv )v
so the correlation writes
u> Cov(Xrv , Yrv )u
ρ(X> >
rv u, Yrv v) = p
u> Cov(Xrv , Xrv )u · v> Cov(Yrv , Yrv )v
Unfortunately, we do not have access to the true distributions of Xrv and Yrv , so we cannot compute
these covariances matrices. However, we can estimate them from data. Assume now that we are
given zero-mean data matrices X ∈ Rn×p and Y ∈ Rn×q , where the rows of the matrix X are i.i.d.
samples xi ∈ Rp from the random variable Xrv , and correspondingly for Yrv . Then
n
1X > 1 >
Cov(Xrv , Yrv ) = E[(Xrv − E[Xrv ])(Yrv − E[Yrv ])>] = E[Xrv Yrv
>
]≈ xi yi = X Y
| {z } | {z } n n
0 0 i=1
where again the sample-based approximation is justified by the law of large numbers. Similarly,
n
1X > 1 >
Cov(Xrv , Xrv ) = E[Xrv X>
rv ] ≈ xi xi = X X
n n
i=1
n
> 1 X 1 >
Cov(Yrv , Yrv ) = E[Yrv Yrv ]≈ yi y>
i = Y Y
n n
i=1
Plugging these estimates in for the true covariance matrices, we arrive at the problem
u> n1 X>Y u u>X>Y v
max r = max √
u,v u,v > > > >
| u X Xu {z· v Y Y v}
u> n1 X>X u · v> n1 Y >Y v
ρ̂(Xu,Y v)
Let’s try to massage the maximization problem into a form that we can reason with more easily.
Our strategy is to choose matrices to transform X and Y such that the maximization problem is
equivalent but easier to understand.
56 CHAPTER 3. LOW-RANK APPROXIMATION
1. First, let’s choose matrices Wx , Wy to whiten X and Y . This will make the (co)variance
matrices (XWx )>(XWx ) and (Y Wy )>(Y Wy ) become identity matrices and simplify our ex-
pression. To do this, note that X>X is positive definite (and hence symmetric), so we can
employ the eigendecomposition
X>X = Ux Sx U>
x
Since
Sx = diag(λ1 (X>X), . . . , λd (X>X))
where all the eigenvalues are positive, we can define the “square root” of this matrix by taking
the square root of every diagonal entry:
q q
1/2
Sx = diag > >
λ1 (X X), . . . , λd (X X)
−1/2 >
Then, defining Wx = Ux Sx Ux , we have
(XWx )>(XWx ) = Wx>X>XWx
= Ux Sx− /2 U> > − /2 >
1 1
x Ux Sx Ux Ux Sx Ux
= Ux Sx− /2 Sx Sx− /2 U>
1 1
x
= Ux U>
x
=I
which shows that Wx is a whitening matrix for X. The same process can be repeated to
−1/2
produce a whitening matrix Wy = Uy Sy U> y for Y .
Let’s denote the whitened data Xw = XWx and Yw = Y Wy . Then by the change of variables
uw = Wx−1 u, vw = Wy−1 v,
(Xu)>Y v
max ρ̂(Xu, Y v) = max p
u,v u,v (Xu)>Xu(Y v)>Y v
(XWx Wx−1 u)>Y Wy Wy−1 v
= max q
u,v
(XWx Wx−1 u)>XWx Wx−1 u(Y Wy Wy−1 v)>Y Wy Wy−1 v
(Xw uw )>Yw vw
= max p
uw ,vw (Xw uw )>Xw uw (Yw vw )>Yw vw
u>w X>
w Yw vw
= max p
uw ,vw u>w X> > >
w Xw uw · vw Yw Yw vw
u> X> Yw vw
= max p w w
uw ,vw u> uw · v> vw
| w {z w }
ρ̂(Xw uw ,Yw vw )
2. Second, let’s choose matrices Dx , Dy to decorrelate Xw and Yw . This will let us simplify
the covariance matrix (Xw Dx )>(Yw Dy ) into a diagonal matrix. To do this, we’ll make use of
the SVD:
X> w Yw = U SV
>
Let’s denote the decorrelated data Xd = Xw Dy and Yd = Yw Wy . Then by the change of variables
ud = Dx−1 uw = D> −1 >
x uw , vd = Dy vw = Dy vw ,
(Xw uw )>Yw vw
max ρ̂(Xw uw , Yw vw ) = max p
uw ,vw uw ,vw u>w uw · v>
w vw
(Xw Dx Dx−1 uw )>Yw Dy Dy−1 vw
= max p
uw ,vw (Dx uw )>Dx uw · (Dy vw )>Dy vw
(Xd ud )>Yd vd
= max q
ud ,vd
u>d ud · v>
d vd
u>Xd Yd vd
= max q d
ud ,vd
u>d ud · v>
d vd
| {z }
ρ̂(Xd ud ,Yd vd )
u>d Svd
= max q
ud ,vd
u>d ud · v>
d vd
Without loss of generality, suppose ud and vd are unit vectors4 so that the denominator becomes
1, and we can ignore it:
u>d Svd u>d Svd
max q = max = max u>d Svd
ud ,vd kud k=1 kud kkvd k kud k=1
u>d ud · v>
d vd kvd k=1 kvd k=1
The diagonal nature of S implies Sij = 0 for i 6= j, so our simplified objective expands as
XX X
u>d Svd = (ud )i Sij (vd )j = Sii (ud )i (vd )i
i j i
where Sii , the singular values of X> w Yw , are arranged in descending order. Thus we have a weighted
sum of these singular values, where the weights are given by the entries of ud and vd , which are
constrained to have unit norm. To maximize the sum, we “put all our eggs in one basket” and
extract S11 by setting the first components of ud and vd to 1, and the rest to 0:
1 1
0 0
ud = .. ∈ Rp vd = .. ∈ Rq
. .
0 0
Any other arrangement would put weight on Sii at the expense of taking that weight away from
S11 , which is the largest, thus reducing the value of the sum.
Finally we have an analytical solution, but it is in a different coordinate system than our original
problem! In particular, ud and vd are the best weights in a coordinate system where the data has
been whitened and decorrelated. To bring it back to our original coordinate system and find the
vectors we actually care about (u and v), we must invert the changes of variables we made:
u = Wx uw = Wx Dx ud v = Wy vw = Wy Dy vd
4 Why can we assume this? Observe that the value of the objective does not change if we replace u by αu and v by βv ,
d d d d
where α and β are any positive constants. Thus if there are maximizers ud , vd which are not unit vectors, then ud /kud k and
vd /kvd k (which are unit vectors) are also maximizers.
58 CHAPTER 3. LOW-RANK APPROXIMATION
U = Wx Dx Ud V = Wy Dy Vd
Note that Ud and Vd have orthogonal columns. The columns of U and V , which are the projection
directions we seek, will in general not be orthogonal, but they will be linearly independent (since
they come from the application of invertible matrices to the columns of Ud , Vd ).
An advantage of CCA over PCA is that it is invariant to scalings and affine transformations of X
and Y . Consider a simplified scenario in which two matrix-valued random variables X, Y satisfy
Y = X + where the noise has huge variance. What happens when we run PCA on Y ? Since
PCA maximizes variance, it will actually project Y (largely) into the column space of ! However,
we’re interested in Y ’s relationship to X, not its dependence on noise. How can we fix this? As
it turns out, CCA solves this issue. Instead of maximizing variance of Y , we maximize correlation
between X and Y . In some sense, we want the maximize “predictive power” of information we
have.
CCA regression
Once we’ve computed the CCA coefficients, one application is to use them for regression tasks,
predicting Y from X (or vice-versa). Recall that the correlation coefficient attains a greater value
when the two sets of data are more linearly correlated. Thus, it makes sense to find the k × k weight
matrix A that linearly relates XU and Y V . We can accomplish this with ordinary least squares.
Denote the projected data matrices by Xc = XU and Yc = Y V . Observe that Xc and Yc are
zero-mean because they are linear transformations of X and Y , which are zero-mean. Thus we can
fit a linear model relating the two:
Yc ≈ Xc A
The least-squares solution is given by
A = (X> −1 >
c Xc ) Xc Yc
= (U>X>XU )−1 U>X>Y V
However, since what we really want is an estimate of Y given new (zero-mean) observations X̃
(or vice-versa), it’s useful to have the entire series of transformations that relates the two. The
predicted canonical variables are given by
We can collapse all these terms into a single matrix Aeq that gives the prediction Ŷ from X̃:
Aeq = U
|{z} (U>X>XU )−1 (U>X>Y V ) (V >V )−1 V >
| {z }| {z }| {z }
projection whitening decorrelation projection back
There are many issues with working in high dimensions. As the dimension d grows, machine learning
algorithms can become more computationally intensive. It also becomes more difficult to visualize
our data - humans are notoriously bad at visualizing beyond 3 dimensions. Additionally, redundant
features can add more noise than signal. There is a widely held belief that most natural data in
high dimensions (for example, data used in genomics) can be represented in a lower dimensional
space.
Dimensionality reduction is an unsupervised method - unlike supervised learning, there are no labels
that we need to match, no classes to predict. As such, defining problems becomes more subjective
and heuristic. One approach to dimensionality reduction is feature selection, in which we remove
features that we deem to be irrelevant based on some criteria. For example, the LASSO provides
this feature selection using L1 regularization.
Another approach to dimensionality reduction is learning latent features. This approach seeks to
find new latent features that are transformations of our given features that represent the data well.
Figure 3.2: The Swiss Roll dataset is a 2-dimensional manifold embedded in 3 dimensional space. We should
be able to represent it with a 2-dimensional feature space.
PCA can be used for dimensionality reduction. Recall that finding the PCA decomposition of
X ∈ Rn×D amounts to finding the SVD
X = U SV T
Suppose d << D, and let Ũ , S̃, Ṽ contain the first d columns of U, S, V respectively. Then Ṽ T x
projects data x into a d-dimensional latent space. This projection into a lower-dimensional space
60 CHAPTER 3. LOW-RANK APPROXIMATION
will typically cause some loss of information, but if the data really does live in a d-dimensional
subspace, then we should be able to reconstruct x via Ṽ Ṽ T x and obtain something that is close to
the original x. PCA can be seen as finding a basis for a subspace that minimizes this reconstruction
error (low-rank approximation property).
Sometimes, it does not make sense to find orthogonal directions that capture the maximum variance,
as PCA does. Independent Components Analysis instead seeks directions that are statistically
independent, and is more suitable for some applications.
NMF takes a non-negative matrix X, where each column is a data entry, and approximately factors
it into X = BH, where B is skinny, H is fat, and both these matrices are non-negative. The k-th
column of X is the sum of the columns of B, weighted according to the entries in the k-th column
of H. In this regard, each column of B can be seen as a feature that contributes to the data
in a meaningful way - the number of columns in B is the number of features we are allowed to
reconstruct the data with. If each column of X is a vectorized image, then each column of B will be
a vectorized image. NMF learns a part-based representation of the data, since the reconstruction is
a non-negative linear combination of features of the same dimension as the data. It turns out that
NMF will tend to produce sparser features than PCA. NMF has applications in computer vision
and recommender systems, among other fields.
Multidimensional Scaling
MDS seeks to learn a low-dimensional representation such that pairwise distances between points
are exactly preserved in the latent representation. Specifically, if Xi ∈ RD and Wij = kXi − Xj k2 ,
then MDS finds Yi such that kYi − Yj k2 ≈ Wij . More generally, one can use any dissimilarity
measure d(Xi , Xj ) in place of the norm. If the dissimilarity measure is a metric, then MDS is
equivalent to PCA. MDS is typically used to visualize high dimensional data.
Note that while MDS captures interpoint distances in its low-dimensional embedding, it is incapable
of capturing local geometric structure - for example, in the Swiss roll dataset shown above, points
in the red region are dissimilar from points in the blue region, but are relatively close to these points
in distance, so MDS will provide a representation where red and blue points are not separated.
Nonlinear Methods
Linear methods such as PCA are not well-suited to capture intrinsic geometric structure in data.
There are several approaches to solving this problem:
neighborhood homeomorphic to a neighborhood in Euclidean space. Most machine learning researchers do not care for these
definitions and will call almost anything a manifold.
3.4. DIMENSIONALITY REDUCTION 61
(1) Construct a local neighborhood graph by connecting each point with its k nearest neighbors.
(2) Compute all-pairs shortest path distances (these are the geodesic distances).
(3) Apply MDS on geodesic distances.
If we apply IsoMap to the Swiss roll, we see that while the Euclidean distance between the red and
blue regions is low, the geodesic distance is high - the geodesic distance between points represents
how far we would have to go if we were walking on the Swiss roll manifold in 2 dimensions.
IsoMap has been used effectively for dimensionality reduction in facial data.
Next time: Laplacian Eigenmaps, t-SNE.
62 CHAPTER 3. LOW-RANK APPROXIMATION
Chapter 4
All the models we’ve seen so far are linear in the parameters we’re trying to learn. That is, our
prediction ŷ = f (x; θ) is some linear function of the parameters θ. For example, in OLS, θ = w and
the residuals ri are computed by yi − w>xi , which is linear in the components of w. In the case of
least-squares polynomial regression, the predicted value is not a linear function of the input x, but
it is still a linear function of the parameters.
However, we may have need for models which are nonlinear function of their parameters. We
consider a motivating example first.
Suppose we want to estimate the 2D position θ = (θ1 , θ2 ) of some entity, for example a robot.
The information we have to work with are noisy distance estimates Yi ∈ R from m sensors whose
positions Xi ∈ R2 are fixed and known. If we assume i.i.d. Gaussian noise as usual, our statistical
model has the form
iid
Yi = kXi − θk + Ni , Ni ∼ N (0, σ 2 ), i = 1, . . . , m
where p
kXi − θk = (Xi1 − θ1 )2 + (Xi2 − θ2 )2
Here our prediction is
ŷ = f (x; θ) = kx − θk
which is clearly not linear in θ.
More generally, let us assume a model similar to the one above, but where f is now some arbitrary
differentiable function and θ ∈ Rd :
iid
Yi = f (Xi ; θ) + Ni , Ni ∼ N (0, σ 2 ), i = 1, . . . , m
63
64 CHAPTER 4. GRADIENT DESCENT, NEWTON’S METHOD
The last step holds because the first term in the sum is constant w.r.t. the optimization variable
θ, and we flip from max to min because of the negative sign.
Observe that the objective function is a sum of squared residuals as we’ve seen before, even though
the function f is nonlinear in general. For this reason the method is called nonlinear least
squares.
Unfortunately, there is no closed-form solution for θ̂mle in general. Later we will see an iterative
method for computing it.
Motivated by the MLE formulation above, we consider the following optimization problem:
X
min LS (θ) = min (yi − f (xi ; θ))2
θ θ
i
One way to minimize a function is to use calculus. We know that the gradient of the objective
function at any minimum must be zero, because if it isn’t, we can take a sufficiently small step in
the direction of the negative gradient that the objective function’s value will be reduced.
Thus, the first-order optimality condition that needs to be satisfied is:
X
∇θ LS = 2 (yi − f (xi ; θ))∇θ f (xi ; θ) = 0
i
J(θ)>(Y − F (θ)) = 0
4.1. NONLINEAR LEAST SQUARES 65
where
f (x1 ; θ)
F (θ) =
..
.
f (xn ; θ)
∇θ f (x1 ; θ)>
..
J(θ) = = ∇θ F , the Jacobian of F
.
∇θ f (xn ; θ) >
Observe that when f is linear in θ (i.e. f (xi ; θ) = θ>xi ), the gradient ∇θ LS will only have θ in one
place because the term ∇θ f (xi ; θ) will only depend on xi :
X X
∇θ LS = 2 (yi − θ>xi )∇θ (θ>xi ) = 2 (yi − θ>xi )xi
i i
and it is easy to derive a closed-form solution for θ in terms of the yi ’s and xi ’s:
2X>(Y − Xθ) = 0
2X>Y − 2X>Xθ = 0
X>Y = X>Xθ
θ = (X>X)−1 X>Y
Since there is no closed-form solution to the nonlinear least squares optimization problem, we resort
to an iterative algorithm, the Gauss-Newton algorithm1 , to tackle it. At each iteration, this
method linearly approximates the function F about the current iterate and solves a least-squares
problem involving the linearization in order to compute the next iterate.
Let’s say that we have a “guess” for θ at iteration k, which we denote θ(k) . We can then approximate
F (θ) to first order using a Taylor expansion about θ(k) :
where ∆θ := θ − θ(k) .
1 For some reason this algorithm was called gradient descent in lecture, but it is not really gradient descent. However,
like gradient descent, it is an iterative, first-order optimization algorithm. Another popular method for solving nonlinear least
squares, the Levenberg-Marquardt algorithm, is a sort of interpolation between Gauss-Newton and gradient descent.
66 CHAPTER 4. GRADIENT DESCENT, NEWTON’S METHOD
Now since F̃ is linear in ∆θ (the Jacobian and F are just constants: functions evaluated at θ(k) ),
we can use the closed form solution for ∆θ from the optimality condition to update our current
guess θ(k) . Applying the first-order optimality condition from earlier to the objective F̃ yields the
following equation:
> (k) > (k) (k)
0 = JF̃ (θ) (Y − F̃ (θ)) = J(θ ) Y − F (θ ) + J(θ )∆θ
Note that we have used the fact that the Jacobian of the linearized function F̃ , evaluated at any θ,
is precisely J(θ(k) ). This is because F̃ is affine where the linear map is J(θ(k) ), so the best linear
approximation is just that.
Writing J = J(θ(k) ) for brevity, we have
where ∆Y := Y − F (θ(k) ). By comparing this solution to OLS, we see that it is effectively solving
∆θ = arg min kJδθ − ∆Y k2
δθ
Since δF ≈ Jδθ close to θ(k) , this is saying that we choose a change to the weights that corrects
for the current error in the function values, but it bases this calculation on the linearization of F .
Recalling that ∆θ = θ − θ(k) , we can improve upon our current guess θ(k) with the update
θ(k+1) = θ(k) + ∆θ
= θ(k) + (J>J)−1 J>∆Y
Here’s the entire process laid out in steps:
Note that the solution found will depend on the initial value θ(0) in general.
The choice for measuring convergence is up to the practitioner. Some common choices include
testing changes in the objective value:
(k+1) − (k)
≤ threshold
(k)
Gradient descent is an iterative algorithm for finding local minima of differentiable functions. For
an analogy, imagine walking downhill surrounded a thick fog that prevents you from seeing the
height of the land around you, other than being able to tell which direction is steepest.
Recall that the gradient of f at x, denoted ∇f (x), points in the direction of steepest ascent.
Conversely, the negative gradient points in the direction of steepest descent. Therefore, if we take
a small step in the direction of the negative gradient, we will decrease the value of the function.
The update performed is
xi+1 = xi − αi ∇f (xi )
where αi > 0 is the step size. We may choose αi to be a fixed constant, but in many cases it is
decayed to zero over the course of training.
68 CHAPTER 4. GRADIENT DESCENT, NEWTON’S METHOD
Chapter 5
Neural Networks
Neural networks are a class of compositional function approximators. Unlike other function ap-
proximators we have seen (e.g. polynomials), they are nonlinear in their parameters.
(Aside) In information processing we have several perspectives:
These circuits represent matrix multiplications, with the weights on the edges being the entries of
the matrix. Assuming that the flow of information is from left to right, the circuit on the left is
multiplication by a 3 × 4 matrix, and the circuit on the right is multiplication by a 2 × 4 matrix
followed by multiplication by a 3 × 2 matrix.
Are these two circuits equally expressive? Certainly not, as the one on the left has rank at most
2, while the one on the right may have rank 3. However, the one on the left has more layers
69
70 CHAPTER 5. NEURAL NETWORKS
of processing, so it seems like it should be more expressive. The key thing that is missing is
nonlinearity.
Let’s insert a nonlinear function, called an activation function, after these linear computations. We
would like to choose this activation function to make the circuit a universal function approximator.1
A key observation is that piecewise-constant functions are universal function approximators:
(
1 x≥0
u(x) =
0 x<0
We can build very complicated functions from this simple step function by combining translated
and scaled versions of it. Observe that
1
This essentially means that given any continuous function, we can choose the weights such that the output of the circuit
can be made arbitrarily close to the output of the given function for all inputs.
5.1. NEURAL NETWORKS 71
The input x is one-dimensional, and the weight on x to neuron i is bi . We also introduce a constant
1, whose weight ai into neuron i is ai . (This is referred to as the bias, but it has nothing to do
with bias in the sense of the bias-variance tradeoff. It’s just there to provide the network with the
ability to shift the function.) The outputs of the intermediate nodes are ai + bi x, and then we pass
each of these through the activation function u. The output of the network is a linear combination
of the outputs of the activation functions:
d
X
h(x) = ci u(ai + bi x)
i=1
Choosing weights
With a proper choice of ai , bi , and ci , this function can approximate any continuous function we
want. But the question remains: given some target function, how do we choose these parameters
in the appropriate way?
Let’s try a familiar technique: least squares. Assume we have training data {(xk , yk )}nk=1 . We aim
to solve the optimization problem
Xn
min ek
a,b,c
k=1
| {z }
f (a,b,c)
72 CHAPTER 5. NEURAL NETWORKS
where
ek = (yk − h(xk ))2
To run gradient descent, we need derivatives of the loss with respect to our optimization variables.
We compute via the chain rule
n
∂f X ∂h
= −2(yk − h(xk )) (xk )
∂ci ∂ci
k=1 | {z }
=u(ai +bi xk )
∂ek
=0
∂ci
so no update will be made for that example. More notably, consider the derivative with respect to
ai :
n
∂f X ∂h
= −2(yk − h(xk )) (xk )
∂ai ∂ai
k=1 | {z }
0
Similarly,
∂f
=0
∂bi
The derivative at the jump is undefined, but in practice we will never hit that point of discontinuity.
The bigger issue is that gradient descent will do nothing to optimize the ai and bi weights. Even
though the step function is useful for the purpose of showing the approximation capabilities of neural
networks, it is seldom used in practice because it cannot be trained by conventional gradient-based
methods.
The next simplest universal approximator is the class of piecewise-linear functions. Just as
piecewise-constant functions can be achieved by combinations of the step function as a nonlinearity,
piecewise-linear functions can be achieved by combinations of the rectified linear unit (ReLU)
function
u(x) = max(0, x)
(
n n
∂f X ∂ X 0 if ai + bi x < 0
= −2(yk − h(xk ))ci max(0, ai + bi x) = −2(yk − h(xk ))ci
∂ai ∂ai 1 if ai + bi x > 0
k=1 k=1
(
n n
∂f X ∂ X 0 if ai + bi x < 0
= −2(yk − h(xk ))ci max(0, ai + bi x) = −2(yk − h(xk ))ci
∂bi ∂bi xi otherwise
k=1 k=1
Crucially, we see that the gradient with respect to a and b is not uniformly zero, unlike with the
step function.
If the ReLU is active, both weights are adjustable. Depending on the gradient of the objective
function, our ReLUs can move to the left or right, increase or decrease their slope, and flip direction.
If the weight initialization turns off the ReLU for every training point, then the gradient descent
updates will not change the parameters of the neuron, and we say it is dead. Random initialization
should result in a reasonable number of active neurons. There are more sophisticated initialization
methods, such as “Glorot” initialization2 , which take into account the number of connections and
are more effective in practice. Leaky ReLUs, which have a small slope in the section where the
ReLU is flat (say, u(x) = .01x when x < 0) are sometimes used to provide some small gradient
signal to avoid dead neurons.
The celebrated neural network universal approximation theorem, due to Kurt Hornik3 , tells us that
neural networks are universal function approximators in the following sense.
Theorem. Suppose u : R → R is nonconstant, bounded, nondecreasing, and continuous4 , and let
S ⊆ Rn be closed and bounded. Then for any continuous function f : S → R and any > 0, there
exists a neural network with one hidden layer and a finite number of neurons, which we can write
d
X
h(x) = ci u(ai + b>i x)
i=1
2 See Understanding the difficulty of training deep feedforward neural networks.
3 See Approximation Capabilities of Multilayer Feedforward Networks.
4 Both ReLU and sigmoid satisfy these requirements.
74 CHAPTER 5. NEURAL NETWORKS
such that
|h(x) − f (x)| <
for all x ∈ S.
There’s some subtlety in the theorem that’s worth noting. It says that for any given continuous
function, there exists a neural network of finite size that uniformly approximates the given func-
tion. However, it says nothing about how well any particular architecture you’re considering will
approximate the function. It also doesn’t tell us how to compute the weights.
It’s also worth pointing out that in the theorem, the network consists of just one hidden layer. In
practice, people find that using more layers works better.
Last time we saw that the second-most simple universal function approximator was the piecewise
linear function. Specifically, we talked about a specific component of piecewise linear functions
called the ReLU, which is defined as f (x) = max(0, x).
In our discussion of neural nets, we saw that we would have the ReLUs act on linear combinations
of neural net units to introduce nonlinearity to the hypothesis function encoded by the neural net.
For example, when acting on one input (and a bias term) our ReLUs will take in arguments of the
form a + bx. Let’s see an example of how expressive they can be. Suppose we wanted to build this
function from ReLUs:
2.5
1.5
1 f (x) = 2x − 12
f (x) = x − 2 f (x) = 6 − x
0.5
All we would need to do is center a ReLU at each hinge of the function and give it the appropriate
parameters. For example, to match f from 0 to 3, we would only need the ReLU defined by
max(0, x − 2). The full function can be matched with this linear combination of ReLUs (and a
constant bias term):
In higher dimensions, i.e. when a ReLU takes in an arbitrarily long dot-product as input: f (x) =
max(0, w>x), the unit can be viewed as representative of a “ramp” in the higher-dimensional space.
Here’s a plot of a 3-D ramp:
We have a very powerful tool at our disposal in neural nets, but it does us no good if we can’t
train them. Let’s talk about how this happens. The output units of a neural net can be thought of
as akin to a regression or hypothesis function determined by the parameters of some model.
For example, in ordinary least-squares regression, we learned a hypothesis function f (x) = w>x
determined by the parameter w. It is just the same with neural nets, except that our hypothesis
function can be arbitrarily complex. Consider the following neural net:
The hypothesis function that this neural net encodes is represented by the two outputs, O =
[O1 , O2 ].
76 CHAPTER 5. NEURAL NETWORKS
Since neural net outputs are not linear functions of their inputs, there is no closed-form solution for
the minimum to any reasonable loss function defined on them. Thus, we resort to using gradient
descent to train them. To run gradient descent, we must calculate the derivative of the loss function
with respect to the parameters of the neural net. In the network depicted above, the parameters
are W = [w1 , w2 , w3 ], a, and V = [v1 , v2 ]. These are the values of the model which we are allowed
to tweak. Backpropagation is an efficient manner of computing these gradients that exploits the
nested structure of neural nets.
This section is an aside meant to recall your knowledge about the chain rule in multivariable
calculus. Let’s take a look at two slices of our neural net from above.
If you want to compute the derivatives of upstream stuff with respect to the weights on these
connections, you need only consider the input of a single connection at a time. That is to say:
∂L ∂H1
= upstream terms ·
∂w1 ∂w1
completely independent of x2 and x3 .
If you want to compute the derivatives of upstream stuff with respect to the weights downstream
of these connections, you’ll need to sum over the contributions of the inputs to these connections.
5.2. TRAINING NEURAL NETWORKS 77
That is to say:
∂L X ∂Oi
= upstream termsi ·
∂a ∂a
i
Backpropagation
A naive way of calculating the gradients might be to differentiate the loss with respect to each
parameter we’re interested in updating, one at a time. However, because the later layers of a
neural net are just functions of the earlier layers, doing this would be wasteful. We can see this by
taking a look at the derivatives of L with respect to v1 , a, and w1 in our example:
∂L ∂L ∂O1
=
∂v1 ∂O1 ∂v1
∂L ∂L ∂O1 ∂L ∂O2
= +
∂a ∂O1 ∂a ∂O2 ∂a
∂L ∂L ∂O1 ∂L ∂O2
= +
∂w1 ∂O1 ∂w1 ∂O2 ∂w1
∂L
You should notice that by invoking the chain rule, we see that the term ∂O 1
is common to all
∂L
three derivatives, and the term ∂O2 is common to the second two. This suggests that we should
consider caching the results of the derivatives of weights in the later layers and reuse them in our
computation of the derivatives of the weights of earlier layers: a dynamic programming approach.
The following is a general outline of how backpropagation might be implemented. Keep in mind
that the specifics, especially those pertaining to the structure of each successive layer, will depend
heavily on the architecture of the neural net in question.
1. The forward pass: populate each unit of the neural net with the value it’s supposed to have
(i.e., invoke all your dot-products and ReLUs).
2. Start at the upstream end, i.e. the outputs. Compute the gradient of the loss function with
respect to the outputs (these are just numbers), and memoize/cache them.
3. Go back one layer. Now, treat your outputs Oi as endpoints of your neural net, and compute
the gradients w.r.t. the previous layer, caching these as well. The contributions to the final
loss should be summed appropriately over the paths through which they influence the loss.
4. Repeat until you hit the last layer (the inputs). You should now have all the necessary
components to compute the derivative of the loss function with respect to any parameter of
the neural net.
Backpropagation efficiently computes gradients, so we can run gradient descent to optimize neural
nets. However, computing the full gradient may be expensive, particularly if we have a lot of data.
Consider that the loss function in machine learning typically is an average (or sum, but this is the
same up to a constant factor) of errors over the training points:
n
1X
L(w) = `(h(xi , w), yi )
n
i=1
78 CHAPTER 5. NEURAL NETWORKS
So computing the gradient takes time in linear in n. In the “big data” regime, n is very large, so
this cost may be unacceptable.
For this reason, we have reason to try to approximate the gradient of the loss function of a neural
net by taking a representative random sample of the inputs to calculate the gradient over. Since
gradient descent is an iterative process, we might consider forfeiting the exactness of the update of
each individual iteration in exchange for better speed of computation. If we take the gradient over
a random sample of these training samples at each step, we will get a noisy but unbiased estimate
of the true gradient. The noise in each step is often an acceptable tradeoff in exchange for the
opportunity to take many more steps.
In the regular gradient descent update, we have the policy:
w(k+1) ← w(k) − αk ∇w L(w(k) )
In stochastic gradient descent, we have the update rule:
w(k+1) ← w(k) − αk Gk
where Gk is a random variable which satisfies E[Gk ] = ∇L(w(k) ). Typically Gk is constructed
by sampling m < n training points uniformly at random (say the index is Ik ⊂ {1, . . . , n}) and
computing the gradient on these points only:
1 X
Gk = ∇w `(h(xi ), yi )
m
i∈Ik
1. The loss function f is `-strongly convex; that is, there exists a constant ` > 0 such that
(∇f (x) − ∇f (y))>(x − y) ≥ `kx − yk2 for all x, y.
2. The expected squared norm of the stochastic gradient is bounded as E[kGk k2 ] ≤ M 2 < ∞.
To evaluate the expectation, we condition on the past (i.e. all random decisions that contributed
to w(k) , including past choices of points to evaluate the gradient at). By the law of iterated
expectation, we have
h i
E[(w(k) − w∗ )>Gk ] = Epast E[(w(k) − w∗ )>Gk | past]
Here the inner expectation is taken over the choice of training points used to compute the stochastic
gradient. But given the past (which includes w(k) ), we already know w(k) , so
The current gradient Gk depends on our new choice of evaluation point, which is presumed inde-
pendent of the past and thus E[Gk | past] = E[Gk ] = ∇f (w(k) ), where the second equality holds
because Gk is an unbiased estimator for the true gradient at w(k) . Putting all this together, we
have
E[(w(k) − w∗ )>Gk ] = Epast [(w(k) − w∗ )>∇f (w(k) )]
First order necessary conditions for optimality imply that ∇f (w∗ ) = 0. Then by the assumption
of `-strong convexity, we have
The αk2 M 2 term was incurred by the randomness in our updates. We can try to send this term to
0 by diminishing the step size αk over time – but decreasing αk will also decrease the effect of the
(1 − 2αk `)dk term, so we need to choose our step size carefully.
1
Setting αk = 2`k gives
1 1
dk+1 ≤ 1− dk + M2
k (2`k)2
M2
Let S = (2`)2 so that this inequality becomes
1 1
dk+1 ≤ 1− dk + S
k k2
d2 ≤ S
1 1 1 1
d3 ≤ 1 − d2 + 2
S= S + 2S
2 2 1·2 2
1 1 1 1 1
d4 ≤ 1 − d3 + 2
S= S+ S+ S
3 3 1·3 2·3 32
1 1 1 1 1 1
d5 ≤ 1 − d4 + 2
S= S+ S+ S + 2S
4 4 1·4 2·4 3·4 4
80 CHAPTER 5. NEURAL NETWORKS
Classification
6.1 Classification
The task of classification differs from regression in that we are now interested in assigning a
d-dimensional data point one of a discrete number of classes instead of assigning it a continuous
value. Thus, the task is simpler in that there are fewer choices of labels per data point but more
complicated in that we now need to somehow factor in information about each class to obtain the
classifier that we want.
Here’s a formal definition: Given a training set D = {(xi , ti )}ni=1 of n points, with each data point
xi ∈ Rd paired with a known discrete class label ti ∈ {1, 2, ..., K}, train a classifier which, when fed
any arbitrary d-dimensional data point, classifies that data point as one of the K discrete classes.
Classifiers have strong roots in probabilistic modeling. The idea is that given an arbitrary datapoint
x, we classify x with the class label k ∗ that maximizes the posterior probability of the class label
given the data point:
k ∗ = arg max P (class = k|x)
k
Consider the example of digit classification. Suppose we are given dataset of images of handwritten
digits each with known values in the range {0, 1, 2, . . . , 9}. The task is, given an image of a
handwritten digit, to classify it to the correct digit. A generic classifier for this this task would
effectively form a posterior probability distribution over the 10 possible digits and choose the digit
that achieves the maximum posterior probability:
k ∗ = arg max P (digit = k|image)
k∈{0,1,2...,9}
There are two main types of models that can be used to train classifiers: generative models and
discriminative models.
81
82 CHAPTER 6. CLASSIFICATION
fk (x) = f (x|class k)
In total there are K + 1 probability distributions: 1 for the prior, and K for all of the individual
classes. Note that the prior probability distribution is a categorical distribution over the K discrete
classes, whereas each class conditional probability distribution is a continuous distribution over Rd
(often represented as a Gaussian). Using the prior and the conditional distributions in conjunction,
we conclude (from Bayes’ rule) that we are effectively solving
P (k) fk (x)
k ∗ = arg max P (class = k|x) = arg max = arg max P (k) fk (x)
k k f (x) k
Quadratic Discriminant Analysis (QDA) is a specific generative method in which the class
conditional probability distributions are independent Gaussians: fk (.) ∼ N (µk , Σk ).
Note: the term “discriminant” in QDA is misleading: remember that QDA is not a discriminative
method, it is a generative method!
Estimating fk (.)
For a particular class conditional probability distribution fk (.), if we do not have the true means
and covariances µk , Σk , then our best bet is to estimate them empirically with the samples in our
training data that are classified as class k:
1 X
µˆk = xi
nk
ti =k
1 X
Σ̂k = (xi − µˆk )(xi − µˆk )T
nk
ti =k
Note that the above formulas are not necessarily trivial and must be formally proven using MLE.
Just to present a glimpse of the process, let’s prove that these formulas hold for the case where we
are dealing with 1-d data points. For notation purposes, assume that Dk = {x1 , x2 , . . . , xnk } is the
set of all training data points that belong to class k. Note that the data points are i.i.d. Our goal
6.3. QDA CLASSIFICATION 83
Note that the objective above is not jointly convex, so we cannot simply take derivatives and set
them to 0! Instead, we decompose the minimization over σk2 and µk into a nested optimization
problem:
nk nk
X (xi − µk )2 X (xi − µk )2
min2 + ln(σk ) = min min + ln(σk )
µk ,σk
i=1
2σk2 σk2 µk
i=1
2σk2
The optimization problem has been decomposed into an inner problem that optimizes for µk given
a fixed σk2 , and an outer problem that optimizes for σk2 given the optimal value µˆk . Let’s first solve
the inner optimization problem. Given a fixed σk2 , the objective is convex in µk , so we can simply
take a partial derivative w.r.t µk and set it equal to 0:
nk nk nk
∂ X (xi − µk )2 X −(xi − µk ) 1 X
+ ln(σ k ) = = 0 =⇒ µ
ˆ k = xi
∂µk
i=1
2σk2 σk2 i=1
nk
i=1
Note that this objective is not convex in σk , so we must instead find the critical point of the
objective that minimizes the objective. Assuming that σk ≥ 0, the critical points are:
• σk = 0: assuming that not all of the points xi are equal to µˆk , there are two terms that are
at odds with each other: a 1/σk2 term that blows off to ∞, and a ln(σk ) term that blows off
to −∞ as σk → 0. Note that the 1/σk2 term blows off at a faster rate, so we conclude that
nk
X (xi − µˆk )2
lim + ln(σk ) = ∞
σk →0
i=1
2σk2
• σk = ∞: this case does not lead to the solution, because it gives a maximum, not a minimum.
nk
X (xi − µˆk )2
lim + ln(σk ) = ∞
σk →∞
i=1
2σk2
84 CHAPTER 6. CLASSIFICATION
Assuming that we know the means and covariances for all the classes, we can use Bayes’ Rule to
directly solve the optimization problem
√ d
For future reference, let’s use Qk (x) = ln 2π P (k) fk (x)) to simplify our notation.
Estimating fk (.)
The training and classification procedures for LDA are almost identical that of QDA. To compute
the within-class means, we still want to take the empirical mean. However, the empirical covariance
is now computed with
n
1X
Σ̂ = (xi − µˆti )(xi − µˆti )T
n
i=1
One way to understand this formula is as a weighted average of the within-class covariances. Here,
assume we have sorted our training data by class and we can index through the xi ’s by specifying
6.5. LDA VS. QDA: DIFFERENCES AND DECISION BOUNDARIES 85
n
1X
Σ̂ = (xi − µˆti )(xi − µˆti )T
n
i=1
K Xnk
1 X
= (xj,k − µˆk )(xj,k − µˆk )T
n
k=1 j=1
K
1X
= nk Σk
n
k=1
K
X nk
= Σk
n
k=1
Up to this point, we have used the term quadratic in QDA and linear in LDA. These terms signify
the shape of the decision boundary in x-space. Given any two classes, the decision boundary
represents the points in x-space at which the two classes are equally likely.
Let’s study binary (2-class) examples for simplicity. Assume that the two classes in question are
class A and class B. An arbitrary point x can be classified according to three cases:
A P (class = A|x) > P (class = B|x)
∗
k = B P (class = A|x) < P (class = B|x)
Either A or B
P (class = A|x) = P (class = B|x)
The decision boundary is the set of all points in x-space that are classified according to the third
case. Let’s look at the form of the decision boundary according to the different scenarios possible
under QDA and LDA.
The simplest case is when the two classes are equally likely in prior, and their conditional proba-
bility distributions are isotropic with identical covariances. Isotropic Gaussian distributions have
covariances of the form of Σ = σ 2 I, which means that their isocontours are circles. In this case,
fA (.) and fB (.) have identical covariances of the form ΣA = ΣB = σ 2 I.
86 CHAPTER 6. CLASSIFICATION
Figure 6.1: Contour plot of two isotropic, identically distributed Gaussians in R2 . The circles are the level
sets of the Gaussians.
Geometrically, we can see that the task of classifying a 2-D point into one of the two classes amounts
simply to figuring out which of the means it’s closer to. Using our notation of Qk (x) from before,
this can be expressed mathematically as:
QA (x) = QB (x)
1
1 1 1 1 1
ln − (x − µˆA )T σ 2 I(x − µˆA ) − ln |σ 2 I| = ln − (x − µˆB )T σ 2 I(x − µˆB ) − ln |σ 2 I|
2 2 2 2 2 2
T T
(x − µˆA ) (x − µˆA ) = (x − µˆB ) (x − µˆB )
The decision boundary is the set of points x for which ||x − µˆA ||2 = ||x − µˆB ||2 , which is simply
the set of points that are equidistant from µA and µB . This decision boundary is linear because
the set of points that are equidistant from µA and µB are simply the perpendicular bisector of the
segment connecting µA and µB .
The next case is when the two classes are equally likely in prior, and their conditional probability
distributions are anisotropic with identical covariances. Anisotropic Gaussian distributions are
simply all Gaussian distributions that are not isotopic.
In order to understand the difference, let’s take a closer look at the covariance matrix Σ. Since
Σ is a symmetric, positive semidefinite matrix, we can decompose it by the spectral theorem into
Σ = VΛVT , where the columns of V form an orthonormal basis in Rd , and Λ is a diagonal matrix
with real, non-negative values. The entries of Λ dictate how elongated or shrunk the distribution is
along each direction. To see why this is the case, let’s consider a zero-mean Gaussian distribution
N (0, Σ). We wish to find its level set f (x) = k, or simply the set of all points x such that
the probability density f (x) evaluates to a fixed constant k. This is equivalent to the level set
ln f (x) = ln(k), which further reduces to xT Σ−1 x = c, for some constant c. Without loss of
generality, assume that this constant is 1. The level set xT Σ−1 x = 1 is an ellipsoid with axes
6.5. LDA VS. QDA: DIFFERENCES AND DECISION BOUNDARIES 87
√ √ √
v1 , v2 , . . . , vd , with lengths λ1 , λ2 , . . . , λd , respectively. Each axis of the ellipsoid is the vector
√
λi vi , and we can verify that
( λi vi )T Σ−1 ( λi vi ) = λi vTi Σ−1 vi = λi vTi (Σ−1 vi ) = λi vTi (λ−1
p p
T
i vi ) = vi vi = 1
In the case of isotropic distributions, the entries of Λ are all identical, meaning the the axes of the
ellipsoid form a circle. In the case of anisotropic distributions, the entries of Λ are not necessarily
identical, meaning that the resulting ellipsoid may be elongated/shruken and also rotated.
Figure 6.2: Two anisotropic, identically distributed Gaussians in R2 . The ellipses are the level sets of the
Gaussians.
The case when the two classes are identical anisotropic distributions can be reduced to the isotropic
case simply by performing a change of coordinates that transforms the ellipses back into circles.
Thus, the decision boundary is still linear.
Now, let’s find the decision boundary when the two classes still have identical covariances but are
not necessarily equally likely in prior:
QA (x) = QB (x)
1 1 1 1
ln P (A) − (x − µˆA )T Σ̂−1 (x − µˆA ) − ln |Σ̂| = ln P (B) − (x − µˆB )T Σ̂−1 (x − µˆB ) − ln |Σ̂|
2 2 2 2
1 1
ln P (A) − (x − µˆA )T Σ̂−1 (x − µˆA ) = ln P (B) − (x − µˆB )T Σ̂−1 (x − µˆB )
2 2
2 ln P (A) − x Σ̂ x + 2xT Σ̂−1 µˆA − µˆA T µˆA
T −1
= 2 ln P (B) − xT Σ̂−1 x + 2xT Σ̂−1 µˆB − µˆB T µˆB
2 ln P (A) + 2xT Σ̂−1 µˆA − µˆA T µˆA = 2 ln P (B) + 2xT Σ̂−1 µˆB − µˆB T µˆB
The decision boundary is the level set of a linear function f (x) = wT x + k. Notice the pattern:
the decision boundary is always the level set of a linear function (which itself is linear) as long as
the two class conditional probability distributions share the same covariance matrices. This is the
reason for why LDA has a linear decision boundary.
Nonidentical Distributions
Here, unlike the case when ΣA = ΣB , we cannot cancel out the quadratic terms in x from both
sides of the equation, and thus our decision boundary is now represented by the level set of an
arbitrary quadratic function.
It should now make sense why QDA is short for quadratic discriminant analysis and LDA is short
for linear discriminant analysis!
The quadratic nature of the decision boundary in QDA and the linear nature of the decision
boundary in LDA still apply to the general case when there are more than two classes. The
following excellent figures from Professor Shewchuk’s notes illustrate this point:
Figure 6.3: LDA (left) vs QDA (right): a collection of linear vs quadratic level set boundaries
In our discussion of LDA and QDA, we focused on generative models, where we explicitly model
the probability of the data P (X, Y ) and choose the class c∗ that maximizes the posterior probability:
c∗ = arg maxc P (Y = c|X). For example, in QDA we model P (X|Y = c) as a Gaussian with an
estimated mean µc and covariance matrix Σc . If we choose a prior P (Y ), then our predicted class
for a new test point x is:
The generative approach is flexible and we can actually use our model to generate new samples.
However, LDA and QDA can be inefficient in that they require estimation of a large number of
parameters (ie. the covariance matrices, which have d(d−1)
2 parameters). For 2-class LDA, the
decision boundary is of the form
wT x + k = 0
where k, w are estimated parameters. The decision boundary only requires d + 1 parameters, but
we ended up estimating O(d2 ) parameters because we needed to determine the class-conditional
Gaussian generative models. LDA is an indirect approach to classification - it estimates parameters
for P (X|Y ) and use Bayes’ rule to compute a decision rule, which leads to the discrepancy between
the number of parameters required to specify the generative model and the number of parameters
required to perform classification.
This leads us to the concept of discriminative models, where we bypass learning a generative
model altogether and directly learn a decision boundary for the purpose of classification. The
process of constructing a discriminative model is composed of two key design choices:
1) Representation: how we represent the output of the model (for example, the output of a model
could be any real-valued number that we threshold to perform classification, or the output
could represent a probability)
As a first example of a discriminative model, we discuss the Least Squares Support Vector
Machine (LS-SVM). Consider the binary classification problem where the classes are represented
by −1 and +1. One way to classify a data point x is to estimate parameters w, compute wT x, and
classify x to be sign(wT x). Geometrically, the decision boundary this produces is a hyperplane,
wT x = 0.
We need to figure out how to optimize the parameter w. One simple procedure we can try is to fit
a least squares objective:
n
X
arg min kyi − sign(wT xi )k2 + λkwk2
w
i=1
Where xi , w ∈ Rd+1 . Note that we have not forgotten about the bias term! Even though we are
dealing with d dimensional data, xi and w are d + 1 dimensional: we add an extra “feature” of 1
to x, and a corresponding bias term of k in w. Note that in practice, we do not want to penalize
the bias term in the regularization term, because the we should be able to work with any affine
transformation of the data and still end up with the same decision boundary. Therefore, rather
than taking the norm of w, we often take the norm of w0 , which is every term of w excluding the
corresponding bias term. For simplicity of notation however, let’s just take the norm of w.
Without the regularization term, this would be equivalent to minimizing the number of misclassified
training points. Unfortunately, optimizing this is NP-hard (computationally intractable). However,
90 CHAPTER 6. CLASSIFICATION
n
X
arg min kyi − wT xi k2 + λkwk2
w
i=1
This method is called the 2-class least squares support vector machine (LS-SVM). Note that in this
relaxed version, we care about the magnitude of wT xi and not just the sign.
One drawback of LS-SVM is that the hyperplane decision boundary it computes does not necessar-
ily make sense for the sake of classification. For example, consider the following set of data points,
color-coded according to the class:
LS-SVM will classify every data point correctly, since all the +1 points lie on one side of the decision
boundary and all the −1 points lie on the other side. Now if we add another cluster of points as
follows:
6.7. LEAST SQUARES SUPPORT VECTOR MACHINE 91
The original LS-SVM fit would still have classified every point correctly, but now the LS-SVM gets
confused and decides that the points at the bottom right are contributing too much to the loss
(perhaps for these points, wT xi = −5 for the original choice of w so even though they are on
the correct side of the original separating hyperplane, they incur a high squared loss and thus the
hyperplane is shifted to accommodate). This problem will be solved when we introduce general
Support Vector Machines (SVM’s).
Feature Extension
Working with linear classifiers in the raw feature space may be extremely limiting, so we may
consider adding features that that allow us to come up with nonlinear classifiers (note that we
are still working with linear classifiers in the augmented feature space). For example, adding
quadratic features allows us to find a linear decision boundary in the augmented quadratic space
that corresponds to a nonlinear “circle” decision boundary projected down into the raw feature
space.
Note that φ is a function that takes as input the data in raw feature space, and outputs the data
in augmented feature space.
Instead of using the linear function wT x or augmenting features to the data, we can also directly
use a non-linear function of our choice in the original feature space, such as a neural network. One
can imagine a whole family of discriminative binary classifiers that minimize
n
X
arg min kyi − gw (xi )k2 + λkwk2
w
i=1
where gw (xi ) can be any function that is easy to optimize. Then we can classify using the rule
(
1 gw (xi ) > θ
ŷi =
−1 gw (xi ) ≤ θ
Where θ is some threshold. In LS-SVM, gw (xi ) = xT wi and θ = 0. Using a neural network with
non-linearities as gw can produce complex, non-linear decision boundaries.
Multi-Class Extension
We can also adapt this approach to the case where we have multiple classes. Suppose there are K
classes, labeled 1, 2, ..., K. One possible way to extend the approach from binary classification is to
compute gw (xi ) and round it to the nearest number from 1 to K. However, this approach gives an
“ordering” to the classes, even if the classes themselves have no natural ordering. This is clearly a
problem. For example, in fruit classification, suppose 1 is used to represent “peach,” 2 is used to
represent “banana,” and 3 is used to represent “apple.” In our numerical representation, it would
appear that peaches are less than bananas, which are less than apples. As a result, if we have an
image that looks like some cross between an apple and a peach, we may simply end up classifying
it as a banana.
The typical way to get around this issue is as follows: if the i-th observation has class k, instead
of using the representation yi = k, we can use the representation yi = ek , the k-th canonical basis
vector. Now there is no relative ordering in the representations of the classes. This method is called
one-hot vector encoding.
When we have multiple classes, each yi is a K-dimensional one-hot vector, so for LS-SVM, we
instead have a K × (d + 1) weight matrix to optimize over:
n
X
arg min kyi − Wxi k2 + λkwk2
W i=1
To classify a data point, we compute Wxi and see which component j is the largest - we then
predict xi to be in class j.
6.8. LOGISTIC REGRESSION 93
Figure 6.7: Logistic function. For our purposes, the horizontal axis is the output of the linear function wT xi
and the vertical axis is the output of the logistic function, which can be interpreted as a probability between
0 and 1.
In order to train our model, we need a loss function to optimize. One possibility is least squares:
n
X
arg min kyi − s(wT xi )k2 + λkwk2
w
i=1
However, this may not be the best choice. Ordinary least squares regression has theoretical justi-
fications such as being the maximum likelihood objective under Gaussian noise. Least squares for
this classification problem does not have a similar justification.
Instead, the loss function we use for logistic regression is called the log-loss, or cross entropy:
n
X 1 1
L(w) = yi ln + (1 − yi ) ln
s(wT xi ) 1 − s(wT xi )
i=1
94 CHAPTER 6. CLASSIFICATION
If we define pi = s(wT xi ), then using the properties of logs we can write this as
n
X
L(w) = − yi ln pi + (1 − yi ) ln(1 − pi )
i=1
For each xi , pi represents our predicted probability that its corresponding class is 1. Because
yi ∈ {0, 1}, the loss corresponding to the i-th data point is
(
− ln pi when yi = 1
Li =
− ln(1 − pi ) when yi = 0
Intuitively, if pi = yi , then we incur 0 loss. However, this is never actually the case. The logistic
function can never actually output a value of exactly 0 or 1, and we will therefore always incur
some loss. If the actual label is yi = 1, then as we lower pi towards 0, the loss for this data point
approaches infinity. The loss function can be derived from a maximum likelihood perspective or an
information-theoretic perspective.
Logistic regression can be derived from a maximum likelihood model. Suppose we observe D =
{(xi , yi )}ni=1 , and each observation is sampled from a distribution Yi ∼ Bernoulli(pi ), where pi
is a function of xi . We can view P (Yi = 1) = P (Y = 1|xi ) as the posterior probability that xi
is classified as 1, and similarly P (Yi = 0) as the posterior probability that xi is classified as 0.
The observation yi is simply sampled from this posterior probability distribution Yi . Thus our
observation yi , which we can view as a “sample,” has probability:
(
pi if yi = 1
P (Yi = yi ) =
1 − pi if yi = 0
1
pi = s(wT xi ) =
1 + e−wT xi
6.8. LOGISTIC REGRESSION 95
Now we can estimate the weights w via maximum likelihood. We have the problem
n
Y
w∗LR = arg max P (Yi = yi )
w
i=1
Yn
= arg max pyi i (1 − pi )(1−yi )
w
i=1
n
hY i
= arg max ln pyi i (1 − pi )(1−yi )
w
i=1
n
X
= arg max yi ln pi + (1 − yi ) ln(1 − pi )
w
i=1
Xn
= arg min − yi ln pi + (1 − yi ) ln(1 − pi )
w
i=1
The logistic regression loss function can also be justified from an information-theoretic perspective.
To motivate this approach, we introduce Kullback-Leibler (KL) divergence (also called rel-
ative entropy), which measures the amount that one distribution diverges from another. Given
any two discrete random variables P and Q, the KL divergence from Q to P is defined to be
X P (x)
DKL (P ||Q) = P (x) ln
x
Q(x)
Note that DKL is not a true distance metric, because it is not symmetric, ie. DKL (P ||Q) 6=
DKL (Q||P ) in general.
In the context of classification, if the class label yi is interpreted as the probability of being class 1,
then logistic regression provides an estimate pi of the probability that the data is in class 1. The
true class label can be viewed as a sampled value from the “true” distribution Pi ∼ Bernoulli(yi ).
Pi is not a particularly interesting distribution because all values sampled from it will yield yi :
P (Pi = yi ) = 1. Logistic regression yields a distribution Qi ∼ Bernoulli(pi ), which is the posterior
probability that estimates the true distribution Pi . Be careful not to mix up the notation between
the random variable Pi and the quantity pi ! In the context of this problem pi is associated with
the estimated distribution Qi , not the true distribution Pi .
The KL divergence from Qi to Pi provides a measure of how closely logistic regression can match
the true label. We would like to minimize this KL divergence, and ideally we would try to choose
our weights so that DKL (Pi ||Qi ) = 0. However, this is impossible for two reasons. First, if we want
DKL (Pi ||Qi ) = 0, then we would need pi = yi , which is impossible because pi is the output of a
logistic function that can never actually reach 0 or 1. Second, even if we tried tuning the weights so
that DKL (Pi ||Qi ) = 0, that’s only optimizing one of the data points – we need to tune the weights
so that we can collectively minimize the totality of all of the KL divergences contributed by all
data points.
Therefore, our goal is to tune the weights w (which indirectly affect the pi values and therefore the
estimated distribution Qi ), in order to minimize the total sum of KL divergences contributed by
96 CHAPTER 6. CLASSIFICATION
Recall that in logistic regression, we are tuning a weight vector w ∈ Rd+1 , which leads to a posterior
distribution Qi ∼ Bernoulli(pi ) over the binary classes 0 and 1:
1
P (Qi = 1) = pi = s(wT xi ) =
1 + e−wT xi
T
e−w xi
P (Qi = 0) = 1 − s(wT xi ) =
1 + e−wT xi
Let’s generalize this concept to Multiclass Logistic Regression, where there are K classes. Sim-
ilarly to our discussion of the multi-class LS-SVM, it is important to note that there is no inherent
6.9. MULTICLASS LOGISTIC REGRESSION 97
ordering to the classes, and predicting a class in the continuous range from 1 to K would be a poor
choice. To see why, recall our fruit classification example. Suppose 1 is used to represent peach,
2 is used to represent banana, and 3 is used to represent apple. In our numerical representation,
it would appear that peaches are less than bananas, which are less than apples. As a result, if we
have an image that looks like some cross between an apple and a peach, we may simply end up
classifying it as a banana.
The solution is to use a one-hot vector encoding to represent all of our labels. If the i-th
observation has class k, instead of using the representation yi = k, we can use the representation
yi = ek , the k-th canonical basis vector. For example, in our fruit example, if the i-th image is
classified as “banana”, its label representation would be
yi = [0 1 0]T
Now there is no relative ordering in the representations of the classes. We must modify our weight
representation accordingly to the one-hot vector encoding. Now, there are a set of d + 1 weights
associated with every class, which amounts to a matrix W ∈ RK×(d+1) . For each input xi ∈ Rd+1 ,
each class k is given a “score”
zk = wkT xi
Where wk is the k-th row of the W matrix. In total there are K scores for an input xi :
The higher the score for a class, the more likely logistic regression will pick that class. Now that we
have a score system, we must transform all of these scores into a posterior probability distribution
Q. For binary logistic regression, we used the logistic function, which takes the value wT xi and
squashes it to a value between 0 and 1. The generalization to the the logistic function for the multi-
class case is the softmax function. The softmax function takes as input all K scores (formally
known as logits) and an index j, and outputs the probability that the corresponding softmax
distribution takes value j:
e zj
softmax(j, {z1 , z2 , . . . , zK }) = PK
zk
k=1 e
The logits induce a softmax distribution, which we can verify is indeed a probability distribution:
On inspection, this softmax distribution is reasonable, because the higher the score of a class, the
higher its probability. In fact, we can verify that the logistic function is a special case of the softmax
function. Assuming that the corresponding weights for class 0 and 1 are w0 and w1 , we have that:
T T
ew1 xi e(w1 −w1 ) xi 1
P (Qi = 1) = wT x w T
x
= (w −w )Tx (w −w )Tx
= −(w −w ) Tx
= s((w1 − w0 )T xi )
e 0 i +e 1 i e 0 1 i + e 1 1 i 1 + e 1 0 i
T T T
ew0 xi e(w0 −w1 ) xi e−(w1 −w0 ) xi
P (Qi = 0) = wT x T = = = 1 − s((w1 − w0 )T xi )
e 0 i + ew1 xi e(w0 −w1 )T xi + e(w1 −w1 )T xi 1 + e−(w1 −w0 )T xi
In the 2-class case, because we are only interested in the difference between w1 and w0 , we just use
a change of variables w = w1 − w0 . We don’t need to know w1 and w0 individually, because once
we know P (Qi = 1), we know by default that P (Qi = 0) = 1 − P (Qi = 1).
98 CHAPTER 6. CLASSIFICATION
Let’s derive the loss function for multiclass logistic regression, using the information-theoretic
perspective. The “true” or more formally the target distribution in this case is P (Pi = j) = yi [j].
In other words, the entire distribution is concentrated on the label for the training example. The
estimated distribution Q comes from multiclass logistic regression, and in this case is the softmax
distribution:
T
ewj xi
P (Qi = j) = PK
wkT xi
k=1 e
Now let’s proceed to deriving the loss function. The objective, as always, is to minimize the sum
of the KL divergences contributed by all of the training examples.
n
X
∗
WM CLR = arg min DKL (Pi ||Qi )
W i=1
n XK
X P (Pi = j)
= arg min P (Pi = j) ln
W P (Qi = j)
i=1 j=1
n X
K
X yi [j]
= arg min yi [j] ln
W i=1 j=1
softmax(j, {w1T xi , w1T xi , . . . , wK
T x })
i
n X
X K
yi [j] · ln yi [j] − yi [j] · ln softmax(j, {w1T xi , w2T xi , . . . , wK
T
= arg min xi })
W i=1 j=1
n X
X K
yi [j] · ln softmax(j, {w1T xi , w1T xi , . . . , wK
T
= arg min − xi })
W i=1 j=1
n X
X K ewjT xi
= arg min − yi [j] · ln PK
wkT xi
W i=1 j=1 k=1 e
n
X
= arg min H(Pi , Qi )
W i=1
6.10. TRAINING LOGISTIC REGRESSION 99
Just like binary logistic regression, we can justify the loss function with MLE as well:
n
Y
∗
WM CLR = arg max P (Yi = yi )
W i=1
Yn YK
= arg max P (Qi = j)yi [j]
W i=1 j=1
n X
X K
= arg max yi [j] ln P (Qi = j)
W i=1 j=1
n X
X K ewjT xi
= arg max yi [j] ln PK
wkT xi
W i=1 j=1 k=1 e
n X
X K ewjT xi
= arg min − yi [j] ln PK
wkT xi
W i=1 j=1 k=1 e
n X
X K
L(W ) = − yi [j] · ln P (Qi = j)
i=1 j=1
The logistic regression loss function has no known analytic closed-form solution. Therefore, in order
to minimize it, we can use gradient descent, either in batch form or stochastic form. Let’s examine
the case for batch gradient descent.
where
1
pi = s(wT xi ) =
1 + e−wT xi
100 CHAPTER 6. CLASSIFICATION
n
X
∇w L(w) = ∇w − yi ln pi + (1 − yi ) ln(1 − pi )
i=1
n
X
=− yi ∇w ln pi + (1 − yi )∇w ln(1 − pi )
i=1
n
X yi 1 − yi
=− ∇w p i − ∇w pi
pi 1 − pi
i=1
n
X yi 1 − yi
=− − ∇w pi
pi 1 − pi
i=1
Note that ∇z s(z) = s(z)(1 − s(z)), and from the chain rule we have that
n
X yi 1 − yi
∇w L(w) = − − ∇w pi
pi 1 − pi
i=1
n
X yi 1 − yi
=− − pi (1 − pi )xi
pi 1 − pi
i=1
n
X
=− yi (1 − pi ) − (1 − yi )(pi ) xi
i=1
n
X
=− (yi − pi ) xi
i=1
We conclude that the gradient of the loss function with respect to the parameters is
n
X
∇w L(w) = − (yi − pi ) xi = −X T (y − p)
i=1
w = w − ∇w L(w)
It does not matter what initial values we pick for w, because the loss function L(w) is convex and
does not have any local minima. Let’s prove this, by first finding the Hessian of the loss function.
6.10. TRAINING LOGISTIC REGRESSION 101
The k, lth entry of the Hessian is the partial derivative of the gradient with respect to wk and wl :
∂ 2 L(w)
Hkl =
∂wk ∂wl
n
∂ X
= − (yi − pi ) xil
∂wk
i=1
n
X ∂
= pi xil
∂wk
i=1
Xn
= pi (1 − pi )xik xil
i=1
We conclude that
n
X
H= pi (1 − pi )xi xTi
i=1
Instead of finding the gradient with respect to all of the parameters of the matrix W , let’s find
them with respect to one row of W at a time:
Xn X K ewjT xi
∇wl L(W ) = ∇wl − yi [j] · ln PK
e wkT xi
i=1 j=1 k=1
n X K T
!
X ewj xi
=− yi [j] · ∇wl ln PK
wkT xi
i=1 j=1 k=1 e
Xn X K XK
T
=− yi [j] · ∇wl wjT xi − ∇wl ln ewk xi
i=1 j=1 k=1
n X
K T
!
ewl xi
1{j = l}xi − PK
X
=− yi [j] · xi
wkT xi
i=1 j=1 k=1 e
n X
K T
!
ewl xi
1{j = l} − PK
X
=− yi [j] · xi
wkT xi
i=1 j=1 k=1 e
n
1{yi = l} − P (Qi = l) xi
X
=−
i=1
= −X T 1{yi = l} − P (Q = l)
Note the use of indicator functions: 1{j = l} evaluates to 1 if j = l, otherwise 0. Also note that
since yi is a one-hot vector encoding, it evaluates to 1 only for one entry and 0 for all other entries.
102 CHAPTER 6. CLASSIFICATION
We can therefore simplify the expression by only considering the j for which yi [j] = 1. The gradient
descent update for wl is
wl = wl − ∇wl L(W )
Just as with binary logistic regression, it does not matter what initial values we pick for W , because
the loss function L(W ) is convex and does not have any local minima.
So far we’ve explored generative models (LDA) and discriminative models (logistic regres-
sion), but in both of these methods, we tasked ourselves with modeling some kind of probability
distribution. One observation about classification is that in the end, if we only care about assigning
each data point a class, all we really need to know do is find a “good” decision boundary, and we
can skip thinking about the distributions. Support Vector Machines (SVMs) are an attempt
to model decision boundaries directly in this spirit.
Here’s the setup for the problem:
• Goal: find a d − 1 dimensional hyperplane decision boundary H which separates the +1’s
from the −1’s.
In order to motivate SVMs, we first have to understand the much simpler perceptron classifier and
its shortcomings. Given that the training data is linearly separable, the perceptron algorithm
finds a d − 1 dimensional hyperplane that perfectly separates the +1’s from the −1’s. Mathemat-
ically, the goal is to learn a weight vector w ∈ Rd and a bias term b ∈ R, that satisfy the linear
separability constraints:
(
wT xi − b ≥ 0 if yi = 1
∀i,
wT xi − b ≤ 0 if yi = −1
Equivalently,
∀i, yi (wT xi − b) ≥ 0
The resulting decision boundary is a hyperplane H = {x : wT x − b = 0}. All points on the positive
side of the hyperplane are classified as +1, and all points on the negative side are classified as −1.
Note that perceptrons have two major shortcomings that as we shall see, SVMs can overcome. First
of all, if the data is not linearly separable, the perceptron fails to find a stable solution. As we shall
see, soft-margin SVMs fix this issue by allowing best-fit decision boundaries even when the data is
not linearly separable. Second, if the data is linearly separable, the perceptron could find infinitely
many decision boundaries – if (w, b) is a pair that separates the data points, then the perceptron
could also end up choosing a slightly different (w, b + ) pair. Some hyperplanes are better than
others, but the perceptron cannot distinguish between them. This leads to generalization issues.
6.11. SUPPORT VECTOR MACHINES 103
Figure 6.8: Two possible decision boundaries under the perceptron. The X’s and C’s represent the +1’s and
−1’s respectively.
In the figure above, we consider two potential linear separators that satisfy the constraints. One
could imagine that if we observed new test points that are nearby the region of C’s in the training
data, they should also be of class C. The orange separator would incorrectly classify some of these
new test points, while the black separator would most likely still be able to classify them correctly.
To the eyes of the perceptron algorithm, both the orange and black lines are perfectly valid decision
boundaries. Therefore, the perceptron may not be able to generalize well to unseen data.
Hard-Margin SVMs
Hard-Margin SVMs solve the generalization problem of perceptrons by maximizing the margin,
formally known as the minimum distance from the decision boundary to any of the training points.
Figure 6.9: The optimal decision boundary (as shown) maximizes the margin.
Intuitively, maximizing the margin allows us to generalize better to unseen data, because the
decision boundary with the maximum margin is as far away from the training data as possible and
the boundary cannot be violated unless the unseen data contains outliers.
Simply put, the goal of hard-margin SVMs is to find a hyperplane H that maximizes the margin
m. Let’s formalize an optimization problem for hard-margin SVMs. The variables we are trying to
optimize over are the margin m and the parameters of the hyperplane, w and b. The objective is
to maximize the margin m, subject to the following constraints:
104 CHAPTER 6. CLASSIFICATION
• All points classified as +1 are to the positive side of the hyperplane and their distance to H
is greater than the margin
• All points classified as −1 are to the negative side of the hyperplane and their distance to H
is greater than the margin
• The margin is non-negative.
Proof: consider any two points on H, x0 and x1 . We will show that (x1 − x0 ) ⊥ ((x1 + w) − x1 ).
Note that
(x1 − x0 )T ((x1 + w) − x1 ) = (x1 − x0 )T w = xT1 w − xT0 w = b − b = 0
Since w is perpendicular to H, the (shortest) distance from any arbitrary point z to the hyperplane
H is determined by a scaled multiple of w. If we take any point on the hyperplane x0 , the distance
from z to H is the length of the projection from z − x0 to the vector w, which is
|wT (z − x0 )| |wT z − wT x0 | |wT z − b|
D= = =
kwk2 kwk2 kwk2
In order to ensure that positive points are on the positive side of the hyperplane outside a margin
of size m, and that negative points are on the negative side of the hyperplane outside a margin of
size m, we can express the constraint
(wT xi − b)
yi ≥m
kwk2
max m
m,w,b
(wT xi − b) (6.1)
s.t. yi ≥ m ∀i
kwk2
m≥0
Maximizing the margin m means that there exists at least one point on the positive side of the
hyperplane and at least one point on the negative side whose distance to the hyperplane is exactly
equal to m. These points are the support vectors, hence the name “support vector machines”.
Through a series of optimization steps, we can simplify the problem by removing the margin variable
and just optimizing the parameters of the hyperplane. In order to do so, we have to first introduce
two new variables w0 and b0 that capture the relationship between the three original variables m,
w, and b.
1
max0 0
m,w,b,w ,b kw0 k2
s.t. yi (w0T xi − b0 ) ≥ 1 ∀i
m≥0 (6.2)
w
w0 =
kwk2 m
b
b0 =
kwk2 m
Having introduced the new variables w0 and b0 , the old variables m, w, and b are no longer relevant
to the optimization problem, and we can remove them. The previous optimization problem is
equivalent to
1
max
w0 ,b0 kw0 k2 (6.3)
s.t. yi (w0T xi − b0 ) ≥ 1 ∀i
Let’s verify that (2) and (3) are equivalent. We will show that
1. The optimal value of (2) is at least as good as the optimal value of (3). Assume that
the optimal values for (3) are w0∗ and b0∗ . One feasible point for (2) is (m, w, b, w0 , b0 ) =
( kw10 k , w0∗ , b0∗ , w0∗ , b0∗ ), which leads to the same objective value as (3). Therefore, the optimal
2
value of (2) is at least as good as that of (3).
2. The optimal value of (3) is at least as good as the optimal value of (2). Assume that the
optimal values for (2) are (m∗ , w∗ , b∗ , w0∗ , b0∗ ). One feasible point for (3) is (w0 , b0 ) = (w0∗ , b0∗ )
which leads to the same objective value as (2). Therefore, the optimal value of (3) is at least
as good as that of (2).
106 CHAPTER 6. CLASSIFICATION
We can rewrite objective so that the problem is a minimization rather than a maximization:
1
w0
2
min 2
0 0
w ,b 2 (6.4)
s.t. yi (w0T xi − b0 ) ≥ 1 ∀i
At last, we have formulated the hard-margin SVM optimization problem! Using the notation w
and b, the objective of hard-margin SVMs is
1
min kwk22
w,b 2 (6.5)
T
s.t. yi (w xi − b) ≥ 1 ∀i
Soft-Margin SVMs
The hard-margin SVM optimization problem has a unique solution only if the data are linearly
separable, but it has no solution otherwise. This is because the constraints are impossible to satisfy
if we can’t draw a hyperplane that separates the +1’s from the −1’s. In addition, hard-margin
SVMs are very sensitive to outliers – for example, if our data is class-conditionally distributed
Gaussian such that the two Gaussians are far apart, if we witness an outlier from class +1 that
crosses into the typical region for class −1, then hard-margin SVM will be forced to compromise a
more generalizable fit in order to accommodate for this point. Our next goal is to come up with
a classifier that is not sensitive to outliers and can work even in the presence of data that is not
linearly separable. To this end, we’ll talk about Soft-Margin SVMs.
A soft-margin SVM modifies the constraints from the hard-margin SVM by allowing some points
to violate the margin. Formally, it introduces slack variables ξi , one for each training point, into
the constraints:
yi (wT xi − b) ≥ 1 − ξi
ξi ≥ 0
which, is a less-strict, softer version of the hard-margin SVM constraints because it says that each
point xi need only be a ”distance” of 1−ξi of the separating hyperplane instead of a hard ”distance”
of 1.
(By the way, the Greek letter ξ is spelled ”xi” and pronounced ”zai”. ξi is pronounced ”zai-eye.”)
These constraints would be fruitless if we didn’t bound the values of the ξi ’s, because by setting them
to large values, we are essentially saying that any point may violate the margin by an arbitrarily
large distance...which makes our choice of w meaningless. We modify the objective function to be:
n
1 X
min kwk2 + C ξi
w,b,,ξi 2
i=1
Where C is a hyperparameter tuned through cross-validation. Putting the objective and constraints
together, the soft-margin SVM optimization problem is
n
1 X
min ||w||2 + C ξi
w,b,ξi 2
i=1
(6.6)
T
s.t. yi (w xi − b) ≥ 1 − ξi ∀i
ξi ≥ 0 ∀i
6.11. SUPPORT VECTOR MACHINES 107
The table below compares the effects of having a large C versus a small C. As C goes to infinity,
the penalty for having non-zero ξi goes to infinity, and thus we force the ξi ’s to be zero, which is
exactly the setting of the hard-margin SVM.
small C large C
Desire maximize margin keep ξi ’s small or zero
Danger underfitting overfitting
Outliers less sensitive more sensitive
In the context of classification, the loss function that we would like to optimize is 0-1 step loss:
(
1 y(wT x − b) < 0
Lstep (y, wT x − b) =
0 y(wT x − b) ≥ 0
The 0-1 loss is 0 if x is correctly classified and 1 otherwise. Thus minimizing n1 ni=1 L(yi , wT xi − b)
P
directly minimizes classification error on the training set. However, the 0-1 loss is difficult to
optimize: it is neither convex nor differentiable (see Figure 9.1).
Figure 6.12: Step (0-1) loss, hinge loss, and squared loss. Squared loss is convex and differentiable, hinge
loss is only convex, and step loss is neither.
We can try to modify the 0-1 loss to be convex. The points with y(wT x − b) ≥ 0 should remain at
0 loss, but we may consider allowing a linear penalty “ramp” for misclassified points. This leads
us to the hinge loss, as illustrated in Figure 9.1:
We claim these two formulations are actually equivalent. Manipulating the first constraint, we have
that
ξi ≥ 1 − yi (wT xi − b)
Combining with the constraint ξi ≥ 0, we have that
At the optimal value of the optimization problem, these inequalities must be tight. Otherwise,
we could lower each ξi to equal max(1 − yi (wT xi − b), 0) and decrease the value of the objective
function. Thus we can rewrite the soft-margin SVM optimization problem as
n
1 X
min ||w||2 + C ξi
w,b,ξi 2 (6.7)
i=1
T
s.t. ξi = max(1 − yi (w xi − b), 0) ∀i
If we divide by Cn (which does not change the optimal solution of the optimization problem), we can
1
see that this formulation is equivalent to the regularized regression problem, with λ = 2Cn . Thus
we have two interpretations of soft-margin SVM: either as finding a max-margin hyperplane that is
allowed to make some mistakes via slack variables ξi , or as regularized empirical risk minimization
with the hinge loss.
Chapter 7
Duality, Kernels
7.1 Duality
Now, we will investigate the dual of this problem, which will motivate our discussion of kernels.
Before we do so, we first have to understand the primal and dual of an optimization problem.
min f0 (x)
x
s.t. fi (x) ≤ 0 i = 1, . . . , m (7.1)
hj (x) = 0 j = 1, . . . , n
For the purposes of our discussion, assume that x ∈ Rd . The main components of an optimization
problem are:
109
110 CHAPTER 7. DUALITY, KERNELS
Working with the constraints can be cumbersome and challenging to manipulate, and it would be
ideal if we could somehow turn this constrained optimization problem into an unconstrained one.
One idea is to re-express the optimization problem into
where (
f0 (x) if fi (x) ≤ 0, ∀i ∈ [1, m] and hj (x) = 0, ∀j ∈ [1, n]
L(x) =
∞ otherwise
Note that the unconstrained optimization problem above is equivalent to the original constrained
problem. Even though the unconstrained problem considers values that violate the constraints (and
therefore are not in the feasible set for the constrained optimization problem), it will effectively
ignore them because they are treated as ∞ in a minimization problem.
Even though we are now dealing with an unconstrained problem, it still is difficult to solve the
optimization problem, because we still have to deal with all of the casework in the objective function
L(x). In order to solve this issue, we have to introduce dual variables, specifically one set of dual
variables for the equality constraints, and one set for the inequality constraints. If we only take
into account the dual variables for the equality constraints, the optimization problem now becomes
where (
f0 (x) + nj=1 νj hj (x) if fi (x) ≤ 0, ∀i ∈ [1, m]
P
L(x, ν) =
∞ otherwise
We are still working with an unconstrained optimization problem, except that now, we are opti-
mizing over two sets of variables: the primal variables x ∈ Rd and the dual variables ν ∈ Rn .
Also note that the optimization problem has now become a nested one, with an inner optimization
problem the maximizes over the dual variables, and an outer optimization problem that minimizes
over the primal variables. Let’s examine why this optimization problem is equivalent to the original
constrained optimization problem:
• Any x that violates the inequality constraints is still treated as ∞ by the outer minimization
problem over x and therefore ignored
• For any x that violates the equality constraints (meaning that ∃j s.t. hj (x) 6= 0), the inner
maximization problem over ν can choose νj as ∞ if hj (x) > 0 (or νj as −∞ if hj (x) < 0) to
cause the inner maximization blow off to ∞, therefore being ignored by the outer minimization
over x
• For any x that does not violate any of the equality or inequality constraints, the inner maxi-
mization problem over ν is simply equal to f0 (x)
7.1. DUALITY 111
This solution comes at a cost – in an effort to remove the equality constraints, we had to add in
dual variables, one for each inequality constraint. With this in mind, let’s try to do the same for
the inequality constraints. Adding in dual variable λi to represent each inequality constraint, we
now have
Xm Xn
min max L(x, λ, ν) = f0 (x) + λi fi (x) + νj hj (x)
x λ,ν (7.4)
i=1 j=1
s.t. λi ≥ 0 i = 1, . . . , m
For convenience, we can place the constraints involving λ into the optimization variable.
m
X n
X
min max L(x, λ, ν) = f0 (x) + λi fi (x) + νj hj (x)
x λ≥0,ν
i=1 j=1
This optimization problem above is otherwise known as the primal (not to be confused with the
primal variables), and its optimal value is indeed equivalent to that of the original constrained
optimization problem.
p∗ = min max L(x, λ, ν)
x λ≥0,ν
We can verify that this is indeed the case:
• For any x that violates the inequality constraints (meaning that ∃i ∈ [1, m] s.t. fi (x) > 0),
the inner maximization problem over λ can choose λi as ∞ to cause the inner maximization
blow off to ∞, therefore being ignored by the outer minimization over x
• For any x that violates the equality constraints (meaning that ∃j s.t. hj (x) 6= 0), the inner
maximization problem over ν can choose νj as ∞ if hj (x) > 0 (or νj as −∞ if hj (x) < 0) to
cause the inner maximization blow off to ∞, therefore being ignored by the outer minimization
over x
• For any x that does not violate any of the P equality or inequality constraints, in the inner
maximization problem over ν, the expression nj=1 νj hj (x) evaluates to 0 no matter what the
value of ν is, and in the inner maximization problem over λ, the expression m
P
i=1 λi fi (x) can
at maximum be 0, because λi is constrained to be non-negative, and fi (x) is non-positive.
Therefore, at best, the maximization problem sets λi fi (x) = 0, and
max L(x, λ, ν) = f0 (x)
λ≥0,ν
In its full form, the objective L(x, λ, ν) is called the Lagrangian, and it takes into account the
unconstrained set of primal variables x ∈ Rd , the constrained set of dual variables λ ∈ Rn cor-
responding to the inequality constraints, and the unconstrained set of dual variables ν ∈ Rm
corresponding to the equality constraints. Note that our dual variables λi are in fact constrained,
so ultimately we were not able to turn the original optimization problem into an unconstrained
one, but our constraints are much simpler than before.
The dual of this optimization problem is still over the same optimization objective, except that
now we swap the order of the maximization of the dual variables and the minimization of the primal
variables.
d∗ = max min L(x, λ, ν) = max g(λ, ν)
λ≥0,ν x λ≥0,ν
where
g(λ, ν) = min L(x, λ, ν)
x
The dual is very useful to work with, because now the inner optimization problem over x is an
unconstrained problem!
It is always true that the solution to the primal problem is at least as large as the solution to the
dual problem:
p∗ ≥ d∗ (7.5)
This condition is know as weak duality.
More compactly,
∀x, λ ≥ 0, ν max L(x, λ̃, ν̃) ≥ min L(x̃, λ, ν)
λ̃≥0,ν̃ x̃
Since this is true for all x, λ ≥ 0, ν this is true in particular when we set
and
λ, ν = arg max min L(x̃, λ̃, ν̃)
x̃
λ̃≥0,ν̃
p∗ = min max L(x̃, λ̃, ν̃) ≥ max min L(x̃, λ̃, ν̃) = d∗
x̃ λ̃≥0,ν̃ λ̃≥0,ν̃ x̃
The difference p∗ − d∗ is known as the duality gap. In the case of strong duality, the duality
gap is 0. That is, we can swap the order of the minimization and maximization and up with the
same optimal value:
p∗ = d∗ (7.6)
There are several useful theorems detailing the existence of strong duality, such as Slater’s the-
orem, which states that if the primal problem is convex, and there exists an x that can strictly
meet the inequality constraints and meet the equality constraints, then strong duality holds. Given
that strong duality holds, the Karush-Kuhn-Tucker (KKT) conditions can help us find the
solutions to the dual variables of the optimization problem. The KKT conditions are composed of:
3. Dual feasibility
λi ≥ 0, ∀i ∈ [1, m]
4. Complementary Slackness
λi fi (x) = 0, ∀i ∈ [1, m]
5. Stationarity
m
X n
X
∇x f0 (x) + λi ∇x fi (x) + νj ∇x hj (x) = 0
i=1 j=1
Proof. KKT conditions 1, 2, 3 are trivially true, because the primal solution x∗ must satisfy the
primal constraints, and the dual solution λ∗ , ν ∗ must satisfy the dual constraints. Now, let’s prove
conditions 4 and 5. We know that since strong duality holds, we can say that
p∗ = f0 (x∗ ) = g(λ∗ , ν ∗ ) = d∗
= min L(x, λ∗ , ν ∗ )
x
≤ L(x∗ , λ∗ , ν ∗ )
Xm n
X
∗ ∗ ∗ ∗ ∗
= f0 (x ) + λi fi (x ) + ν
jh j (x )
i=1
j=1
Xm
= f0 (x∗ ) + λ∗i fi (x∗ )
i=1
≤ f0 (x∗ )
In the fourth step, we can cancel the terms involving hj (x∗ ) because we know that the primal
solution must satisfy hj (x∗ ) = 0. In the fifth step, we know that λ∗i fi (x∗ ) ≤ 0, because λ∗i ≥ 0
in order to satisfy the dual constraints, and fi (x∗ ) ≤ 0 in order to satisfy the primal constraints.
Since we established that f0 (x∗ ) = minx L(x, λ∗ , ν ∗ ) ≤ L(x∗ , λ∗ , ν ∗ ) ≤ f0 (x∗ ), we know that all
of the inequalities hold with equality and therefore L(x∗ , λ∗ , ν ∗ ) = minx L(x, λ∗ , ν ∗ ). This implies
KKT condition 5 (stationarity), that
m
X n
X
∇x f0 (x∗ ) + λ∗i ∇x fi (x∗ ) + νj∗ ∇∗x hj (x∗ ) = 0
i=1 j=1
The theorem above establishes that in the presence of strong duality, if the solutions are optimal,
then they satisfy the KKT conditions. Let’s prove a statement that is almost (but not quite) the
converse, which will be much more helpful for solving optimization problems.
114 CHAPTER 7. DUALITY, KERNELS
Theorem 2. If x̄ and λ̄, ν̄ satisfy the KKT conditions, and the primal problem is convex, then they
are the optimal solutions to the primal and dual problems with zero duality gap.
Proof. If x̄ and λ̄, ν̄ satisfy KKT conditions 1, 2, 3 we know that they are at least feasible for the
primal and dual problem. From the KKT stationarity condition we know that
m
X n
X
∇x f0 (x̄) + λ̄i ∇x fi (x̄) + ν¯j ∇∗x hj (x̄) = 0
i=1 j=1
Since the primal problem is convex, we know that L(x, λ, ν) is convex in x, and if the gradient of
L(x, λ̄, ν̄) at x̄ is 0, we know that
Therefore, we know that the optimal primal values for the primal problem optimize the inner
optimization problem of the dual problem, and
m
X n
X
g(λ̄, ν̄) = f0 (x̄) + λ̄i fi (x̄) + ν¯j hj (x̄)
i=1 j=1
By the primal feasibility conditions for hj (x) and the complementary slackness condition, we know
that
g(λ̄, ν̄) = f0 (x̄)
Now, all we have to do is to prove that x̄ and λ̄, ν̄ are primal and dual optimal, respectively. Note
that since weak duality always holds, we know that
Since p∗ is the minimum value for the primal problem, we can go further by saying that p∗ ≥ f0 (x̄)
holds with equality and
p∗ = f0 (x̄) = g(λ̄, ν̄) = d∗
Therefore, we have proven that x̄ and λ̄, ν̄ are primal and dual optimal respectively, with zero
duality gap. We eventually arrived at the conclusion that strong duality does indeed hold.
Let’s pause for a second to understand what we’ve found so far. Given an optimization problem,
its primal problem is an optimization problem over the primal variables, and its dual problem is
an optimization problem over the dual variables. If strong duality holds, then we can solve the
dual problem and arrive at the same optimal value. In order to solve the dual, we have to first
solve the unconstrained inner optimization problem over the primal variables and then solve the
constrained outer optimization problem over the dual variables. But how do we even know in the
first place that strong duality holds? This is where KKT comes into play. If the the primal problem
is convex and the KKT conditions hold, we can solve for the dual variables easily and also verify
strong duality does indeed hold. We shall do just that, in our discussion of SVMs.
7.2. THE DUAL OF SVMS 115
Let’s apply our knowledge of duality to find the dual of the soft-margin SVM optimization problem.
min f (w,t,ξ)
z }| {
n
1 X
min ||w||2 + C ξi
w,t,ξ 2
i=1
T
(1 − ξi ) − yi (w xi − t) ≤ 0 and − ξi ≤ 0
| {z }
g(w,t,ξ)≤0
Let’s identify the primal and dual variables for the SVM problem. We will have
For the purposes of notation, note that we are using α and β in place of λ, and there are no dual
variables corresponding to ν because there are no equality constraints. The lagrangian for the SVM
problem is:
n n n
1 X X X
L(w, t, ξ, α, β) = ||w||2 + C ξi + αi ((1 − ξi ) − yi (wT xi − t)) + βi (−ξi )
2
i=1 i=1 i=1
n n n (7.7)
1 X X X
= ||w||2 − T
αi yi (w xi − t) + αi + (C − αi − βi )ξi
2
i=1 i=1 i=1
Let’s use the KKT conditions to find the optimal dual variables. Verify that the primal problem
is convex in the primal variables. We know that from the stationarity conditions, evaluated at the
optimal dual values α∗ and β ∗ , and the optimal primal values w∗ , t∗ , ξi∗ :
∂L ∂L ∂L
= = =0
∂wi ∂t ∂ξi
• ∂L
∂ξi = C − αi∗ − βi∗ = 0 =⇒ 0 ≤ αi∗ ≤ C. This tells us that the weights αi∗ are restricted to
being less than the hyperparameter C.
116 CHAPTER 7. DUALITY, KERNELS
Verify that the other KKT also hold, establishing strong duality. Using these observations, we can
eliminate some terms of the dual problem.
n n n
∗ ∗ 1 X X X
L(w, t, ξ, α , β ) = ||w||2 − αi∗ yi (wT xi − t) + αi∗ + (C − αi∗ − βi∗ )ξi
2
i=1 i=1 i=1
n n n n
1 X X X X
= ||w||2 − αi∗ yi (wT xi ) +t αi∗ yi + αi∗ + (C − αi∗ − βi∗ )ξi
2
i=1 i=1 i=1 i=1
| {z } | {z }
=0 =0
n n
1 X X
= ||w||2 − αi∗ yi (wT xi ) + αi∗
2
i=1 i=1
We can also rewrite all the optimal primal variables w∗ , t∗ , ξ ∗ in terms of the optimal dual variables
αi∗ :
g(α∗ , β ∗ ) = min L(w, t, ξ, α∗ , β ∗ )
w,t,ξ
= L(w∗ , t∗ , ξ ∗ , α∗ , β ∗ )
n n n n
1 X ∗ X X X
= || αi yi xi ||2 − αi∗ yi (( αj∗ yj xj )T xi ) + αi∗
2
i=1 i=1 j=1 i=1
n n n n
1 X X X X
= || αi∗ yi xi ||2 − (αi∗ yi xTi ( αj∗ yj xj )) + αi∗
2
i=1 i=1 j=1 i=1
1
= α∗ T 1 − α∗ T Qα∗
2
where Qij = yi (xTi xj )yj (and Q = (diag y)XX T (diag y)).
Now, we can write the final form of the dual, which is only in terms of α and x and y (Note that
we have eliminated all references to β):
1
max αT 1 − αT Qα
α 2
n
(7.9)
X
s.t. αi yi = 0
i=1
0 ≤ αi ≤ C i = 1, . . . , n
Geometric intuition
We’ve seen that the optimal value of the dual problem in terms of α is equivalent to the optimal
value of the primal problem in terms of w, t, and ξ. But what do these dual values αi even mean?
That’s a good question!
We know that the following KKT conditions are enforced:
• Stationarity
C − αi∗ − βi∗ = 0
• Complementary slackness
αi∗ · ((1 − ξi∗ ) − yi (w∗ T xi − t∗ )) = 0
7.2. THE DUAL OF SVMS 117
βi∗ · ξi∗ = 0
Here are some noteworthy relationships between αi and the properties of the SVMs:
• Case 1: αi∗ = 0. In this case, we know βi∗ = C, which is nonzero, and therefore ξi∗ = 0. That
is, if for point i we have that αi∗ = 0 by the dual problem, then we know that there is no slack
given to this point. Looking at the other complementary slackness condition, this makes sense
because if αi∗ = 0, then yi (w∗ T xi − t∗ ) − (1 − ξi∗ ) may be any value, and if we’re minimizing
the sum of our ξi ’s, we should have ξi∗ = 0. So, point i lies on or outside the margin.
• Case 2: αi∗ is nonzero. If this is the case, then we know βi∗ = C − αi∗ ≥ 0
– Case 2.1: αi∗ = C. If this is the case, then we know βi∗ = 0, and therefore ξi∗ may be
exactly 0 or nonzero. So, point i lies on or inside the margin.
– Case 2.2: 0 < αi∗ < C. In this case, then βi∗ is nonzero and ξi∗ = 0. But this is different
from Case 1 because with αi∗ nonzero, we can divide by αi∗ in the complementary slackness
condition and arrive at the fact that 1 − yi (w∗ T xi − t∗ ) = 0 =⇒ yi (w∗ T xi − t∗ ) = 1,
which means xi lies exactly on the margin determined by w∗ and t∗ . So, point i lies on
the margin.
Finally, let’s reconstruct the optimal primal values w∗ , t∗ , ξi∗ from the optimal dual values α∗ :
n
X
∗
w = αi∗ yi xi
i=1
t = mean(w∗ T xi
∗
∀i : 0 < αi∗ < C) (7.10)
(
1 − yi (w∗ T xi − t∗ ) if αi∗ = C,
ξi∗ =
0 otherwise
118 CHAPTER 7. DUALITY, KERNELS
7.3 Kernels
Before we begin our discussion of kernels, let’s first introduce the Fundamental Theorem of
Linear Algebra (FTLA). Suppose that there is a matrix (linear map) A that maps Rn to Rm .
Denote N (A) as the nullspace of A, and R(A) as the range of A. Then the following properties
hold:
⊥
1. N (A) ⊕ R(AT ) = Rn
The symbol ⊕ indicates that we taking a direct sum of N (A) and R(AT ), which means that
∀u ∈ Rn there exist unique elements u1 ∈ N (A) and u2 ∈ R(AT ) such that u = u1 + u2 .
Furthermore, the symbol ⊥ indicates that N (A) and R(AT ) are orthogonal subspaces.
u ∈ N (A) ⇐⇒ Au = 0
⇐⇒ hv|Aui = 0 ∀v ∈ Rm
D E
⇐⇒ AT v u = 0 ∀v ∈ Rm
⇐⇒ u ∈ R(AT )⊥
⊥
N (AT ) ⊕ R(A) = Rm , by symmetry.
2. N (AT A) = N (A)
u ∈ N (A) ⇐⇒ Au = 0
=⇒ AT Au = 0
⇐⇒ u ∈ N (AT A)
u ∈ N (AT A) ⇐⇒ AT Au = 0
D E
⇐⇒ u1 AT Au = 0 ∀u1 ∈ Rn
3. R(AT A) = R(AT )
7.3. KERNELS 119
Proof.
u ∈ R(AT ) ⇐⇒ ∃v ∈ Rm AT v = u
⇐⇒ ∃v1 ∈ R(A), v2 ∈ N (AT ) AT (v1 + v2 ) = u
⇐⇒ ∃v1 ∈ R(A) AT v1 = u
⇐⇒ ∃u1 ∈ Rn AT (Au1 ) = u
⇐⇒ u ∈ R(AT A)
What exactly is the fundamental theorem of linear algebra useful for in the context of kernels?
In most machine learning problems such as regression and SVM, we given a vector y ∈ Rn and a
matrix X ∈ Rn×m , where n is the number of training points and m is the dimension of the raw
data points. In most settings we don’t want to work with just the raw feature space, so we augment
features to the data points and end up with a matrix A ∈ Rn×d , where ai = φ(xi ) ∈ Rd . Then we
solve a well-defined optimization problem that involves A and y, over the weights w ∈ Rd . Note
the problem that arises here. If we have polynomial features of degree at most p in the raw m
dimensional space, then there are d = O(mp ) terms that we need to optimize, which can be very
very very large, in fact much larger than the number of training points n. Wouldn’t it be useful,
if instead of solving a optimization problem over d variables, we could solve an equivalent problem
over (potentially much smaller) n variables? Here’s where FTLA comes in. We know that we can
express any w ∈ Rd as a unique combination w = w1 + w2 , where w1 ∈ R(AT ) and w2 ∈ N (A).
Equivalently we can express this as w = AT v + r, where v ∈ Rn and r ∈ N (A). Now, instead of
optimizing over w ∈ Rd , we can optimize over v ∈ Rn and r ∈ Rd , which equates to optimizing over
n + d variables. However, as we shall see, the optimization over r will be trivial so we just have to
optimize an n dimensional problem. Let’s first see how this plays our in ridge regression.
120 CHAPTER 7. DUALITY, KERNELS
We know that w = AT v + r, where v ∈ Rn and r ∈ N (A). Let’s now solve ridge regression by
optimizing over the variables v and r instead of w:
We crossed out Ar and 2v T Ar because r ∈ N (A) and therefore Ar = 0. Now we are optimizing
over L(v, r), which is jointly convex in v and r, because its hessian is positive semi-definite. Let’s
show that this is indeed the case:
∇2r L(v, r) = 2I 0
∇r ∇v L(v, r) = ∇v ∇r L(v, r) = 0
Since the cross terms of the hessian are 0, it suffices that ∇2r L(v, r) and ∇2v L(v, r) are psd to
establish joint convexity. With joint convexity established, we can set the gradient to 0 wrt r and
v and obtain the global minimum:
∇r L(v, r) = 2r = 0 =⇒ r∗ = 0
Note that AAT + λI is positive definite and therefore invertible, so we can compute (AAT + λI)−1 y.
Even though (AAT +λI)−1 y is a critical point for which the gradient is 0, it must achieve the global
minimum because the objective is jointly convex. We conclude that
w∗ = AT (AAT + λI)−1 y
w∗ = (AT A + λI)−1 Ay
In fact, these two are equivalent expressions! The question that now arises is which expression
should you pick? Which is more efficient to calculate? We will answer this question after we
introduce kernels.
7.3. KERNELS 121
Alternative Derivation
We can arrive at the same expression for w∗ with some clever algebraic manipulations. Our previous
derivation of ridge regression was w∗ = (AT A+λI)−1 AT Y . Rearranging the terms of this equation,
we have
(AT A + λI)w = AT y
AT Aw + λw = AT y
λw = AT y − AT Aw
1
w = (AT y − AT Aw)
λ
AT y − AT Aw
w=
λ
y − Aw
w = AT
λ
which says that whatever w is, it is some linear combination of the training points ai (because
anything of the form AT v is a linear combination of the columns of AT , which are the training
points). To find w it suffices to find v, where w = AT v.
Recall that the relationship we have to satisfy is AT Aw − λw = AT y. Let’s assume that we had v,
and just substitute AT v in for all the w’s.
AT A(AT v) + λ(AT v) = AT y
AT AAT v + AT (λv) = AT y
AT (AAT v + λv) = AT (y)
We can’t yet isolate v and have a closed-form solution for it, but we can make the observation that
if we found an v such that we had
AAT v + λv = y
that would imply that this v also satisfies the above equation. Note that we did not “cancel the
AT ’s on both sides of the equation.” We saw that having v satisfy one equation implied that it
satisfied the other as well. So, indeed, we can isolate v in this new equation:
and have that the v which satisfies this equation will be such that AT v equals w. We conclude that
the optimal w is
w∗ = AT v ∗ = AT (AAT + λI)−1 y
Kernels
Recall that in the solution for ridge regression we have an AAT term. If we expand this term we
have that
aT1
aT1 a1 aT1 a2 . . .
aT2
..
AAT =
T
.. a1 a2
. . . an =
a2 a1 .
. ..
. T
an an
aTn
122 CHAPTER 7. DUALITY, KERNELS
Each entry AATij is a dot product between ai and aj and can be interpreted as a similarity measure.
In fact, we can express
AATij = ai aj = φ(xi )φ(xj ) = k(xi , xj )
where k(., .) is the kernel function. The kernel function takes raw-feature inputs and outputs their
inner product in the augmented feature space. We denote the matrix of k(xi , xj ) terms as the
Gram matrix and denote it as K:
k(x1 , x1 ) k(x1 , x2 ) ...
..
K = AAT =
k(x2 , x1 .
..
. k(xn , xn )
Formally, k(x1 , x2 ) is defined to be a valid kernel function if there exists a feature map φ(.) such
that ∀x1 , x2 ,
k(x, y) = φ(x1 )φ(x2 )
Equivalently, we can also say that for all sets D = {a1 , a2 , . . . , an }, the Gram matrix K(D) is
positive semi-definite.
Computing the each Gram matrix entry k(xi , xj ) can be done in a straightforward fashion if we
apply the feature map to xi and xj and then take their dot product in the augmented feature space
— this takes O(d) time, where d is the dimensionality of the problem in the augmented feature
space. However, if we use the kernel trick, we can perform this operation in O(m + log p) time,
where m is the dimensionality of the problem in the raw feature space and k is the degree of the
polynomials in the augmented feature space.
Suppose that you are computing k(x, z), using a p-degree polynomial feature map that maps m
dimensional inputs to d = O(mp ) dimensional outputs. Let’s take p = 2 and m = 2 as an example.
We have that
k(x, z) = φ(x)φ(z)
h √ √ √ iT h √ √ √ i
= x21 x22 2x1 x2 2x1 2x2 1 z12 z22 2z1 z2 2z1 2z2 1
= x21 z12 + x22 z22 + 2x1 z1 x2 z2 + 2x1 z1 + 2x2 z2 + 1
= (x21 z12 + 2x1 z1 x2 z2 + x22 z22 ) + 2x1 z1 + 2x2 z2 + 1
= (x1 z1 + x2 z2 )2 + 2(x1 z1 + x2 z2 ) + 1
= (xT z)2 + 2xT z + 1
= (xT z + 1)2
1. Raising the inputs to the augmented feature space and take their inner product
2. Computing (xT z + 1)2 , which involves an inner product of the raw-feature inputs
Clearly, the latter option is much cheaper to calculate, taking O(m + log p) time, instead of O(mp )
time. In fact, this concept generalizes for any arbitrary m and p, and for p-degree polynomial
7.5. NEAREST NEIGHBOR CLASSIFICATION 123
Computational Analysis
w∗ = AT (AAT + λI)−1 y
or
w∗ = (AT A + λI)−1 Ay
Let’s compare their computational complexities. Suppose you are given an arbitrary test point
z ∈ Rm , and you would like to compute its predicted value ŷ. Let’s see how these values are
calculated in each case:
1. Kernelized
T
ŷ = φ(z)w∗ = φ(z)T AT (AAT + λI)−1 y = k(x1 , z) . . . k(xn , z) (K + λI)−1 y
Computing the K term takes O(n2 (m + log p)), and inverting the matrix takes O(n3 ). These
two computations dominate, for a total computation time of O(n3 + n2 (m + log p)).
2. Non-kernelized
ŷ = φ(z)w∗ = φ(z)T (AT A + λI)−1 AT y
Computing the AT A term takes O(d2 n), and inverting the matrix takes O(d3 ). These two
computations dominate, for a total computation time of O(d3 + d2 n) = O(m3p + m2p n).
Here is the takeaway: if d << n, the non-kernelized method is preferable. Otherwise if n << d,
the kernelized method is preferable.
This discussion now motivates a major reason for dual SVMs. Recall that the dual SVM for-
mulation is an optimization problem over n variables instead of d variables. It incorporates a
Q = (diag y)AAT (diag y) term, and using the kernel trick, the AAT term takes O(n2 (m + log p))
instead of O(n2 mp ) to calculate, making the dual SVM formulation a very attractive choice when
n << d.
Kernelization can be applied to pretty much any machine learning problem, such as logistic regres-
sion, k nearest neighbors, and even PCA.
In classification, it is reasonable to conjecture that data points that are sufficiently close to one
another should be of the same class. For example, in fruit classification, perturbing a few pixels in
an image of a banana should still result in something that looks like a banana. The k-nearest-
neighbors (k-NN) classifier is based on this observation. Assuming that there is no preprocessing
of the training data, the training time for k-NN is effectively O(1). To train this classifier, we simply
124 CHAPTER 7. DUALITY, KERNELS
store our training data for future reference.1 For this reason, k-NN is sometimes referred to as “lazy
learning.” The major work of k-NNs in done at testing time: to predict on a test data point z,
we compute the k closest training data points to z, where “closeness” can be quantified in some
distance function such as Euclidean distance - these are the k nearest neighbors to z. We then
find the most common class y among these k neighbors and classify z as y (that is, we perform a
majority vote). For binary classification, k is usually chosen to be odd so we can break ties cleanly.
Note that k-NN can also be applied to regression tasks — in that case k-NN would return the
average label of the k nearest points.
Figure 7.2: Voronoi diagram for k-NN. Points in a region shaded a certain color will be classified as that
color. Test points in a region shaded with a combination of 2 colors have those colors as their 2 nearest
neighbors.
Choosing k
Nearest neighbors can produce very complex decision functions, and its behavior is highly dependent
on the choice of k.
1 Sometimes we store the data in a specialized structure called a k-d tree. This data structure is out of scope for this course,
but it usually allows for faster (average-case O(log n)) nearest neighbors queries.
7.5. NEAREST NEIGHBOR CLASSIFICATION 125
(a) k = 1 (b) k = 15
Figure 7.3: Voronoi diagram for k = 1 vs. k = 15. Figure from Introduction to Statistical Learning.
Choosing k = 1, we achieve an optimal training error of 0 because each training point will classify
as itself, thus achieving 100% accuracy on itself. However, k = 1 overfits to the training data, and
is a terrible choice in the context of the bias-variance tradeoff. Increasing k leads to an increase in
training error, but a decrease in testing error and achieves better generalization. At one point, if
k becomes too large, the algorithm will underfit the training data, and suffer from huge bias. In
general, in order to select k we use cross-validation.
Figure 7.4: Training and Testing error as a function of k. Figure from Introduction to Statistical Learning.
Bias-Variance Analysis
Let’s justify this reasoning formally for k-NN applied to regression tasks. Suppose we are given
a training dataset D = {(xi , yi )}ni=1 , where the labels yi are real valued scalars. We model our
126 CHAPTER 7. DUALITY, KERNELS
hypothesis h(z) as
n
1X
h(z) = N (xi , z, k)
k
i=1
where the function N is defined as
(
yi if xi is one of the k closest points to z
N (xi , z, k) =
0 o.w.
Suppose also we assume our labels yi = f (xi ) + , where is the noise that comes from N (0, σ 2 )
and f is the true function. Let x1 . . . xk be the k closest points. Let’s first derive the bias2 of our
model for the given dataset D.
2 2
n k
2 1X 1X
E[h(z)] − f (z) = E N (xi , z, k) − f (z) = E yi − f (z)
k k
i=1 i=1
2 2
k k
1 X 1 X
= E[yi ] − f (z) = E[f (xi ) + ] − f (z)
k k
i=1 i=1
2
k
1X
= f (xi ) − f (z)
k
i=1
When k −→ ∞, then k1 ki=1 f (xi ) goes to the average label for x. When k = 1, then the bias
P
is simply f (x1 ) − f (z). Assuming x1 is close enough to f (z), the bias would likely be small when
k = 1 since it’s likely to share a similar label. Meanwhile, when k −→ ∞, the bias doesn’t depend
on the training points at all which like will restrict it to be higher.
Now, let’s derive the variance of our model.
k k
1 X 1 X
V ar[h(z)] = V ar yi = 2 V ar[f (xi ) + ]
k k
i=1 i=1
k
1 X
= V ar[]
k2
i=1
k
1 X 1 σ2
= σ2 = kσ 2
=
k2 k2 k
i=1
Properties
Computational complexity: We require O(n) space to store a training set of size n. There is no
runtime cost during training if we do not use specialized data structures to store the data. However,
predictions take O(n) time, which is costly. There has been research into approximate nearest
neighbors (ANN) procedures that quickly find an approximation for the nearest neighbor - some
common ANN methods are Locality-Sensitive Hashing and algorithms that perform dimensionality
reduction via randomized (Johnson-Lindenstrauss) distance-preserving projections.2
2 ANN methods are beyond the scope of this course, but are useful in real applications.
7.5. NEAREST NEIGHBOR CLASSIFICATION 127
Flexibility: When k > 1, k-NN can be modified to output predicted probabilities P (Y |X) by
defining P (Y |X) as the proportion of nearest neighbors to X in the training set that have class Y .
k-NN can also be adapted for regression — instead of taking the majority vote, take the average
of the y values for the nearest neighbors. k-NN can learn very complicated, non-linear decision
boundaries.
Non-parametric: k-NN is a non-parametric method, which means that the number of parameters
in the model grows with n, the number of training points. This is as opposed to parametric
methods, for which the number of parameters is independent of n. Some examples of parametric
models include linear regression, LDA, and neural networks.
Behavior in high dimensions: k-NN does not behave well in high dimensions. As the dimension
increases, data points drift farther apart, so even the nearest neighbor to a point will tend to be
very far away.
Theoretical properties: 1-NN has impressive theoretical guarantees for such a simple method. Cover
and Hart, 1967 prove that as the number of training samples n approaches infinity, the expected
prediction error for 1-NN is upper bounded by 2∗ , where ∗ is the Bayes (optimal) error. Fix and
Hodges, 1951 prove that as n and k approach infinity and if nk → 0, then the k nearest neighbor
error approaches the Bayes error.
Curse of Dimensionality
To understand why k-NN does not perform well in high-dimensional space, we first need to un-
derstand the properties of metric spaces. In high-dimensional spaces, much of our low-dimensional
intuition breaks down. Here is one classical example. Consider a ball in Rd centered at the origin
with radius r, and suppose we have another ball of radius r − centered at the origin. In low
dimensions, we can visually see that much of the volume of the outer ball is also in the inner ball.
In general, the volume of the outer ball is proportional to rd , while the volume of the inner ball is
proportional to (r − )d . Thus the ratio of the volume of the inner ball to that of the outer ball is
(r − )d d
= 1− ≈ e−d/r −→ 0
rd r d→∞
Hence as d gets large, most of the volume of the outer ball is concentrated in the annular region
{x : r − < x < r} instead of the inner ball.
High dimensions also make Gaussian distributions behave counter-intuitively. Suppose X ∼ N (0, σ 2 I).
If Xi are the components of X and R is the distance from X to the origin, then R2 = di=1 Xi2 . We
P
128 CHAPTER 7. DUALITY, KERNELS
have E(R2 ) = dσ 2 , so in expectation a random Gaussian will actually be reasonably far from the
origin. If σ = 1, then R2 is distributed chi-squared with d degrees of freedom. One can show that
in high dimensions, with high probability 1 − O(e−d ), this multivariate Gaussian will lie within
the annular region {X : |R2 − E(R2 )| ≤ d1/2+ } where E(R2 ) = dσ 2 (one possible approach is to
note that as d → ∞, the chi-squared approaches a Gaussian by the CLT, and use a Chernoff bound
to show exponential decay). This phenomenon is known as concentration of measure. Without
resorting to more complicated inequalities, we can show a simple, weaker result:
Theorem: If Xi ∼ N (0, σ 2 ), i = 1, ..., d are independent and R2 = di=1 Xi2 , then for every > 0,
P
the following holds:
1
lim P (|R2 − E(R2 )| ≥ d 2 + ) = 0
d→∞
Thus in the limit, the squared radius is concentrated about its mean.
Proof. From the formula for the variance of a chi-squared distribution, we see that V ar(R2 ) = 2dσ 4 .
Applying a Chebyshev bound yields
1 2dσ 4
P (|R2 − E(R2 )| ≥ d 2 + ) ≤ −→ 0
d1+2 d→∞
Thus a random Gaussian will lie within a thin annular region away from the origin in high dimen-
sions with high probability, even though the mode of the Gaussian bell curve is at the origin. This
illustrates the phenomenon in high dimensions where random data is spread very far apart. The
k-NN classifier was conceived on the principle that nearby points should be of the same class -
however, in high dimensions, even the nearest neighbors that we have to a random test point will
tend to be far away, so this principle is no longer useful.
Improving k-NN
There are two main ways to improve k-NN and overcome the shortcomings we have discussed.
2. Reduce the dimensionality of the features and/or pick better features. Consider other choices
of distance function.
One example of reducing the dimensionality in image space is to lower the resolution of the image
— while this is throwing some of the original pixel features away, we may still be able to get the
same or better performance with a nearest neighbors method.
We can also modify the distance function. For example, we have a whole family of Minkowski
distances that are induced by the Lp norms:
1
d p
X
p
Dp (x, z) = |xi − zi |
i=1
Without preprocessing the data, 1-NN with the L3 distance outperforms 1-NN with L2 on MNIST.
7.5. NEAREST NEIGHBOR CLASSIFICATION 129
We can also use kernels to compute distances in a different feature space. For example, if k is a
kernel with associated feature map Φ and we want to compute the Euclidean distance from Φ(x)
to Φ(z), then we have
Sparsity
Note that if a point xi has a nonzero slack ξi > 0, by definition it must lie inside the margin. Due
to the heavy penalty factor C for violating the margin there are relatively few such points, and
thus the slack vector ξ is sparse — most of its entries are 0. We are interested in explaining why
this phenomenon occurs in this specific optimization problem, and identifying the key properties
that determine sparse solutions for arbitrary optimization problems.
To reason about the SVM case, let’s see how changing some arbitrary slack variable ξi affects the
loss. A unit decrease in ξi results in a “reward” of C, and is captured by the partial derivative
∂L
∂ξi . Note that no matter what the current value of ξi is, the reward for decreasing ξi is constant.
Of course, decreasing ξi may change the boundary and thus the cost attributed to the size of the
margin kwk2 . The overall reward for decreasing ξi is either going to be worth the effort (greater
than cost incurred from w) or not worth the effort (less than cost incurred from w). Intuitively, ξi
will continue to decrease until it hits a lower-bound “equilibrium” — which is often just 0.
Now consider the following formulation (constraints omitted for brevity again):
n
1 X
min kwk2 + C ξi2
w,ξ 2
i=1
The reward for decreasing ξi is no longer constant — at any point, a unit decrease in ξi results in a
“reward” of 2Cξi . As ξi approaches 0, the rewards get smaller and smaller, reaching infinitesimal
values. On the other hand, decreasing ξi causes a finite increase in the cost incurred by the kwk2
— the same increase in cost as in the previous example. Intuitively, we can reason that there will
be a threshold value ξi∗ such that decreasing ξi further will no longer outweigh the cost incurred by
the size of the margin, and that the ξi ’s will halt their descent before they hit zero.
There are many motivations for designing optimization problems with sparse solutions. One advan-
tage of sparse solutions is that they speed up testing time. In the context of primal problems, if the
131
132 CHAPTER 8. SPARSITY
weight vector w is sparse, then after we compute w in training, we can discard features/dimensions
with 0 weight, as they will contribute nothing to the evaluation of the hypothesized regression
values of test points. A similar reasoning applies to dual problems with dual weight vector v,
allowing us to discard the training points corresponding to dual weight 0, ultimately allowing for
faster evaluation of our hypothesis function on test points.
LASSO
Given the motivations for sparsity in SVMs, let’s modify the ridge regression objective to achieve
sparsity as well. The least absolute shrinkage and selection operator (LASSO) developed
in 1996 by Robert Tibshirani, is one method that achieves this desired effect. LASSO is identical
to the ridge regression objective, except that the L2-norm (squared) penalizing w is now changed
to an L1-norm (with no squaring term):
d
X
kzk1 = |wi |
i=1
Compare this to the L2-norm squared of w, the sum of squared values of its entries:
d
X
kzk2 = wi2
i=1
Just as in ridge regression, there is a statistical justification for using the L1-norm. Whereas
ridge regression assumes a Gaussian prior over the weights w, LASSO assumes a Laplace prior
(otherwise known as a double exponential distribution) over the weights w.
Let’s understand why such a simple change from the L2 to L1-norm inherently leads to a sparse
solution. For any particular component wi of w, note that the corresponding loss in LASSO is
the absolute value |wi |, while the loss in ridge regression is the squared term wi2 . In the case of
LASSO the “reward” for decreasing wi by a unit amount is a constant λ, while for ridge regression
the equivalent “reward” is 2λwi , which depends on the value of wi . Thus for the same reasons
as we presented for SVMs, LASSO achieves sparsity while ridge regression does not. There is a
compelling geometric argument behind this reasoning as well.
8.2. COORDINATE DESCENT 133
Figure 8.1: Comparing contour plots for LASSO (left) vs. ridge regression (right).
Suppose for simplicity that we are only working with 2-dimensional data points and are thus
optimizing over two weight variables w1 and w2 . In both figures above, the red ellipses represent
isocontours in w-space of the squared loss kXw − yk2 . In ridge regression, each isocontour of
λkwk22 is represented by a circle, one of which is shown in the right figure. Note that the optimal
w will only occur at points of tangency between the red ellipse and the blue circle. Otherwise
we could always move along the isocontour of one of the functions (keeping its overall cost fixed)
while improving the value of the the other function, thereby improving the overall value of the
loss function. We can’t really infer much about these points of tangency other than the fact that
the blue circle centered at the origin draws the optimal point closer to the origin (ridge regression
penalizes large weights).
Now, let’s examine the LASSO case. The red ellipses represent the same objective kXw − yk2 ,
but now the L1 regularization term λkwk1 is represented by diamond isocontours. As with ridge
regression, note that the optimal point in w-space must occur at points of tangency between the
ellipse and the diamond. Due to the “pointy” property of the diamonds, tangency is very likely to
happen at the corners of the diamond because they are single points from which the rest of the
diamond draws away from. And what are the corners of the diamond? Why, they are points at
which one component of w is 0!
Note that the LASSO objective function is convex but not differentiable, due to the “pointiness”
of the L1-norm. This is a problem for gradient descent techniques — in particular, LASSO zeros
out features, and once these weights are set to 0 the objective function becomes non-differentiable.
Coordinate descent is an alternative optimization algorithm that can solve convex but non-
differentiable problems such as LASSO.
While SGD focuses on iteratively optimizing the value of the objective L(w) for each sample in the
training set, coordinate descent iteratively optimizes the value of the objective for each feature.
134 CHAPTER 8. SPARSITY
Coordinate descent is guaranteed to find the global minimum if L is jointly convex. No such
guarantees can be made however if L is only elementwise convex, since it may have local minima.
To understand why, let’s start by understanding elementwise vs joint convexity. Suppose you are
trying to minimize f (x, y), a function of two scalar variables x and y. For simplicity, assume that f
is twice differentiable, so we can take its hessian. f (x, y) is element-wise convex in x if its hessian
is psd when y is fixed:
∂2
f (x, y) ≥ 0
∂x∂x
Same goes for element-wise convexity in y.
f (x, y) is jointly convex in x and y if its hessian ∇2 f (x, y) is psd. Note that being element-wise
convex in both x and y does not imply joint convexity in x and y (consider f (x, y) = x2 + y 2 − 4xy
as an example). However, being joint convexity in x and y does imply being element-wise convex
in both x and y.
Now, if f (x, y) was jointly convex, then we could find the gradient wrt x and y individually, set them
to 0, and be guaranteed that would be the global minimum. Can we do this if f (x, y) is element-
wise convex in both x and y? Actually not always. Even though it is true that minx,y f (x, y) =
minx miny f (x, y), we can’t always just set gradients to 0 if f (x, y) is not jointly convex. The reason
for this is that while the inner optimization problem over y is convex, the outer optimization problem
over x may no longer be convex. In the case when joint convexity is not reached, there is no clean
strategy to find global minimum and we must analyze all of the critical points to find the minimum.
In the case of LASSO, the objective function is jointly convex, so we can use coordinate descent.
There are a few details to be filled in, namely the choice of which feature to update and how wi
is updated. One simple way is to just pick a random feature i each iteration. After choosing the
feature, we have to update wi ← arg minwi L(w). For LASSO, it turns out there is a closed-form
solution (note that we are only minimizing with respect to one feature instead of all the features).
The LASSO objective is kXw − yk22 + λkwk1 . It will be convenient to separate the terms that
depend on wi from those that don’t. Denoting xj as the j-th column of X, we have
n
(1)
X
L(w) = λ|wi | + C (2) + (wi xj,i + Cj )2
j=1
Suppose the optimal wi is strictly positive: wi > 0. Setting the partial derivative of the objective
8.3. SPARSE LEAST-SQUARES 135
Pn (1) Pn 2
Denoting a = − j=1 2xj,i Cj and b = j=1 2xj,i , we have
−λ + a
wi∗ =
b
−λ+a
But this only holds if the right hand side, b , is actually positive. If it is negative or 0, then this
means there is no optimum in (0, ∞).
When wi∗ < 0, then similar calculations will lead to
λ+a
wi∗ =
b
λ+a
Again, this only holds if b is actually negative. If it is positive or 0, then there is no optimum in
(−∞, 0).
If neither the conditions −λ+a
b > 0 or λ+a
b < 0 hold, then there is no optimum in (−∞, 0) or (0, ∞).
But the LASSO objective is convex in wi and has an optimum somewhere, thus in this case wi∗ = 0.
For this to happen, −λ+ab ≤ 0 and λ+a
b ≥ 0. Rearranging, we can see this is equivalent to |a| ≤ λ.
To summarize:
0
if |a| ≤ λ
wi∗ = −λ+a
b if −λ+a
b >0
λ+a λ+a
if b < 0
b
where
n n
(1)
X X
a=− 2xj,i Cj , b= 2x2j,i
j=1 j=1
The term ab is the least squares solution (without regularization), so we can see that the regu-
larization term tries to pull the least squares update towards 0. This is not a gradient-descent
based update — we have a closed-form solution for the optimum wi , given all the other weights
are fixed constants. We can also see explicitly how the LASSO objective induces sparsity — a is
some function of the data and the other weights, and when |a| ≤ λ, we set wi = 0 in this iteration
of coordinate descent. By increasing λ, we increase the threshold for |a| to be set to 0, and our
solution becomes more sparse.
Note that during the optimization, weights can be set to 0, but they can also be “reactivated” after
they have been set to 0 in a previous iteration, since a is affected by factors other than wi .
Suppose we want to solve the least squares objective, subject to a constraint that w is sparse.
Mathematically this is expressed as
min kXw − yk22
w
s.t. kwk0 ≤ k
136 CHAPTER 8. SPARSITY
Note that the L0-norm of w is simply the number of non-zero elements in w. This optimization
problem aims to minimize the residual error between our prediction Xw and y while ensuring that
the solution w is sparse. Solving this optimization problem is NP-hard, so we instead aim to find a
computationally feasible alternative method that can approximate the optimal solution. Matching
pursuit is a greedy algorithm that achieves this goal.
Recall that in ordinary least squares, we minimize the residual error kXw − yk22 by projecting y
onto the subspace spanned by the columns of X, thereby obtaining a linear combination w∗ of the
columns of X that minimizes the residual error. Matching pursuit is a greedy algorithm that starts
with with a completely sparse solution (w = 0) and iteratively “builds up” w until the the sparsity
constraint kwk0 ≤ k can no longer be met. The algorithm is as follows:
At each step of the algorithm, we pick the coordinate i such that xi (the i-th column of X corre-
sponding to feature i, not datapoint i) minimizes the residual r. This equates to finding the index
i for which the length of the projection onto xi is maximized:
hr, xj i
i = arg max
j kxj k
Let’s see why this is true. When we project the residual r on the vector xj , the resulting projection
has length hr, xj i/kxj k. The length of the new residual follows from pythagoras theorem:
!2
2 2 hr, xj i hr, xj i
krold k = krnew k + =⇒ i = arg max
kxj k j kxj k
We move wi to the optimum projection value and repeat greedily at each iteration. While matching
pursuit is not guaranteed to find the optimal w∗ , in practice it works well for most applications.
Setting the number of iterations is typically determined through cross-validation.
Chapter 9
A decision tree is a type of classifier which classifies things by repeatedly posing questions of the
form, “Is feature xi greater than value v?” You can visualize such a thing as a tree, where each
node consists of a split-feature, split-value pair (xi , v) and the leaves are possible classes. Here’s a
picture:
Note that a class may appear multiple times at the leaves of the decision tree. Picking a split-
feature, split-value pair essentially carves up the feature space into chunks, and you can imagine
that having enough of these pairs lets you create an arbitrarily complex classifier that can perfectly
overfit any training set (unless two training points of different classes coincide).
When we consider the task of building this tree, we want to ask ourselves a couple of things:
Let’s address the first point. Intuitively, when we pick a feature to split on, we want choose the
one which maximizes how sure we are of the classes of the two resulting chunks. If we had a bunch
137
138 CHAPTER 9. DECISION TREE LEARNING
of data points and the barrier represented by the line xi = v perfectly split them such that all the
instances of the positive class were on one side and all the instances of the negative class were on
the other, then the split (xi , v) is a pretty good one.
Now, to concretely talk about the idea of “how sure” we are, we’ll use the ideas of surprise and
entropy. The numerical surprise of observing that a random variable X takes on value k is:
1
log = − log P (X = k)
P (X = k)
H(X) = E[surprise]
1
= EX [log ]
P (X = k)
X 1
= P (X = k) log
P (X = k)
k
Here’s a graph of entropy vs. P (X = k) for a binary discrete random variable X (as in, k ∈ {0, 1}):
Observe that (1) it’s convex and (2) it’s maximized when the probability distribution is even with
respect to the outcomes. That is to say, a coin that is completely fair (P (X = 0) = P (X = 1) = 0.5)
has more entropy than a coin that is biased. This is because we are less sure of the outcome of the
fair coin than the biased coin overall. Even though we are more surprised when the biased coin
comes up as its more unlikely outcome, the way that entropy is defined gives a higher uncertainty
score to the fair coin. In general, a random variable has more entropy when the distribution of
its outcomes is closer to uniform and less entropy when the distribution is highly skewed to one
outcome.
An assumption
By the way, it’s strange that entropy is defined for random variables when there are no
random variables in a training set. To address this, we make the assumption that a random
variable X on the possible classes is representative of our training set. The distribution of X
is defined by our training points (xi , yi ), where P (X = cj ) = nnj with nj being the number
of training points whose yi is cj .
So now we know that when we choose a split-feature, split-value pair, we want to reduce entropy in
some way. First, let Yfi ,v be an indicator function that is 1 for a data point if its feature fi is less
than v. Note that Y is not a random variable, though we will use it in some notations that make
9.1. DECISION TREES 139
it appear like one. When we consider splitting one set of points represented by random variable X
with split-key (fi , v), there are a few entropies we should consider.
• H(X)
• H(X|Y = 1). That is, the entropy of the set of points whose fi is less than v.
• H(X|Y = 0). That is, the entropy of the set of points whose fi is not less than v.
In some ways H(X) is non-negotiable: we start with the set of points represented by X, and they
have some entropy, and now we wish to carve up those points in a way to minimize the entropy
remaining. Thus, the thing we want to minimize is a weighted average of the two sides of the split,
where the weights are (proportional to) the sizes of two sides:
An equivalent way of seeing this is that we want to maximize the information we’ve learned, which
represented by how much H(X) gets reduced after learning Y :
Note that I(X;Y) is nonnegative and it’s zero only if the resulting sides of the split have the same
distribution of classes as the original set of points. Let’s say you were using a decision tree to
classify emails as spam and ham. The precedng statement says, you gain no information if you
take a set of (20 ham, 10 spam) and split it on some feature to give you sets of (12 ham, 6 spam); (8
ham, 4 spam) because the empirical distribution of those two resulting sets is equal to the original
one.
Now that we have addressed how to pick splits, let’s talk about when to stop growing the tree. We
know that decision trees are powerful tools because they can be used to represent any arbitrarily
complex decision boundaries. Thus, it is very easy to overfit them. One way to do this is to keep
carving up the feature space until the leaves are entirely pure (as in, they only contain points of a
single class). As a result, when training decision trees, we always want to keep in mind a couple of
heuristics for stopping early:
• Limited depth: keep track of the depth of each split, and don’t go past some depth t
• Information gain criteria: stop carving once your splits don’t reward you with enough infor-
mation (i.e., set a threshold for I(X; Y ))
140 CHAPTER 9. DECISION TREE LEARNING
• Pruning: This isn’t a method for stopping early. Rather it’s the strategy of growing your
tree out as far as you can, and then combining splits back together if doing so reduces your
validation error.
Another way to combat overfitting is the use of random forests. A random forest is a set of
decision trees whose outputs are averaged together in some way (usually majority vote) to produce
the final answer to a classification problem. In addition, each tree of the forest will have some
elements of randomness to them. Let’s think about why this randomness is necessary.
Imagine you took the same set of training points and you built k decision trees on them with the
method of choosing splits described in the previous section. What could you say about these k
trees? Answer: They would all be the exact same trees in terms of what splits are chosen! Why:
because there is only one best split of a set of training points, and if we use the same algorithm on
the same set of training points, we’ll get the same result.
There are two options: (1) change our algorithm for each tree, or (2) change our training points.
The algorithm is pretty good, so let’s change the training points. In particular, we’ll use the method
of bagging: for each tree, we sample with replacement some constant number of training points
to “bag” and use for training. This will avoid the problem of having identical decision trees...
...Or will it? Imagine that you had an extremely predictive feature, like the appearance of the
word “viagra” for classifying spam, for your classification task. Then, even if you took a random
subsample of training points, your k trees would still be very similar, most likely choosing to split
on the same features. Thus, the randomness of our trees will come from:
Both the size of the random subsample of training points and the number of features per tree are
hyperparameters intended to be tuned through cross-validation.
Remember why this is a good idea: One decision tree by itself is prone to overfitting to the training
set. However, if we have a bunch of them that are diverse and uncorrelated, then we can be more
sure that their average is closer to the true classifier.
9.3 Boosting
We have seen that in the case of random forests, combining many imperfect models can produce
a single model that works very well. This is the idea of ensemble methods. However, random
forests treat each member of the forest equally, taking a majority vote or an average over their
outputs. The idea of boosting is to combine the models (typically called weak learners in this
context) in a more principled manner. Roughly, we want to:
• Treat each of the weak classifiers (e.g., trees in a random forest) as “features”
• Hunt for better “features” based on the current performance of the overall classifier
9.3. BOOSTING 141
The key idea is as follows: to improve our classifier, we should focus on finding learners that
correctly classify the points which the overall boosted classifier is currently getting wrong. Boosting
algorithms implement this idea by associating a weight with each training point and iteratively
reweighting so that misclassified points have relatively high weights. Intuitively, some points are
“harder” to classify than others, so the algorithm should focus its efforts on those.
AdaBoost
There are many flavors of boosting. We will discuss one of the most popular versions, known as
AdaBoost (short for adaptive boosting). Its developers won the Gödel Prize for this work.
Algorithm
2. Repeat for m = 1, . . . , M :
(a) Build a classifier Gm with data weighted according to wi .
P
wi
(b) Compute the weighted error em = i misclassified
P
wi .
i
We first address the issue of step (a): how do we train a classifier if we want to weight different
samples differently? One common way to do this is to resample from the original training set
every iteration to create a new training set that is fed to the next classifier. Specifically, we create
a training set of size n by sampling n values from the original training data with replacement,
according to the distribution wi . (This is why we might renormalize the weights in step (d).) This
way, data points with large values of wi are more likely to be included in this training set, and the
next classifier will place higher priority on such data points.
Suppose that our weak learners always produce an error em < 1/2. To make sense of the formulas
we see in the algorithm, note that for step
q (c), if the i-th data point is misclassified, then the
1−em
weight wi gets increased by a factor of em (more priority placed on sample i), while if it
is classified correctly, the priority gets decreased. AdaBoost does have a weakness in that this
aggressive reweighting can cause the classifier to focus too much on certain training examples – if
the data is noisy with outliers, then this will weaken the boosting algorithm’s performance.
We have not yet discussed how to make a prediction given our classifiers (say, G1 , . . . , GM ). One
conceivable method is to use logistic regression with Gm (x) as features. However, a smarter choice
that is based on the AdaBoost algorithm is to set
1 1 − em
αm = ln
2 em
142 CHAPTER 9. DECISION TREE LEARNING
P
and classify x by sign( m αm Gm (x)). Note that this choice of αm (derived later) gives high weight
to classifiers that have low error:
• As em → 0, 1−em
em → ∞, so αm → ∞.
• As em → 1, 1−em
em → 0, so αm → −∞.
We now proceed to demystify the formulas in the algorithm by presenting a matching pursuit
interpretation of AdaBoost.
Derivation of AdaBoost
Suppose we have computed classifiers G1 , ..., Gm−1 along with their corresponding weights αk and
we want to computeP the next classifier Gm along with its weight αm . The output of our model so
far is Fm−1 (x) := m−1
i=1 αi Gi (x), and we want to minimize the risk:
n
X
arg min L(yi , Fm−1 (xi ) + αm Gm (xi ))
αm ,Gm i=1
for some suitable loss function L. Loss functions we have previously used include mean squared
error for linear regression, cross-entropy loss for logistic regression, and hinge loss for SVM. For
AdaBoost, we use the exponential loss:
This loss function is illustrated in Figure 9.1. We have exponential decay as we increase the input
– thus if yh(x) is large and positive (so h(x) has the correct sign and high magnitude), our loss
is decreasing exponentially. Conversely, if yh(x) is a large negative value, our loss is increasing
exponentially, and thus we are heavily penalized for confidently making an incorrect prediction.
Figure 9.1: The exponential loss provides exponentially increasing penalty for confident incorrect predictions.
This figure is from Cornell CS4780 notes.
9.3. BOOSTING 143
We can write the AdaBoost optimization problem with exponential loss as follows:
n
X
αm , Gm = arg min e−yi (Fm−1 (xi )+αG(xi ))
α,G i=1
n
X
= arg min e−yi Fm−1 (xi ) e−yi αG(xi )
α,G i=1
(m)
The term wi := e−yi Fm−1 (x) is a constant with respect to our optimization variables. We can split
out this sum into the components with correctly classified points and incorrectly classified points:
(m) (m)
X X
αm , Gm = arg min wi e−α + wi e+α (∗)
α,G
yi =G(x) yi 6=G(x)
n
(m) (m)
X X
α −α −α
= arg min(e − e ) wi +e wi
α,G i=1
yi 6=G(xi )
For a fixed value of α, the second term does not depend on G. Thus we can see that the best choice
(m)
of Gm (x) is the classifier that minimizes the error given the weights wi . Let
P (m)
yi 6=Gm (xi ) wi
em = P (m)
i wi
Pn (m)
Once we have obtained Gm , we can solve for αm : by dividing (∗) by the constant i=1 wi , we
obtain
as claimed earlier.
From the optimal αm , we can derive the weights:
(m+1)
wi = exp −yi Fm (xi )
= exp −yi [Fm−1 (xi ) + αm Gm (xi )]
(m)
= wi exp −yi Gm (xi )αm
!
(m) 1 1 − e m
= wi exp −yi Gm (xi ) ln
2 em
− 1 yi Gm (xi )
(m) 1 − em 2
= wi exp ln
em
− 1 yi Gm (xi )
(m) 1 − em 2
= wi
em
144 CHAPTER 9. DECISION TREE LEARNING
q q
em 1−em
Here we see that the multiplicative factor is 1−em when yi = Gm (xi ) and em otherwise. This
completes the derivation of the algorithm.
As a final note about the intuition, we can view these α updates as pushing towards a solution
in some direction until we can no longer improve our performance. More precisely, whenever we
compute αm (and thus w(m+1) ), for the incorrectly classified entries, we have
r
X (m+1)
X (m) 1 − em
wi = wi
em
yi 6=Gm (xi ) yi 6=Gm (xi )
q
(m)
Dividing the right-hand side by ni=1 wi , we obtain em 1−e
P p
em =
m
em (1 − em ). Similarly, for
the correctly classified entries, we have
P (m+1)
yi =Gm (xi ) wi
r
em p
(m)
= (1 − em ) = em (1 − em )
Pn 1 − em
i=1 wi
Thus these two quantities are the same once we have adjusted our α, so the misclassified and
correctly classified sets both get equal total weight.
This observation has an interesting practical implication. Even after the training error goes to zero,
the AdaBoost test error may continue to decrease. This may be counter-intuitive, as one would
expect the classifier to be overfitting to the training data at this point. One interpretation for this
phenomenon is that even though the boosted classifier has achieved perfect training error, it is still
refining its fit in a max-margin fashion, which increases its generalization capabilities.
Chapter 10
Deep Learning
A convolutional neural net is just like a regular neural net, except more complicated (con-
voluted). But in some ways, a convolutional neural net can actually be a simpler model than a
fully-connected neural net.
We’ll be talking about convolutional neural networks (ConvNets) mostly in the framework of
image classification. First, let’s recall what one layer of a regular neural net looks like:
Remember that each layer consists of units which represent a single number. The values on the
units of some layer are determined by the values of the units behind them and the weights on the
edges in between the layers. More specifically, the vector of numbers representing layer 2, say y,
can be represented by linear combinations W x of the vector representing layer 1. If we were dealing
with images, each unit of the input layer might be the intensity of a single pixel of the image.
The fully-connected architecture of units has a lot of weights to learn. For just a small 32 × 32
image, there would be 1024 input units and at least as many weights just between the first two
layers. Now, let’s look at how a ConvNet which deals with images uses its weights differently.
145
146 CHAPTER 10. DEEP LEARNING
Notice that some of the layers are called “conv” and “pool.” These layers are the new ideas that
ConvNets introduce to the neural net architecture.
Convolutional Layers
A convolutional layer is determined by convolving a kernel about the previous layer. The
kernel is just a fancy name for an array of weights, and convolving means that we slide the array
of weights across the pixels of the previous layer and compute the sum of the elementwise products
(kind of like a 2D dot-product). Here is a picture that illustrates this:
In the above picture, we extracted 6 activation values from 12 input values (we would supposedly
pass the dot-products through some kind of nonlinearity as well). In a regular fully-connected neural
net, we would have used 6 × 12 = 72 weights to accomplish this. However, in this convolutional
layer, we only used 4 weights. This is because we made use of weight sharing, as in:
1. the weights w, x, y, z were shared among all the pixels of the input
2. the individual units of the output layer were all determined by the same weights w, x, y, z
Compare this to the fully-connected architecture where for each output unit, there is a separate
weight to learn for each input unit. This kind of strategy decreases the complexity of our model
(there are fewer weights), but it makes sense for image processing because there are lots of repeated
10.1. CONVOLUTIONAL NEURAL NETS 147
patterns in images, and if you have one kernel which can detect some kind of phenomenon, then it
would make sense to use it elsewhere in the image.
How do kernels detect things, anyways? The short answer is: they will produce large values in the
areas of the image which appear most similar to them. Consider a simple kernel [1 − 1]. This
kernel will have produce large values for which the left pixel is bright and the right pixel is dark.
Conversely, it will produce small values for which the left pixel is dark and the right pixel is bright.
Notice how the output image has high values (white) in the areas where the original image turned
from bright to dark (like the right hand side of the figure), and it has low values (black) in the
areas where the original image turned from dark to bright (like the left hand side of the figure).
This kernel can be thought of as a simple edge detector! As another example, consider the kernel:
0.6 0.2 0
0.2 0 0.2
0 0.2 0.6
If this was convolved about an image, it would detect edges at a positive 45-degree angle.
Just a few more things on convolutional layers:
1. You can stack the outputs of multiple kernels together to form a convolutional layer.
Figure 10.2: Here, we slid 6 separate 5 × 5 × 3 kernels across the original image to produce 6 activation maps
in the next convolutional layer. Each activation map can also be referred to as a channel.
2. To save memory, you can have your kernel stride across the image by multiple pixels.
148 CHAPTER 10. DEEP LEARNING
3. Zero-padding is sometimes used to control the exact dimensions of the convolutional layer.
4. As you add more convolutional layers, the effective receptive field of each successive layer
increases. That is to say, as you go downstream (of the layers), the value of any single unit
is informed by an increasingly large patch of the original image. For example. If you use
two successive layers of 3 × 3 kernels, any one unit in the first convolutional layer is informed
by 9 separate image pixels. Any one unit in the second convolutional layer is informed by
9 separate units of the first convolutional layer, which could informed by up to 9 × 9 = 81
original pixels.
Figure 10.3: The highlighted unit in the downstream layer uses information from all the highlighted units in
the input layer.
Pooling Layers
A pooling layer does not involve more weights. Instead, it is a layer whose purpose is solely to
downsample AKA pool AKA gather AKA consolidate the previous layer. It does this by sliding a
window of some size across the units of a layer of the ConvNet and choosing (somehow) one value
to effectively “represent” all the units captured by the window. You can tweak the nature of a
pooling layer in a few orthogonal ways.
1. How to pool? In max-pooling, the representative value just becomes the largest of all the units
in the window. In average-pooling, the representative value is the average of all the units in
the window.
10.1. CONVOLUTIONAL NEURAL NETS 149
Figure 10.4: Here, the input layer of the right image is a translated version of the input layer of the left
image, but because of max-pooling, the next layer looks more or less the same.
(b) Cross-channel pooling pools values across different channels. This has the capability of
introducing transformational invariance to your model.
Figure 10.5: Here, we have an example where our convolutional layer is represented by 3 kernels. Suppose
they were each good for detecting the number 5 in some degree of rotation. If we pooled across the three
channels determined by these kernels, then no matter what orientation of the number “5” we got as input
to our ConvNet, the pooling layer would have a large response!
3. “Lossiness” of pooling. This is determined by the stride of the pooling window. If you stride
by a large amount, you potentially lose more information, but you conserve memory.
If you now look back at the picture of the ConvNet near the beginning of this note, you should have
a better idea of what each layer is doing. The ConvNet in that picture is Yann Lecun’s LeNet,
which is used to classify handwritten alphanumeric characters!
CNN Architectures
Convolutional Neural Networks were first applied successfully to the ImageNet challenge in 2012
and continue to outperform computer vision techniques that do not use neural networks. Here are
a few of the architectures that have been developed over the years.
Key characteristics:
• Conv filters of varying sizes - for example, the first layer has 11 × 11 conv filters
• First use of ReLU, which fixed the problem of saturating gradients in the predominant tanh
activation.
150 CHAPTER 10. DEEP LEARNING
Figure 10.6: AlexNet architecture. Reference: “ImageNet Classification with Deep Convolutional Neural
Networks,” NIPS 2012.
• Several layers of convolution, max pooling, some normalization. Three fully connected layers
at the end of the network (these comprise the majority of the weights in the network).
• Around 60 million weights, over half of which are in the first fully connected layer following
the last convolution.
• Trained over two GPU’s - the top and bottom divisions in Figure 10.6 were due to the need
to separate training onto two GPU’s. There was limited communication between the GPU’s,
as illustrated by the arrows that go between the top and bottom.
• Heavy data augmentation. One form is image translation and reflection: for example, an
elephant facing the left is the same class as an elephant facing the right. The second form is
altering the intensity of RGB color channels: different cameras can have different lighting on
the same objects, so it is necessary to account for this.
Reference paper: “Very Deep Convolutional Networks for Large-Scale Image Recognition,” ICLR
2015.1 Key characteristics:
• Only uses 3×3 convolutional filters. Blocks of conv-conv-conv-pool layers are stacked together,
followed by fully connected layers at the end (the number of convolutional layers between
pooling layers can vary). Note that a stack of 3 3 × 3 conv filters has the same effective
receptive field as one 7 × 7 conv filter. To see this, imagine sliding a 3 × 3 filter over a 7 × 7
image - the result is a 5 × 5 image. Do this twice more and the result is a 1 × 1 cell - sliding
one 7 × 7 filter over the original image would also result in a 1 × 1 cell. The computational cost
of the 3 × 3 filters is lower - a stack of 3 such filters over C channels requires 3 ∗ (32 C) weights
(not including bias weights), while one 7 × 7 filter would incur a higher cost of 72 C learned
weights. Deeper, more narrow networks can introduce more non-linearities than shallower,
wider networks due to the repeated composition of activation functions.
1 VGG stands for the “Visual Geometry Group” at Oxford where this was developed.
10.1. CONVOLUTIONAL NEURAL NETS 151
Also codenamed as “Inception.”2 Published in CVPR 2015 as “Going Deeper with Convolutions.”
Key characteristics:
• Deeper than previous networks (22 layers), but more computationally efficient (5 million
parameters - no fully connected layers).
• Network is composed of stacked sub-networks called “Inception modules.” The naive Incep-
tion module (a) runs convolutional layers in parallel and concatenates the filters together.
However, this can be computationally inefficient. The dimensionality reduction Inception
module (b) performs 1 × 1 convolutions that act as dimensionality reduction. This lowers the
computational cost and makes it tractable to stack many Inception modules together.
Figure 10.8: Building block for the ResNet from “Deep Residual Learning for Image Recognition,” CVPR
2016. If the desired function to be learned is H(x), we instead learn the residual F(x) := H(x) − x, so the
output of the network is F(x) + x = H(x).
Key characteristics:
2 “In this paper, we will focus on an efficient deep neural network architecture for computer vision, codenamed Inception,
which derives its name from the Network in network paper by Lin et al [12] in conjunction with the famous we need to go
deeper internet meme [1].” The authors seem to be meme-friendly.
152 CHAPTER 10. DEEP LEARNING
• Very deep (152 layers). Residual blocks (Figure 10.8) are stacked together - each individual
weight layer in the residual block is implemented as a 3 × 3 convolution. There are no FC
layers until the final layer.
• Residual blocks solve the “vanishing gradient” problem: the gradient signal diminishes in
layers that are farther away from the end of the network. Let L be the loss, Y be the output
at a layer, x be the input. Regular neural networks have gradients that look like
∂L ∂L ∂Y
=
∂x ∂Y ∂x
but the derivative of Y with respect to x can be small. If we use a residual block where
Y = F (x) + x, we have
∂Y ∂F (x)
= +1
∂x ∂x
The +x term in the residual block always provides some default gradient signal so the signal
is still backpropagated to the front of the network. This allows the network to be very deep.
To conclude this section, we note that the winning ImageNet architectures have all increased in
depth over the years. While both shallow and deep neural networks are known to be universal
function approximators, there is growing empirical and theoretical evidence that deep neural net-
works can require fewer (even exponentially fewer) parameters than shallow nets to achieve the
same approximation performance. There is also evidence that deep neural networks possess better
generalization capabilities than their shallow counterparts. The performance, generalization, and
optimization benefits of adding more layers is an ongoing component of theoretical research.
We know that a convolutional net learns features, but these may not be directly useful to visualize.
There are several methods available that enable us to better understand what convolutional nets
actually learn. These include:
• Visualizing filters - can give an idea of what types of features the network learns, such as
edge detectors. This only works in the first layer. Visualizing activations - can see sparsity
in the responses as the depth increases. One can also visualize the feature map before a
fully connected layer by conducting a nearest neighbor search in feature space. This helps
to determine if the features learned by the CNN are useful - for example, in pixel space, an
elephant on the left side of the image would not be a neighbor of an elephant on the right side
of the image, but in a translation-invariant feature space these pictures might be neighbors.
• Reconstruction by deconvolution - isolate an activation and reconstruct the original image
based on that activation alone to determine its effect.
• Activation maximization - Hubel and Wiesel’s experiment, but computationally
• Saliency maps - find what locations in the image make a neuron fire
• Code inversion - given a feature representation, determine the original image
• Semantic interpretation - interpret the activations semantically (for example, is the CNN
determining whether or not an object is shiny when it is trying to classify?)