Lecture 1, Part 3: Training a Classifier
Roger Grosse
1 Introduction
Now that we’ve defined what binary classification is, let’s actually train a
classifier. We’ll approach this problem in much the same way as we did
linear regression: define a model and a cost function, and minimize the
cost using gradient descent. The one thing that makes the classification
case harder is that it’s not obvious what loss function to use. We can’t
just use the classification error itself, because the gradient is zero almost
everywhere! Instead, we’ll define a surrogate loss function, i.e. an alternative
loss function which is easier to optimize.
1.1 Learning Goals
• Understand why classification error and squared error are problematic
cost functions for classification.
• Know what cross-entropy is and understand why it can be easier to
optimize than squared error (assuming a logistic activation function).
• Be able to derive the gradient descent updates for all of the models
and cost functions mentioned in this lecture and to implement the
learning algorithms in Python.
• Know what hinge loss is, and how it relates to cross-entropy loss.
• Understand how binary logistic regression can be generalized to mul-
tiple variables.
• Know what it means for a function to be convex, how to check con-
vexity visually, and why it’s important for optimization.
– Algebraically proving functions to be convex is beyond the scope
of this course.
• Know how to check the correctness of gradients using finite differences.
2 Choosing a cost function
Recall the setup from the previous lecture. We have a set of training ex-
amples {(x(i) , t(i) )}N
i=1 , where the x
(i) are vector-valued inputs, and t(i) are
binary-valued targets. We would like to learn a binary linear classifier,
where we compute a linear function of x(i) and threshold it at zero:
1 if w> x > 0
y= (1)
0 otherwise.
1
In the last lecture, our goal was to correctly classify every training exam-
ple. But this might be impossible if the dataset isn’t linearly separable.
Even if it’s possible to correctly classify every training example, it may be
undesirable since then we might just overfit!
How can we define a sensible learning criterion when the dataset isn’t
linearly separable? One natural criterion is to minimize the number of mis-
classified training examples. We can formalize this with the classification
error loss, or the 0-1 loss:
0 if y = t
L0−1 (y, t) = (2)
1 otherwise.
As always, the cost function is just the loss averaged over the training
examples; in this case, that corresponds to the error rate, or fraction of
misclassified examples. How do we make this small?
2.1 Attempt 1: 0-1 loss
As a first attempt, let’s try to minimize 0-1 loss directly. In order to compute
the gradient descent updates, we need to compute the partial derivatives
∂L0−1 /∂wj . Rather than mechanically deriving this derivative, let’s think
about what it means. It means, how much does L0−1 change if you make a
very small change to wj ? As long as we’re not on the classification boundary,
making a small enough change to wj will have no effect on L0−1 , because
the prediction won’t change. This implies that ∂L0−1 /∂wj = 0, as long as
we’re not on the boundary. Gradient descent will go nowhere. (If we are on
the boundary, the cost is discontinuous, which certainly isn’t any better!)
OK, we certainly can’t optimize 0-1 loss with gradient descent. Near the end of the course, when
we discuss reinforcement learning,
we’ll see an algorithm which can
2.2 Attempt 2: linear regression minimize 0-1 loss directly. It’s
nowhere near as efficient as
Since that didn’t work, let’s try using something we already know: linear gradient descent, though, so we
regression. Recall that this assumes a linear model and the squared error still need the techniques of this
loss function: lecture!
y = w> x + b (3)
1
LSE (y, t) = (y − t)2 (4)
2
We’ve already seen two ways of optimizing this: gradient descent, and a
closed-form solution. But does it make sense for classification? One obvious
problem is that the predictions are real-valued rather than binary. But
that’s OK, since we can just pick some scheme for binarizing them, such
as thresholding at y = 1/2. When we replace a loss function we trust with
another one we trust less but which is easier to optimize, the replacement
one is called a surrogate loss function.
But there’s still a problem. Suppose we have a positive example, i.e. t =
1. If we predict y = 1, we get a cost of 0, whereas if we make the wrong
prediction y = 0, we get a cost of 1/2; so far, so good. But suppose we’re
really confident that this is a positive example, and predict y = 9. Then we
pay a cost of 12 (9 − 1)2 = 32. This is far higher than the cost for y = 0, so
the learning algorithm will try very hard to prevent this from happening.
2
That’s not bad in itself, but it means that something else might need to be
sacrificed, if it’s impossible to match all of the targets exactly. Perhaps the
sacrifice will be that it incorrectly classifies some other training examples.
2.3 Attempt 3: logistic nonlinearity
The problem with linear regression is that the predictions were allowed to
take arbitrary real values. But it makes no sense to predict anything smaller
than 0 or larger than 1. Let’s fix this problem by applying a nonlinearity, If you predict y > 1, then
or activation function, which squashes the predictions to be between 0 regardless of the target, you can
decrease the loss by setting y = 1.
and 1. In particular, we’ll use something called the logistic function: Similarly for y < 0.
1
σ(z) = . (5)
1 + e−z
This is a kind of sigmoidal, or S-shaped, function:
What’s important about this function is that it increases monotonically,
with asymptotes at 0 and 1. (Plus, it’s smooth, so we can compute deriva-
tives.) Another advantage of the logistic
We refine the model as follows: function is that calculations tend
to work out very nicely.
z = w> x + b (6)
y = σ(z) (7)
1
LSE (y, t) = (y − t)2 . (8)
2
Notice that this model solves the problem we observed with linear regression.
As the predictions get more and more confident on the correct answer, the
loss continues to decrease.
To derive the gradient descent updates, we’ll need the partial derivatives
of the cost function. We’ll do this by applying the Chain Rule twice: first
to compute dLSE /dz, and then again to compute ∂LSE /∂wj . But first, let’s
note the convenient fact that This is equivalent to the elegant
identity σ 0 (z) = σ(z)(1 − σ(z)).
∂y e−z
=
∂z (1 + e−z )2
= y(1 − y). (9)
3
Figure 1: Visualization of derivatives of squared error loss with logistic
nonlinearity, for a training example with t = 1. The derivative dJ /dz
corresponds to the slope of the tangent line.
Now for the Chain Rule:
dLSE dLSE dy
=
dz dy dz
= (y − t)y(1 − y) (10)
∂LSE dLSE ∂z
=
∂wj dz ∂wj
dLSE
= · xj . (11)
dz
Done! Why don’t we go one step further and plug Eqn. 10 into Eqn. 11? At this point, you should stop and
The reason is that our goal isn’t to compute a formula for ∂LSE /∂wj ; as sanity check the equations we just
derived, e.g. checking that they
computer scientists, our goal is to come up with a procedure for computing have the sign that they ought to.
it. The two formulas above give us a procedure which we can implement Get in the habit of doing this.
directly in Python. One advantage of doing it this way is that we can reuse
some of the work we’ve done in computing the derivative with respect to
the bias: Reusing computation of derivatives
is one of the main insights behind
dLSE dLSE ∂z backpropagation, one of the
= central algorithms in this course.
db dz ∂b
dLSE
= (12)
dz
If we had expanded out the entire formula, it might not be obvious to us
that we can reuse computation like this.
So far, so good. But there’s one hitch. Let’s suppose you classify one
of the training examples extremely wrong, e.g. you confidently predict a
negative label with z = −5, which gives y ≈ 0.0067, for a positive example
(i.e. t = 1). Plugging these values into Eqn 10, we find that ∂LSE /∂z ≈
−0.0066. This is a pretty small value, considering how big the mistake
was. As shown in Figure 1, the more confident the wrong prediction, the
smaller the gradient is! The most badly misclassified examples will have
hardly any effect on the training. This doesn’t seem very good. We say
the learning algorithm does not have a strong gradient signal for those
training examples.
4
Figure 2: Plot of cross-entropy loss as a function of the input z to the
logistic activation function.
The problem with squared error loss in the classification setting is that
it doesn’t distinguish bad predictions from extremely bad predictions. If Think about how the argument in
t = 1, then a prediction of y = 0.01 has roughly the same squared-error this paragraph relates to the one
in the previous paragraph.
loss as a prediction of y = 0.00001, even though in some sense the latter is
more wrong. This isn’t necessarily a problem in terms of the cost function
itself: whether 0.00001 is inherently much worse than 0.01 depends on the
situation. (If all we care about is classification error, they’re essentially
equivalent.) But from the perspective of optimization, the fact that the
losses are nearly equivalent is a big problem. If we can increase y from Actually, the effect discussed here
0.00001 to 0.0001, that means we’re “getting warmer,” but this doesn’t can also be beneficial, because it
makes the algorithm robust, in that
show up in the squared-error loss. We’d like a loss function which reflects it can learn to ignore mislabeled
our intuitive notion of getting warmer. examples. Cost functions like this
are sometimes used for this reason.
However, when you do use it, you
2.4 Final touch: cross-entropy loss should be aware of the
optimization difficulties it creates!
The problem with squared-error loss is that it treats y = 0.01 and y =
0.00001 as nearly equivalent (for a positive example). We’d like a loss
function which makes these very different. One such loss function is cross-
entropy (CE). This is defined as follows: You’ll sometimes see cross-entropy
abbreviated XE.
− log y if t = 1
LCE (y, t) = (13)
− log 1 − y if t = 0
In our earlier example, we see that LCE (0.01, 1) = 4.6, whereas LCE (0.00001, 1) =
11.5, so cross-entropy treats the latter as much worse than the former.
When we do calculations, it’s cumbersome to use the case notation, so
we instead rewrite Eqn. 13 in the following form. You should check that
they are equivalent:
LCE (y, t) = −t log y − (1 − t) log 1 − y. (14)
Remember, the logistic function squashes y to be between 0 and 1, but See if you can derive the equations
cross-entropy draws big distinctions between probabilities close to 0 or 1. for the asymptote lines.
Interestingly, these effects cancel out: Figure 2 plots the loss as a function
of z. You get a sizable gradient signal even when the predictions are very
wrong.
5
When we combine the logistic activation function with cross-entropy
loss, you get logistic regression:
z = w> x + b
y = σ(z) (15)
LCE = −t log y − (1 − t) log 1 − y.
Now let’s compute the derivatives. We’ll do it two different ways: the
mechanical way, and the clever way. Let’s do the mechanical way first, as
an example of the chain rule for derivatives. Remember, our job here isn’t
to produce a formula for the derivatives, the way we would in calculus class.
Our job is to give a procedure for computing the derivatives which we could
translate into NumPy code. The following does that: The second step of this derivation
uses Eqn. 9.
dLCE t 1−t
=− +
dy y 1−y
dLCE dLCE dy
=
dz dy dz
dLCE
= · y(1 − y) (16)
dy
∂LCE dLCE ∂z
=
∂wj dz ∂wj
dLCE
= · xj
dz
This can be translated directly into NumPy (exercise: how do you vec-
torize this?). If we were good little computer scientists, we would stop here.
But today we’re going to be naughty computer scientists and break the
abstraction barrier between the activation function (logistic) and the cost
function (cross-entropy).
2.5 Logistic-cross-entropy function
There’s a big problem with Eqns. 15 and 16: what happens if we have a
positive example (t = 1), but we confidently classify it as a negative example
(z 0, implying y ≈ 0)? This is likely to happen at the very beginning
of training, so we should be able to handle it. But if y is small enough, it
could be smaller than the smallest floating point value, i.e. numerically
zero. Then when we compute the cross-entropy, we take the log of 0 and
get −∞. Or if this doesn’t happen, think about Eqn. 16. Since y appears
in the denominator, dLCE /dy will be extremely large in magnitude, which
again can cause numerical difficulties. These bugs are very subtle, and can
be hard to track down if you don’t expect them.
What we do instead is combine the logistic function and cross-entropy
loss into a single function, which we term logistic-cross-entropy:
LLCE (z, t) = LCE (σ(z), t) = t log(1 + e−z ) + (1 − t) log(1 + ez ) (17)
This still isn’t numerically stable if we implement it directly, since ez could
blow up. But most scientific computing environments provide a numerically
6
Figure 3: Comparison of the loss functions considered so far.
stable log-sum-exp routine.1 In numpy, this is np.logaddexp. So the
following code would compute the logistic-cross-entropy:
E = t * np.logaddexp(0, -z) + (1-t) * np.logaddexp(0, z)
Now to compute the loss derivatives:
dLLCE d
t log(1 + e−z ) + (1 − t) log(1 + ez )
=
dz dz
e−z ez
= −t · + (1 − t) (18)
1 + e−z 1 + ez
= −t(1 − y) + (1 − t)y
=y−t
This is like magic! We took a somewhat complicated formula for the logistic This isn’t a coincidence. The
activation function, combined it with a somewhat complicated formula for reason it happens is beyond the
scope of this course, but if you’re
the cross-entropy loss, and wound up with a stunningly simple formula for curious, look up “generalized
the loss derivative! Observe that this is exactly the same formula as for linear models.”
dLSE /dy in the case of linear regression. And it has the same intuitive
interpretation: if y > t, you made too positive a prediction, so you want
to shift your prediction in the negative direction. Conversely, if y < t, you
want to shift your prediction in the positive direction.
2.6 Another alternative: hinge loss
Another loss function you might encounter is hinge loss. Here, y is a real
value, and t ∈ {−1, 1}.
LH (y, t) = max(0, 1 − ty)
Hinge loss is plotted in Figure 3 for a positive example. One useful property
of hinge loss is that it’s an upper bound on 0–1 loss; this is a useful property
1
The log-sum-exp trick is pretty neat. https://hips.seas.harvard.edu/blog/2013/
01/09/computing-log-sum-exp/
7
for a surrogate loss function, since it means that if you make the hinge loss
small, you’ve also made 0–1 loss small. A linear model with hinge loss is
known as a support vector machine (SVM):
y = w> x + b (19)
LH = max(0, 1 − ty) (20)
If you take CSC411, you’ll learn a lot about SVMs, including their statis-
tical motivation, how to optimize them efficiently and how to make them
nonlinear (using something called the “kernel trick”). But you already know
one optimization method: you already know enough to derive the gradient
descent updates for an SVM.
Interestingly, even though SVMs came from a different community and
had a different sort of motivation from logistic regression, the algorithms
behave very similarly in practice. The reason has to do with the loss func-
tions. Figure 3 compares hinge loss to cross-entropy loss; even though cross-
entropy is smoother, the asymptotic behavior is the same, suggesting the
loss functions are basically pretty similar.
All of the loss functions covered so far is shown in Figure 3. Take the
time to review them, to understand their strengths and weaknesses.
3 Multiclass classification
So far we’ve talked about binary classification, but most classification prob-
lems involve more than two categories. Fortunately, this doesn’t require any
new ideas: everything pretty much works by analogy with the binary case.
The first question is how to represent the targets. We could represent them
as integers, but it’s more convenient to use a one-hot vector, also called
a one-of-K encoding:
t = (0, . . . , 0, 1, 0, . . . , 0) (21)
| {z }
entry k is 1
Now let’s design the whole model by analogy with the binary case.
First of all, consider the linear part of the model. We have K outputs
and D inputs. To represent a linear function, we’ll need a K × D weight
matrix, as well as a K-dimensional bias vector. We first compute the
intermediate quantities as follows:
z = Wx + b. (22)
This is the general form of a linear function from RD to RK .
Next, the activation function. We saw that the logistic function was a
good thing to use in the binary case. There’s a multivariate generalization
of the logistic function called the softmax function: Try plugging in K = 2 to figure
out how the softmax relates to the
ezk logistic function.
yk = softmax(z1 , . . . , zK )k = P zk 0
(23)
k0 e
Importantly, the outputs of the softmax function are nonnegative and sum
to 1, so they can be interpreted as a probability distribution over the K
8
classes (just like the output of the logistic could be interpreted as a prob-
ability). The inputs to the softmax are called the logits. Note that when Think about the logits as the
one of the zk ’s is much larger than the others, the output of the softmax “log-odds”, because when you
exponentiate them you get the
will be approximately the argmax, in the one-hot encoding. Hence, a more odds ratios of the probabilities.
accurate name might be “soft-argmax.”
Finally, the loss function. Cross-entropy can be generalized to the
multiple-output case: You’ll sometimes see σ(z) used to
denote the softmax function, by
K analogy with the logistic. But in
this course, it will always denote
X
LCE (y, t) = − tk log yk
the logistic function.
k=1
>
= −t (log y).
Here, log y represents the elementwise log. Note that only one of the tk ’s is
1 and the rest are 0, so the summation has the effect of picking the relevant
entry of the vector log y. (See how convenient the one-hot notation is?)
Note that this loss function only makes sense for predictions which sum to Try plugging in K = 2 to see how
1; if you eliminate that constraint, you could trivially minimize the loss by this relates to binary cross-entropy.
making all the yk ’s large.
When we put these things together, we get multiclass logistic regres-
sion, or softmax regression:
z = Wx + b
y = softmax(z)
LCE = −t> (log y)
We won’t go through the derivatives in detail, but it basically works out
exactly like logistic regression. The softmax and cross-entropy functions
interact nicely with each other, so we always combine them into a single
softmax-cross-entropy function LSCE for purposes of numerical stability.
The derivatives of LSCE have the same elegant formula we’ve been seeing
repeatedly, except this time remember that t and y are both vectors:
∂LSCE
=y−t (24)
∂z
Softmax regression is an elegant learning algorithm which can work very
well in practice.
4 Convex Functions
An important criterion we often use to compare different loss functions is
convexity. Recall that a set S is convex if the line segment connecting any
two points in S lies entirely within S. Mathematically, this means that for
x0 , x1 ∈ S,
(1 − λ)x0 + λx1 ∈ S for 0 ≤ λ ≤ 1.
The definition of a convex function is closely related. A function f is
convex if for any x0 , x1 in the domain of f ,
f ((1 − λ)x0 + λx1 ) ≤ (1 − λ)f (x0 ) + λf (x1 ) (25)
9
Figure 4: Left: Definition of convexity. Right: Proof-by-picture that if
the model is linear and L is a convex function of z = w> x + b, then it’s also
convex as a function of w and b.
This is shown graphically in Figure 4. Another way to phrase this require-
ment is that the line segment connecting any two points on the graph of f
must lie above the graph of f . Equivalently, the set of points lying above
the graph of f must be a convex set. Intuitively, convex functions are bowl-
shaped.
Convexity is a really important property from the standpoint of opti-
mization. There are two main reasons for this:
1. All critical points are global minima, so if you can set the derivatives
to zero, you’ve solved the problem.
2. Gradient descent will always converge to the global minimum.
We’ll talk in more detail in a later lecture about what can go wrong when the
cost function is not convex. Look back at our comparison of loss functions
in Figure 3. You can see visually that squared error, hinge loss, and the
logistic regression objective are all convex; 0–1 loss, and logistic-with-least-
squares are not convex. It’s not a coincidence that the loss functions we
might actually try to optimize are the convex ones. There is an entire field
of research on convex optimization, which comes up with better ways to
minimize convex functions over convex sets, as well as ways to formulate
various kinds of problems in terms of convex optimization.
Note that even though convexity is important, most of the optimization
problems we’ll consider in this course will be non-convex, because training
a deep neural network is a non-convex problem, even when the loss func-
tion is convex. Nonetheless, convex loss functions somehow still tend to be
advantageous from the standpoint of optimization.
5 Gradient Checking with Finite Differences
We’ve derived a lot of gradients so far. How do we know if they’re correct?
Fortunately, there’s an easy and effective procedure for testing them. Recall
10
Figure 5: Estimating a derivative using one-sided and two-sided finite dif-
ferences.
the definition of the partial derivative:
∂ f (x1 , . . . , xi + h, . . . , xN ) − f (x1 , . . . , xi , . . . , xN )
f (x1 , . . . , xN ) = lim
∂xi h→0 h
(26)
We can check the derivatives numerically by plugging in a small value of h,
such as 10−10 . This is known as the method of finite differences. You don’t want to implement your
It’s actually better to use the two-sided definition of the partial deriva- actual learning algorithm using
finite differences, because it’s very
tive than the one-sided one, since it is much more accurate: slow, but it’s great for testing.
∂ f (x1 , . . . , xi + h, . . . , xN ) − f (x1 , . . . , xi − h, . . . , xN )
f (x1 , . . . , xN ) = lim
∂xi h→0 2h
(27)
An example is shown in Figure 5 of estimating a derivative using the one-
sided and two-sided formulas.
Gradient checking is really important! In machine learning, your algo-
rithm can often seem to learn well even if the gradient calculation is totally
wrong. This might lead you to skip the correctness checks. But it might
work even better if the derivatives are correct, and this is important when
you’re trying to squeeze out the last bit of accuracy. Wrong derivatives can
also lead you on wild goose chases, as you make changes to your system
which appear to help significantly, but actually are only helping because
they compensate for errors in the gradient calculations. If you implement
derivatives by hand, gradient checking is the single most important thing
you need to do to get your algorithm to work well.
11