Notes Cce 577
Notes Cce 577
SPRING 2019
2
TABLE OF CONTENTS
Page
TABLE OF CONTENTS 2
LIST OF FIGURES 4
2 CLASSIFICATION 13
2.1 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Generative Learning Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2.1 Gaussian Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3 Naive Bayes Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3.1 The Naive Bayes Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3.2 Estimation using Maximum Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3.3 Spam Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.4 Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.4.1 Decision Tree Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.4.2 Decision Tree Problem Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.4.3 The Iterative Dichotomy 3 (ID3) Learning Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.4.4 Which Attribute is the Best Classifier? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.5 k−Nearest Neighbors (k−NN) Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.5.1 Do As Your Neighbor Does . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.5.2 k−Nearest Neighbors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3 REGULARIZATION 37
3.1 Generalization in Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.1.1 Overfitting and Underfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.1.2 Capacity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2 Hypothesis Evaluation: Training Error and Testing Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2.1 Bias/ Variance Tradeoff . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.3 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3
4
LIST OF FIGURES
Page
1.1 Housing prices in USD versus their surface areas in square feet. . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Estimating a house of surface area 750 ft 2 using a second order polynomial. . . . . . . . . . . . . . . . . . . 3
1.4 The data set denoted by a cross refers to a benign tumor and the the one denoted by a circle refers to a
malignant tumor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.5 Classification example with two features; the size of the tumor and the age of the patient. The data set denoted
by a cross refers to a benign tumor and the the one denoted by a circle refers to a malignant tumor. . . . . . 3
1.6 Fitting a line to separate the data set in a diary classification problem. . . . . . . . . . . . . . . . . . . . . . . 4
1.7 Cocktail party problem, where each microphone picks up two simultaneous speeches. The Cocktail party
algorithm separates the mixed speeches into the two original speeches with no prior information about the sources. 5
2.1 A plot of dataset giving the living areas and prices of 47 houses in certain town. . . . . . . . . . . . . . . . . 1
2.2 A plot of dataset giving the living areas and prices of 47 houses in certain town and using hθ (x) = 50 + 0.06x
to predict the housing prices. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.5 Starting from a random location (θ0 , θ1 ) and following the gradient descent. . . . . . . . . . . . . . . . . . . 4
2.7 Choosing another initial location and applying the gradient descent algorithm leads to another local minimum. 6
2.8 Geometrical intuition of the convergence for the gradient descent algorithm. . . . . . . . . . . . . . . . . . . 8
2.3 A scenario showing that linear regression is not the right technique to be used for a classification problem. . . 14
2.7 Shown in the figure are the training set, as well as the contours of the two Gaussian distributions that have
been fit to the data in each of the two classes. Note that the two Gaussians have contours that are the same
shape and orientation, since they share a covariance matrix, but they have different mean vectors. . . . . . . . 20
2.8 An example of a decision tree for the concept of PlayTennis. An example is classified by sorting it through the
tree to the appropriate leaf node, then returning the classification associated with this leaf (in this case, Yes or
No). This tree classifies Saturday mornings according to whether or not they are suitable for playing tennis. . 28
2.10 The partially learned decision tree resulting from the first step of ID3. . . . . . . . . . . . . . . . . . . . . . . 32
2.11 An illustration of the decision boundaries of the 1−NN rule. The points depicted are the sample points, and
the predicted label of any new point will be the label of the sample point in the center of the cell it belongs to.
These cells are called a Voronoi tessellation of the space, and are drawn using the Euclidean distance. Other
metrics will change the decision surface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.12 An illustration of the operation of k = 3-NN. Circles and squares denote the training points and diamonds the
test points. Test point A will be assigned to the square class and B to the circle class. . . . . . . . . . . . . . 35
3.1 The zig-zag line on the left panel is consistent over the blue and red training sample, but it is a complex
separation surface that is not likely to generalize well to unseen data. In contrast, the decision surface on the
right panel is simpler and might generalize better in spite of its misclassification of a few points of the training
sample. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.4 We fit three models to this example training set. The training data was generated synthetically, by randomly
sampling x values and choosing y deterministically by evaluating a quadratic function. (Left) A linear function
fit to the data suffers from underfitting−it cannot capture the curvature that is present in the data. (Center) A
quadratic function fit to the data generalizes well to unseen points. It does not suffer from a significant amount
of overfitting or underfitting. (Right) A polynomial of degree 9 fit to the data suffers from overfitting. . . . . 40
3.5 Typical relationship between capacity and error. Training and test error behave differently. At the left end of
the graph, training error and generalization error are both high. This is the underfitting regime. As we increase
capacity, training error decreases, but the gap between training and generalization error increases. Eventually,
the size of this gap outweighs the decrease in training error, and we enter the overfitting regime, where capacity
is too large, above the optimal capacity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5
3.6 We fit a high-degree polynomial regression model to our example training set. The true function is quadratic,
but here we use only models with degree 9. We vary the amount of weight decay to prevent these high-
degree models from overfitting. (Left) With very large λ, we can force the model to learn a function with
no slope at all. This under-fits because it can only represent a constant function. (Center) With a medium
value of λ, the learning algorithm recovers a curve with the right general shape. Even though the model is
capable of representing functions with much more complicated shape, weight decay has encouraged it to use a
simpler function described by smaller coefficients. (Right) With weight decay approaching zero (i.e., using the
Moore-Penrose pseudo-inverse to solve the underdetermined problem with minimal regularization), the degree-9
polynomial overfits significantly, as we saw in Fig. 3.4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.7 In L2 regularization the regularizer term has a minimum of 0 and the error term has its own minimum, however
the optimum solution of the new regularized cost function lies in between. . . . . . . . . . . . . . . . . . . . 44
3.8 A schematic display of 5−fold CV. A set of n observations is randomly split into five non-overlapping groups.
Each of these fifths acts as a validation set (shown in beige), and the remainder as a training set (shown in
blue). The test error is estimated by averaging the five resulting MSE estimates. . . . . . . . . . . . . . . . . 49
4.1 (a) Three possible hyperplanes used to separate the dataset. (b) The hyperplane with the maximum margin. . 51
4.3 Margin and equations of the hyperplanes for a canonical maximum-margin hyperplane. The marginal hyperplanes
are represented by dashed lines on the figure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.4 Non-linearly separable case. The classification task consists of discriminating between solid squares and solid
circles. (a) No hyperplane can separate the two populations. (b) A non-linear mapping can be used instead. . 56
4.5 Illustration of the XOR classification problem. XOR problem linearly non-separable in the input space. . . . . . 57
4.6 Illustration of the XOR classification problem. XOR problem linearly separable in the feature space using a
quadratic kernel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
6
INTRODUCTION TO MACHINE LEARNING 1
CHAPTER 1
Example 1.
Suppose your email program watches which emails you do or do not mark as spam, and based on that learns how to filter spam
better. Classify the following items as T , E and P.
solution. 1 = E , 2 = T , 3 = P.
It is important to note that machine Learning can only discover correlations, not causal relationships. In fact, one of the
most popular types of machine learning consists of trying out different actions and observing their consequences: the essence
of causal discovery. For example, an e-commerce site can try many different ways of presenting a product and choose the one
that leads to the most purchases. You have probably participated in thousands of these experiments without knowing it. And
causal relationships can be discovered even in some situations where experiments are out of the question, and all the computer
can do is look at past data.
Figure 1.1: Housing prices in USD versus their surface areas in square feet.
Assume we want to use the data to estimate the price of a house having a surface area of 750 square feet. How could a
learning algorithm help us with this task? One way of estimating the price is to try to fit a curve through the plotted data and
map a price that corresponds to the 750 ft 2 . The simplest curve that we could fit through the data would be a straight line as
seen in Figure. 1.2, but maybe a better one would be some second order polynomial as seen in Figure. 1.3. Obviously, fitting
the data with a second order polynomial gives a better estimate of the price of the house.
Figure 1.2: Estimating a house of surface area 750 ft 2 using a straight line.
This is an example of a supervised learning algorithm. The term “supervised” refers to the fact that we gave the algorithm
data sets for which the “right answers” (the right price in this example) were given. The task of the algorithm is to produce ore
of these right answers. This specific kind of problems is referred to as a regression problem, where a continuous-valued output
(the price) needs to be predicted.
Figure 1.3: Estimating a house of surface area 750 ft 2 using a second order polynomial.
1(Y )
0(N ) x x x x x
Tumor size
Figure 1.4: The data set denoted by a cross refers to a benign tumor and the the one denoted by a circle refers to a malignant
tumor.
The machine learning task is to estimate what is the probability or what is the chance that a breast tumor with a certain
size, which does not belong to the collected data set, is benign or malignant. This sort of examples is known as a classification
problem. The term “classification” refers to the fact that we are trying to predict a discrete-valued output (0) or 1). In general,
classification problems can have more than two outputs. As a concrete example, we may want to predict one of the following
discrete-valued outputs: 0, 1, 2, 3, where 1, 2 and 3 refer respectively to individual types of breast cancers.
In some other classification problems, we might have more than one “feature” or “attribute”. Assume that our data set
consists of two features: the size of the tumor and the age of the patient. See Figure. 1.5.
Age
x
x
x x
x x x
x x
x
Tumor size
Figure 1.5: Classification example with two features; the size of the tumor and the age of the patient. The data set denoted
by a cross refers to a benign tumor and the the one denoted by a circle refers to a malignant tumor.
Given a certain patient with a given age and a given size of tumor, the objective of a machine learning algorithm is predict
whether or not the tumor is benign or malignant. What the learning algorithm may do is to separate the data by fitting a
straight line. Now if the data pertaining to the patient falls below the line, as seen in Figure. 1.7, then the patient is more likely
to have a benign tumor, and if the data falls above the line, then the patient is more likely to have a malignant tumor.
In general this kind of classification problems has more features like clump thickness, uniformity of cell size, uniformity
of cell shape, etc. In practice, we would like to have a countably infinite number of features, so that the machine learning
algorithm has more attributes that help it in making better accurate predictions. But how do we deal with and store infinite
number of features on a computer? It turns out that there is an algorithm known as Support Vector Machine which use a neat
mathematical trick that can accomplish such a mission.
Age
x
x
x x
x x
x
x x
x
Tumor size
Figure 1.6: Fitting a line to separate the data set in a diary classification problem.
Example 2. You’re running a company, and you want to develop learning algorithms to address each of two problems.
• Problem 1: You would like your software to examine individual customer accounts, and for each account decide if it has
been hacked/compromised.
• Problem 2: You have a large inventory of identical items. You want to predict how many of these items will sell over the
next three months.
Which one of these problems is a linear regression problem and which one is a classification problem?
Remark 1. In statistics, a fit refers to how well you approximate a target function. This is good terminology to use in machine
learning, because supervised machine learning algorithms seek to approximate the unknown underlying mapping function for
the output variables given the input variables. Statistics often describe the goodness of fit which refers to measures used to
estimate how well the approximation of the function matches the target function. Some of these methods are useful in machine
learning (e.g. calculating the residual errors), but some of these techniques assume we know the form of the target function
we are approximating, which is not the case in machine learning. If we knew the form of the target function, we would use it
directly to make predictions, rather than trying to learn an approximation from samples of noisy training data.
Figure 1.7: Cocktail party problem, where each microphone picks up two simultaneous speeches. The Cocktail party algorithm
separates the mixed speeches into the two original speeches with no prior information about the sources.
learning involves observing several examples of a random vector x and an associated value or vector y , and learning to predict
¯ ¯ target y being provided
y from x, usually by estimating p(y |x). The term supervised learning originates from the view of the
¯
¯by an instructor or teacher who shows ¯
¯ the machine learning system what to do. In unsupervised learning, there¯ is no instructor
or teacher, and the algorithm must learn to make sense of the data without this guide.
Unsupervised learning and supervised learning are not formally defined terms. The lines between them are often blurred.
Many machine learning technologies can be used to perform both tasks. For example, the chain rule of probability states that
for a vector x ∈ Rn , the joint distribution can be decomposed as
¯
Yn
p(x) = p(xi |x1 , ... , xi−1 )
¯ i=1
This decomposition means that we can solve the ostensibly unsupervised problem of modeling p(x) by splitting it into n
¯
supervised learning problems. Alternatively, we can solve the supervised learning problem of learning p(y |x) by using traditional
¯
unsupervised learning technologies to learn the joint distribution p(x, y ) and inferring
¯
p(x, y )
p(y |x) = P ¯
¯ p(x, y 0 )
y0 ¯
Though unsupervised learning and supervised learning are not completely formal or distinct concepts, they do help to roughly
categorize some of the things we do with machine learning algorithms. Traditionally, people refer to regression, classification
and structured output problems as supervised learning. Density estimation in support of other tasks is usually considered
unsupervised learning.
Other variants of the learning paradigm are possible. For example, in semi-supervised learning, some examples include a
supervision target but others do not. In multi-instance learning, an entire collection of examples is labeled as containing or
not containing an example of a class, but the individual members of the collection are not labeled. Some machine learning
algorithms do not just experience a fixed dataset. For example, reinforcement learning algorithms interact with an environment,
so there is a feedback loop between the learning system and its experiences.
CHAPTER 2
Our definition of machine learning algorithm as an algorithm that is capable of improving computer program’s performance
at some task via experience is somewhat abstract. To make this more concrete, we present an example of a simple machine
learning algorithm: Linear regression.
As the name implies, linear regression solves a regression problem. In other words, the goal is to build a system that can
take a vector x ∈ Rn as input and predict the value of a scalar y ∈ R as its output. The output of linear regression is a linear
¯
function of the input. Let ŷ be the value that our model predicts y should take on. We define the output to be
ŷ = w T x (2.1)
¯ ¯
where w ∈ Rn is a vector of “parameters”. Parameters are values that control the behavior of the system. In this case, wi is the
¯
coefficient that we multiply by feature xi before summing up the contributions from all the features. We can think of w as a set
¯
of weights that determine how each feature affects the prediction. If a feature xi receives a positive weight wi , then increasing
the value of that feature increases the value of our prediction ŷ . If a feature receives a negative weight, then increasing the
value of that feature decreases the value of our prediction. If a feature’s weight is large in magnitude, then it has a large effect
on the prediction. If a feature’s weight is zero, then it has no effect on the prediction.
We thus have a definition of our task T : to predict y from x by outputting ŷ = w T x. What remains to be specified is a
¯ ¯ ¯
definition of our performance measure, P. This will be done later subsequently.
Considering again the motivating example of predicting the housing prices. A data set is collected and plotted in Figure. 2.1,
where we have collected the “right answers” for 47 houses. Assume we want to predict the price of a house (i.e., predicting a
real-valued output) having an area of 1250 ft 2 . As we have seen before, we can try to fit a straight line through the data and
predict a price.
Figure 2.1: A plot of dataset giving the living areas and prices of 47 houses in certain town.
More formally, in the language of supervised learning, we have a “training set ” of housing prices and the task is to learn
from this data how to predict the prices of the houses. To establish notation for future use, we will use x (i) to denote the
“input” variables (living area in this example), also called “input features”, and y (i) to denote the “output” or “target variable”
that we are trying to predict (price). A pair (x (i) , y (i) ) is called a “training example”, and the dataset that we will be using
to learn−a list of m training examples {(x (i) , y (i) ); i = 1, ... , m}−is called a “training set”. Note that the superscript (i ) in
the notation is simply an index into the training set, and has nothing to do with exponentiation. We will also use X denote
the space of input values, and Y the space of output values. In this example, X = Y = R. Remember that when the target
variable that we are trying to predict is continuous, such as in our housing example, we call the learning problem a regression
problem. When y can take on only a small number of discrete values (such as if, given the living area, we wanted to predict if
a selling is a house or an apartment, say), we call it a classification problem.
To describe the supervised learning problem slightly more formally, our goal is, given a training set, to learn a function
h : X → Y so that h(x) is a “good” predictor for the corresponding value of y . For historical reasons, this function h is called
a “hypothesis”. Seen pictorially, the process is therefore like this:
To perform supervised learning, we must decide how we are going to represent functions/hypotheses h in a computer. As
an initial choice, let us say we decide to approximate y as a linear function of x, hence the hypothesis is expressed as
hθ (x) = θ0 + θ1 x (2.2)
Here, the θi ’s are the parameters (also called “weights”) parameterizing the space of linear functions mapping from X to Y.
When there is no risk of confusion, we will drop the θ subscript in hθ (x), and write it more simply as h(x). This kind of linear
regression is known as linear regression with one variable or univariate linear regression, where the variable here is x.
Now we need to investigate how to fit the best linear function into a given dataset, i.e., how to find the best θi0 s that will
make the straight line fit the training set as much a possible. The idea is to choose θ0 and θ1 such that hθ (x) is close to y for
our training examples (x, y ) in our training set. Analytically, we need to solve the following minimization problem
m
X 2
min hθ (x (i) ) − y (i) (2.3)
θ0 ,θ1
i=1
m 2
1
hθ (x (i) ) − y (i)
P
where J(θ0 , θ1 ) = 2m is known as the “cost function” or the “squared error”. There are other cost
i=1
functions that can work pretty well in terms of fitting the data, but the squared error function is one of the most widely used
for regression problems.
Example 3.
Consider the following dataset {(x, y ) = (1, 2), (2, 4), (3, 6)}. Assume that the chosen hypothesis is hθ (x) = θ1 x. Compute
the square error function for θ1 = 0, 1, 2, 3. Which θ1 minimizes the square error function?
3
1
(θ1 x (i) − y (i) )2 .
P
Solution. In this case, hθ1 (x) = θ1 x, then J(θ1 ) = 6
i=1
6
1
(0 − y (i) )2 = 16 (22 + 42 + 62 ) = 9.33
P
• For θ = 0, J(0) = 6
i=1
6
1
(x (i) − y (i) )2 = 16 ((1 − 2)2 + (2 − 4)2 + (3 − 6)2 ) = 2.33
P
• For θ = 1, J(1) = 6
i=1
6
1
(2x (i) − y (i) )2 = 16 ((2 − 2)2 + (4 − 4)2 + (6 − 6)2 ) = 0
P
• For θ = 2, J(2) = 6
i=1
6
1
(3x (i) − y (i) )2 = 16 ((3 − 2)2 + (6 − 4)2 + (9 − 6)2 ) = 2.33
P
• For θ = 3, J(3) = 6
i=1
Obviously, the optimal value for θ1 is θ1∗ = 2, as it leads to a square error equal to 0.
Going back to the scenario of Figure. 2.1, we want to use the hypothesis described by eq. (2.2) to solve the linear regression
problem. Assuming θ0 = 50 and θ1 = 0.06, we obtain hθ (x) = 50 + 0.06x. The situation is illustrated in Figure. 2.2.
Figure 2.2: A plot of dataset giving the living areas and prices of 47 houses in certain town and using hθ (x) = 50 + 0.06x to
predict the housing prices.
Plotting the curve of the squared error function as defined by (2.3), we get a three-dimensional surface plot with one local
minimum, as shown in Figure. 2.3. For a particular choice of θ0 and θ1 , we obtain a specific hypothesis hθ (x), and the height
of the corresponding point is the value of the error function.
For ease of explanation, we will use contour plots to illustrate the squared error function. See Figure. 2.4. Each of the
ellipses denote a set of points that take on the same value for J(θ0 , θ1 ). All points belonging to the same ellipse have the same
J(θ0 , θ1 ). The innermost ellipse corresponds to the minimum of J(θ0 , θ1 ).
Example 4. In this example illustrates how a good choice of a hypothesis corresponds to minimizing the cost function J(θ0 , θ1 ).
Consider Figure.2.6(a), showing a hypothesis that apparently does not fit the data; hθ (x) = 360, i.e., θ0 = 360 and θ1 = 0.
This choice of hypothesis can be seen on the contour plots as a specific point x, which is pretty far from the minimum (the
midpoint of the inner contour). Figure.2.6(b) shows a hypothesis with θ0 = 800 and θ1 = −1.5; hθ (x) = 800 − 1.5x, which
also does not fit the data properly. This choice of hypothesis can be seen on the contour plots as a specific point x, which is
pretty far from the minimum. Notice that the hypothesis hθ (x) = 360 is a better hypothesis than hθ (x) = 800 − 1.5x, since
the x on the contour plots is closer to the minimum in the former case than in the latter case. Figure.2.6(c) shows a hypothesis
with θ0 = 250 and θ1 = 0.15; hθ (x) = 250 + 0.15x. This choice of hypothesis can be seen on the contour plots as a specific
point x, which is pretty close to the minimum.
In high dimensional problems with more features, an algorithm should be derived in order to automatically find the proper
choice of parameters θi ’s that minimize the cost function J(θ1 , θ2 , ... , θn ). In the next section we derive such an algorithm.
Figure 2.5: Starting from a random location (θ0 , θ1 ) and following the gradient descent.
(a)
(b)
(c)
We will present first a graphical intuition on how the gradient descent algorithm works. In Figure. 2.5(a), we pick an initial
guess for (θ0 , θ1 ) (illustrated as a star point on the surface), and according to gradient descent algorithm we need to look 360◦
degrees around and see which direction corresponds to the steepest descent following a small baby step (going down the hill
as quickly as possible). We Assume that this direction leads to another star point, as shown in Figure. 2.5(b). We repeat the
same process of looking for the steepest descent from the second point to a third point, and so on until we reach a minimum
value. The gradient descent algorithm has an interesting property, if we choose another initial guess for (θ0 , θ1 ) which is not
far from the original guess, and repeat the same procedure of following the steepest descent, we will reach a different local
minimum. See Figure. 2.7. We will say more about this property later.
After the graphical intuition, we turn to the analytical interpretation of the gradient descent algorithm. The following is the
definition of the Gradient Descent Algorithm.
Gradient Descent Algorithm
∂
Repeat until convergence {θj := θj − α ∂θ j
J(θ0 , θ1 ) (for j = 0 and j = 1)}
∂
According to this definition, we should repeatedly update the parameter θj by taking θj and subtracting from it α ∂θ j
J(θ0 , θ1 ),
where α > 0 is the “learning rate” that controls the size of the step that we take during the descent. If α is very large, this is
equivalent to taking a very large step, while if α is very small, this is equivalent to taking a very small step. If α is too small,
Figure 2.7: Choosing another initial location and applying the gradient descent algorithm leads to another local minimum.
gradient descent can be slow, while if α is too large, gradient descent can overshoot the minimum. It may fail to converge, or
∂
even diverge. The partial derivative term ∂θ j
J(θ0 , θ1 ) is the gradient descent of the cost function.
Remark 2. Note that in applying the gradient descent algorithm, we need to simultaneously update θ0 and θ1 , and this is an
important subtlety in applying the algorithm. More concretely, the update should be carried out in the following manner:
∂
t0 := θ0 − α J(θ0 , θ1 )
∂θ0
∂
t1 := θ1 − α J(θ0 , θ1 )
∂θ1
θ0 := t0
θ1 := t1
Remark 3. In gradient descent algorithm, as we approach a local minimum the algorithm will automatically take smaller steps.
∂
This is due to the fact that as we approach a local minimum the term ∂θ j
J(θ0 , θ1 ) will get smaller till it becomes equal to zero
once reaching the local minimum. Hence, there is no need to decrease the learning rate α over time.
Remark 4. While gradient descent can be susceptible to local minima in general, the optimization problem we have posed
here for linear regression has only one global, and no other local, optima; thus gradient descent always converges (assuming the
learning rate α is not too large) to the global minimum. Indeed, J is a convex quadratic function that has one global minimum.
Example 5.
Consider the following dataset {(x, y ) = (1, 2), (2, 4), (3, 6)}. Assume that the chosen hypothesis is hθ (x) = θ1 x. Apply the
gradient descent algorithm to find the optimal value for θ1 , assuming an initial choice of θ1 = 0 and with a learning rate
α = 0.2.
3
1 14θ12 −56θ1 +56
(θ1 x (i) − y (i) )2 = 16 [(θ1 − 2)2 + (2θ1 − 4)2 + (3θ1 − 6)2 ] =
P
Solution. The cost function is J(θ1 ) = 6 6 . For
i=1
2
14θ −56θ +56
α = 0.2, the algorithm is as follows {θ1 := θ1 − 0.2 ∂θ∂ 1 1 6 1 }. Starting with θ1 = 0,
28θ1 − 56 −56 28 28
θ1 := 0 − 0.2 = −0.2 = ' 1.866 ⇒ J(θ1 ) =
6 θ1 =0 6 15 675
28 28θ1 − 56 448
θ1 := − 0.2 = ' 1.991 ⇒ J(θ1 ) = 1.84 × 10−4
15 6 θ1 =28/15 225
448 28θ1 − 56 6748
θ1 := − 0.2 = ⇒ J(θ1 ) = 8.19 × 10−7
225 6 448
θ1 = 225 3375
6748 28θ1 − 56 101248
θ1 := − 0.2 = ' 1.99996 ⇒ J(θ1 ) = 3.64 × 10−9
3375 6 6748
θ1 = 3375 50625
125258 28θ1 − 56 1542758
θ1 := − 0.2 = ' 2.031 ⇒ J(θ1 ) = 2.33 × 10−3
50625 6 θ1 = 125258 759375
50625
Since 2.33 × 10−3 > 3.64 × 10−9 , then maybe θ1 = 2 would be a good choice, and as a result the hypothesis would be
hθ (x) = 2x.
∂
Returning to the gradient descent of the cost function ∂θj J(θ0 , θ1 ), assuming J(θ0 , θ1 ) = θ0 + θ1 x (i) , we can write
m 2 m 2
∂ ∂ 1 X 1 ∂ X
J(θ0 , θ1 ) = hθ (x (i) ) − y (i) = θ0 + θ1 x (i) − y (i) , (for j = 0 and j = 1) (2.5)
∂θj ∂θj 2m 2m ∂θj
i=1 i=1
}
called the LMS update rule (LMS stands for “least mean squares”), and is also known as the Widrow-Hoff learning rule.
This rule has several properties that seem natural and intuitive. For instance, the magnitude of the update is proportional
to the error term hθ (x (i) ) − y (i) ; thus, for instance, if we are encountering a training example on which our prediction nearly
matches the actual value of y (i) , then we find that there is little need to change the parameters; in contrast, a larger change to
the parameters will be made if our prediction hθ (x (i) ) has a large (i)
Pm error (i.e., if it is very far from y ). This method looks at
every example in the entire training set (the summation term i=1 ) on every step, and is thus called Batch Gradient Descent.
There are other versions of gradient descent that are not batch versions (do not look at the entire training set). For example,
consider the following algorithm
Stochastic Gradient Descent Algorithm
Loop { for i = 1 to m, { In the Stochas-
(i)
θj := θj − α hθ (x (i) ) − y (i) · xj (for every j)}}
tic Gradient Descent algorithm, we repeatedly run through the training set, and each time we encounter a training example,
we update the parameters according to the gradient of the error with respect to that single training example only.
Whereas batch gradient descent has to scan through the entire training set before taking a single step−a costly operation if
m is large−stochastic gradient descent can start making progress right away, and continues to make progress with each example
it looks at. Often, stochastic gradient descent gets θ “close” to the minimum much faster than batch gradient descent. Note
however that it may never “converge” to the minimum, and the parameters θ will keep oscillating around the minimum of J(θ);
but in practice most of the values near the minimum will be reasonably good approximations to the true minimum. For these
reasons, particularly when the training set is large, stochastic gradient descent is often preferred over batch gradient descent.
Remark 5. While it is more common to run stochastic gradient descent as we have described it and with a fixed learning rate
α, by slowly letting the learning rate α decrease to zero as the algorithm runs, it is also possible to ensure that the parameters
will converge to the global minimum rather than merely oscillate around the minimum.
Remark 6. Reaching a point in which the gradient descent algorithm that makes very small changes in your objective function
(cost function) is called convergence. So how do we decide on the convergence of the gradient descent algorithm? We can
imagine how gradient descent works thinking that we throw marble inside a bowl and start taking snapshots. The marble will
oscillate till friction will stop it in the bottom. Now imaging that we are in an environment where friction is so small that the
marble takes a long time to completely stop, so we can assume that when the oscillations are small enough the marble have
reached the bottom (although it could continue oscillating). In Figure. 2.8 you could see the first eight steps (snapshots of the
marble) of the gradient descent.
Figure 2.8: Geometrical intuition of the convergence for the gradient descent algorithm.
If we continue taking photos of the movement of the marble, we can notice that the movement becomes less and less
relevant, which does not mean that the marble has necessarily reached the bottom of the bowl (the optimal result), but it is
really quite close to it if not at it. The precision value can be chosen as the threshold at which the consecutive iterations of
GD are almost the same. For example, we can use the following precision rule
grad(i ) = 0.0001
grad(i + 1) = 0.000099989 : (grad has changed less than 0.01 in one iteration% ⇒ STOP)
Such a precision rule is also referred to a test of convergence, and choosing a precision threshold might not be an evident
task.
In general, we could have n features and m training examples, where each feature j is represented by an m−dimensional
(i)
vector x (i) , where xj is the value of the j th feature in the i th training example.
¯
To perform supervised learning, we must decide how we are going to represent functions/hypotheses h in a computer. As
an initial choice, let us say we decide to approximate y as a linear function of x:
n
X
hθ (x) = θ0 + θ1 x1 + θ2 x2 + ... + θn xn = θj xj = ΘT x (2.8)
j=0
¯ ¯
where the θj ’s are the parameters (weights) parameterizing the space of linear functions mapping from X to Y. When there is
no risk of confusion, we will drop the θ subscript in hθ (x), and write it more simply as h(x). To simplify our notation, we have
used the convention of letting x0 = 1. Note now that x and Θ are both (n + 1)−dimensional vectors.
¯ ¯
We now address the question of how to apply the gradient descent algorithm in the context of multivariate linear regression.
The corresponding cost function is give by
2
m m n
1 X 1 X X (i)
J(Θ) = J(θ0 , θ1 , ... , θn ) = (hθ (x (i) ) − y (i) )2 = θj xj − y (i) (2.9)
¯ 2m 2m
i=1 i=1 j=0
}
scent gives one way of minimizing J. Let us discuss a second way of doing so, this time performing the minimization explicitly
and without resorting to an iterative algorithm. In this method, we will minimize J by explicitly taking its derivatives with
respect to the θj ’s, and setting them to zero. To enable us to do this without having to write reams of algebra and pages full
of matrices of derivatives, let us introduce some notation for doing calculus with matrices.
We also introduce the “trace” operator, written “Tr”. For an n−by−n (square) matrix A, the trace of A is defined to be the
¯ ¯
sum of its diagonal entries:
Xn
Tr A = Aii (2.12)
¯ i=1
If a is a real number (i.e., a 1−by−1 matrix), then Tr a = a. The trace operator has the property that for two matrices A and
B such that AB is square, we have that Tr AB = Tr BA. (Check this yourself!) As corollaries of this, we also have, e.g.,
Tr AB C = Tr C AB = Tr B C A
¯¯ ¯ ¯ ¯¯ ¯¯¯
Tr AB C D = Tr D AB C = Tr C D AB = Tr B C D A
¯¯ ¯ ¯ ¯ ¯¯ ¯ ¯ ¯ ¯¯ ¯¯¯¯
The following properties of the trace operator are also easily verified. Here, A and B are square matrices, and a is a real number:
Tr A = Tr AT (2.13)
¯ ¯
Tr(A + B) = Tr A + Tr B (2.14)
¯ ¯ ¯ ¯
Tr aA = a Tr A (2.15)
¯ ¯
We now state without proof some facts of matrix derivatives:
∇A Tr AB = BT (2.16)
¯ ¯¯ ¯
∇AT f (A) = (∇A f (A))T (2.17)
¯ ¯ ¯ ¯
∇A Tr AB AT C = C AB + C T AB T (2.18)
¯ ¯¯ ¯ ¯ ¯ ¯¯ ¯ ¯¯
∇A |A| = |A|(A−1 )T (2.19)
¯ ¯ ¯ ¯
where (2.19) applies only to non-singular square matrices A, and |A| denotes the determinant of a matrix.
Also, let y = [y (1) y (2) · · · y (m) ]. Now, since hθ (x (i) ) = (x (i) )T Θ, we can easily verify that
¯ ¯
(1) T (1)
hθ (x (1) ) − y (1)
(x ) Θ y
¯
(x (2) )T Θ y (2) hθ (x (2) ) − y (2)
XΘ − y = .. ¯ − .. =
..
¯¯ ¯ . . .
(x (m) )T Θ y (m) hθ (x (m) ) − y (m)
¯
Thus, using the fact that for a vector z, we have that z T z = i zi2 , we can write
P
¯ ¯ ¯
m 2
1 T 1 X
(X Θ − y ) (X Θ − y ) = hθ (x (i) ) − y (i) = J(θ)
2 ¯¯ ¯ ¯¯ ¯ 2
i=1
Finally, to minimize J, let us find its derivatives with respect to Θ. Combining Eq.(2.17)and Eq.(2.18), we find that
¯
∇AT Tr AB AT C = B T AT C T + B AT C (2.20)
¯ ¯¯ ¯ ¯ ¯ ¯ ¯ ¯¯ ¯
Hence,
1
∇Θ J(θ) = ∇Θ (X Θ − y )T (X Θ − y )
¯ ¯ 2 ¯ ¯ ¯ ¯¯ ¯
1
∇Θ Θ X X Θ − ΘT X T y − y T X Θ + y T y
T T
=
2 ¯ ¯ ¯ ¯¯ ¯ ¯ ¯ ¯ ¯¯ ¯ ¯
1
∇Θ Tr Θ X X Θ − ΘT X T y − y T X Θ + y T y
T T
=
2 ¯ ¯ ¯ ¯¯ ¯ ¯ ¯ ¯ ¯¯ ¯ ¯
1
∇Θ Tr Θ X X Θ − 2 Tr y T X Θ
T T
(y T y is a constant and hence its derivative is zero)
=
2 ¯ ¯ ¯ ¯¯ ¯ ¯ ¯ ¯ ¯
1
X T X Θ + X T X Θ − 2X T y
=
2 ¯ ¯¯ ¯ ¯¯ ¯ ¯
= XTXΘ − XTy
¯ ¯¯ ¯ ¯
In the third step, we used the fact that the trace of a real number is just the real number; the fourth step used the fact that
Tr A = Tr AT , and the fifth step used Eq. (2.20) with AT = Θ, B = B T = X T X , and C = I , and Eq.(2.16). To minimize J,
¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯
we set its derivatives to zero, and obtain the “normal equations”:
Θ = (X T X )−1 X T y (2.22)
¯ ¯ ¯ ¯ ¯
hθ (x) = θ0 + θ1 x + θ2 x 2 + θ3 x 3
We can define x1 , x, x2 , x 2 and x3 , x 3 , and cast the given hypothesis as a linear regression hypothesis
hθ (x) = θ0 + θ1 x1 + θ2 x2 + θ3 x 3
and use the machinery of gradient descent as defined earlier for linear regression problems. Note that in such a case feature
scaling becomes increasingly important.
CHAPTER 2
CLASSIFICATION
In this chapter we discuss the “classification problem”. This is just like the regression problem, except that the values y we
now want to predict take on only a small number of discrete values. For now, we will focus on the binary classification problem
in which y can take on only two values, 0 and 1. (Most of what we say here will also generalize to the multiple-class case.) For
instance, if we are trying to build a spam classifier for email, then x (i) may be some features of a piece of email, and y may
be 1 if it is a piece of spam mail, and 0 otherwise. 0 is also called the negative class, and 1 the positive class, and they are
sometimes also denoted by the symbols “−” and “+”. Given x (i) , the corresponding y (i) is also called the label for the training
example.
1(Y ) xxxx
0(N ) x x x x
Tumor size
One thing we can do to this training set is to apply the linear regression algorithm which we already know, i.e., try to fit a
straight line hθ (x) = ΘT X into the data. Doing that, assume that we will get a “classifier threshold” at y = 0.5:
¯ ¯
• If hθ (x) ≥ 0.5, then predict y = 1
Now, every data to the right of the value corresponding to y = 0.5 will be predicted as belonging to the positive class
(malignant tumor) and every data to the left of the value corresponding to y = 0.5 will be predicted as belonging to the
negative class (benign tumor). See Figure. 2.2. It seems that in this example the linear regression approach is reasonably
solving the classification problem.
h✓ (x)
1(Y ) xxxx
0.5
0(N ) x x x x
Tumor size
However, it is easy to construct examples where this method performs very poorly. Consider the scenario shown in Figure. 2.3,
where one extra training example is added to the far right.
h✓ (x)
1(Y ) x x x x x
0.5
0(N ) x x x x
Tumor size
Figure 2.3: A scenario showing that linear regression is not the right technique to be used for a classification problem.
The previously proposed hypothesis still work well for this example, however, if we decide to implement linear regression,
the hypothesis hθ (x) that would fit to the data would correspond to a line with a smaller slope. Using a classifier threshold
at y = 0.5, we will get a prediction rule that performs poorly (the data to the left of the threshold contain both positive and
negative classes). In general, linear regression is not the right tool to be used to solve classification problems. Intuitively, it also
does not make sense for hθ (x) to take values larger than 1 or smaller than 0 when we know that y ∈ {0, 1}. To fix this, what
we will do is to develop a new algorithm called “logistic regression”, where the outputs have the property that 0 ≤ hθ (x) ≤ 1.
Let us change the form of our hypothesis hθ (x). We will choose
1
hθ (x) = g(ΘT x) = Tx (2.1)
¯ ¯ 1 + e −Θ
¯ ¯
where
1
g(z) = (2.2)
1 + e −z
is called the “logistic function” or the “sigmoid function”. See Figure. 2.4 for a plot of g(z). 17
0.9
0.8
0.7
0.6
g(z)
0.5
0.4
0.3
0.2
0.1
0
−5 −4 −3 −2 −1 0 1 2 3 4 5
z
Note that g(z) tends towards 1 as z → ∞, and g(z) tends towards 0 as z → −∞. Moreover, g(z), and hence also Pn hθ (x),
is always bounded between 0 and 1. As before, we are keeping the convention of letting x0 = 1, so that ΘT x = θ0 + j=1 θj xj .
¯ ¯
For now, let us take the choice of g as given. Other functions that smoothly increase from 0 to 1 can also be used. Before
0
moving on, here is a useful property of the derivative of the sigmoid function, which we write as g :
0 d 1 1 −z 1 1
g (z) = = (e ) = · 1− = g(z)(1 − g(z))
dz 1 + e −z (1 + e −z )2 1 + e −z (1 + e −z )
How to we interpret the output hθ (x)? When the hypothesis output some number, we will interpret this number as the
estimated probability that y = 1 on a new input x. To understand this, consider the tumor example with a feature vector
x consisting of two entries: x0 = 1 and x1 = tumor size; x = [1 x1 ]T . Give a certain patient, assume that the features
¯ ¯
provide us with the hypothesis hθ (x) = 0.7. This hypothesis is going to be interpreted as the tumor’s of the patient has a 70%
chance of being malignant (y = 1). Using probabilistic terms, hθ (x) will be interpreted as the probability that y = 1, given
x, parametrized by θ. Since, this is a classification task, we know that y can be either 0 or 1, as a result we can compute the
probability that y = 0, given x, parametrized by θ as 1 − hθ (x). In summary,
x2
x x
x
x x
x x
xxx
0 3 x1
Solution. From the graph of the sigmoid function, we can see that if hθ (x) ≥ 0.5, this corresponds to ΘT X ≥ 0. As we have
¯ ¯
two features x1 and x2 , then this yields: θ0 + θ1 x1 + θ2 x2 ≥ 0. This inequality represents a decision region delimited by the line
θ2 x2 + θ1 x1 = −θ0 . Looking at the figure, the decision boundary is the line: x2 = −x1 + 3. By simple identification, θ0 = −3,
θ1 = θ2 = 1.
x2
x x
x 1
x x
1 1 x1
x
x 1 x
x x
Solution. From the graph of the sigmoid function, we can see that if hθ (x) ≥ 0.5, this corresponds to θ0 + θ1 x1 + θ2 x2 +
θ3 x12 + θ4 x22 ≥ 0. Substituting with the provided parameters, this is equivalent to −1 + x12 + x22 ≥ 0. Clearly, the decision region
is a circle centered at the origin and with radius 1. This example shows that for more complex data sets, we need to add extra
features of higher orders.
The natural question to ask now is, given a training set how do we fit the parameters Θ to it? Consider a training set of
¯
m training examples {(x (1) , y (1) ), (x (2) , y (2) ), ... , (x (m) , y (m) )}, where each of the examples is represented by a feature vector
T
x = [x1 x2 ... xn ] , where x0 = 1 and y ∈ {0, 1}. The hypothesis is always given by (2.1). We need to choose Θ to fit
¯ ¯
our training examples.
The cost function for linear regression was given by
m 2
1 X1
J(θ) = hθ (x (i) ) − y (i)
m 2
i=1
2
Suppose we define cost (hθ (x), y ) , 21 (hθ (x) − y ) , to be the cost of predicting y via hθ (x). However, the problem we are
dealing with here is a logistic regression problem, and in this context hθ (x) is a non-linear function, which results in J(θ) being
a non-convex function. Actually, using the sigmoid function, J(θ) will be a function of many local optima, and hence applying
the gradient descent algorithm, there is no guarantee that the algorithm would converge to the global optimum. Therefore, it
would make sense to define a new cost function that would turn out to be convex.
Define the following logistic regression cost function
− log(hθ (x)), if y = 1
cost (hθ (x), y ) = (2.6)
− log(1 − hθ (x)), if y = 0
Obviously, if we predict that hθ (x) = 1, while the true value is y = 1, then cost = 0, and similarly, if we predict that hθ (x) = 0,
while the true value is y = 0, then the cost = 0. However, if for instance hθ (x) = 0, while the true value is y = 1, then
cost → +∞. See Figure. 2.5.
This result captures the intuition that if hθ (x) = 0 (predict P(y = 1|x; θ) = 0), but y = 1, we penalize the learning
algorithm with a very large cost. It is easy to show that (2.6) is indeed a convex function and hence free of local optima. The
This result can be derived as a maximum likelihood estimator under a set of assumptions (this also applies to the cost
function for linear regression). Let us endow our classification model with a set of probabilistic assumptions, and then fit the
parameters via maximum likelihood. Assuming that the m training examples were generated independently, we can then write
down the likelihood of the parameters as
Ym Ym (y (i) ) 1−y (i)
L(Θ) = p(Y |X ; Θ) = p(y (i) )|x (i) ; θ) = hθ (x (i) ) 1 − hθ (x (i) ) (2.8)
¯ ¯ ¯ ¯ i=1 i=1
In this equation, |Σ| denotes the determinant of the matrix Σ. Figure 2.6 some examples of what the density of a Gaussian
distribution looks like.
The left-most figure shows a Gaussian with mean zero and covariance matrix Σ = I . A Gaussian distribution with zero mean
¯
and identity covariance is also called the “standard normal distribution”. The middle figure shows the density of a Gaussian
with zero mean and Σ = 0.6I ; and in the rightmost figure shows one with, Σ = 2I . We see that as Σ becomes larger, the
¯ ¯
Gaussian distribution becomes more “spread-out”, and as it becomes smaller, the distribution becomes more “compressed”
Let us look at some more examples.
0.15 0.15
CLASSIFICATION 0.15
19
0.1
0.05
Here are some examples of what the density of a Gaussian distribution 0.1
0.05
0.1
0.05
looks like:
3 3 3
2 2 2
3 3 3
1 2 1 2 1 2
0 1 0 1 0 1
−1 0 −1 0 −1 0
−1 −1 −1
0.25 −2 0.25 −2 0.25 −2
−2 −2 −2
−3 −3 −3 −3 −3 −3
The left-most figure shows a Gaussian with mean zero (that is, the 2x1
0.1 0.1 0.1
sian with zero mean and identity covariance is also called the standard nor-
3
2
1 2
3
3
2
1 2
3
3
2
1 2
3
0 1 0 1 0 1
mal distribution. The middle figure shows the density of a Gaussian with
−1
−2
−3 −3
−2
−1
0 −1
−2
−3 −3
−2
−1
0 −1
−2
−3 −3
−2
−1
0
zero mean and Σ = 0.6I; and in the rightmost figure shows one with , Σ = 2I.
Thethat
We see left-most figure shows
as Σ becomes a Gaussian
larger, withbecomes
the Gaussian mean zero
more(that is, the 2x1
“spread-out,”
and Figure
zero-vector)
as 2.6:covariance
and
it becomes Examples
smaller, the of
matrixmultivariate
Σ = I (the
distribution Gaussian
2x2
becomes identity distribution.
matrix). A Gaus-
more “compressed.”
sian withlook
Let’s zeroatmean
someand identity
more covariance is also called the standard nor-
examples.
mal distribution. The middle figure shows the density of a Gaussian with
zero mean and Σ = 0.6I; and in the rightmost figure shows one with , Σ = 2I.
0.25 0.25 0.25
0.05 0.05
3 3 3
1
2
1
2
0 0 0
3 3 3
−1 2 −1 2 −1 2
1 1 1
−2 0 0.25 −2 0 −2 0
0.25 0.25
−1 −1 −1
−3 −2 −3 −2 −3 −2
0.2 −3 0.2 −3 0.2 −3
The figures above show Gaussians with mean 0, and with covariance
0.05 0.05 0.05
matrices respectively 3 3 3
2 2 2
The figures above show Gaussians with! mean"0, and !with respective covariance matrices
1 1
" ! 0
" 3
0
3
0
3
1 0 1 0.5 1 0.8 −1
1
2 −1
1
2 −1
1
2
Σ= ; Σ= ; Σ = .
−2 0 −2 0 −2 0
−1 −1 −1
−3 −2 −3 −2 −3 −2
0 1 0.5 1 0.8 1
−3 −3 −3
1 0 1 0.5 1 0.8
Σ = above show; ΣGaussians
= ; Σ0,=and with covariance
The The figures
leftmost 0 shows
figure 1 with
1 mean
0.5standard
the familiar 0.8 1 and we
normal distribution,
matrices respectively
see that as we increase the off-diagonal entry in Σ, the density becomes more
! " 45◦ line! (given by"x = x ).! We can see " this more
The leftmost figure shows“compressed”
the familiartowards
1 0 the normal
standard distribution,
1 0.5 1 and
2 we
1 see0.8 that as we increase the off-diagonal entry
Σ =
clearly when we look ; Σ = ; Σ = .
0 1at the contours
0.5of◦ the
1 same three 0.8
densities:
1
in Σ, the density becomes more “compressed” towards the 45 line (given by x1 = x2 ). We can see this more clearly when we
look at the contours of the same
The three densities:
leftmost figure shows the familiar standard normal distribution, and we 4
see that as we increase the off-diagonal entry in Σ, the density becomes more
3 “compressed” towards the 45◦ line (given by x1 = x2 ). We can see this more 3 3
2
clearly when we look at the contours of the same three densities: 2 2
1 1 1
0 0 0
−1 −1 −1
−2 −2 −2
−3 −3 −3
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
2 2 2
1
The Gaussian Discriminant Analysis Model 1 1
When we have a classification problem in which the input features x are continuous-valued random variables, we can then
0 0 0
use the Gaussian Discriminant Analysis (GDA) model, which models p(x|y ) using a multivariate normal distribution. The model
−1 −1 −1
is: −2 −2 −2
−3 −3 −3
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
0.15
Copyright c 2019, Dr. Mustafa El-Halabi 0.15 0.15
19
0.1 0.1 0.1
3 3 3
2 2 2
3 3 3
1 2 1 2 1 2
0 1 0 1 0 1
−1 0 −1 0 −1 0
CLASSIFICATION 20
Here, the parameters of our model are φ, Σ, µ0 , µ1 . (Note that there are two different mean vectors and only one covariance
matrix Σ). The log-likelihood of the data is given by
m
Y m
Y
`(φ, Σ, µ0 , µ1 ) = log p x (i) , y (i) ; φ, µ0 .µ1 , Σ = log p x (i) |y (i) ; µ0 .µ1 , Σ p(y (i) ; φ) (2.16)
i=1 i=1 6
By maximizing ` with respect to the parameters, we find the maximum likelihood estimate of the parameters to be
m
By maximizing ℓ with respect1 to
X the (i)
parameters, we find the maximum like-
φ = 1{y = 1} (2.17)
lihood estimate of the parameters
m
i=1
(see problem set 1) to be:
mm
1 !1{y (i)(i) = 0}x (i)
P
φ µ= =m i=1m
1{y = 1} (2.18)
0
i=1
P (i) = 0}
"m 1{y(i)
i=11{y
i=1 = 0}x(i)
µ0 = "m
m (i) (i) = (i)
1{y1{y = 1}x 0}
P
i=1
µ1
" i=1
= m m1{y (i) = 1}x(i) (2.19)
i=1
"
µ1 =
P (i)
m 1{y (i)= 1}
i=1 1{y
i=1 = 1}
m m
!1 X (i)
=1
Σ σ= (x x −−µµ(i)
(i) y (i))(x
(i)
(x(i) −−µ )T T
µy i(i) (2.20)
m y y ) .
m i=1i=1
Pictorially, what the algorithm is doing can be seen in Figure. 2.7
Pictorially, what the algorithm is doing can be seen in as follows:
1
−1
−2
−3
−4
−5
−6
−7
−2 −1 0 1 2 3 4 5 6 7
where θ is some appropriate function of φ, µ0 , µ1 and Σ. This is exactly the form that logistic regression (a discriminative
algorithm) used to model p(y = 1|x).
When would we prefer one model over another? GDA and logistic regression will, in general, give different decision boundaries
when trained on the same dataset. Which is better? We just argued that if p(x|y ) is multivariate gaussian (with shared Σ),
then p(y |x) necessarily follows a logistic function. The converse, however, is not true; i.e., p(y |x) being a logistic function does
not imply p(x|y ) is multivariate gaussian. This shows that GDA makes stronger modeling assumptions about the data than
does logistic regression. It turns out that when these modeling assumptions are correct, then GDA will find better fits to the
data, and is a better model. Specifically, when p(x|y ) is indeed gaussian (with shared Σ), then GDA is asymptotically efficient.
Informally, this means that in the limit of very large training sets (large m), there is no algorithm that is strictly better than
GDA (in terms of, say, how accurately they estimate p(y |x)). In particular, it can be shown that in this setting, GDA will be a
better algorithm than logistic regression; and more generally, even for small training set sizes, we would generally expect GDA
to better.
In contrast, by making significantly weaker assumptions, logistic regression is also more robust and less sensitive to incorrect
modeling assumptions. There are many different sets of assumptions that would lead to p(y |x) taking the form of a logistic
function. For example, if x|y = 0 ∼ Poisson(λ0 ), and x|y = 1 ∼ Poisson(λ1 ), then p(y |x) will be logistic. Logistic regression
will also work well on Poisson data like this. But if we were to use GDA on such data-and fit Gaussian distributions to such
non-Gaussian data?then the results will be less predictable, and GDA may (or may not) do well.
To summarize: GDA makes stronger modeling assumptions, and is more data efficient (i.e., requires less training data to
learn “well”) when the modeling assumptions are correct or at least approximately correct. Logistic regression makes weaker
assumptions, and is significantly more robust to deviations from modeling assumptions. Specifically, when the data is indeed
non-Gaussian, then in the limit of large datasets, logistic regression will almost always do better than GDA. For this reason,
in practice logistic regression is used more often than GDA. (Some related considerations about discriminative vs. generative
models also apply for the Naive Bayes algorithm that we discuss next, but the Naive Bayes algorithm is still considered a very
good, and is certainly also a very popular, classification algorithm.)
YD
p(x, y ) = p(y ) p(xj |y ) (2.22)
¯ j=1
which is much easier to compute, and reduces the severity of the dimensionality problem.
So the classifier rule for the naive Bayes’ classifier is to select the class yi for which the following computation is the maximum
D
Y
p(y ) p(xj |y )
j=1
Coupled with a suitable choice for each conditional distribution p(xj |y ), we can use Bayes’ rule to form the Bayes classifier
for a novel input vector x ∗ :
¯ D
p(xj∗ |y )
Q
p(y )
∗ p(x ∗ |y )p(y ) j=1
p(y |x ) = ¯ = P (2.23)
¯ p(x ∗ ) p(y )p(x ∗ |y )
¯ y ¯
In practice it is common to consider only two classes dom(y ) = {0, 1}. The theory we describe below is valid for any number
of classes y , though our examples are restricted to the binary class case. Also, the attributes xj are often taken to be binary, as
we shall do initially below as well. The extension to more than two attribute states, or continuous attributes is straightforward.
As an example, consider bank that has a record of past loans containing customers data (customer’s yearly income and
savings) and whether the loan was paid back or not. From this data of particular applications, the aim is to infer a general
rule coding the association between a customer?s attributes and his risk. That is, the machine learning system fits a model to
the past data to be able to calculate the risk for a new application and then decides to accept or refuse it accordingly. Assume
the customer’s yearly income and savings are represent by two random variables x1 and x2 . With what we can observe, the
credibility of a customer is denoted by a Bernoulli random variable y conditioned on the observables x = [x1 , x2 ]T where y = 1
¯
indicates a high-risk customer and y = 0 indicates a low-risk customer. Thus if we know p(y |x1 , x2 ) we can choose
y = 1, if p(y = 1|x1 , x2 ) > p(y = 0|x1 , x2 )
(2.24)
y = 0, otherwise
• Class Likelihood. p(x|y ) is called the class likelihood and is the conditional probability that an event belonging to y has
¯
the associated observation value x. In our case, p(x1 , x2 |y = 1) is the probability that a high-risk customer has his or her
¯
x1 and x2 . It is what the data tells us regarding the class.
• Evidence. p(x), the evidence, is the marginal probability that an observation x is seen, regardless of whether it is a
¯ ¯
positive or negative example.
X
p(x) = p(x, y ) = p(x|y = 1)p(y = 1) + p(x|y = 0)p(y = 0)
¯ y
¯ ¯ ¯
• Posterior Probability. Combining the prior and what the data tells us using Bayes’ rule, we calculate the posterior
probability of the concept, p(y |x), after having seen the observation, x.
¯ ¯
prior × likelihood
posterior =
evidence
Because of normalization by the evidence, the posteriors sum up to 1:
In the general case, we have K mutually exclusive and exhaustive classes; yi ,i = 1, ... , K . The posterior probability of class yi
can be calculated as
p(x|yi )p(yi ) p(x|yi )p(yi )
p(yi |x) = ¯ = K ¯
¯ p(x) P
¯ p(x|yk )p(yk )
k=1 ¯
and for minimum error, the Baye’s classifier chooses the class with the highest posterior probability; that is, we choose yi if
p(yi |x) = max p(yk |x).
¯ k ¯
Example 8. EZsurvey.org partitions radio station listeners into two groups − the “young” and “old”. They assume that, given
the knowledge that a customer is either “young” or “old”, this is sufficient to determine whether or not a customer will like a
particular radio station, independent of their likes or dislikes for any other stations:
where each of the variables r1 , r2 , r3 , r4 can take the states like or dislike, and the “age” variable can take the value young or
old. Thus the information about the age of the customer determines the individual radio station preferences without needing
to know anything else. To complete the specification, given that a customer is young, she has a 95% chance to like Radio1, a
5% chance to like Radio2, a 2% chance to like Radio3 and a 20% chance to like Radio4. Similarly, an old listener has a 3%
chance to like Radio1, an 82% chance to like Radio2, a 34% chance to like Radio3 and a 92% chance to like Radio4. They
know that 90% of the listeners are old.
Given this model, and the fact that a new customer likes Radio1, and Radio3, but dislikes Radio2 and Radio4, what is the
probability that the new customer is young?
Solution. Let D denotes ”dislike” and L denotes “like”. The probability in question can be written as
p(r1 = L, r2 = D, r3 = L, r4 = D|young)p(young)
p(young|r1 = L, r2 = D, r3 = L, r4 = D) = P
p(r1 = L, r2 = D, r3 = L, r4 = D|age)p(age)
age
Which gives
0.0014
p(young|r1 = L, r2 = D, r3 = L, r4 = D) = = 0.9161
0.0014 + 1.3219 × 10−4
Based on the NB conditional independence assumption the probability of observing a vector x can be compactly written
¯
YD YD
x 1−x
p(x|y ) = p(xj |y ) = (θj,y ) j (1 − θj,y ) j (2.26)
¯ j=1 j=1
In the above expression, xj is either 0 or 1 and hence each i term contributes a factor θj,y if xj = 1 or 1 − θj,y if xj = 0. Together
with the assumption that the training data is i.i.d. generated, the log likelihood of the attributes and class labels is
X X Y
L = log p(x n , y n ) = log p(y n ) p(xjn |y n ) (2.27)
n
¯ n j
X
= xjn log θj,y
n
+ (1 − xjn ) log(1 − θj,y
n
) + n0 log p(y = 0) + n1 log p(y = 1) (2.28)
j,n
I xjn = 1, y n = 1 log θj,1 + I xjn = 0, y n = 1 log(1 − θj,1 )} + n0 log p(y = 0) + n1 log p(y = 1)
(2.30)
We can find the maximum likelihood optimal θj,y by differentiating with respect to θj,y and equating to zero, giving
n n
P
n I[xj = 1, y = y ] number of times xj = 1 for class y
θj,y = p(xj = 1|y ) = P n n = y] +
P n = 1, y n = y ] = number of datapoints in class y (2.31)
n I[x j = 0, y n I[x j
log p(x ∗ |y = 1) + log p(y = 1) − log p(x ∗ ) > log p(x ∗ |y = 0) + log p(y = 0) − log p(x ∗ )
¯ ¯ ¯ ¯
This is equivalent to X X
log p(xj∗ |y = 1) + log p(y = 1) > log p(xj∗ |y = 0) + log p(y = 0)
j j
∗
Using the binary encoding xj ∈ {0, 1}, we therefore classify x as class 1 if
¯
X X
∗ ∗
{xj log θj,1 + (1 − xj ) log(1 − θj,1 )} + log p(y = 1) > {xj∗ log θj,0 + (1 − xj∗ ) log(1 − θj,0 )} + log p(y = 0)
j j
Example 9. Consider the following vector of binary attributes: (pizza, espresso, beer, sausages, football). A vector x =
¯
(1, 0, 1, 1, 0)T would describe that a person likes pizza, does not like espresso, drinks beer, eats sausages, and does not like to play
football. Together with each vector x, there is a label y describing the nationality of the person, dom(y ) = {German, Italian}.
¯
Table (a) depicts German tastes for 6 people and table (b) depicts Italian tastes for 7 people over attributes (pizza, espresso,
beer, sausages, football). Each column represents the tastes of an individual.
0 1 1 1 0 0 1 1 1 1 1 1 1
0 0 1 1 1 0 0 1 1 1 1 0 0 Figure 10.2: (a): Engli
1 1 0 0 0 0 0 0 1 0 0 1 1 tastes for 6 people over attribut
1 1 0 0 0 1 1 0 1 1 1 1 0 (shortbread, lager, whiskey, porridge, f ootball
1 0 1 0 1 0 1 1 0 0 1 0 0 Each column represents the tastes of a
(a) (b) individual. (b): Scottish tastes for 7 people.
no zero probabilities in the model. An alternative is to use a Bayesian approach that discourages extrem
We wish to classify a person who likes eating pizza and sausages and drinking beer, but does not like drinking espresso and
probabilities, as discussed in section(10.3).
does not play football, as either German or Italian.
Potential
Solution. We wish to classify the vector x = (1, pitfalls
0, 1, 1, 0)with
T
asencoding
either German or Italian. Using Bayes’ rule
¯
In many o↵-the-shelf packages implementing Naive Bayes, binary attributes are assumed. In practic
p(x|Italian)p(Italian) non-binary attributes p(x|Italian)p(Italian)
p(Italian|x) = ¯ however, the case of = ¯ often occurs. Consider the following attribute : age. In a surve
¯ p(x) age is marked
a person’s p(x|Italian)p(Italian)
down using the variable + p(x|German)p(German)
a 2 1, 2, 3. a = 1 means the person is between 0 and 1
years ¯ old, a = 2 means ¯the person is between 10 ¯and 20 years old, a = 3 means the person is older tha
7 6
By maximum likelihood the “prior” class
20.probability
One way top(Italian)
transform = the13 and p(German)
variable = 13
a into a binary . For p(x|y )would
representation underbethe Naive
to use Bayes
three binary variabl
(a , a , a ) with (1, 0, 0), (0, 1, 0), (0, 0, 1) representing a = 1, a¯ = 2, a = 3 respectively. This is calle
assumption: 1 2 3
1p(x|yof ) =M p(x
coding since only 1 of the binary variables is active in encoding the M states. By constructio
1 |y )p(x2 |y )p(x3 |y )p(x4 |y )p(x5 |y )
this¯means that the variables a1 , a2 , a3 are dependent – for example, if we know that a1 = 1, we know th
Based on tables and using maximum likelihood
a2 = 0 and wea3have:
= 0. Regardless of any class conditioning, these variables will always be dependent, contra
to the assumption of Naive Bayes. A correct approach is to use variables with more than two states,
explained 1
in section(10.2.2).
p(x 1= 1|German) = p(x = 1|Italian) = 1 1
2
10.2.2 1
Multi-state variables 4
p(x 2 = 1|German) = p(x2 = 1|Italian) =
2 7
The extension of the above method to class variables c with more than two states is straightforward. W
1 3
concentrate
p(x here therefore
3 = 1|German) = onp(x
extending the attribute
3 = 1|Italian) = variables to having more than two states. For
variable xi with states, 3dom(xi ) = {1, . . . , S}, the likelihood
7 of observing a state xi = s is denoted
1 5
p(x4 = 1|German) = p(x4 = 1|Italian) =
p(xi = s|c) = ✓si (c)2 7 (10.2.1
P 1 3
p(x 5 = 1|German)
with = 1. The
s p(xi = s|c) = 2
p(xclass
5 = conditional
1|Italian) =likelihood of generating the i.i.d. data D = (xn , cn ), n
7
1, . . . , N is
For x = (1, 0, 1, 1, 0)T , we get N N Y
D Y
S YC
¯ Y Y
✓4si (c)I[7xi =s]I[c =c] .
n n
p(xn |cn ) = 3 3 5
(10.2.1
n=1 1 × 7n=1
× i=1
7 × × 7 × 13
s=17c=1
p(Italian|x) = 3 3 = 0.8076
¯ 1× 7 × × 75 × 47 × 13
7
+ 12 × 12 × 13 × 2 ✓ i (c)
7of the indicators is that only those2terms
1
× 1
× 6
13 survive for which there are attributes i in state
The e↵ect s
for class c. This then gives the class conditional log-likelihood
Since this is greater than 0.5, we would classify this person as being Italian.
N X
X D X
S X
C
L(✓) = I [xni = s] I [cn = c] log ✓si (c) (10.2.1
n=1 i=1 s=1 c=1
2.3.3withSpam
We can optimize Filtering
respect to the parameters ✓ using a Lagrange multiplier (one for each of the attribut
i and classes c) to ensure normalisation. This gives the Lagrangian
For our motivating example, consider building an email spam filter using machine learning. Here, we wish to classify messages
N XD X Semail,
C C X D S
!
according to whether they are unsolicited commercialX (spam) X orn non-spamn
email. i After X learningc
to doX
this,
i
we can then
L(✓, ) = I [xi = s] I [c = c] log ✓s (c) + i 1 ✓s (c) (10.2.2
have our mail reader automatically filter out the spam messages and perhaps place them in a separate mail folder. Classifying
n=1 i=1 s=1 c=1 c=1 i=1 s=1
emails is one example of a broader set of problems called text classification.
i
Let us say we have a training set (aToset
findofthe optimum
emails of this
labeled as function
spam or we may di↵erentiate
non-spam). We willwith respect
begin to ✓s (c) andofequate
construction to zero. Solvin
our spam
filter by specifying the features xj usedthe
to resulting
representequation
an email.we obtain
We will represent an email via a feature vector whose length is
equal to the number of words in the dictionary. XN Specifically, if an email contains the j−th word of the dictionary, then we will
n I [xi = s] I [c = c]
n
c
= i (10.2.2
✓si (c)
n=1
a
1
0
aardvark
0 aardwolf
.. ..
x = . . (2.34)
¯
1
buy
. ..
.. .
0 zymurgy
is used to represent an email that contains the words “a” and “buy,” but not “aardvark,” “aardwolf” or “zygmurgy”1 . The
set of words encoded into the feature vector is called the vocabulary, so the dimension of x is equal to the size of the vocabulary.
¯
Having chosen our feature vector, we now want to build a generative model. So, we have to model p(x|y ). But if we
50000 ¯
have, say, a vocabulary of 50000 words, then x ∈ {0, 1} (x is a 50000-dimensional vector of 0’s and 1’s), and if we
¯
were to model x explicitly with a multinomial distribution over the 250000 possible outcomes, then we would end up with a
¯
(250000 − 1)−dimensional parameter vector. This is clearly too many parameters.
To model p(x|y ), we will will assume that the xj ’s are conditionally independent given y , i.e., the Naive Bayes (NB)
¯
assumption, and the resulting algorithm is called the Naive Bayes classifier. For instance, if y = 1 means spam email; “buy” is
word 2087 and “price” is word 39831; then we are assuming that if I tell you y = 1 (that a particular piece of email is spam),
then knowledge of x2087 (knowledge of whether “buy” appears in the message) will have no effect on your beliefs about the
value of x39831 (whether “price” appears). More formally, this can be written p(x2087 |y ) = p(x2087 |y , x39831 ). (Note that this
is not the same as saying that x2087 and x39831 are independent, which would have been written p(x2087 ) = p(x2087 |x39831 );
rather, we are only assuming that x2087 and x39831 are conditionally independent given y .)
We now have,
p(x1 , ... , x50000 |y ) = p(x1 |y )p(x2 |y , x1 )p(x3 |y , x1 , x2 ) ... p(x50000 |y , x1 , ... , x49999 )
= p(x1 |y )p(x2 |y )p(x3 |y ) ... p(x50000 |y )
Yn
= p(xj |y )
j=1
The first equality simply follows from the usual properties of probabilities, and the second equality used the NB assumption.
We note that even though the Naive Bayes assumption is an extremely strong assumptions, the resulting algorithm works well
on many problems.
Our model is parametrized by φj|y =1 = p(xj = 1|y = 1), φj|y =0 = p(xj = 1|y = 0), and φy = p(y = 1). As usual, given a
training set {(x (i) , y (i) ); i = 1, ... , m}, we can write down the joint likelihood of the data
m
Y
L(φy , φj|y =1 , φj|y =0 ) = p(x (i) , y i ) (2.35)
i=1
1 Actually, rather than looking through an English dictionary for the list of all English words, in practice it is more common to look through our
training set and encode in our feature vector only the words that occur at least once there. Apart from reducing the number of words modeled and
hence reducing our computational and space requirements, this also has the advantage of allowing us to model/include as a feature many words that
may appear in your email (such as “CCE414”) but that you will not find in a dictionary. Sometimes, we also exclude the very high frequency words
(which will be words like “the”, “of”, “and”; these high frequency, “content free” words are called stop words) since they occur in so many documents
and do little to indicate whether an email is spam or non-spam.
Maximizing this with respect to φy , φj|y =1 , φj|y =0 gives the maximum likelihood estimates
m
P (i) (i)
1{xj = 1, yi = 1}
i=1
φj|y =1 = m (2.36)
P
1{y (i) = 1}
i=1
m
P (i) (i)
1{xj = 1, yi = 0}
i=1
φj|y =0 = m (2.37)
P
1{y (i) = 0}
i=1
m
1{y (i) = 1}
P
i=1
φy = (2.38)
m
The parameters have a very natural interpretation. For instance, φj|y =1 is just the fraction of the spam (y = 1) emails in
which word j does appear.
Having fit all these parameters, to make a prediction on a new example with features x, we then simply calculate
¯
n
Q
p(y = 1) p(xj |y = 1)
p(x|y = 1)p(y = 1) j=1
p(y = 1|x) = = n n (2.39)
p(x) Q Q
p(y = 1) p(xj |y = 1) + p(y = 0) p(xj |y = 0)
j=1 j=1
Figure 2.8: An example of a decision tree for the concept of PlayTennis. An example is classified by sorting it through the
tree to the appropriate leaf node, then returning the classification associated with this leaf (in this case, Yes or No). This tree
classifies Saturday mornings according to whether or not they are suitable for playing tennis.
Figure. 2.8 illustrates a typical learned decision tree. This decision tree classifies Saturday mornings according to whether
they are suitable for playing tennis. For example, the “instance”
would be sorted down the leftmost branch of this decision tree and would therefore be classified as a negative instance (i.e.,
the tree predicts that PlayTennis = no).
A decision tree consists of:
• Edges: correspond to the outcome of a test and connect to the next node or leaf
A decision tree classifies data using the attributes. Every tree consists of decision nodes and decision leafs. Nodes can have
two or more branches, which represent the value for the attribute tested. Leaf nodes produce a homogeneous result.
In the rest of the section we will use the learning task represented by the training examples of Table. 2.9.
Every row i corresponds to the i th labeled instances (xi , y ) for i = 1, 2, ... , n, and every column denotes the features x (j) .
¯
Going back to Figure. 2.8, each internal node test one feature x (j) , each branch from a node selects one value for x (j) , and
each leaf node predicts y . Note that the features in our example are discrete, if they were continuous, internal nodes can test
the value of a feature against a threshold.
hxi , yi i
Although a variety of decision tree learning methods have been developed with somewhat differing capabilities and require-
ments, decision tree learning is generally best suited to problems with the following characteristics:
• Instances are represented by attribute-value pairs. Instances are represented by attribute-value pairs. Instances are de-
scribed by a fixed set of attributes (e.g., Temperature) and their values (e.g., Hot). The easiest situation for decision
tree learning is when each attribute takes on a small number of disjoint possible values (e.g., Hot, Mild, Cold). How-
ever, extensions to the basic algorithm (which we are yet to explain) allow handling real-valued attributes as well (e.g.,
representing Temperature numerically).
• The target function has discrete output values. The decision tree in Figure. 2.8 assigns a boolean classification (e.g., yes
or no) to each example. Decision tree methods easily extend to learning functions with more than two possible output
values. A more substantial extension allows learning target functions with real-valued outputs, though the application of
decision trees in this setting is less common.
• The training data may contain errors. Decision tree learning methods are robust to errors, both errors in classifications
of the training examples and errors in the attribute values that describe these examples.
• The training data may contain missing attribute values. Decision tree methods can be used even when some training
examples have unknown values (e.g., if the Humidity of the day is known for only some of the training examples).
Many practical problems have been found to fit these characteristics. Decision tree learning has therefore been applied
to problems such as learning to classify medical patients by their disease, equipment malfunctions by their cause, and loan
applicants by their likelihood of defaulting on payments.
A simplified version of the algorithm, specialized to learning boolean-valued functions (i.e., concept learning), is described
in below.
Our basic algorithm, ID3, learns decision trees by constructing them top-down, beginning with the question “which attribute
should be tested at the root of the tree?’” To answer this question, each instance attribute is evaluated using a statistical test
to determine how well it alone classifies the training examples. The best attribute is selected and used as the test at the root
node of the tree. A descendant of the root node is then created for each possible value of this attribute, and the training
examples are sorted to the appropriate descendant node (i.e., down the branch corresponding to the example’s value for this
attribute). The entire process is then repeated using the training examples associated with each descendant node to select
the best attribute to test at that point in the tree. This forms a greedy search for an acceptable decision tree, in which the
algorithm never backtracks to reconsider earlier choices.
ID3 Algorithm
• If all Examples are positive, Return the single-node tree Root, with label = +
• If all Examples are negative, Return the single-node tree Root, with label = −
• If Attributes is empty, Return the single-node tree Root, with label = most common value of Target-
attribute in Examples
• Otherwise Begin
a
– A ← the attribute from Attributes that best classifies Examples
– The decision attribute for Root ← A
– For each possible value , vi , of A
∗ Add a new tree branch below Root, corresponding to the test A = vi
∗ Let Examplesvi , be the subset of Examples that have value vi for A
∗ If Examplesvi is empty
· Then below this new branch add a leaf node with label = most common value of Target-
attribute in Examples
· Else below this new branch add the subtree
ID3(Examples,Target-attribute, Attributes−{A})
– End
– Return Root
a The best attribute is the one with the highest information gain, as will be defined next
negative examples of some target concept, the entropy of S relative to this boolean classification is
Entropy (S) = −p⊕ log2 p⊕ − p log2 p
where p⊕ is the proportion of positive examples in S and p is the proportion of negative examples in S.
One interpretation of entropy from information theory is that it specifies the minimum number of bits of information needed
to encode the classification of an arbitrary member of S (i.e., a member of S drawn at random with uniform probability). For
example, if p⊕ , is 1, the receiver knows the drawn example will be positive, so no message need be sent, and the entropy is
zero. On the other hand, if p⊕ is 0.5,one bit is required to indicate whether the drawn example is positive or negative. If p⊕ is
0.8, then a collection of messages can be encoded using on average less than 1 bit per message by assigning shorter codes to
collections of positive examples and longer codes to less likely negative examples.
Thus far we have discussed entropy in the special case where the target classification is boolean. More generally, if the
target attribute can take on c different values, then the entropy of S relative to this c−wise classification is defined as
c
X
Entropy (S) = − pi log2 pi (2.40)
i=1
where pi is the proportion of S belonging to class i . Note the logarithm is still base 2 because entropy is a measure of the
expected encoding length measured in bits. Note also that if the target attribute can take on c possible values, the entropy
can be as large as log2 c.
Given entropy as a measure of the impurity in a collection of training examples, we can now define a measure of the
effectiveness of an attribute in classifying the training data. The measure we will use, called information gain, is simply the
expected reduction in entropy caused by partitioning the examples according to this attribute. More precisely, the information
gain, Gain(S, A) of an attribute A, relative to a collection of examples S, is defined as
X |Sv |
Gain(S, A) = Entropy (S) − Entropy (Sv ) (2.41)
|S|
v ∈Values(A)
where Values(A) is the set of all possible values for attribute A, and Sv is the subset of S for which attribute A has value v
(i.e., Sv = {s ∈ S|A(s) = v }). Note the first term in (2.41) is just the entropy of the original collection S, and the second term
is the expected value of the entropy after S is partitioned using attribute A. The expected entropy described by this second
term is simply the sum of the entropies of each subset S, weighted by the fraction of examples that belong to S. Gain(S, A)
is therefore the expected reduction in entropy caused by knowing the value of attribute A. Put another way, Gain(S, A) is the
information provided about the target function value, given the value of some other attribute A. The value of Gain(S, A) is
the number of bits saved when encoding the target value of an arbitrary member of S, by knowing the value of attribute A.
Example 10. Consider the problem depicted in Fig. 2.9. Let S be the collection of training-example days described by the
attributes of Outlook, Temperature, Humidity, and Wind. Compute Gain(S, Wind).
Solution. Suppose S is a collection of training-example days described by attributes including Wind, which can have the values
Weak or Strong. Assume S is a collection containing 14 examples, [9+, 5−]. Of these 14 examples, suppose 6 of the positive
and 2 of the negative examples have Wind = Weak, and the remainder have Wind = Strong. The information gain due to
sorting the original 14 examples by the attribute Wind may then be calculated as
9 9 5 5
Entropy (S) = − log − log = 0.94
14 14 14 14
6 6 2 2
Entropy (SWeak ) = − log − log = 0.811
8 8 8 8
3 3 3 3
Entropy (SStrong ) = − log − log = 1
2 2 2 2
8 6
∴ Gain(S, Wind) = 0.94 − (0.811) − (1) = 0.048
14 14
Information gain is precisely the measure used by ID3 to select the best attribute at each step in growing the tree. To
decide on which attribute is the best classifier, we should compute G (S, Humidity ), G (S, Outlook), and G (S, Temperature) as
well. Consider the first step through the algorithm, in which the topmost node of the decision tree is created. The information
gain values for all four attributes are: G (S, Humidity ) = 0.151, G (S, Outlook) = 0.246, G (S, Temperature) = 0.029, and
G (S, Wind) = 0.048. According to the information gain measure, the Outlook attribute provides the best prediction of the
target attribute, PlayTennis, over the training examples. Therefore, Outlook is selected as the decision attribute for the root
node, and branches are created below the root for each of its possible values (i.e., Sunny, Overcast, and Rain).
[D1,D2,...,D14]
[9+,5-]
Outlook
Sunny Rain
Overcast
? Yes ?
Figure 2.10: The partially learned decision tree resulting from the first step of ID3.
The resulting partial decision tree is shown in Figure. 2.10, along with the training examples sorted to each new descendant
node. Note that every example for which Outlook = Overcast is also a positive example of PlayTennis. Therefore, this node
of the tree becomes a leaf node with the classification PlayTennis = Yes. In contrast, the descendants corresponding to
Outlook = Sunny and Outlook = Rain still have nonzero entropy, and the decision tree will be further elaborated below these
nodes.
The process of selecting a new attribute and partitioning the training examples is now repeated for each non-terminal
descendant node, this time using only the training examples associated with that node. Attributes that have been incorporated
higher in the tree are excluded, so that any given attribute can appear at most once along any path through the tree. This
process continues for each new leaf node until either of two conditions is met: (1) every attribute has already been included
along this path through the tree, or (2) the training examples associated with this leaf node all have the same target attribute
value (i.e., their entropy is zero).
Considering the Sunny branch, to decide on whether the attribute connected to it is Humidity, Temperature and Wind, we
Clearly Humidity has the highest information gain, so this should be the attribute connected to the Sunny branch. Continuing
in the same manner, the final decision tree learned by ID3 from the 14 training examples is shown in Figure. 2.8.
Figure
Figure 2.11: An 19.1 An
illustration illustration
of the of the decision
decision boundaries boundaries
of the 1−NN of the
rule. The 1-NN
points rule.are
depicted The
thepoints
sample points, and the
predicted labeldepicted
of any newarepoint
the sample points,
will be the and
label of thethe predicted
sample point inlabel of anyofnew
the center the point will be to.
cell it belongs theThese cells are
label
called a Voronoi of the sample
tessellation pointand
of the space, in the centerusing
are drawn of the
thecell it belongs
Euclidean to. These
distance. cells are
Other metrics willcalled
changea the decision
surface. Voronoi Tessellation of the space.
1−NN Algorithm
k-NN
1. Calculate the dissimilarity of the test point x to each of the train points, d n = d(x, x n ), n = 1, ... , N
input: a training sample ¯ S = (x1 , y1 ), . . . , (xm , ym )¯ ¯
2. Find the training output: n
point x which
for isevery
nearest to x: xn∗2=Xargmin
point ,
n
n d(x, x )
¯ ¯ ¯ ¯
∗
3. Assign the class labelreturn
y (x) =the
y n majority
. label among {y⇡i (x) : i k}
¯
4. In the case that there are two or more nearest neighbors with different class labels, the most numerous
When k = 1, we have the 1-NN rule:
class is chosen. If there is no one single most numerous class, we use the K −nearest-neighbors.
The nearest neighbor algorithm is simple and intuitive. There are, however, some issues:
hS (x) = y⇡1 (x) .
• How should we measure the distance between points? Whilst the Euclidean square distance is popular, this may not
A appropriate.
always be geometric Aillustration
fundamental of the 1-NN
limitation of therule is given
Euclidean in Figure
distance is that it19.1.
does not take into account how the
For regression problems, namely, Y = R, one can define the ¯xprediction
data is distributed. For example if the length scales of the components of the vector vary greatly,tothebelargest length
scale will dominate the squared distance, with potentially useful class-specific information in Pk components of x lost.
1 other
the average
The Mahalanobis distance
target of the k nearest neighbors. That is, h S (x) = k i=1 y⇡i (x) . ¯
k
More generally, for some function (x −⇥x 0Y)
d(x, x 0 ) =: (X )T Σ−1 !(xY, − xthe
0
) k-NN rule with respect (2.43)
¯ ¯ ¯ ¯ ¯ ¯
to is:
where Σ is the covariance matrix of the inputs (from all classes) can overcome some of these problems since it effectively
rescales the input vector components.
h (x) = (x ,y ), . . . , (x ,y ) . (19.1)
S ⇡1 (x) ⇡1 (x) ⇡k (x) ⇡k (x)
• The whole dataset needs to be stored to make a classification since the novel point must be compared to all of the train
It iscan
points. This easy to verify
be partially that by
addressed wea method
can cast thedata
called prediction by majority
editing in which data pointsof labels
which have (for
little or no effect
classification)
on the decision boundaryor areby the averaged
removed from the target
training (for regression)
dataset. Depending asoninthe
Equation
geometry(19.1)
of the bytraining points,
finding the
an appropriate choice of . The generality can lead to other rules; for example,xifi of ¯x in turn.
nearest neighbor can also be accelerated by examining the values of each of the components
Such an axis-aligned space-split is called a KD−tree and can reduce the possible set of candidate nearest neighbors in
Y = R, we can take a weighted average of the targets according to the distance
the training set to the novel x ∗ , particularly in low dimensions.
from x: ¯
• Each distance calculation can be expensive if the k data points are high dimensional. Principal Components Analysis is
X
one way to address this and replaces x with a low dimensional ⇢(x, x⇡projection
i (x)
) p. The Euclidean distance of two data points
a b hS (x)
¯ =a bPk ¯y⇡compute
i (x)
.
x and x is then approximately given by p and p . This is both faster to and can also improve classification
¯ ¯ ¯ i=1¯ j=1 ⇢(x, x⇡j (x) )
accuracy since only the large scale characteristic of the data are retained in the PCA projections.
If we assume the Euclidean distance as the dissimilarity measure, the k−Nearest Neighbor algorithm considers a hypersphere
centered on the test point x. The radius of the hypersphere is increased until it contains exactly K train inputs. The class
¯
label y (x) is then given by the most numerous class within the hypersphere. This is illustrated in Figure. 2.12. The training
¯
data consists of data points belonging to one of two classes (grey circles and white squares). Two test points are indicated
by black diamonds and in both cases, the dotted circles enclose their k = 3 nearest neighbors. The neighbors of test point A
include two from the square class and one from the circle class and it will be classified as belonging to the square class. All of
the neighbors of test point B being to the circle class to which B is assigned.
Test point A
Test point B
Figure 2.12: An illustration of the operation of k = 3-NN. Circles and squares denote the training points and diamonds the
test points. Test point A will be assigned to the square class and B to the circle class.
We have the data points positioned within input space, so we just need to work out which of the training data are close to
it. This requires computing the distance to each datapoint in the training set, which is relatively expensive: if we are in normal
Euclidean space, then we have to compute d subtractions and d squarings (we can ignore the square root since we only want
to know which points are the closest, not the actual distance). We can then identify the k nearest neighbors to the test point,
and then set the class of the test point to be the most common one out of those for the nearest neighbors. The choice of k is
not trivial. Make it too small and nearest neighbor methods are sensitive to noise, too large and the accuracy reduces as points
that are too far away are considered.
One drawback of the k−nearest neighbor approach is the issue of ties− two or more classes having an equal number of
votes. For example, if k = 8 in Figure. 2.12, we will always have four neighbors belonging to each class and no majority. One
option is to assign the class randomly from the set of tied classes. This may not always be sensible as it means that the same
x ∗ may be assigned to different classes if it is testes more than once. For binary classification, a neater solution is to always
¯
use an odd number of neighbors. More generally, we can weight the votes according to distance such that the votes from closer
points have greater influence, making ties highly unlikely. Also, in the case of a tie one may increase k until the tie is broken.
The k−nearest Neighbor algorithm is one of the simplest of all machine learning classifiers (also known as instance-based
learning, or memory-based learning). However, this method suffers from the curse of dimensionality. First, as shown above,
the computational costs get higher as the number of dimensions grows. This is not as bad as it might appear at first: there
are sets of methods such as KD-Trees that can reduce the computational time. However, more importantly, as the number of
dimensions increases, so the distance to other data points tends to increase. In addition, they can be far away in a variety of
different directions−there might be points that are relatively close in some dimensions, but a long way in others. There are
methods for dealing with these problems, known as adaptive nearest neighbor methods.
Once we have some data and have chosen a suitable distance measure, the only thing that remains is the choice of k. If
k is too small, our classification can be heavily influenced by noise. Whilst there is some sense in making k > 1 (reduces the
chances of overfitting), there is certainly little sense in making k = N(N being the number of training points). For k very large,
all classifications will become the same − simply assign each novel x ∗ to the most numerous class in the train data (we have
¯
smoothed to such an extent that everywhere belongs to a specific class). This suggests that there is an optimal intermediate
setting of k which gives the best generalization performance. This can be determined using cross-validation, as described in
the next chapter.
Data sets with uneven numbers of objects in each class are known as imbalanced and are common in machine learning and
something we must be aware of when we undertake any classification analysis.
Example 11. Consider the following training set in the 2-dimensional Euclidean space
x y Class
-1 1 −
0 1 +
0 2 −
1 -1 −
1 0 +
1 2 +
2 2 −
2 3 +
1. What is the prediction of the 3−nearest neighbor classifier at point (1, 1)?
2. What is the prediction of the 5−nearest neighbor classifier at point (1, 1)?
3. What is the prediction of the 7−nearest neighbor classifier at point (1, 1)?
4. What is the prediction of the 4−nearest neighbor classifier at point (1, 1)?
2
Solution. We start by computing the square distance from (1, 1) to each of the point in the data set. We get: d1,− = 4,
2, 2,− 2,− 2,+ 2 2 2
d2,+ = 1, d3 = 2, d4 = 4, d5 = 1, d6,+ = 1, d7,− = 2, d8,+ = 5
2,
1. The three smallest distances are: d2,+ = 1, d52,+ = 1 and d6,+
2
= 1. Then, the given point will be classified as belonging
to class +.
2,
2. The five smallest distances are: d2,+ = 1, d52,+ = 1, d6,+
2
= 1, d32,− = 2, d7,−
2
= 2. Then, the given point will be
classified as belonging to class +.
CHAPTER 3
REGULARIZATION
Figure
Figure 3.1: The zig-zag on theThe
line 1.1 leftzig-zag
panel isline on the over
consistent left the
panel is and
blue consistent over sample,
red training the bluebut
andit red
is a complex separation
training sample, but it is a complex separation surface that is not likely to generalize
surface that is not likely to generalize well to unseen data. In contrast, the decision surface on the right panel is simpler and
wellintospite
might generalize better unseen data.
of its In contrast, of
misclassification thea decision
few pointssurface
of the on the right
training panel is simpler
sample.
and might generalize better in spite of its misclassification of a few points of the
training sample.
that can neither model the training data nor generalize to new data. An under-fit machine learning model is not a suitable
model and will be obvious as it will have poor performance on the training data. Underfitting is often not discussed as it is
easy to detect given a good performance metric. The remedy is to move on and try alternate machine learning algorithms.
In the scenario of Figure. 3.2(b), we could choose the quadratic hypothesis hθ (x) = θ0 + θ1 x + θ2 x 2 to fit the data, which
seems to work pretty well. The extreme scenario corresponds to Figure. 3.2(c), where the data is perfectly fitting all five of our
training examples using the higher order hypothesis hθ (x) = θ0 + θ1 x + θ2 x 2 + θ3 x 3 + θ4 x 4 . The issue of overly fitting the
training data using a certain hypothesis is known as “overfitting”. We also say the corresponding algorithm has “high variance”.
More concretely, overfitting occurs when, having too many features, the learned hypothesis may fit the training set very
well (J(θ) ' 0), but fail to generalize to new examples (predict prices on new examples). Overfitting happens when a model
learns the detail and the noise1 in the training data to the extent that it negatively impacts the performance of the model on
new data. This means that the noise or random fluctuations in the training data is picked up and learned as concepts by the
model. The problem is that these concepts do not apply to new data and negatively impact the models ability to generalize.
Overfitting is more likely with nonparametric and nonlinear models that have more flexibility when learning a target function.
As such, many nonparametric machine learning algorithms also include parameters or techniques to limit and constrain how
much detail the model learns. For example, decision trees are nonparametric machine learning algorithms that are very flexible
and subject to overfitting training data. This problem can be addressed by pruning a tree after it has been learned in order to
remove some of the details it has picked up.
We have discussed the problem of overfitting and underfitting for linear regression, but the same problems could apply for
logistic regression as well.
x1 x1 x1
x x x
x x x x x x
x x x
x x x x x x x x x x x x
x x x x x x
x x xx x x xx x x x x xx x
x x x x x x x x x x x x x x x
In Figure. 3.3(a), we fit the data using a simple hypothesis hθ (x) = g(θ0 + θ1 x1 + θ2 x2 ), which is equivalent to using a
1 Noise can occur both at the feature level and at the label level. Some features might correspond to measurements taken by sensors. For instance,
a robot might use a laser range finder to compute its distance to a wall. However, this sensor might fail and return an incorrect value. In a sentiment
classification problem, someone might have a typo in their review of a course. These would lead to noise at the feature level. There might also be
noise at the label level. A student might write a scathingly negative review of a course, but then accidentally click the wrong button for the course
rating.
straight line to separate negative and positive examples. Obviously, this does not look like a good fit and once again this is
an example of “underfitting”. In contrast, in Figure. 3.3(b), we add to the features some quadratic terms; giving a hypothesis
hθ (x) = g(θ0 + θ1 x1 + θ2 x2 + θ3 x12 + θ4 x22 + θ5 x1 x2 ). This choice of hypothesis corresponds to a decision boundary which is a
more realistic fit for the training data. Figure. 3.3(c) shows an extreme scenario where a very high order polynomial is used to
fit the data. As a result, logistic regression will go to a great length to find a decision boundary that overfits the training data,
and this will not generalize well for future training examples.
When we have a machine learning problem that encompasses lot of features but few training data, then overfitting can
become a problem. In order to address overfitting, there are two main options:
• reduce the number of features by manually selecting which important features to keep
• use Model Selection algorithms, which help to choose which features to keep and which features to throw out (in
subsequent chapters)
However, features are information, and by throwing away some features we are throwing away some information that could
help in improving the predictions. If we decide not to get rid of some features, we need to recur to “regularization” in order to
handle the problem of overfitting. Using regularization, we will keep all the features, but reduce the magnitude or the values of
θj . This method works well when we have lot of features, where each one contributes a bit to predicting y .
3.1.2 Capacity
We can control whether a model is more likely to overfit or under-fit by altering its “capacity”. Informally, a model’s capacity
is its ability to fit a wide variety of functions. Models with low capacity may struggle to fit the training set. Models with high
capacity can overfit by memorizing properties of the training set that do not serve them well on the test set. One way to
control the capacity of a learning algorithm is by choosing its hypothesis space, the set of functions that the learning algorithm
is allowed to select as being the solution. For example, the linear regression algorithm has the set of all linear functions of its
input as its hypothesis space. We can generalize linear regression to include polynomials, rather than just linear functions, in
its hypothesis space. Doing so increases the model’s capacity. Ideally, we want to select a model at the sweet spot between
underfitting and overfitting. This is the goal, but is very difficult to do in practice.
A polynomial of degree one gives us the linear regression model with which we are already familiar, with prediction
hθ (x) = θ0 + θ1 x
By introducing x 2 as another feature provided to the linear regression model, we can learn a model that is quadratic as a
function of x:
hθ (x) = θ0 + θ1 x + θ2 x 2
Though this model implements a quadratic function of its input, the output is still a linear function of the parameters, so we
can still use the normal equations to train the model in closed form. We can continue to add more powers of x as additional
features, for example to obtain a polynomial of degree 9:
9
X
hθ (x) = θ0 + θj x j
j=1
Machine learning algorithms will generally perform best when their capacity is appropriate in regard to the true complexity
of the task they need to perform and the amount of training data they are provided with. Models with insufficient capacity
are unable to solve complex tasks. Models with high capacity can solve complex tasks, but when their capacity is higher than
needed to solve the present task they may overfit. The most extreme case of arbitrarily high capacity is reached through
non-parametric models (k-NN, decision trees, etc.)
Fig. 3.4 shows this principle in action. We compare a linear, quadratic and degree-9 predictor attempting to fit a problem
where the true underlying function is quadratic. The linear function is unable to capture the curvature in the true underlying
problem, so it under-fits. The degree-9 predictor is capable of representing the correct function, but it is also capable of
representing infinitely many other functions that pass exactly through the training points, because we have more parameters
than training examples. We have little chance of choosing a solution that generalizes well when so many wildly different solutions
estimate the generalization error of a machine learning model by measuring its performance on a test set of examples that were
collected separately from the training set. Considering for example linear regression, then the first step would be to learn the
parameters θ from the training data, thus minimizing the training error J(θ), and the second step is to compute the test error
Jtest (θ) given by
mtest 2
1 X (i) (i)
Jtest (θ) = hθ (xtest ) − ytest
2mtest
i=1
Quantifying the capacity of the model allows statistical learning theory to make quantitative predictions. The most important
results in statistical learning theory show that the discrepancy between training error and generalization error is bounded from
above by a quantity that grows as the model capacity grows but shrinks as the number of training examples increases. These
bounds provide intellectual justification that machine learning algorithms can work, but they are rarely used in practice when
working with deep learning algorithms.
We must remember that while simpler functions are more likely to generalize (to have a small gap between training and test
error) we must still choose a sufficiently complex hypothesis to achieve low training error. Typically, training error decreases
until it asymptotes to the minimum possible error value as model capacity increases (assuming the error measure has a minimum
value). Typically, generalization error has a U-shaped curve as a function of model capacity. This is illustrated in Fig. 3.5.
Remark 8. Note that it is possible for the model to have optimal capacity and yet still have a large gap between training and
generalization error. In this situation, we may be able to reduce this gap by gathering more training examples.
Training error
Underfitting zone Overfitting zone
Generalization error
0 Optimal Capacity
Capacity
We can also give a learning algorithm a preference for one solution in its hypothesis space to another. This means that
both functions are eligible, but one is preferred. The unpreferred solution be chosen only if it fits the training data significantly
a sum comprising both the mean squared error on the training and a criterion
better than the preferred solution.
J (w) that expresses a preference for the weights to have smaller squared L2 norm.
For example, we can modify the training criterion for linear regression to include weight decay. To perform linear regression
with weight decay, Specifically,
we minimize a sum comprising both the mean squared error on the training and a criterion J(θ) that expresses
J(w)L= 2 MSEtrain + λw w, (5.18)
a preference for the weights to have smaller squared norm. Specifically,
where λ is a value chosen ahead of time that controls the strength of our preference
J(θ) = MSEtrain + λΘT Θ (3.1)
for smaller weights. When λ = 0, we impose no preference,
¯ ¯ and larger λ forces the
where λ is a valueweights
chosen to become
ahead smaller.
of time Minimizing
that controls J (w ) results
the strength in a choiceforofsmaller
of our preference weightsweights.
that When λ = 0, we
make
impose no preference, anda tradeoff between
larger λ forces the fitting
weightsthe trainingsmaller.
to become data and being small. This gives us
Minimizing J(θ)solutions
results inthat have of
a choice a smaller slope,make
weights that or put weightbetween
a tradeoff on fewerfitting
of the
thefeatures.
training As
dataanand being small. This
gives us solutions example
that haveofa how we slope,
smaller can control
or putaweight
model’s ontendency to overfit
fewer of the or underfit
features. via weight
As an example of how we can control a
model’s tendency todecay, weorcan
overfit train avia
under-fit high-degree polynomial
weight decay, regression
we can train model polynomial
a high-degree with different values model with different
regression
of λ3.6.
values of λ. See Fig. . See Fig. 5.5 for the results.
θ3 and θ4 and make them really small. One way of doing it is to adjust the cost function and write it as
m 2
1 X
J(θ) = hθ (x (i) ) − y (i) + 1000θ32 + 1000θ42
2m
i=1
The only way for this new cost function to be minimum if θ3 and θ4 are to be made very small; θ3 ' 0 and θ4 ' 0.This is
equivalent to getting rid of θ3 and θ4 and ending up with a quadratic hypothesis that fits the data well.
In general, the idea behind regularization is as follows. If we were to pick small values for θ0 , θ1 , ... , θn , we could end up with
a simpler hypothesis which is less prone to overfitting. Back to the housing example, we could have 100 features x1 , ... , x100
and we need to learn 101 parameters θ0 , θ1 , ... , θ100 in order to predict the price of a certain house. It is not clear which
of these parameters would be tied to a high order term and hence need to be made less relevant. In regularization, we pick
n
θj2 , where λ is the
P
the cost function and modify it to shrink all the parameters by adding an extra regularization term λ
j=1
“regularization parameter”. The resulting “regularized cost function” is given by
m 2 n
1 X X
J(θ) = hθ (x (i) ) − y (i) + λ θj2 (3.2)
2m
i=1 j=1
By convention the θ0 term is not included in the adjusted cost function. The regularization parameter λ is used to control a
tradeoff between two different goals; the first being captured by the first term in the cost function which conveys the need to fit
the data and the second goal capture in the second term which conveys that need to keep the parameters small. This technique
L2 Regularization
is known as ridge regression. If λ is to be made very large (λ ' 1010 ), then all of the parameters θ1 , ... , θn will be made pretty
small, and this result in the hypothesis hθ (x) = θ0 , which results in underfitting. Figure. 3.7 is a geometric illustration of the
concept of ridge regression.
The geometric picture:
Figure 3.7: In L2 regularization the regularizer term has a minimum of 0 and the error term has its own minimum, however the
optimum solution of the new regularized
UofT
cost function lies in between.
CSC 411: 06-Linear Regression 34 / 37
λ λ
Since the term 1 − α m is smaller than 1, then θj is updated by shrinking it to θj 1 − α m and then reducing it by the same
term that we used to have before regularization.
If we were to use the normal equation with regularization, then the vector of learned parameters Θ ends up being
¯
T
−1 T
Θ = X X + λA X y (3.6)
¯ ¯ ¯ ¯ ¯ ¯
where A is an (n + 1) × (n + 1) matrix, given by
¯
0
A=
¯
1
..
0
.
0
1
• Try decreasing λ
• Try increasing λ
One should not randomly choose one of these adjustments and use it toward improving the performance of a machine
learning algorithm. Fortunately, there is pretty simple technique that can allow us to rule out at least half of these options,
and as a result save us some time pursuing something that will not work. This technique is referred to as “Machine Learning
Diagnostic”, which is a test that one can run to gain insight what is and what is not working with a learning algorithm, and
gain guidance on the best way to improve its performance. However, one should note that diagnostics can take quite some
time to be implemented, but doing so can be a very good use of time.
• ...
3.5.3 Cross-Validation
Let us suppose we are, as usual, given a training set S. Given what we know about empirical risk minimization, here is what
might initially seem like a algorithm, resulting from error minimization for model selection:
This algorithm does not work. Consider choosing the order of a polynomial. The higher the order of the polynomial, the
better it will fit the training set S, and thus the lower the training error. Hence, this method will always select a high-variance,
high-degree polynomial model, which we saw previously is often poor choice.
Here is an algorithm that works better. In hold-out cross validation, we do the following:
1. Randomly split S into Strain (say, 70% of the data) and Scv (the remaining 30%). Here, Scv is called the hold-out cross
validation set.
3. Select and output the hypothesis hi that had the smallest error on the hold out cross validation set. For a hypothesis h,
we define this error (also called empirical risk or empirical error ) to be
m
1 X
ˆScv (h) = I{h(x (i) ) 6= y (i) } (3.11)
m
i=1
By testing on a set of examples Scv that the models were not trained on, we obtain a better estimate of each hypothesis
hi ’s true generalization error, and can then pick the one with the smallest estimated generalization error. Usually, somewhere
between 1/4 − 1/3 of the data is used in the hold out cross validation set, and 30% is a typical choice. Validations sets are
used to optimize hyperparameters.
Optionally, step 3 in the algorithm may also be replaced with selecting the model Mi according to argmini ˆScv (hi ), and then
retraining Mi on the entire training set S. (This is often a good idea, with one exception being learning algorithms that are
be very sensitive to perturbations of the initial conditions and/or data. For these methods, Mi doing well on Strain does not
necessarily mean it will also do well on Scv , and it might be better to forgo this retraining step.)
Earlier we discussed how a held-out test set, composed of examples coming from the same distribution as the training set,
can be used to estimate the generalization error of a learner, after the learning process has completed. It is important that the
test examples are not used in any way to make choices about the model, including its hyperparameters. For this reason, no
example from the test set can be used in the validation set. Therefore, we always construct the validation set from the training
data.
Example 12. Let us employ the leave-one-out cross validation (LOOCV) approach to choose k in k nearest neighbors. As the
name suggests, it involves using a single example from the training data as the validation data, and the remaining examples as
the training data. This is repeated such that each example in the training data is used once as the validation data. Formally,
given a classifier C , the approach computes the Misclassification Error of C as follows. Let d be the number of training examples
and let us assume that the examples are indexed from 1 to d.
• Error = 0
• For i = 1 to d
– Remove the i −th example from the training set and use it as your validation set
– Train the classifier C on the new training set
– If the test example is incorrectly classified by C then Error = Error + 1
– Put back the i −th example in the training set
• Error
To select the best classifier among a group of classifiers, you apply the LOOCV approach to all the classifiers and choose
the one that has the smallest error.
x Class
-0.1 −
0.7 +
1 +
1.6 −
2 +
2.5 +
3.2 −
3.5 −
4.1 +
4.9 +
1. What is the LOOCV error of 1 nearest neighbor algorithm on the dataset given above (use Euclidean distance)?
2. What is the LOOCV error of 3 nearest neighbor algorithms on the dataset given above (use Euclidean distance)?
Solution.
1. For i = 1, pick −0.1 and classify it as 0.7, i.e., +. This is an incorrect classification, then Error = 0 + 1 = 1. For i = 2,
pick 0.7 and classify it as 1, i.e., +. This is a correct classification, then Error = 1. For i = 3, pick 1 and classify it as
0.7, i.e., +. This is a correct classification, then Error = 1. For i = 4, pick 1.6 and classify it as 2, i.e., +. This is an
incorrect classification, then Error = 2. For i = 5, pick 2 and classify it as 1.6, i.e., −. This is an incorrect classification,
then Error = 3. For i = 6, pick 2.5 and classify it as 2, i.e., +. This is a correct classification, then Error = 3. For
i = 7, pick 3.2 and classify it as 3.5, i.e., −. This is an correct classification, then Error = 3. For i = 8, pick 3.5 and
classify it as 3.2, i.e., −. This is a correct classification, then Error = 3. For i = 9, pick 4.1 and classify it as 3.5, i.e.,
−. This is an incorrect classification, then Error = 4. For i = 10, pick 4.9 and classify it as 4.1, i.e., +. This is a correct
classification, then Error = 4. Hence, the LOOCV error of 1 nearest neighbor algorithm is 4.
2. For i = 1, pick −0.1 and classify it as +.This is an incorrect classification, then Error = 0 + 1 = 1. For i = 2, pick 0.7
and classify it as −. This is a correct classification, then Error = 2. For i = 3, pick 1 and classify it as +. This is a
correct classification, then Error = 2. For i = 4, pick 1.6 and classify it as +. This is an incorrect classification, then
Error = 3. For i = 5, pick 2 and classify it as +. This is a correct classification, then Error = 3. For i = 6, pick 2.5
and classify it as −. This is an incorrect classification, then Error = 4. For i = 7, pick 3.2 and classify it as +. This is
an incorrect classification, then Error = 5. For i = 8, pick 3.5 and classify it as +. This is an incorrect classification,
then Error = 6. For i = 9, pick 4.1 and classify it as −. This is an incorrect classification, then Error = 7. For i = 10,
pick 4.9 and classify it as −. This is an incorrect classification, then Error = 8. Hence, the LOOCV error of 1 nearest
neighbor algorithm is 8.
LOOCV has a major advantage over the validation set approach. It has far less bias. In LOOCV, we repeatedly fit the
statistical learning method using training sets that contain m − 1 observations, almost as many as are in the entire data set.
LOOCV has the potential to be expensive to implement, since the model has to be fit m times (think linear regression for
instance). Hold out cross validation wastes about 30% of the data. Even if we were to take the optional step of retraining the
model on the entire training set, it is still as if we are trying to find a good model for a learning problem in which we had 0.7m
training examples, rather than m training examples, since we are testing models that were trained on only 0.7m examples each
time. While this is fine if data is abundant and/or cheap, in learning problems in which data is scarce (consider a problem with
m = 20, say), we would like to do something better; k-fold cross validation is one option.
The k-fold cross validation approach involves randomly dividing the set of observations into k groups, or folds, of approxi-
mately equal size. The first fold is treated as a validation set, and the method is fit on the remaining k − 1 folds. The mean
squared error, MSE1 , is then computed on the observations in the held-out fold. This procedure is repeated k times; each time,
a different group of observations is treated as a validation set. This process results in k estimates of the test error, MSE1 ,
MSE2 , ... , MSEk . The k−fold CV estimate is computed by averaging these values,
k
1X
CVk = MSEi (3.12)
k
i=1
Figure. 3.8 illustrates the k-fold CV approach. It is not hard to see that LOOCV is a special case of k−fold CV in which k
is set to equal m. In practice, one typically performs k−fold CV using k = 5 or k = 10. What is the advantage of using k = 5
or k = 10 rather than k = m? The most obvious advantage is computational.
The validation set approach can lead to overestimates of the test error, since in this approach the training set used to fit the
statistical learning method contains only half the observations of the entire data set. Using this logic, it is not hard to see that
LOOCV will give approximately unbiased estimates of the test error, since each training set contains m − 1 observations, which
is almost as many as the number of observations in the full data set. And performing k−fold CV for, say, k = 5 or k = 10
will lead to an intermediate level of bias, since each training set contains (k − 1)n/k observations−fewer than in the LOOCV
approach, but substantially more than in the validation set approach. Therefore, from the perspective of bias reduction, it is
clear that LOOCV is to be preferred to k−fold CV.
5.1 Cross-Validation 181
123 n
11 76 5 47
11 76 5 47
11 76 5 47
11 76 5 47
11 76 5 47
chapters. The magic formula (5.2) does not hold in general, in which case
the model has to be refit n times.
CHAPTER 4
This chapter presents one of the most theoretically well motivated and practically most effective classification algorithms
in modern machine learning: Support Vector Machines (SVMs). SVMs are our second non-probabilistic classifier after k−NN.
Their success is due to their excellent empirical performance and for many applications their are hard to beat. Their have been
found to be particularly useful in applications where the number of features is much larger than the number of training data.
This is because the number of parameters that must be set for the SVM is related to the number of training data and not
the number of features. We first introduce the algorithm for separable datasets, then present its general version designed for
non-separable datasets. We start with the description of the problem of linear binary classification, focusing only on binary
classification.
kwk
|wT (xn ) + 2
Not only do we want the instances to be on the right side of the hyperplane, but we also want them some distance away, for
= min
n kwk2
better generalization. =
1
The intuition “the further away the better” translates to solving
kwk2
|wT (xn ) + b|
x2 max min
w,b n kwk2
March 5, 2019 11 / 62
w
H
x1
The distance from the hyperplane to the instances closest to it on either side is called the margin, which we want to maximize
for best generalization. The objective is to specify b and w to maximize the margin.
¯
4.2 Support Vectors Machines: separable case
In this section, we assume that the training sample S can be linearly separated, that is, we assume the existence of a
hyperplane that perfectly separates the training sample into two populations of positively and negatively labeled points, as
illustrated by the left label of Figure. 4.1. But there are then infinitely many such separating hyperplanes. Which hyperplane
should a learning algorithm select? The solution returned by the SVM algorithm is the hyperplane with the maximum margin, or
distance to the closest points, and is thus known as the maximum-margin hyperplane. The right panel of Figure. 4.1 illustrates
that choice.
where w ∈ Rn is a non-zero vector normal to the hyperplane and b ∈ R a scalar. Note that this definition of a hyperplane is
¯
invariant to non-zero scalar multiplication. Hence, for a hyperplane that does not pass through any sample point, we can scale
w and b appropriately such that min |w · x + b| = 1. We define this representation of the hyperplane, i.e., the corresponding
¯ (x,y )∈S ¯ ¯
¯
pair (w , b), as the canonical hyperplane. The distance of any point x0 ∈ Rn to a hyperplane defined by (4.3) is given by
¯ ¯
|w · x0 + b|
¯ ¯ (4.4)
||w ||
¯
Thus, for a canonical hyperplane, the margin ρ is given by
|w · x + b| 1
ρ = min ¯ ¯ = (4.5)
4.2 SVMs — separable case (x,y )∈S ||w || ||w || 65
¯ ¯ ¯
w·x+b = 0
margin
w·x+b = +1
w·x+b = −1
min f (x)
x∈X
¯
¯
subject to: gi (x) ≤ 0, ∀i ∈ {1, ... , m} (4.8)
¯
This general formulation does not make any convexity assumptions and can be augmented with equality constraints. It is
referred to as the primal problem in contrast with a related problem introduced later. We will denote by p ∗ the optimal value
of the objective.
For any c ∈ X , we will denote by g(x) the vector (g1 (x), ... , gm (x))T . Thus, the constraints can be written as g(x) ≤ 0.
¯ ¯ ¯
To any constrained optimization problem, we can associate a Lagrange function that plays an important in the analysis of the
problem and its relationship with another related optimization problem.
Definition 1. (Lagrangian) The Lagrange function or the Lagrangian associated to the general constrained optimization
problem defined in (4.12) is the function defined over X × R+ by:
m
X
∀x ∈ X , ∀αi ≥ 0, L(x, α) = f (x) + αi gi (x) (4.9)
¯ ¯ ¯ i=1
¯
where the variables αi are known as the Lagrange multipliers with α = (α1 , ... , αm )T .
Any equality constraint of the form g(x) = 0 for a function g can be equivalently expressed by two inequalities: −g(x) ≤ 0
¯ ¯
and +g(x) ≤ 0. Let α− ≥ 0 be the Lagrange multiplier associated to the first constraint and α+ ≥ 0 then associated with
¯
the second constraint. The sum of the terms corresponding to theses constraints in the definition of the Lagrange function can
be therefore be written as αg(x) with α = (α+ − α− ). Thus, in general, for an equality constraint g(x) = 0 the Lagrangian
¯ ¯
is augmented with a term αg(x) but with α ∈ R not constrained to be non-negative. Note that in the case of a convex
¯
optimization problem, equality constraints g(x) are required to be affine since both g(x) and −g(x) are required to be convex.
¯ ¯ ¯
Definition 2. (Dual Function) The Lagrange dual function associated to the constrained optimization problem is defined by
m
!
X
∀α ≥ 0, F (α) = min L(x, α) = min f (x) + αi gi (x) (4.10)
x∈X
¯
¯ x∈X
¯
¯ i=1
¯
Note that F is always concave, since the Lagrangian is linear with respect to α and since the minimum preserves concavity.
We further observe that
∀α ≥ 0, F (α) ≤ p ∗ (4.11)
Pm
since for any feasible x, f (x) + αi gi (x) ≤ f (x). The dual function naturally leads to the following optimization problem.
¯ ¯ i=1 ¯ ¯
Definition 3. (Dual Problem) The dual (optimization) problem associated to the constrained optimization problem is
max F (α)
α
subject to: α≥0 (4.12)
The dual problem is always a convex optimization problem (as a maximization of a concave problem). Let d ∗ denote optimal
value. By (4.11), the following inequality always holds:
does not hold in general. However, strong duality does hold when convex optimization problems satisfy a constraint qualification.
An example of constraint qualification conditions are the Karush-Kuhn-Tucker’s (KKT) conditions.
Theorem 1. (Karush-Kuhn-Tucker’s Theorem) Assume that, f , gi : X → R, ∀i ∈ {1, ... , m} are convex and differential and
that the constraints are qualified. Then x ∗ is a solution of the constrained program if and only if there exists α∗ ≥ 0 such that
¯
∇x L(x, α) = 0 (4.15)
¯ ¯
∇α L(x, α) ≤ 0 (4.16)
¯ m
X
α · g(x ∗ ) =
∗
αi∗ g(xi∗ ) = 0 (4.17)
¯ ¯ i=1
¯
Theses conditions are known as the KKT conditions. The last two conditions are known as the complementarity conditions.
The KKT conditions are obtained by setting the gradient of the Lagrangian with respect to the primal variables w and b to
¯
zero and by writing the complementarity conditions:
X m
X
∇w L(w , b, α) = w − αi yi xi = 0 ⇒ w= αi yi xi (4.19)
¯ ¯ ¯ i
¯ ¯ i
¯
m
X m
X
∇b L(w , b, α) = − αi yi = 0 ⇒ αi yi = 0 (4.20)
¯ i i
∀i , αi [yi (w · xi + b) − 1] = 0 ⇒ αi = 0 ∨ yi (w · xi + b) = 1 (4.21)
¯ ¯ ¯ ¯
We can see that the weigh vector w solution of the SVM problem is a linear combination of the training set vectors x1 , ... , xm .
¯ ¯ ¯
A vector xi appears in that expansion iff αi 6= 0. Such vectors are called support vectors. By the complementarity conditions,
¯
if αi 6= 0, then yi (w · xi + b) = 1. Thus, support vectors lie on the marginal hyperplanes w · xi + b = ±1.
¯ ¯ ¯ ¯
Support vectors fully define the maximum-margin hyperplane or SVM solution, which justifies the name of the algorithm.
By definition, vectors not lying on the marginal hyperplanes do not affect the definition of these hyperplanes − in their absence,
the solution to the SVM problem remains unchanged. Note that while the solution w of the SVM problem is unique, the
¯
support vectors are not. In dimension n , n + 1 points are sufficient to define a hyperplane. Thus, when more than n + 1 points
lie on a marginal hyperplane, different choices are possible for the n + 1 support vectors.
! !
1 X X X X X X
L(w , b, α) = αi yi xi · αj yj xj − αi yi xi · αj yj xj − αi yi b + αi (4.22)
¯ 2 ¯ ¯ ¯ ¯
i j i j i i
X 1 XX
= αi − αi αj yi yj (xi · xj ) (4.23)
2 ¯ ¯
i i j
This leads to the following dual optimization problem for SVMs in the separable case:
X 1 XX
max αi − αi αj yi yj (xi · xj )
α 2 ¯ ¯
i i j
subject to: αi ≥ 0, ∀i ∈ [1, m]
Xm
subject to: αi yi = 0, ∀i ∈ [1, m] (4.24)
i=1
1
P PP
It can be easily shown that the function F (α) = αi αj yi yj (xi · xj ) is a concave function of α. Since the
αi − 2
i i¯ ¯ j
constraints are affine and convex, the maximization problem (4.24) is equivalent to a convex optimization problem. Since F is
a quadratic function of α, this dual optimization problem is also a QP problem, as in the case of the primal optimization and
once again both general-purpose and specialized QP solvers can be used to obtain the solution. Moreover, since the constraints
are affine, they are qualified and strong duality holds. Thus, the primal and dual problems are equivalent, i.e., the solution α
of the dual problem (4.24) can be used directly to determine the hypothesis returned by SVMs, using (4.19):
m
!
X
h(x) = sgn(w · x + b) = sgn αi yi (xi · x) + b (4.25)
¯ ¯ ¯ i=1
¯ ¯
Since support vectors lie on the marginal hyperplanes, for any support vector xi , w · xi + b = yi , and thus b can be obtained
¯ ¯ ¯
via
Xm
b = yi − αj yj (xi · xj ) (4.26)
j=1
¯ ¯
The dual optimization problem and the decision rule reveal an important property of SVMs: the hypothesis solution depends
only on inner products between vectors and not directly on the vectors themselves.
Equation (4.34) can now be used to derive a simple expression of the margin ρ in terms of α. Since (4.34) holds for all i
with αi 6= 0, multiplying both sides by αi yi and taking the sum leads to
m
X m
X m X
X m
αi yi b = αi yi2 − αi αj yi yj (xi · xj )
i=1 i=1 i=1 j=1
¯ ¯
using the fact that yi2 = 1 along with (4.19) then yields
m
X
0= αi − ||w ||2
i=1
¯
Remark 10. In SVM, the optimization problems fallen a convex space,so we are not afraid from local optimas, in contrast to
neural networks which are plagued with local optimas.
4.3 Kernels
Kernel methods are widely used in machine learning. They are flexible techniques that can be used to extend algorithms
such as SVMs to define non-linear decision boundaries. Other algorithms that only depend on inner products between sample
points can be extended similarly. The main idea behind these methods is based on so-called kernels or kernel functions, which,
under some technical conditions of symmetry and positive-definiteness, implicitly define an inner product in a high-dimensional
space. Replacing the original inner product in the input space with positive definite kernels immediately extends algorithms such
as SVMs to a linear separation in that high-dimensional space, or, equivalently, to a non-linear separation in the input space.
In the previous sections, we presented an algorithm for linear classification, SVMs, which is both effective in applications
and benefits from a strong theoretical justification. In practice, linear separation is often not possible. Figure. 4.4(a) shows an
example where any hyperplane crosses both populations. However, one can use more complex functions to separate the two
sets as in figure. 4.4(b). One way to define such a non-linear decision boundary is to use a non-linear mapping Φ from the
90 Kernel Methods
input space X to a higher-dimensional space H, where linear separation is possible.
(a) (b)
Figure
Figure 4.4: Non-linearly Non-linearly
5.1 case.
separable separable task
The classification case.consists
The classification task consists
of discriminating betweenof discrim-
solid squares and solid circles.
(a) No hyperplane can inating between
separate the two solid squares and
populations. (b)solid circles. (a)
A non-linear No hyperplane
mapping can be usedcaninstead.
separate the
two populations. (b) A non-linear mapping can be used instead.
The dimension of H can truly be very large in practice. For example, in the case of document classification, one may
wish to use as features sequences
space X to a of three consecutive space
higher-dimensional words,H,i.e., trigrams.
where linear Thus, with ais vocabulary
separation possible. of just 100,000 words,
15
the dimension of the feature
The dimension of H can truly be very large in practice. For example,show
space H reaches 10 . On the positive side, the margin bounds that, remarkably, the
in the
generalization ability case
of large-margin classification algorithms such as SVMs do not depend
of document classification, one may wish to use as features sequences of three on the dimension of the feature
space, but only on theconsecutive
margin ρ and the number of training examples m. Thus, with a favorable
words, i.e., trigrams. Thus, with a vocabulary of just 100,000 words, margin ρ, such algorithms could
succeed even in verythe high-dimensional
dimension of the space. However,
feature space H determining
reaches 10the15 hyperplane solution requires multiple inner product
. On the positive side, the margin
computations in high-dimensional spaces, which can become be very
bounds presented in section 4.4 show that, remarkably, costly. A solution to this problem
the generalization is toofuse kernel methods,
ability
which are based on kernels or kernel functions.
large-margin classification algorithms such as SVMs do not depend on the dimension
Definition 4. (Kernels)of the feature space, but only on the margin ρ and the number of training examples
A function K : X × Xm.→Thus, R is with
calleda afavorable margin
kernel over ρ, such
X . The ideaalgorithms
is to definecould succeed
a kernel even
K such in very
that two points x, x 0 ∈ X ,
high-
for any
0 ¯ ¯
K (x, x ) be equal to dimensional space.
an inner product of However,
vectors Φ(determining
x) and Φ(y ):the hyperplane solution requires multiple
¯ ¯ ¯ ¯
inner product computations in high-dimensional spaces, which can become be very
costly. ∀x, x 0 ∈ X , K (x, x 0 ) = hΦ(x), Φ(x 0 )i (4.28)
¯ ¯ ¯ ¯ ¯ ¯
A solution to this problem is to use kernel methods, which are based on kernels
for some mapping Φ : X → H to a Hilbert space H called a feature space. Since an inner product is a measure of the similarity
or kernel functions.
of two vectors, K is often interpreted as a similarity measure between elements of the input space X .
Definition 5.1 Kernels
The kernel has two main advantages:
A function K : X × X → R is called a kernel over X .
1. Efficiency: K is often significantly more efficient to compute than Φ and an inner product in H. We need less computations
The idea is to define a kernel K such that for any two points x, x′ ∈ X , K(x, x′ ) be
to evaluate K rather than hΦ(x), Φ(x 0 )i.
¯ ¯
2. Flexibility: there is no need to explicitly define or compete a mapping Φ. The kernel can be arbitrarily chosen as long as
the existence of Φ is guaranteed; K should satisfy some condition known as Mercer’s condition or K is positive definite
symmetric (PDS). This condition is important to guarantee the convexity of the optimization problem for algorithms such
as SVMs and thus convergence guarantees.
The following are some standard examples of kernels commonly used in applications.
1. Linear Kernels:
K (x, x 0 ) = hx, x 0 i (4.29)
¯ ¯ ¯ ¯
2. Polynomial Kernels: For any constant c > 0, a polynomial kernel of degree d ∈ N is the kernel defined over RN by:
3. Gaussian Kernels: For any constant σ > 0, a Gaussian kernel or radial basis function (RBF) is the kernel K defined over
RN by:
||x 0 − x||2
0
K (x, x ) = exp − ¯ 2¯ (4.31)
¯ ¯ 2σ
Gaussians kernels are among the most frequently used kernels in applications.
4. Sigmoid Kernels: For any real constant a, b ≥ 0, a sigmoid kernel is the kernel K defined over RN by:
As an example, consider the XOR classification problem in an input space of dimension N = 2, as shown in Figure. 4.5.
The label of a point is blue iff exactly one of5.2 Positive definite
its coordinates symmetric
is 1. The kernels
XOR classification problem is obviously non-separable in 93
the input space.
x2
√
2 x1 x2
(−1, 1) (1, 1) √ √ √
(1, 1, + 2, − 2, − 2, 1)
√ √ √
(1, 1, + 2, + 2, + 2, 1)
√
2 x1
x1
√ √ √
(1, −1)
√ √ √
(−1, −1) (1, 1, − 2, − 2, + 2, 1) (1, 1, − 2, + 2, − 2, 1)
(a) (b)
Figure 4.5: Illustration of the XOR classification problem. XOR problem linearly non-separable in the input space.
Figure 5.2 Illustration of the XOR classification problem and the use of poly-
Now, using the quadratic kernel K (x, x 0 ) = (hx, x 0 i + 1)2 , the problem is equivalent to an inner product in a feature space
¯ ¯ nomial ¯ ¯kernels. (a) XOR problem linearly non-separable in the input space. (b)
of dimension N = 6:
Linearly separable using second-degree polynomial kernel.
x12 x102
x2
√
2 x1 x2
(−1, 1) (1, 1) √ √ √
(1, 1, + 2, − 2, − 2, 1)
√ √ √
(1, 1, + 2, + 2, + 2, 1)
√
2 x1
x1
√ √ √
(1, −1)
√ √ √
(−1, −1) (1, 1, − 2, − 2, + 2, 1) (1, 1, − 2, + 2, − 2, 1)
(a) (b)
Figure 4.6: Illustration of the XOR classification problem. XOR problem linearly separable in the feature space using a quadratic
kernel.
Figure 5.2 Illustration of the XOR classification problem and the use of poly-
nomial kernels. (a) XOR problem linearly non-separable in the input space. (b)
Linearly separable using second-degree polynomial kernel.
X 1 XX
max αi − αi αj yi yj K (xi , xj )
dimension 6: α 2
i i j
⎡ ⎤ ⎡ ⎤
2 ′2
subject to: αi ≥ 0,x∀i1 ∈ [1, m] x 1
m⎢ ⎥ ⎢ ⎥
⎢ x22 ⎥ ⎢ x′ 22 ⎥
⎢ ⎥ ⎢ ⎥
X
subject to: ⎢√αi yi = 0,⎥∀i⎢∈√[1, m]
2 x x 2 x ′ ′⎥
x
(4.34)
′ 2 ′ ′ ′ 2 i=1⎢ 1 2 ⎥ ⎢ 1 2⎥
∀x, x ∈ R , K(x, x ) = (x1 x1 + x2 x2 + c) = ⎢ √ ⎥·⎢ √ ⎥ . (5.4)
⎢ 2c x1 ⎥ ⎢ 2c x′1 ⎥
⎢ ⎥ ⎢ ⎥
As a result, the hypothesis solution (4.25) can be re-written ⎢√ as⎥ ⎢ √ ⎥
⎣ 2c x2 ⎦ ⎣ 2c x′2 ⎦
m
!
X c c
h(x) = sgn αi yi K (xi , x) + b (4.35)
¯
Thus, the features corresponding to a second-degree polynomial are the original
i=1
features (xP1mand x2 ), as well as products of these features, and the constant feature.
withMore i −
b = ygenerally,
αj ythe
j K (x i , xj ) for associated
features any xi . to a polynomial kernel of degree d are all
j=1
the monomials of degree at most d based on the original features. The explicit
expression of polynomial kernels as inner products, as in (5.4), proves directly that
they are
Example 13.PDS kernels.
An SVM is trained with the following data:
To illustrate the application of polynomial kernels, x1 consider
x2 y the example of fig-
ure 5.2a which shows a simple data set in dimension 1 two0 that1 is not linearly sepa-
rable. This is known as the XOR problem due to its -1 interpretation
0 -1 in terms of the
exclusive OR (XOR) function: the label of a point is2 blue0 iff exactly 1 one of its coor-
dinates is 1. However, if we map these points to the -2 six-dimensional
0 -1 space defined
by a second-degree polynomial as described in (5.4), then the problem becomes
We separable
propose tobyusethe
a linear kernel of
hyperplane to equation
solve the problem.
x1 x2 = 0.The kernel
Figure 5.2bis Killustrates x T xn . In other words, Φ(x) = x. We index
(xm , xn ) =that
T T T ¯ T¯ ¯m ¯ by ¯ ¯
the showing
data set the
as xprojection
1 = (1, 0) of, xthese
2 = (−1, , xthe
0) on
points (2, 0) , and x4 =space
3 = two-dimensional 0) . The
(−2, defined bydual
theiroptimization problem is:
¯ ¯ ¯ ¯
third and fourth coordinates. 4 4 4
X 1XX
Example 5.2 Gaussian kernels max α n − αn αm yn ym Km,n
α
n=1
2 n=1 m=1
subject to: αi ≥ 0, ∀i ∈ [1, 4]
subject to: α1 y1 + α2 y2 + α3 y3 + α4 y4 = 0
Substituting the values of the kernel matrix and using the symmetry of the problem (α1 = α2 and α3 = α4 ), we can re-write
the dual optimization problem as
Since α3 is always nonnegative, to minimize the objective function, we set α3 = 0 and α1 = 12 . Hence, we have shown that
α1 = α2 = 12 and α3 = α4 = 0. Namely, x1 and x2 are support vectors and x3 and x4 are removable without changing the
¯ ¯ ¯ ¯
solution. This yields
X 1
w= αn yn xn = (x1 − x2 )T = (1 0)T
¯ n
¯ 2 ¯ ¯
2
P
The constant b is given by: b = yn − αm ym K (xn , xm ) for any xn . Pick x1 , we get b = 1−α1 K (x1 , x1 )y1 −α2 K (x1 , x2 )y2 =
m=1 ¯
1 − 21 (1) − 21 (−1)(−1) = 0, Thus, the decision boundary is w T x + b = x1 = 0.
¯ ¯