KEMBAR78
Notes Cce 577 | PDF | Machine Learning | Support Vector Machine
0% found this document useful (0 votes)
4 views71 pages

Notes Cce 577

The document is a course outline for 'Elements of Machine Learning' taught by Dr. Mustafa El-Halabi in Spring 2019. It includes sections on the introduction to machine learning, types of algorithms, linear regression, classification techniques, regularization, and support vector machines. Each section is detailed with subtopics and page numbers for easy navigation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views71 pages

Notes Cce 577

The document is a course outline for 'Elements of Machine Learning' taught by Dr. Mustafa El-Halabi in Spring 2019. It includes sections on the introduction to machine learning, types of algorithms, linear regression, classification techniques, regularization, and support vector machines. Each section is detailed with subtopics and page numbers for easy navigation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 71

ELEMENTS OF MACHINE LEARNING

Instructor: Dr. Mustafa El-Halabi

SPRING 2019
2

TABLE OF CONTENTS

Page

TABLE OF CONTENTS 2

LIST OF FIGURES 4

1 INTRODUCTION TO MACHINE LEARNING 1


1.1 Defining Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Types of Machine Learning Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 Supervised Learning: Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.2 Supervised Learning: Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.3 Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 LINEAR REGRESSION & GRADIENT DESCENT ALGORITHM 1


2.1 Gradient Descent Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Multivariate Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.1 Matrix Derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.2 The Normal Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Gradient Descent in Practice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.1 Feature Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.2 Learning Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4 Features and Polynomial Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2 CLASSIFICATION 13
2.1 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Generative Learning Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2.1 Gaussian Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3 Naive Bayes Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3.1 The Naive Bayes Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3.2 Estimation using Maximum Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3.3 Spam Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.4 Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.4.1 Decision Tree Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.4.2 Decision Tree Problem Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.4.3 The Iterative Dichotomy 3 (ID3) Learning Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.4.4 Which Attribute is the Best Classifier? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.5 k−Nearest Neighbors (k−NN) Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.5.1 Do As Your Neighbor Does . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.5.2 k−Nearest Neighbors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3 REGULARIZATION 37
3.1 Generalization in Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.1.1 Overfitting and Underfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.1.2 Capacity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2 Hypothesis Evaluation: Training Error and Testing Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2.1 Bias/ Variance Tradeoff . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.3 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

Copyright c 2019, Dr. Mustafa El-Halabi 2


3.3.1 Regularized Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.3.2 Regularized Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.4 Debugging a Machine Learning Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.5 Hyperparameters, Model Selection and Validation Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.5.1 Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.5.2 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.5.3 Cross-Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4 SUPPORT VECTOR MACHINES 50


4.1 Margins: Intuition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.2 Support Vectors Machines: separable case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.2.1 Primal Optimization Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.2.2 Constrained Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.2.3 Support Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.2.4 Dual Optimization Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.3 Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.4 SVM with PDS Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

3
4

LIST OF FIGURES

Page

1.1 Housing prices in USD versus their surface areas in square feet. . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Estimating a house of surface area 750 ft 2 using a straight line. . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Estimating a house of surface area 750 ft 2 using a second order polynomial. . . . . . . . . . . . . . . . . . . 3

1.4 The data set denoted by a cross refers to a benign tumor and the the one denoted by a circle refers to a
malignant tumor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.5 Classification example with two features; the size of the tumor and the age of the patient. The data set denoted
by a cross refers to a benign tumor and the the one denoted by a circle refers to a malignant tumor. . . . . . 3

1.6 Fitting a line to separate the data set in a diary classification problem. . . . . . . . . . . . . . . . . . . . . . . 4

1.7 Cocktail party problem, where each microphone picks up two simultaneous speeches. The Cocktail party
algorithm separates the mixed speeches into the two original speeches with no prior information about the sources. 5

2.1 A plot of dataset giving the living areas and prices of 47 houses in certain town. . . . . . . . . . . . . . . . . 1

2.2 A plot of dataset giving the living areas and prices of 47 houses in certain town and using hθ (x) = 50 + 0.06x
to predict the housing prices. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.3 A plot of the squared error function J(θ0 , θ1 ). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.4 Contour figures of the squared error function J(θ0 , θ1 ). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.5 Starting from a random location (θ0 , θ1 ) and following the gradient descent. . . . . . . . . . . . . . . . . . . 4

2.6 Contour figures of the squared error function J(θ0 , θ1 ). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.7 Choosing another initial location and applying the gradient descent algorithm leads to another local minimum. 6

2.8 Geometrical intuition of the convergence for the gradient descent algorithm. . . . . . . . . . . . . . . . . . . 8

2.1 Training data for tumor classification. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.2 Using linear regression in a classification problem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.3 A scenario showing that linear regression is not the right technique to be used for a classification problem. . . 14

2.4 A Plot of the Sigmoid function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

Copyright c 2019, Dr. Mustafa El-Halabi 4


2.5 Cost function for logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.6 Examples of multivariate Gaussian distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.7 Shown in the figure are the training set, as well as the contours of the two Gaussian distributions that have
been fit to the data in each of the two classes. Note that the two Gaussians have contours that are the same
shape and orientation, since they share a covariance matrix, but they have different mean vectors. . . . . . . . 20

2.8 An example of a decision tree for the concept of PlayTennis. An example is classified by sorting it through the
tree to the appropriate leaf node, then returning the classification associated with this leaf (in this case, Yes or
No). This tree classifies Saturday mornings according to whether or not they are suitable for playing tennis. . 28

2.9 Training examples for the target concept of PlayTennis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.10 The partially learned decision tree resulting from the first step of ID3. . . . . . . . . . . . . . . . . . . . . . . 32

2.11 An illustration of the decision boundaries of the 1−NN rule. The points depicted are the sample points, and
the predicted label of any new point will be the label of the sample point in the center of the cell it belongs to.
These cells are called a Voronoi tessellation of the space, and are drawn using the Euclidean distance. Other
metrics will change the decision surface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

2.12 An illustration of the operation of k = 3-NN. Circles and squares denote the training points and diamonds the
test points. Test point A will be assigned to the square class and B to the circle class. . . . . . . . . . . . . . 35

3.1 The zig-zag line on the left panel is consistent over the blue and red training sample, but it is a complex
separation surface that is not likely to generalize well to unseen data. In contrast, the decision surface on the
right panel is simpler and might generalize better in spite of its misclassification of a few points of the training
sample. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.2 Different hypotheses for predicting the housing prices. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.3 Different hypotheses for a classification problem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.4 We fit three models to this example training set. The training data was generated synthetically, by randomly
sampling x values and choosing y deterministically by evaluating a quadratic function. (Left) A linear function
fit to the data suffers from underfitting−it cannot capture the curvature that is present in the data. (Center) A
quadratic function fit to the data generalizes well to unseen points. It does not suffer from a significant amount
of overfitting or underfitting. (Right) A polynomial of degree 9 fit to the data suffers from overfitting. . . . . 40

3.5 Typical relationship between capacity and error. Training and test error behave differently. At the left end of
the graph, training error and generalization error are both high. This is the underfitting regime. As we increase
capacity, training error decreases, but the gap between training and generalization error increases. Eventually,
the size of this gap outweighs the decrease in training error, and we enter the overfitting regime, where capacity
is too large, above the optimal capacity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5
3.6 We fit a high-degree polynomial regression model to our example training set. The true function is quadratic,
but here we use only models with degree 9. We vary the amount of weight decay to prevent these high-
degree models from overfitting. (Left) With very large λ, we can force the model to learn a function with
no slope at all. This under-fits because it can only represent a constant function. (Center) With a medium
value of λ, the learning algorithm recovers a curve with the right general shape. Even though the model is
capable of representing functions with much more complicated shape, weight decay has encouraged it to use a
simpler function described by smaller coefficients. (Right) With weight decay approaching zero (i.e., using the
Moore-Penrose pseudo-inverse to solve the underdetermined problem with minimal regularization), the degree-9
polynomial overfits significantly, as we saw in Fig. 3.4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.7 In L2 regularization the regularizer term has a minimum of 0 and the error term has its own minimum, however
the optimum solution of the new regularized cost function lies in between. . . . . . . . . . . . . . . . . . . . 44

3.8 A schematic display of 5−fold CV. A set of n observations is randomly split into five non-overlapping groups.
Each of these fifths acts as a validation set (shown in beige), and the remainder as a training set (shown in
blue). The test error is estimated by averaging the five resulting MSE estimates. . . . . . . . . . . . . . . . . 49

4.1 (a) Three possible hyperplanes used to separate the dataset. (b) The hyperplane with the maximum margin. . 51

4.2 The wide street approach for classification. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.3 Margin and equations of the hyperplanes for a canonical maximum-margin hyperplane. The marginal hyperplanes
are represented by dashed lines on the figure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.4 Non-linearly separable case. The classification task consists of discriminating between solid squares and solid
circles. (a) No hyperplane can separate the two populations. (b) A non-linear mapping can be used instead. . 56

4.5 Illustration of the XOR classification problem. XOR problem linearly non-separable in the input space. . . . . . 57

4.6 Illustration of the XOR classification problem. XOR problem linearly separable in the feature space using a
quadratic kernel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

6
INTRODUCTION TO MACHINE LEARNING 1

CHAPTER 1

INTRODUCTION TO MACHINE LEARNING

1.1 Defining Machine Learning


Even among Machine Learning practitioners there is no well accepted definition of what is and what is not machine learning.
However, let us see a couple of examples of the ways that people have tried to define it.
1. Machine Learning is the field of study that gives computers the ability to learn without being explicitly programmed. In
1959, Arthur Samuel designed a checkers playing program, where he programmed the computer to play the game thousands
of times against itself. Instead of searching each path until it came to the game’s conclusion, Samuel developed a scoring
function based on the position of the board at any given time. This function tried to measure the chance of winning for
each side at the given position. It took into account such things as the number of pieces on each side, the number of
kings, and the proximity of pieces to being “kinged”. The program chose its move based on a minimax strategy, meaning
it made the move that optimized the value of this function, assuming that the opponent was trying to optimize the value
of the same function from its point of view. Over time the computer learned the good board positions and the bad board
positions, and due to its relatively larger memory eventually learned to play checkers better than Arthur Samuel was able
to.
This definition of machine learning is an old one. The next one is a relatively more modern definition.
2. Well-posed Learning Problem: A computer program is said to learn from experience E with respect to some task T and
some performance measure P, if its performance on T , as measured by P, improves with experience E . (Tom Mitchell
(1998)). In reference to the checkers playing example, the experience E will be the experience of letting the computer
play tens of thousands of games against itself, the task T will be the task of playing checkers, and the performance
measure P is the probability of winning the next games of checkers against a new opponent.

Example 1.
Suppose your email program watches which emails you do or do not mark as spam, and based on that learns how to filter spam
better. Classify the following items as T , E and P.

1. Watching you label emails as spam or not spam.

2. Classifying emails as spam or not spam.

3. The number (or fraction) of emails correctly classified as spam/not spam.

solution. 1 = E , 2 = T , 3 = P.

Copyright c 2019, Dr. Mustafa El-Halabi 1


INTRODUCTION TO MACHINE LEARNING 2

It is important to note that machine Learning can only discover correlations, not causal relationships. In fact, one of the
most popular types of machine learning consists of trying out different actions and observing their consequences: the essence
of causal discovery. For example, an e-commerce site can try many different ways of presenting a product and choose the one
that leads to the most purchases. You have probably participated in thousands of these experiments without knowing it. And
causal relationships can be discovered even in some situations where experiments are out of the question, and all the computer
can do is look at past data.

1.2 Types of Machine Learning Algorithms


In general there are two different types of learning algorithms: Supervised learning and unsupervised learning. In brief, in
supervised learning we will teach the computer what to do, while in unsupervised learning we will let the computer learn by
itself. There are other types of machine learning algorithms like reinforcement learning and recommender system.

1.2.1 Supervised Learning: Linear Regression


Let us say we want to predict housing prices in certain town. We start by collecting the housing prices and plot the data
pertaining to the size of the houses and the corresponding prices in thousands of dollars. The result is shown in Figure. 1.1.

Figure 1.1: Housing prices in USD versus their surface areas in square feet.

Assume we want to use the data to estimate the price of a house having a surface area of 750 square feet. How could a
learning algorithm help us with this task? One way of estimating the price is to try to fit a curve through the plotted data and
map a price that corresponds to the 750 ft 2 . The simplest curve that we could fit through the data would be a straight line as
seen in Figure. 1.2, but maybe a better one would be some second order polynomial as seen in Figure. 1.3. Obviously, fitting
the data with a second order polynomial gives a better estimate of the price of the house.

Figure 1.2: Estimating a house of surface area 750 ft 2 using a straight line.

This is an example of a supervised learning algorithm. The term “supervised” refers to the fact that we gave the algorithm
data sets for which the “right answers” (the right price in this example) were given. The task of the algorithm is to produce ore
of these right answers. This specific kind of problems is referred to as a regression problem, where a continuous-valued output
(the price) needs to be predicted.

1.2.2 Supervised Learning: Classification


Here we present another supervised learning example, where the data set consists of the size of a breast tumor and the
associated classification that identifies the tumor as being benign (denoted by N or 0) or malignant (denoted by Y or 1). See
Figure. 1.4.

Copyright c 2019, Dr. Mustafa El-Halabi 2


INTRODUCTION TO MACHINE LEARNING 3

Figure 1.3: Estimating a house of surface area 750 ft 2 using a second order polynomial.

1(Y )

0(N ) x x x x x
Tumor size

Figure 1.4: The data set denoted by a cross refers to a benign tumor and the the one denoted by a circle refers to a malignant
tumor.

The machine learning task is to estimate what is the probability or what is the chance that a breast tumor with a certain
size, which does not belong to the collected data set, is benign or malignant. This sort of examples is known as a classification
problem. The term “classification” refers to the fact that we are trying to predict a discrete-valued output (0) or 1). In general,
classification problems can have more than two outputs. As a concrete example, we may want to predict one of the following
discrete-valued outputs: 0, 1, 2, 3, where 1, 2 and 3 refer respectively to individual types of breast cancers.
In some other classification problems, we might have more than one “feature” or “attribute”. Assume that our data set
consists of two features: the size of the tumor and the age of the patient. See Figure. 1.5.

Age
x
x
x x
x x x
x x
x
Tumor size

Figure 1.5: Classification example with two features; the size of the tumor and the age of the patient. The data set denoted
by a cross refers to a benign tumor and the the one denoted by a circle refers to a malignant tumor.

Given a certain patient with a given age and a given size of tumor, the objective of a machine learning algorithm is predict
whether or not the tumor is benign or malignant. What the learning algorithm may do is to separate the data by fitting a
straight line. Now if the data pertaining to the patient falls below the line, as seen in Figure. 1.7, then the patient is more likely
to have a benign tumor, and if the data falls above the line, then the patient is more likely to have a malignant tumor.
In general this kind of classification problems has more features like clump thickness, uniformity of cell size, uniformity
of cell shape, etc. In practice, we would like to have a countably infinite number of features, so that the machine learning
algorithm has more attributes that help it in making better accurate predictions. But how do we deal with and store infinite
number of features on a computer? It turns out that there is an algorithm known as Support Vector Machine which use a neat
mathematical trick that can accomplish such a mission.

Copyright c 2019, Dr. Mustafa El-Halabi 3


INTRODUCTION TO MACHINE LEARNING 4

Age
x
x
x x
x x
x
x x
x
Tumor size

Figure 1.6: Fitting a line to separate the data set in a diary classification problem.

Example 2. You’re running a company, and you want to develop learning algorithms to address each of two problems.

• Problem 1: You would like your software to examine individual customer accounts, and for each account decide if it has
been hacked/compromised.

• Problem 2: You have a large inventory of identical items. You want to predict how many of these items will sell over the
next three months.

Which one of these problems is a linear regression problem and which one is a classification problem?

solution. Problem 1 is classification and problem 2 is linear regression.

Remark 1. In statistics, a fit refers to how well you approximate a target function. This is good terminology to use in machine
learning, because supervised machine learning algorithms seek to approximate the unknown underlying mapping function for
the output variables given the input variables. Statistics often describe the goodness of fit which refers to measures used to
estimate how well the approximation of the function matches the target function. Some of these methods are useful in machine
learning (e.g. calculating the residual errors), but some of these techniques assume we know the form of the target function
we are approximating, which is not the case in machine learning. If we knew the form of the target function, we would use it
directly to make predictions, rather than trying to learn an approximation from samples of noisy training data.

1.2.3 Unsupervised Learning


In supervised ML problems, we have been provided with the right answer. In contrast, in unsupervised learning problems, we
are given data that does not not have any labels or has the same labeled. We are given a data set but we are not told what to
do with it or what each data point is. What need stop be done is to find a certain structure in the data. A typical unsupervised
learning algorithm is a clustering algorithm, which breaks the data into a number of separate clusters. One example where
clustering is used is in google news (https://news.google.com), where google looks for hundred of thousands of new stories
online and group them into a cohesive story. For instance, it would group a numbers of stories, each having its own URL,
related to cellphones from different sources in one page. Another example of clustering is finding which servers in a data center
tend to work together, and by identifying these servers and group them together, data centers can work more efficiently. On
the level of social networking, given knowledge about which friends you email the most, or given your Facebook friends, an ML
algorithm can build a cohesive cluster of people who may know each other (Social Network Analysis).
A famous application of unsupervised learning is Blind Source Separation. It deals with the separation of a set of source
signals from a set of mixed signals. A well known example of this problem is known as the cocktail party problem where a
number of people are talking simultaneously and we want to separate each persons speech so we can listen to it separately. Now
the caveat with this type of approach is that we need as many mixtures as we have source signals or in terms of the cocktail
party problem we need as many microphones as people talking in the room. It turns out that this problem can be solved using
the Cocktail Party Algorithm, which can be written using a single line of code.
Roughly speaking, unsupervised learning involves observing several examples of a random vector x, and attempting to
¯
implicitly or explicitly learn the probability distribution p(x), or some interesting properties of that distribution, while supervised
¯

Copyright c 2019, Dr. Mustafa El-Halabi 4


INTRODUCTION TO MACHINE LEARNING 5

Figure 1.7: Cocktail party problem, where each microphone picks up two simultaneous speeches. The Cocktail party algorithm
separates the mixed speeches into the two original speeches with no prior information about the sources.

learning involves observing several examples of a random vector x and an associated value or vector y , and learning to predict
¯ ¯ target y being provided
y from x, usually by estimating p(y |x). The term supervised learning originates from the view of the
¯
¯by an instructor or teacher who shows ¯
¯ the machine learning system what to do. In unsupervised learning, there¯ is no instructor
or teacher, and the algorithm must learn to make sense of the data without this guide.
Unsupervised learning and supervised learning are not formally defined terms. The lines between them are often blurred.
Many machine learning technologies can be used to perform both tasks. For example, the chain rule of probability states that
for a vector x ∈ Rn , the joint distribution can be decomposed as
¯
Yn
p(x) = p(xi |x1 , ... , xi−1 )
¯ i=1

This decomposition means that we can solve the ostensibly unsupervised problem of modeling p(x) by splitting it into n
¯
supervised learning problems. Alternatively, we can solve the supervised learning problem of learning p(y |x) by using traditional
¯
unsupervised learning technologies to learn the joint distribution p(x, y ) and inferring
¯
p(x, y )
p(y |x) = P ¯
¯ p(x, y 0 )
y0 ¯

Though unsupervised learning and supervised learning are not completely formal or distinct concepts, they do help to roughly
categorize some of the things we do with machine learning algorithms. Traditionally, people refer to regression, classification
and structured output problems as supervised learning. Density estimation in support of other tasks is usually considered
unsupervised learning.
Other variants of the learning paradigm are possible. For example, in semi-supervised learning, some examples include a
supervision target but others do not. In multi-instance learning, an entire collection of examples is labeled as containing or
not containing an example of a class, but the individual members of the collection are not labeled. Some machine learning
algorithms do not just experience a fixed dataset. For example, reinforcement learning algorithms interact with an environment,
so there is a feedback loop between the learning system and its experiences.

Copyright c 2019, Dr. Mustafa El-Halabi 5


LINEAR REGRESSION & GRADIENT DESCENT ALGORITHM 1

CHAPTER 2

LINEAR REGRESSION & GRADIENT DESCENT ALGORITHM

Our definition of machine learning algorithm as an algorithm that is capable of improving computer program’s performance
at some task via experience is somewhat abstract. To make this more concrete, we present an example of a simple machine
learning algorithm: Linear regression.
As the name implies, linear regression solves a regression problem. In other words, the goal is to build a system that can
take a vector x ∈ Rn as input and predict the value of a scalar y ∈ R as its output. The output of linear regression is a linear
¯
function of the input. Let ŷ be the value that our model predicts y should take on. We define the output to be

ŷ = w T x (2.1)
¯ ¯
where w ∈ Rn is a vector of “parameters”. Parameters are values that control the behavior of the system. In this case, wi is the
¯
coefficient that we multiply by feature xi before summing up the contributions from all the features. We can think of w as a set
¯
of weights that determine how each feature affects the prediction. If a feature xi receives a positive weight wi , then increasing
the value of that feature increases the value of our prediction ŷ . If a feature receives a negative weight, then increasing the
value of that feature decreases the value of our prediction. If a feature’s weight is large in magnitude, then it has a large effect
on the prediction. If a feature’s weight is zero, then it has no effect on the prediction.
We thus have a definition of our task T : to predict y from x by outputting ŷ = w T x. What remains to be specified is a
¯ ¯ ¯
definition of our performance measure, P. This will be done later subsequently.
Considering again the motivating example of predicting the housing prices. A data set is collected and plotted in Figure. 2.1,
where we have collected the “right answers” for 47 houses. Assume we want to predict the price of a house (i.e., predicting a
real-valued output) having an area of 1250 ft 2 . As we have seen before, we can try to fit a straight line through the data and
predict a price.

Figure 2.1: A plot of dataset giving the living areas and prices of 47 houses in certain town.

Copyright c 2019, Dr. Mustafa El-Halabi 1


LINEAR REGRESSION & GRADIENT DESCENT ALGORITHM 2

More formally, in the language of supervised learning, we have a “training set ” of housing prices and the task is to learn
from this data how to predict the prices of the houses. To establish notation for future use, we will use x (i) to denote the
“input” variables (living area in this example), also called “input features”, and y (i) to denote the “output” or “target variable”
that we are trying to predict (price). A pair (x (i) , y (i) ) is called a “training example”, and the dataset that we will be using
to learn−a list of m training examples {(x (i) , y (i) ); i = 1, ... , m}−is called a “training set”. Note that the superscript (i ) in
the notation is simply an index into the training set, and has nothing to do with exponentiation. We will also use X denote
the space of input values, and Y the space of output values. In this example, X = Y = R. Remember that when the target
variable that we are trying to predict is continuous, such as in our housing example, we call the learning problem a regression
problem. When y can take on only a small number of discrete values (such as if, given the living area, we wanted to predict if
a selling is a house or an apartment, say), we call it a classification problem.
To describe the supervised learning problem slightly more formally, our goal is, given a training set, to learn a function
h : X → Y so that h(x) is a “good” predictor for the corresponding value of y . For historical reasons, this function h is called
a “hypothesis”. Seen pictorially, the process is therefore like this:

To perform supervised learning, we must decide how we are going to represent functions/hypotheses h in a computer. As
an initial choice, let us say we decide to approximate y as a linear function of x, hence the hypothesis is expressed as
hθ (x) = θ0 + θ1 x (2.2)
Here, the θi ’s are the parameters (also called “weights”) parameterizing the space of linear functions mapping from X to Y.
When there is no risk of confusion, we will drop the θ subscript in hθ (x), and write it more simply as h(x). This kind of linear
regression is known as linear regression with one variable or univariate linear regression, where the variable here is x.
Now we need to investigate how to fit the best linear function into a given dataset, i.e., how to find the best θi0 s that will
make the straight line fit the training set as much a possible. The idea is to choose θ0 and θ1 such that hθ (x) is close to y for
our training examples (x, y ) in our training set. Analytically, we need to solve the following minimization problem
m 
X 2
min hθ (x (i) ) − y (i) (2.3)
θ0 ,θ1
i=1

where hθ (x (i) ) = θ0 + θ1 x (i) , or equivalently solving the following minimization problem


min J(θ0 , θ1 ) (2.4)
θ0 ,θ1

m 2
1
hθ (x (i) ) − y (i)
P
where J(θ0 , θ1 ) = 2m is known as the “cost function” or the “squared error”. There are other cost
i=1
functions that can work pretty well in terms of fitting the data, but the squared error function is one of the most widely used
for regression problems.

Example 3.
Consider the following dataset {(x, y ) = (1, 2), (2, 4), (3, 6)}. Assume that the chosen hypothesis is hθ (x) = θ1 x. Compute
the square error function for θ1 = 0, 1, 2, 3. Which θ1 minimizes the square error function?

Copyright c 2019, Dr. Mustafa El-Halabi 2


LINEAR REGRESSION & GRADIENT DESCENT ALGORITHM 3

3
1
(θ1 x (i) − y (i) )2 .
P
Solution. In this case, hθ1 (x) = θ1 x, then J(θ1 ) = 6
i=1

6
1
(0 − y (i) )2 = 16 (22 + 42 + 62 ) = 9.33
P
• For θ = 0, J(0) = 6
i=1

6
1
(x (i) − y (i) )2 = 16 ((1 − 2)2 + (2 − 4)2 + (3 − 6)2 ) = 2.33
P
• For θ = 1, J(1) = 6
i=1

6
1
(2x (i) − y (i) )2 = 16 ((2 − 2)2 + (4 − 4)2 + (6 − 6)2 ) = 0
P
• For θ = 2, J(2) = 6
i=1

6
1
(3x (i) − y (i) )2 = 16 ((3 − 2)2 + (6 − 4)2 + (9 − 6)2 ) = 2.33
P
• For θ = 3, J(3) = 6
i=1

Obviously, the optimal value for θ1 is θ1∗ = 2, as it leads to a square error equal to 0.

Going back to the scenario of Figure. 2.1, we want to use the hypothesis described by eq. (2.2) to solve the linear regression
problem. Assuming θ0 = 50 and θ1 = 0.06, we obtain hθ (x) = 50 + 0.06x. The situation is illustrated in Figure. 2.2.

Figure 2.2: A plot of dataset giving the living areas and prices of 47 houses in certain town and using hθ (x) = 50 + 0.06x to
predict the housing prices.

Plotting the curve of the squared error function as defined by (2.3), we get a three-dimensional surface plot with one local
minimum, as shown in Figure. 2.3. For a particular choice of θ0 and θ1 , we obtain a specific hypothesis hθ (x), and the height
of the corresponding point is the value of the error function.

Figure 2.3: A plot of the squared error function J(θ0 , θ1 ).

Copyright c 2019, Dr. Mustafa El-Halabi 3


LINEAR REGRESSION & GRADIENT DESCENT ALGORITHM 4

For ease of explanation, we will use contour plots to illustrate the squared error function. See Figure. 2.4. Each of the
ellipses denote a set of points that take on the same value for J(θ0 , θ1 ). All points belonging to the same ellipse have the same
J(θ0 , θ1 ). The innermost ellipse corresponds to the minimum of J(θ0 , θ1 ).

Figure 2.4: Contour figures of the squared error function J(θ0 , θ1 ).

Example 4. In this example illustrates how a good choice of a hypothesis corresponds to minimizing the cost function J(θ0 , θ1 ).
Consider Figure.2.6(a), showing a hypothesis that apparently does not fit the data; hθ (x) = 360, i.e., θ0 = 360 and θ1 = 0.
This choice of hypothesis can be seen on the contour plots as a specific point x, which is pretty far from the minimum (the
midpoint of the inner contour). Figure.2.6(b) shows a hypothesis with θ0 = 800 and θ1 = −1.5; hθ (x) = 800 − 1.5x, which
also does not fit the data properly. This choice of hypothesis can be seen on the contour plots as a specific point x, which is
pretty far from the minimum. Notice that the hypothesis hθ (x) = 360 is a better hypothesis than hθ (x) = 800 − 1.5x, since
the x on the contour plots is closer to the minimum in the former case than in the latter case. Figure.2.6(c) shows a hypothesis
with θ0 = 250 and θ1 = 0.15; hθ (x) = 250 + 0.15x. This choice of hypothesis can be seen on the contour plots as a specific
point x, which is pretty close to the minimum.

In high dimensional problems with more features, an algorithm should be derived in order to automatically find the proper
choice of parameters θi ’s that minimize the cost function J(θ1 , θ2 , ... , θn ). In the next section we derive such an algorithm.

2.1 Gradient Descent Algorithm


The Gradient Descent Algorithm is a general algorithm which is not only used to minimize the cost function for linear
regression, but is also used with other machine learning problems. The problem set up is as follows: we have a function
J(θ0 , θ1 ) and we want to minimize J(θ0 , θ1 ) by finding appropriate values for θ0 and θ1 . The approach that will be followed is
to start with some initial guesses for θ0 and θ1 , and keep on updating those parameters to reduce J(θ0 , θ1 ) and hopefully end
up with a minimum.

(a) (b) (c)

Figure 2.5: Starting from a random location (θ0 , θ1 ) and following the gradient descent.

Copyright c 2019, Dr. Mustafa El-Halabi 4


LINEAR REGRESSION & GRADIENT DESCENT ALGORITHM 5

(a)

(b)

(c)

Figure 2.6: Contour figures of the squared error function J(θ0 , θ1 ).

We will present first a graphical intuition on how the gradient descent algorithm works. In Figure. 2.5(a), we pick an initial
guess for (θ0 , θ1 ) (illustrated as a star point on the surface), and according to gradient descent algorithm we need to look 360◦
degrees around and see which direction corresponds to the steepest descent following a small baby step (going down the hill
as quickly as possible). We Assume that this direction leads to another star point, as shown in Figure. 2.5(b). We repeat the
same process of looking for the steepest descent from the second point to a third point, and so on until we reach a minimum
value. The gradient descent algorithm has an interesting property, if we choose another initial guess for (θ0 , θ1 ) which is not
far from the original guess, and repeat the same procedure of following the steepest descent, we will reach a different local
minimum. See Figure. 2.7. We will say more about this property later.
After the graphical intuition, we turn to the analytical interpretation of the gradient descent algorithm. The following is the
definition of the Gradient Descent Algorithm.
Gradient Descent Algorithm

Repeat until convergence {θj := θj − α ∂θ j
J(θ0 , θ1 ) (for j = 0 and j = 1)}

According to this definition, we should repeatedly update the parameter θj by taking θj and subtracting from it α ∂θ j
J(θ0 , θ1 ),
where α > 0 is the “learning rate” that controls the size of the step that we take during the descent. If α is very large, this is
equivalent to taking a very large step, while if α is very small, this is equivalent to taking a very small step. If α is too small,

Copyright c 2019, Dr. Mustafa El-Halabi 5


LINEAR REGRESSION & GRADIENT DESCENT ALGORITHM 6

Figure 2.7: Choosing another initial location and applying the gradient descent algorithm leads to another local minimum.

gradient descent can be slow, while if α is too large, gradient descent can overshoot the minimum. It may fail to converge, or

even diverge. The partial derivative term ∂θ j
J(θ0 , θ1 ) is the gradient descent of the cost function.

Remark 2. Note that in applying the gradient descent algorithm, we need to simultaneously update θ0 and θ1 , and this is an
important subtlety in applying the algorithm. More concretely, the update should be carried out in the following manner:

t0 := θ0 − α J(θ0 , θ1 )
∂θ0

t1 := θ1 − α J(θ0 , θ1 )
∂θ1
θ0 := t0
θ1 := t1

Remark 3. In gradient descent algorithm, as we approach a local minimum the algorithm will automatically take smaller steps.

This is due to the fact that as we approach a local minimum the term ∂θ j
J(θ0 , θ1 ) will get smaller till it becomes equal to zero
once reaching the local minimum. Hence, there is no need to decrease the learning rate α over time.

Remark 4. While gradient descent can be susceptible to local minima in general, the optimization problem we have posed
here for linear regression has only one global, and no other local, optima; thus gradient descent always converges (assuming the
learning rate α is not too large) to the global minimum. Indeed, J is a convex quadratic function that has one global minimum.

Example 5.
Consider the following dataset {(x, y ) = (1, 2), (2, 4), (3, 6)}. Assume that the chosen hypothesis is hθ (x) = θ1 x. Apply the
gradient descent algorithm to find the optimal value for θ1 , assuming an initial choice of θ1 = 0 and with a learning rate
α = 0.2.
3
1 14θ12 −56θ1 +56
(θ1 x (i) − y (i) )2 = 16 [(θ1 − 2)2 + (2θ1 − 4)2 + (3θ1 − 6)2 ] =
P
Solution. The cost function is J(θ1 ) = 6 6 . For
i=1

Copyright c 2019, Dr. Mustafa El-Halabi 6


LINEAR REGRESSION & GRADIENT DESCENT ALGORITHM 7

2
14θ −56θ +56
α = 0.2, the algorithm is as follows {θ1 := θ1 − 0.2 ∂θ∂ 1 1 6 1 }. Starting with θ1 = 0,
   
28θ1 − 56 −56 28 28
θ1 := 0 − 0.2 = −0.2 = ' 1.866 ⇒ J(θ1 ) =
6 θ1 =0 6 15 675
 
28 28θ1 − 56 448
θ1 := − 0.2 = ' 1.991 ⇒ J(θ1 ) = 1.84 × 10−4
15 6 θ1 =28/15 225
 
448 28θ1 − 56 6748
θ1 := − 0.2 = ⇒ J(θ1 ) = 8.19 × 10−7
225 6 448
θ1 = 225 3375
 
6748 28θ1 − 56 101248
θ1 := − 0.2 = ' 1.99996 ⇒ J(θ1 ) = 3.64 × 10−9
3375 6 6748
θ1 = 3375 50625
 
125258 28θ1 − 56 1542758
θ1 := − 0.2 = ' 2.031 ⇒ J(θ1 ) = 2.33 × 10−3
50625 6 θ1 = 125258 759375
50625

Since 2.33 × 10−3 > 3.64 × 10−9 , then maybe θ1 = 2 would be a good choice, and as a result the hypothesis would be
hθ (x) = 2x.


Returning to the gradient descent of the cost function ∂θj J(θ0 , θ1 ), assuming J(θ0 , θ1 ) = θ0 + θ1 x (i) , we can write
m 2 m 2
∂ ∂ 1 X 1 ∂ X
J(θ0 , θ1 ) = hθ (x (i) ) − y (i) = θ0 + θ1 x (i) − y (i) , (for j = 0 and j = 1) (2.5)
∂θj ∂θj 2m 2m ∂θj
i=1 i=1

Specializing (2.5) for j = 0 and j = 1, we get


m
∂ 1

hθ (x (i) ) − y (i)
P
• j =0: ∂θ0 J(θ0 , θ1 ) = m
i=1
m
∂ 1

hθ (x (i) ) − y (i) · x (i)
P
• j =1: ∂θ1 J(θ0 , θ1 ) = m
i=1

As a result, the update rule in the gradient descent algorithm becomes:


Repeat until convergence {
m
1 X 
θ0 := θ0 − α hθ (x (i) ) − y (i) (2.6)
m
i=1
m 
This rule is
1 X
(i) (i)

(i)
θ1 := θ1 − α hθ (x )−y ·x (2.7)
m
i=1

}
called the LMS update rule (LMS stands for “least mean squares”), and is also known as the Widrow-Hoff learning rule.
This rule has several properties that seem natural and intuitive. For instance, the magnitude of the update is proportional
to the error term hθ (x (i) ) − y (i) ; thus, for instance, if we are encountering a training example on which our prediction nearly
matches the actual value of y (i) , then we find that there is little need to change the parameters; in contrast, a larger change to
the parameters will be made if our prediction hθ (x (i) ) has a large (i)
Pm error (i.e., if it is very far from y ). This method looks at
every example in the entire training set (the summation term i=1 ) on every step, and is thus called Batch Gradient Descent.
There are other versions of gradient descent that are not batch versions (do not look at the entire training set). For example,
consider the following algorithm
Stochastic Gradient Descent Algorithm
Loop { for i = 1 to m, { In the Stochas-
 (i)
θj := θj − α hθ (x (i) ) − y (i) · xj (for every j)}}

Copyright c 2019, Dr. Mustafa El-Halabi 7


LINEAR REGRESSION & GRADIENT DESCENT ALGORITHM 8

tic Gradient Descent algorithm, we repeatedly run through the training set, and each time we encounter a training example,
we update the parameters according to the gradient of the error with respect to that single training example only.
Whereas batch gradient descent has to scan through the entire training set before taking a single step−a costly operation if
m is large−stochastic gradient descent can start making progress right away, and continues to make progress with each example
it looks at. Often, stochastic gradient descent gets θ “close” to the minimum much faster than batch gradient descent. Note
however that it may never “converge” to the minimum, and the parameters θ will keep oscillating around the minimum of J(θ);
but in practice most of the values near the minimum will be reasonably good approximations to the true minimum. For these
reasons, particularly when the training set is large, stochastic gradient descent is often preferred over batch gradient descent.

Remark 5. While it is more common to run stochastic gradient descent as we have described it and with a fixed learning rate
α, by slowly letting the learning rate α decrease to zero as the algorithm runs, it is also possible to ensure that the parameters
will converge to the global minimum rather than merely oscillate around the minimum.

Remark 6. Reaching a point in which the gradient descent algorithm that makes very small changes in your objective function
(cost function) is called convergence. So how do we decide on the convergence of the gradient descent algorithm? We can
imagine how gradient descent works thinking that we throw marble inside a bowl and start taking snapshots. The marble will
oscillate till friction will stop it in the bottom. Now imaging that we are in an environment where friction is so small that the
marble takes a long time to completely stop, so we can assume that when the oscillations are small enough the marble have
reached the bottom (although it could continue oscillating). In Figure. 2.8 you could see the first eight steps (snapshots of the
marble) of the gradient descent.

Figure 2.8: Geometrical intuition of the convergence for the gradient descent algorithm.

If we continue taking photos of the movement of the marble, we can notice that the movement becomes less and less
relevant, which does not mean that the marble has necessarily reached the bottom of the bowl (the optimal result), but it is
really quite close to it if not at it. The precision value can be chosen as the threshold at which the consecutive iterations of
GD are almost the same. For example, we can use the following precision rule

grad(i ) = 0.0001
grad(i + 1) = 0.000099989 : (grad has changed less than 0.01 in one iteration% ⇒ STOP)
Such a precision rule is also referred to a test of convergence, and choosing a precision threshold might not be an evident
task.

2.2 Multivariate Linear Regression


In the original version of linear regression that we have developed, we had a single feature (the price of the house) denoted
by x, and the objective was to use that single feature to predict the house price y , via the hypothesis hθ (x) = θ0 + θ1 x. To
make our housing example more interesting, let us consider a slightly richer dataset in which we also know more features about
the house like the number of bedrooms, the number of floors and how old the house is. Here, the x’s are four-dimensional
(i) (i) (i) (i)
vectors in R4 . For instance, x1 , x2 , x3 , and x4 are respectively the area, the number of bedrooms, the number of floors
and the age of the age of the i −th house in the training set. (In general, when designing a learning problem, it will be up to you
to decide what features to choose, so if you are out in gathering housing data, you might also decide to include other features
such as whether each house has a fireplace, the number of bathrooms, and so on. We will say more about feature selection
later, but for now let us take the features as given).

Copyright c 2019, Dr. Mustafa El-Halabi 8


LINEAR REGRESSION & GRADIENT DESCENT ALGORITHM 9

In general, we could have n features and m training examples, where each feature j is represented by an m−dimensional
(i)
vector x (i) , where xj is the value of the j th feature in the i th training example.
¯
To perform supervised learning, we must decide how we are going to represent functions/hypotheses h in a computer. As
an initial choice, let us say we decide to approximate y as a linear function of x:
n
X
hθ (x) = θ0 + θ1 x1 + θ2 x2 + ... + θn xn = θj xj = ΘT x (2.8)
j=0
¯ ¯

where the θj ’s are the parameters (weights) parameterizing the space of linear functions mapping from X to Y. When there is
no risk of confusion, we will drop the θ subscript in hθ (x), and write it more simply as h(x). To simplify our notation, we have
used the convention of letting x0 = 1. Note now that x and Θ are both (n + 1)−dimensional vectors.
¯ ¯
We now address the question of how to apply the gradient descent algorithm in the context of multivariate linear regression.
The corresponding cost function is give by
 2
m m n
1 X 1 X X (i)
J(Θ) = J(θ0 , θ1 , ... , θn ) = (hθ (x (i) ) − y (i) )2 =  θj xj − y (i)  (2.9)
¯ 2m 2m
i=1 i=1 j=0

The gradient descent algorithm should be updated to the following format


For n ≥ 1, repeat until convergence {
m
1 X 
(i)
θj := θj − α hθ (x (i) ) − y (i) · xj simultaneously update θj for j = 1, ... n (2.10) Gradient de-
m
i=1

}
scent gives one way of minimizing J. Let us discuss a second way of doing so, this time performing the minimization explicitly
and without resorting to an iterative algorithm. In this method, we will minimize J by explicitly taking its derivatives with
respect to the θj ’s, and setting them to zero. To enable us to do this without having to write reams of algebra and pages full
of matrices of derivatives, let us introduce some notation for doing calculus with matrices.

2.2.1 Matrix Derivatives


For a function f : Rm×n → R mapping from m-by-n matrices to the real numbers, we define the derivative of f with respect
to matrix A to be:
¯  ∂f
··· ∂f 
∂A11 ∂A1n
∇A f (A) =  ... .. ..  (2.11)

. . 
¯ ∂f ∂f
∂Am1 ··· ∂Amn
 
∂f A11 A12
Thus, the gradient ∇A f (A) is itself an m−by−n matrix, whose (i , j)−element is ∂Aij . For example, suppose is a
¯ A21 A22
2−by−2 matrix, and the function f : R2×2 → R is given by
3
f (A) = A11 + 5A212 + A21 A22
¯ 2

Copyright c 2019, Dr. Mustafa El-Halabi 9


LINEAR REGRESSION & GRADIENT DESCENT ALGORITHM 10

Here, Aij denotes the (i , j) entry of the matrix A. We then have


 3 
10A12
∇A f (A) = 2
¯ A22 A21

We also introduce the “trace” operator, written “Tr”. For an n−by−n (square) matrix A, the trace of A is defined to be the
¯ ¯
sum of its diagonal entries:
Xn
Tr A = Aii (2.12)
¯ i=1

If a is a real number (i.e., a 1−by−1 matrix), then Tr a = a. The trace operator has the property that for two matrices A and
B such that AB is square, we have that Tr AB = Tr BA. (Check this yourself!) As corollaries of this, we also have, e.g.,

Tr AB C = Tr C AB = Tr B C A
¯¯ ¯ ¯ ¯¯ ¯¯¯
Tr AB C D = Tr D AB C = Tr C D AB = Tr B C D A
¯¯ ¯ ¯ ¯ ¯¯ ¯ ¯ ¯ ¯¯ ¯¯¯¯
The following properties of the trace operator are also easily verified. Here, A and B are square matrices, and a is a real number:

Tr A = Tr AT (2.13)
¯ ¯
Tr(A + B) = Tr A + Tr B (2.14)
¯ ¯ ¯ ¯
Tr aA = a Tr A (2.15)
¯ ¯
We now state without proof some facts of matrix derivatives:

∇A Tr AB = BT (2.16)
¯ ¯¯ ¯
∇AT f (A) = (∇A f (A))T (2.17)
¯ ¯ ¯ ¯
∇A Tr AB AT C = C AB + C T AB T (2.18)
¯ ¯¯ ¯ ¯ ¯ ¯¯ ¯ ¯¯
∇A |A| = |A|(A−1 )T (2.19)
¯ ¯ ¯ ¯
where (2.19) applies only to non-singular square matrices A, and |A| denotes the determinant of a matrix.

2.2.2 The Normal Equations


Armed with the tools of matrix derivatives, let us now proceed to find in closed-form the value of θ that minimizes J(θ).
We begin by re-writing J in matrix-vectorial notation. Given a training set, define the design matrix X to be the m−by−n
¯
matrix (actually m−by−n + 1, if we include the intercept term) that contains the training examples’ input values in its rows:
 (1) T 
(x )
 (x (2) )T 
X =
 
.. 
¯  . 
(m) T
(x )

Also, let y = [y (1) y (2) · · · y (m) ]. Now, since hθ (x (i) ) = (x (i) )T Θ, we can easily verify that
¯ ¯
 (1) T   (1)  
hθ (x (1) ) − y (1)

(x ) Θ y
¯
 (x (2) )T Θ   y (2)   hθ (x (2) ) − y (2) 
XΘ − y =  .. ¯  −  ..  = 
     
.. 
¯¯ ¯  .   .   . 
(x (m) )T Θ y (m) hθ (x (m) ) − y (m)
¯
Thus, using the fact that for a vector z, we have that z T z = i zi2 , we can write
P
¯ ¯ ¯
m 2
1 T 1 X
(X Θ − y ) (X Θ − y ) = hθ (x (i) ) − y (i) = J(θ)
2 ¯¯ ¯ ¯¯ ¯ 2
i=1

Copyright c 2019, Dr. Mustafa El-Halabi 10


LINEAR REGRESSION & GRADIENT DESCENT ALGORITHM 11

Finally, to minimize J, let us find its derivatives with respect to Θ. Combining Eq.(2.17)and Eq.(2.18), we find that
¯
∇AT Tr AB AT C = B T AT C T + B AT C (2.20)
¯ ¯¯ ¯ ¯ ¯ ¯ ¯ ¯¯ ¯
Hence,
1
∇Θ J(θ) = ∇Θ (X Θ − y )T (X Θ − y )
¯ ¯ 2 ¯ ¯ ¯ ¯¯ ¯
1
∇Θ Θ X X Θ − ΘT X T y − y T X Θ + y T y
T T

=
2 ¯ ¯ ¯ ¯¯ ¯ ¯ ¯ ¯ ¯¯ ¯ ¯
1
∇Θ Tr Θ X X Θ − ΘT X T y − y T X Θ + y T y
T T

=
2 ¯ ¯ ¯ ¯¯ ¯ ¯ ¯ ¯ ¯¯ ¯ ¯
1
∇Θ Tr Θ X X Θ − 2 Tr y T X Θ
T T
(y T y is a constant and hence its derivative is zero)

=
2 ¯ ¯ ¯ ¯¯ ¯ ¯ ¯ ¯ ¯
1
X T X Θ + X T X Θ − 2X T y

=
2 ¯ ¯¯ ¯ ¯¯ ¯ ¯
= XTXΘ − XTy
¯ ¯¯ ¯ ¯
In the third step, we used the fact that the trace of a real number is just the real number; the fourth step used the fact that
Tr A = Tr AT , and the fifth step used Eq. (2.20) with AT = Θ, B = B T = X T X , and C = I , and Eq.(2.16). To minimize J,
¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯
we set its derivatives to zero, and obtain the “normal equations”:

XTXΘ = XTy (2.21)


¯ ¯¯ ¯ ¯
Thus, the value of θ that minimizes J(θ) is given in closed form by the equation

Θ = (X T X )−1 X T y (2.22)
¯ ¯ ¯ ¯ ¯

2.3 Gradient Descent in Practice


In this section we will introduce practical tricks for making gradient descent work well.

2.3.1 Feature Scaling


The idea is if we have a linear regression problem with multiple features, then if make the range of values of the features
comparable (features are on similar scale), then the gradient descent algorithm will converge quickly. Assume we have a problem
with two features; x10 ∈ [0, 2000] representing the size of a house and x20 ∈ {1, 2, 3, 4, 5} representing the number of bedrooms
of the house. If we draw the contours of J(θ) as a function of θ1 and θ2 , it will turn out that the contours have a very skewed
ovals. As a result, the gradient descent algorithm will oscillate a lot before reaching convergence to the global minimum.
In such a setting it would be helpful to scale the features, i.e., redefine the features as x1 = x1 /2000 and x2 = x2 /5. As
a result, the contours of J(θ) will be much less skewed and hence convergence can be attained quicker. In general, in feature
scaling, features should be normalized into approximately −1 ≤ xi ≤ 1 range.
Another way of scaling is to perform what we call “mean normalization”, where we replace xi by xi − µi , where µi is the
mean of the dataset, e.g., x1 = (x10 − 1000)/200 if µ1 = 1000 is the average size of houses. This will make all the features
have approximately zero mean. Note that we do not apply mean normalization to x0 = 1.

2.3.2 Learning Rate


If the learning rate α is too big, the gradient descent might overshoot the minimum and keep on overshooting The solution
is to decrease the value of α. Mathematically, it can be shown that, for sufficiently small α, J(θ) should decrease on every
iteration. However, if α is too small, gradient descent can be very slow to converge. In practice, we could start by testing a
number of values for α and see how quick things evolve (typically, start with α = 0.001, 0.003, 0.01, 00.3, 0.1).

Copyright c 2019, Dr. Mustafa El-Halabi 11


LINEAR REGRESSION & GRADIENT DESCENT ALGORITHM 12

2.4 Features and Polynomial Regression


Sometimes we can get different learning algorithms, and some very powerful ones by choosing appropriate features. In
particular, we can use polynomial regression which allows the use of the machinery of linear regression to fit very non-linear
dataset. Assume that the hypothesis that would fit our data is of the form

hθ (x) = θ0 + θ1 x + θ2 x 2 + θ3 x 3

We can define x1 , x, x2 , x 2 and x3 , x 3 , and cast the given hypothesis as a linear regression hypothesis

hθ (x) = θ0 + θ1 x1 + θ2 x2 + θ3 x 3

and use the machinery of gradient descent as defined earlier for linear regression problems. Note that in such a case feature
scaling becomes increasingly important.

Copyright c 2019, Dr. Mustafa El-Halabi 12


CLASSIFICATION 13

CHAPTER 2

CLASSIFICATION

In this chapter we discuss the “classification problem”. This is just like the regression problem, except that the values y we
now want to predict take on only a small number of discrete values. For now, we will focus on the binary classification problem
in which y can take on only two values, 0 and 1. (Most of what we say here will also generalize to the multiple-class case.) For
instance, if we are trying to build a spam classifier for email, then x (i) may be some features of a piece of email, and y may
be 1 if it is a piece of spam mail, and 0 otherwise. 0 is also called the negative class, and 1 the positive class, and they are
sometimes also denoted by the symbols “−” and “+”. Given x (i) , the corresponding y (i) is also called the label for the training
example.

2.1 Logistic Regression


We could approach the classification problem ignoring the fact that y is discrete-valued, and use our old linear regression
algorithm to try to predict y given x. Let us re-consider the example of classifying tumors as being malignant or benign.
Figure. 2.1 shows a training set for such a classification task.

1(Y ) xxxx

0(N ) x x x x
Tumor size

Figure 2.1: Training data for tumor classification.

One thing we can do to this training set is to apply the linear regression algorithm which we already know, i.e., try to fit a
straight line hθ (x) = ΘT X into the data. Doing that, assume that we will get a “classifier threshold” at y = 0.5:
¯ ¯
• If hθ (x) ≥ 0.5, then predict y = 1

• If hθ (x) < 0.5, then predict y = 0

Now, every data to the right of the value corresponding to y = 0.5 will be predicted as belonging to the positive class
(malignant tumor) and every data to the left of the value corresponding to y = 0.5 will be predicted as belonging to the
negative class (benign tumor). See Figure. 2.2. It seems that in this example the linear regression approach is reasonably
solving the classification problem.

Copyright c 2019, Dr. Mustafa El-Halabi 13


CLASSIFICATION 14

h✓ (x)
1(Y ) xxxx
0.5
0(N ) x x x x
Tumor size

Figure 2.2: Using linear regression in a classification problem.

However, it is easy to construct examples where this method performs very poorly. Consider the scenario shown in Figure. 2.3,
where one extra training example is added to the far right.

h✓ (x)
1(Y ) x x x x x
0.5
0(N ) x x x x
Tumor size

Figure 2.3: A scenario showing that linear regression is not the right technique to be used for a classification problem.

The previously proposed hypothesis still work well for this example, however, if we decide to implement linear regression,
the hypothesis hθ (x) that would fit to the data would correspond to a line with a smaller slope. Using a classifier threshold
at y = 0.5, we will get a prediction rule that performs poorly (the data to the left of the threshold contain both positive and
negative classes). In general, linear regression is not the right tool to be used to solve classification problems. Intuitively, it also
does not make sense for hθ (x) to take values larger than 1 or smaller than 0 when we know that y ∈ {0, 1}. To fix this, what
we will do is to develop a new algorithm called “logistic regression”, where the outputs have the property that 0 ≤ hθ (x) ≤ 1.
Let us change the form of our hypothesis hθ (x). We will choose
1
hθ (x) = g(ΘT x) = Tx (2.1)
¯ ¯ 1 + e −Θ
¯ ¯

where
1
g(z) = (2.2)
1 + e −z
is called the “logistic function” or the “sigmoid function”. See Figure. 2.4 for a plot of g(z). 17

0.9

0.8

0.7

0.6
g(z)

0.5

0.4

0.3

0.2

0.1

0
−5 −4 −3 −2 −1 0 1 2 3 4 5
z

Notice that g(z) tends


Figure towards
2.4: A Plot1 as → ∞,
ofzthe and g(z)
Sigmoid tends towards 0 as
function.
z → −∞. Moreover, g(z), and hence also h(x), is always bounded between
0 and 1. As! before, we are keeping the convention of letting x0 = 1, so that
θT x = θ0 + nj=1 θj xj .
For now, let’s take the choice of g as given. Other functions that smoothly
increase from 0 to 1 can alsocbe2019,
Copyright Dr. for
used, but Mustafa
a coupleEl-Halabi
of reasons that we’ll see 14
later (when we talk about GLMs, and when we talk about generative learning
algorithms), the choice of the logistic function is a fairly natural one. Before
moving on, here’s a useful property of the derivative of the sigmoid function,

CLASSIFICATION 15

Note that g(z) tends towards 1 as z → ∞, and g(z) tends towards 0 as z → −∞. Moreover, g(z), and hence also Pn hθ (x),
is always bounded between 0 and 1. As before, we are keeping the convention of letting x0 = 1, so that ΘT x = θ0 + j=1 θj xj .
¯ ¯
For now, let us take the choice of g as given. Other functions that smoothly increase from 0 to 1 can also be used. Before
0
moving on, here is a useful property of the derivative of the sigmoid function, which we write as g :
 
0 d 1 1 −z 1 1
g (z) = = (e ) = · 1− = g(z)(1 − g(z))
dz 1 + e −z (1 + e −z )2 1 + e −z (1 + e −z )
How to we interpret the output hθ (x)? When the hypothesis output some number, we will interpret this number as the
estimated probability that y = 1 on a new input x. To understand this, consider the tumor example with a feature vector
x consisting of two entries: x0 = 1 and x1 = tumor size; x = [1 x1 ]T . Give a certain patient, assume that the features
¯ ¯
provide us with the hypothesis hθ (x) = 0.7. This hypothesis is going to be interpreted as the tumor’s of the patient has a 70%
chance of being malignant (y = 1). Using probabilistic terms, hθ (x) will be interpreted as the probability that y = 1, given
x, parametrized by θ. Since, this is a classification task, we know that y can be either 0 or 1, as a result we can compute the
probability that y = 0, given x, parametrized by θ as 1 − hθ (x). In summary,

P(y = 1|x; θ) = hθ (x) (2.3)


P(y = 0|x; θ) = 1 − hθ (x) (2.4)

Note that this can be written more compactly as

p(y |x; θ) = (hθ (x))y (1 − hθ (x))1−y (2.5)

Example 6. (Linear Decision Boundary)


Consider a dataset with two features x1 and x2 and the associated classes × and ◦, as shown below. Using a logistic regression
algorithm, the derived classification rule is: choose ◦ is hθ (x) ≥ 0.5, and × otherwise. The resulting “decision boundary” is the
oblique bold line shown in the figure. What parameters Θ fit the given logistic regression model?
¯

x2

x x
x
x x
x x
xxx
0 3 x1

Solution. From the graph of the sigmoid function, we can see that if hθ (x) ≥ 0.5, this corresponds to ΘT X ≥ 0. As we have
¯ ¯
two features x1 and x2 , then this yields: θ0 + θ1 x1 + θ2 x2 ≥ 0. This inequality represents a decision region delimited by the line
θ2 x2 + θ1 x1 = −θ0 . Looking at the figure, the decision boundary is the line: x2 = −x1 + 3. By simple identification, θ0 = −3,
θ1 = θ2 = 1.

Example 7. (Non-Linear Decision Boundary)


Consider a dataset with two features x1 and x2 and the associated classes × and ◦, as shown below. Using a logistic regression
algorithm, the derived classification rule is: choose ◦ is hθ (x) ≥ 0.5, and × otherwise. Assume the predicted hypothesis to have
the form hθ (x) = g(θ0 + θ1 x1 + θ2 x2 + θ3 x12 + θ4 x22 ), where ΘT = [−1 0 0 1 1]. Specify the decision boundary for such
¯
a classification problem.

Copyright c 2019, Dr. Mustafa El-Halabi 15


CLASSIFICATION 16

x2

x x
x 1
x x

1 1 x1
x
x 1 x
x x

Solution. From the graph of the sigmoid function, we can see that if hθ (x) ≥ 0.5, this corresponds to θ0 + θ1 x1 + θ2 x2 +
θ3 x12 + θ4 x22 ≥ 0. Substituting with the provided parameters, this is equivalent to −1 + x12 + x22 ≥ 0. Clearly, the decision region
is a circle centered at the origin and with radius 1. This example shows that for more complex data sets, we need to add extra
features of higher orders.

The natural question to ask now is, given a training set how do we fit the parameters Θ to it? Consider a training set of
¯
m training examples {(x (1) , y (1) ), (x (2) , y (2) ), ... , (x (m) , y (m) )}, where each of the examples is represented by a feature vector
T
x = [x1 x2 ... xn ] , where x0 = 1 and y ∈ {0, 1}. The hypothesis is always given by (2.1). We need to choose Θ to fit
¯ ¯
our training examples.
The cost function for linear regression was given by
m 2
1 X1
J(θ) = hθ (x (i) ) − y (i)
m 2
i=1

2
Suppose we define cost (hθ (x), y ) , 21 (hθ (x) − y ) , to be the cost of predicting y via hθ (x). However, the problem we are
dealing with here is a logistic regression problem, and in this context hθ (x) is a non-linear function, which results in J(θ) being
a non-convex function. Actually, using the sigmoid function, J(θ) will be a function of many local optima, and hence applying
the gradient descent algorithm, there is no guarantee that the algorithm would converge to the global optimum. Therefore, it
would make sense to define a new cost function that would turn out to be convex.
Define the following logistic regression cost function

− log(hθ (x)), if y = 1
cost (hθ (x), y ) = (2.6)
− log(1 − hθ (x)), if y = 0
Obviously, if we predict that hθ (x) = 1, while the true value is y = 1, then cost = 0, and similarly, if we predict that hθ (x) = 0,
while the true value is y = 0, then the cost = 0. However, if for instance hθ (x) = 0, while the true value is y = 1, then
cost → +∞. See Figure. 2.5.

Figure 2.5: Cost function for logistic regression

This result captures the intuition that if hθ (x) = 0 (predict P(y = 1|x; θ) = 0), but y = 1, we penalize the learning
algorithm with a very large cost. It is easy to show that (2.6) is indeed a convex function and hence free of local optima. The

Copyright c 2019, Dr. Mustafa El-Halabi 16


CLASSIFICATION 17

cost function for logistic regression can be compactly written as


" m m
#
1 X (i) i
X
(i) (i)
J(θ) = − y log hθ (x ) + (1 − y ) log(1 − hθ (x )) (2.7)
m
i=1 i=1

This result can be derived as a maximum likelihood estimator under a set of assumptions (this also applies to the cost
function for linear regression). Let us endow our classification model with a set of probabilistic assumptions, and then fit the
parameters via maximum likelihood. Assuming that the m training examples were generated independently, we can then write
down the likelihood of the parameters as
Ym Ym  (y (i) )  1−y (i)
L(Θ) = p(Y |X ; Θ) = p(y (i) )|x (i) ; θ) = hθ (x (i) ) 1 − hθ (x (i) ) (2.8)
¯ ¯ ¯ ¯ i=1 i=1

It is easier to maximize the log-likelihood function:


m h
X i
`(Θ) = log L(Θ) = y (i) log h(x (i) ) + (1 − y (i) ) log(1 − h(x (i) ))
¯ ¯ i=1

Now, we need to find the set of parameters Θ that minimizes J(θ):


¯
Θ = arg min J(θ)
¯ Θ
¯
Similar to our derivation in the case of linear regression, we can use the gradient descent algorithm. Written in vectorial
notation, our updates will therefore be given by Θ := Θ + α∇θ J(θ). Let us start by working with just one training example
¯ ¯
(x, y ), and take derivatives to derive the stochastic gradient descent rule:
 
∂ 1 1 ∂
J(θ) = − y T
− (1 − y ) T
g(ΘT x)
∂θj g(Θ x) 1 − g(Θ x) ∂θj ¯ ¯
 ¯ ¯ ¯ ¯ 
1 1  ∂ T
= − y T
− (1 − y ) T
g(ΘT x) 1 − g(ΘT x) Θ x
g(Θ x) 1 − g(Θ x) ¯ ¯ ¯ ¯ ∂θj ¯ ¯
¯ ¯ ¯ ¯ 
= − y (1 − g(ΘT x)) − (1 − y )g(ΘT x) xj
¯ ¯ ¯ ¯
= (hθ (x) − y )xj
Above, we used the fact that g 0 (z) = g(z)(1 − g(z)). This therefore gives us the stochastic gradient descent rule
Stochastic Gradient Descent Algorithm
(i)
Repeat until convergence {θj := θj − α(hθ (x (i) ) − y (i) )xj (simultaneously update for all θj )}
If we compare this to the LMS update rule, we see that it looks identical; but this is not the same algorithm, because hθ (x (i ))
is now defined as a non-linear function of ΘT X . Nonetheless, it is a little surprising that we end up with the same update rule
¯ ¯
for a rather different algorithm and learning problem. Is this coincidence, or is there a deeper reason behind this? It turns out
that both logistic regression and linear regression models belong to a broader family of models known as the Generalized Linear
Models (GLM), and for these models the LMS update rule is a natural rule.
Remark 7. Gradient descent is not the only algorithm that could be used for logistic regression problems. There are more
powerful and sophisticated algorithms that could attain the optimal solution more quickly and that scale much better to larger
machine learning problems with a large number of features. Such algorithms are:
• Conjugate gradient
• BFGS
• L-BFGS
The aforementioned algorithms have another advantage in addition to fast convergence. For these algorithms, there is no need
to manually pick the learning rate α; these algorithms have a clever inner loop called the “linear searching algorithm” that look
up different values for α and automatically pick the appropriate learning rate α, and sometimes even pick different learning
rates for every iteration. The main disadvantage of using these algorithms is complexity of implementation (one need to have
some experience in numerical computing).

Copyright c 2019, Dr. Mustafa El-Halabi 17


CLASSIFICATION 18

2.2 Generative Learning Algorithms


So far, we have mainly been talking about learning algorithms that model p(y |x; θ), the conditional distribution of y given
x. For instance, logistic regression modeled p(y |x; θ) as hθ (x) = g(ΘT X ) where g is the sigmoid function. In this section, we
¯ ¯
will treat different types of learning algorithms.
Consider a classification problem in which we want to learn to distinguish between elephants (y = 1) and dogs (y = 0),
based on some features of an animal. Given a training set, an algorithm like logistic regression tries to find a straight line−
that is, a decision boundary−that separates the elephants and dogs. Then, to classify a new animal as either an elephant or a
dog, it checks on which side of the decision boundary it falls, and makes its prediction accordingly.
Here is a different approach. First, looking at elephants, we can build a model of what elephants look like. Then, looking
at dogs, we can build a separate model of what dogs look like. Finally, to classify a new animal, we can match the new animal
against the elephant model, and match it against the dog model, to see whether the new animal looks more like the elephants
or more like the dogs we had seen in the training set.
Algorithms that try to learn p(y |x) directly (such as logistic regression), or algorithms that try to learn mappings directly
from the space of inputs X to the labels {0, 1}, (such as the perceptron algorithm) are called discriminative learning algorithms.
Here, we will talk about algorithms that instead try to model p(x|y ) (and p(y )). These algorithms are called generative learning
algorithms. For instance, if y indicates whether an example is a dog (0) or an elephant (1), then p(x|y = 0) models the
distribution of dogs’ features, and p(x|y = 1) models the distribution of elephants’ features.
After modeling p(y ) (called the class priors) and p(x|y ), our algorithm can then use Bayes’ rule to derive the posterior
distribution on y given x:
p(x|y )p(y )
p(y |x) =
p(x)
Here, the denominator is given by p(x) = p(x|y = 1)p(y = 1) + p(x|y = 1)p(y = 0) (you should be able to verify that this is
true from the standard properties of probabilities), and thus can also be expressed in terms of the quantities p(x|y ) and p(y )
that we have learned. Actually, if we are calculating p(y |x) in order to make a prediction, then we do not actually need to
calculate the denominator, since
p(x|y )p(y )
arg max p(y |x) = arg max = arg max p(x|y )p(y )
y y p(x) y

2.2.1 Gaussian Discriminant Analysis


The first generative learning algorithm that we will look at is Gaussian discriminant analysis (GDA). In this model, we will
assume that p(x|y ) is distributed according to a multivariate normal distribution. Let us talk briefly about the properties of
multivariate normal distributions before moving on to the GDA model itself.

The Multivariate Normal Distribution


The multivariate normal distribution in n−dimensions, also called the multivariate Gaussian distribution, is parameterized
by a mean vector E [x] = µ and a covariance matrix Kx x = Σ ∈ Rn×n , where Σ is symmetric and positive-definite matrix. Also
¯ ¯¯
written N (µ, Σ), its density is given by
 
1 1 T −1
p(x; µ, Σ) = exp − (x − µ) Σ (x − µ) (2.9)
¯ (2π)n/2 |Σ|1/2 2 ¯ ¯

In this equation, |Σ| denotes the determinant of the matrix Σ. Figure 2.6 some examples of what the density of a Gaussian
distribution looks like.
The left-most figure shows a Gaussian with mean zero and covariance matrix Σ = I . A Gaussian distribution with zero mean
¯
and identity covariance is also called the “standard normal distribution”. The middle figure shows the density of a Gaussian
with zero mean and Σ = 0.6I ; and in the rightmost figure shows one with, Σ = 2I . We see that as Σ becomes larger, the
¯ ¯
Gaussian distribution becomes more “spread-out”, and as it becomes smaller, the distribution becomes more “compressed”
Let us look at some more examples.

Copyright c 2019, Dr. Mustafa El-Halabi 18


real-valued random variable. The covariance can also be defined as Cov(Z) =
Here are some examples of what the density of a Gaussian distribution
E[ZZ T ] − (E[Z])(E[Z])T . (You should be able to prove to yourself that these
looks like:
two definitions are equivalent.) If X ∼ N (µ, Σ), then
0.25 0.25 0.25

0.2 0.2 Cov(X) = Σ. 0.2

0.15 0.15
CLASSIFICATION 0.15
19
0.1

0.05
Here are some examples of what the density of a Gaussian distribution 0.1

0.05
0.1

0.05

looks like:
3 3 3
2 2 2
3 3 3
1 2 1 2 1 2
0 1 0 1 0 1
−1 0 −1 0 −1 0
−1 −1 −1
0.25 −2 0.25 −2 0.25 −2
−2 −2 −2
−3 −3 −3 −3 −3 −3

0.2 0.2 0.2

0.15 0.15 0.15

The left-most figure shows a Gaussian with mean zero (that is, the 2x1
0.1 0.1 0.1

zero-vector) and covariance matrix Σ = I (the 2x2 identity matrix). A Gaus-


0.05 0.05 0.05

sian with zero mean and identity covariance is also called the standard nor-
3
2
1 2
3
3
2
1 2
3
3
2
1 2
3

0 1 0 1 0 1

mal distribution. The middle figure shows the density of a Gaussian with
−1
−2
−3 −3
−2
−1
0 −1
−2
−3 −3
−2
−1
0 −1
−2
−3 −3
−2
−1
0

zero mean and Σ = 0.6I; and in the rightmost figure shows one with , Σ = 2I.
Thethat
We see left-most figure shows
as Σ becomes a Gaussian
larger, withbecomes
the Gaussian mean zero
more(that is, the 2x1
“spread-out,”
and Figure
zero-vector)
as 2.6:covariance
and
it becomes Examples
smaller, the of
matrixmultivariate
Σ = I (the
distribution Gaussian
2x2
becomes identity distribution.
matrix). A Gaus-
more “compressed.”
sian withlook
Let’s zeroatmean
someand identity
more covariance is also called the standard nor-
examples.
mal distribution. The middle figure shows the density of a Gaussian with
zero mean and Σ = 0.6I; and in the rightmost figure shows one with , Σ = 2I.
0.25 0.25 0.25

0.2 0.2 0.2

We see that as Σ becomes larger, the Gaussian becomes more “spread-out,”


0.15 0.15 0.15

0.1 0.1 0.1

0.05 0.05

and as it becomes smaller, the distribution becomes more “compressed.”


0.05

3 3 3

Let’s look at some more examples.


2

1
2

1
2

0 0 0
3 3 3
−1 2 −1 2 −1 2
1 1 1
−2 0 0.25 −2 0 −2 0
0.25 0.25
−1 −1 −1
−3 −2 −3 −2 −3 −2
0.2 −3 0.2 −3 0.2 −3

0.15 0.15 0.15

0.1 0.1 0.1

The figures above show Gaussians with mean 0, and with covariance
0.05 0.05 0.05

matrices respectively 3 3 3

2 2 2

The figures above show Gaussians with! mean"0, and !with respective covariance matrices
1 1

" ! 0
" 3
0
3
0
3

1 0 1 0.5 1 0.8 −1
1
2 −1
1
2 −1
1
2

Σ=  ; Σ=  ; Σ = . 
−2 0 −2 0 −2 0
−1 −1 −1
−3 −2 −3 −2 −3 −2

0 1  0.5 1 0.8 1
 −3 −3 −3

1 0 1 0.5 1 0.8
Σ = above show; ΣGaussians
= ; Σ0,=and with covariance
The The figures
leftmost 0 shows
figure 1 with
1 mean
0.5standard
the familiar 0.8 1 and we
normal distribution,
matrices respectively
see that as we increase the off-diagonal entry in Σ, the density becomes more
! " 45◦ line! (given by"x = x ).! We can see " this more
The leftmost figure shows“compressed”
the familiartowards
1 0 the normal
standard distribution,
1 0.5 1 and
2 we
1 see0.8 that as we increase the off-diagonal entry
Σ =
clearly when we look ; Σ = ; Σ = .
0 1at the contours
0.5of◦ the
1 same three 0.8
densities:
1
in Σ, the density becomes more “compressed” towards the 45 line (given by x1 = x2 ). We can see this more clearly when we
look at the contours of the same
The three densities:
leftmost figure shows the familiar standard normal distribution, and we 4
see that as we increase the off-diagonal entry in Σ, the density becomes more
3 “compressed” towards the 45◦ line (given by x1 = x2 ). We can see this more 3 3

2
clearly when we look at the contours of the same three densities: 2 2

1 1 1

0 0 0

−1 −1 −1

−2 −2 −2

−3 −3 −3
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3

Here’s one last set of examples generated by varying Σ:


3 3 3

2 2 2

1
The Gaussian Discriminant Analysis Model 1 1

When we have a classification problem in which the input features x are continuous-valued random variables, we can then
0 0 0

use the Gaussian Discriminant Analysis (GDA) model, which models p(x|y ) using a multivariate normal distribution. The model
−1 −1 −1

is: −2 −2 −2

−3 −3 −3
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3

y ∼ Ber (φ) (2.10)


The plots above used, respectively,
! " x|y = 0! ∼ N (µ0", Σ) ! " (2.11)
1 -0.5 1 -0.8 3 0.8
Σ= ;x|yΣ = 1 ∼ N (µ1 , Σ)
; Σ= . (2.12)
-0.5 1 -0.8 1 0.8 1
From the leftmost and middle figures, we see that by decreasing the off-
Writing out the distributions,
diagonal elements of the covariance matrix, the density now becomes “com-
y
pressed” again,
p(y ) but
= inφthe φ)1−y direction. Lastly, as we vary the pa-
opposite
(1 − (2.13)
rameters, more generally the contours
1 willform
1 ellipses (the rightmost figure

T −1
p(x|y
showing an = 0) =
example). exp − (x − µ0 ) Σ (x − µ0 ) (2.14)
(2π)n/2 |Σ|1/2 2 ¯ ¯
As our last set of examples, fixing Σ =I, by varying µ, we can also  move
the mean of the density 1
around. 1 T −1
p(x|y = 1) = exp − (x − µ1 ) Σ (x − µ1 ) (2.15)
(2π)n/2 |Σ|1/2 2 ¯ ¯

0.25 0.25 0.25

0.2 0.2 0.2

0.15
Copyright c 2019, Dr. Mustafa El-Halabi 0.15 0.15
19
0.1 0.1 0.1

0.05 0.05 0.05

3 3 3
2 2 2
3 3 3
1 2 1 2 1 2
0 1 0 1 0 1
−1 0 −1 0 −1 0
CLASSIFICATION 20

Here, the parameters of our model are φ, Σ, µ0 , µ1 . (Note that there are two different mean vectors and only one covariance
matrix Σ). The log-likelihood of the data is given by
m
Y   m
Y  
`(φ, Σ, µ0 , µ1 ) = log p x (i) , y (i) ; φ, µ0 .µ1 , Σ = log p x (i) |y (i) ; µ0 .µ1 , Σ p(y (i) ; φ) (2.16)
i=1 i=1 6
By maximizing ` with respect to the parameters, we find the maximum likelihood estimate of the parameters to be
m
By maximizing ℓ with respect1 to
X the (i)
parameters, we find the maximum like-
φ = 1{y = 1} (2.17)
lihood estimate of the parameters
m
i=1
(see problem set 1) to be:
mm
1 !1{y (i)(i) = 0}x (i)
P
φ µ= =m i=1m
1{y = 1} (2.18)
0
i=1
P (i) = 0}
"m 1{y(i)
i=11{y
i=1 = 0}x(i)
µ0 = "m
m (i) (i) = (i)
1{y1{y = 1}x 0}
P
i=1
µ1
" i=1
= m m1{y (i) = 1}x(i) (2.19)
i=1
"
µ1 =
P (i)
m 1{y (i)= 1}
i=1 1{y
i=1 = 1}
m m
!1 X  (i)
=1

Σ σ= (x x −−µµ(i)
(i) y (i))(x
(i)
(x(i) −−µ )T T
µy i(i) (2.20)
m y y ) .
m i=1i=1
Pictorially, what the algorithm is doing can be seen in Figure. 2.7
Pictorially, what the algorithm is doing can be seen in as follows:
1

−1

−2

−3

−4

−5

−6

−7
−2 −1 0 1 2 3 4 5 6 7

Figure 2.7: Shown inShown in are


the figure thethefigure areset,
training theas training
well as theset, as well
contours astwo
of the theGaussian
contours of the that have been
distributions
fit to the data intwo
eachGaussian distributions
of the two classes. Note thatthat have
the two been have
Gaussians fit to the data
contours in the
that are each
sameofshape
the and orientation,
since they share two classes.matrix,
a covariance Notebutthattheythe two
have Gaussians
different have contours that are the same
mean vectors.
shape and orientation, since they share a covariance matrix Σ, but they have
Also shown in Figure. 2.7
different is the straight
means µ0 andline
µ1giving
. Also the shown
decision in
boundary at which
the figure is p(y
the=straight
1|x) = 0.5. On one side of the
line
boundary, we will predict y = 1 to be the most likely outcome, and on the other side, we will predict y = 0.
giving the decision boundary at which p(y = 1|x) = 0.5. On one side of
the boundary, we’ll predict y = 1 to be the most likely outcome, and on the
GDA Versus Logistic Regression
other side, we’ll predict y = 0.
The GDA model has an interesting relationship to logistic regression. If we view the quantity p(y = 1|x; φ, µ0 , µ1 , Σ) as a
function of x, we will find that it can be expressed in the form
1.3 Discussion: GDA and logistic regression
1
p(y = 1|x; φ, µ0 , µ1 , Σ) =
T x)
(2.21)
The GDA model has an interesting relationship to logistic
1 + exp(θ regression. If we
¯
view the quantity p(y = 1|x; φ, µ0 , µ1 , Σ) as a function of x, we’ll find that it
Copyright c 2019, Dr. Mustafa El-Halabi 20
CLASSIFICATION 21

where θ is some appropriate function of φ, µ0 , µ1 and Σ. This is exactly the form that logistic regression (a discriminative
algorithm) used to model p(y = 1|x).
When would we prefer one model over another? GDA and logistic regression will, in general, give different decision boundaries
when trained on the same dataset. Which is better? We just argued that if p(x|y ) is multivariate gaussian (with shared Σ),
then p(y |x) necessarily follows a logistic function. The converse, however, is not true; i.e., p(y |x) being a logistic function does
not imply p(x|y ) is multivariate gaussian. This shows that GDA makes stronger modeling assumptions about the data than
does logistic regression. It turns out that when these modeling assumptions are correct, then GDA will find better fits to the
data, and is a better model. Specifically, when p(x|y ) is indeed gaussian (with shared Σ), then GDA is asymptotically efficient.
Informally, this means that in the limit of very large training sets (large m), there is no algorithm that is strictly better than
GDA (in terms of, say, how accurately they estimate p(y |x)). In particular, it can be shown that in this setting, GDA will be a
better algorithm than logistic regression; and more generally, even for small training set sizes, we would generally expect GDA
to better.
In contrast, by making significantly weaker assumptions, logistic regression is also more robust and less sensitive to incorrect
modeling assumptions. There are many different sets of assumptions that would lead to p(y |x) taking the form of a logistic
function. For example, if x|y = 0 ∼ Poisson(λ0 ), and x|y = 1 ∼ Poisson(λ1 ), then p(y |x) will be logistic. Logistic regression
will also work well on Poisson data like this. But if we were to use GDA on such data-and fit Gaussian distributions to such
non-Gaussian data?then the results will be less predictable, and GDA may (or may not) do well.
To summarize: GDA makes stronger modeling assumptions, and is more data efficient (i.e., requires less training data to
learn “well”) when the modeling assumptions are correct or at least approximately correct. Logistic regression makes weaker
assumptions, and is significantly more robust to deviations from modeling assumptions. Specifically, when the data is indeed
non-Gaussian, then in the limit of large datasets, logistic regression will almost always do better than GDA. For this reason,
in practice logistic regression is used more often than GDA. (Some related considerations about discriminative vs. generative
models also apply for the Naive Bayes algorithm that we discuss next, but the Naive Bayes algorithm is still considered a very
good, and is certainly also a very popular, classification algorithm.)

2.3 Naive Bayes Algorithm


In GDA, the feature vectors x were continuous, real-valued vectors. Let us now talk about a different learning algorithm in
which the xj ’s are discrete-valued.

2.3.1 The Naive Bayes Classifier


Consider the problem of classifying an input image into one of two classes - male or female. Naive Bayes (NB) is a popular
classification method and aids in discussions related to conditional independence, overfitting and Bayesian methods.
Let us suppose that the vector of features x had many elements, so that there were lots of different features that were
¯
measured. How would this affect the classifier? In NB, we are trying to estimate a joint distribution of a D−dimensional
attribute (input) vector x and the corresponding class label y , p(x, y ). As the dimensionality of x increases, this task becomes
¯ ¯ ¯
harder and harder. However, there is one simplifying assumption that we can make. We can assume that the elements of the
feature vector are conditionally independent of each other, given the classification. So given a class y , the values of the different
features do not affect each other. This is the naı̈veté in the name of the classifier, since it often does not make much sense−it
tells us that the features are independent of each other. If we were to try to classify coins it would say that the weight and the
diameter of the coin are independent of each other, which clearly is not true. However, it does mean that the probability of
getting the string of feature values is just equal to the product of multiplying together all of the individual probabilities:

YD
p(x, y ) = p(y ) p(xj |y ) (2.22)
¯ j=1

which is much easier to compute, and reduces the severity of the dimensionality problem.
So the classifier rule for the naive Bayes’ classifier is to select the class yi for which the following computation is the maximum
D
Y
p(y ) p(xj |y )
j=1

Copyright c 2019, Dr. Mustafa El-Halabi 21


CLASSIFICATION 22

Coupled with a suitable choice for each conditional distribution p(xj |y ), we can use Bayes’ rule to form the Bayes classifier
for a novel input vector x ∗ :
¯ D
p(xj∗ |y )
Q
p(y )
∗ p(x ∗ |y )p(y ) j=1
p(y |x ) = ¯ = P (2.23)
¯ p(x ∗ ) p(y )p(x ∗ |y )
¯ y ¯
In practice it is common to consider only two classes dom(y ) = {0, 1}. The theory we describe below is valid for any number
of classes y , though our examples are restricted to the binary class case. Also, the attributes xj are often taken to be binary, as
we shall do initially below as well. The extension to more than two attribute states, or continuous attributes is straightforward.
As an example, consider bank that has a record of past loans containing customers data (customer’s yearly income and
savings) and whether the loan was paid back or not. From this data of particular applications, the aim is to infer a general
rule coding the association between a customer?s attributes and his risk. That is, the machine learning system fits a model to
the past data to be able to calculate the risk for a new application and then decides to accept or refuse it accordingly. Assume
the customer’s yearly income and savings are represent by two random variables x1 and x2 . With what we can observe, the
credibility of a customer is denoted by a Bernoulli random variable y conditioned on the observables x = [x1 , x2 ]T where y = 1
¯
indicates a high-risk customer and y = 0 indicates a low-risk customer. Thus if we know p(y |x1 , x2 ) we can choose


y = 1, if p(y = 1|x1 , x2 ) > p(y = 0|x1 , x2 )
(2.24)
y = 0, otherwise

The probability of error is


pe = 1 − max{p(y = 1|x1 , x2 ), p(y = 0|x1 , x2 )} (2.25)
The problem then is to be able to calculate p(y |x). Using Baye’s rule, it can be written
¯
p(x|y )p(y )
p(y |x) = ¯
¯ p(x)
¯
• Prior Probability. p(y = 1) is called the prior probability that y takes the value 1, which in our example corresponds to
the probability that a customer is high-risk, regardless of the x value. It is called the prior probability because it is the
¯
knowledge we have as to the value of y before looking at the observables x, satisfying
¯
p(y = 0) + p(y = 1) = 1

• Class Likelihood. p(x|y ) is called the class likelihood and is the conditional probability that an event belonging to y has
¯
the associated observation value x. In our case, p(x1 , x2 |y = 1) is the probability that a high-risk customer has his or her
¯
x1 and x2 . It is what the data tells us regarding the class.

• Evidence. p(x), the evidence, is the marginal probability that an observation x is seen, regardless of whether it is a
¯ ¯
positive or negative example.
X
p(x) = p(x, y ) = p(x|y = 1)p(y = 1) + p(x|y = 0)p(y = 0)
¯ y
¯ ¯ ¯

• Posterior Probability. Combining the prior and what the data tells us using Bayes’ rule, we calculate the posterior
probability of the concept, p(y |x), after having seen the observation, x.
¯ ¯
prior × likelihood
posterior =
evidence
Because of normalization by the evidence, the posteriors sum up to 1:

p(y = 0|x) + P(y = 1|x) = 1


¯ ¯
Once we have the posteriors, we decide by using Eq. (2.24).

Copyright c 2019, Dr. Mustafa El-Halabi 22


CLASSIFICATION 23

In the general case, we have K mutually exclusive and exhaustive classes; yi ,i = 1, ... , K . The posterior probability of class yi
can be calculated as
p(x|yi )p(yi ) p(x|yi )p(yi )
p(yi |x) = ¯ = K ¯
¯ p(x) P
¯ p(x|yk )p(yk )
k=1 ¯

and for minimum error, the Baye’s classifier chooses the class with the highest posterior probability; that is, we choose yi if
p(yi |x) = max p(yk |x).
¯ k ¯

Example 8. EZsurvey.org partitions radio station listeners into two groups − the “young” and “old”. They assume that, given
the knowledge that a customer is either “young” or “old”, this is sufficient to determine whether or not a customer will like a
particular radio station, independent of their likes or dislikes for any other stations:

p(r1 , r2 , r3 , r4 |age) = p(r1 |age)p(r2 |age)p(r3 |age)p(r4 |age)

where each of the variables r1 , r2 , r3 , r4 can take the states like or dislike, and the “age” variable can take the value young or
old. Thus the information about the age of the customer determines the individual radio station preferences without needing
to know anything else. To complete the specification, given that a customer is young, she has a 95% chance to like Radio1, a
5% chance to like Radio2, a 2% chance to like Radio3 and a 20% chance to like Radio4. Similarly, an old listener has a 3%
chance to like Radio1, an 82% chance to like Radio2, a 34% chance to like Radio3 and a 92% chance to like Radio4. They
know that 90% of the listeners are old.
Given this model, and the fact that a new customer likes Radio1, and Radio3, but dislikes Radio2 and Radio4, what is the
probability that the new customer is young?

Solution. Let D denotes ”dislike” and L denotes “like”. The probability in question can be written as

p(r1 = L, r2 = D, r3 = L, r4 = D|young)p(young)
p(young|r1 = L, r2 = D, r3 = L, r4 = D) = P
p(r1 = L, r2 = D, r3 = L, r4 = D|age)p(age)
age

Using the naive Bayes structure, the numerator above is given by

p(r1 = L|young)p(r2 = D|young)p(r3 = L|young)p(r4 = D|young)p(young)

Plugging in the values we obtain


0.95 × 0.95 × 0.02 × 0.8 × 0.1 = 0.0014
The denominator is given by this value plus the corresponding term evaluated assuming the customer is old,

0.03 × 0.18 × 0.34 × 0.08 × 0.9 = 1.3219 × 10−4

Which gives
0.0014
p(young|r1 = L, r2 = D, r3 = L, r4 = D) = = 0.9161
0.0014 + 1.3219 × 10−4

2.3.2 Estimation using Maximum Likelihood


Consider a dataset {(x n , y n ), n = 1, ... , N} of binary attributes, xjn ∈ {0, 1}, j = 1, ... , D and associated class y n . The
¯
number of data points from class y = 0 is denoted n0 and the number from class y = 1 is denoted n1 . For each attribute of the
two classes, we need to estimate the values p(xj = 1|y ) , θj,y . The other probability, p(xj = 0|y ) is given by the normalization
property p(xj = 0|y ) = 1 − θj,y .

Copyright c 2019, Dr. Mustafa El-Halabi 23


CLASSIFICATION 24

Based on the NB conditional independence assumption the probability of observing a vector x can be compactly written
¯
YD YD
x 1−x
p(x|y ) = p(xj |y ) = (θj,y ) j (1 − θj,y ) j (2.26)
¯ j=1 j=1

In the above expression, xj is either 0 or 1 and hence each i term contributes a factor θj,y if xj = 1 or 1 − θj,y if xj = 0. Together
with the assumption that the training data is i.i.d. generated, the log likelihood of the attributes and class labels is
X X Y
L = log p(x n , y n ) = log p(y n ) p(xjn |y n ) (2.27)
n
¯ n j
 
X 
= xjn log θj,y
n
+ (1 − xjn ) log(1 − θj,y
n
) + n0 log p(y = 0) + n1 log p(y = 1) (2.28)
 
j,n

This can be written more explicitly in terms of the parameters as


X 
{I xjn = 1, y n = 0 log θj,0 + I xjn = 0, y n = 0 log(1 − θj,0 ) +
  
L = (2.29)
j,n

I xjn = 1, y n = 1 log θj,1 + I xjn = 0, y n = 1 log(1 − θj,1 )} + n0 log p(y = 0) + n1 log p(y = 1)
   
(2.30)

We can find the maximum likelihood optimal θj,y by differentiating with respect to θj,y and equating to zero, giving
n n
P
n I[xj = 1, y = y ] number of times xj = 1 for class y
θj,y = p(xj = 1|y ) = P n n = y] +
P n = 1, y n = y ] = number of datapoints in class y (2.31)
n I[x j = 0, y n I[x j

Similarly, optimizing L with respect to p(y ) gives

number of times class y occurs


p(y ) = (2.32)
total number of data points
These results are consistent with a general theory that maximum likelihood corresponds to setting tables by counting.
We classify a novel input x ∗ as class 1 if
¯
p(y = 1|x ∗ ) > p(y = 0|x ∗ ) (2.33)
¯ ¯
Using Bayes’ rule and writing the log of the above expression, this is equivalent to

log p(x ∗ |y = 1) + log p(y = 1) − log p(x ∗ ) > log p(x ∗ |y = 0) + log p(y = 0) − log p(x ∗ )
¯ ¯ ¯ ¯
This is equivalent to X X
log p(xj∗ |y = 1) + log p(y = 1) > log p(xj∗ |y = 0) + log p(y = 0)
j j

Using the binary encoding xj ∈ {0, 1}, we therefore classify x as class 1 if
¯
X X
∗ ∗
{xj log θj,1 + (1 − xj ) log(1 − θj,1 )} + log p(y = 1) > {xj∗ log θj,0 + (1 − xj∗ ) log(1 − θj,0 )} + log p(y = 0)
j j

Example 9. Consider the following vector of binary attributes: (pizza, espresso, beer, sausages, football). A vector x =
¯
(1, 0, 1, 1, 0)T would describe that a person likes pizza, does not like espresso, drinks beer, eats sausages, and does not like to play
football. Together with each vector x, there is a label y describing the nationality of the person, dom(y ) = {German, Italian}.
¯
Table (a) depicts German tastes for 6 people and table (b) depicts Italian tastes for 7 people over attributes (pizza, espresso,
beer, sausages, football). Each column represents the tastes of an individual.

Copyright c 2019, Dr. Mustafa El-Halabi 24


CLASSIFICATION 25

Estimation using Maximum Likelihood

0 1 1 1 0 0 1 1 1 1 1 1 1
0 0 1 1 1 0 0 1 1 1 1 0 0 Figure 10.2: (a): Engli
1 1 0 0 0 0 0 0 1 0 0 1 1 tastes for 6 people over attribut
1 1 0 0 0 1 1 0 1 1 1 1 0 (shortbread, lager, whiskey, porridge, f ootball
1 0 1 0 1 0 1 1 0 0 1 0 0 Each column represents the tastes of a
(a) (b) individual. (b): Scottish tastes for 7 people.

no zero probabilities in the model. An alternative is to use a Bayesian approach that discourages extrem
We wish to classify a person who likes eating pizza and sausages and drinking beer, but does not like drinking espresso and
probabilities, as discussed in section(10.3).
does not play football, as either German or Italian.
Potential
Solution. We wish to classify the vector x = (1, pitfalls
0, 1, 1, 0)with
T
asencoding
either German or Italian. Using Bayes’ rule
¯
In many o↵-the-shelf packages implementing Naive Bayes, binary attributes are assumed. In practic
p(x|Italian)p(Italian) non-binary attributes p(x|Italian)p(Italian)
p(Italian|x) = ¯ however, the case of = ¯ often occurs. Consider the following attribute : age. In a surve
¯ p(x) age is marked
a person’s p(x|Italian)p(Italian)
down using the variable + p(x|German)p(German)
a 2 1, 2, 3. a = 1 means the person is between 0 and 1
years ¯ old, a = 2 means ¯the person is between 10 ¯and 20 years old, a = 3 means the person is older tha
7 6
By maximum likelihood the “prior” class
20.probability
One way top(Italian)
transform = the13 and p(German)
variable = 13
a into a binary . For p(x|y )would
representation underbethe Naive
to use Bayes
three binary variabl
(a , a , a ) with (1, 0, 0), (0, 1, 0), (0, 0, 1) representing a = 1, a¯ = 2, a = 3 respectively. This is calle
assumption: 1 2 3
1p(x|yof ) =M p(x
coding since only 1 of the binary variables is active in encoding the M states. By constructio
1 |y )p(x2 |y )p(x3 |y )p(x4 |y )p(x5 |y )
this¯means that the variables a1 , a2 , a3 are dependent – for example, if we know that a1 = 1, we know th
Based on tables and using maximum likelihood
a2 = 0 and wea3have:
= 0. Regardless of any class conditioning, these variables will always be dependent, contra
to the assumption of Naive Bayes. A correct approach is to use variables with more than two states,
explained 1
in section(10.2.2).
p(x 1= 1|German) = p(x = 1|Italian) = 1 1
2
10.2.2 1
Multi-state variables 4
p(x 2 = 1|German) = p(x2 = 1|Italian) =
2 7
The extension of the above method to class variables c with more than two states is straightforward. W
1 3
concentrate
p(x here therefore
3 = 1|German) = onp(x
extending the attribute
3 = 1|Italian) = variables to having more than two states. For
variable xi with states, 3dom(xi ) = {1, . . . , S}, the likelihood
7 of observing a state xi = s is denoted
1 5
p(x4 = 1|German) = p(x4 = 1|Italian) =
p(xi = s|c) = ✓si (c)2 7 (10.2.1
P 1 3
p(x 5 = 1|German)
with = 1. The
s p(xi = s|c) = 2
p(xclass
5 = conditional
1|Italian) =likelihood of generating the i.i.d. data D = (xn , cn ), n
7
1, . . . , N is
For x = (1, 0, 1, 1, 0)T , we get N N Y
D Y
S YC
¯ Y Y
✓4si (c)I[7xi =s]I[c =c] .
n n
p(xn |cn ) = 3 3 5
(10.2.1
n=1 1 × 7n=1
× i=1
7 × × 7 × 13
s=17c=1
p(Italian|x) = 3 3 = 0.8076
¯ 1× 7 × × 75 × 47 × 13
7
+ 12 × 12 × 13 × 2 ✓ i (c)
7of the indicators is that only those2terms
1
× 1
× 6
13 survive for which there are attributes i in state
The e↵ect s
for class c. This then gives the class conditional log-likelihood
Since this is greater than 0.5, we would classify this person as being Italian.
N X
X D X
S X
C
L(✓) = I [xni = s] I [cn = c] log ✓si (c) (10.2.1
n=1 i=1 s=1 c=1

2.3.3withSpam
We can optimize Filtering
respect to the parameters ✓ using a Lagrange multiplier (one for each of the attribut
i and classes c) to ensure normalisation. This gives the Lagrangian
For our motivating example, consider building an email spam filter using machine learning. Here, we wish to classify messages
N XD X Semail,
C C X D S
!
according to whether they are unsolicited commercialX (spam) X orn non-spamn
email. i After X learningc
to doX
this,
i
we can then
L(✓, ) = I [xi = s] I [c = c] log ✓s (c) + i 1 ✓s (c) (10.2.2
have our mail reader automatically filter out the spam messages and perhaps place them in a separate mail folder. Classifying
n=1 i=1 s=1 c=1 c=1 i=1 s=1
emails is one example of a broader set of problems called text classification.
i
Let us say we have a training set (aToset
findofthe optimum
emails of this
labeled as function
spam or we may di↵erentiate
non-spam). We willwith respect
begin to ✓s (c) andofequate
construction to zero. Solvin
our spam
filter by specifying the features xj usedthe
to resulting
representequation
an email.we obtain
We will represent an email via a feature vector whose length is
equal to the number of words in the dictionary. XN Specifically, if an email contains the j−th word of the dictionary, then we will
n I [xi = s] I [c = c]
n
c
= i (10.2.2
✓si (c)
n=1

DRAFT December 13, 2014 23

Copyright c 2019, Dr. Mustafa El-Halabi 25


CLASSIFICATION 26

set xj = 1; otherwise, we let xj = 0. For instance, the vector

a
 
1
0
  aardvark
0 aardwolf
 ..  ..
 
x = . . (2.34)
¯  
1
  buy
. ..
 ..  .
0 zymurgy

is used to represent an email that contains the words “a” and “buy,” but not “aardvark,” “aardwolf” or “zygmurgy”1 . The
set of words encoded into the feature vector is called the vocabulary, so the dimension of x is equal to the size of the vocabulary.
¯
Having chosen our feature vector, we now want to build a generative model. So, we have to model p(x|y ). But if we
50000 ¯
have, say, a vocabulary of 50000 words, then x ∈ {0, 1} (x is a 50000-dimensional vector of 0’s and 1’s), and if we
¯
were to model x explicitly with a multinomial distribution over the 250000 possible outcomes, then we would end up with a
¯
(250000 − 1)−dimensional parameter vector. This is clearly too many parameters.
To model p(x|y ), we will will assume that the xj ’s are conditionally independent given y , i.e., the Naive Bayes (NB)
¯
assumption, and the resulting algorithm is called the Naive Bayes classifier. For instance, if y = 1 means spam email; “buy” is
word 2087 and “price” is word 39831; then we are assuming that if I tell you y = 1 (that a particular piece of email is spam),
then knowledge of x2087 (knowledge of whether “buy” appears in the message) will have no effect on your beliefs about the
value of x39831 (whether “price” appears). More formally, this can be written p(x2087 |y ) = p(x2087 |y , x39831 ). (Note that this
is not the same as saying that x2087 and x39831 are independent, which would have been written p(x2087 ) = p(x2087 |x39831 );
rather, we are only assuming that x2087 and x39831 are conditionally independent given y .)
We now have,

p(x1 , ... , x50000 |y ) = p(x1 |y )p(x2 |y , x1 )p(x3 |y , x1 , x2 ) ... p(x50000 |y , x1 , ... , x49999 )
= p(x1 |y )p(x2 |y )p(x3 |y ) ... p(x50000 |y )
Yn
= p(xj |y )
j=1

The first equality simply follows from the usual properties of probabilities, and the second equality used the NB assumption.
We note that even though the Naive Bayes assumption is an extremely strong assumptions, the resulting algorithm works well
on many problems.
Our model is parametrized by φj|y =1 = p(xj = 1|y = 1), φj|y =0 = p(xj = 1|y = 0), and φy = p(y = 1). As usual, given a
training set {(x (i) , y (i) ); i = 1, ... , m}, we can write down the joint likelihood of the data
m
Y
L(φy , φj|y =1 , φj|y =0 ) = p(x (i) , y i ) (2.35)
i=1
1 Actually, rather than looking through an English dictionary for the list of all English words, in practice it is more common to look through our

training set and encode in our feature vector only the words that occur at least once there. Apart from reducing the number of words modeled and
hence reducing our computational and space requirements, this also has the advantage of allowing us to model/include as a feature many words that
may appear in your email (such as “CCE414”) but that you will not find in a dictionary. Sometimes, we also exclude the very high frequency words
(which will be words like “the”, “of”, “and”; these high frequency, “content free” words are called stop words) since they occur in so many documents
and do little to indicate whether an email is spam or non-spam.

Copyright c 2019, Dr. Mustafa El-Halabi 26


CLASSIFICATION 27

Maximizing this with respect to φy , φj|y =1 , φj|y =0 gives the maximum likelihood estimates
m
P (i) (i)
1{xj = 1, yi = 1}
i=1
φj|y =1 = m (2.36)
P
1{y (i) = 1}
i=1
m
P (i) (i)
1{xj = 1, yi = 0}
i=1
φj|y =0 = m (2.37)
P
1{y (i) = 0}
i=1
m
1{y (i) = 1}
P
i=1
φy = (2.38)
m
The parameters have a very natural interpretation. For instance, φj|y =1 is just the fraction of the spam (y = 1) emails in
which word j does appear.
Having fit all these parameters, to make a prediction on a new example with features x, we then simply calculate
¯
n
Q
p(y = 1) p(xj |y = 1)
p(x|y = 1)p(y = 1) j=1
p(y = 1|x) = = n n (2.39)
p(x) Q Q
p(y = 1) p(xj |y = 1) + p(y = 0) p(xj |y = 0)
j=1 j=1

and pick whichever class has the higher posterior probability.


Lastly, we note that while we have developed the Naive Bayes algorithm mainly for the case of problems where the features
xj are binary-valued, the generalization to where xj can take values in {1, 2, ... , kj } is straightforward. Here, we would simply
model p(xj |y ) as multinomial rather than as Bernoulli. Indeed, even if some original input attribute (say, the living area of a
house, as in our earlier example) were continuous valued, it is quite common to discretize it−that is, turn it into a small set
of discrete values−and apply Naive Bayes. For instance, if we use some feature xj to represent living area, we might discretize
the continuous values as follows: xi = 1 if living area is smaller than 400, xi = 2 if living area is between 400 − 800, xi = 3 if
living area is between 800 − 1200, etc. We can then apply the Naive Bayes algorithm, and model p(xj |y ) with a multinomial
distribution, as described previously. When the original, continuous-valued attributes are not well-modeled by a multivariate
normal distribution, discretizing the features and using Naive Bayes (instead of GDA) will often result in a better classifier.
Bayesian methods typically generalize much better when limited training data is available, but typically suffer from high
computational cost when the number of training examples is large.

2.4 Decision Trees


Decision tree learning is one of the most widely used and practical methods for inductive inference. It is a method for
approximating discrete-valued functions that is robust to noisy data and capable of learning disjunctive expressions. This
section describes the ID3 (Iterative Dichomotiser 3 ) algorithm, a widely used decision tree algorithm.
Decision tree learning is a method for approximating discrete-valued target functions, in which the learned function is
represented by a decision tree. Learned trees can also be re-represented as sets of if-then rules to improve human readability.
These learning methods are among the most popular of inductive inference algorithms and have been successfully applied to a
broad range of tasks from learning to diagnose medical cases to learning to assess credit risk of loan applicants.

2.4.1 Decision Tree Representation


Decision trees classify instances by sorting them down the tree from the root to some leaf node, which provides the
classification of the instance. Each node in the tree specifies a test of some attribute of the instance, and each branch
descending from that node corresponds to one of the possible values for this attribute. An instance is classified by starting at
the root node of the tree, testing the attribute specified by this node, then moving down the tree branch corresponding to the
value of the attribute in the given example. This process is then repeated for the subtree rooted at the new node.

Copyright c 2019, Dr. Mustafa El-Halabi 27


CLASSIFICATION 28

Figure 2.8: An example of a decision tree for the concept of PlayTennis. An example is classified by sorting it through the
tree to the appropriate leaf node, then returning the classification associated with this leaf (in this case, Yes or No). This tree
classifies Saturday mornings according to whether or not they are suitable for playing tennis.

Figure. 2.8 illustrates a typical learned decision tree. This decision tree classifies Saturday mornings according to whether
they are suitable for playing tennis. For example, the “instance”

hOutlook=Sunny, Temperature=Hot, Humidity=High, Wind=Strongi

would be sorted down the leftmost branch of this decision tree and would therefore be classified as a negative instance (i.e.,
the tree predicts that PlayTennis = no).
A decision tree consists of:

• Nodes: test for the value of a certain attribute

• Edges: correspond to the outcome of a test and connect to the next node or leaf

• Leaves: terminal nodes that predict the outcome

A decision tree classifies data using the attributes. Every tree consists of decision nodes and decision leafs. Nodes can have
two or more branches, which represent the value for the attribute tested. Leaf nodes produce a homogeneous result.

2.4.2 Decision Tree Problem Setting


Any problem cast as a decision tree problem is set as follow

• Set of possible “instances” X

• Set of possible “labels” Y

• Unknown “target function” f : X → Y

• Set of function hypotheses H = {h|h : X → Y}. Each hypothesis h is a decision tree.

• Input: Training examples of unknown target function f

{(xi , yi )}ni=1 = {(x1 , y1 ), ... , (xn , yn )}


¯ ¯ ¯
• Output: Hypothesis h ∈ H that best approximates f

In the rest of the section we will use the learning task represented by the training examples of Table. 2.9.
Every row i corresponds to the i th labeled instances (xi , y ) for i = 1, 2, ... , n, and every column denotes the features x (j) .
¯
Going back to Figure. 2.8, each internal node test one feature x (j) , each branch from a node selects one value for x (j) , and
each leaf node predicts y . Note that the features in our example are discrete, if they were continuous, internal nodes can test
the value of a feature against a threshold.

Copyright c 2019, Dr. Mustafa El-Halabi 28


• Rows,denote,labeled,instances, hxi , yi i
• Class,label,denotes,whether,a,tennis,game,was,played
29 CLASSIFICATION

hxi , yi i

Figure 2.9: Training examples for the target concept of PlayTennis.

A sample task would be to classify the following instance:

hOutlook=Sunny, Temperature=Cool, Humidity=High, Wind=Strongi

Although a variety of decision tree learning methods have been developed with somewhat differing capabilities and require-
ments, decision tree learning is generally best suited to problems with the following characteristics:
• Instances are represented by attribute-value pairs. Instances are represented by attribute-value pairs. Instances are de-
scribed by a fixed set of attributes (e.g., Temperature) and their values (e.g., Hot). The easiest situation for decision
tree learning is when each attribute takes on a small number of disjoint possible values (e.g., Hot, Mild, Cold). How-
ever, extensions to the basic algorithm (which we are yet to explain) allow handling real-valued attributes as well (e.g.,
representing Temperature numerically).

• The target function has discrete output values. The decision tree in Figure. 2.8 assigns a boolean classification (e.g., yes
or no) to each example. Decision tree methods easily extend to learning functions with more than two possible output
values. A more substantial extension allows learning target functions with real-valued outputs, though the application of
decision trees in this setting is less common.

• The training data may contain errors. Decision tree learning methods are robust to errors, both errors in classifications
of the training examples and errors in the attribute values that describe these examples.

• The training data may contain missing attribute values. Decision tree methods can be used even when some training
examples have unknown values (e.g., if the Humidity of the day is known for only some of the training examples).
Many practical problems have been found to fit these characteristics. Decision tree learning has therefore been applied
to problems such as learning to classify medical patients by their disease, equipment malfunctions by their cause, and loan
applicants by their likelihood of defaulting on payments.

2.4.3 The Iterative Dichotomy 3 (ID3) Learning Algorithm


Most algorithms that have been developed for learning decision trees are variations of a core algorithm that employs a
top-down, greedy search through the space of possible decision trees. This approach is exemplified by the ID3 algorithm and its
successor C4.5, which form the primary focus of our discussion here. In this section we present the basic algorithm for decision
tree learning, corresponding approximately to the ID3 algorithm. ID3 performs a Hill-Climbing search in the space of trees.

Copyright c 2019, Dr. Mustafa El-Halabi 29


CLASSIFICATION 30

A simplified version of the algorithm, specialized to learning boolean-valued functions (i.e., concept learning), is described
in below.
Our basic algorithm, ID3, learns decision trees by constructing them top-down, beginning with the question “which attribute
should be tested at the root of the tree?’” To answer this question, each instance attribute is evaluated using a statistical test
to determine how well it alone classifies the training examples. The best attribute is selected and used as the test at the root
node of the tree. A descendant of the root node is then created for each possible value of this attribute, and the training
examples are sorted to the appropriate descendant node (i.e., down the branch corresponding to the example’s value for this
attribute). The entire process is then repeated using the training examples associated with each descendant node to select
the best attribute to test at that point in the tree. This forms a greedy search for an acceptable decision tree, in which the
algorithm never backtracks to reconsider earlier choices.
ID3 Algorithm

Examples, Target-attribute, Attributes


Examples are the training examples. Target-attribute is the attribute whose value is to be predicted by the
tree. Attributes is a list of other attributes that may be tested by the learned decision tree. Returns a decision
tree that correctly classifies the given Examples.

• Create a Root node for the tree

• If all Examples are positive, Return the single-node tree Root, with label = +

• If all Examples are negative, Return the single-node tree Root, with label = −

• If Attributes is empty, Return the single-node tree Root, with label = most common value of Target-
attribute in Examples

• Otherwise Begin
a
– A ← the attribute from Attributes that best classifies Examples
– The decision attribute for Root ← A
– For each possible value , vi , of A
∗ Add a new tree branch below Root, corresponding to the test A = vi
∗ Let Examplesvi , be the subset of Examples that have value vi for A
∗ If Examplesvi is empty
· Then below this new branch add a leaf node with label = most common value of Target-
attribute in Examples
· Else below this new branch add the subtree
ID3(Examples,Target-attribute, Attributes−{A})
– End
– Return Root
a The best attribute is the one with the highest information gain, as will be defined next

2.4.4 Which Attribute is the Best Classifier?


The central choice in the ID3 algorithm is selecting which attribute to test at each node in the tree. We would like to
select the attribute that is most useful for classifying examples. What is a good quantitative measure of the worth of an
attribute? We will define a statistical property, called information gain, that measures how well a given attribute separates the
training examples according to their target classification. ID3 uses this information gain measure to select among the candidate
attributes at each step while growing the tree.
In order to define information gain precisely, we begin by defining a measure commonly used in information theory, called
entropy, that characterizes the (im)purity of an arbitrary collection of examples. Given a collection S, containing positive and

Copyright c 2019, Dr. Mustafa El-Halabi 30


CLASSIFICATION 31

negative examples of some target concept, the entropy of S relative to this boolean classification is
Entropy (S) = −p⊕ log2 p⊕ − p log2 p
where p⊕ is the proportion of positive examples in S and p is the proportion of negative examples in S.
One interpretation of entropy from information theory is that it specifies the minimum number of bits of information needed
to encode the classification of an arbitrary member of S (i.e., a member of S drawn at random with uniform probability). For
example, if p⊕ , is 1, the receiver knows the drawn example will be positive, so no message need be sent, and the entropy is
zero. On the other hand, if p⊕ is 0.5,one bit is required to indicate whether the drawn example is positive or negative. If p⊕ is
0.8, then a collection of messages can be encoded using on average less than 1 bit per message by assigning shorter codes to
collections of positive examples and longer codes to less likely negative examples.
Thus far we have discussed entropy in the special case where the target classification is boolean. More generally, if the
target attribute can take on c different values, then the entropy of S relative to this c−wise classification is defined as
c
X
Entropy (S) = − pi log2 pi (2.40)
i=1

where pi is the proportion of S belonging to class i . Note the logarithm is still base 2 because entropy is a measure of the
expected encoding length measured in bits. Note also that if the target attribute can take on c possible values, the entropy
can be as large as log2 c.
Given entropy as a measure of the impurity in a collection of training examples, we can now define a measure of the
effectiveness of an attribute in classifying the training data. The measure we will use, called information gain, is simply the
expected reduction in entropy caused by partitioning the examples according to this attribute. More precisely, the information
gain, Gain(S, A) of an attribute A, relative to a collection of examples S, is defined as
X |Sv |
Gain(S, A) = Entropy (S) − Entropy (Sv ) (2.41)
|S|
v ∈Values(A)

where Values(A) is the set of all possible values for attribute A, and Sv is the subset of S for which attribute A has value v
(i.e., Sv = {s ∈ S|A(s) = v }). Note the first term in (2.41) is just the entropy of the original collection S, and the second term
is the expected value of the entropy after S is partitioned using attribute A. The expected entropy described by this second
term is simply the sum of the entropies of each subset S, weighted by the fraction of examples that belong to S. Gain(S, A)
is therefore the expected reduction in entropy caused by knowing the value of attribute A. Put another way, Gain(S, A) is the
information provided about the target function value, given the value of some other attribute A. The value of Gain(S, A) is
the number of bits saved when encoding the target value of an arbitrary member of S, by knowing the value of attribute A.

Example 10. Consider the problem depicted in Fig. 2.9. Let S be the collection of training-example days described by the
attributes of Outlook, Temperature, Humidity, and Wind. Compute Gain(S, Wind).

Solution. Suppose S is a collection of training-example days described by attributes including Wind, which can have the values
Weak or Strong. Assume S is a collection containing 14 examples, [9+, 5−]. Of these 14 examples, suppose 6 of the positive
and 2 of the negative examples have Wind = Weak, and the remainder have Wind = Strong. The information gain due to
sorting the original 14 examples by the attribute Wind may then be calculated as

Values(Wind) = Weak, Strong


S = [9+, 5−]
SWeak ← [6+, 2−]
SStrong ← [3+, 3−]
X |Sv | 8 6
Gain(S, Wind) = Entropy (S) − Entropy (Sv ) = Entropy (S) − Entropy (SWeak ) − Entropy (SStrong )
|S| 14 14
v ∈{Weak,Strong}

Copyright c 2019, Dr. Mustafa El-Halabi 31


CLASSIFICATION 32

9 9 5 5
Entropy (S) = − log − log = 0.94
14 14 14 14
6 6 2 2
Entropy (SWeak ) = − log − log = 0.811
8 8 8 8
3 3 3 3
Entropy (SStrong ) = − log − log = 1
2 2 2 2

8 6
∴ Gain(S, Wind) = 0.94 − (0.811) − (1) = 0.048
14 14

Information gain is precisely the measure used by ID3 to select the best attribute at each step in growing the tree. To
decide on which attribute is the best classifier, we should compute G (S, Humidity ), G (S, Outlook), and G (S, Temperature) as
well. Consider the first step through the algorithm, in which the topmost node of the decision tree is created. The information
gain values for all four attributes are: G (S, Humidity ) = 0.151, G (S, Outlook) = 0.246, G (S, Temperature) = 0.029, and
G (S, Wind) = 0.048. According to the information gain measure, the Outlook attribute provides the best prediction of the
target attribute, PlayTennis, over the training examples. Therefore, Outlook is selected as the decision attribute for the root
node, and branches are created below the root for each of its possible values (i.e., Sunny, Overcast, and Rain).

[D1,D2,...,D14]
[9+,5-]

Outlook

Sunny Rain

Overcast

[D1,D2,D8,D9,D11] [D3,D7,D12,D13] [D4,D5,D6,D10,D14]


[2+,3-] [4+,0-] [3+,2-]

? Yes ?

Figure 2.10: The partially learned decision tree resulting from the first step of ID3.

The resulting partial decision tree is shown in Figure. 2.10, along with the training examples sorted to each new descendant
node. Note that every example for which Outlook = Overcast is also a positive example of PlayTennis. Therefore, this node
of the tree becomes a leaf node with the classification PlayTennis = Yes. In contrast, the descendants corresponding to
Outlook = Sunny and Outlook = Rain still have nonzero entropy, and the decision tree will be further elaborated below these
nodes.
The process of selecting a new attribute and partitioning the training examples is now repeated for each non-terminal
descendant node, this time using only the training examples associated with that node. Attributes that have been incorporated
higher in the tree are excluded, so that any given attribute can appear at most once along any path through the tree. This
process continues for each new leaf node until either of two conditions is met: (1) every attribute has already been included
along this path through the tree, or (2) the training examples associated with this leaf node all have the same target attribute
value (i.e., their entropy is zero).
Considering the Sunny branch, to decide on whether the attribute connected to it is Humidity, Temperature and Wind, we

Copyright c 2019, Dr. Mustafa El-Halabi 32


CLASSIFICATION 33

compute the following information gains


3 2
Gain(SSunny , Humidity ) 0.97 − (0) −
= (0) = 0.97
5 5
2 2 1
Gain(SSunny , Temperature) = 0.97 − (0) − (1) − (0) = 0.57
5 5 5
2 3
Gain(SSunny , Wind) = 0.97 − (1) − (0.918) = 0.019
5 5

Clearly Humidity has the highest information gain, so this should be the attribute connected to the Sunny branch. Continuing
in the same manner, the final decision tree learned by ID3 from the 14 training examples is shown in Figure. 2.8.

2.5 k−Nearest Neighbors (k−NN) Algorithm


We now turn our attention to non-probabilistic classifiers. Rather than providing a probability of class membership, p(y |x),
¯
their output is an assignment of an object to a class: ynew = c. We will look at two different algorithms; k−Nearest Neighbors
(k−NN) and the Support Vector Machine (SVM). Both are very popular within machine learning due to their excellent empirical
performance. In this section we discuss the k−Nearest Neighbors algorithm, which can handle both binary and multi-class data
and makes no assumption about the parametric form of the decision boundary. k−NN does not have a training phase and is
best described through the simple process used to classify new objects. It is an example of instance-based learning.

2.5.1 Do As Your Neighbor Does


Suppose that you are in a nightclub and decide to dance. It is unlikely that you will know the dance moves for the particular
song that is playing, so you will probably try to work out what to do by looking at what the people close to you are doing. The
first thing you could do would be just to pick the person closest to you and copy them. However, since most of the people
who are in the nightclub are also unlikely to know all the moves, you might decide to look at a few more people and do what
most of them are doing. This is pretty much exactly the idea behind nearest neighbor methods: if we do not have a model
that describes the data, then the best thing to do is to look at similar data and choose to be in the same class as them. This
method performs best if we are working in low dimensions and if we have reason to believe that points (x, y ) that are “close”
¯
in input have similar labels/values.
Successful prediction typically relies on smoothness in the data − if the class label can change as we move a small amount
in the input space, the problem is essentially random and no algorithm will generalize well. In machine learning one constructs
appropriate measures of smoothness for the problem at hand and hopes to exploit this to obtain good generalization. Nearest
neighbor methods are a useful starting point since they readily encode basic smoothness intuitions and are easy to program.
In a classification problem each input vector x has a corresponding class label, y n ∈ {1, ... , Y }. Given a dataset of N train
¯
examples, D = {x n , y n }, n = 1, ... , N, and a novel x, we aim to return the correct class y (x). A simple, but often effective,
¯ ¯ ¯
strategy for this supervised learning problem can be stated as: for novel x, find the nearest input in the train set and use the
¯ 0
class of this nearest input, what is know as the 1−NN rule. For vectors x and x representing two different data points, we
¯ ¯
measure “nearness” using a dissimilarity function d(x, x 0 ). A common dissimilarity is the squared Euclidean distance
¯ ¯
d(x, x 0 ) = (x − x 0 )T (x − x 0 ) (2.42)
¯ ¯ ¯ ¯ ¯ ¯
Based on the squared Euclidean distance, the decision boundary is determined by the perpendicular bisectors of the closest
training points with different training labels, see Figure. 2.11. This partitions the input space into regions classified equally and
is called a Voronoi tessellation.
The 1−NN algorithm to classify a vector x, given training data D = {(x n , y n ), n = 1, ... , N} is given as follows.
¯ ¯

Copyright c 2019, Dr. Mustafa El-Halabi 33


19.2 Analysis 259
CLASSIFICATION 34

Figure
Figure 2.11: An 19.1 An
illustration illustration
of the of the decision
decision boundaries boundaries
of the 1−NN of the
rule. The 1-NN
points rule.are
depicted The
thepoints
sample points, and the
predicted labeldepicted
of any newarepoint
the sample points,
will be the and
label of thethe predicted
sample point inlabel of anyofnew
the center the point will be to.
cell it belongs theThese cells are
label
called a Voronoi of the sample
tessellation pointand
of the space, in the centerusing
are drawn of the
thecell it belongs
Euclidean to. These
distance. cells are
Other metrics willcalled
changea the decision
surface. Voronoi Tessellation of the space.

1−NN Algorithm
k-NN
1. Calculate the dissimilarity of the test point x to each of the train points, d n = d(x, x n ), n = 1, ... , N
input: a training sample ¯ S = (x1 , y1 ), . . . , (xm , ym )¯ ¯
2. Find the training output: n
point x which
for isevery
nearest to x: xn∗2=Xargmin
point ,
n
n d(x, x )
¯ ¯ ¯ ¯

3. Assign the class labelreturn
y (x) =the
y n majority
. label among {y⇡i (x) : i  k}
¯
4. In the case that there are two or more nearest neighbors with different class labels, the most numerous
When k = 1, we have the 1-NN rule:
class is chosen. If there is no one single most numerous class, we use the K −nearest-neighbors.
The nearest neighbor algorithm is simple and intuitive. There are, however, some issues:
hS (x) = y⇡1 (x) .
• How should we measure the distance between points? Whilst the Euclidean square distance is popular, this may not
A appropriate.
always be geometric Aillustration
fundamental of the 1-NN
limitation of therule is given
Euclidean in Figure
distance is that it19.1.
does not take into account how the
For regression problems, namely, Y = R, one can define the ¯xprediction
data is distributed. For example if the length scales of the components of the vector vary greatly,tothebelargest length
scale will dominate the squared distance, with potentially useful class-specific information in Pk components of x lost.
1 other
the average
The Mahalanobis distance
target of the k nearest neighbors. That is, h S (x) = k i=1 y⇡i (x) . ¯
k
More generally, for some function (x −⇥x 0Y)
d(x, x 0 ) =: (X )T Σ−1 !(xY, − xthe
0
) k-NN rule with respect (2.43)
¯ ¯ ¯ ¯ ¯ ¯
to is:
where Σ is the covariance matrix of the inputs (from all classes) can overcome some of these problems since it effectively
rescales the input vector components.
h (x) = (x ,y ), . . . , (x ,y ) . (19.1)
S ⇡1 (x) ⇡1 (x) ⇡k (x) ⇡k (x)
• The whole dataset needs to be stored to make a classification since the novel point must be compared to all of the train
It iscan
points. This easy to verify
be partially that by
addressed wea method
can cast thedata
called prediction by majority
editing in which data pointsof labels
which have (for
little or no effect
classification)
on the decision boundaryor areby the averaged
removed from the target
training (for regression)
dataset. Depending asoninthe
Equation
geometry(19.1)
of the bytraining points,
finding the
an appropriate choice of . The generality can lead to other rules; for example,xifi of ¯x in turn.
nearest neighbor can also be accelerated by examining the values of each of the components
Such an axis-aligned space-split is called a KD−tree and can reduce the possible set of candidate nearest neighbors in
Y = R, we can take a weighted average of the targets according to the distance
the training set to the novel x ∗ , particularly in low dimensions.
from x: ¯
• Each distance calculation can be expensive if the k data points are high dimensional. Principal Components Analysis is
X
one way to address this and replaces x with a low dimensional ⇢(x, x⇡projection
i (x)
) p. The Euclidean distance of two data points
a b hS (x)
¯ =a bPk ¯y⇡compute
i (x)
.
x and x is then approximately given by p and p . This is both faster to and can also improve classification
¯ ¯ ¯ i=1¯ j=1 ⇢(x, x⇡j (x) )
accuracy since only the large scale characteristic of the data are retained in the PCA projections.

2.5.2 k−Nearest Neighbors


If your neighbor is simply mistaken (has an incorrect training class label), or is not a particularly representative example
19.2 Analysis
of his class, then these situations will typically result in an incorrect classification. By including more than the single nearest
neighbor, we hope to make a more robust classifier with a smoother decision boundary (less swayed by single neighbor opinions).
Since the NN rules are such natural learning methods, their generalization prop-
erties have been extensively studied.
Copyright c 2019,Most previous
Dr. Mustafa results are asymptotic con-
El-Halabi 34
sistency results, analyzing the performance of NN rules when the sample size, m,
CLASSIFICATION 35

If we assume the Euclidean distance as the dissimilarity measure, the k−Nearest Neighbor algorithm considers a hypersphere
centered on the test point x. The radius of the hypersphere is increased until it contains exactly K train inputs. The class
¯
label y (x) is then given by the most numerous class within the hypersphere. This is illustrated in Figure. 2.12. The training
¯
data consists of data points belonging to one of two classes (grey circles and white squares). Two test points are indicated
by black diamonds and in both cases, the dotted circles enclose their k = 3 nearest neighbors. The neighbors of test point A
include two from the square class and one from the circle class and it will be classified as belonging to the square class. All of
the neighbors of test point B being to the circle class to which B is assigned.

Test point A

Test point B

Figure 2.12: An illustration of the operation of k = 3-NN. Circles and squares denote the training points and diamonds the
test points. Test point A will be assigned to the square class and B to the circle class.

We have the data points positioned within input space, so we just need to work out which of the training data are close to
it. This requires computing the distance to each datapoint in the training set, which is relatively expensive: if we are in normal
Euclidean space, then we have to compute d subtractions and d squarings (we can ignore the square root since we only want
to know which points are the closest, not the actual distance). We can then identify the k nearest neighbors to the test point,
and then set the class of the test point to be the most common one out of those for the nearest neighbors. The choice of k is
not trivial. Make it too small and nearest neighbor methods are sensitive to noise, too large and the accuracy reduces as points
that are too far away are considered.
One drawback of the k−nearest neighbor approach is the issue of ties− two or more classes having an equal number of
votes. For example, if k = 8 in Figure. 2.12, we will always have four neighbors belonging to each class and no majority. One
option is to assign the class randomly from the set of tied classes. This may not always be sensible as it means that the same
x ∗ may be assigned to different classes if it is testes more than once. For binary classification, a neater solution is to always
¯
use an odd number of neighbors. More generally, we can weight the votes according to distance such that the votes from closer
points have greater influence, making ties highly unlikely. Also, in the case of a tie one may increase k until the tie is broken.
The k−nearest Neighbor algorithm is one of the simplest of all machine learning classifiers (also known as instance-based
learning, or memory-based learning). However, this method suffers from the curse of dimensionality. First, as shown above,
the computational costs get higher as the number of dimensions grows. This is not as bad as it might appear at first: there
are sets of methods such as KD-Trees that can reduce the computational time. However, more importantly, as the number of
dimensions increases, so the distance to other data points tends to increase. In addition, they can be far away in a variety of
different directions−there might be points that are relatively close in some dimensions, but a long way in others. There are
methods for dealing with these problems, known as adaptive nearest neighbor methods.
Once we have some data and have chosen a suitable distance measure, the only thing that remains is the choice of k. If
k is too small, our classification can be heavily influenced by noise. Whilst there is some sense in making k > 1 (reduces the
chances of overfitting), there is certainly little sense in making k = N(N being the number of training points). For k very large,
all classifications will become the same − simply assign each novel x ∗ to the most numerous class in the train data (we have
¯
smoothed to such an extent that everywhere belongs to a specific class). This suggests that there is an optimal intermediate

Copyright c 2019, Dr. Mustafa El-Halabi 35


CLASSIFICATION 36

setting of k which gives the best generalization performance. This can be determined using cross-validation, as described in
the next chapter.
Data sets with uneven numbers of objects in each class are known as imbalanced and are common in machine learning and
something we must be aware of when we undertake any classification analysis.

Example 11. Consider the following training set in the 2-dimensional Euclidean space

x y Class
-1 1 −
0 1 +
0 2 −
1 -1 −
1 0 +
1 2 +
2 2 −
2 3 +

1. What is the prediction of the 3−nearest neighbor classifier at point (1, 1)?

2. What is the prediction of the 5−nearest neighbor classifier at point (1, 1)?

3. What is the prediction of the 7−nearest neighbor classifier at point (1, 1)?

4. What is the prediction of the 4−nearest neighbor classifier at point (1, 1)?
2
Solution. We start by computing the square distance from (1, 1) to each of the point in the data set. We get: d1,− = 4,
2, 2,− 2,− 2,+ 2 2 2
d2,+ = 1, d3 = 2, d4 = 4, d5 = 1, d6,+ = 1, d7,− = 2, d8,+ = 5
2,
1. The three smallest distances are: d2,+ = 1, d52,+ = 1 and d6,+
2
= 1. Then, the given point will be classified as belonging
to class +.
2,
2. The five smallest distances are: d2,+ = 1, d52,+ = 1, d6,+
2
= 1, d32,− = 2, d7,−
2
= 2. Then, the given point will be
classified as belonging to class +.

3. The seven smallest distances are: d2,+2,


= 1, d52,+ = 1, d6,+
2
= 1, d32,− = 2, d7,−
2 2
= 2, d1,− = 4, d42,− = 4. Then, the
given point will be classified as belonging to class −.
2,
4. The four smallest distances are: d2,+ = 1, d52,+ = 1, d6,+
2
= 1, d32,− = 2, d7,−
2
= 2. Then, the given point will be
classified as belonging to class +.

Copyright c 2019, Dr. Mustafa El-Halabi 36


REGULARIZATION 37

CHAPTER 3

REGULARIZATION

3.1 Generalization in Machine Learning


In machine learning we describe the learning of the target function from training data as inductive learning. Induction refers
to learning general concepts from specific examples which is exactly the problem that supervised machine learning problems aim
to solve. This is different from deduction that is the other way around and seeks to learn specific concepts from general rules.
The main practical objectives of machine learning consist of generating accurate predictions for unseen inputs and of
designing efficient and robust algorithms to produce these predictions, even for large-scale problems. The ability to perform well
on previously unobserved inputs is called “generalization”. There is a terminology used in machine learning when we talk about
how well a machine 1.2 Definitions
learning and terminology
model learns 3
and generalizes to new data, namely “overfitting” and “underfitting”. See Figure. 3.1.

Figure
Figure 3.1: The zig-zag on theThe
line 1.1 leftzig-zag
panel isline on the over
consistent left the
panel is and
blue consistent over sample,
red training the bluebut
andit red
is a complex separation
training sample, but it is a complex separation surface that is not likely to generalize
surface that is not likely to generalize well to unseen data. In contrast, the decision surface on the right panel is simpler and
wellintospite
might generalize better unseen data.
of its In contrast, of
misclassification thea decision
few pointssurface
of the on the right
training panel is simpler
sample.
and might generalize better in spite of its misclassification of a few points of the
training sample.

3.1.1 Overfitting and Underfitting


Which concept families can actually be learned, and under what conditions? How
By now we havewell seencan
main machine
these conceptslearning algorithms
be learned pertaining to linear regression, logistic regression, etc. However,
computationally?
when you apply these machine learning problems, they could run into a problem called “overfitting” that can cause them to
perform very poorly. To understand this problem, let us go back to the example of predicting the housing prices based on the
size of the house.
Consider Figure.1.2
3.2(a),Definitions
where the dataand terminology
is being fitted using a linear hypothesis hθ (x) = θ0 + θ1 x. This is apparently not
a very good model, because looking at the data, as the size of the houses increases, the prices kind of plateau. Since this
hypothesis does notWe will use
represent the canonical
a good problem
fit to the data, of spam
we refer detection
to this problem as a running example
as “underfitting” to that the algorithm
or we say
illustrate
has a “high bias”, which some
roughly basicthat
means definitions and tohas
the algorithm describe
a very the usepre-conception
strong and evaluation(bias)
of machine
that the housing prices are
learning
going to vary linearly algorithms
with their in practice.
size, despite the fact the Spam detection
training data doisnot
thereflect
problem of learning
this fact. to refers to a model
Underfitting
automatically classify email messages as either spam or non-spam.

Examples: Items or instances


Copyright ofc data
2019,used for learning
Dr. Mustafa or evaluation. In our spam
El-Halabi 37
problem, these examples correspond to the collection of email messages we will use
for learning and testing.
Features: The set of attributes, often represented as a vector, associated to an
REGULARIZATION 38

Price Price Price

Size Size Size


2
h✓ (x) = ✓0 + ✓1 x h✓ (x) = ✓0 + ✓1 x + ✓2 x h✓ (x) = ✓0 + ✓1 x + ✓2 x + ✓3 x3 + ✓4 x4
2

Figure 3.2: Different hypotheses for predicting the housing prices.

that can neither model the training data nor generalize to new data. An under-fit machine learning model is not a suitable
model and will be obvious as it will have poor performance on the training data. Underfitting is often not discussed as it is
easy to detect given a good performance metric. The remedy is to move on and try alternate machine learning algorithms.
In the scenario of Figure. 3.2(b), we could choose the quadratic hypothesis hθ (x) = θ0 + θ1 x + θ2 x 2 to fit the data, which
seems to work pretty well. The extreme scenario corresponds to Figure. 3.2(c), where the data is perfectly fitting all five of our
training examples using the higher order hypothesis hθ (x) = θ0 + θ1 x + θ2 x 2 + θ3 x 3 + θ4 x 4 . The issue of overly fitting the
training data using a certain hypothesis is known as “overfitting”. We also say the corresponding algorithm has “high variance”.
More concretely, overfitting occurs when, having too many features, the learned hypothesis may fit the training set very
well (J(θ) ' 0), but fail to generalize to new examples (predict prices on new examples). Overfitting happens when a model
learns the detail and the noise1 in the training data to the extent that it negatively impacts the performance of the model on
new data. This means that the noise or random fluctuations in the training data is picked up and learned as concepts by the
model. The problem is that these concepts do not apply to new data and negatively impact the models ability to generalize.
Overfitting is more likely with nonparametric and nonlinear models that have more flexibility when learning a target function.
As such, many nonparametric machine learning algorithms also include parameters or techniques to limit and constrain how
much detail the model learns. For example, decision trees are nonparametric machine learning algorithms that are very flexible
and subject to overfitting training data. This problem can be addressed by pruning a tree after it has been learned in order to
remove some of the details it has picked up.
We have discussed the problem of overfitting and underfitting for linear regression, but the same problems could apply for
logistic regression as well.

x1 x1 x1

x x x
x x x x x x
x x x
x x x x x x x x x x x x
x x x x x x
x x xx x x xx x x x x xx x
x x x x x x x x x x x x x x x

(a) x2 (b) x2 (c)


x2

h✓ (x) = g(✓0 + ✓1 x1 + ✓2 x2 + h✓ (x) = g(✓0 + ✓1 x1 + ✓2 x21 +


h✓ (x) = g(✓0 + ✓1 x1 + ✓2 x2 ) ✓3 x21 + ✓4 x22 + ✓5 x1 x2 ) ✓3 x21 x2 + ✓4 x21 x22 + ✓5 x21 x32 )

Figure 3.3: Different hypotheses for a classification problem.

In Figure. 3.3(a), we fit the data using a simple hypothesis hθ (x) = g(θ0 + θ1 x1 + θ2 x2 ), which is equivalent to using a
1 Noise can occur both at the feature level and at the label level. Some features might correspond to measurements taken by sensors. For instance,

a robot might use a laser range finder to compute its distance to a wall. However, this sensor might fail and return an incorrect value. In a sentiment
classification problem, someone might have a typo in their review of a course. These would lead to noise at the feature level. There might also be
noise at the label level. A student might write a scathingly negative review of a course, but then accidentally click the wrong button for the course
rating.

Copyright c 2019, Dr. Mustafa El-Halabi 38


REGULARIZATION 39

straight line to separate negative and positive examples. Obviously, this does not look like a good fit and once again this is
an example of “underfitting”. In contrast, in Figure. 3.3(b), we add to the features some quadratic terms; giving a hypothesis
hθ (x) = g(θ0 + θ1 x1 + θ2 x2 + θ3 x12 + θ4 x22 + θ5 x1 x2 ). This choice of hypothesis corresponds to a decision boundary which is a
more realistic fit for the training data. Figure. 3.3(c) shows an extreme scenario where a very high order polynomial is used to
fit the data. As a result, logistic regression will go to a great length to find a decision boundary that overfits the training data,
and this will not generalize well for future training examples.
When we have a machine learning problem that encompasses lot of features but few training data, then overfitting can
become a problem. In order to address overfitting, there are two main options:

• reduce the number of features by manually selecting which important features to keep

• use Model Selection algorithms, which help to choose which features to keep and which features to throw out (in
subsequent chapters)

However, features are information, and by throwing away some features we are throwing away some information that could
help in improving the predictions. If we decide not to get rid of some features, we need to recur to “regularization” in order to
handle the problem of overfitting. Using regularization, we will keep all the features, but reduce the magnitude or the values of
θj . This method works well when we have lot of features, where each one contributes a bit to predicting y .

3.1.2 Capacity
We can control whether a model is more likely to overfit or under-fit by altering its “capacity”. Informally, a model’s capacity
is its ability to fit a wide variety of functions. Models with low capacity may struggle to fit the training set. Models with high
capacity can overfit by memorizing properties of the training set that do not serve them well on the test set. One way to
control the capacity of a learning algorithm is by choosing its hypothesis space, the set of functions that the learning algorithm
is allowed to select as being the solution. For example, the linear regression algorithm has the set of all linear functions of its
input as its hypothesis space. We can generalize linear regression to include polynomials, rather than just linear functions, in
its hypothesis space. Doing so increases the model’s capacity. Ideally, we want to select a model at the sweet spot between
underfitting and overfitting. This is the goal, but is very difficult to do in practice.
A polynomial of degree one gives us the linear regression model with which we are already familiar, with prediction

hθ (x) = θ0 + θ1 x

By introducing x 2 as another feature provided to the linear regression model, we can learn a model that is quadratic as a
function of x:
hθ (x) = θ0 + θ1 x + θ2 x 2
Though this model implements a quadratic function of its input, the output is still a linear function of the parameters, so we
can still use the normal equations to train the model in closed form. We can continue to add more powers of x as additional
features, for example to obtain a polynomial of degree 9:
9
X
hθ (x) = θ0 + θj x j
j=1

Machine learning algorithms will generally perform best when their capacity is appropriate in regard to the true complexity
of the task they need to perform and the amount of training data they are provided with. Models with insufficient capacity
are unable to solve complex tasks. Models with high capacity can solve complex tasks, but when their capacity is higher than
needed to solve the present task they may overfit. The most extreme case of arbitrarily high capacity is reached through
non-parametric models (k-NN, decision trees, etc.)
Fig. 3.4 shows this principle in action. We compare a linear, quadratic and degree-9 predictor attempting to fit a problem
where the true underlying function is quadratic. The linear function is unable to capture the curvature in the true underlying
problem, so it under-fits. The degree-9 predictor is capable of representing the correct function, but it is also capable of
representing infinitely many other functions that pass exactly through the training points, because we have more parameters
than training examples. We have little chance of choosing a solution that generalizes well when so many wildly different solutions

Copyright c 2019, Dr. Mustafa El-Halabi 39


parameters than training examples. We have little chance of choosing a solution
that generalizes well when so many wildly different solutions exist. In this example,
the quadratic model is perfectly matched to the true structure of the task so it
REGULARIZATION 40
generalizes well to new data.

Figure 3.4: We fitFigure


three models
5.2: Wetofitthis example
three modelstraining
to this set. The training
example training set.
dataThe
was training
generated synthetically,
data was by randomly
sampling x values and choosing
generated y deterministically
synthetically, by evaluating
by randomly sampling ax quadratic
values andfunction.
choosing (Left) A linear function fit to the data
y deterministically
by evaluating
suffers from underfitting−it cannota quadratic
capture thefunction. (Left)
curvature thatAislinear function
present in thefit to the
data. data suffers
(Center) from function fit to the
A quadratic
underfitting—it
data generalizes well cannot
to unseen points. capture
It does notthe curvature
suffer from a that is present
significant in the
amount of data. (Center
overfitting or )underfitting.
A (Right) A
quadratic
polynomial of degree 9 fit to function
the datafitsuffers
to thefrom
data overfitting.
generalizes well to unseen points. It does not suffer from
a significant amount of overfitting or underfitting. (Right) A polynomial of degree 9 fit to
the data suffers from overfitting. Here we used the Moore-Penrose pseudoinverse to solve
the underdetermined
exist. In this example, normal
the quadratic model equations.
is perfectly The solution
matched passes
to the true through
structure ofall
theoftask
the training
so it generalizes well to new
data. points exactly, but we have not been lucky enough for it to extract the correct structure.
So far we have It nowdescribed
only has a deep valley in
changing a between two training
model’s capacity points that
by changing thedoes not appear
number in features
of input the true it has (and simulta-
underlying function. It also increases sharply on the left side of the data, while the true
neously adding new parameters associated with those features). There are in fact many ways of changing a model’s capacity.
function decreases in this area.
Capacity is not determined only by the choice of model. The model specifies which family of functions the learning algorithm
can choose from when varying the parameters in order to reduce a training objective. This is called the representational capacity
of the model. In manySo far we
cases, havethe
finding only
bestdescribed changing
function within this afamily
model’s
is a capacity by changing
very difficult optimization theproblem. In practice,
number
the learning algorithm does of
notinput features
actually find theit best
has function,
(and simultaneously
but merely oneadding new parameters
that significantly reduces the training error.
associatedsuch
These additional limitations, withasthose features). There
the imperfection are in fact many
of the optimization ways of
algorithm, changing
mean that thea learning
model’s algorithm’s effective
capacity may be less than the Capacity
capacity. representational
is not capacity
determinedof theonly
modelby family.
the choice of model. The model
Our modern ideas about which
specifies improving the of
family generalization
functions the of machine
learninglearning
algorithm models
can are refinements
choose from whenof thoughts dating back
to philosophers at least as early
varying as Ptolemy. Many
the parameters in orderearly
toscholars
reduce invoke a principle
a training of parsimony
objective. that is the
This is called now most widely known
as Occam’s razor representational
(c.1287-1347). This principle
capacity states
of the that In
model. among
manycompeting hypotheses
cases, finding thatfunction
the best explain known observations
equally well, one should
withinchoose the “simplest”
this family is a veryone. This optimization
difficult idea was formalized
problem.and In
made more precise
practice, in the 20th century by the
the learning
founders of statistical learning theory.
algorithm does not actually find the best function, but merely one that significantly
Statistical learning theory
reduces theprovides
trainingvarious
error.means
Theseofadditional
quantifyinglimitations,
model capacity.
such Among these, the most well-known is the
as the imperfection
Vapnik-Chervonenkis dimension, or VC dimension. The VC dimension measures the capacity of a binary classifier. The VC
dimension is defined as being the largest possible value of m113 for which there exists a training set of m different x points that
¯
the classifier can label arbitrarily.

3.2 Hypothesis Evaluation: Training Error and Testing Error


In this section we describe a way to evaluate a hypothesis. Typically, when training a machine learning model, we have
access to a training set, we can compute some error measure on the training set called the “training error”, and we reduce this
training error. But if a hypothesis has a low training error, that does not mean that necessarily this is a good hypothesis, and
might as well fail to generalize to new examples not in the training set (overfitting).
In general with machine learning problems with many features, it is impossible to plot the hypothesis function in order to
inspect overfitting. A standard way is to split the data X we have into two portions; a “training set” X(train) and a “test set”
¯ ¯
X(test) . A typical split is to randomly pick 70% of X as the training set and 30% of X as the test set. We will denote the data
¯ ¯ (i) ¯ (i)
in a training set by (x (i) , y (i) ), where i = 1, ... , m, and the data in a test set by (xtest , ytest ), where i = 1, ... , mtest , where mtest
is the number of our test examples.
What separates machine learning from optimization is that we want the generalization error, also called the test error, to
be low as well. The test error is defined as the expected value of the error on a new input. Here the expectation is taken across
different possible inputs, drawn from the distribution of inputs we expect the system to encounter in practice. We typically

Copyright c 2019, Dr. Mustafa El-Halabi 40


REGULARIZATION 41

estimate the generalization error of a machine learning model by measuring its performance on a test set of examples that were
collected separately from the training set. Considering for example linear regression, then the first step would be to learn the
parameters θ from the training data, thus minimizing the training error J(θ), and the second step is to compute the test error
Jtest (θ) given by
mtest  2
1 X (i) (i)
Jtest (θ) = hθ (xtest ) − ytest
2mtest
i=1

by plugging the parameters θ we have learned in Jtest (θ).


How can we affect performance on the test set when we get to observe only the training set? The field of statistical learning
theory provides some answers. If the training and the test set are collected arbitrarily, there is indeed little we can do. If we are
allowed to make some assumptions about how the training and test set are collected, then we can make some progress.
The training and test data are generated by a probability distribution over datasets called the data generating process. We
typically make a set of assumptions known collectively as the i.i.d. assumptions. These assumptions are that the examples in
each dataset are independent from each other, and that the training set and test set are identically distributed, drawn from
the same probability distribution as each other. This assumption allows us to describe the data generating process with a
probability distribution over a single example. The same distribution is then used to generate every training example and every
test example. We call that shared underlying distribution the data generating distribution, denoted pdata . This probabilistic
framework and the i.i.d. assumptions allow us to mathematically study the relationship between training error and test error.
One immediate connection we can observe between the training and test error is that the expected training error of a
randomly selected model is equal to the expected test error of that model. Suppose we have a probability distribution p(x, y )
¯
and we sample from it repeatedly to generate the train set and the test set. For some fixed value θ, the expected training set
error is exactly the same as the expected test set error, because both expectations are formed using the same dataset sampling
process. The only difference between the two conditions is the name we assign to the dataset we sample.
Of course, when we use a machine learning algorithm, we do not fix the parameters ahead of time, then sample both
datasets. We sample the training set, then use it to choose the parameters to reduce training set error, then sample the test
set. Under this process, the expected test error is greater than or equal to the expected value of training error. The factors
determining how well a machine learning algorithm will perform are its ability to:

1. Make the training error small

2. Make the gap between training and test error small

Quantifying the capacity of the model allows statistical learning theory to make quantitative predictions. The most important
results in statistical learning theory show that the discrepancy between training error and generalization error is bounded from
above by a quantity that grows as the model capacity grows but shrinks as the number of training examples increases. These
bounds provide intellectual justification that machine learning algorithms can work, but they are rarely used in practice when
working with deep learning algorithms.
We must remember that while simpler functions are more likely to generalize (to have a small gap between training and test
error) we must still choose a sufficiently complex hypothesis to achieve low training error. Typically, training error decreases
until it asymptotes to the minimum possible error value as model capacity increases (assuming the error measure has a minimum
value). Typically, generalization error has a U-shaped curve as a function of model capacity. This is illustrated in Fig. 3.5.
Remark 8. Note that it is possible for the model to have optimal capacity and yet still have a large gap between training and
generalization error. In this situation, we may be able to reduce this gap by gathering more training examples.

3.2.1 Bias/ Variance Tradeoff


Revisiting Fig. 3.2, fitting a 9th order polynomial to the data (rightmost figure) did not result in a good model. Specifically,
even though the 9th order polynomial did a very good job predicting y (say, prices of houses) from x (say, living area) for the
examples in the training set, we do not expect the model shown to be a good one for predicting the prices of houses not in
the training set. In other words, what’s has been learned from the training set does not generalize well to other houses. The
generalization error of a hypothesis is its expected error on examples not necessarily in the training set.
Both the models in the leftmost and the rightmost figures above have large generalization error. However, the problems
that the two models suffer from are very different. If the relationship between y and x is not linear, then even if we were fitting

Copyright c 2019, Dr. Mustafa El-Halabi 41


REGULARIZATION
CHAPTER 5. MACHINE LEARNING BASICS 42

Training error
Underfitting zone Overfitting zone
Generalization error

Error Generalization gap

0 Optimal Capacity
Capacity

Figure 3.5: TypicalFigure 5.3: Typical


relationship betweenrelationship between
capacity and error. capacity
Training and error.error
and test Training
behaveand test error
differently. At the left end of the
behave differently. At the left end of the graph, training error and generalization error
graph, training error and generalization error are both high. This is the underfitting regime. As we increase capacity, training
are both high. This is the underfitting regime. As we increase capacity, training error
error decreases, but the gap between training and generalization error increases. Eventually, the size of this gap outweighs the
decreases, but the gap between training and generalization error increases. Eventually,
decrease in training error, and we enter the overfitting regime, where capacity is too large, above the optimal capacity.
the size of this gap outweighs the decrease in training error, and we enter the overfitting
regime, where capacity is too large, above the optimal capacity.
a linear model to a very large amount of training data, the linear model would still fail to accurately capture the structure in
the data. Informally, we define
models, suchthe as bias
linearof regression.
a model to be the expected
Parametric generalization
models error even
learn a function if we were to fit it to a very
described
(say, infinitely) large training set. Thus, for the problem above, the linear model
by a parameter vector whose size is finite and fixed before any data is observed. suffers from large bias, and may under-fit (i.e.,
fail to capture structure exhibited by)
Non-parametric the data.
models have no such limitation.
Apart from bias, there’s a second component to the generalization error, consisting of the variance of a model fitting
Sometimes, non-parametric models are just theoretical abstractions (such as
procedure. Specifically, when fitting a 9th order polynomial as in the rightmost figure, there is a large risk that we are fitting
an algorithm that searches over all possible probability distributions) that cannot
patterns in the data that happened to be present in our small, finite training set, but that do not reflect the wider pattern of
be
the relationship between implemented
x and y . This in practice.
could be,However,
say, because we in
canthealso designsetpractical
training non-parametric
we just happened by chance to get a slightly
models by making their complexity a function of the
more-expensive-than-average house here, and a slightly less-expensive-than-average house there, training set size. One example
and so on. By fitting these
“spurious” patternsofinsuch an algorithm
the training set, weis might
nearest neighbor
again obtainregression.
a model with Unlike
largelinear regression,
generalization which
error. In this case, we say the
has
model has large variance. a fixed-length vector of weights, the nearest neighbor regression model simply
Often, there is stores thebetween
a tradeoff X and bias y from and the training
variance. set.model
If our When asked
is too to classify
“simple” and hasa test
verypoint x,
few parameters, then it may
the model looks up the nearest entry in the training set and returns the associated
have large bias (but small variance); if it is too “complex” and has very many parameters, then it may suffer from large variance
regression
(but have smaller bias). In thetarget.
exampleIn otherfitting
above, words, ŷ = yi where
a quadratic function i=does min ||X
argbetter than − x||22of
i,: either . The
the extremes of a first or
algorithm can also be generalized to distance metrics other than the L 2 norm, such
a ninth order polynomial.
as learned distance metrics (Goldberger et al., 2005). If the algorithm is allowed
Remark 9. In k−NN, choosing
to break ties abysmall k might
averaging beygood
the to capture fine-grained patterns, however it could lead to overfitting
i values for all X i,: that are tied for nearest, then
as the model will thisbe sensitive to random idiosyncrasies
algorithm is able to achieve the minimum in the training data.training
possible On theerror
other(which
hand, might
choosing a large k makes
stable predictions by averaging over lots of example,
be greater than zero, if two identical but this could lead to underfitting as the
inputs are associated with different outputs) model fails to capture important

irregularities. A rule of the thumb is to
on any regression dataset. choose k < m, where m is the number of training examples.
Finally, we can also create a non-parametric learning algorithm by wrapping a
3.3 Regularization
parametric learning algorithm inside another algorithm that increases the number
So far, the only method of modifying a learning algorithm115 we have discussed is to increase or decrease the model’s capacity
by adding or removing functions from the hypothesis space of solutions the learning algorithm is able to choose. We gave the
specific example of increasing or decreasing the degree of a polynomial for a regression problem. The view we have described
so far is oversimplified.
The behavior of our algorithm is strongly affected not just by how large we make the set of functions allowed in its hypothesis
space, but by the specific identity of those functions. The learning algorithm we have studied so far, linear regression, has a
hypothesis space consisting of the set of linear functions of its input. These linear functions can be very useful for problems
where the relationship between inputs and outputs truly is close to linear. They are less useful for problems that behave in a
very nonlinear fashion. For example, linear regression would not perform very well if we tried to use it to predict sin(x) from
x. We can thus control the performance of our algorithms by choosing what kind of functions we allow them to draw solutions
from, as well as by controlling the amount of these functions.

Copyright c 2019, Dr. Mustafa El-Halabi 42


REGULARIZATION 43
CHAPTER 5. MACHINE LEARNING BASICS

We can also give a learning algorithm a preference for one solution in its hypothesis space to another. This means that
both functions are eligible, but one is preferred. The unpreferred solution be chosen only if it fits the training data significantly
a sum comprising both the mean squared error on the training and a criterion
better than the preferred solution.
J (w) that expresses a preference for the weights to have smaller squared L2 norm.
For example, we can modify the training criterion for linear regression to include weight decay. To perform linear regression
with weight decay, Specifically,
we minimize a sum comprising both the mean squared error on the training and a criterion J(θ) that expresses
J(w)L= 2 MSEtrain + λw w, (5.18)
a preference for the weights to have smaller squared norm. Specifically,
where λ is a value chosen ahead of time that controls the strength of our preference
J(θ) = MSEtrain + λΘT Θ (3.1)
for smaller weights. When λ = 0, we impose no preference,
¯ ¯ and larger λ forces the
where λ is a valueweights
chosen to become
ahead smaller.
of time Minimizing
that controls J (w ) results
the strength in a choiceforofsmaller
of our preference weightsweights.
that When λ = 0, we
make
impose no preference, anda tradeoff between
larger λ forces the fitting
weightsthe trainingsmaller.
to become data and being small. This gives us
Minimizing J(θ)solutions
results inthat have of
a choice a smaller slope,make
weights that or put weightbetween
a tradeoff on fewerfitting
of the
thefeatures.
training As
dataanand being small. This
gives us solutions example
that haveofa how we slope,
smaller can control
or putaweight
model’s ontendency to overfit
fewer of the or underfit
features. via weight
As an example of how we can control a
model’s tendency todecay, weorcan
overfit train avia
under-fit high-degree polynomial
weight decay, regression
we can train model polynomial
a high-degree with different values model with different
regression
of λ3.6.
values of λ. See Fig. . See Fig. 5.5 for the results.

Figure 3.6: We fitFigure


a high-degree
5.5: We fit polynomial regression
a high-degree model
polynomial to our example
regression model totraining set. training
our example The truesetfunction is quadratic,
but here we use onlyfrommodels
Fig. 5.2with
. Thedegree 9. We vary
true function the amount
is quadratic, but of weight
here decay
we use only to prevent
models these
with high-degree
degree 9. models from
overfitting. (Left) We
Withvary thelarge
very amount
λ, weof can
weight decay
force to prevent
the model thesea high-degree
to learn function withmodels fromatoverfitting.
no slope all. This under-fits because it
can only represent(Left) With function.
a constant very large(Center)
λ, we can force
With the model
a medium to learn
value of λ, athe
function
learningwith no slope
algorithm at
recovers a curve with the
right general shape.all.Even
Thisthough
underfits thebecause
model isit capable
can onlyofrepresent a constant
representing function.
functions (Center
with much more) With a
complicated shape, weight
decay has encouragedmedium value
it to use of λ, thefunction
a simpler learningdescribed
algorithmbyrecovers
smaller acoefficients.
curve with (Right)
the rightWith
general shape.
weight decay approaching zero
Even thoughpseudo-inverse
(i.e., using the Moore-Penrose the model is capable of the
to solve representing functionsproblem
underdetermined with much
withmore complicated
minimal regularization), the degree-9
shape, weight decay has encouraged it to use a simpler function described by smaller
polynomial overfits significantly, as we saw in Fig. 3.4.
coefficients. (Right) With weight decay approaching zero (i.e., using the Moore-Penrose
pseudoinverse to solve the underdetermined problem with minimal regularization), the
More generally,degree-9
we can regularize
polynomiala model
overfitsthat learns a function
significantly, by adding
as we saw in Fig. a5.2
penalty
. called a regularizer to the cost function.
In the case of weight decay, the regularizer is Ω(Θ) = ΘT Θ.
¯ ¯ ¯
Expressing preferences for one function over another is a more general way of controlling a model’s capacity than including
More generally, we can regularize
or excluding members from the hypothesis space. We can think a model that learns
of excluding a function
a feature from af(x; θ) by space as expressing
hypothesis
adding a penalty called a regularizer
an infinitely strong preference against that feature. to the cost function. In the case of weight
decay,example,
In our weight decay the regularizer is Ω(w)
we expressed w w. In for
our=preference Chapter
linear 7functions
, we will defined
see thatwith
many otherweights explicitly, via
smaller
an extra term in the criterion we minimize. There are many119 other ways of expressing preferences for different solutions, both
implicitly and explicitly. Together, these different approaches are known as regularization. Regularization is any modification
we make to a learning algorithm that is intended to reduce its generalization error but not its training error. Regularization is
one of the central concerns of the field of machine learning, rivaled in its importance only by optimization. In what follows we
explain at a deeper level how regularization works for linear and logistic regression.

3.3.1 Regularized Linear Regression


Assume that for the housing price model, a hypothesis that fits well the training data is hθ (x) = θ0 + θ1 x + θ2 x 2 and a
hypothesis that overfits the training data is hθ (x) = θ0 + θ1 x + θ2 x 2 + θ3 x 3 + θ4 x 4 . Suppose we were to penalize the parameters

Copyright c 2019, Dr. Mustafa El-Halabi 43


REGULARIZATION 44

θ3 and θ4 and make them really small. One way of doing it is to adjust the cost function and write it as
m 2
1 X
J(θ) = hθ (x (i) ) − y (i) + 1000θ32 + 1000θ42
2m
i=1

The only way for this new cost function to be minimum if θ3 and θ4 are to be made very small; θ3 ' 0 and θ4 ' 0.This is
equivalent to getting rid of θ3 and θ4 and ending up with a quadratic hypothesis that fits the data well.
In general, the idea behind regularization is as follows. If we were to pick small values for θ0 , θ1 , ... , θn , we could end up with
a simpler hypothesis which is less prone to overfitting. Back to the housing example, we could have 100 features x1 , ... , x100
and we need to learn 101 parameters θ0 , θ1 , ... , θ100 in order to predict the price of a certain house. It is not clear which
of these parameters would be tied to a high order term and hence need to be made less relevant. In regularization, we pick
n
θj2 , where λ is the
P
the cost function and modify it to shrink all the parameters by adding an extra regularization term λ
j=1
“regularization parameter”. The resulting “regularized cost function” is given by
 
m  2 n
1  X  X
J(θ) = hθ (x (i) ) − y (i) + λ θj2  (3.2)
2m
i=1 j=1

By convention the θ0 term is not included in the adjusted cost function. The regularization parameter λ is used to control a
tradeoff between two different goals; the first being captured by the first term in the cost function which conveys the need to fit
the data and the second goal capture in the second term which conveys that need to keep the parameters small. This technique
L2 Regularization
is known as ridge regression. If λ is to be made very large (λ ' 1010 ), then all of the parameters θ1 , ... , θn will be made pretty
small, and this result in the hypothesis hθ (x) = θ0 , which results in underfitting. Figure. 3.7 is a geometric illustration of the
concept of ridge regression.
The geometric picture:

Figure 3.7: In L2 regularization the regularizer term has a minimum of 0 and the error term has its own minimum, however the
optimum solution of the new regularized
UofT
cost function lies in between.
CSC 411: 06-Linear Regression 34 / 37

The gradient descent algorithm corresponding to regularized linear regression is


Gradient Descent Algorithm
Repeat until convergence {
m
1 X 
θ0 := θ0 − α hθ (x (i) ) − y (i) (3.3)
m
i=1
" m
#
1 X (i) (i)

(i) λ
θj := θj − α hθ (x ) − y · xj + θj , j = 1, 2, ... , n (3.4)
m m
i=1
  m
λ 1 X 
(i)
:= θj 1 − α −α hθ (x (i) ) − y (i) · xj , j = 1, 2, ... , n (3.5)
m m
i=1

Copyright c 2019, Dr. Mustafa El-Halabi 44


REGULARIZATION 45

λ λ

Since the term 1 − α m is smaller than 1, then θj is updated by shrinking it to θj 1 − α m and then reducing it by the same
term that we used to have before regularization.
If we were to use the normal equation with regularization, then the vector of learned parameters Θ ends up being
¯
T
−1 T
Θ = X X + λA X y (3.6)
¯ ¯ ¯ ¯ ¯ ¯
where A is an (n + 1) × (n + 1) matrix, given by
¯
 
0

A=
¯ 

 1
..


 0
.
0 
1

3.3.2 Regularized Logistic Regression


Logistic regression can also be prone to overfitting, as seen in Figure. 3.3(c).Overfitting also can occur if the number of
features is too big, irrespective whether those features are polynomial or not. To perform regularization on logistic regression,
n
θj2 . The resulting
P
the cost function needs to be modified to shrink all the parameters by adding an extra regularization term λ
j=1
regularized logistic cost function is
" m m
# n
1 X (i) i
X
(i) (i) λ X 2
J(θ) = − y log hθ (x ) + (1 − y ) log(1 − hθ (x )) + θj (3.7)
m 2m
i=1 i=1 j=1

The gradient descent algorithm corresponding to regularized logistic regression is


Gradient Descent Algorithm
Repeat until convergence {
m
1 X 
θ0 := θ0 − α hθ (x (i) ) − y (i) (3.8)
m
i=1
" m
#
1 X (i) (i)

(i) λ
θj := θj − α hθ (x ) − y · xj − θj , j = 1, 2, ... , n (3.9)
m m
i=1
  m
λ 1 X 
(i)
:= θj 1 − α −α hθ (x (i) ) − y (i) · xj , j = 1, 2, ... , n (3.10)
m m
i=1

3.4 Debugging a Machine Learning Problem


Suppose you have implemented regularized linear regression to predict housing price. However, when you test your hypothesis
on a new set of houses, you find that it makes unacceptable large errors in its predictions. What would you try next? The
following are possible options

• Get more training examples

• Try smaller sets of features

• Try getting additional features

• Try adding polynomial features (x12 , x22 , x1 x2 , ...)

• Try decreasing λ

• Try increasing λ

Copyright c 2019, Dr. Mustafa El-Halabi 45


REGULARIZATION 46

One should not randomly choose one of these adjustments and use it toward improving the performance of a machine
learning algorithm. Fortunately, there is pretty simple technique that can allow us to rule out at least half of these options,
and as a result save us some time pursuing something that will not work. This technique is referred to as “Machine Learning
Diagnostic”, which is a test that one can run to gain insight what is and what is not working with a learning algorithm, and
gain guidance on the best way to improve its performance. However, one should note that diagnostics can take quite some
time to be implemented, but doing so can be a very good use of time.

3.5 Hyperparameters, Model Selection and Validation Sets


3.5.1 Hyperparameters
Most machine learning algorithms have several settings that we can use to control the behavior of the learning algorithm.
These settings are called hyperparameters. The values of hyperparameters are not adapted by the learning algorithm itself
(though we can design a nested learning procedure where one learning algorithm learns the best hyperparameters for another
learning algorithm). In polynomial regression, there is a single hyperparameter: the degree of the polynomial, which acts as a
capacity hyperparameter. The λ value used to control the strength of weight decay is another example of a hyperparameter.
Broadly speaking, hyperparameters are tunable knobs (a hopefully small set of high-level parameters) that govern how the
lower-level parameters (i.e., regression weights) interact with the training data.
Sometimes a setting is chosen to be a hyperparameter that the learning algorithm does not learn because it is difficult to
optimize. More frequently, we do not learn the hyperparameter because it is not appropriate to learn that hyperparameter
on the training set. This applies to all hyperparameters that control model capacity. If learned on the training set, such
hyperparameters would always choose the maximum possible model capacity, resulting in overfitting. For example, we can
always fit the training set better with a higher degree polynomial and a weight decay setting of λ = 0 than we could with a
lower degree polynomial and a positive weight decay setting.
To solve this problem, we need a validation set of examples that the training algorithm does not observe.
Examples of hyperparameters:

• Number of leaves or depth of a tree

• Learning rate (in many models)

• Number of hidden layers in a deep neural network

• Number of clusters in a k-means clustering

• Number of neighbors in k−NN

• ...

3.5.2 Model Selection


Suppose we are trying to select among several different models for a learning problem. For instance, we might be using
a polynomial regression model hθ (x) = g(θ0 + θ1 x + θ2 x 2 + ... + θk x k ), and wish to decide if k should be 0, 1, ... , 10. How
can we automatically select a model that represents a good tradeoff between the twin evils of bias and variance? Alternatively,
suppose we want to automatically choose the number of neighbors k in the k−NN algorithm. How can we do that?
For the sake of concreteness, in these notes we assume we have some finite set of models M = {M1 , ... , Md } that we are
trying to select among. For instance, in our first example above, the model Mi would be an i −th order polynomial regression
model. Alternatively, if we are trying to decide between using decision trees, naı̈ve Bayes classifier or logistic regression, then
M may contain these models.

3.5.3 Cross-Validation
Let us suppose we are, as usual, given a training set S. Given what we know about empirical risk minimization, here is what
might initially seem like a algorithm, resulting from error minimization for model selection:

1. Train each model Mi on S, to get some hypothesis hi .

Copyright c 2019, Dr. Mustafa El-Halabi 46


REGULARIZATION 47

2. Pick the hypotheses with the smallest training error.

This algorithm does not work. Consider choosing the order of a polynomial. The higher the order of the polynomial, the
better it will fit the training set S, and thus the lower the training error. Hence, this method will always select a high-variance,
high-degree polynomial model, which we saw previously is often poor choice.
Here is an algorithm that works better. In hold-out cross validation, we do the following:

1. Randomly split S into Strain (say, 70% of the data) and Scv (the remaining 30%). Here, Scv is called the hold-out cross
validation set.

2. Train each model Mi on Strain only, to get some hypothesis hi .

3. Select and output the hypothesis hi that had the smallest error on the hold out cross validation set. For a hypothesis h,
we define this error (also called empirical risk or empirical error ) to be
m
1 X
ˆScv (h) = I{h(x (i) ) 6= y (i) } (3.11)
m
i=1

By testing on a set of examples Scv that the models were not trained on, we obtain a better estimate of each hypothesis
hi ’s true generalization error, and can then pick the one with the smallest estimated generalization error. Usually, somewhere
between 1/4 − 1/3 of the data is used in the hold out cross validation set, and 30% is a typical choice. Validations sets are
used to optimize hyperparameters.
Optionally, step 3 in the algorithm may also be replaced with selecting the model Mi according to argmini ˆScv (hi ), and then
retraining Mi on the entire training set S. (This is often a good idea, with one exception being learning algorithms that are
be very sensitive to perturbations of the initial conditions and/or data. For these methods, Mi doing well on Strain does not
necessarily mean it will also do well on Scv , and it might be better to forgo this retraining step.)
Earlier we discussed how a held-out test set, composed of examples coming from the same distribution as the training set,
can be used to estimate the generalization error of a learner, after the learning process has completed. It is important that the
test examples are not used in any way to make choices about the model, including its hyperparameters. For this reason, no
example from the test set can be used in the validation set. Therefore, we always construct the validation set from the training
data.

Example 12. Let us employ the leave-one-out cross validation (LOOCV) approach to choose k in k nearest neighbors. As the
name suggests, it involves using a single example from the training data as the validation data, and the remaining examples as
the training data. This is repeated such that each example in the training data is used once as the validation data. Formally,
given a classifier C , the approach computes the Misclassification Error of C as follows. Let d be the number of training examples
and let us assume that the examples are indexed from 1 to d.

• Error = 0

• For i = 1 to d

– Remove the i −th example from the training set and use it as your validation set
– Train the classifier C on the new training set
– If the test example is incorrectly classified by C then Error = Error + 1
– Put back the i −th example in the training set

• Error

To select the best classifier among a group of classifiers, you apply the LOOCV approach to all the classifiers and choose
the one that has the smallest error.

Copyright c 2019, Dr. Mustafa El-Halabi 47


REGULARIZATION 48

x Class
-0.1 −
0.7 +
1 +
1.6 −
2 +
2.5 +
3.2 −
3.5 −
4.1 +
4.9 +

1. What is the LOOCV error of 1 nearest neighbor algorithm on the dataset given above (use Euclidean distance)?

2. What is the LOOCV error of 3 nearest neighbor algorithms on the dataset given above (use Euclidean distance)?

3. Will you choose k = 1 or k = 3 for this dataset?

Solution.

1. For i = 1, pick −0.1 and classify it as 0.7, i.e., +. This is an incorrect classification, then Error = 0 + 1 = 1. For i = 2,
pick 0.7 and classify it as 1, i.e., +. This is a correct classification, then Error = 1. For i = 3, pick 1 and classify it as
0.7, i.e., +. This is a correct classification, then Error = 1. For i = 4, pick 1.6 and classify it as 2, i.e., +. This is an
incorrect classification, then Error = 2. For i = 5, pick 2 and classify it as 1.6, i.e., −. This is an incorrect classification,
then Error = 3. For i = 6, pick 2.5 and classify it as 2, i.e., +. This is a correct classification, then Error = 3. For
i = 7, pick 3.2 and classify it as 3.5, i.e., −. This is an correct classification, then Error = 3. For i = 8, pick 3.5 and
classify it as 3.2, i.e., −. This is a correct classification, then Error = 3. For i = 9, pick 4.1 and classify it as 3.5, i.e.,
−. This is an incorrect classification, then Error = 4. For i = 10, pick 4.9 and classify it as 4.1, i.e., +. This is a correct
classification, then Error = 4. Hence, the LOOCV error of 1 nearest neighbor algorithm is 4.

2. For i = 1, pick −0.1 and classify it as +.This is an incorrect classification, then Error = 0 + 1 = 1. For i = 2, pick 0.7
and classify it as −. This is a correct classification, then Error = 2. For i = 3, pick 1 and classify it as +. This is a
correct classification, then Error = 2. For i = 4, pick 1.6 and classify it as +. This is an incorrect classification, then
Error = 3. For i = 5, pick 2 and classify it as +. This is a correct classification, then Error = 3. For i = 6, pick 2.5
and classify it as −. This is an incorrect classification, then Error = 4. For i = 7, pick 3.2 and classify it as +. This is
an incorrect classification, then Error = 5. For i = 8, pick 3.5 and classify it as +. This is an incorrect classification,
then Error = 6. For i = 9, pick 4.1 and classify it as −. This is an incorrect classification, then Error = 7. For i = 10,
pick 4.9 and classify it as −. This is an incorrect classification, then Error = 8. Hence, the LOOCV error of 1 nearest
neighbor algorithm is 8.

3. Choose k = 1. Because the LOOCV error of k = 1 is smaller than that of k = 3.

LOOCV has a major advantage over the validation set approach. It has far less bias. In LOOCV, we repeatedly fit the
statistical learning method using training sets that contain m − 1 observations, almost as many as are in the entire data set.
LOOCV has the potential to be expensive to implement, since the model has to be fit m times (think linear regression for
instance). Hold out cross validation wastes about 30% of the data. Even if we were to take the optional step of retraining the
model on the entire training set, it is still as if we are trying to find a good model for a learning problem in which we had 0.7m
training examples, rather than m training examples, since we are testing models that were trained on only 0.7m examples each
time. While this is fine if data is abundant and/or cheap, in learning problems in which data is scarce (consider a problem with
m = 20, say), we would like to do something better; k-fold cross validation is one option.

Copyright c 2019, Dr. Mustafa El-Halabi 48


REGULARIZATION 49

The k-fold cross validation approach involves randomly dividing the set of observations into k groups, or folds, of approxi-
mately equal size. The first fold is treated as a validation set, and the method is fit on the remaining k − 1 folds. The mean
squared error, MSE1 , is then computed on the observations in the held-out fold. This procedure is repeated k times; each time,
a different group of observations is treated as a validation set. This process results in k estimates of the test error, MSE1 ,
MSE2 , ... , MSEk . The k−fold CV estimate is computed by averaging these values,
k
1X
CVk = MSEi (3.12)
k
i=1

Figure. 3.8 illustrates the k-fold CV approach. It is not hard to see that LOOCV is a special case of k−fold CV in which k
is set to equal m. In practice, one typically performs k−fold CV using k = 5 or k = 10. What is the advantage of using k = 5
or k = 10 rather than k = m? The most obvious advantage is computational.
The validation set approach can lead to overestimates of the test error, since in this approach the training set used to fit the
statistical learning method contains only half the observations of the entire data set. Using this logic, it is not hard to see that
LOOCV will give approximately unbiased estimates of the test error, since each training set contains m − 1 observations, which
is almost as many as the number of observations in the full data set. And performing k−fold CV for, say, k = 5 or k = 10
will lead to an intermediate level of bias, since each training set contains (k − 1)n/k observations−fewer than in the LOOCV
approach, but substantially more than in the validation set approach. Therefore, from the perspective of bias reduction, it is
clear that LOOCV is to be preferred to k−fold CV.
5.1 Cross-Validation 181

123 n

11 76 5 47

11 76 5 47

11 76 5 47

11 76 5 47

11 76 5 47

FIGURE 5.5. A schematic display of 5-fold CV. A set of n observations is


Figure 3.8: A schematic randomly
display of split
5−foldintoCV.
fiveAnon-overlapping
set of n observations
groups.is Each
randomly split fifths
of these into five
actsnon-overlapping
as a groups. Each
of these fifths acts as a validation set (shown
validation set (shown inin beige),
beige), and
and the
the remainder
remainder as as aa training
training set
set (shown
(shown in
in blue). The test error is
estimated by averaging the fiveThe
blue). resulting MSE
test error is estimates.
estimated by averaging the five resulting MSE estimates.

chapters. The magic formula (5.2) does not hold in general, in which case
the model has to be refit n times.

5.1.3 k-Fold Cross-Validation


An alternative to LOOCV is k-fold CV. This approach involves randomly k-fold CV
dividing the set of observations into k groups, or folds, of approximately
equal size. The first fold is treated as a validation set, and the method
is fit on the remaining k − 1 folds. The mean squared error, MSE1 , is
then computed on the observations in the held-out fold. This procedure is
repeated k times; each time, a different group of observations is treated
as a validation set. This process results in k estimates of the test error,
MSE1 , MSE2 , . . . , MSEk . The k-fold CV estimate is computed by averaging
these values,
k
1!
CV(k) = MSEi . (5.3)
k i=1

Figure 5.5 illustrates the k-fold CV approach.


It is not hard to see that LOOCV is a special case of k-fold CV in which k
practice,cone
is set to equal n. InCopyright typically
2019, performs
Dr. Mustafa k-fold CV using k = 5
El-Halabi 49
or k = 10. What is the advantage of using k = 5 or k = 10 rather than
k = n? The most obvious advantage is computational. LOOCV requires
fitting the statistical learning method n times. This has the potential to be
SUPPORT VECTOR MACHINES 50

CHAPTER 4

SUPPORT VECTOR MACHINES

This chapter presents one of the most theoretically well motivated and practically most effective classification algorithms
in modern machine learning: Support Vector Machines (SVMs). SVMs are our second non-probabilistic classifier after k−NN.
Their success is due to their excellent empirical performance and for many applications their are hard to beat. Their have been
found to be particularly useful in applications where the number of features is much larger than the number of training data.
This is because the number of parameters that must be set for the SVM is related to the number of training data and not
the number of features. We first introduce the algorithm for separable datasets, then present its general version designed for
non-separable datasets. We start with the description of the problem of linear binary classification, focusing only on binary
classification.

4.1 Margins: Intuition


We will start our story on SVMs by talking about margins. This section will give the intuitions about margins and about
the “confidence” of our predictions.
Consider logistic regression, where the probability p(y = 1|x; θ) is modeled by hθ (x) = g(ΘT x). We would then predict “1”
¯ ¯
on an input x if and only if hθ (x) ≥ 0.5, or equivalently, if and only if ΘT x ≥ 0. Consider a positive training example (y = 1).
¯ ¯ ¯
The larger ΘT x is, the larger also is hθ (x) = p(y = 1|x; θ), and thus also the higher our degree of “confidence” that the label is
¯ ¯
1. Thus, informally we can think of our prediction as being a very confident one that y = 1 if ΘT x  0. Similarly, we think of
¯ ¯
logistic regression as making a very confident prediction of y = 0, if ΘT x  0. Given a training set, again informally it seems
¯ ¯
that we would have found a good fit to the training data if we can find Θ so that Θ xi  0 whenever yi = 1, and ΘT xi  0
T
¯ ¯ ¯ ¯ ¯
whenever yi = 0, since this would reflect a very confident (and correct) set of classifications for all the training examples. This
seems to be a nice goal to aim for, and we will soon formalize this idea using the notion of functional margins.
Consider the scenario of Figure. 4.1, the data are clearly linearly separable and obviously infinitely many hyperplanes can
be used for separation with zero training error. So, which one should we choose? The first intuition is to use a hyperplane
which is the farther away from the data points, i.e., to draw a hyperplane in such a way that it is bounded by two other
hyperplanes (referred to as support vectors); one touching the closest positive examples and the other touching the closest
negative examples, forming some kind of a street. Hence, the objective is to make this street as wide as possible. But how can
we formalize this intuition?
We want to start with some vector w normal to the side of the street (gutter). Let u be the position of some unknown
¯ ¯
data point that needs to be classified as belonging to the positive class (right hand side of the street) or the negative class (left
hand side of the street). The projection of u on w gives us a distance which is proportional to how far we go on the street.
¯ ¯
The bigger u · w is, the more confident we are that the unknown data point belongs to the positive class, and the smaller u · w
¯ ¯ ¯ ¯
is, the more confident we are that the unknown data point belongs to the negative class. Hence, we can set up a decision rule

Copyright c 2019, Dr. Mustafa El-Halabi 50


March 5, 2019 5 / 62 March 5, 2019 6 / 62

Geometric motivation: separable case


Intuition
SUPPORT VECTOR MACHINES Distance 51
to hyperplan
When data is linearly separable, there are infinitely many hyperplanes What is the distance from
ns with zero training error: The further away from data points the better.
T
kwk22 w is a normal vector perpen
n (w (xn ) + b) + H
2 H
x0 2 H is the projection of
H w
Then, x0 = x ` kwk 2
, we g
parallel to w.
ed
ed explicitly (think about why after this
Since x0 belongs to a hyper

0 = wT x

So which one should we choose?(a) (b)


How to formalize this intuition?
From this we find the dista
March 5, 2019 7 / 62 March 5, 2019 8 / 62
March 5, 2019 9 / 62
Figure 4.1: (a) Three possible hyperplanes used to separate the dataset. (b) The hyperplane with the maximum margin.

Maximizing margin Rescaling


to be: if u · w ≥ c, then decide positive class, for some decision threshold c. This decision rule could be re-written as:
Margin: the smallest distance from all training points to the hyperplane
¯ ¯
Note: rescaling (w, b) does
|wT (xn ) + b|
u.w + b ≥ 0, for positive + class
margin of (w, b) = min
n kwk2 (4.1)
¯ ¯ We can thus always scale (
where b , −c. Actually, we do not simply require u.w + b ≥ 0, we will require that
¯ ¯ H : w (x) + b = 0 T

The margin then becomes


u.w + b ≥ 1, for positive + class (4.2)
¯ ¯ |w (x) + b| margin of (w, b) T

kwk
|wT (xn ) + 2

Not only do we want the instances to be on the right side of the hyperplane, but we also want them some distance away, for
= min
n kwk2
better generalization. =
1
The intuition “the further away the better” translates to solving
kwk2
|wT (xn ) + b|
x2 max min
w,b n kwk2

March 5, 2019 11 / 62

w
H

x1

Figure 4.2: The wide street approach for classification.

The distance from the hyperplane to the instances closest to it on either side is called the margin, which we want to maximize
for best generalization. The objective is to specify b and w to maximize the margin.
¯
4.2 Support Vectors Machines: separable case
In this section, we assume that the training sample S can be linearly separated, that is, we assume the existence of a
hyperplane that perfectly separates the training sample into two populations of positively and negatively labeled points, as
illustrated by the left label of Figure. 4.1. But there are then infinitely many such separating hyperplanes. Which hyperplane
should a learning algorithm select? The solution returned by the SVM algorithm is the hyperplane with the maximum margin, or
distance to the closest points, and is thus known as the maximum-margin hyperplane. The right panel of Figure. 4.1 illustrates
that choice.

4.2.1 Primal Optimization Problem


We now derive the equations and optimization problem that define the SVM solution. The general equation of a hyperplane
in Rn is
w ·x +b =0 (4.3)
¯ ¯

Copyright c 2019, Dr. Mustafa El-Halabi 51


SUPPORT VECTOR MACHINES 52

where w ∈ Rn is a non-zero vector normal to the hyperplane and b ∈ R a scalar. Note that this definition of a hyperplane is
¯
invariant to non-zero scalar multiplication. Hence, for a hyperplane that does not pass through any sample point, we can scale
w and b appropriately such that min |w · x + b| = 1. We define this representation of the hyperplane, i.e., the corresponding
¯ (x,y )∈S ¯ ¯
¯
pair (w , b), as the canonical hyperplane. The distance of any point x0 ∈ Rn to a hyperplane defined by (4.3) is given by
¯ ¯
|w · x0 + b|
¯ ¯ (4.4)
||w ||
¯
Thus, for a canonical hyperplane, the margin ρ is given by
|w · x + b| 1
ρ = min ¯ ¯ = (4.5)
4.2 SVMs — separable case (x,y )∈S ||w || ||w || 65
¯ ¯ ¯

w·x+b = 0

margin
w·x+b = +1
w·x+b = −1

Figure 4.3: MarginFigureand equations of the and


4.2 Margin hyperplanes
equationsfor of
a canonical maximum-margin
the hyperplanes hyperplane.
for a canonical The marginal hyperplanes
maximum-
are represented by margin
dashed lines on the figure.
hyperplane. The marginal hyperplanes are represented by dashed lines on
the figure.
Figure. 4.3 illustrates the margin for a maximum-margin hyperplane with a canonical representation (w , b). It also shows
¯
the marginal hyperplanes, which are the hyperplanes parallel to the separating hyperplane and passing through the closest points
We define this representation of the hyperplane, i.e., the corresponding pair (w, b),
on the negative or positive sides. Since they are parallel to the separating hyperplane,N they admit the same normal vector w .
as the canonical hyperplane. The distance of any point x0 ∈ R to a hyperplane ¯
Furthermore, by definition of a canonical representation, for a point x on a marginal hyperplane, |w · x + b| = 1, and thus the
defined by (4.3) is given by
equations of the marginal hyperplanes are w · x + b = ±1.
¯ ¯ ¯
¯ ¯
A hyperplane defined by (w , b) correctly classifies a|wtraining · x0 + b|point xi , i ∈ [1.m] when w · xi + b has the same sign as
¯ ¯ ¯ ¯
yi . For a canonical hyperplane, by definition, we have |w ·∥w∥ xi + b| .≥ 1 for all i ∈ [1, m]; thus, xi is(4.4)
correctly classified when
¯ ¯ ¯
yi (w · xi + b) ≥ 1. In view of (4.5), maximizing the margin of a canonical hyperplane is equivalent to minimizing ||w || or
1 ¯ 2¯ Thus, for a canonical hyperplane, the margin ρ is given by ¯
2 ||w || . Thus, in the separable case, the SVM solution, which is a hyperplane maximizing the margin while correctly classifying
¯
all training points, can be expressed as the solution to the|w following
· x + b| convex1 optimization problem:
ρ = min = . (4.5)
(x,y)∈S ∥w∥ ∥w∥
1
Figure 4.2 illustrates the min margin for a||maximum-margin
w ||2 hyperplane with a canon- (4.6)
w ,b
ical representation (w, b).¯ It also shows 2 ¯ the marginal hyperplanes, which are the
hyperplanes parallelsubject to theto: yi (w
separating · xi + b) ≥and ∀i ∈ [1.m]
1, passing (4.7)
¯hyperplane
¯ through the closest
points on the negative or positive sides. Since they are parallel to the separating
Since the objective function is quadratic and the constraints affine, the above optimization problem is in fact a specific
hyperplane, they admit the same normal vector w. Furthermore, by definition of a
instance of quadratic programming (QP), a family of problems extensively studied in optimization. A variety of commercial and
open-source solverscanonical
are availablerepresentation, for a point
for solving convex x on a marginal
QP problems. hyperplane,
Additionally, |w by
motivated · x the
+ b|empirical
= 1, success of SVMs
and thus the equations of the marginal hyperplanes are w · x + b = ±1.
along with its rich theoretical underpinnings, specialized methods have been developed to more efficiently solve this particular
convex QP problem. A hyperplane defined by (w, b) correctly classifies a training point xi , i ∈ [1, m]
when w · xi + b has the same sign as yi . For a canonical hyperplane, by definition,
we have |w · xi + b| ≥ 1 for all i ∈ [1, m]; thus, xi is correctly classified when
4.2.2 Constrained Optimization
yi (w · xi + b) ≥ 1. In view of (4.5), maximizing the margin of a canonical hyperplane
We now defineisa equivalent
general constrained optimization 1
to minimizing ∥w∥ or problem 2 and the specific properties associated to convex constrained
2 ∥w∥ . Thus, in the separable case, the SVM
optimization problems.solution, which is a hyperplane maximizing the margin while correctly classifying all
Let X ⊆ Rn and f , gi : X
training → R,can
points, for be i ∈ [1, m].asThen,
all expressed a constrained
the solution to the optimization problem
following convex of the form
optimization
problem:
Copyright c 2019, Dr. Mustafa El-Halabi 52
SUPPORT VECTOR MACHINES 53

min f (x)
x∈X
¯
¯
subject to: gi (x) ≤ 0, ∀i ∈ {1, ... , m} (4.8)
¯
This general formulation does not make any convexity assumptions and can be augmented with equality constraints. It is
referred to as the primal problem in contrast with a related problem introduced later. We will denote by p ∗ the optimal value
of the objective.
For any c ∈ X , we will denote by g(x) the vector (g1 (x), ... , gm (x))T . Thus, the constraints can be written as g(x) ≤ 0.
¯ ¯ ¯
To any constrained optimization problem, we can associate a Lagrange function that plays an important in the analysis of the
problem and its relationship with another related optimization problem.

Definition 1. (Lagrangian) The Lagrange function or the Lagrangian associated to the general constrained optimization
problem defined in (4.12) is the function defined over X × R+ by:
m
X
∀x ∈ X , ∀αi ≥ 0, L(x, α) = f (x) + αi gi (x) (4.9)
¯ ¯ ¯ i=1
¯

where the variables αi are known as the Lagrange multipliers with α = (α1 , ... , αm )T .

Any equality constraint of the form g(x) = 0 for a function g can be equivalently expressed by two inequalities: −g(x) ≤ 0
¯ ¯
and +g(x) ≤ 0. Let α− ≥ 0 be the Lagrange multiplier associated to the first constraint and α+ ≥ 0 then associated with
¯
the second constraint. The sum of the terms corresponding to theses constraints in the definition of the Lagrange function can
be therefore be written as αg(x) with α = (α+ − α− ). Thus, in general, for an equality constraint g(x) = 0 the Lagrangian
¯ ¯
is augmented with a term αg(x) but with α ∈ R not constrained to be non-negative. Note that in the case of a convex
¯
optimization problem, equality constraints g(x) are required to be affine since both g(x) and −g(x) are required to be convex.
¯ ¯ ¯
Definition 2. (Dual Function) The Lagrange dual function associated to the constrained optimization problem is defined by
m
!
X
∀α ≥ 0, F (α) = min L(x, α) = min f (x) + αi gi (x) (4.10)
x∈X
¯
¯ x∈X
¯
¯ i=1
¯

Note that F is always concave, since the Lagrangian is linear with respect to α and since the minimum preserves concavity.
We further observe that
∀α ≥ 0, F (α) ≤ p ∗ (4.11)
Pm
since for any feasible x, f (x) + αi gi (x) ≤ f (x). The dual function naturally leads to the following optimization problem.
¯ ¯ i=1 ¯ ¯
Definition 3. (Dual Problem) The dual (optimization) problem associated to the constrained optimization problem is

max F (α)
α
subject to: α≥0 (4.12)

The dual problem is always a convex optimization problem (as a maximization of a concave problem). Let d ∗ denote optimal
value. By (4.11), the following inequality always holds:

d ∗ ≤ p∗ (weak duality) (4.13)

The difference (p ∗ − d ∗ ) is known as the duality gap. The equality case

d ∗ = p∗ (strong duality) (4.14)

does not hold in general. However, strong duality does hold when convex optimization problems satisfy a constraint qualification.
An example of constraint qualification conditions are the Karush-Kuhn-Tucker’s (KKT) conditions.

Copyright c 2019, Dr. Mustafa El-Halabi 53


SUPPORT VECTOR MACHINES 54

Theorem 1. (Karush-Kuhn-Tucker’s Theorem) Assume that, f , gi : X → R, ∀i ∈ {1, ... , m} are convex and differential and
that the constraints are qualified. Then x ∗ is a solution of the constrained program if and only if there exists α∗ ≥ 0 such that
¯
∇x L(x, α) = 0 (4.15)
¯ ¯
∇α L(x, α) ≤ 0 (4.16)
¯ m
X
α · g(x ∗ ) =

αi∗ g(xi∗ ) = 0 (4.17)
¯ ¯ i=1
¯

Theses conditions are known as the KKT conditions. The last two conditions are known as the complementarity conditions.

4.2.3 Support Vectors


Back to the SVT optimization problem. The objective function as well as the affine constraints are convex and differentiable.
Thus, the KKT conditions apply at the optimum.
We introduce Lagrange multipliers αi ≥ 0, i ∈ [1, m], associated to the m constraints and denote by α the vector (α1 , ... , αm )T .
The Lagrangian can then be defined for all w ∈ Rn , b ∈ R, and α ∈ Rm + , by
¯
m
1 X
L(w , b, α) = ||w ||2 − αi [yi (w · xi + b) − 1] (4.18)
¯ 2 ¯ ¯ ¯
i=1

The KKT conditions are obtained by setting the gradient of the Lagrangian with respect to the primal variables w and b to
¯
zero and by writing the complementarity conditions:

X m
X
∇w L(w , b, α) = w − αi yi xi = 0 ⇒ w= αi yi xi (4.19)
¯ ¯ ¯ i
¯ ¯ i
¯
m
X m
X
∇b L(w , b, α) = − αi yi = 0 ⇒ αi yi = 0 (4.20)
¯ i i
∀i , αi [yi (w · xi + b) − 1] = 0 ⇒ αi = 0 ∨ yi (w · xi + b) = 1 (4.21)
¯ ¯ ¯ ¯
We can see that the weigh vector w solution of the SVM problem is a linear combination of the training set vectors x1 , ... , xm .
¯ ¯ ¯
A vector xi appears in that expansion iff αi 6= 0. Such vectors are called support vectors. By the complementarity conditions,
¯
if αi 6= 0, then yi (w · xi + b) = 1. Thus, support vectors lie on the marginal hyperplanes w · xi + b = ±1.
¯ ¯ ¯ ¯
Support vectors fully define the maximum-margin hyperplane or SVM solution, which justifies the name of the algorithm.
By definition, vectors not lying on the marginal hyperplanes do not affect the definition of these hyperplanes − in their absence,
the solution to the SVM problem remains unchanged. Note that while the solution w of the SVM problem is unique, the
¯
support vectors are not. In dimension n , n + 1 points are sufficient to define a hyperplane. Thus, when more than n + 1 points
lie on a marginal hyperplane, different choices are possible for the n + 1 support vectors.

4.2.4 Dual Optimization Problem


To derive the dual form of the constrained optimization problem (4.12), we plug into the Lagrangian the definition of w in
¯
terms of the dual variables as expressed in (4.19) and apply the constraint (4.20). This yields

!   !  
1 X X X X X X
L(w , b, α) = αi yi xi ·  αj yj xj  − αi yi xi · αj yj xj  − αi yi b + αi (4.22)
¯ 2 ¯ ¯ ¯ ¯
i j i j i i
X 1 XX
= αi − αi αj yi yj (xi · xj ) (4.23)
2 ¯ ¯
i i j

This leads to the following dual optimization problem for SVMs in the separable case:

Copyright c 2019, Dr. Mustafa El-Halabi 54


SUPPORT VECTOR MACHINES 55

X 1 XX
max αi − αi αj yi yj (xi · xj )
α 2 ¯ ¯
i i j
subject to: αi ≥ 0, ∀i ∈ [1, m]
Xm
subject to: αi yi = 0, ∀i ∈ [1, m] (4.24)
i=1

1
P PP
It can be easily shown that the function F (α) = αi αj yi yj (xi · xj ) is a concave function of α. Since the
αi − 2
i i¯ ¯ j
constraints are affine and convex, the maximization problem (4.24) is equivalent to a convex optimization problem. Since F is
a quadratic function of α, this dual optimization problem is also a QP problem, as in the case of the primal optimization and
once again both general-purpose and specialized QP solvers can be used to obtain the solution. Moreover, since the constraints
are affine, they are qualified and strong duality holds. Thus, the primal and dual problems are equivalent, i.e., the solution α
of the dual problem (4.24) can be used directly to determine the hypothesis returned by SVMs, using (4.19):
m
!
X
h(x) = sgn(w · x + b) = sgn αi yi (xi · x) + b (4.25)
¯ ¯ ¯ i=1
¯ ¯

Since support vectors lie on the marginal hyperplanes, for any support vector xi , w · xi + b = yi , and thus b can be obtained
¯ ¯ ¯
via
Xm
b = yi − αj yj (xi · xj ) (4.26)
j=1
¯ ¯

The dual optimization problem and the decision rule reveal an important property of SVMs: the hypothesis solution depends
only on inner products between vectors and not directly on the vectors themselves.
Equation (4.34) can now be used to derive a simple expression of the margin ρ in terms of α. Since (4.34) holds for all i
with αi 6= 0, multiplying both sides by αi yi and taking the sum leads to
m
X m
X m X
X m
αi yi b = αi yi2 − αi αj yi yj (xi · xj )
i=1 i=1 i=1 j=1
¯ ¯

using the fact that yi2 = 1 along with (4.19) then yields
m
X
0= αi − ||w ||2
i=1
¯

Noting that αi ≥ 0, we obtain the following expression of the margin ρ in terms of α:


1 1
ρ2 = 2
= m (4.27)
||w || P
¯ αi
i=1

Remark 10. In SVM, the optimization problems fallen a convex space,so we are not afraid from local optimas, in contrast to
neural networks which are plagued with local optimas.

4.3 Kernels
Kernel methods are widely used in machine learning. They are flexible techniques that can be used to extend algorithms
such as SVMs to define non-linear decision boundaries. Other algorithms that only depend on inner products between sample
points can be extended similarly. The main idea behind these methods is based on so-called kernels or kernel functions, which,
under some technical conditions of symmetry and positive-definiteness, implicitly define an inner product in a high-dimensional
space. Replacing the original inner product in the input space with positive definite kernels immediately extends algorithms such
as SVMs to a linear separation in that high-dimensional space, or, equivalently, to a non-linear separation in the input space.

Copyright c 2019, Dr. Mustafa El-Halabi 55


SUPPORT VECTOR MACHINES 56

In the previous sections, we presented an algorithm for linear classification, SVMs, which is both effective in applications
and benefits from a strong theoretical justification. In practice, linear separation is often not possible. Figure. 4.4(a) shows an
example where any hyperplane crosses both populations. However, one can use more complex functions to separate the two
sets as in figure. 4.4(b). One way to define such a non-linear decision boundary is to use a non-linear mapping Φ from the
90 Kernel Methods
input space X to a higher-dimensional space H, where linear separation is possible.

(a) (b)

Figure
Figure 4.4: Non-linearly Non-linearly
5.1 case.
separable separable task
The classification case.consists
The classification task consists
of discriminating betweenof discrim-
solid squares and solid circles.
(a) No hyperplane can inating between
separate the two solid squares and
populations. (b)solid circles. (a)
A non-linear No hyperplane
mapping can be usedcaninstead.
separate the
two populations. (b) A non-linear mapping can be used instead.
The dimension of H can truly be very large in practice. For example, in the case of document classification, one may
wish to use as features sequences
space X to a of three consecutive space
higher-dimensional words,H,i.e., trigrams.
where linear Thus, with ais vocabulary
separation possible. of just 100,000 words,
15
the dimension of the feature
The dimension of H can truly be very large in practice. For example,show
space H reaches 10 . On the positive side, the margin bounds that, remarkably, the
in the
generalization ability case
of large-margin classification algorithms such as SVMs do not depend
of document classification, one may wish to use as features sequences of three on the dimension of the feature
space, but only on theconsecutive
margin ρ and the number of training examples m. Thus, with a favorable
words, i.e., trigrams. Thus, with a vocabulary of just 100,000 words, margin ρ, such algorithms could
succeed even in verythe high-dimensional
dimension of the space. However,
feature space H determining
reaches 10the15 hyperplane solution requires multiple inner product
. On the positive side, the margin
computations in high-dimensional spaces, which can become be very
bounds presented in section 4.4 show that, remarkably, costly. A solution to this problem
the generalization is toofuse kernel methods,
ability
which are based on kernels or kernel functions.
large-margin classification algorithms such as SVMs do not depend on the dimension
Definition 4. (Kernels)of the feature space, but only on the margin ρ and the number of training examples
A function K : X × Xm.→Thus, R is with
calleda afavorable margin
kernel over ρ, such
X . The ideaalgorithms
is to definecould succeed
a kernel even
K such in very
that two points x, x 0 ∈ X ,
high-
for any
0 ¯ ¯
K (x, x ) be equal to dimensional space.
an inner product of However,
vectors Φ(determining
x) and Φ(y ):the hyperplane solution requires multiple
¯ ¯ ¯ ¯
inner product computations in high-dimensional spaces, which can become be very
costly. ∀x, x 0 ∈ X , K (x, x 0 ) = hΦ(x), Φ(x 0 )i (4.28)
¯ ¯ ¯ ¯ ¯ ¯
A solution to this problem is to use kernel methods, which are based on kernels
for some mapping Φ : X → H to a Hilbert space H called a feature space. Since an inner product is a measure of the similarity
or kernel functions.
of two vectors, K is often interpreted as a similarity measure between elements of the input space X .
Definition 5.1 Kernels
The kernel has two main advantages:
A function K : X × X → R is called a kernel over X .
1. Efficiency: K is often significantly more efficient to compute than Φ and an inner product in H. We need less computations
The idea is to define a kernel K such that for any two points x, x′ ∈ X , K(x, x′ ) be
to evaluate K rather than hΦ(x), Φ(x 0 )i.
¯ ¯
2. Flexibility: there is no need to explicitly define or compete a mapping Φ. The kernel can be arbitrarily chosen as long as
the existence of Φ is guaranteed; K should satisfy some condition known as Mercer’s condition or K is positive definite
symmetric (PDS). This condition is important to guarantee the convexity of the optimization problem for algorithms such
as SVMs and thus convergence guarantees.

The following are some standard examples of kernels commonly used in applications.

1. Linear Kernels:
K (x, x 0 ) = hx, x 0 i (4.29)
¯ ¯ ¯ ¯

Copyright c 2019, Dr. Mustafa El-Halabi 56


SUPPORT VECTOR MACHINES 57

2. Polynomial Kernels: For any constant c > 0, a polynomial kernel of degree d ∈ N is the kernel defined over RN by:

K (x, x 0 ) = (hx, x 0 i + c)d (4.30)


¯ ¯ ¯ ¯
For d = 2, polynomial kernels are referred to as quadratic kernels.

3. Gaussian Kernels: For any constant σ > 0, a Gaussian kernel or radial basis function (RBF) is the kernel K defined over
RN by:
||x 0 − x||2
 
0
K (x, x ) = exp − ¯ 2¯ (4.31)
¯ ¯ 2σ
Gaussians kernels are among the most frequently used kernels in applications.

4. Sigmoid Kernels: For any real constant a, b ≥ 0, a sigmoid kernel is the kernel K defined over RN by:

K (x, x 0 ) = tanh (ahx, x 0 i + b) (4.32)


¯ ¯ ¯ ¯
Using sigmoid kernels with SVMs leads to an algorithm that is closely related to learning algorithms based on simple
neural networks, which are also often defined via a sigmoid function.

As an example, consider the XOR classification problem in an input space of dimension N = 2, as shown in Figure. 4.5.
The label of a point is blue iff exactly one of5.2 Positive definite
its coordinates symmetric
is 1. The kernels
XOR classification problem is obviously non-separable in 93
the input space.

x2

2 x1 x2
(−1, 1) (1, 1) √ √ √
(1, 1, + 2, − 2, − 2, 1)
√ √ √
(1, 1, + 2, + 2, + 2, 1)


2 x1
x1

√ √ √
(1, −1)
√ √ √
(−1, −1) (1, 1, − 2, − 2, + 2, 1) (1, 1, − 2, + 2, − 2, 1)

(a) (b)
Figure 4.5: Illustration of the XOR classification problem. XOR problem linearly non-separable in the input space.

Figure 5.2 Illustration of the XOR classification problem and the use of poly-
Now, using the quadratic kernel K (x, x 0 ) = (hx, x 0 i + 1)2 , the problem is equivalent to an inner product in a feature space
¯ ¯ nomial ¯ ¯kernels. (a) XOR problem linearly non-separable in the input space. (b)
of dimension N = 6:
Linearly separable using second-degree polynomial kernel.
x12 x102
   

dimension 6:  x22   x202 


√  √
 2x1 x2   2x10 x20 

⎡ ⎤ ⎡ ⎤
K (x, x 0 ) = (hx, x 0 i + 1)2 =  √ · √
 2x1   2x 0 

x21
2
x′ 1 (4.33)
¯ ¯ ¯ ¯  √
 2x2   √2x 0 
1  ⎢ ⎥ ⎢ ⎥
⎢ x22 ⎥ ⎢ x′ 22 ⎥
 
2 ⎢√ ⎥ ⎢√ ⎥
1 1 ⎢ 2 x x ⎥ ⎢ 2 x′ x′ ⎥
⎢ 1 2 ⎥ ⎢ 1 2⎥
∀x, x′ ∈ R2 , K(x, x′ ) = (x1 x′1 + x2 x′2 + c)2 = ⎢ √ ⎥·⎢ √ ⎥ . (5.4)
⎢ ⎥ ⎢ ⎥
x′1 (4.33),
Mapping the blue and red points to the six-dimensional space defined by a second-degree polynomial ⎢ √2c xas 1 ⎥ ⎢ √2c in
described ⎥
then the problem becomes separable by the hyperplane of equation x1 x2 = 0. Figure. 4.6 illustrates ⎢ ⎥ ⎢ ⎥
⎣ 2c x2that ⎦ ⎣by showing
2c x′2 ⎦ the
projection of these points on the two-dimensional space defined by their third and fourth coordinates.
c c
4.4 SVM with PDS Kernels
Thus, the features corresponding to a second-degree polynomial are the original
featuresfor(xSVMs
As noted earlier, the dual optimization problem 1 and as
x2 ), as as
well well
theasform
products
of theofsolution
these features, and thedepend
did not directly constant
on feature.
the
More agenerally,
input vectors but only on inner products. Since PDS kernel theimplicitly
features defines
associated to a product,
an inner polynomial kernel
we can of degree
extend SVMsdand are all
combine it with an arbitrary PDS kernel K by thereplacing
monomialseachofinstance
degree ofatanmost
innerd product
based on x 0 i with
hx, the x 0 ). ThisThe
K (x,features.
original leadsexplicit
to
¯ ¯ ¯ ¯
the following general form of the SVM optimization problem and solution with PDS kernels extending (4.24):
expression of polynomial kernels as inner products, as in (5.4), proves directly that
they are PDS kernels.

To illustrate the application of polynomial kernels, consider the example of fig-


ure 5.2a which
Copyright c 2019,shows a simple
Dr. Mustafa data set in dimension two that is not linearly57sepa-
El-Halabi
rable. This is known as the XOR problem due to its interpretation in terms of the
exclusive OR (XOR) function: the label of a point is blue iff exactly one of its coor-
dinates is 1. However, if we map these points to the six-dimensional space defined
5.2 Positive definite symmetric kernels SUPPORT VECTOR MACHINES
93 58

x2

2 x1 x2
(−1, 1) (1, 1) √ √ √
(1, 1, + 2, − 2, − 2, 1)
√ √ √
(1, 1, + 2, + 2, + 2, 1)


2 x1
x1

√ √ √
(1, −1)
√ √ √
(−1, −1) (1, 1, − 2, − 2, + 2, 1) (1, 1, − 2, + 2, − 2, 1)

(a) (b)
Figure 4.6: Illustration of the XOR classification problem. XOR problem linearly separable in the feature space using a quadratic
kernel.
Figure 5.2 Illustration of the XOR classification problem and the use of poly-
nomial kernels. (a) XOR problem linearly non-separable in the input space. (b)
Linearly separable using second-degree polynomial kernel.
X 1 XX
max αi − αi αj yi yj K (xi , xj )
dimension 6: α 2
i i j
⎡ ⎤ ⎡ ⎤
2 ′2
subject to: αi ≥ 0,x∀i1 ∈ [1, m] x 1
m⎢ ⎥ ⎢ ⎥
⎢ x22 ⎥ ⎢ x′ 22 ⎥
⎢ ⎥ ⎢ ⎥
X
subject to: ⎢√αi yi = 0,⎥∀i⎢∈√[1, m]
2 x x 2 x ′ ′⎥
x
(4.34)
′ 2 ′ ′ ′ 2 i=1⎢ 1 2 ⎥ ⎢ 1 2⎥
∀x, x ∈ R , K(x, x ) = (x1 x1 + x2 x2 + c) = ⎢ √ ⎥·⎢ √ ⎥ . (5.4)
⎢ 2c x1 ⎥ ⎢ 2c x′1 ⎥
⎢ ⎥ ⎢ ⎥
As a result, the hypothesis solution (4.25) can be re-written ⎢√ as⎥ ⎢ √ ⎥
⎣ 2c x2 ⎦ ⎣ 2c x′2 ⎦
m
!
X c c
h(x) = sgn αi yi K (xi , x) + b (4.35)
¯
Thus, the features corresponding to a second-degree polynomial are the original
i=1
features (xP1mand x2 ), as well as products of these features, and the constant feature.
withMore i −
b = ygenerally,
αj ythe
j K (x i , xj ) for associated
features any xi . to a polynomial kernel of degree d are all
j=1
the monomials of degree at most d based on the original features. The explicit
expression of polynomial kernels as inner products, as in (5.4), proves directly that
they are
Example 13.PDS kernels.
An SVM is trained with the following data:
To illustrate the application of polynomial kernels, x1 consider
x2 y the example of fig-
ure 5.2a which shows a simple data set in dimension 1 two0 that1 is not linearly sepa-
rable. This is known as the XOR problem due to its -1 interpretation
0 -1 in terms of the
exclusive OR (XOR) function: the label of a point is2 blue0 iff exactly 1 one of its coor-
dinates is 1. However, if we map these points to the -2 six-dimensional
0 -1 space defined
by a second-degree polynomial as described in (5.4), then the problem becomes
We separable
propose tobyusethe
a linear kernel of
hyperplane to equation
solve the problem.
x1 x2 = 0.The kernel
Figure 5.2bis Killustrates x T xn . In other words, Φ(x) = x. We index
(xm , xn ) =that
T T T ¯ T¯ ¯m ¯ by ¯ ¯
the showing
data set the
as xprojection
1 = (1, 0) of, xthese
2 = (−1, , xthe
0) on
points (2, 0) , and x4 =space
3 = two-dimensional 0) . The
(−2, defined bydual
theiroptimization problem is:
¯ ¯ ¯ ¯
third and fourth coordinates. 4 4 4
X 1XX
Example 5.2 Gaussian kernels max α n − αn αm yn ym Km,n
α
n=1
2 n=1 m=1
subject to: αi ≥ 0, ∀i ∈ [1, 4]
subject to: α1 y1 + α2 y2 + α3 y3 + α4 y4 = 0

The kernel matrix is  


1 −1 2 −2
−1 1 −2 2
K =  (4.36)
¯ 2 −2 4 −4
−2 2 −4 4

Copyright c 2019, Dr. Mustafa El-Halabi 58


SUPPORT VECTOR MACHINES 59

Substituting the values of the kernel matrix and using the symmetry of the problem (α1 = α2 and α3 = α4 ), we can re-write
the dual optimization problem as

min 2(α12 + 4α1 α3 + 4α32 − α1 − α3 )


α1 ,α3
subject to: α1 ≥ 0, α3 ≥ 0

The objective function is (after completing the square)


 2
1 1
min α1 + 2α3 − + α3 −
α1 ,α3 2 4

Since α3 is always nonnegative, to minimize the objective function, we set α3 = 0 and α1 = 12 . Hence, we have shown that
α1 = α2 = 12 and α3 = α4 = 0. Namely, x1 and x2 are support vectors and x3 and x4 are removable without changing the
¯ ¯ ¯ ¯
solution. This yields
X 1
w= αn yn xn = (x1 − x2 )T = (1 0)T
¯ n
¯ 2 ¯ ¯
2
P
The constant b is given by: b = yn − αm ym K (xn , xm ) for any xn . Pick x1 , we get b = 1−α1 K (x1 , x1 )y1 −α2 K (x1 , x2 )y2 =
m=1 ¯
1 − 21 (1) − 21 (−1)(−1) = 0, Thus, the decision boundary is w T x + b = x1 = 0.
¯ ¯

Copyright c 2019, Dr. Mustafa El-Halabi 59

You might also like