KEMBAR78
Supervised Learning | PDF | Spamming | Loss Function
0% found this document useful (0 votes)
15 views5 pages

Supervised Learning

The lecture introduces supervised learning, emphasizing its goal of making predictions from data, such as classifying emails as spam or not. It formalizes the setup with training data pairs and discusses various label spaces, feature vectors, hypothesis classes, loss functions, and methods to avoid overfitting through train/test splits. The importance of making assumptions about the data and the implications of the No Free Lunch Theorem are also highlighted.

Uploaded by

chakumchukum
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views5 pages

Supervised Learning

The lecture introduces supervised learning, emphasizing its goal of making predictions from data, such as classifying emails as spam or not. It formalizes the setup with training data pairs and discusses various label spaces, feature vectors, hypothesis classes, loss functions, and methods to avoid overfitting through train/test splits. The importance of making assumptions about the data and the implications of the No Free Lunch Theorem are also highlighted.

Uploaded by

chakumchukum
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Lecture 1: Supervised Learning

back next

Lecture 1 "Supervised Learning Setup" -Cornell CS4780 Ma…


Ma…

Video II

Intro

The goal in supervised learning is to make predictions from data. For example, one popular application of
supervised learning is email spam filtering. Here, an email (the data instance) needs to be classified as spam
or not-spam. Following the approach of traditional computer science, one might be tempted to write a carefully
designed program that follows some rules to decide if an email is spam or not. Although such a program
might work reasonably well for a while, it has significant drawbacks. As email spam changes it would have to
be rewritten. Spammers could attempt to reverse engineer the software and design messages that circumvent
it. And even if it is successful, it could probably not easily be applied to different languages. Machine Learning
uses a different approach to generate a program that can make predictions from data. Instead of
programming it by hand it is learned from past data. This process works if we have data instances for which
we know exactly what the right prediction would have been. For example past data might be user-annotated as
spam or not-spam. A machine learning algorithm can utilize such data to learn a program, a classifier, to
predict the correct label of each annotated data instance. Other successful applications of machine learning
include web-search ranking (predict which web-page the user will click on based on his/her search query),
placing of online advertisements (predict the expected revenue of an ad, when placed on a homepage, which
is seen by a specific user), visual object recognition (predict which object is in an image - e.g. a camera
mounted on a self-driving car), face-detection (predict if an image patch contains a human face or not).

Setup

Let us formalize the supervised machine learning setup. Our training data comes in pairs of inputs (x, y) ,
d
where x ∈ R is the input instance and y its label. The entire training data is denoted as

d
D = {(x1 , y1 ), … , (xn , yn )} ⊆ R ×C

where:

d
R is the d-dimensional feature space
th
xi is the input vector of the i sample
th
yi is the label of the i sample

C is the label space

The data points (xi , yi ) are drawn from some (unknown) distribution P(X, Y ). Ultimately we

would like to learn a function h such that for a new pair (x, y) ∼ P , we have h(x) = y with

high probability (or h(x) ≈ y ). We will get to this later. For now let us go through some examples of

X and Y .

Examples of Label Spaces


There are multiple scenarios for the label space C:
Binary C = {0, 1} or
Eg. spam filtering. An email is either spam (+1), or not (−1).
classification C = {−1, +1} .

Multi-class C = {1, 2, ⋯ , K} Eg. face classification. A person can be exactly one of K identities
classification (K ≥ 2) . (e.g., 1="Barack Obama", 2="George W. Bush", etc.).
Regression C = R. Eg. predict future temperature or the height of a person.

Examples of feature vectors

We call xi a feature vector. Each one of its d dimensions is a features describing the i−th sample. Let us look
at some examples:
1 2 d 1
Patient Data in a hospital. xi = (x , x , ⋯ , x ) , where x = 0 or 1 , may refer
i i i i
2 3
to the patient i's gender, x could be the height of patient i in cm, and x may be his/her in years, etc.
i i

In this case, d ≤ 100 and the feature vector is dense, i.e., the number of nonzero coordinates in xi is

large relative to d .
1 2 d α
Text document in bag-of-words format. xi = (x , x , ⋯ , x ) , where x is the
i i i i
th
number of occurrences of the α word in a dictionary in document i (often referred to as term
frequencies). In this case, d ∼ 100, 000 − 10M and the feature vector is sparse, i.e.,
xi consists of mostly zeros. A common way to avoid the use of a dictionary is to use feature hashing
instead to directly hash any string to a dimension index (the advantage is that no dictionary is needed, but
a minor disadvantage can be that multiple words are hashed into the same dimension.) A popular
improvement over bag-of-words features is TF-IDF, which down-scales common words and highlights
rare words.
1 2 3k
Images. Here, the features typically represent pixel values. xi = (x ,x ,⋯,x ),
i i i
3j−2 3j−1 3j
where x ,x , and x refer to the red, green, and blue values of the jth pixel in the
i i i

image. In this case, d ∼ 100, 000 − 10M and the feature vector is dense. A 7MP

camera results in 7M × 3 = 21M features.

Hypothesis classes and No Free Lunch

Before we can find a function h , we must specify what type of function it is that we are looking for. It could be
an artificial neural network, a decision tree or many other types of classifiers. We call the set of possible
functions the hypothesis class. By specifying the hypothesis class, we are encoding important assumptions
about the type of problem we are trying to learn. The No Free Lunch Theorem states that every successful ML
algorithm must make assumptions. This also means that there is no single ML algorithm that works for every
setting.

Loss Functions

There are typically two steps involved in learning a hypothesis function h() . First, we select the type of
machine learning algorithm that we think is appropriate for this particular learning problem. This defines the

hypothesis class H, i.e. the set of functions we can possibly learn. The second step is to find the best

function within this class, h ∈ H . This second step is the actual learning process and often, but not
always, involves an optimization problem. Essentially, we try to find a function h within the hypothesis class
that makes the fewest mistakes within our training data. (If there is not a single function we typically try to
choose the "simplest" by some notion of simplicity - but we will cover this in more detail in a later class.) How
can we find the best function? For this we need some way to evaluate what it means for one function to be
better than another. This is where the loss function (aka risk function) comes in. A loss function evaluates a

hypothesis h ∈ H on our training data and tells us how bad it is. The higher the loss, the worse it is - a
loss of zero means it makes perfect predictions. It is common practice to normalize the loss by the total
number of training samples, n, so that the output can be interpreted as the average loss per sample (and is
independent of n).

Examples:

Zero-one loss:

The simplest loss function is the zero-one loss. It literally counts how many mistakes an hypothesis function
h makes on the training set. For every single example it suffers a loss of 1 if it is mispredicted, and 0
otherwise. The normalized zero-one loss returns the fraction of misclassified training samples, also often
referred to as the training error. The zero-one loss is often used to evaluate classifiers in multi-class/binary
classification settings but rarely useful to guide optimization procedures because the function is non-
differentiable and non-continuous. Formally, the zero-one loss can be stated has:

n
1 1, if h(xi ) ≠ yi
L0/1 (h) = ∑ δh(x , where δh(x = {
i )≠yi i )≠yi
n 0, o.w.
i=1

This loss function returns the error rate on this data set D. For every example that the classifier misclassifies
(i.e. gets wrong) a loss of 1 is suffered, whereas correctly classified samples lead to 0 loss.

Squared loss:

The squared loss function is typically used in regression settings. It iterates over all training samples and
2
suffers the loss (h(xi ) − yi ) . The squaring has two effects: 1., the loss suffered is always nonnegative; 2.,
the loss suffered grows quadratically with the absolute mispredicted amount. The latter property encourages
no predictions to be really far off (or the penalty would be so large that a different hypothesis function is
likely better suited). On the flipside, if a prediction is very close to be correct, the square will be tiny and little
attention will be given to that example to obtain zero error. For example, if |h(xi ) − yi | = 0.001 the
squared loss will be even smaller, 0.000001 , and will likely never be fully corrected. If, given an input x, the
label y is probabilistic according to some distribution P (y|x) then the optimal prediction to minimize the
squared loss is to predict the expected value, i.e. h(x) = EP (y|x) [y] . Formally the squared loss is:

n
1
2
Lsq (h) = ∑(h(xi ) − yi ) .
n
i=1

Absolute loss:

Similar to the squared loss, the absolute loss function is also typically used in regression settings. It suffers
the penalties |h(xi ) − yi | . Because the suffered loss grows linearly with the mispredictions it is more
suitable for noisy data (when some mispredictions are unavoidable and shouldn't dominate the loss). If,
given an input x, the label y is probabilistic according to some distribution P (y|x) then the optimal
prediction to minimize the absolute loss is to predict the median value, i.e. h(x) = MEDIANP (y|x) [y] .
Formally, the absolute loss can be stated as:

n
1
Labs (h) = ∑ |h(xi ) − yi |.
n
i=1

Generalization:

Given a loss function, we can then attempt to find the function h that minimizes the loss:

h = argminh∈H L(h)

A big part of machine learning focuses on the question, how to do this minimization efficiently.

If you find a function h(⋅) with low loss on your data D , how do you know whether it will still get

examples right that are not in D ?

Bad example: "memorizer" h(⋅)

yi , if ∃(xi , yi ) ∈ D, s.t., x = xi ,
h(x) = {
0, o.w.

For this h(⋅) , we get 0% error on the training data D, but does horribly with samples not in D, i.e., there's the
overfitting issue with this function.

Train / Test splits

To resolve the overfitting issue, we usually split D into three subsets: DTR as the training data, DVA , as the
validation data, and DTE , as the test data. Usually, they are split into a proportion of 80% , 10% , and 10% .
Then, we choose h(⋅) based on DTR , and evaluate h(⋅) on DTE .

Quiz: Why do we need DVA ?


DVA is used to check whether the h(⋅) obtained from DTR suffers from the overfitting issue. h(⋅) will need

to be validated on DVA , if the loss is too large, h(⋅) will get revised based on DTR , and validated again on
DVA . This process will keep going back and forth until it gives low loss on DVA . Here's a trade-off between

the sizes of DTR and DVA : the training results will be better for a larger DTR , but the validation will be more
reliable (less noisy) if DVA is larger.

How to Split the Data?

You have to be very careful when you split the data in Train,Validation,Test. The test set must simulate a real
test scenario, i.e. you want to simulate the setting that you will encounter in real life. For example, if you want
to train an email spam filter, you train a system on past data to predict if future email is spam. Here it is
important to split train / test temporally - so that you strictly predict the future from the past. If there is no such
thing as a temporal component, it is often best to split uniformly at random. Definitely never split
alphabetically, or by feature values.

By time, if the data is temporally collected.


In general, if the data has a temporal component, we must split it by time.

Uniformly at random, if (and, in general, only if) the data is i. i. d. .


The test error (or testing loss) approximates the true generalization error/loss.

Putting everything together:

We train our classifier by minimizing the training loss:

1

Learning: h (⋅) = argmin ∑ ℓ(x, y|h(⋅)),
h(⋅)∈H
|DTR |
(x,y)∈DTR

where H is the hypothetical class (i.e., the set of all possible classifiers h(⋅) ). In other words, we are trying to
find a hypothesis h which would have performed well on the past/known data.

We evaluate our classifier on the testing loss:

1

Evaluation: ϵTE = ∑ ℓ(x, y|h (⋅)).
|DT E |
(x,y)∈DTE

If the samples are drawn i.i.d. from the same distribution P , then the testing loss is an unbiased estimator of
the true generalization loss:


Generalization: ϵ = E(x,y)∼P [ℓ(x, y|h (⋅))].

Quiz: Why does ϵTE → ϵ as |DTE | → +∞ ? This is due to the weak law of large numbers,
which says that the empirical average of data drawn from a distribution converges to its mean.

No free lunch. Every ML algorithm has to make assumptions on which hypothesis class H should you

choose? This choice depends on the data, and encodes your assumptions about the data set/distribution P .

Clearly, there's no one perfect H for all problems.

Example. Assume that (x1 , y1 ) = (1, 1) , (x2 , y2 ) = (2, 2) , (x3 , y3 ) = (3, 3) , (x4 , y4 ) = (4, 4) , and
(x5 , y5 ) = (5, 5) .

Question: what is the value of y if x = 2.5 ? Well, it is utterly impossible to know the answer without
assumptions. The most common assumption of ML algorithms is that the function to be approximated is
locally smooth.

You might also like