Machine Learning Coms-4771
Alina Beygelzimer
Tony Jebara, John Langford, Cynthia Rudin
February 3, 2008
(partially based on Yann LeCun’s and Sam Roweis’s slides; see links at the web page)
Logistics
I The course web page is
http://hunch.net/~coms-4771
I If you have a question, email
beygel@us.ibm.com
or post it at
http://coms-4771.blogspot.com/
I Do interrupt and ask questions during the class.
I The web page has notes on probability theory and statistics (if
you need to refresh your memory)
What is Machine Learning?
I In October 2006, Netflix announced a
$1M problem:
Predict the rating a given user would
assign to a given movie (based on 100
million past user-movie ratings).
I 10% improvement = $1M
I 2500 teams; annual progress prize of
www.netflixprize.com
$50K went to KorBell from AT&T
Labs (8.43%)
Netflix Problem Setup
I Success is measured by the root mean
squared error (RMSE) on the test set:
v
u n
u1 X
t (yi − ŷi )2 ,
n i=1
where yi and ŷi are the actual and
predicted movie ratings.
Netflix Problem Setup
I Success is measured by the root mean
squared error (RMSE) on the test set:
v
u n
u1 X
t (yi − ŷi )2 ,
n i=1
where yi and ŷi are the actual and
predicted movie ratings.
I Q: What’s the role of the probe set?
I Q: Why are there both quiz and test sets?
Netflix Problem Setup
I Success is measured by the root mean
squared error (RMSE) on the test set:
v
u n
u1 X
t (yi − ŷi )2 ,
n i=1
where yi and ŷi are the actual and
predicted movie ratings.
I Q: What’s the role of the probe set?
I Q: Why are there both quiz and test sets?
I We have a well defined task. How would
you go about solving it?
What is Machine Learning?
I We want robust, intelligent behavior. Hand-programming a solution
directly is not going to work. The world is too complex.
I Learning approach = programming by example: Get the machine to
program itself by showing it examples of the behavior we want.
Learning is about improving performance through experience.
I Learning is data driven. It can examine much larger amounts
of data than you can.
I Labeling examples is perhaps the easiest way to express
knowledge.
I Learning is general purpose—algorithm reuse!
I In reality, we specify a space of possible solutions, and let the
machine find a good solution in this space.
Learning Problems, Structure of Learning Machines
I Learning problem = (unknown) distribution of inputs and outputs D
+ (typically known) loss function L.
I Hypothesis space H = Space of functions mapping inputs to
outputs (H is often indexed by a set of parameters the algorithm
can tune to create different solutions)
I Learning algorithm searches (or prunes or tunes parameters in) H to
find a hypothesis minimizing the expected L on D, based on a
limited set of input-output examples.
The hardest part is deciding how to represent inputs/outputs and
how to select appropriate L and H.
How do we incorporate prior information?
Supervised Learning
Given a set of labeled examples, predict outputs of future unlabeled
examples.
Classification: Feature space X , discrete set of
labels Y (categories). Find a decision boundary
between the categories in Y .
Loss function: `(y , y 0 ) = 1(y 6= y 0 )
(zero-one loss)
Distribution D over X × Y . Find a classifier
h : X → Y minimizing the expected loss on D
given by Pr(x,y )∼D [h(x) 6= y ] = E(x,y )∼D `(y , h(x)).
Regression (“curve fitting” or “function
approximation”): Y = R
Loss function: `sq (y , y 0 ) = (y − y 0 )2 (squared
loss)
Learn a continuous mapping f : X → R minimizing
E(x,y )∼D `sq (y , f (x)).
Training vs. Testing
I Training (empirical) error: the average loss on the training
data
I Test error: the average loss on the test data
I Ideally we want to minimize the test error, but we can’t
evaluate it! (Most of the time we don’t even know future
inputs.)
I Do we want to minimize the training error instead?
Training vs. Testing
I Training (empirical) error: the average loss on the training
data
I Test error: the average loss on the test data
I Ideally we want to minimize the test error, but we can’t
evaluate it! (Most of the time we don’t even know future
inputs.)
I Do we want to minimize the training error instead?
I NO. Consider an algorithm that memorizes training examples.
If a test example is in the training set, produce the memorized
output. Otherwise, choose a random output.
I We are overfitting: Training error is 0. Test error is HUGE.
I Learning is not Memorization. We want to generalize from
training examples to predict well on previously unseen
examples.
Inductive Bias
I There should be some hope that the data is predictive.
I No Free Lunch Theorems: an unbiased learner can never generalize.
Inductive bias = set of assumptions that favor some predictors over
others.
I Ways of incorporating bias: assumptions on the data distribution
(test data is drawn from the same distribution as the training data,
examples are independent), choice of the hypothesis class
I Examples: Occam’s Razor (choose the simplest consistent
hypothesis), maximum margin (attempt to maximize the width of
the boundary), nearest neighbors (guess the label based on closest
training examples).
Choosing Hypothesis Space
I The number of training examples for
which the training error and test error
start converging = “capacity” of the
learning machine.
I Can we bound the expected loss as a
function of the empirical loss, the
capacity of the family of functions,
and the size of the training set? (Yes,
sometimes.)
I Problem: if the class is too rich, there
is a risk of overfitting the data. If the
class if too simple, there is a risk of
not being able to approximate the
data well.
Q: How to choose the hypothesis space so that it is large enough to
contain a solution to your problem, yet small enough to ensure
generalization from the training sets you have?
Choosing Hypothesis Space
For each training set size, there is an optimal capacity for the machine.
Choosing the Loss Function
I L quantifies what it means to do well/poorly on the task
I Tradeoff: L captures what we actually want to minimize vs. L is
computationally easy to optimize.
I Loss function semantics:
I Optimizing squared loss means predicting the conditional mean
E(x,y )∼D (y | x).
I Optimizing the absolute loss |y − y 0 | means predicting the
conditional median.
I Good rule: start with what you want and try to derive a loss.
Other Types of Learning
I Unsupervised learning: given only inputs, automatically
discover structure (clustering, outlier detection, embedding in
low-dimensional manifold, compression). Not so well defined.
I Semi-supervised learning: labels are expensive, unlabeled data
is cheap. Use the unlabeled data to help you learn.
I Active learning: ask for labels on unlabeled examples of your
choice, direct the learning process.
I Reinforcement learning: learn how to act in a way that
maximizes your expected reward; your actions change the
distribution of future inputs.
Some Applications of Machine Learning
I Handwritten character recongition, speech recognition, speaker
verification, object detection, tracking objects in videos
I Search, targeted ads, recommendation systems, spam filtering, auctions
I Credit card fraud, insurance premium prediction, product pricing, stock
market analysis (Wall Street uses a lot of machine learning)
I Medical diagnosis and prognosis, fMRI analysis
I Game playing (adaptive opponents)
I Robotics, adaptive decision making under uncertainty