An introduction to
Machine Learning
Pierre Geurts
p.geurts@ulg.ac.be
Department of EE and CS &
GIGA-R, Bioinformatics and modelling
University of Lige
Outline
Introduction
Supervised Learning
Other learning protocols/frameworks
Machine Learning: definition
Machine Learning is concerned with the development,
the analysis, and the application of algorithms that
allow computers to learn
Learning:
A computer learns if it improves its performance at
some task with experience (i.e. by collecting data)
Extracting a model of a system from the sole
observation (or the simulation) of this system in some
situations.
A model = some relationships between the variables
used to describe the system.
Two main goals: make prediction and better
understand the system
3
Machine learning: when ?
Learning is useful when:
Human expertise does not exist (navigating on Mars),
Humans are unable to explain their expertise (speech
recognition)
Solution changes in time (routing on a computer
network)
Solution needs to be adapted to particular cases (user
biometrics)
Example: It is easier to write a program that learns to
play checkers or backgammon well by self-play rather
than converting the expertise of a master player to a
program.
4
Applications: autonomous driving
DARPA Grand challenge 2005: build a robot capable
of navigating 175 miles through desert terrain in less
than 10 hours, with no human intervention
The actual wining time of Stanley [Thrun et al., 05]
was 6 hours 54 minutes.
http://www.darpa.mil/grandchallenge/
Applications: recommendation system
Netflix prize: predict how much someone is going to love a
movie based on their movies preferences
Data: over 100 million ratings that over 480,000 users
gave to nearly 18,000 movies
Reward: $1,000,000 dollars if 10% improvement with
respect to Netflix's current system
http://www.netflixprize.com
Applications: credit risk analysis
Data:
Logical rules automatically learned from data:
Applications
Machine learning has a wide spectrum of applications
including:
Retail: Market basket analysis, Customer relationship
management (CRM)
Finance: Credit scoring, fraud detection
Manufacturing: Optimization, troubleshooting
Medicine: Medical diagnosis
Telecommunications: Quality of service optimization,
routing
Bioinformatics: Motifs, alignment
Web mining: Search engines
...
8
Related fields
Artificial Intelligence: smart algorithms
Statistics: inference from a sample
Computer Science: efficient algorithms and complex
models
Systems and control: analysis, modeling, and control
of dynamical systems
Data Mining: searching through large volumes of data
Problem definition
One part of the data
mining process
Data generation
Raw data
Each step generates many questions:
Data generation: data types, sample
size, online/offline...
Preprocessing: normalization,
missing values, feature
selection/extraction...
Machine learning: hypothesis,
choice of learning
paradigm/algorithm...
Hypothesis validation: crossvalidation, model deployment...
Preprocessing
Preprocessed data
Machine
learning
Hypothesis
Validation
Knowledge/Predictive model
10
Glossary
Data=a table (dataset, database, sample)
Variables (attributes, features) =
measurements made on objects
Object 1
Object 2
Object 3
Object 4
Object 5
Object 6
Object 7
Object 8
Object 9
Object 10
...
VAR 1
0
2
0
1
0
0
2
2
1
1
...
VAR 2
1
1
0
1
1
1
1
2
1
2
...
VAR 3
2
2
1
2
0
2
0
1
0
2
...
VAR 4
0
0
0
2
2
1
1
0
1
0
...
VAR 5
1
1
1
0
1
1
1
0
0
1
...
Objects (samples, observations,
individuals, examples, patterns)
VAR 6
1
1
1
0
0
1
2
0
0
0
...
VAR 7 VAR 8 VAR 9 VAR 10
2
1
0
2
0
2
1
0
2
0
2
1
0
1
2
1
2
1
1
0
1
1
1
1
2
2
1
1
1
1
1
1
0
0
1
2
1
2
1
0
...
...
...
...
VAR 11
0
2
2
1
1
1
1
2
1
1
...
...
...
...
...
...
...
...
...
...
...
...
...
Dimension=number of variables
Size=number of objects
Objects: samples, patients, documents, images...
Variables: genes, proteins, words, pixels...
11
Outline
Introduction
Supervised Learning
Introduction
Model selection, cross-validation, overfitting
Some supervised learning algorithms
Beyond classification and regression
Other learning protocols/frameworks
12
Supervised learning
Inputs
A1
-0.69
-2.3
0.32
0.37
-0.67
0.51
A2
-0.72
-1.2
-0.9
-1
-0.53
-0.09
A3
Y
N
N
Y
N
Y
Output
A4
0.47
0.15
-0.76
-0.59
0.33
-0.05
Y
Healthy
Disease
Healthy
Disease
Healthy
Disease
Supervised
learning
= h(A1,A2,A3,A4)
Learning sample
model,
hypothesis
Goal: from the database (learning sample), find a
function h of the inputs that approximates at best the
output
Symbolic output classification problem,
Numerical output regression problem
13
Two main goals
Predictive:
Make predictions for a new sample described by its
attributes
A1
0.83
-2.3
0.08
0.06
-0.98
-0.68
0.92
A2
-0.54
-1.2
0.63
-0.29
-0.18
0.82
-0.33
A3
T
F
F
T
F
T
F
A4
0.68
-0.83
0.76
-0.57
-0.38
-0.95
-0.48
Y
Healthy
Disease
Healthy
Disease
Healthy
Disease
?
Informative:
Help to understand the relationship between the inputs
and the output
Y=disease if A3=F and A2<0.3
Find the most relevant inputs
14
Example of applications
Biomedical domain: medical diagnosis, differentiation
of diseases, prediction of the response to a
treatment...
Gene expression, Metabolite concentrations...
Patients
A1
-0.61
-2.3
-0.82
-0.74
-0.14
-0.37
A2
0.23
-1.2
-0.41
-0.1
0.98
0.27
...
...
...
...
...
...
...
A4
0.49
-0.11
0.24
-0.15
-0.13
-0.67
Y
Healthy
Disease
Healthy
Disease
Healthy
Disease
15
Example of applications
Perceptual tasks: handwritten character recognition,
speech recognition...
Inputs:
a grey intensity [0,255] for
each pixel
each image is
represented by a vector of
pixel intensities
eg.: 32x32=1024
dimensions
Output:
9 discrete values
Y={0,1,2,...,9}
16
Example of applications
Time series prediction: predicting electricity load,
network usage, stock market prices...
17
Outline
Introduction
Supervised Learning
Introduction
Model selection, cross-validation, overfitting
Some supervised learning algorithms
Beyond classification and regression
Other learning protocols/frameworks
18
Illustrative problem
Medical diagnosis from two measurements (eg., weights
and temperature)
M1
0.52
0.44
0.89
0.99
...
0.95
0.29
M2
0.18
0.29
0.88
0.37
...
0.47
0.09
Y
Healthy
Disease
Healthy
Disease
...
Disease
Healthy
M2
0
0
M1
Goal: find a model that classifies at best new cases for
which M1 and M2 are known
19
Learning algorithm
A learning algorithm is defined by:
a family of candidate models (=hypothesis space H)
a quality measure for a model
an optimization strategy
It takes as input a learning sample and outputs a
function h in H of maximum quality
1
a model
obtained by
supervised
learning
G2
0
0
G1
20
Linear model
h(M1,M2)=
Disease if w0+w1*M1+w2*M2>0
Normal otherwise
M2
0
0
M1
Learning phase: from the learning sample, find the
best values for w0, w1 and w2
Many alternatives even for this simple model (LDA,
Perceptron, SVM...)
21
Quadratic model
h(M1,M2)=
Disease if w0+w1*M1+w2*M2+w3*M12+w4*M22>0
Normal otherwise
M2
0
0
M1
Learning phase: from the learning sample, find the
best values for w0, w1,w2, w3 and w4
Many alternatives even for this simple model (LDA,
Perceptron, SVM...)
22
Artificial neural network
h(M1,M2)=
Disease if some very complex function of M1,M2>0
Normal otherwise
M2
0
0
M1
Learning phase: from the learning sample, find the
numerous parameters of the very complex function
23
Which model is the best?
linear
quadratic
0
0
neural net
0
0
Why not choose the model that minimises the error
rate on the learning sample? (also called resubstitution error)
How well are you going to predict future data drawn
from the same distribution? (generalisation error)
24
The test set method
1.Randomly choose 30% of the data
to be in a test sample
2.The remainder is a learning
sample
0
0
3.Learn the model from the learning
sample
4.Estimate its future performance on
the test sample
25
Which model is the best?
linear
quadratic
0
0
LS error= 3.4%
TS error= 3.5%
neural net
0
0
LS error= 1.0%
TS error= 1.5%
LS error= 0%
TS error= 3.5%
We say that the neural network overfits the data
Overfitting occurs when the learning algorithm starts
fitting noise.
(by opposition, the linear model underfits the data)
26
The test set method
Upside:
very simple
Computationally efficient
Downside:
Wastes data: we get an estimate of the best method to
apply to 30% less data
Very unstable when the database is small (the test
sample choice might just be lucky or unlucky)
27
Leave-one-out Cross Validation
For k=1 to N
0
0
remove the kth object from the
learning sample
learn the model on the remaining
objects
apply the model to get a prediction
for the kth object
report the proportion of
missclassified objects
28
Leave-one-out Cross Validation
Upside:
Does not waste the data (you get an estimate of the
best method to apply to N-1 data)
Downside:
Expensive (need to train N models)
High variance
29
k-fold Cross Validation
Randomly partition the dataset into k subsets (for example
10)
TS
For each subset:
learn the model on the objects that are not in the subset
compute the error rate on the points in the subset
Report the mean error rate over the k subsets
When k=the number of objects leave-one-out cross
validation
30
Which kind of Cross Validation?
Test set:
Leave-one-out:
Doesn't waste data but expensive
k-fold cross validation:
Cheap but waste data and unreliable when few data
compromise between the two
Rule of thumb:
a lot of data (>1000): test set validation
small data (100-1000): 10-fold CV
very small data(<100): leave-one-out CV
31
CV-based complexity control
Error
Under-fitting
Over-fitting
CV error
LS error
Optimal complexity
Complexity
32
Complexity
Controlling complexity is called regularization or
smooting
Complexity can be controlled in several ways
The size of the hypothesis space: number of candidate
models, range of the parameters...
The performance criterion: learning set performance
versus parameter range, eg. minimizes
Err(LS)+ C(model)
The optimization algorithms: number of iterations,
nature of the optimization problem (one global optimum
versus several local optima)...
33
CV-based algorithm choice
Step 1: compute 10-fold (or test set or LOO) CV error
for different algorithms
CV
error
Algo 1
Algo 4
Algo 2
Algo 3
Step 2: whichever algorithm gave best CV score: learn
a new model with all data, and that's the predictive
model
What is the expected error rate of this model?
34
Warning: Intensive use of CV can overfit
If you compare many (complex) models, the
probability that you will find a good one by chance on
your data increases
Solution:
Hold out an additional test set before starting the
analysis (or, better, generate this data afterwards)
Use it to estimate the performance of your final
model
(For small datasets: use two stages of 10-fold CV)
35
A note on performance measures
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
True class
Negative
Negative
Negative
Negative
Negative
Negative
Negative
Negative
Negative
Positive
Positive
Positive
Positive
Positive
Positive
Model 1
Positive
Negative
Positive
Positive
Negative
Negative
Negative
Negative
Negative
Positive
Positive
Positive
Positive
Negative
Positive
Model 2
Negative
Negative
Positive
Negative
Negative
Negative
Positive
Negative
Negative
Positive
Negative
Positive
Positive
Negative
Negative
Which of these two models
is the best?
The choice of an error or
quality measure is highly
application dependent.
36
A note on performance measures
The error rate is not the only way to assess a predictive
model
In binary classification, results can be summarized in a
contingency table (aka confusion matrix)
Predicted class
Actual class
p
n
Total
p
True Positive False Negative P
n
False Positive True Negative
Various criterion
Error rate = (FP+FN)/(N+P)
Sensitivity = TP/P
(aka recall)
Accuracy = (TP+TN)/(N+P)
= 1-Error rate
Specificity = TN/(TN+FP)
Precision = TP/(TP+FP) (aka PPV)
37
ROC and Precision/recall curves
Each point corresponds to a particular choice of the
decision threshold
Precision
True Positive Rate
(Sensitivity)
False Positive Rate
(1-Specificity)
Recall
(Sensitivity)
38
Outline
Introduction
Model selection, cross-validation, overfitting
Some supervised learning algorithms
k-NN
Linear methods
Artificial neural networks
Support vector machines
Decision trees
Ensemble methods
Beyond classification and regression
39
Comparison of learning algorithms
Three main criteria:
Accuracy:
Efficiency:
Computing times and scalability for learning and testing
Interpretability:
Measured by the generalization error (estimated by CV)
Comprehension brought by the model about the inputoutput relationship
Unfortunately, there is usually a tradeoff between
these criteria
40
1-Nearest Neighbor (1-NN)
(prototype based method, instance based learning, non-parametric
method)
One of the simplest learning algorithm:
outputs as a prediction the output associated to the
sample which is the closest to the test object
1
2
3
4
5
M1
0.32
0.15
0.39
0.62
0.92
M2
0.81
0.38
0.34
0.11
0.43
Y
Healthy
Disease
Healthy
Disease
?
d 5,2 = 0. 15 0 .92 0 . 380 . 43 =0 . 77
d 5,3 = 0. 390 . 92 0 . 34 0. 43 =0 . 71
d 5,4 = 0 .620 . 92 0 . 430 . 43 = 0 . 44
2
d 5,1 = 0. 320 . 92 0 . 810 . 43 =0 . 71
closest=usually of minimal Euclidian distance
41
Obvious extension: k-NN
Find the k nearest neighbors (instead of only the first
one) with respect to Euclidian distance
Output the most frequent class (classification) or the
average outputs (regression) among the k neighbors.
42
Effect of k on the error
Error
Under-fitting
Over-fitting
CV error
LS error
Optimal k
k
43
Small exercise
In this classification problem
with two inputs:
What it the resubstitution
error (LS error) of 1-NN?
What is the LOO error of 1NN?
What is the LOO error of 3NN?
What is the LOO error of
22-NN?
Andrew Moore
44
k-NN
Advantages:
very simple
can be adapted to any data type by changing the
distance measure
Drawbacks:
choosing a good distance measure is a hard problem
very sensitive to the presence of noisy variables
slow for testing
45
Linear methods
Find a model which is a linear combinations of the inputs
Regression: y=w 0 w 1 x 1 w 2 x 2 ...w n w n
Classification: y=c 1 if w 0 w1 x1 ...w n x n 0 , c 2 otherwise
Several methods exist to find coefficients w0,w1... corresponding
to different objective functions, optimization algorithms, eg.:
Regression: least-square regression, ridge regression, partial
least square, support vector regression, LASSO...
Classification: linear discriminant analysis, PLS-discriminant
analysis, support vector machines...
46
Example: ridge regression
Find w that minimizes (>0):
i y i w x i w
From simple algebra, the solution is given by:
r
w = X X I X y
where X is the input matrix and y is the output vector
regulates complexity (and avoids problems related to
the singularity of XTX)
47
Example: perceptron
Find w that minimizes:
i y i w x i 2
using gradient descent: given a training example x , y
ywT x
j w j w j x j
Online algorithm, ie. that treats every example in turn
(vs Batch algorithm that treats all examples at once)
Complexity is regulated by the learning rate and the
number of iterations
Can be adapted to classification
48
Linear methods
Advantages:
simple
there exist fast and scalable variants
provide interpretable models through variable weights
(magnitude and sign)
Drawbacks:
often not as accurate as other (non-linear) methods
49
Non-linear extensions
Generalization of linear methods:
y=w 0 w1 1 xw 2 2 x 2 ...w n n x
Any linear methods can be applied (but regularization
becomes more important)
Artificial neural networks (with a single hidden layer):
y= g W j g w i , j x i
j
where g is a non linear function (eg. sigmoid)
(a non linear function of a linear combination of non
linear functions of linear combinations of inputs)
Kernel methods:
y= w i i x
i
y= j k x j , x
j
where k x , x ' = x , x ' is the dot-product in the
feature space and j indexes training examples
50
Artificial neural networks
Supervised learning method initially inspired by the
behaviour of the human brain
Consists of the inter-connection of several small units
Essentially numerical but can handle classification and
discrete inputs with appropriate coding
Introduced in the late 50s, very popular in the 90s
51
Hypothesis space: a single neuron
A2
1
A1
A2
x w1
x w2
AN
x w0
+ tanh
A1
x wN
tanh
+1
Y=tanh(w1*A1+w2*A2++wN*AN+w0)
-1
52
Hypothesis space:
Multi-layers Perceptron
Inter-connection of several neurons (just like in the human
brain)
Hidden layer
Input layer
Output layer
With a sufficient number of neurons and a sufficient
number of layers, a neural network can model any function
of the inputs.
53
Learning
Choose a structure
Tune the value of the parameters (connections
between neurons) so as to minimize the learning
sample error.
Non-linear optimization by the back-propagation
algorithm. In practice, quite slow.
Repeat for different structures
Select the structure that minimizes CV error
54
Illustrative example
1
G2
1 neuron
0
0
1
G2
2 +2 neurons
G1
0
0
10 +10 neurons
...
G2
...
G1
55
0
0
G1
Artificial neural networks
Advantages:
Universal approximators
May be very accurate (if the method is well used)
Drawbacks:
The learning phase may be very slow
Black-box models, very difficult to interprete
Scalability
56
Support vector machines
Recent (mid-90's) and very successful method
Based on two smart ideas:
large margin classifier
kernelized input space
57
Linear classifier
Where would you place a linear classifier?
58
Margin of a linear classifier
The margin = the
width that the
boundary could be
increased by before
hitting a datapoint.
59
Maximum-margin linear classifier
Support vectors: the
samples the closest to the
hyperplane
The linear classifier
with the maximum
margin (= Linear
SVM)
Intuitively, this feels
safest
Works very well in
practice
60
Mathematically
Linearly separable case: amount at solving the
following quadratic programming optimization problem:
1
2
w
minimizes
2
T
subject to y i w x i b1, i=1,... , N
Decision function:
T
y=1 if w xb0, y =1 otherwise
Non linearly separable case:
1
2
w C i
minimizes
2
i
T
subject to y i w x i b1i , i=1,... , N
61
Non-linear boundary
What about this problem?
x1
x12
Solution:
x2
map the data into a new feature space where the
boundary is linear
Find the maximum margin model in this new space
x22
62
The kernel trick
Intuitively:
You don't need to compute explicitly the mapping
All you need is a (special) similarity measure between
objects (like for the kNN)
This similarity measure is called a kernel
Mathematically:
The maximum-margin classifier in some feature space
can be written only in terms of dot-products in that
feature space:
k(x,x')=<(x),(x')>
63
Mathematically
Primal form of the optimization problem:
1
2
w
minimizes
2
subject to y i w , xi b1, i=1,... , N
Dual form:
1
minimizes i i j yi y j xi , x j
2 i, j
i
subject to i 0 and
i y i =0
i
w= i y i x i
i
Decision function:
y=1 if w , x = i y i x i , x= i y i k x i , x0
y=1 otherwise
64
Support vector machines
1
2
3
4
5
6
G1
0.21
0.57
0.21
0.69
0.83
0.48
G2
0.64
0.26
0.68
0.52
0.96
-0.52
Y
C1
C2
C2
C1
C1
C2
kernel matrix
1
2
3
4
5
6
1
1
0.14
0.96
0.17
0.01
0.24
2
0.14
1
0.02
0.7
0.22
0.67
3
0.96
0.02
1
0.15
0.27
0.07
Class labels
1
2
3
4
5
6
G1
ACGCTCTATAG
ACTCGCTTAGA
GTCTCTGAGAG
CGCTAGCGTCG
CGATCAGCAGC
GCTCGCGCTCG
Y
C1
C2
C2
C1
C1
C2
4
0.17
0.17
0.15
1
0.37
0.55
1
2
3
4
5
6
5
6
0.01 0.24
0.22 0.67
0.27 0.07
0.37 0.55
1 -0.25
-0.25 1
Y
C1
C2
C2
C1
C1
C2
SVM
algorithm
Classification
model
65
Examples of kernels
Linear kernel:
k(x,x')= <x,x'>
Polynomial kernel
k(x,x')=(<x,x'>+1)d
(main parameter: d, the maximum degree)
Radial basis function kernel:
k(x,x')=exp(-||x-x'||2/(22))
(main parameter: , the spread of the distribution)
+ many kernels that have been defined for structured
data types (eg. texts, graphs, trees, images)
66
Feature ranking with linear kernel
With a linear kernel, the model looks like:
h(x1,x2,...,xK)=
|w|
C1 if w0+w1*x1+w2*x2+...+wK*xK>0
C2 otherwise
Most important variables are those corresponding to
large |wi|
100
75
50
25
0
variables
67
SVM parameters
Mainly two sets of parameters in SVM:
Optimization algorithm's parameters:
Kernel's parameters:
Control the number of training errors versus the margin
(when the learning sample is not linearly separable)
choice of particular kernel
given this choice, usually one complexity parameter
eg, the degree of the polynomial kernel
Again, these parameters can be determined by crossvalidation
68
Support vector machines
Advantages:
State-of-the-art accuracy on many problems
Can handle any data types by changing the kernel
(many applications on sequences, texts, graphs...)
Drawbacks:
Tuning the parameter is very crucial to get good results
and somewhat tricky
Black-box models, not easy to interprete
69
A note on kernel methods
The kernel trick can be applied to any (learning)
algorithm whose solution can be expressed in terms of
dot-products in the original input space
It makes a non-linear algorithm from a linear one
Can work in a very highly dimensional space (even
infinite) without requiring to explicitly compute the
features
Decouple the representation stage from the learning
stage. The same learning machine can be applied to a
large range of problems
Examples: ridge regression, perceptron, PCA, kmeans...
70
Decision (classification) trees
A learning algorithm that can handle:
Classification problems (binary or multi-valued)
Attributes may be discrete (binary or multi-valued) or
continuous.
Classification trees were invented at least twice:
By statisticians: CART (Breiman et al.)
By the AI community: ID3, C4.5 (Quinlan et al.)
71
Decision trees
A decision tree is a tree where:
Each interior node tests an attribute
Each branch corresponds to an attribute value
Each leaf node is labeled with a class
A1
a11
a13
a12
A2
c1
A3
a21
a22
a31
c1
c2
c2
a32
c1
72
A simple database: playtennis
Day
D1
D2
D3
D4
D5
D6
D7
D8
D9
D10
D11
D12
D13
D14
Outlook
Sunny
Sunny
Overcast
Rain
Rain
Rain
Overcast
Sunny
Sunny
Rain
Sunny
Overcast
Overcast
Rain
Temperature
Hot
Hot
Hot
Mild
Cool
Cool
Cool
Mild
Hot
Mild
Cool
Mild
Hot
Mild
Humidity
High
High
High
Normal
Normal
Normal
High
Normal
Normal
Normal
Normal
High
Normal
High
Wind
Weak
Strong
Weak
Weak
Weak
Strong
Strong
Weak
Weak
Strong
Strong
Strong
Weak
Strong
Play Tennis
No
No
Yes
Yes
Yes
No
Yes
No
Yes
Yes
Yes
Yes
Yes
No
73
A decision tree for playtennis
Outlook
Sunny
Overcast
Humidity
High
yes
Normal
Wind
Strong
Weak
yes
no
yes
no
Rain
Should we play tennis on D15?
Day
D15
Outlook
Sunny
Temperature
Hot
Humidity
High
Wind
Weak
Play Tennis
?
74
Top-down induction of DTs
Choose best attribute
Split the learning sample
Proceed recursively until each object is correctly classified
Outlook
Sunny
Rain
Overcast
Day
Outlook
Temp.
Humidity
Wind
Play
Day
Outlook
Temp.
Humidity
Wind
Play
D1
Sunny
Hot
High
Weak
No
D4
Rain
Mild
Normal
Weak
Yes
D2
Sunny
Hot
High
Strong
No
D5
Rain
Cool
Normal
Weak
Yes
D8
Sunny
Mild
High
Weak
No
D6
Rain
Cool
Normal
Strong
No
D9
Sunny
Hot
Normal
Weak Day Yes Outlook
D11
Sunny
Cool
Normal
Strong D3
Temp.
Humidity
Wind
Play
D10
Rain
Mild
Normal
Strong
Yes
Yes Overcast
Hot
High
Weak
Yes
D14
Rain
Mild
High
Strong
No
D7
Overcast
Cool
High
Strong
Yes
D12
Overcast
Mild
High
Strong
Yes
D13
Overcast
Hot
Normal
Weak
Yes
75
Top-down induction of DTs
Procedure learn_dt(learning sample, LS)
If all objects from LS have the same class
Create a leaf with that class
Else
Find the best splitting attribute A
Create a test node for this attribute
For each value a of A
Build LSa= {o LS | A(o) is a}
Use Learn_dt(LSa) to grow a subtree from LSa.
76
Which attribute is best ?
A1=?
[29+,35-]
A2=?
[29+,35-]
[21+,5-]
[8+,30-]
[18+,33-]
[11+,2-]
A score measure is defined to evaluate splits
This score should favor class separation at each step (to
shorten the tree depth)
Common score measures are based on information theory
I (LS, A) H( LS) | LS left | H(LS left ) | LSright | H(LS right)
| LS |
| LS |
77
Effect of number of nodes on error
Error
Under-fitting
Over-fitting
CV error
LS error
Optimal complexity
Nb nodes
78
How can we avoid overfitting?
Pre-pruning: stop growing the tree earlier, before it
reaches the point where it perfectly classifies the
learning sample
Post-pruning: allow the tree to overfit and then postprune the tree
Ensemble methods (later)
79
Post-pruning
Error
Under-fitting
Over-fitting
CV error
2. Tree pruning
1. Tree growing
Optimal complexity
LS error
Nb nodes
80
Numerical variables
Example: temperature as a number instead of a
discrete value
Two solutions:
Pre-discretize: Cold if Temperature<70, Mild between 70 and
75, Hot if Temperature>75
Discretize during tree growing:
Temperature
65.4
no
>65.4
yes
optimization of the threshold to maximize the score
81
Illustrative example
1
M2<0.33?
yes
Healthy
M2
no
M1<0.91?
0
M2<0.91?
M1<0.23?
Healthy
M1
Sick
M2<0.75?
M2<0.49?
Healthy
Sick
M2<0.65?
Sick
Sick
Healthy
82
Regression trees
Trees for regression problems: exactly the same model but
with a number in each leaf instead of a class
Outlook
Sunny
Overcast
Humidity
High
45.6
Normal
Temperature
22.3
<71
1.2
Rain
Wind
Strong
64.4
Weak
7.4
>71
3.4
83
Interpretability and attribute selection
Interpretability
Intrinsically, a decision tree is highly interpretable
A tree may be converted into a set of ifthen
rules.
Attribute selection
If some attributes are not useful for classification,
they will not be selected in the (pruned) tree
Of practical importance, if measuring the value of a
variable is costly (e.g. medical diagnosis)
Decision trees are often used as a pre-processing for
other learning algorithms that suffer more when
there are irrelevant variables
84
Attribute importance
In many applications, all variables do not contribute
equally in predicting the output.
We can evaluate variable importances with trees
Outlook
Humidity
Wind
Temperature
85
Decision and regression trees
Advantages:
very fast and scalable method (able to handle a very
large number of inputs and objects)
provide directly interpretable models and give an idea of
the relevance of attributes
Drawbacks:
high variance (more on this later)
often not as accurate as other methods
86
Ensemble methods
...
Sick
...
Healthy
Sick
Sick
Combine the predictions of several models built with a learning
algorithm. Often improve very much accuracy.
Often used in combination with decision trees for efficiency
reasons
Examples of algorithms: Bagging, Random Forests, Boosting...
87
Bagging: motivation
Different learning samples yield different models,
especially when the learning algorithm overfits the data
1
0
0
As there is only one optimal model, this variance is source
of error
Solution: aggregate several models to obtain a more
stable one
1
88
0
0
Bagging: bootstrap aggregating
Boostrap sampling
(sampling with replacement)
...
0
10
...
Sick
...
Healthy
Sick
Sick
Note: the more models, the better.
89
Bootstrap sampling
Sampling with replacement
1
2
3
4
5
6
7
8
9
10
G1
0.74
0.78
0.86
0.2
0.2
0.32
-0.34
0.89
0.1
-0.34
G2
0.68
0.45
0.09
0.61
-5.6
0.6
-0.45
-0.34
0.3
-0.65
Y
Healthy
Disease
Healthy
Disease
Healthy
Disease
Healthy
Disease
Healthy
Healthy
3
7
2
9
3
10
1
8
6
10
G1
0.86
-0.34
0.78
0.1
0.86
-0.34
0.74
0.89
0.32
-0.34
G2
0.09
-0.45
0.45
0.3
0.09
-0.65
0.68
-0.34
0.6
-0.65
Y
Healthy
Healthy
Disease
Healthy
Healthy
Healthy
Healthy
Disease
Disease
Healthy
Some objects do not appear, some objects appear
several times
90
Boosting
Idea of boosting: combine many weak models to
produce a more powerful one.
Weak model = a model that underfits the data (strictly,
in classification, a model slightly better than random
guessing)
Adaboost:
At each step, adaboost forces the learning algorithm to focus
on the cases from the learning sample misclassified by the
last model
The predictions of the models are combined through a
weighted vote. More accurate models have more weights in
the vote.
Eg., by duplicating
the missclassified
examples in the
learning sample
91
Boosting
LS
LS1
LS2
LST
...
Healthy
w1
Sick
Healthy
wT
w2
Healthy
92
Interpretability and efficiency
When combined with decision trees, ensemble methods loose
interpretability and efficiency
However,
We still can use the ensemble to compute the importance of
variables (by averaging it over all trees)
Ensemble methods can be parallelized and boosting type
algorithm uses smaller trees. So, the increase of computing
times is not so detrimental.
93
100
75
50
25
0
Example on microarray data
72 patients, 7129 gene expressions, 2 classes of Leukemia
(ALL and AML) (Golub et al., Science, 1999)
Leave-one-out error with several variants
Method
Error
1 decision tree
Random forests (k=85,T=500)
9.7% (7/72)
Extra-trees (sth=0.5, T=500)
5.5% (4/72)
Adaboost (1 test node, T=500)
1.4% (1/72)
Variable importance with boosting
100
Importance
22.2% (16/72)
75
50
25
0
94
variables
Method comparison
Method
Accuracy Efficiency
Interpretability Ease of use
kNN
++
++
DT
+++
+++
+++
Linear
++
+++
++
+++
Ensemble +++
+++
++
+++
ANN
SVM
+
+
+
+
++
+
+++
++++
Note:
The relative importance of the criteria depends on
the specific application
These are only general trends. Eg., in terms of
accuracy, no algorithm is always better than all
others.
95
Outline
Introduction
Supervised Learning
Introduction
Model selection, cross-validation, overfitting
Some supervised learning algorithms
Beyond classification and regression
Other learning protocols/frameworks
96
Beyond classification and regression
All supervised learning problems can not be turned
into standard classification or regression problems
Examples:
Graph predictions
Sequence labeling
image segmentation
97
Structured output approaches
Decomposition:
Reduce the problem to several simpler classification or
regression problems by decomposing the output
Not always possible and does not take into account
interactions between suboutputs
Kernel output methods
Extend regression methods to handle an output space
endowed with a kernel
This can be done with regression trees or ridge regression
for example
Large margin methods
Use SVM-based approaches to learn a model that scores
directly input-output pairs:
y=arg max y ' w i i x , y '
i
98
Outline
Introduction
Supervised learning
Other learning protocols/frameworks
Semi-supervised learning
Transductive learning
Active learning
Reinforcement learning
Unsupervised learning
99
Labeled versus unlabeled data
Unlabeled data=input-output pairs without output value
In many settings, unlabeled data is cheap but labeled
data can be hard to get
labels may require human experts
human annotation is expensive, slow, unreliable
labels may require special devices
Examples:
Biomedical domain
Speech analysis
Natural language parsing
Image categorization/segmentation
Network measurement
100
Semi-supervised learning
Goal: exploit both labeled and unlabeled data to build
better models than using each one alone
labeled data
unlabeled data
test data
A1
0.01
-2.3
0.69
-0.56
-0.85
-0.17
-0.09
A2
0.37
-1.2
-0.78
-0.89
0.62
0.09
0.3
A3
T
F
F
T
F
T
F
A4
0.54
0.37
0.63
-0.42
-0.05
0.29
0.17
Y
Healthy
Disease
Healthy
Why would it improve?
101
Some approaches
Self-training
Iteratively label some unlabeled examples with a model
learned from the previously labeled examples
Semi-supervised SVM (S3VM)
Enumerate all possible labeling of the unlabeled
examples
Learn an SVM for each labeling
Pick the one with the largest margin
102
Some approaches
Graph-based algorithms
Build a graph over the (labeled and unlabeled)
examples (from the inputs)
Learn a model that predicts well labeled examples and
is smooth over the graph
103
Transductive learning
Like supervised learning but we have access to the
test data from the beginning and we want to exploit it
We don't want a model, only compute predictions for
the unlabeled data
Simple solution:
Apply semi-supervised learning techniques using the
test data as unlabeled data to get a model
Use the resulting model to make predictions on the test
data
There exist also specific algorithms that avoid building
a model
104
Active learning
Goal:
given unlabeled data, find (adaptively) the examples to
label in order to learn an accurate model
The hope is to reduce the number of labeled instances
with respect to the standard batch SL
Usually, in an online setting:
choose the k best unlabeled examples
determine their labels
update the model and iterate
Algorithms differ in the way the unlabeled examples
are selected
Example: choose the k examples for which the model
predictions are the most uncertain
105
Reinforcement learning
Learning form interactions
s0
a0
r0
s1
a1
r1
s2
a2
r2
...
Goal: learn to choose sequence of actions (= policy) that
maximizes
2
106
r 0 r 1 r 2 ... , where 01
RL approaches
System is usually modeled by
state transition probabilities P st 1s t , at
reward probabilities P r t1st , at
(= Markov Decision Process)
Model of the dynamics and reward is known try to
compute optimal policy by dynamic programming
Model is unknown
Model-based approaches first learn a model of the
dynamics and then derive an optimal policy from it (DP)
Model-free approaches learn directly a policy from
the observed system trajectories
107
Reinforcement versus supervised learning
Batch-mode SL: learn a mapping from input to output from
observed input-output pairs
Batch-mode RL: learn a mapping from state to action from
observed (state,action,reward) triplets
Online active learning: combine SL and (online) selection
of instances to label
Online RL: combine policy learning with control of the
system and generation of the training trajectories
Note:
RL would reduce to SL if the optimal action was known
in each state
SL is used inside RL to model system dynamics and/or
value functions
108
Examples of applications
Robocup Soccer Teams (Stone & Veloso, Riedmiller et al.)
Inventory Management (Van Roy, Bertsekas, Lee
&Tsitsiklis)
Dynamic Channel Assignment, Routing (Singh &
Bertsekas, Nie & Haykin, Boyan & Littman)
Elevator Control (Crites & Barto)
Many Robots: navigation, bi-pedal walking, grasping,
switching between skills...
Games: TD-Gammon and Jellyfish (Tesauro, Dahl)
109
Unsupervised learning
Unsupervised learning tries to find any regularities in
the data without guidance about inputs and outputs
A1
A2
A3
-0.27 -0.15 -0.14
-2.3
-1.2
-4.5
0.41
0.77
-0.44
A4
A5
0.91
-0.17
-0.01 -0.83
A6
A7
A8
A9
A10
0.26 -0.48
-0.1
-0.53 -0.65
0.66
0.55
0.27 -0.65
0.39
-0.82 0.17
0.54 -0.04
0.6
A11
A12
A13
A14
A15
0.23
0.22
0.98
0.57
0.02 -0.55 -0.32
-1.3
-0.2
-3.5
0.4
0.21 -0.87 0.64
0.41
0.66 -0.27 -0.86 -0.92
0.03
0.28 -0.71 -0.82
0.27
-0.21
-0.9
0.61 -0.57 0.44
0.21
0.97
-0.27 0.74
0.2
-0.28 0.48
-0.14
0.8
0.28
0.75
-0.78 -0.72
0.94 -0.78
0.48
0.37
0.79
0.26
0.3
0.06
A16
A17
A18
A19
0.28
-0.33
0.6
-0.29
0.48
0.74
0.49
0.7
0.79
0.59
-0.33
0.26 0.83 -0.88 -0.59
0.71
-0.16
0.01
0.36
0.03
0.03
0.59
-0.5
0.4
-0.88 -0.53
0.95
0.15
0.31
0.66 -0.34 0.79
-0.12
0.49
-0.53
-0.8
-0.64
-0.93 -0.51
0.28
0.25
0.01 -0.94
0.96
0.25
-0.12 0.27
-0.72 -0.77 -0.31 0.44
0.58
-0.86
0.04
0.94
-0.92
0.05
-0.38 -0.07
0.98
0.1
0.19 -0.57 -0.69 -0.23
0.13
-0.28
0.98 -0.08 -0.3
-0.84
0.47
-0.88 -0.73
-0.4
0.58
0.24
0.08
-0.2
0.42 -0.61 -0.13 -0.47 -0.36 -0.37
0.95
-0.31 0.25
0.55
0.52
-0.66
-0.56 0.97
-0.93
0.91
0.36
-0.14
-0.9
0.65
0.73
0.68 -0.65 -0.4
0.91
-0.64
0.41
-0.12
0.35
0.21
0.22
Are there interesting groups of variables or samples?
outliers? What are the dependencies between
variables?
110
Unsupervised learning methods
Many families of problems exist, among which:
Clustering: try to find natural groups of
samples/variables
Dimensionality reduction: project the data from a highdimensional space down to a small number of
dimensions
eg: k-means, hierarchical clustering
eg: principal/independent component analysis, MDS
Density estimation: determine the distribution of data
within the input space
eg: bayesian networks, mixture models.
111
Clustering
Goal: grouping a collection of objects into subsets or
clusters, such that those within each cluster are more
closely related to one another than objects assigned to
different clusters
112
Clustering
variables
Clustering rows
grouping similar objects
objects
grouping similar variables
across samples
Bi-cluster
Clustering columns
Cluster of
variables
Cluster of
objects
Bi-Clustering/Two-way
clustering
grouping objects that are
similar across a subset of
variables
113
Clustering
Two essential components of cluster analysis:
Distance measure: A notion of distance or similarity of
two objects: When are two objects close to each other?
Cluster algorithm: A procedure to minimize distances
of objects within groups and/or maximize distances
between groups
114
Examples of distance measures
Euclidean distance measures
average difference across coordinates
Manhattan distance measures
average difference across
coordinates, in a robust way
Correlation distance measures
difference with respect to trends
115
Clustering algorithms
Popular algorithms for clustering
hierarchical clustering
K-means
SOMs (Self-Organizing Maps)
autoclass, mixture models...
Hierarchical clustering allows the choice of the
dissimilarity matrix.
k-Means and SOMs take original data directly as input.
Attributes are assumed to live in Euclidean space.
116
Hierarchical clustering
Agglomerative clustering:
1. Each object is assigned to its own cluster
2. Iteratively:
the two most similar clusters are joined and
replaced by a new one
the distance matrix is updated with this new cluster
replacing the two joined clusters
(divisive clustering would start from a big cluster)
117
Distance between two clusters
Single linkage uses the smallest distance
Complete linkage uses the largest distance
Average linkage uses the average distance
118
Hierarchical clustering
(wikipedia)
119
Dendrogram
Hierarchical clustering are visualized through dendrograms
Clusters that are joined are combined by a line
Height of line is distance between clusters
Can be used to determine visually the number of
clusters
120
Hierarchical clustering
Strengths
No need to assume any particular number of clusters
Can use any distance matrix
Find sometimes a meaningful taxonomy
Limitations
Find a taxonomy even if it does not exist
Once a decision is made to combine two clusters it
cannot be undone
Not well theoretically motivated
121
k-Means clustering
Partitioning algorithm with a prefixed number k of
clusters
Use Euclidean distance between objects
Try to minimize thek sum of intra-cluster variances
j= 1 oCluster j
d 2 o,c j
where cj is the center of cluster j and d2 is the Euclidean
distance
122
k-Means clustering
123
k-Means clustering
124
k-Means clustering
125
k-Means clustering
Strengths
Simple, understandable
Can cluster any new point (unlike hierarchical
clustering)
Well motivated theoretically
Limitations
Must fix the number of clusters beforehand
Sensitive to the initial choice of cluster centers
Sensitive to outliers
126
Suboptimal clustering
You could obtain any of these from a random start of
k-means
Solution: restart the algorithm several times
127
Principal Component Analysis
An exploratory technique used to reduce the
dimensionality of the data set to a smaller space (2D,
3D)
A1
A2
A3
A4
A5
A6
A7
A8
A9
A10
0.25
0.93
0.04
-0.78
-0.53
0.57
0.19
0.29
0.37
-0.22
-2.3
-1.2
-4.5
-0.51
-0.76
0.07
0.81
0.95
0.99
0.26
0.53
-0.29
-1
0.73
-0.33
0.52
0.13
0.13
-0.5
-0.48
-0.16
-0.17
-0.26
0.32
-0.08
-0.38
-0.48 0.99
-0.95
0.34
0.07
-0.87
0.39
0.5
-0.63
-0.53
0.79
0.74
-0.14
0.61
0.15
0.68
-0.94
0.5
0.06
-0.56 0.49
-0.77
0.88
PC1
0.36
-2.3
0.27
-0.19
-0.77
-0.65
PC2
0.1
-1.2
-0.89
0.7
-0.7
-0.99
Transform some large number of variables into a
smaller number of uncorrelated variables called
principal components (PCs)
128
Objectives of PCA
Reduce dimensionality (pre-processing for other
methods)
Choose the most useful (informative) variables
Compress the data
Visualize multidimensional data
to identify groups of objects
to identify outliers
129
Basic idea
Goal: map data points into a few dimensions while
trying to preserve the variance of the data as much as
possible
First
component
Second
component
130
Each component is a linear combination of
the original variables
A1
A2
A3
A4
A5
A6
A9
A10
-0.39
-0.38
0.29
0.65
0.15
0.73
-0.57 0.91
A7
A8
-0.89
-0.17
-2.3
-1.2
-4.5
-0.15
0.86
-0.85
0.43 -0.19
-0.83
-0.4
0.9
0.4
-0.11
0.62
0.94
0.97
0.1
-0.41
0.01
0.1
-0.82
-0.31
0.14
0.22
-0.49
-0.76
0.27
-0.43
-0.81
0.71
0.39
-0.09
0.26
-0.46
-0.05
0.46
0.39
-0.01
0.64
-0.25
0.27
-0.81
-0.42
0.62
0.54
-0.67 -0.15
-0.46
0.69
Scores for
each sample
and PC
PC1
0.62
-2.3
0.88
-0.18
-0.39
-0.61
PC2
-0.33
-1.2
0.31
-0.05
-0.01
0.53
PC1=0.2*A1+3.4*A2-4.5*A3
VAR(PC1)=4.5 45%
PC2=0.4*A4+5.6*A5+2.3*A7
VAR(PC2)=3.3 33%
...
Loading of a variable
Gives an idea of its importance in
the component
Can be use for selecting
biomarkers
...
For each component, we
have a measure of the
percentage of the variance
of the initial data that it
contains
131
Books
Reference textbooks for the course
The elements of statistical learning: data mining, inference, and
prediction. T. Hastie et al, Springer, 2001
Pattern Recognition and Machine Learning (Information Science and
Statistics). C.M.Bishop, Springer, 2004
Other textbooks
Pattern classification (2nd edition). R.Duda, P.Hart, D.Stork, Wiley
Interscience, 2000
Introduction to Machine Learning. Ethan Alpaydin, MIT Press, 2004.
Machine Learning. Tom Mitchell, McGraw Hill, 1997.
132
Books
More advanced topics
kernel methods for pattern analysis. J. Shawe-Taylor and N. Cristianini.
Cambridge University Press, 2004
Reinforcement Learning: An Introduction. R.S. Sutton and A.G. Barto.
MIT Press, 1998
Neuro-Dynamic Programming. D.P Bertsekas and J.N. Tsitsiklis. Athena
Scientific, 1996
Semi-supervised learning. Chapelle et al., MIT Press, 2006
Predicting structured data. G. Bakir et al., MIT Press, 2007
133
Softwares
Pepito
www.pepite.be
Free for academic research and education
WEKA
http://www.cs.waikato.ac.nz/ml/weka/
Many R and Matlab packages
http://www.kyb.mpg.de/bs/people/spider/
http://www.cs.ubc.ca/~murphyk/Software/BNT/bnt.html
134
Journals
Journal of Machine Learning Research
Machine Learning
IEEE Transactions on Pattern Analysis and Machine
Intelligence
Journal of Artificial Intelligence Research
Neural computation
Annals of Statistics
IEEE Transactions on Neural Networks
Data Mining and Knowledge Discovery
...
135
Conferences
International Conference on Machine Learning (ICML)
European Conference on Machine Learning (ECML)
Neural Information Processing Systems (NIPS)
Uncertainty in Artificial Intelligence (UAI)
International Joint Conference on Artificial Intelligence
(IJCAI)
International Conference on Artificial Neural Networks
(ICANN)
Computational Learning Theory (COLT)
Knowledge Discovery and Data mining (KDD)
...
136