CS 509: Pattern Recognition
Non-parametric Methods
Dr. Mohammed Ayoub Alaoui Mhamdi
Bishop's University
Sherbrooke, Qc, Canada
malaoui@ubishops.ca
Introduction
Density estimation with parametric models assumes that
the forms of the underlying density functions are known.
However, common parametric forms do not always fit the
densities actually encountered in practice.
In addition, most of the classical parametric densities are
unimodal, whereas many practical problems involve
multimodal densities.
Non-parametric methods can be used with arbitrary
distributions and without the assumption that the forms of
the underlying densities are known.
Histograms.
Kernel Density Estimation / Parzen Windows.
k-Nearest Neighbor Density Estimation.
Real Example in Figure-Ground Segmentation
2
Histograms
3
Histogram Density Representation
Consider a single continuous variable x and let’s say we have
a set of of them . Our goal is to model from .
Standard histograms simply partition into distinct bins of
width and then count the number of observations falling
into bin .
To turn this count into a normalized probability density, we
simply divide by the total number of observations and by the
width of the bins.
This gives us:
Hence the model for the density p(x) is constant over the
width of each bin. (And often the bins are chosen to have the
same width .)
4
Histogram Density Representation
5
Histogram Density as a Function of Bin Width
6
Histogram Density as a Function of Bin
Width
The green curve is the
underlying true density from
which the samples were drawn.
It is a mixture of two
Gaussians.
When is very small (top), the
resulting density is quite spiky
and hallucinates a lot of
structure
When not is present
very bigin . (bottom), the resulting density is quite
smooth and consequently fails to capture the bimodality of .
It appears that the best results are obtained for some
intermediate value of , which is given in the middle figure.
In principle, a histogram density model is also dependent on
7
the choice of the edge location of each bin.
Analyzing the Histogram Density
What are the advantages and disadvantages of the
histogram density estimator?
Advantages:
Simple to evaluate and simple to use.
One can throw away once the histogram is computed.
Can be computed sequentially if data continues to come in.
Disadvantages:
The estimated density has discontinuities due to the bin
edges rather than any property of the underlying density.
Scales poorly (curse of dimensionality): we would have bins
if we divided each variable in a -dimensional space into
bins.
8
What can we learn from Histogram Density
Estimation?
Lesson 1: To estimate the probability density at a particular
location, we should consider the data points that lie within
some local neighborhood of that point.
This requires we define some distance measure.
There is a natural smoothness parameter describing the spatial
extent of the regions (this was the bin width for the
histograms).
Lesson 2: The value of the smoothing parameter should
neither be too large or too small in order to obtain good
results.
With these two lessons in mind, we proceed to kernel
density estimation and nearest neighbor density estimation,
9 two closely related methods for density estimation.
The Space-Averaged / Smoothed Density
Consider again samples x from underlying density
p(x).
Let denote a small region containing x.
The probability mass associated with is given by
Suppose we have samples . The probability of each
sample falling into is .
How will the total number of points falling into be
distributed?
This will be a binomial distribution:
10
The Space-Averaged / Smoothed Density
The expected value for k is thus
The binomial for peaks very sharply about the mean.
So, we expect to be a very good estimate for the
probability (and thus for the space-averaged density).
This estimate is increasingly accurate as n increases.
11
The Space-Averaged / Smoothed Density
Assuming continuous and that is so small that does
not appreciably vary within it, we can write:
where is a point within and is the volume enclosed by .
After some rearranging, we get the following estimate
for
12
Example
Simulated an example of example the density at 0.5 for
an underlying zero-mean, unit variance Gaussian.
Varied the volume used to estimate the density.
Red=1000, Green=2000, Blue=3000, Yellow=4000,
Black=5000.
13
Practical Concerns
The validity of our estimate depends on two contradictory
assumptions:
1. The region must be sufficiently small the the density is
approximately constant over the region.
2. The region must be sufficiently large that the number of
points falling inside it is sufficient to yield a sharply peaked
binomial.
Another way of looking it is to fix the volume and increase
the number of training samples. Then, the ratio will
converge as desired. But, this will only yield an estimate of
the space-averaged density ().
We want p(x), so we need to let V approach 0. However, with
a fixed , will become so small, that no points will fall into it
and our estimate would be useless: .
Note that in practice, we cannot let V to become arbitrarily
14
small because the number of samples is always limited.
Practical Concerns
How can we skirt these limitations when an unlimited
number of samples if available?
To estimate the density at , form a sequence of regions
containing with the having sample (), having samples ()
and so on.
Let be the volume of , be the number of samples falling in
, and be the nth estimate for :
*
If is to converge to we need the following three
conditions
15
Practical Concerns
ensures that our space-averaged density will converge to .
basically ensures that the frequency ratio will converge to
the probability (the binomial will be sufficiently peaked).
is required for to converge at all. It also says that
although a huge number of samples will fall within the
region , they will form a negligibly small fraction of the
total number of samples.
16
Practical Concerns
There are two common ways of obtaining regions that
satisfy these conditions:
1. Shrink an initial region by specifying the volume as
some function of such as . Then, we need to show that
converges to . (This is like the Parzen window we’ll
talk about next.)
2. Specify as some function of such as . Then, we grow
the volume until it encloses neighbors of . (This is the
k-nearest-neighbor).
Both of these methods converge...
17
18
Parzen Windows
Let’s temporarily assume the region is a -dimensional
hypercube with being the length of an edge.
The volume of the hypercube is given by
We can derive an analytic expression for :
Define a windowing function:
This windowing function defines a unit hypercube centered
at the origin.
Hence, s equal to unity if falls within the hypercube of
volume centered at , and is zero otherwise
19
Parzen Windows
The number of samples in this hypercube is therefore
given by
Substituting in equation (*), yields the estimate
Hence, the windowing function , in this context called
a Parzen window, tells us how to weight all of the
samples in to determine at a particular .
20
Example
But, what undesirable traits from histograms are inherited
by Parzen window density estimates of the form we’ve
just defined?
Discontinuities...
21 Dependence on the bandwidth.
Generalizing the Kernel Function
What if we allow a more general class of windowing
functions rather than the hypercube?
If we think of the windowing function as an interpolator,
rather than considering the window function about only, we
can visualize it as a kernel sitting on each data sample in .
And, if we require the following two conditions on the
kernel function , then we can be assured that the resulting
density will be proper: non-negative and integrate to .
For our previous case of , then it follows will also satisfy
these conditions.
22
Example: A Univariate Guassian Kernel
A popular choice of the kernel is the Gaussian kernel:
The resulting density is given by:
It will give us smoother estimates without the
discontinuites from the hypercube kernel.
23
Effect of the Window Width
An important question is what effect does the window
width have on ?
Define as
and rewrite as the average
24
Effect of the Window Width
clearly affects both the amplitude and the width of .
25
Effect of Window Width (And, hence,
Volume )
But, for any value of , the distribution is normalized:
If is too large, the estimate will suffer from too little
resolution.
If is too small, the estimate will suffer from too much
variability.
In theory (with an unlimited number of samples), we can
let slowly approach zero as increases and then will
converge to the unknown . But, in practice, we can, at
best, seek some compromise.
26
Example: Revisiting the Univariate
Guassian Kernel
27
Example: A Bimodal Distribution
28
Parzen Window-Based Classifiers
Estimate the densities for each category.
Classify a query point by the label corresponding to the
maximum posterior (i.e., one can include priors).
As you guessed it, the decision regions for a Parzen
window-based classifier depend upon the kernel
function.
29
Parzen Window-Based Classifiers
During training, we can make the error arbitrarily low by
making the window sufficiently small, but this will have an
ill-effect during testing (which is our ultimate need).
Think of any possibilities for system rules of choosing the
kernel?
One possibility is to use cross-validation. Break up the data
into a training set and a validation set. Then, perform
training on the training set with varying bandwidths. Select
the bandwidth that minimizes the error on the validation
set.
There is little theoretical justification for choosing one
window width over another.
30
Nearest Neighbor Methods
Selecting the best window / bandwidth is a severe limiting
factor for Parzen window estimators.
methods circumvent this problem by making the window
size a function of the actual training data.
The basic idea here is to center our window around and
let it grow until it captures samples, where is a function
of n.
These samples are the nearest neighbors of .
If the density is high near then the window will be relatively
small leading to good resolution.
If the density is low near , the window will grow large, but it
will stop soon after it enters regions of higher density.
31
In either case, we estimate according to
Nearest Neighbor Methods
We want to go to infinity as n goes to infinity thereby
assuring us that will be a good estimate of the
probability that a point will fall in the window of volume
Vn.
But, we also want to grow sufficiently slowly so that the
size of our window will go to zero.
Thus, we want to go to zero.
Recall these conditions from the earlier discussion; these
will ensure that converges to as approaches infinity.
32
Examples of Estimation
Notice the discontinuities in the slopes of the estimate.
33
Estimation From 1 Sample
We don’t expect the density estimate from 1 sample to
be very good, but in the case of it will diverge!
With and , the estimate for is
34
But, as we increase the number of samples, the estimate will
improve.
35
Limitations
The Estimator suffers from an analogous flaw from which
the Parzen window methods suffer.
What is it? How do we specify the ?
We saw earlier that the specification of can lead to
radically different density estimates (in practical situations
where the number of training samples is limited).
One could obtain a sequence of estimates by taking and
choose different values of .
But, like the Parzen window size, one choice is as good as
another absent any additional information.
Similarly, in classification scenarios, we can base our
judgement on classification error.
36
Posterior Estimation for Classification
We can directly apply the methods to estimate the
posterior probabilities from a set of n labeled samples.
Place a window of volume around and capture
samples, with ki turning out to be of label .
The estimate for the joint probability is thus
A reasonable estimate for the posterior is thus
Hence, the posterior probability for is simply the
fraction of samples within the window that are
labeled . This is a simple and intuitive result.
37
Example: Figure-Ground Discrimination
Source: Zhao and Davis. Iterative Figure-Ground Discrimination. ICPR 2004.
Figure-ground discrimination is an important low-level
vision task.
Want to separate the pixels that contain some
foreground object (specified in some meaningful way)
from the background.
38
Example: Figure-Ground Discrimination
Source: Zhao and Davis. Iterative Figure-Ground Discrimination. ICPR 2004.
This paper presents a method for figure-ground
discrimination based on non-parametric densities for
the foreground and background.
They use a subset of the pixels from each of the two
regions. They propose an algorithm called iterative
sampling-expectation for performing the actual
segmentation.
The required input is simply a region of interest
(mostly) containing the object.
39
Example: Figure-Ground Discrimination
Source: Zhao and Davis. Iterative Figure-Ground Discrimination. ICPR 2004.
Given a set of samples where each is a dimensional
vector.
We know the kernel density estimate is defined as
where the same kernel ϕ with different bandwidth σj is
used in each dimension.
40
The Representation
Source: Zhao and Davis. Iterative Figure-Ground Discrimination. ICPR 2004.
The representation used here is a function of RGB:
Separating the chromaticity from the brightness allows them
to us a wider bandwidth in the brightness dimension to
account for variability due to shading effects.
And, much narrower kernels can be used on the and
41 chromaticity channels to enable better discrimination.
The Color Density
Source: Zhao and Davis. Iterative Figure-Ground Discrimination. ICPR 2004.
Given a sample of pixels , the color density estimate is
given by
where we have simplified the kernel definition:
They use Gaussian kernels
with a different bandwidth in each dimension.
42
Data-Driven Bandwidth
Source: Zhao and Davis. Iterative Figure-Ground Discrimination. ICPR 2004.
The bandwidth for each channel is calculated directly from
the image based on sample statistics.
where is the sample variance.
43
Initialization: Choosing the Initial Scale
Source: Zhao and Davis. Iterative Figure-Ground Discrimination. ICPR 2004.
For initialization, they compute a distance between the
foreground and background distribution by varying the scale
of a single Gaussian kernel (on the foreground).
To evaluate the “significance” of a particular scale, they
compute the normalized KL-divergence:
where and are the density estimates for the foreground and
background regions respectively. To compute each, they use
about of the pixels (using all of the pixels would lead to quite
slow performance).
44
45
Iterative Sampling-Expectation Algorithm
Source: Zhao and Davis. Iterative Figure-Ground Discrimination. ICPR 2004.
Given the initial segmentation, they need to refine the
models and labels to adapt better to the image.
However, this is a chicken-and-egg problem. If we know the
labels, we could compute the models, and if we knew the
models, we could compute the best labels.
They propose an EM algorithm for this. The basic idea is to
alternate between estimating the probability that each pixel
is of the two classes, and then given this probability to refine
the underlying models.
EM is guaranteed to converge (but only to a local
minimum).
46
Iterative Sampling-Expectation Algorithm
Source: Zhao and Davis. Iterative Figure-Ground Discrimination. ICPR 2004.
1. Initialize using the normalized KL-divergence.
2. Uniformly sample a set of pixel from the image to use in
the kernel density estimation. This is essentially the ‘M’
step (because we have a non-parametric density).
3. Update the pixel assignment based on maximum
likelihood (the ‘E’ step).
4. Repeat until stable. One can use a hard assignment of the
pixels and the kernel density estimator we’ve discussed,
or a soft assignment of the pixels and then a weighted
kernel density estimate (the weight is between the
different classes).
5. The overall probability of a pixel belonging to the
47
foreground class
Results: Stability
Source: Zhao and Davis. Iterative Figure-Ground Discrimination. ICPR 2004.
48
Results
Source: Zhao and Davis. Iterative Figure-Ground Discrimination. ICPR 2004.
49
Results
Source: Zhao and Davis. Iterative Figure-Ground Discrimination. ICPR 2004.
50
Summary
Advantages:
No assumptions are needed about the distributions ahead
of time (generality).
With enough samples, convergence to an arbitrarily
complicated target density can be obtained.
Disadvantages:
The number of samples needed may be very large
(number grows exponentially with the dimensionality of
the feature space).
There may be severe requirements for computation time
and storage.
51