Machine Learning Srihari
Probability Theory
Sargur N. Srihari
srihari@cedar.buffalo.edu
1
Machine Learning Srihari
Probability Theory in Machine Learning
• Probability is key concept is dealing with
uncertainty
– Arises due to finite size of data sets and noise on
measurements
• Probability Theory
– Framework for quantification and manipulation of
uncertainty
– One of the central foundations of machine learning
2
Machine Learning Srihari
Random Variable (R.V.)
• Takes values subject to chance
– E.g., X is the result of coin toss with values Head
and Tail which are non - numeric
• X can be denoted by a r.v. x which has values of 1 and 0
– Each value of x has an associated probability
• Probability Distribution
– Mathematical function that describes
1. possible values of a r.v.
2. and associated probabilities
3
Machine Learning Srihari
Probability with Two Variables
• Key concepts:
– conditional & joint probabilities of variables
• Random Variables: B and F
– Box B, Fruit F
• F has two values orange (o) or apple (a)
• B has values red (r) or blue (b)
2 apples 3 apples
6 oranges 1 orange P(F=o)=3/4 and P(F=a)=1/4
Let p(B=r)=4/10 and p(B=b)=6/10
Given the above data we are interested in
several probabilities of interest:
marginal, conditional and joint
Described next
4
Machine Learning Srihari
Probabilities of Interest
• Marginal Probability
2 apples 3 apples
– what is the probability of an 6 oranges 1 orange
apple? P(F=a)
• Note that we have to consider P(B)
• Conditional Probability
– Given that we have an orange
what is the probability that we
chose the blue box? P(B=b|F=o)
• Joint Probability
– What is the probability of orange
AND blue box? P(B=b,F=o) 5
Machine Learning Srihari
Sum Rule of Probability Theory
• Consider two random variables
• X can take on values xi, i=1,, M
• Y can take on values yi, i=1,..L
• N trials sampling both X and Y
• No of trials with X=xi and Y=yj is nij
nij
Joint Probability p(X = x i ,Y = y j ) =
N
ci
p(X = x i ) =
• Marginal Probability N
L
Since ci = ∑ nij , p(X = x i ) = ∑ p(X = x i ,Y = y j )
j j =1 6
Machine Learning Srihari
Product Rule of Probability Theory
• Consider only those instances for which X=xi
• Then fraction of those instances for which Y=yj is
written as p(Y=yj|X=xi)
• Called conditional probability
• Relationship between joint and conditional probability:
nij
p(Y = y j | X = x i ) =
ci
nij nij ci
p(X = x i ,Y = y j ) = = •
N ci N
= p(Y = y j | X = x i )p(X = x i )
7
Machine Learning Srihari
Bayes Theorem
• From the product rule together with the symmetry
property p(X,Y)=p(Y,X) we get
p(X |Y )p(Y )
p(Y | X ) =
p(X )
• Which is called Bayes’ theorem
• Using the sum rule the denominator is expressed as
Normalization constant to
ensure sum of conditional
p(X ) = ∑ p(X |Y )p(Y ) probability on LHS
Y sums to 1 over all values of Y
8
Machine Learning Srihari
Rules of Probability
• Given random variables X and Y
• Sum Rule gives Marginal Probability
L
ci
p(X = x i ) = ∑ p(X = x i ,Y = y j ) =
j =1 N
• Product Rule: joint probability in terms of conditional and
marginal
nij nij ci
p(X,Y ) = = p(Y | X )p(X ) = ×
N ci N
• Combining we get Bayes Rule
p(X |Y )p(Y ) where p(X ) = ∑ p(X |Y )p(Y )
p(Y | X ) = Y
p(X )
Viewed as
Posterior a likelihood x prior
9
Machine Learning Srihari
Ex: Joint Distribution over two Variables
X takes nine possible values, Y takes two values
N = 60 data points
Histogram
of Y
(Fraction of
data points
having each
value of Y)
Histogram Histogram
of X of X given Y=1
Fractions would equal the probability as N àoo
10
Machine Learning Srihari
Bayes rule applied to Fruit Problem
• Probability that box is red given
that fruit picked is orange
p(F = o | B = r) p(B = r)
p(B = r | F = o) =
p(F = o)
3 4
×
= 4 10 = 2 = 0.66 The a posteriori probability of 0.66 is
9 3 different from the a priori probability of 0.4
20
• Probability that fruit is orange
€
– From sum and product rules
p(F = o) = p(F = o,B = r) + p(F = o,B = b)
= p(F = o | B = r) p(B = r) + p(F = o | B = b) p(B = b)
6 4 1 6 9 The marginal probability of 0.45 is lower
= × + × = = 0.45
8 10 4 10 20 than average probability of 7/12=0.58 11
Machine Learning Srihari
Independent Variables
• If p(X,Y)=p(X)p(Y) then X and Y are said to be
independent
• Why?
p(X,Y )
• From product rule: p(Y | X ) =
p(X )
= p(Y )
• In fruit example if each box contained same
fraction of apples and oranges then p(F|B)=p(F)
12
Machine Learning Srihari
Probability Density Function (pdf)
Cumulative
Distribution
Function
• Continuous Variables
• If probability that x falls in
interval (x,x+δx) is given
by p(x)dx for δx à0
then p(x) is a pdf of x
• Probability x lies in Probability that x lies in
interval (a,b) is Interval (-∞,z) is
z
b
p(x ∈ (a,b)) = ∫ p(x)dx P(z) = ∫ p(x)dx
a −∞
13
Machine Learning Srihari
Several Variables
• If there are several continuous variables x1,…,xD
denoted by vector x then we can define a joint
probability density p(x)=p(x1,..,xD)
• Multivariate probability density must satisfy
p(x) ≥ 0
∞
∫ p(x)d x = 1
−∞
14
Machine Learning Srihari
Sum, Product, Bayes for Continuous
• Rules apply for continuous, or combinations of
discrete and continuous variables
p(x) = ∫ p(x,y)dy
p(x,y) = p(y | x)p(x)
p(x | y)p(y)
p(y | x) =
p(x)
• Formal justification of sum, product rules for
continuous variables requires measure theory
15
Machine Learning Srihari
Expectation
• Expectation is average value of some function f(x) under the
probability distribution p(x) denoted E[f]
• For a discrete distribution
E[f] = Σx p(x) f(x) Examples of f(x)
• For a continuous distribution of use in ML:
f(x)=x; E[f] is mean
f(x)=ln p(x); E[f] is entropy
E[f ] = ∫ p(x)f (x)dx f(x)=-ln[q(x)/p(x)]; K-L divergence
• If there are N points drawn from a pdf, then expectation can be
approximated as This approximation is extremely important
when we use
E[f] = (1/N)ΣnN=1 f(xn) sampling to determine expected value
• Conditional Expectation with respect to a conditional distribution
Ex[f] = Σx p(x|y) f(x)
16
Machine Learning Srihari
Variance
• Measures how much variability there is in f(x)
around its mean value E[f(x)]
• Variance of f(x) is denoted as
var[f] = E[(f(x) – E[f(x)])2]
• Expanding the square
var[f] = E[(f(x)2] – E[f(x)]2
• Variance of the variable x itself
var[x] = E[x2] – E[x]2
17
Machine Learning Srihari
Covariance
• For two random variables x and y their covariance is
• cov[x,y] = Ex,y [{x-E[x]} {y-E[y]}]
= Ex,y [xy] - E[x]E[y]
– Expresses how x and y vary together
• If x and y are independent then their covariance
vanishes
• If x and y are two vectors of random variables
covariance is a matrix
• If we consider covariance of components of vector x
with each other then we denote it as cov[x] =cov [x,x]
18
Machine Learning Srihari
Bayesian Probabilities
• Classical or Frequentist view of Probabilities
– Probability is frequency of random, repeatable event
– Frequency of a tossed coin coming up heads is 1/2
• Bayesian View
– Probability is a quantification of uncertainty
– Degree of belief in propositions that do not involve random
variables
– Examples of uncertain events as probabilities:
• Whether Arctic Sea ice cap will disappear
• Whether moon was once in its own orbit around the sun
• Whether Thomas Jefferson had a child by one of his slaves
• Whether a signature on a check is genuine 19
Machine Learning Srihari
Whether Arctic Sea cap will disappear
• We have some idea of how
quickly polar ice is melting
• Revise it on the basis of
fresh evidence (satellite
observations)
NASA Video • Assessment will affect
actions we take (to reduce
greenhouse gases)
An uncertain event
Answered by general Bayesian interpretation
20
Machine Learning Srihari
Bayesian Representation of Uncertainty
• Use of probability to represent uncertainty is not an
ad-hoc choice
• If numerical values are used to represent degrees of
belief, then simple set of axioms for manipulating
degrees of belief leads to sum and product rules of
probability (Cox’s theorem)
• Probability theory can be regarded as an extension of
Boolean logic to situations involving uncertainty
(Jaynes)
21
Machine Learning Srihari
Bayesian Approach
• Quantify uncertainty around choice of parameters w
– E.g., w is vector of parameters in curve fitting
M
y(x, w) = w0 + w1x + w2x 2 + ..+ wM x M = ∑ w j x j
j =0
• Uncertainty before observing data expressed by p(w)
• Given observed data D ={ t1, . . tN }
– Uncertainty in w after observing D, by Bayes rule:
p(D | w)p(w)
p(w | D) =
p(D)
– Quantity p(D|w) is evaluated for observed data
• It can be viewed as function of w
• It represents how probable the data set is for different parameters w
• It is called the Likelihood function
• Not a probability distribution over w 22
Machine Learning Srihari
Bayes theorem in words
• Uncertainty in w expressed as
p(D | w)p(w)
p(w | D) =
p(D)
• Bayes theorem in words:
posterior α likelihood ✕ prior
• Denominator is normalization factor
• Involves marginalization over w
p(D) = ∫ p(D | w)p(w)d w by Sum Rule
Machine Learning Srihari
Role of Likelihood Function
• Likelihood Function plays central role in both
Bayesian and frequentist paradigms
• Frequentist:
• w is a fixed parameter determined by an estimator;
• Error bars on estimate are obtained from possible
data sets D
• Bayesian:
• There is a single data set D
• Uncertainty in parameters expressed as probability
distribution over w
Machine Learning Srihari
Maximum Likelihood Approach
• In frequentist setting w is a fixed parameter
– w is set to value that maximizes likelihood function p(D|w)
– In ML, negative log of likelihood function is called error
function since maximizing likelihood is equivalent to
minimizing error
• Error Bars
– Bootstrap approach to creating L data sets
• From N data points new data sets are created by drawing N points at
random with replacement
• Repeat L times to generate L data sets
• Accuracy of parameter estimate can be evaluated by variability of
predictions between different bootstrap sets
25
Machine Learning Srihari
Bayesian: Prior and Posterior
• Inclusion of prior knowledge arises naturally
• Coin Toss Example
– Fair looking coin is tossed three times and lands Head each time
– Classical m.l.e of the probability of landing heads is 1 implying all
future tosses will land Heads
– Bayesian approach with reasonable prior will lead to less
extreme conclusion
µ=p(H)
p(µ) p(µ|H)
26
Machine Learning Srihari
Practicality of Bayesian Approach
• Marginalization over whole parameter space is
required to make predictions or compare
models
• Factors making it practical:
• Sampling Methods such as Markov Chain Monte Carlo
methods
• Increased speed and memory of computers
• Deterministic approximation schemes such as
Variational Bayes and Expectation propagation
are alternatives to sampling
27
Machine Learning Srihari
The Gaussian Distribution
• For single real-valued variable x What is an Exponential:
1 ⎧⎪ 1 2⎪
⎫ y=ex, where e=2.718
N(x | µ, σ ) =2
exp ⎨− 2 (x − µ) ⎪⎬
⎪ Its value for argument 0 is 1
2 1/2
(2πσ ) ⎪⎪⎩ 2σ ⎪⎪⎭ It is its own derivative
• It has two parameters: Maximum of a distribution is its mode
– Mean µ, variance σ 2, For a Gaussian, mode coincides with mean
– Standard deviation σ
• Precision β =1/σ 2
• Can find expectations of functions of
x under Gaussian
∞
∫ N(x | µ, σ )
2
E[x ] =
−∞
∞
E[x 2 ] = ∫ N(x | µ, σ )x dx = µ
2 2 2
+ σ2 µ= 0, σ =1
−∞
var[x ] = E[x 2 ]− E[x ]2 = σ 2
Machine Learning Srihari
Multivariate Gaussian Distribution
• For single real-valued variable x
1 1 ⎧⎪ 1 ⎫⎪
N(x | µ,Σ) = exp ⎨− (x − µ) Σ (x − µ)⎪⎬
⎪ T −1
D/2
(2π) Σ 1/2 ⎪⎪⎩ 2 ⎪⎪⎭
• It has parameters:
– Mean µ, a D-dimensional vector
– Covariance matrix Σ
• Which is a D ×D matrix
Machine Learning Srihari
Likelihood Function for Gaussian
• Given N scalar observations x=[x1,.. xn]T
– Which are independent and identically
distributed
• Probability of data set is given by
likelihood function
N
Data: black points
p(x | µ, σ ) = ∏ N(x n | µ, σ 2 )
2
Likelihood= product of blue values
n=1
Pick mean and variance to maximize
• Log-likelihood function is this product
1 N N N
ln p(x | µ, σ ) = − 2 ∑ (x n − µ)2 − ln σ 2 − ln(2π)
2
2σ n=1 2 2
• Maximum likelihood solutions are given by
1 N
µML =
N
∑x
n=1
n which is the sample mean
1 N
2
σML =
N
∑ (x
n=1
n
− µML )2 which is the sample variance
30
Machine Learning Srihari
Bias in Maximum Likelihood
• Maximum likelihood
systematically
underestimates variance
– E[µML]=µ
– E[σ 2ML]=((N-1)/N)σ 2
Green curve is true distribution
Averaged across three data sets
– Not an issue as N increases mean is correct
Variance is underestimated
• Problem is related to over- because it is estimated relative
to sample mean and not true mean
fitting problem
31
Machine Learning Srihari
Curve Fitting Probabilistically
• Goal is to predict for target
variable t given a new value
of the input variable x
– Given N input values
x=(x1,..xN)T and corresponding
target values t=(t1,..,tN)T
– Assume given value of x, value
of t has a Gaussian distribution
with mean equal to y(x,w) of
polynomial curve Gaussian conditional distribution
for t given x.
p(t|x,w,β)=N(t|y(x,w),β-1) Mean is given by
M polynomial function y(x,w)
y(x, w) = w0 + w1x + w2x 2 + ..+ wM x M = ∑ w j x j Precision given by β
j =0
32
Machine Learning Srihari
Curve Fitting with Maximum Likelihood
N
p(t | x, w, β) = ∏ N(tn | y(x n , w), β −1 )
• Likelihood Function is n=1
• Logarithm of the Likelihood function is
β N N N
ln p(t | x, w, β) = − ∑ {y(x n , w) −tn }2 + ln β − ln(2π)
2 n=1 2 2
• To find maximum likelihood solution for
polynomial coefficients wML
– Maximize w.r.t w
– Can omit last two terms -- don’t depend on w
– Can replace β/2 with ½ (since it is constant wrt w)
– Minimize negative log-likelihood
– Identical to sum-of-squares error function
33
Machine Learning Srihari
Precision parameter with MLE
• Maximum likelihood can also be used to
determine β of Gaussian conditional distribution
• Maximizing likelihood wrt β gives
2
1 1 N
=
βML N
∑ {y(x , w
n=1
n ML
)−tn }
• First determine parameter vector wML governing
the mean and subsequently use this to find
precision βML
34
Machine Learning Srihari
Predictive Distribution
• Knowing parameters w and β
• Predictions for new values of x can be made
using
p(t|x,wML,βML)=N(t|y(x,wML),βML-1)
• Instead of a point estimate we are now giving a
probability distribution over t
35
Machine Learning Srihari
A More Bayesian Treatment
• Introducing a prior distribution over polynomial
coefficients w
(M +1)/2
⎛ α ⎞⎟ ⎧
⎪ α T ⎫
⎪
p(w | α) = N(w | 0, α I ) = ⎜⎜⎜ ⎟⎟
−1
exp ⎨− w w⎪
⎪ ⎬
⎝ 2π ⎟⎠ ⎪
⎪
⎩ 2 ⎪
⎪
⎭
– where α is the precision of the distribution
– M+1 is total no. of parameters for an Mth order polynomial
– α are Model parameters also called hyperparameter
• they control distribution of model parameters
36
Machine Learning Srihari
Posterior Distribution
• Using Bayes theorem, posterior distribution for w is
proportional to product of prior distribution and
likelihood function
p(w|x,t,α,β) α p(t|x,w,β)p(w|α)
• w can be determined by finding the most probable
value of w given the data, ie. maximizing posterior
distribution
• This is equivalent (by taking logs) to minimizing
2
β N
α T
∑
2 n=1
{y(x n , w )−tn } + w w
2
• Same as sum of squared errors function with a
regularization parameter given by λ=α/β
37
Machine Learning Srihari
Bayesian Curve Fitting
• Previous treatment still makes point estimate of w
– In fully Bayesian approach consistently apply sum and
product rules and integrate over all values of w
• Given training data x and t and new test point x , goal
is to predict value of t
– i.e, wish to evaluate predictive distribution p(t|x,x,t)
• Applying sum and product rules of probability
– Predictive distribution can be written in the form
p(t | x, x,t) = ∫ p(t, w |x, x, t)dw by Sum Rule (marginalizing over w)
=∫ p(t|x,w,x,t) p(w | x, x,t) by Product Rule
=∫ p(t|x,w)p(w|x,t)dw by eliminating unnecessary variables
p(t | x, w) = N(t | y(x , w), β −1 ) Posterior distribution over parameters 38
Also a Gaussian
Machine Learning Srihari
Bayesian Curve Fitting
• Predictive distribution is also Gaussian
p(t | x, x,t) = N(t | m(x),s 2(x))
– Where the Mean and Variance are dependent on x
N
m(x) = βφ(x) S ∑ φ(x n )tn
T
n=1 First term is uncertainty in predicted value due to noise in target
2 −1 T Second term is uncertainty in parameters due to Bayesian treatment
s (x) = β + φ(x) Sφ(x)
N
S −1
= αI + β ∑ φ(x n )φ(x)T
n=1
φ(x) has elements φi (x) = x i for i = 0,..M
Predictive Distribution is a M=9 polynomial
α = 5 x 10-3
β =11.1
Red curve is mean
Red region is +1 std dev
39
Machine Learning Srihari
Model Selection
40
Machine Learning Srihari
Models in Curve Fitting
• In polynomial curve fitting:
– an optimal order of polynomial gives best
generalization
• Order of the polynomial controls
– the number of free parameters in the model and
thereby model complexity
• With regularized least squares l also controls
model complexity
41
Machine Learning Srihari
Validation Set to Select Model
• Performance on training set is not a good
indicator of predictive performance
• If there is plenty of data,
– use some of the data to train a range of models Or a
given model with a range of values for its parameters
– Compare them on an independent set, called
validation set
– Select one having best predictive performance
• If data set is small then some over-fitting can
occur and it is necessary to keep aside a test set
42
Machine Learning Srihari
S-fold Cross Validation
• Supply of data is limited S=4
• All available data is
partitioned into S groups
• S-1 groups are used to train
and evaluated on remaining
group
• Repeat for all S choices of If S=N this is the
leave-one-out method
held-out group
• Performance scores from S
runs are averaged
43
Machine Learning Srihari
Bayesian Information Criterion
• Criterion for choosing model
• Akaike Information criterion (AIC) chooses
model for which the quantity
ln p(D|wML) –M
• Is highest
• Where M is number of adjustable parameters
• BIC is a variant of this quantity
44
Machine Learning Srihari
The Curse of Dimensionality
Need to deal with spaces with many
variables in machine learning
45
Machine Learning Srihari
Example Clasification Problem
• Three classes
• 12 variables:
two shown
• 100 points
• Learn to Which class
should x
classify from belong to?
data
46
Machine Learning Srihari
Cell-based Classification
• Naïve approach of
cell based voting
will fail because of
exponential growth
of cells with
dimensionality
• Hardly any points in
each cell
47
Machine Learning Srihari
Volume of Sphere in High Dimensions
• Sphere is of radius r =1 in
D-dimensions
• What fraction of volume
lies between radius
r = 1-ε and r =1?
• VD(r)=KDrD
• This fraction is given by 1-
(1-ε)D
• As D increases high
proportion of volume lies Fraction of volume of sphere
near outer shell lying in range r =1- ε to r = 1
for various dimensions D
48
Machine Learning Srihari
Gaussian in High-dimensional Space
• x-y space converted to r-
space using polar
coordinates
• Most of the probability
mass is located in a thin
shell at a specific radius
49