Lecture 4 Notes
Lecture 4 Notes
1 Introduction
1.1 Recap: Frequentist Inference
In last years subject, FIT2086, we learned about what is commonly referred to as the frequentist
approach to statistical inference. The approach takes its name from the fact that it follows a very
strict frequency-based interpretation of probability; that is, probabilities are only well defined if they
represent the frequencies with which events would occur under repeated sampling/experimentation,
i.e., if we repeat an experiment enough times, the observed frequency with which events occur would
match the theoretical probabilities. In our frequentist framework we had a model for our data described
by a probability distribution
p(y | θ)
where θ were some parameters that described the population we were studying/modelling. The fun-
damental assumption that we made was that θ was unknown, but fixed. That is, we did not know the
population parameters, but we assumed they were equal to some fixed value. From this assumption,
we learned a complete theory of inference, including methods for point estimation (choosing a good
value of θ from the data), interval estimation (quantifying how uncertain we were about our estimates
of θ) and hypothesis testing (testing to see if certain hypotheses about the population were supported
by the data); i.e.,
• Parameter estimation. We learned about the maximum likelihood estimator which is a standard
frequentist point estimation technique:
We choose the value of θ that maximises the probability of seeing our observed data y as the
best estimate of the unknown population parameter.
• Interval estimation. We learned about the idea of confidence intervals to quantify the uncertainty
we have about our estimates; in particular, we can produce an 100(1 − α)% confidence interval
(θ− (y), θ+ (y)) such that
P (θ ∈ (θ− (y), θ+ (y))) = 1 − α
∗ Copyright (C) Daniel F. Schmidt, 2020
1
under repeated sampling of y, where α is the confidence level. That is, if we repeatedly sample
from the population, and use the above rule with α = 0.05 to compute an interval for θ using
the sample y, then for 95% of possible datasets, our confidence interval will contain the fixed,
but unknown θ.
• Hypothesis testing. We learned about the idea of computing p-values to quantify how much
evidence the data present against a proposed null hypothesis. This could be interpreted as
the probability of seeing data with as an extreme, or more extreme a deviation from our null
hypothesis, than the data we have observed, if the null hypothesis was true
Some of the concepts were difficult to grasp, because they are not particularly intuitive; this springs
from the assumption that θ is fixed. We will now look at a completely different branch of inference
that has much simpler interpretations, due to the relaxation of the “fixed θ” assumption.
P(Y = y | X = x)P(X = x)
P(X = x | Y = y) =
P(Y = y)
where X
P(Y = y) = P(Y = y | X = x)P(X = x)
X∈x
is the marginal distribution of Y , i.e., the probability of Y = y irrespective of what value X takes.
2
Example 1: Example use of Bayes’ Rule
To see how Bayes’ rule can be used, and get a basic idea of what it does, let us consider the follow
standard example. Imagine that a woman attends a GP clinic regarding a lump she has felt in her
breast; she is concerned she may have breast cancer. The GP knows that the population frequency
of breast cancer (C = 1) is 0.0066 (our prior probability); that is, without knowing anything about
the woman other than she comes from the Australian population, the GP knows there is a 0.0066
probability she has breast cancer. From a study, the GP also knows that (this is made up!) the
probability of developing a breast lump (L = 1) if :
• a woman has breast cancer (C = 1) is 60%;
• if a woman does not have breast cancer (C = 0) is 5%.
So a woman with breast cancer is quite likely to develop a lump, but a woman without breast cancer
can also develop a benign lump. Bayes’ rule lets us combine these probabilities together to get
P(C = 1 | L = 1):
P(L = 1 | C = 1)P(C = 1)
P(C = 1 | L = 1) =
P(L = 1 | C = 0)P(C = 0) + P(L = 1 | C = 1)P(C = 1)
0.6 · 0.0066
=
0.05 · (1 − 0.0066) + 0.6 · 0.0066
= 0.0738
To summarise, before seeing lump, the GP’s estimate of P(C = 1) was 0.0066. Bayes rule let us take
the evidence (the value of L) and revise this probability to be 0.0738; so the chance that the woman
has breast cancer is now around 10 times greater after observing the fact that she has a breast lump,
than before the GP knew anything about her condition. This is the essence of Bayes rule’; it lets us
update our knowledge using evidence.
2 Bayesian Inference
2.1 Problem Setup
How can we use Bayes rule to perform statistical inference? In the Bayesian framework, we have the
following ingredients:
3
As we can see, the primary difference between frequentist and Bayesian statistics is that we now treat
the unknown parameter as a random variable. The advantage of this approach is that it allows us to
make probabilistic statements about θ.
p(y | θ)π(θ)
p(θ | y) = ∝ p(y | θ)π(θ) (1)
p(y)
where Z
p(y) = p(y | θ)π(θ)dθ
Θ
is the marginal distribution of the data y arising, irrespective of the value of the population parameter
(only assuming that it follows the prior distribution). The quantity (1) is called the posterior distri-
bution and is the central inferential quantity in Bayesian inference. Everything else flows from this
quantity. To summarise, in the Bayesian framework we can interpret the different quantities as:
• π(θ) is the prior probability of model θ generating the data, i.e., the likelihood the true population
parameter takes on the value θ;
• p(y | θ) is the probability of seeing a data sample y from our population if the population value
is θ;
• p(θ | y) is the conditonal (posterior) probability of the population parameter being θ, given that
we have observing the data sample y.
The posterior distribution p(θ | y) is the central component in Bayesian inference. It is important then
to understand how to interpret it; essentially, if our prior distribution, π(θ), accurately describes the
probability that θ is the true population parameter, then the posterior probability
Z
P(θ ∈ A | y) = p(θ0 | y)dθ0
A
is the probability the population parameter is in the set A, given that we observed the data sample
y = (y1 , . . . , yn ). This obviously says that areas of the posterior distribution which have higher
probability density are the values of θ that are more likely to be the true population parameter. This
offers a very nice, clean interpretation – we don’t need ideas of “confidence” or repeated sampling. In
contrast, the posterior takes the data sample we have observed, and uses it to update or refine our
prior beliefs π(θ) about the value of true, underlying model θ. This is a very elegant approach as it
naturally combines the evidence in the data about θ with any prior beliefs about θ that we might have.
4
• As a subjective description of prior beliefs about θ. In contrast to a frequentist interpretation of
probability as the frequency of events under repeated sampling, the idea of subjective probability
is more about quantifying how much or how strongly you believe in something to be potentially
true. The odds of a team winning a soccer match are an example of subjective probability –
they reflect the gambler’s beliefs in which team is more likely to win, and how likely they are to
win. They can’t have a frequency interpretation because once the game is played, it can never
be played again – so repeated sampling is not a possibility. The idea of subjective probability
lets us model this idea.
• We can also view the prior probability as a model of a truly random process. For example, imagine
we are working for a company that has factories in many countries. They start a new factory
and want to estimate the failure rate of components being manufactured in this new plant. We
can model the collection of different factories as a population themself, and the different failure
rates in different factories as a random process (being influenced by factors such as quality of
training of local labour and other local features), and use this as a model of the prior probability
over the failure rates. In this case our prior is a model of variability in manufacturing processes
between countries.
Over the journey, frequentists have usually attacked Bayesianism by targeting the prior distribution.
The claim is that frequentist statistics is free of “personal priors” and that Bayesian inference can
be unduly influenced by the choice of prior distribution. An obvious counter-argument is that the
frequentist approach hinges crucially on accepting the probability model of the population p(y | θ) as a
true model of reality – when obviously, all models are likely to be inaccurate representations of reality.
However, such a criticism does prompt the question: where do prior distributions come from? Over one
hundred years of research in Bayesian statistics has gone into trying to put answers to this question;
some approaches include
1. They are chosen to reflect prior information/beliefs about problem. This is the original origin of
priors – the statistician must use whatever real prior information they have available (perhaps
gathered from experts, perhaps their opinions) to form the prior. More recently, the idea of prior
information has been refined to include more “general” beliefs about the underlying population
– e.g., the population parameter is likely to be “large” or “close to zero”. We will see some
examples of this kind later on.
2. They can be chosen for mathematical convenience; this is often done so that the choice of prior
distribution π(·) leads to a simple equation for the posterior distribution.
3. A lot of work has gone into trying to create prior distributions that express prior ignorance,
rather than knowledge; i.e., expressing the concept that “I know nothing”. These are sometimes
called uninformative priors, and are created by defining a mathematical concept of ignorance,
and generating prior distributions from this concept.
4. Finally, we can choose priors so that our Bayesian method ends up matching classical procedures;
for example, the Bayesian LASSO or the Bayesian ridge regression are Bayesian versions of these
classical, non-Bayesian procedures.
Of course, these ideas are not mutually exclusive – one can (and often may) combine different ap-
proaches, i.e., we might choose a convenient prior distribution in such a way that it (at least partially)
reflects real prior information or prior beliefs.
5
2.4 Using the Posterior for Inference
Once we have obtained a posterior distribution using (1) the obvious question is how to use it for
inferential purposes? Let us first examine how we might obtain point estimates (i.e., a best guess of θ)
using the posterior. In general, point estimates are statistics of the posterior distribution. Two very
common choices are:
• The posterior maximum (MAP - maximum a posteriori estimator). The MAP estimator is found
by using the value of θ that maximises the posterior probability density, i.e.,
In this approach we try to select the “most likely” value of θ, conditional on the particular data
sample y that we have observed.
• An alternative approach is to use the “average” value of θ from the posterior distribution. This
is called the posterior mean estimator and is given by
Z
θ̂PM = θ p(θ | y)dθ = E [θ | y]
This approach uses the posterior average value of θ as the best guess of the unknown population
parameter.
There are a large variety of other approaches, and we will examine them more formally in Section 4. In
general, however, all Bayesian estimates have the property that they combine the information encoded
in the prior distribution with information encoded in the likelihood (i.e., from the observed data) to
produce a best guess of the population parameter. As in frequentist statistics, point estimates give us a
best “guess” at the population parameter values; they tell us nothing about the variability/uncertainty
in our guess. These aspects can also be naturally measured using the posterior distribution. In
particular, one way to measure the uncertainty about our point estimate is to use the posterior standard
deviation: p
V [θ | y]
This quantity tells us how much the posterior distribution concentrates probability around the posterior
mean. The more information in your posterior distribution, the smaller (less uncertainty) the posterior
standard deviation will be. The difference in Bayesian statistics is that the information in the posterior
can come from two sources: the information in the sample, which grows with increasing n, or the
“information” in the prior, which can be increased by making our prior beliefs more strict or exact.
We can also generate interval estimates to capture uncertainty in our posterior inferences in a
similar fashion to confidence intervals. The Bayesian equivalent of a confidence interval is called a
credible set. A 100α% credible interval is any interval (θ− , θ+ ) of the parameter space Θ such that
Z θ̂+
P(θ̂− < θ < θ̂+ | y) = p(θ | y)dθ = α
θ̂−
where α ∈ (0, 1) is the level of the set. That is, the posterior probability that the population parameter
θ is in the interval (θ− , θ+ ) is α. In this way credible sets have a different – and cleaner – interpretation
than a confidence interval. A 100α% confidence interval is some interval such that for 100α% of possible
data samples from the population that we might see, the interval will contain the (fixed) unknown
true θ. In contrast, a 100α% credible interval says that if our prior is accurate, then the probability
that θ ∈ (θ̂− , θ̂+ ) is α, given we have observed the data sample y. We can interpret this directly as a
probability, which makes things a lot cleaner.
6
2.5 Bayesian Inference – The Different Components
We conclude with a re-cap/summary of the different components of Bayesian inference.
• The likelihood p(y | θ) describes the probability of seeing data y, if the population parameter
was θ. It is the probability model that we want to use to model our population.
• The prior distribution π(θ) describes the probability that the population parameter is θ, before
we have observed any data from our population.
• Together, these form a joint distribution p(y, θ) = p(y | θ)π(θ) which describes the probability
that we would see sample y and the population parameter is θ.
• From this joint distribution, we can find the posterior distribution p(θ | y), which describes the
probability that θ is the population parameter, given we have observed the particular data sample
y. This is the quantity we use to perform inference as it describes how likely different values of
θ are to be the true population parameter, after taking into account our prior beliefs and the
evidence in the data sample.
• The marginal distribution p(y) describes the probability of observing a data sample y if all we
know about the population parameter is that it follows our prior distribution π(θ).
where θ ∈ [0, 1] is the probability of the coin coming up a head when we toss it. The maximum
likelihood estimate of θ, found by maximising the likelihood w.r.t. θ is well known to be
k
θ̂ML (y) =
n
Pn
where k = j=1 yj is the number of observed heads; so the maximum likelihood estimate is just the
observed proportion of heads in our data sample. To undertake a Bayesian analysis of this problem we
need: (i) a probability model, and (ii) a prior distribution. We have the former; as above, we assume
that yj ∼ Be(θ). We then need a prior for θ. We will choose something called the beta distributionW :
θα−1 (1 − θ)β−1
p(θ | α, β) =
B(α, β)
7
where B(α, β) is a special function called the beta functionW . The beta distribution is often used to
model proportions or probabilities, as it is supported on the range [0, 1]. It is frequently used as a
prior in Bayesian analysis because it is both mathematically convenient and quite flexible. The values
of α and β control the shape of the distribution, and therefore also control our prior beliefs about θ.
These types of parameters – ones that control the shape of our prior probability distributions – are
usually called hyperparameters. They are not strictly parameters of our model, but they are instead
parameters of our prior beliefs. To see how the beta distribution can encode a variety of priors beliefs,
we first note that the mean of the the beta distribution is
α
E [θ] = .
α+β
The mean of the beta distribution can be viewed as our “best guess’ of the population parameter before
seeing any data. Its our best prior guess, and by setting α, β we can control the prior expected value
of θ, and therefore our prior guess. So the prior mean lets use set our best guess – but how confident
are we in that guess? The variance of the beta distribution is
αβ
V [θ] = .
(α + β)2 (α + β + 1)
The prior variance tells us how much probability is concentrated around our prior mean; in this sense
we can use it to control how confident we are in our prior guess. Varying α, β lets us modify the
strength of prior belief about E [θ]; the larger α, β, the smaller the prior variance and the stronger
our prior belief that θ is close to α/(α + β) will be. The smaller α and β are, the larger the prior
variance and the less sure we are about our prior guess. If α = β = 1 the beta distribution reduces
to a special case: the uniform distribution over θ, which says that a priori, we believe any value of
success probability θ is possible. This is in some sense more of an expression of prior ignorance than
belief.
which, after normalisation (see the solutions for Studio 5) is a beta distribution of the form
θ | y ∼ Beta(k + α, (n − k) + β) (2)
The fact the posterior distribution is also a beta distribution is due to the fact that prior distribution
is conjugate to the likelihood; this is an example of a convenient prior distribution because it leads
to a simple form for the posterior. We can see from (2) that the prior hyperparameters act as “extra
datapoints”; the α hyperparameter adds an additional α heads to the k heads actually observed in our
sample, and the β hyperparameter adds an additional β tails to the n − k tails we actually observed
in our sample.
8
10-3
3 4
Prior, Beta(0.5,0.5)
Posterior, Beta(7.5,3.5)
3
2
1
1
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
(a) Likelihood function; θ̂ML = 0.7 (b) θ̂MAP ≈ 0.7222; θ̂PM ≈ 0.6818
4 4
Prior, Beta(1,1) Prior, Beta(3,3)
Posterior, Beta(8,4) Posterior, Beta(10,6)
3 3
2 2
1 1
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
(c) θ̂MAP ≈ 0.7; θ̂PM ≈ 0.666 (d) θ̂MAP ≈ 0.643; θ̂PM ≈ 0.625
Figure 1: Bayesian analysis of Bernoulli data (k = 7 successes out of n = 10 trials) with various choices
of beta prior distributions.
If we compare these to the maximum likelihood estimate θ̂ML (y) = k/n we see that:
• The posterior mode adds an additional (α − 1) heads and (β − 1) tails;
• the posterior mean adds an extra α heads and β tails.
So, as we determined above, the prior hyperparameters α and β can be interpreted as additional data
points we have observed before doing our experiment. We can easily see that the posterior mode is
same as ML if α = β = 1, i.e., if we choose a uniform prior over θ. In contrast, the posterior mean can
9
never be made equivalent to to maximum likelihood estimator. To see how the two estimators differ,
consider the case when k = 0 (i.e., our data sequence contains no heads); then
θ̂ML (y) = 0
which says that we will predict the probability of seeing future heads to be zero, while for the posterior
mean
α
θ̂PM (y) = >0
n+α+β
which assigns a non-zero probability to seeing a head in future data. This is obviously more sensible
– as for example, tossing a coin twice and seeing two tails should not lead us to believe that we will
never see a head in the future! In fact, if α > 1 and β > 1, then both the posterior mean and mode
estimators will act by shrinking the maximum likelihood estimator θ̂ML towards our choice of prior
mean
α
E [θ] =
α+β
with the degree of shrinkage increasing for large α, β. This is a very common theme with Bayesian
estimates – they tend to shrink the maximum likelihood estimate (the information in the data) towards
prior mean (the information encoded by our prior beliefs). In fact, this can be made even more explicit
in the case of the Bernoulli by re-writing the posterior mean (3) as a weighted average of (i) the prior
mean (our prior guess of θ) and (ii) and the ML estimate (what our data says about θ):
where
α+β
w=
n+α+β
is the amount of weight we put on the prior mean, and (1 − w) is therefore the weight put on the
sample mean. Looking at his formula we can determine that:
• The larger the prior hyperparameters α + β, the more weight that is placed on the prior mean;
if we remember their interpretation as extra data, increasing α and β is equivalent to saying we
have seen more evidence a priori to believe the true θ is close to our prior guess.
• The larger the sample size n, the more weight that is placed on the ML estimate; that is, as
the sample size grows we start to ignore our prior guess, and if n → ∞, we ignore the prior
completely and use only the information in the data.
This is a very elegant and neat example of the idea of weighing up evidence against our prior beliefs
that occurs in most Bayesian procedures. In general, the formulas are not so simple, but a similar idea
applies – the posterior will contain a weighted mixture of information from our data with information
encoded in our prior distribution, with the relative weights determined by the sample size n and how
strongly our prior distribution is concentrated around our best prior guess.
10
probability assigned to the data string y by p(y | θ) for different values of θ, and as such, the curve does
not represent a probability density. The maximum likelihood estimate of θ is seen to be 7/10 = 0.7.
Figure 1(b) shows a Beta(1/2, 1/2) prior, which encodes a prior belief that the coin is more likely to
be biased (towards either heads of tails) than it is to be fair, along with the resulting posterior. We
note even in this case, the posterior mean is slightly shrunk towards 1/2 (the prior guess at θ). Figure
1(c) shows the posterior obtained when we use a uniform prior on θ (any value of θ believed to be a
priori equally likely). In this case the posterior mode coincides exactly with the maximum likelihood
estimate, though the posterior mean is shrunk more towards 1/2. Finally 1(d) shows the posterior
obtained when we choose a Beta(3, 3) prior distribution, which encodes the belief that the coin is less
likely to be biased (towards zero or one) than unbiased. Comparing this posterior to the one obtained
using the uniform prior (Figure 1(c)), we see that the posterior distribution is “pulled” towards 1/2,
and the posterior mean and mode are both shrunken towards 1/2 (our prior guess). This is because
our prior encodes more strongly the belief that the θ will be around 1/2.
11
point estimation in a more formal fashion. Given that we have observed some data sample y, the
posterior distribution p(θ | y) tells us how likely it is that different values of θ are to be the (unknown)
population parameter. The more posterior probability assigned to a value of θ, the more likely we
believe that it will be the true value of the population parameter. We can use this information, coupled
with the idea of a loss function, to define a generic class of estimation methods.
The idea is as follows: we first choose to measure how badly an estimate performs by using a loss
function L(θ, θ̂). This tells us how far some estimate θ̂ is from the true value θ. We then define a
Bayes estimator as one that minimises the posterior expected loss.
Definition 1. Let p(θ | y) denote a posterior distribution and L(θ, θ̂) a loss function; the Bayes
estimator relative to this loss function is then given by solving
Z
0
θ̂L = arg min L(θ, θ )p(θ | y)dθ (4)
θ0
That is, given a data sample, we form the posterior distribution which contains all the information we
have about the true values of θ, and then look for a guess at θ that has the smallest average loss, the
average being taken with respect to the posterior distribution (i.e., our updated beliefs over what are
the true values of θ). In other words, we use the value of θ that has the smallest expected loss, given
that we have observed data y, as our best guess. This approach reveals a very interesting strength of
the Bayesian choice. In particular, we see that Bayesian inference decouples
1. the inference problem, which is completely determined by the posterior distribution p(y | θ);
2. and the decision/estimation problem, which is determined by choosing a loss function that mea-
sures how badly making the incorrect decision will be penalized.
In contrast to the frequentist approach of estimator design in which one tends to first choose a loss,
and then tries to design an estimator so that the risk will be small, the Bayesian approach gives an
exact formula (4) for finding an estimator. This Bayes estimator formula looks complex, as we have
to minimise an expectation with respect to a complex posterior distribution. Luckily, for certain loss
functions the solution is known to have a reasonable simple form; in particular:
• If L(θ, θ̂) = (θ − θ̂)2 (i.e., squared-error loss) then
θ̂L (y) = E [θ | y]
which is just the posterior mean. So the posterior mean minimises the posterior expected squared-
error.
which is just the posterior median. So the posterior median minimises the posterior expected
absolute error.
For other more general loss functions, the solution is harder to find, but can usually be solved numer-
ically.
12
4.2 Are Bayes Estimators Good?
By choosing a prior distribution π(θ) over Θ and loss function L(θ, θ̂) we can generate an infinite
number of different estimators by solving (4). A natural question might be: are these Bayes estimators
good in some sense? The short answer to this question is yes; the long answer can be framed in terms
of the language of risk and admissability. In particular, we have the following result.
Theorem 1. A Bayes estimator θ̂L that solves (4) using prior π(θ) and loss L(θ, θ̂) is always
an admissible estimator under the chosen loss.
This theorem says that by choosing a prior (any prior!) and a loss function (any loss function), solving
(4) will lead to an estimator that cannot be outperformed for every value θ ∈ Θ by any other estimator.
There will always be some area of Θ (intuitively, the areas that the prior distribution puts most of
its probability) for which the Bayes estimator will do better than any other estimator. This means,
for example, that the posterior mean is always admissable under squared-error loss, and the posterior
median is always admissable under absolute-error loss, regardless of the choice of prior! This leads to
the following corollary.
This result tells us that if we have any estimator θ̂ that we have proposed, we can prove it is admissible
if we can find a Bayes estimator that has the same form. This is a commonly used tool to establish
admissability. This result quantifies why being admissable is both strong (it means there is no other
method that is everywhere better than our method), but also quite weak (because there are an infinite
number of different admissable procedures – one for each possible prior distribution). Bayes estimators
have a special link to Bayes risk, i.e., the average risk (performance) of an estimator, the average being
taken with respect to a weighting function π(θ) over the parameter space Θ.
Theorem 3. A Bayes estimator θ̂L based on prior π(θ) and loss L(θ, θ̂) always minimises the
Bayes risk for that loss with weighting function equal to the prior, i.e.
Z
Rπ (θ̂L ) = min π(θ)R(θ, θ̂0 )dθ
θ̂ 0
That is, amongst all possible estimators, the Bayes estimator will have the smallest average risk if
the population parameter follows the distribution π(θ). So we can view the “weighting function” (see
Lecture 1) as a prior distribution; if our true population parameter is generated from θ ∼ π(θ) then the
Bayes estimator will do the best, on average, at estimating θ. A final theorem relates Bayes estimators
to minimax estimators (i.e., estimators that minimise the worst case performance amongst all possible
estimation methods):
Theorem 4. If a Bayes estimator has constant risk (with respect to the population parameter),
then it is minimax
This says that if we have a Bayes estimator for which the performance (risk) does not depend on the
true value of the population parameter, then that the estimator will be minimax. Again, this technique
is used frequently to prove minimaxity of estimators.
13
Example 2: Estimation of Bernoulli
which is independent of the population success probability θ; that is, regardless of the true population
success probability, the average squared-error we would obtain from using this estimator to estimate θ
from data samples drawn from the population would always be the same. As this estimator is a Bayes
estimator with constant risk, Theorem 3 above tells us it must be minimax. This demonstrates the
power of these tools, and why Bayes results are used even by people who are not explicitly Bayesians
to justify their estimators.
does not, in general, correspond to any Bayes estimator, and is therefore not necessarily admissable.
Yet in practice, the posterior mode is often used – perhaps more frequently than the posterior mean
or median. Why? One reason is philosophical: the mode looks for the most likely estimate of the
population parameter, given the data sample y and our prior beliefs. It tells us which value of θ we
think is the most likely to be true, in a sense, and avoids the need to specify some measure of loss.
Another reason is purely practical; recall the formula for the posterior distribution
p(y | θ)π(θ)
p(θ | y) = ,
p(y)
where Z
p(y) = π(θ)p(y | θ)dθ
is the marginal probability of the data sample y. The problem is that that posterior mean and median
(and many other Bayes estimators) require the marginal distribution to be computed, and the marginal
distribution is in general very hard (even potentially impossible) to compute analytically, and is even
difficult to compute through numerical methods. It is a nasty integral to deal with, in general! The
14
posterior mode does not require knowledge of the marginal distribution, which is why it is often utilised.
To understand why, we can note that that
p(y | θ)π(θ)
p(θ | y) = ∝ p(y | θ)π(θ),
p(y)
because p(y) does not depend on θ. So therefore, the value of θ that maximises the product p(y | θ)π(θ)
is the same value of θ that maximises p(θ | y). This means that the posterior mode can be equivalently
found by solving
θ̂MAP = arg max {p(y | θ)π(θ)}
θ
which avoids the need to compute p(y). All we need to do is write down the product of the likelihood
p(y | θ) and the prior π(θ), and maximise this expression for θ.
That is, we say that the value of θ that maximises the probability of our data sample y is considered
our best guess at the true, unknown population parameter. We can see that this is closely related to
the posterior mode: taking negative logarithms of p(θ | y) and ignoring terms that are constant in θ
yields (i.e., the − log p(y) term):
which is equivalent to minimising the negative log-likelihood plus a penalty function. So the poste-
rior mode/MAP estimator can be interpreted as penalized maximum likelihood, with the negative-
logarithm of the prior distribution being the penalty applied to parameter θ. The smaller the prior
probability assigned by π(θ), the less likely we believe θ to be the true population parameter, and the
larger the penalty will be. This idea can be used later in the subject to connect Bayesian inference and
penalized regression. This form also lets us show that Bayesian inference is very similar to maximum
likelihood for large amounts of data. Assume that our prior π(θ) does not depend on the sample size
(i.e., none of the hyperparameters depend on n), and assume our data are independently distributed.
Then, we see that the MAP estimator
( n )
X
θ̂MAP (y) = arg min − log p(yi | θ) − log π(θ)
θ i=1
for large n is dominated by the likelihood term, which does grow in magnitude with increasing sample
size. That is, for large n sample size and prior distributions not depending on n, Bayesian estimation
and maximum likelihood converge (become the same) – this is because the information provided by
the data outweighs any information we have encoded in our prior beliefs, and eventually completely
swamps the effects of our prior distribution. This is comforting, as it tells us for large amounts of data
our prior beliefs no longer affect our inferences, which is intuitively sensible. We eventually want the
“data to speak for themselves”, as it were, regardless of our prior beliefs.
15