0% found this document useful (0 votes)

7 views15 pages

Lecture 4 Notes

Uploaded by

Moqiu Liang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views15 pages

Lecture 4 Notes

Uploaded by

Moqiu Liang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

FIT3154 Lecture 4 Summary

Bayesian Inference Part 1

Dr. Daniel F. Schmidt∗
September 4, 2020

1 Introduction
1.1 Recap: Frequentist Inference
In last years subject, FIT2086, we learned about what is commonly referred to as the frequentist
approach to statistical inference. The approach takes its name from the fact that it follows a very
strict frequency-based interpretation of probability; that is, probabilities are only well defined if they
represent the frequencies with which events would occur under repeated sampling/experimentation,
i.e., if we repeat an experiment enough times, the observed frequency with which events occur would
match the theoretical probabilities. In our frequentist framework we had a model for our data described
by a probability distribution
p(y | θ)
where θ were some parameters that described the population we were studying/modelling. The fun-
damental assumption that we made was that θ was unknown, but fixed. That is, we did not know the
population parameters, but we assumed they were equal to some fixed value. From this assumption,
we learned a complete theory of inference, including methods for point estimation (choosing a good
value of θ from the data), interval estimation (quantifying how uncertain we were about our estimates
of θ) and hypothesis testing (testing to see if certain hypotheses about the population were supported
by the data); i.e.,
• Parameter estimation. We learned about the maximum likelihood estimator which is a standard
frequentist point estimation technique:

θ̂ML = arg min {− log p(y | θ)}

We choose the value of θ that maximises the probability of seeing our observed data y as the
best estimate of the unknown population parameter.
• Interval estimation. We learned about the idea of confidence intervals to quantify the uncertainty
we have about our estimates; in particular, we can produce an 100(1 − α)% confidence interval
(θ− (y), θ+ (y)) such that
P (θ ∈ (θ− (y), θ+ (y))) = 1 − α
∗ Copyright (C) Daniel F. Schmidt, 2020

1
under repeated sampling of y, where α is the confidence level. That is, if we repeatedly sample
from the population, and use the above rule with α = 0.05 to compute an interval for θ using
the sample y, then for 95% of possible datasets, our confidence interval will contain the fixed,
but unknown θ.
• Hypothesis testing. We learned about the idea of computing p-values to quantify how much
evidence the data present against a proposed null hypothesis. This could be interpreted as
the probability of seeing data with as an extreme, or more extreme a deviation from our null
hypothesis, than the data we have observed, if the null hypothesis was true
Some of the concepts were difficult to grasp, because they are not particularly intuitive; this springs
from the assumption that θ is fixed. We will now look at a completely different branch of inference
that has much simpler interpretations, due to the relaxation of the “fixed θ” assumption.

1.2 The Bayesian School of Inference

The Bayesian approach to statistical inference makes a fundamentally different assumption about the
unknown population parameter θ: it assumes that it is itself also a random variable. This one difference
to a completely different approach to statistical inference. In many ways the resulting theory is much
more elegant, and has a number of advantages over the frequentist approach: (i) it provides a simple,
unified framework for inference, in which point and interval estimation and hypothesis testing are
all performed using the one idea; (ii) it allows for very flexible, hierarchical specification of complex
models, including the idea of hidden/latent parameters; (iii) it directly incorporates the uncertainty in
estimation into the inference procedure; (iv) it allows for natural specification of any prior information
on the problem at hand you might have, and (v) it marries extremely well with the increases in
computational power that have occured in the last 30 years.

1.3 Bayes’ Rule

The main tool underlying Bayesian inference, from which it takes its name, is Bayes’ rule (theorem);
this itself was named after the Reverend Thomas Bayes, a man of the cloth with a keen interest in
games of chance. Bayes’ rule is a conditional probability rule; let X, Y be two R.V.s, and assume that
we know the following:
1. the marginal distribution P(X = x) of X, i.e., the probability of seeing X irrespective of what
value Y takes;

2. the conditional distribution P(Y = y | X = x) of Y , given that we have observed X = x.

What we would like to find is the conditional probability P(X = x | Y = y), i.e., the probability of X
given we have observed a particular value y for Y ; Bayes’ rule tells that this is given by

P(Y = y | X = x)P(X = x)
P(X = x | Y = y) =
P(Y = y)

where X
P(Y = y) = P(Y = y | X = x)P(X = x)
X∈x

is the marginal distribution of Y , i.e., the probability of Y = y irrespective of what value X takes.

2
Example 1: Example use of Bayes’ Rule

To see how Bayes’ rule can be used, and get a basic idea of what it does, let us consider the follow
standard example. Imagine that a woman attends a GP clinic regarding a lump she has felt in her
breast; she is concerned she may have breast cancer. The GP knows that the population frequency
of breast cancer (C = 1) is 0.0066 (our prior probability); that is, without knowing anything about
the woman other than she comes from the Australian population, the GP knows there is a 0.0066
probability she has breast cancer. From a study, the GP also knows that (this is made up!) the
probability of developing a breast lump (L = 1) if :
• a woman has breast cancer (C = 1) is 60%;
• if a woman does not have breast cancer (C = 0) is 5%.
So a woman with breast cancer is quite likely to develop a lump, but a woman without breast cancer
can also develop a benign lump. Bayes’ rule lets us combine these probabilities together to get
P(C = 1 | L = 1):

P(L = 1 | C = 1)P(C = 1)
P(C = 1 | L = 1) =
P(L = 1 | C = 0)P(C = 0) + P(L = 1 | C = 1)P(C = 1)
0.6 · 0.0066
=
0.05 · (1 − 0.0066) + 0.6 · 0.0066
= 0.0738

To summarise, before seeing lump, the GP’s estimate of P(C = 1) was 0.0066. Bayes rule let us take
the evidence (the value of L) and revise this probability to be 0.0738; so the chance that the woman
has breast cancer is now around 10 times greater after observing the fact that she has a breast lump,
than before the GP knew anything about her condition. This is the essence of Bayes rule’; it lets us
update our knowledge using evidence.

2 Bayesian Inference
2.1 Problem Setup
How can we use Bayes rule to perform statistical inference? In the Bayesian framework, we have the
following ingredients:

1. A model of our data

p(y | θ), y ∈ Y n , θ ∈ Θ,
parameterised by an unknown θ. This describes the probability of seeing data y in our population,
given that the true population parameter θ. This is the same as in frequentist inference.
2. A prior probability distribution describing the likelihood of our unknown population parameter
taking on different values:
π(θ), θ ∈ Θ
The prior π(θ) describes the (marginal – irrespective of the sample we have observed) probability
that different values of θ are the true parameter population parameter.
3. An observed sampled of data y = (y1 , . . . , yn ) from the population we wish to model.

3
As we can see, the primary difference between frequentist and Bayesian statistics is that we now treat
the unknown parameter as a random variable. The advantage of this approach is that it allows us to
make probabilistic statements about θ.

2.2 The Posterior Distribution

The key idea is to note that we have seen a sample from our population, y; by definition we know
the conditional probability of seeing y given θ, i.e., our probability model p(y | θ). We also know the
marginal probability of θ being the true population parameter, i.e., π(θ). Bayesian inference then
procedes by applying Bayes’ rule to find p(θ | y):

p(y | θ)π(θ)
p(θ | y) = ∝ p(y | θ)π(θ) (1)
p(y)

where Z
p(y) = p(y | θ)π(θ)dθ
Θ
is the marginal distribution of the data y arising, irrespective of the value of the population parameter
(only assuming that it follows the prior distribution). The quantity (1) is called the posterior distri-
bution and is the central inferential quantity in Bayesian inference. Everything else flows from this
quantity. To summarise, in the Bayesian framework we can interpret the different quantities as:
• π(θ) is the prior probability of model θ generating the data, i.e., the likelihood the true population
parameter takes on the value θ;

• p(y | θ) is the probability of seeing a data sample y from our population if the population value
is θ;
• p(θ | y) is the conditonal (posterior) probability of the population parameter being θ, given that
we have observing the data sample y.

The posterior distribution p(θ | y) is the central component in Bayesian inference. It is important then
to understand how to interpret it; essentially, if our prior distribution, π(θ), accurately describes the
probability that θ is the true population parameter, then the posterior probability
Z
P(θ ∈ A | y) = p(θ0 | y)dθ0
A

is the probability the population parameter is in the set A, given that we observed the data sample
y = (y1 , . . . , yn ). This obviously says that areas of the posterior distribution which have higher
probability density are the values of θ that are more likely to be the true population parameter. This
offers a very nice, clean interpretation – we don’t need ideas of “confidence” or repeated sampling. In
contrast, the posterior takes the data sample we have observed, and uses it to update or refine our
prior beliefs π(θ) about the value of true, underlying model θ. This is a very elegant approach as it
naturally combines the evidence in the data about θ with any prior beliefs about θ that we might have.

2.3 The Prior Distribution

The prior distribution π(θ) which describes our a priori (i.e., before seeing data) beliefs about the
likelihood of different values of θ being the population parameter is the most controversial element of
Bayesian inference. A common question is how might we interpret the prior distribution? There are
several common interpretations:

4
• As a subjective description of prior beliefs about θ. In contrast to a frequentist interpretation of
probability as the frequency of events under repeated sampling, the idea of subjective probability
is more about quantifying how much or how strongly you believe in something to be potentially
true. The odds of a team winning a soccer match are an example of subjective probability –
they reflect the gambler’s beliefs in which team is more likely to win, and how likely they are to
win. They can’t have a frequency interpretation because once the game is played, it can never
be played again – so repeated sampling is not a possibility. The idea of subjective probability
lets us model this idea.

• We can also view the prior probability as a model of a truly random process. For example, imagine
we are working for a company that has factories in many countries. They start a new factory
and want to estimate the failure rate of components being manufactured in this new plant. We
can model the collection of different factories as a population themself, and the different failure
rates in different factories as a random process (being influenced by factors such as quality of
training of local labour and other local features), and use this as a model of the prior probability
over the failure rates. In this case our prior is a model of variability in manufacturing processes
between countries.
Over the journey, frequentists have usually attacked Bayesianism by targeting the prior distribution.
The claim is that frequentist statistics is free of “personal priors” and that Bayesian inference can
be unduly influenced by the choice of prior distribution. An obvious counter-argument is that the
frequentist approach hinges crucially on accepting the probability model of the population p(y | θ) as a
true model of reality – when obviously, all models are likely to be inaccurate representations of reality.
However, such a criticism does prompt the question: where do prior distributions come from? Over one
hundred years of research in Bayesian statistics has gone into trying to put answers to this question;
some approaches include

1. They are chosen to reflect prior information/beliefs about problem. This is the original origin of
priors – the statistician must use whatever real prior information they have available (perhaps
gathered from experts, perhaps their opinions) to form the prior. More recently, the idea of prior
information has been refined to include more “general” beliefs about the underlying population
– e.g., the population parameter is likely to be “large” or “close to zero”. We will see some
examples of this kind later on.
2. They can be chosen for mathematical convenience; this is often done so that the choice of prior
distribution π(·) leads to a simple equation for the posterior distribution.
3. A lot of work has gone into trying to create prior distributions that express prior ignorance,
rather than knowledge; i.e., expressing the concept that “I know nothing”. These are sometimes
called uninformative priors, and are created by defining a mathematical concept of ignorance,
and generating prior distributions from this concept.
4. Finally, we can choose priors so that our Bayesian method ends up matching classical procedures;
for example, the Bayesian LASSO or the Bayesian ridge regression are Bayesian versions of these
classical, non-Bayesian procedures.

Of course, these ideas are not mutually exclusive – one can (and often may) combine different ap-
proaches, i.e., we might choose a convenient prior distribution in such a way that it (at least partially)
reflects real prior information or prior beliefs.

5
2.4 Using the Posterior for Inference
Once we have obtained a posterior distribution using (1) the obvious question is how to use it for
inferential purposes? Let us first examine how we might obtain point estimates (i.e., a best guess of θ)
using the posterior. In general, point estimates are statistics of the posterior distribution. Two very
common choices are:
• The posterior maximum (MAP - maximum a posteriori estimator). The MAP estimator is found
by using the value of θ that maximises the posterior probability density, i.e.,

θ̂MAP = arg max {p(θ | y)}

In this approach we try to select the “most likely” value of θ, conditional on the particular data
sample y that we have observed.
• An alternative approach is to use the “average” value of θ from the posterior distribution. This
is called the posterior mean estimator and is given by
Z
θ̂PM = θ p(θ | y)dθ = E [θ | y]

This approach uses the posterior average value of θ as the best guess of the unknown population
parameter.
There are a large variety of other approaches, and we will examine them more formally in Section 4. In
general, however, all Bayesian estimates have the property that they combine the information encoded
in the prior distribution with information encoded in the likelihood (i.e., from the observed data) to
produce a best guess of the population parameter. As in frequentist statistics, point estimates give us a
best “guess” at the population parameter values; they tell us nothing about the variability/uncertainty
in our guess. These aspects can also be naturally measured using the posterior distribution. In
particular, one way to measure the uncertainty about our point estimate is to use the posterior standard
deviation: p
V [θ | y]
This quantity tells us how much the posterior distribution concentrates probability around the posterior
mean. The more information in your posterior distribution, the smaller (less uncertainty) the posterior
standard deviation will be. The difference in Bayesian statistics is that the information in the posterior
can come from two sources: the information in the sample, which grows with increasing n, or the
“information” in the prior, which can be increased by making our prior beliefs more strict or exact.
We can also generate interval estimates to capture uncertainty in our posterior inferences in a
similar fashion to confidence intervals. The Bayesian equivalent of a confidence interval is called a
credible set. A 100α% credible interval is any interval (θ− , θ+ ) of the parameter space Θ such that
Z θ̂+
P(θ̂− < θ < θ̂+ | y) = p(θ | y)dθ = α
θ̂−

where α ∈ (0, 1) is the level of the set. That is, the posterior probability that the population parameter
θ is in the interval (θ− , θ+ ) is α. In this way credible sets have a different – and cleaner – interpretation
than a confidence interval. A 100α% confidence interval is some interval such that for 100α% of possible
data samples from the population that we might see, the interval will contain the (fixed) unknown
true θ. In contrast, a 100α% credible interval says that if our prior is accurate, then the probability
that θ ∈ (θ̂− , θ̂+ ) is α, given we have observed the data sample y. We can interpret this directly as a
probability, which makes things a lot cleaner.

6
2.5 Bayesian Inference – The Different Components
We conclude with a re-cap/summary of the different components of Bayesian inference.

• The likelihood p(y | θ) describes the probability of seeing data y, if the population parameter
was θ. It is the probability model that we want to use to model our population.
• The prior distribution π(θ) describes the probability that the population parameter is θ, before
we have observed any data from our population.

• Together, these form a joint distribution p(y, θ) = p(y | θ)π(θ) which describes the probability
that we would see sample y and the population parameter is θ.
• From this joint distribution, we can find the posterior distribution p(θ | y), which describes the
probability that θ is the population parameter, given we have observed the particular data sample
y. This is the quantity we use to perform inference as it describes how likely different values of
θ are to be the true population parameter, after taking into account our prior beliefs and the
evidence in the data sample.
• The marginal distribution p(y) describes the probability of observing a data sample y if all we
know about the population parameter is that it follows our prior distribution π(θ).

3 Part II: Example of Bayesian Inference

3.1 Bayesian Inference of the Bernoulli Distribution
We can now examine how to apply the tools of Bayesian inference to a simple problem: estimating
the probability of success from a series of n independent and identically distributed Bernoulli trials.
This example is good because it is relatively simple but highly illustrative of the different aspects of
Bayesian inference. Consider a data sample consisting of n tosses of a coin, with yj = 0 recording a
tail and yj = 1 recording a head. Each toss is an independent Bernoulli RV, so our model is
n
Y
p(y | θ) = p(yj | θ)
j=1
Yn
= θyj (1 − θ)1−yj
j=1

where θ ∈ [0, 1] is the probability of the coin coming up a head when we toss it. The maximum
likelihood estimate of θ, found by maximising the likelihood w.r.t. θ is well known to be
k
θ̂ML (y) =
n
Pn
where k = j=1 yj is the number of observed heads; so the maximum likelihood estimate is just the
observed proportion of heads in our data sample. To undertake a Bayesian analysis of this problem we
need: (i) a probability model, and (ii) a prior distribution. We have the former; as above, we assume
that yj ∼ Be(θ). We then need a prior for θ. We will choose something called the beta distributionW :

θα−1 (1 − θ)β−1
p(θ | α, β) =
B(α, β)

7
where B(α, β) is a special function called the beta functionW . The beta distribution is often used to
model proportions or probabilities, as it is supported on the range [0, 1]. It is frequently used as a
prior in Bayesian analysis because it is both mathematically convenient and quite flexible. The values
of α and β control the shape of the distribution, and therefore also control our prior beliefs about θ.
These types of parameters – ones that control the shape of our prior probability distributions – are
usually called hyperparameters. They are not strictly parameters of our model, but they are instead
parameters of our prior beliefs. To see how the beta distribution can encode a variety of priors beliefs,
we first note that the mean of the the beta distribution is
α
E [θ] = .
α+β
The mean of the beta distribution can be viewed as our “best guess’ of the population parameter before
seeing any data. Its our best prior guess, and by setting α, β we can control the prior expected value
of θ, and therefore our prior guess. So the prior mean lets use set our best guess – but how confident
are we in that guess? The variance of the beta distribution is
αβ
V [θ] = .
(α + β)2 (α + β + 1)
The prior variance tells us how much probability is concentrated around our prior mean; in this sense
we can use it to control how confident we are in our prior guess. Varying α, β lets us modify the
strength of prior belief about E [θ]; the larger α, β, the smaller the prior variance and the stronger
our prior belief that θ is close to α/(α + β) will be. The smaller α and β are, the larger the prior
variance and the less sure we are about our prior guess. If α = β = 1 the beta distribution reduces
to a special case: the uniform distribution over θ, which says that a priori, we believe any value of
success probability θ is possible. This is in some sense more of an expression of prior ignorance than
belief.

3.2 Posterior distribution for Bernoulli data

In Bayesian analysis, it is usual to write the final Bayesian model (likelihood and prior distribution)
as a hierarchy:
yj | θ ∼ Be(θ), j = 1, . . . , n
θ ∼ Beta(α, β)
You can interpret this as saying that the population parameter θ follows a Beta(α, β) distribution,
and that given a particular value of θ, the data samples follow a Bernoulli distribution with success
probability θ. The posterior distribution for this hierarchy is
n
Y
p(θ | y) ∝ θα−1 (1 − θ)β−1 θyj (1 − θ)yj −1
j=1

which, after normalisation (see the solutions for Studio 5) is a beta distribution of the form
θ | y ∼ Beta(k + α, (n − k) + β) (2)
The fact the posterior distribution is also a beta distribution is due to the fact that prior distribution
is conjugate to the likelihood; this is an example of a convenient prior distribution because it leads
to a simple form for the posterior. We can see from (2) that the prior hyperparameters act as “extra
datapoints”; the α hyperparameter adds an additional α heads to the k heads actually observed in our
sample, and the β hyperparameter adds an additional β tails to the n − k tails we actually observed
in our sample.

8
10-3
3 4
Prior, Beta(0.5,0.5)
Posterior, Beta(7.5,3.5)

3
2

1
1

0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1

(a) Likelihood function; θ̂ML = 0.7 (b) θ̂MAP ≈ 0.7222; θ̂PM ≈ 0.6818

4 4
Prior, Beta(1,1) Prior, Beta(3,3)
Posterior, Beta(8,4) Posterior, Beta(10,6)

3 3

2 2

1 1

0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1

Figure 1: Bayesian analysis of Bernoulli data (k = 7 successes out of n = 10 trials) with various choices
of beta prior distributions.

3.3 Posterior Mean and Mode

Remember that the posterior mode (value of θ that maximises the posterior p(θ | y)) and the posterior
mean provide guesses at the values of the population parameter. As the posterior (2) is a beta
distribution, we can use the properties of the beta distributionW to find that the mode is given by
k+α−1
θ̂MAP (y) =
n+α+β−2
and the posterior mean is given by
k+α
θ̂PM (y) = . (3)
n+α+β

If we compare these to the maximum likelihood estimate θ̂ML (y) = k/n we see that:
• The posterior mode adds an additional (α − 1) heads and (β − 1) tails;
• the posterior mean adds an extra α heads and β tails.

So, as we determined above, the prior hyperparameters α and β can be interpreted as additional data
points we have observed before doing our experiment. We can easily see that the posterior mode is
same as ML if α = β = 1, i.e., if we choose a uniform prior over θ. In contrast, the posterior mean can

9
never be made equivalent to to maximum likelihood estimator. To see how the two estimators differ,
consider the case when k = 0 (i.e., our data sequence contains no heads); then

θ̂ML (y) = 0

which says that we will predict the probability of seeing future heads to be zero, while for the posterior
mean
α
θ̂PM (y) = >0
n+α+β
which assigns a non-zero probability to seeing a head in future data. This is obviously more sensible
– as for example, tossing a coin twice and seeing two tails should not lead us to believe that we will
never see a head in the future! In fact, if α > 1 and β > 1, then both the posterior mean and mode
estimators will act by shrinking the maximum likelihood estimator θ̂ML towards our choice of prior
mean
α
E [θ] =
α+β
with the degree of shrinkage increasing for large α, β. This is a very common theme with Bayesian
estimates – they tend to shrink the maximum likelihood estimate (the information in the data) towards
prior mean (the information encoded by our prior beliefs). In fact, this can be made even more explicit
in the case of the Bernoulli by re-writing the posterior mean (3) as a weighted average of (i) the prior
mean (our prior guess of θ) and (ii) and the ML estimate (what our data says about θ):

θ̂PM (y) = w E [θ] + (1 − w) θ̂ML (y)

α k
= w + (1 − w)
α+β n

where
α+β
w=
n+α+β
is the amount of weight we put on the prior mean, and (1 − w) is therefore the weight put on the
sample mean. Looking at his formula we can determine that:
• The larger the prior hyperparameters α + β, the more weight that is placed on the prior mean;
if we remember their interpretation as extra data, increasing α and β is equivalent to saying we
have seen more evidence a priori to believe the true θ is close to our prior guess.
• The larger the sample size n, the more weight that is placed on the ML estimate; that is, as
the sample size grows we start to ignore our prior guess, and if n → ∞, we ignore the prior
completely and use only the information in the data.
This is a very elegant and neat example of the idea of weighing up evidence against our prior beliefs
that occurs in most Bayesian procedures. In general, the formulas are not so simple, but a similar idea
applies – the posterior will contain a weighted mixture of information from our data with information
encoded in our prior distribution, with the relative weights determined by the sample size n and how
strongly our prior distribution is concentrated around our best prior guess.

3.4 Example of Posterior Distributions

Now, imagine that we observed a sequence of binary data y = (0, 1, 1, 1, 0, 1, 1, 0, 1, 1); in this case
k = 7 and n = 10. Figure 1 demonstrates the different inferential quantities associated with the
problem. In Figure 1(a) we see the likelihood function plotted as θ is varied; remember, this is the

10
probability assigned to the data string y by p(y | θ) for different values of θ, and as such, the curve does
not represent a probability density. The maximum likelihood estimate of θ is seen to be 7/10 = 0.7.
Figure 1(b) shows a Beta(1/2, 1/2) prior, which encodes a prior belief that the coin is more likely to
be biased (towards either heads of tails) than it is to be fair, along with the resulting posterior. We
note even in this case, the posterior mean is slightly shrunk towards 1/2 (the prior guess at θ). Figure
1(c) shows the posterior obtained when we use a uniform prior on θ (any value of θ believed to be a
priori equally likely). In this case the posterior mode coincides exactly with the maximum likelihood
estimate, though the posterior mean is shrunk more towards 1/2. Finally 1(d) shows the posterior
obtained when we choose a Beta(3, 3) prior distribution, which encodes the belief that the coin is less
likely to be biased (towards zero or one) than unbiased. Comparing this posterior to the one obtained
using the uniform prior (Figure 1(c)), we see that the posterior distribution is “pulled” towards 1/2,
and the posterior mean and mode are both shrunken towards 1/2 (our prior guess). This is because
our prior encodes more strongly the belief that the θ will be around 1/2.

3.5 Credible Intervals for θ

We conclude this example by examining how we can determine credible intervals for our problem. As
we previously noted, a 100α% credible interval for θ is any interval T ⊂ Θ of the parameter space Θ
such that
P(θ ∈ T | y) = α.
It is usual to select α = 0.95, i.e., a 95% credible interval. For our Bernoulli example, the posterior
distribution is itself a beta distribution, which makes the computation of intervals convenient as there
exist functions for computing quantiles for the beta distribution in R. For our dataset (k = 7, n = 10),
and using the prior θ ∼ Beta(1, 1), the following are all examples of valid 95% credible intervals:
qbeta(p=c(0,0.95), 8, 4) = (0.0, 0.865),
qbeta(p=c(0.05,1), 8, 4) = (0.435, 1.0),
qbeta(p=c(0.025,0.975), 8, 4) = (0.390, 0.890).
While all of these represent valid credible intervals, in general, by convention, when we talk about a
95% credible interval we are usually referring to the interval from cumulative probabilities 0.025 up to
0.975. This interval is often one of the shorter intervals, and is also usually “centered” around the bulk
of posterior probability. The effect of increasing confidence in your prior information on the credible
intervals is generally to shorten them – this is because our prior is acting like additional “fake” data
and is increasing the amount of information contained in our posterior; for example, the 95% credible
interval from cumulative probabilities 0.025 to 0.0975 above has a length of 0.5005. If we instead chose
to use the θ ∼ Beta(3, 3) prior – which we know from Figure 1(d) concentrates the prior probability
more around our prior guess – then we have the interval
qbeta(p=c(0.025,0.975), 10, 6) = (0.383, 0.836)
which has a length of 0.453. So increasing the confidence in our prior guess acts to increase the
“information” about θ in the posterior, and shorten our credible interval.

4 Bayesian Point Estimation

4.1 Bayes Estimators
As we have seen, the posterior mode and posterior mean are two common methods for obtaining a best
guess at the population parameter θ using the posterior distribution. We will now examine Bayesian

11
point estimation in a more formal fashion. Given that we have observed some data sample y, the
posterior distribution p(θ | y) tells us how likely it is that different values of θ are to be the (unknown)
population parameter. The more posterior probability assigned to a value of θ, the more likely we
believe that it will be the true value of the population parameter. We can use this information, coupled
with the idea of a loss function, to define a generic class of estimation methods.
The idea is as follows: we first choose to measure how badly an estimate performs by using a loss
function L(θ, θ̂). This tells us how far some estimate θ̂ is from the true value θ. We then define a
Bayes estimator as one that minimises the posterior expected loss.

Definition 1. Let p(θ | y) denote a posterior distribution and L(θ, θ̂) a loss function; the Bayes
estimator relative to this loss function is then given by solving
Z
0
θ̂L = arg min L(θ, θ )p(θ | y)dθ (4)
θ0

That is, given a data sample, we form the posterior distribution which contains all the information we
have about the true values of θ, and then look for a guess at θ that has the smallest average loss, the
average being taken with respect to the posterior distribution (i.e., our updated beliefs over what are
the true values of θ). In other words, we use the value of θ that has the smallest expected loss, given
that we have observed data y, as our best guess. This approach reveals a very interesting strength of
the Bayesian choice. In particular, we see that Bayesian inference decouples
1. the inference problem, which is completely determined by the posterior distribution p(y | θ);
2. and the decision/estimation problem, which is determined by choosing a loss function that mea-
sures how badly making the incorrect decision will be penalized.
In contrast to the frequentist approach of estimator design in which one tends to first choose a loss,
and then tries to design an estimator so that the risk will be small, the Bayesian approach gives an
exact formula (4) for finding an estimator. This Bayes estimator formula looks complex, as we have
to minimise an expectation with respect to a complex posterior distribution. Luckily, for certain loss
functions the solution is known to have a reasonable simple form; in particular:
• If L(θ, θ̂) = (θ − θ̂)2 (i.e., squared-error loss) then

θ̂L (y) = E [θ | y]

which is just the posterior mean. So the posterior mean minimises the posterior expected squared-
error.

• If L(θ, θ̂) = |θ − θ̂| (i.e., absolute error loss) then

θ̂L (y) = med(θ | y)

which is just the posterior median. So the posterior median minimises the posterior expected
absolute error.

For other more general loss functions, the solution is harder to find, but can usually be solved numer-
ically.

12
4.2 Are Bayes Estimators Good?
By choosing a prior distribution π(θ) over Θ and loss function L(θ, θ̂) we can generate an infinite
number of different estimators by solving (4). A natural question might be: are these Bayes estimators
good in some sense? The short answer to this question is yes; the long answer can be framed in terms
of the language of risk and admissability. In particular, we have the following result.

Theorem 1. A Bayes estimator θ̂L that solves (4) using prior π(θ) and loss L(θ, θ̂) is always
an admissible estimator under the chosen loss.

This theorem says that by choosing a prior (any prior!) and a loss function (any loss function), solving
(4) will lead to an estimator that cannot be outperformed for every value θ ∈ Θ by any other estimator.
There will always be some area of Θ (intuitively, the areas that the prior distribution puts most of
its probability) for which the Bayes estimator will do better than any other estimator. This means,
for example, that the posterior mean is always admissable under squared-error loss, and the posterior
median is always admissable under absolute-error loss, regardless of the choice of prior! This leads to
the following corollary.

Theorem 2. We can prove that an estimator θ̂ is admissable if we can show it is equivalent

to some Bayes estimator.

This result tells us that if we have any estimator θ̂ that we have proposed, we can prove it is admissible
if we can find a Bayes estimator that has the same form. This is a commonly used tool to establish
admissability. This result quantifies why being admissable is both strong (it means there is no other
method that is everywhere better than our method), but also quite weak (because there are an infinite
number of different admissable procedures – one for each possible prior distribution). Bayes estimators
have a special link to Bayes risk, i.e., the average risk (performance) of an estimator, the average being
taken with respect to a weighting function π(θ) over the parameter space Θ.

Theorem 3. A Bayes estimator θ̂L based on prior π(θ) and loss L(θ, θ̂) always minimises the
Bayes risk for that loss with weighting function equal to the prior, i.e.
Z
Rπ (θ̂L ) = min π(θ)R(θ, θ̂0 )dθ
θ̂ 0

That is, amongst all possible estimators, the Bayes estimator will have the smallest average risk if
the population parameter follows the distribution π(θ). So we can view the “weighting function” (see
Lecture 1) as a prior distribution; if our true population parameter is generated from θ ∼ π(θ) then the
Bayes estimator will do the best, on average, at estimating θ. A final theorem relates Bayes estimators
to minimax estimators (i.e., estimators that minimise the worst case performance amongst all possible
estimation methods):

Theorem 4. If a Bayes estimator has constant risk (with respect to the population parameter),
then it is minimax

This says that if we have a Bayes estimator for which the performance (risk) does not depend on the
true value of the population parameter, then that the estimator will be minimax. Again, this technique
is used frequently to prove minimaxity of estimators.

13
Example 2: Estimation of Bernoulli

In Studio 2 we examined the “smoothed” estimator of Bernoulli success probability:

k+α
θ̂α =
n + 2α
where α > 0 is the smoothing constant. This estimator was proposed in the 1950s as an alternative
to the maximum likelihood estimator to avoid the estimation of success probabilities of zero or one
in the case we see only failures or successes in our data sample. From (3), we see that the smoothed
estimator is the same as the posterior mean of the Bernoulli with a Beta(α, α) prior. The posterior
mean minimises the posterior expected squared error, and therefore, from Theorem 1 (see above), the
√
smoothed estimator θ̂α is admissable under squared-error loss. In the particular case that α = n/2,
the squared-error risk of θ̂α can be shown to be
n
R(θ, θ̂α=√n/2 ) = √
4(n + n)2

which is independent of the population success probability θ; that is, regardless of the true population
success probability, the average squared-error we would obtain from using this estimator to estimate θ
from data samples drawn from the population would always be the same. As this estimator is a Bayes
estimator with constant risk, Theorem 3 above tells us it must be minimax. This demonstrates the
power of these tools, and why Bayes results are used even by people who are not explicitly Bayesians
to justify their estimators.

4.3 The Posterior Mode

As we have seen, the posterior mean and median estimators correspond to specific Bayes estimators
when we choose to use the squared and absolute losses, respectively. This implies that they are always
admissable estimators, which is a good property to have. For a continuous parameter θ, the posterior
mode (MAP estimator)
θ̂MAP = arg max {p(θ | y)}
θ

does not, in general, correspond to any Bayes estimator, and is therefore not necessarily admissable.
Yet in practice, the posterior mode is often used – perhaps more frequently than the posterior mean
or median. Why? One reason is philosophical: the mode looks for the most likely estimate of the
population parameter, given the data sample y and our prior beliefs. It tells us which value of θ we
think is the most likely to be true, in a sense, and avoids the need to specify some measure of loss.
Another reason is purely practical; recall the formula for the posterior distribution

p(y | θ)π(θ)
p(θ | y) = ,
p(y)

where Z
p(y) = π(θ)p(y | θ)dθ

is the marginal probability of the data sample y. The problem is that that posterior mean and median
(and many other Bayes estimators) require the marginal distribution to be computed, and the marginal
distribution is in general very hard (even potentially impossible) to compute analytically, and is even
difficult to compute through numerical methods. It is a nasty integral to deal with, in general! The

14
posterior mode does not require knowledge of the marginal distribution, which is why it is often utilised.
To understand why, we can note that that
p(y | θ)π(θ)
p(θ | y) = ∝ p(y | θ)π(θ),
p(y)
because p(y) does not depend on θ. So therefore, the value of θ that maximises the product p(y | θ)π(θ)
is the same value of θ that maximises p(θ | y). This means that the posterior mode can be equivalently
found by solving
θ̂MAP = arg max {p(y | θ)π(θ)}
θ
which avoids the need to compute p(y). All we need to do is write down the product of the likelihood
p(y | θ) and the prior π(θ), and maximise this expression for θ.

4.4 The Posterior Mode and Maximum Likelihood

We conclude our discussion of Bayesian point estimators by examining the links between the posterior
mode and the maximum likelihood estimator. Begin by recalling the definition of the maximum
likelihood estimator
θ̂ML (y) = arg max {p(y | θ)} ,
θ
or more usually, the equivalent formulation
θ̂ML (y) = arg min {− log p(y | θ)} .
θ

That is, we say that the value of θ that maximises the probability of our data sample y is considered
our best guess at the true, unknown population parameter. We can see that this is closely related to
the posterior mode: taking negative logarithms of p(θ | y) and ignoring terms that are constant in θ
yields (i.e., the − log p(y) term):

θ̂MAP (y) = arg min {− log p(y | θ) − log π(θ)}

which is equivalent to minimising the negative log-likelihood plus a penalty function. So the poste-
rior mode/MAP estimator can be interpreted as penalized maximum likelihood, with the negative-
logarithm of the prior distribution being the penalty applied to parameter θ. The smaller the prior
probability assigned by π(θ), the less likely we believe θ to be the true population parameter, and the
larger the penalty will be. This idea can be used later in the subject to connect Bayesian inference and
penalized regression. This form also lets us show that Bayesian inference is very similar to maximum
likelihood for large amounts of data. Assume that our prior π(θ) does not depend on the sample size
(i.e., none of the hyperparameters depend on n), and assume our data are independently distributed.
Then, we see that the MAP estimator
( n )
X
θ̂MAP (y) = arg min − log p(yi | θ) − log π(θ)
θ i=1

for large n is dominated by the likelihood term, which does grow in magnitude with increasing sample
size. That is, for large n sample size and prior distributions not depending on n, Bayesian estimation
and maximum likelihood converge (become the same) – this is because the information provided by
the data outweighs any information we have encoded in our prior beliefs, and eventually completely
swamps the effects of our prior distribution. This is comforting, as it tells us for large amounts of data
our prior beliefs no longer affect our inferences, which is intuitively sensible. We eventually want the
“data to speak for themselves”, as it were, regardless of our prior beliefs.

Block 4 ST3189
No ratings yet
Block 4 ST3189
25 pages
Bayesian Statistics Essentials
No ratings yet
Bayesian Statistics Essentials
180 pages
Bayesian Inference: Chris Mathys
No ratings yet
Bayesian Inference: Chris Mathys
32 pages
Bayesian Statistics
No ratings yet
Bayesian Statistics
76 pages
Zzzz-Essential Bayes
No ratings yet
Zzzz-Essential Bayes
158 pages
Introduction To Probabilities, Bayesian and Frequentist Statistics
No ratings yet
Introduction To Probabilities, Bayesian and Frequentist Statistics
23 pages
Stat 535 C - Statistical Computing & Monte Carlo Methods: Arnaud Doucet
No ratings yet
Stat 535 C - Statistical Computing & Monte Carlo Methods: Arnaud Doucet
23 pages
Bayesian Statistics
No ratings yet
Bayesian Statistics
24 pages
Course On Bayesian Methods in Environmental Valuation: Basics (Continued) : Models For Proportions and Means
No ratings yet
Course On Bayesian Methods in Environmental Valuation: Basics (Continued) : Models For Proportions and Means
34 pages
Bayesian vs Frequentist Sample Size
No ratings yet
Bayesian vs Frequentist Sample Size
26 pages
Main
No ratings yet
Main
195 pages
STATS 225: Bayesian Analysis Lecture 1: Introduction: Babak Shahbaba
No ratings yet
STATS 225: Bayesian Analysis Lecture 1: Introduction: Babak Shahbaba
49 pages
Introduction To Bayesian Inference: M. Botje NIKHEF, PO Box 41882, 1009DB Amsterdam, The Netherlands June 21, 2006
No ratings yet
Introduction To Bayesian Inference: M. Botje NIKHEF, PO Box 41882, 1009DB Amsterdam, The Netherlands June 21, 2006
68 pages
Overview of Principles of Statistics
No ratings yet
Overview of Principles of Statistics
8 pages
Bayesian Inference: A Practical Primer: Outline
No ratings yet
Bayesian Inference: A Practical Primer: Outline
28 pages
BST413 12jan Page1to11
No ratings yet
BST413 12jan Page1to11
11 pages
03 Bay Est He or em
No ratings yet
03 Bay Est He or em
13 pages
Lecture 1
No ratings yet
Lecture 1
17 pages
Statisticians' Guide: Bayesian vs. Frequentist
No ratings yet
Statisticians' Guide: Bayesian vs. Frequentist
4 pages
Bayes Slides1
No ratings yet
Bayes Slides1
146 pages
Babybayes Master
No ratings yet
Babybayes Master
172 pages
ML Unit 3 Part 1
No ratings yet
ML Unit 3 Part 1
36 pages
1 Inference
No ratings yet
1 Inference
9 pages
Ba Yes Freq Book
No ratings yet
Ba Yes Freq Book
30 pages
IDS21 Bayes Theorem
No ratings yet
IDS21 Bayes Theorem
22 pages
Bayes
No ratings yet
Bayes
31 pages
확통1 LectureNote09 on Bayesian Statistical Inference
No ratings yet
확통1 LectureNote09 on Bayesian Statistical Inference
78 pages
Statistics For The Behavioural Sciences An Introduction To Frequentist and Bayesian Approaches, 2nd Edition Exclusive Download
100% (13)
Statistics For The Behavioural Sciences An Introduction To Frequentist and Bayesian Approaches, 2nd Edition Exclusive Download
17 pages
Part 1
No ratings yet
Part 1
200 pages
Preprint of The Book Chapter: "Bayesian Versus Frequentist Inference"
No ratings yet
Preprint of The Book Chapter: "Bayesian Versus Frequentist Inference"
29 pages
BML Lecture Notes
No ratings yet
BML Lecture Notes
126 pages
ML Unit 3 Part 1
No ratings yet
ML Unit 3 Part 1
36 pages
Bayesian Analysis - Explanation
No ratings yet
Bayesian Analysis - Explanation
20 pages
Introduction To Bayesian Statistics: 24 February 2016 A Semester's Worth of Material in Just A Few Dozen Slides
No ratings yet
Introduction To Bayesian Statistics: 24 February 2016 A Semester's Worth of Material in Just A Few Dozen Slides
40 pages
Introduction To Machine Learning CS - 229
No ratings yet
Introduction To Machine Learning CS - 229
109 pages
Baysian-Slides 16 Bayes Intro
No ratings yet
Baysian-Slides 16 Bayes Intro
49 pages
Frequentist Statistics
No ratings yet
Frequentist Statistics
34 pages
Bayesian Learning for Graphics
No ratings yet
Bayesian Learning for Graphics
141 pages
Bayesian Course 1 Short
No ratings yet
Bayesian Course 1 Short
92 pages
Lecture Notes For Probability and Statistics
No ratings yet
Lecture Notes For Probability and Statistics
7 pages
Bayes Lecture Notes
No ratings yet
Bayes Lecture Notes
79 pages
24 Intro To Bayesian Inference
No ratings yet
24 Intro To Bayesian Inference
33 pages
0 Points of View
No ratings yet
0 Points of View
15 pages
Bayes For Beginners: Luca Chech and Jolanda Malamud Supervisor: Thomas Parr 13 February 2019
No ratings yet
Bayes For Beginners: Luca Chech and Jolanda Malamud Supervisor: Thomas Parr 13 February 2019
41 pages
EN - Bayesian Methods in Survival Analysis Enhancing Insights in Clinical Research
No ratings yet
EN - Bayesian Methods in Survival Analysis Enhancing Insights in Clinical Research
11 pages
A Unified Approach To Understanding Statistics
No ratings yet
A Unified Approach To Understanding Statistics
8 pages
Bayesian Statistics
No ratings yet
Bayesian Statistics
6 pages
Unit - 5 ML
No ratings yet
Unit - 5 ML
57 pages
Bayesian-Statistics Final 20140416 3
No ratings yet
Bayesian-Statistics Final 20140416 3
38 pages
L1: (Probability And) Statistics: ENGG 2780A ESTR 2020
No ratings yet
L1: (Probability And) Statistics: ENGG 2780A ESTR 2020
29 pages
Baysian Inferences
No ratings yet
Baysian Inferences
20 pages
BaYesian Models Machine Learning 2016
No ratings yet
BaYesian Models Machine Learning 2016
126 pages
Bayesian Inference & Applications
No ratings yet
Bayesian Inference & Applications
12 pages
Intro-Bayes Theory
No ratings yet
Intro-Bayes Theory
17 pages
IDS22Bayes Applications
No ratings yet
IDS22Bayes Applications
34 pages
COMP3411 Week 9 - Uncertainty
No ratings yet
COMP3411 Week 9 - Uncertainty
70 pages
Bayesian Statistics For Beginners A Step by Step Approach Therese M. Donovan Download
100% (1)
Bayesian Statistics For Beginners A Step by Step Approach Therese M. Donovan Download
39 pages
Introduction To Discrete Bayesian Methods: Petri Nokelainen
No ratings yet
Introduction To Discrete Bayesian Methods: Petri Nokelainen
146 pages
Chapter 1 - Introduction To Probability
No ratings yet
Chapter 1 - Introduction To Probability
14 pages
Sampling Distributions Guide
No ratings yet
Sampling Distributions Guide
8 pages
Probability in Public Health
100% (2)
Probability in Public Health
138 pages
Statistical Analysis for Researchers
No ratings yet
Statistical Analysis for Researchers
7 pages
Worksheet Independent Events 2
No ratings yet
Worksheet Independent Events 2
8 pages
AI Probabilistic Reasoning Guide
No ratings yet
AI Probabilistic Reasoning Guide
14 pages
DSC1371 - Chapter 8 - Sampling and Sampling Distributions
No ratings yet
DSC1371 - Chapter 8 - Sampling and Sampling Distributions
36 pages
Nominal Variables & Chi-Squared Analysis
No ratings yet
Nominal Variables & Chi-Squared Analysis
24 pages
1 Statistics, Graphs, Curves
No ratings yet
1 Statistics, Graphs, Curves
65 pages
Advanced Probability Concepts
No ratings yet
Advanced Probability Concepts
31 pages
Hypothesis Testing Guide
No ratings yet
Hypothesis Testing Guide
30 pages
F Test
No ratings yet
F Test
4 pages
MC Math 13 Module 1
No ratings yet
MC Math 13 Module 1
26 pages
Estimation
No ratings yet
Estimation
30 pages
Lecture 2 Components of Statistics
No ratings yet
Lecture 2 Components of Statistics
11 pages
ZR Eduard Schenuit Eng
No ratings yet
ZR Eduard Schenuit Eng
32 pages
Chapter 2 Solutions
No ratings yet
Chapter 2 Solutions
64 pages
LESSON 11A Parametric Statistics
100% (1)
LESSON 11A Parametric Statistics
11 pages
Robust Stats for Software Engineers
No ratings yet
Robust Stats for Software Engineers
52 pages
Dependent and Independent Probability
No ratings yet
Dependent and Independent Probability
43 pages
Random Variables and Pdfs
No ratings yet
Random Variables and Pdfs
18 pages
T-Test-Assignment 3-Act
No ratings yet
T-Test-Assignment 3-Act
4 pages
Stats Extra Credit
100% (3)
Stats Extra Credit
5 pages
Testing of Hypothesis Devori
No ratings yet
Testing of Hypothesis Devori
91 pages
Assignment 03 Solutions
No ratings yet
Assignment 03 Solutions
5 pages
Probability Project PDF
No ratings yet
Probability Project PDF
3 pages
MBA201 - TASK1 Nijar S Basiri
No ratings yet
MBA201 - TASK1 Nijar S Basiri
3 pages
CHAPTER 8 For The Canadian Law Teaching About Ethics
No ratings yet
CHAPTER 8 For The Canadian Law Teaching About Ethics
54 pages
The Confidence Interval Mini-Project
No ratings yet
The Confidence Interval Mini-Project
8 pages
Nonparametric Effect Size Estimators: East Carolina University
No ratings yet
Nonparametric Effect Size Estimators: East Carolina University
6 pages

Lecture 4 Notes

Uploaded by

Lecture 4 Notes

Uploaded by

FIT3154 Lecture 4 Summary

Bayesian Inference Part 1

θ̂ML = arg min {− log p(y | θ)}

1.2 The Bayesian School of Inference

1.3 Bayes’ Rule

2. the conditional distribution P(Y = y | X = x) of Y , given that we have observed X = x.

1. A model of our data

2.2 The Posterior Distribution

2.3 The Prior Distribution

θ̂MAP = arg max {p(θ | y)}

3 Part II: Example of Bayesian Inference

3.2 Posterior distribution for Bernoulli data

3.3 Posterior Mean and Mode

θ̂PM (y) = w E [θ] + (1 − w) θ̂ML (y)

3.4 Example of Posterior Distributions

3.5 Credible Intervals for θ

4 Bayesian Point Estimation

• If L(θ, θ̂) = |θ − θ̂| (i.e., absolute error loss) then

θ̂L (y) = med(θ | y)

Theorem 2. We can prove that an estimator θ̂ is admissable if we can show it is equivalent

In Studio 2 we examined the “smoothed” estimator of Bernoulli success probability:

4.3 The Posterior Mode

4.4 The Posterior Mode and Maximum Likelihood

θ̂MAP (y) = arg min {− log p(y | θ) − log π(θ)}

You might also like