Bayesian Inference:
A Practical Primer
Tom Loredo
Department of Astronomy, Cornell University
loredo@spacenet.tn.cornell.edu
http://www.astro.cornell.edu/staff/loredo/bayes/
Outline
• Parametric Bayesian inference
– Probability theory
– Parameter estimation
– Model uncertainty
• What’s different about it?
• Bayesian calculation
– Asymptotics: Laplace approximations
– Quadrature
– Posterior sampling and MCMC
Bayesian Statistical Inference:
Quantifying Uncertainty
Inference:
• Reasoning from one proposition to another
• Deductive Inference: Strong syllogisms, logic; quantify with Boolean
algebra
• Plausible Inference: Weak syllogisms; quantify with probability
Propositions of interest to us are descriptions of data (D),
and hypotheses about the data, Hi
Statistical:
• Statistic: Summary of what data say about a particular ques-
tion/issue
• Statistic = f (D) (value, set, etc.); implicitly also f (question)
• Statistic is chosen & interpreted via probability theory
• Statistical inference = Plausible inference using probability theory
Bayesian (vs. Frequentist):
What are valid arguments for probabilities P (A| · · ·)?
• Bayesian: Any propositions are valid (in principle)
• Frequentist: Only propositions about random events (data)
How should we use probability theory to do statistics?
• Bayesian: Calculate P (Hi |D, · · ·) vs. Hi with D = Dobs
• Frequentist: Create methods for choosing among Hi with
good long run behavior determined by examining P (D|H i )
for all possible hypothetical D; apply method to Dobs
What is distributed in p(x)?
Bayesian: Probability describes uncertainty
Bernoulli, Laplace, Bayes, Gauss. . .
p(x) describes how probability (plausibility) is distributed
among the possible choices for x in the case at hand.
Analog: a mass density, ρ(x)
P p is distributed
x has a single,
uncertain value
x
Relationships between probability and frequency were
demonstrated mathematically (large number theorems,
Bayes’s theorem).
Frequentist: Probability describes “randomness”
Venn, Boole, Fisher, Neymann, Pearson. . .
x is a random variable if it takes different values through-
out an infinite (imaginary?) ensemble of “identical”
sytems/experiments.
p(x) describes how x is distributed throughout the infi-
nite ensemble.
P x is distributed
x
Probability ≡ frequency.
Interpreting Abstract Probabilities
Symmetry/Invariance/Counting
• Resolve possibilities into equally plausible “microstates”
using symmetries
• Count microstates in each possibility
Frequency from probability
Bernoulli’s laws of large numbers: In repeated trials,
given P (success), predict
Nsuccess
→P as N →∞
Ntotal
Probability from frequency
Bayes’s “An Essay Towards Solving a Problem in the
Doctrine of Chances” → Bayes’s theorem
Probability 6= Frequency!
Bayesian Probability:
A Thermal Analogy
Intuitive notion Quantification Calibration
Hot, cold Temperature, T Cold as ice = 273K
Boiling hot = 373K
uncertainty Probability, P Certainty = 0, 1
p = 1/36:
plausible as “snake’s eyes”
p = 1/1024:
plausible as 10 heads
The Bayesian Recipe
Assess hypotheses by calculating their probabilities
p(Hi | . . .) conditional on known and/or presumed in-
formation using the rules of probability theory.
Probability Theory Axioms (“grammar”):
‘OR’ (sum rule) P (H1 + H2 |I) = P (H1 |I) + P (H2 |I)
−P (H1 , H2 |I)
‘AND’ (product rule) P (H1 , D|I) = P (H1 |I) P (D|H1 , I)
= P (D|I) P (H1 |D, I)
Direct Probabilities (“vocabulary”):
• Certainty: If A is certainly true given B, P (A|B) = 1
• Falsity: If A is certainly false given B, P (A|B) = 0
• Other rules exist for more complicated types of informa-
tion; for example, invariance arguments, maximum (in-
formation) entropy, limit theorems (tying probabilities
to frequencies), bold (or desperate!) presumption. . .
Important Theorems
Normalization:
For exclusive, exhaustive Hi
X
P (Hi | · · ·) = 1
i
Bayes’s Theorem:
P (D|Hi, I)
P (Hi|D, I) = P (Hi |I)
P (D|I)
posterior ∝ prior × likelihood
Marginalization:
Note that for exclusive, exhaustive {Bi},
X X
P (A, Bi |I) = P (Bi|A, I)P (A|I) = P (A|I)
i i
X
= P (Bi|I)P (A|Bi, I)
i
→ We can use {Bi} as a “basis” to get P (A|I). This is
sometimes called “extending the conversation.”
Example: Take A = D, Bi = Hi ; then
X
P (D|I) = P (D, Hi |I)
i
X
= P (Hi |I)P (D|Hi, I)
i
prior predictive for D = Average likelihood for Hi
Inference With Parametric Models
Parameter Estimation
I = Model M with parameters θ (+ any add’l info)
Hi = statements about θ; e.g. “θ ∈ [2.5, 3.5],” or “θ > 0”
Probability for any such statement can be found using a
probability density function (PDF) for θ:
P (θ ∈ [θ, θ + dθ]| · · ·) = f (θ)dθ
= p(θ| · · ·)dθ
Posterior probability density:
p(θ|M ) L(θ)
p(θ|D, M ) = R
dθ p(θ|M ) L(θ)
Summaries of posterior:
• “Best fit” values: mode, posterior mean
• Uncertainties: Credible regions
• Marginal distributions:
– Interesting parameters ψ, nuisance parameters φ
– Marginal dist’n for ψ:
Z
p(ψ|D, M ) = dφ p(ψ, φ|D, M )
Generalizes “propagation of errors”
Model Uncertainty: Model Comparison
I = (M1 + M2 + . . .) — Specify a set of models.
Hi = Mi — Hypothesis chooses a model.
Posterior probability for a model:
p(D|Mi, I)
p(Mi|D, I) = p(Mi|I)
p(D|I)
∝ p(Mi)L(Mi)
R
But L(Mi) = p(D|Mi) = dθi p(θi |Mi )p(D|θi , Mi ).
Likelihood for model = Average likelihood for its
parameters
L(Mi ) = hL(θi)i
Posterior odds and Bayes factors:
Discrete nature of hypothesis space makes odds conve-
nient:
p(Mi |D, I)
Oij ≡
p(Mj |D, I)
p(Mi |I) p(D|Mi)
= ×
p(Mj |I) p(D|Mj )
= Prior Odds × Bayes Factor Bij
Often take models to be equally probable a priori
→ Oij = Bij .
Model Uncertainty: Model Averaging
Models have a common subset of interesting pa-
rameters, ψ.
Each has different set of nuisance parameters φi
(or different prior info about them).
Hi = statements about ψ
Calculate posterior PDF for ψ:
X
p(ψ|D, I) = p(ψ|D, Mi)p(Mi|D, I)
i Z
X
∝ L(Mi) dθi p(ψ, φi|D, Mi)
i
The model choice is itself a (discrete) nuisance
parameter here.
An Automatic Occam’s Razor
Predictive probabilities prefer simpler models:
P(D|H) Simple H
Complicated H
D obs D
The Occam Factor:
p, L
Likelihood
δθ
Prior
θ
∆θ
Z
p(D|Mi ) = dθi p(θi |M ) L(θi )
≈ p(θ̂i |M )L(θ̂i )δθi
δθi
≈ L(θ̂i )
∆θi
= Maximum Likelihood × Occam Factor
Models with more parameters usually make the data
more probable for the best fit.
The Occam factor penalizes models for “wasted” vol-
ume of parameter space.
Comparison of Bayesian &
Frequentist Approaches
Bayesian Inference (BI):
• Specify at least two competing hypotheses and priors
• Calculate their probabilities using the rules of probability
theory
– Parameter estimation:
p(θ|M )L(θ)
p(θ|D, M ) = R
dθ p(θ|M )L(θ)
– Model Comparison:
R
dθ1 p(θ1 |M1 ) L(θ1 )
O∝R
dθ2 p(θ2 |M2 ) L(θ2 )
Frequentist Statistics (FS):
• Specify null hypothesis H0 such that rejecting it implies
an interesting effect is present
• Specify statistic S(D) that measures departure of the
data from null expectations
R
• Calculate p(S|H0 ) = dD p(D|H0 )δ[S − S(D)]
(e.g. by Monte Carlo simulation of data)
• Evaluate
R S(Dobs ); decide whether to reject H0 based on,
e.g., >Sobs dS p(S|H0 )
Crucial Distinctions
The role of subjectivity:
BI exchanges (implicit) subjectivity in the choice of null
& statistic for (explicit) subjectivity in the specification
of alternatives.
• Makes assumptions explicit
• Guides specification of further alternatives that gen-
eralize the analysis
• Automates identification of statistics:
BI is a problem-solving approach
FS is a solution-characterization approach
The types of mathematical calculations:
The two approaches require calculation of very different
sums/averages.
• BI requires integrals over hypothesis/parameter space
• FS requires integrals over sample/data space
A Frequentist Confidence Region
(xi − µ)2
· ¸
1
Infer µ : x i = µ + ²i ; p(xi |µ, M ) = √ exp −
σ 2π 2σ 2
p(x1,x2| µ )
x2
µ
x2 x1
x1
√
68% confidence region: x̄ ± σ/ N
1. Pick a null hypothesis, µ = µ0
2. Draw xi ∼ N (µ0 , σ 2 ) for i = 1 to N
√
3. Find x̄; check if µ0 ∈ x̄ ± σ/ N
4. Repeat M >> 1 times; report fraction (≈ 0.683)
5. Hope result is independent of µ0 !
A Monte Carlo calculation of the N -dimensional integral:
(x −µ)2 (x −µ)2
− 12σ2 − N2σ2
√
Z Z
e e
dx1 √ · · · dxN √ × [µ0 ∈ x̄ ± σ/ N ] ≈ 0.683
σ 2π σ 2π
A Bayesian Credible Region
(x̄ − µ)2
· ¸
Infer µ : Flat prior; L(µ) ∝ exp − √
2(σ/ N )2
p(x1,x2| µ )
L( µ)
x2
µ
x1
√
68% credible region: x̄ ± σ/ N
R x̄−σ/√N h
(x̄−µ) 2
i
√
x̄−σ/ N
dµ exp − 2(σ/ √ 2
N)
R∞ h i ≈ 0.683
(x̄−µ) 2
−∞
dµ exp − 2(σ/√N )2
Equivalent to a Monte Carlo calculation of a 1-d integral:
1. Draw µ from N (x̄, σ 2 /N ) (i.e., prior ×L)
2. Repeat M >> 1 times; histogram
3. Report most probable 68.3% region
This simulation uses hypothetical hypotheses rather than
hypothetical data.
When Will Results Differ?
When models are linear in the parameters and
have additive Gaussian noise, frequentist results
are identical to Bayesian results with flat priors.
This mathematical coincidence will not occur if:
• The choice of statistic is not obvious
(no sufficient statistics)
• There is no identity between parameter space
and sample space integrals (due to nonlinearity
or the form of the sampling distribution)
• There is important prior information
In addition, some problems can be quantitatively
addressed only from the Bayesian viewpoint; e.g.,
systematic error.
Benefits of Calculating
in Parameter Space
• Provides probabilities for hypotheses
– Straightforward interpretation
– Identifies weak experiments
– Crucial for global (hierarchical) analyses
(e.g., pop’n studies)
– Allows analysis of systematic error models
– Forces analyst to be explicit about assumptions
• Handles nuisance parameters via marginalization
• Automatic Occam’s razor
• Model comparison for > 2 alternatives; needn’t be nested
• Valid for all sample sizes
• Handles multimodality
• Avoids inconsistency & incoherence
• Automated identification of statistics
• Accounts for prior information (including other data)
• Avoids problems with sample space choice:
– Dependence of results on “stopping rules”
– Recognizable subsets
– Defining number of “independent” trials in searches
• Good frequentist properties:
– Consistent
– Calibrated—E.g., if you choose a model only if B >
100, you will be right ≈ 99% of the time
– Coverage as good or better than common methods
Challenges from Calculating
in Parameter Space
Inference with independent data:
Consider N data, D = {xi }; and model M with m pa-
rameters (m ¿ N ).
Suppose L(θ) = p(x1 |θ) p(x2 |θ) · · · p(xN |θ).
Frequentist integrals:
Z Z Z
dx1 p(x1 |θ) dx2 p(x2 |θ) · · · dxN p(xN |θ)f (D)
Seek integrals with properties independent of θ. Such
rigorous frequentist integrals usually cannot be identi-
fied.
Approximate results are easy via Monte Carlo (due to
independence).
Bayesian integrals:
Z
dm θ g(θ) p(θ|M ) L(θ)
Such integrals are sometimes easy if analytic (especially
in low dimensions).
Asymptotic approximations require ingredients familiar
from frequentist calculations.
For large m (> 4 is often enough!) the integrals are
often very challenging because of correlations (lack of
independence) in parameter space.
Bayesian Integrals:
Laplace Approximations
Suppose posterior has a single dominant (interior) mode at
θ̂, with m parameters
· ¸
1
→ p(θ|M )L(θ) ≈ p(θ̂|M )L(θ̂) exp − (θ − θ̂)I(θ − θ̂)
2
∂ 2 ln[p(θ|M )L(θ)] ¯¯
where I= ¯, Info matrix
∂ 2θ θ̂
Bayes Factors:
Z
dθ p(θ|M )L(θ) ≈ p(θ̂|M )L(θ̂) (2π)m/2 |I|−1/2
Marginals:
Profile likelihood Lp (θ) ≡ max L(θ, φ)
φ
−1/2
→ p(θ|D, M ) ∝
∼ Lp (θ)|I(θ)|
Uses same ingredients as common frequentist calculations
Uses ratios → approximation is often O(1/N )
Using “unit info prior” in i.i.d. setting → Schwarz criterion;
Bayesian Information Criterion (BIC)
1
ln B ≈ ln L(θ̂) − ln L(θ̂, φ̂) + (m2 − m1 ) ln N
2
Low-D Models (m<∼10):
Quadrature & MC Integration
Quadrature/Cubature Rules:
Z X
dθ f (θ) ≈ wi f (θi ) + O(n−2 ) or O(n−4 )
i
Smoothness → fast convergence in 1-D
Curse of dimensionality → O(n−2/m ) or O(n−4/m ) in m-D
Monte Carlo Integration:
· ¸
∼ O(n−1 ) with
Z X
dθ g(θ)p(θ) ≈ g(θi ) + O(n−1/2 )
quasi-MC
θi ∼p(θ)
Ignores smoothness → poor performance in 1-D
Avoids curse: O(n−1/2 ) regardless of dimension
Practical problem: multiplier is large (variance of g)
→ hard if m>∼6 (need good “importance sampler” p)
Randomized Quadrature:
Quadrature rule + random dithering of abscissas
→ get benefits of both methods
Most useful in settings resembling Gaussian quadrature
Subregion-Adaptive Quadrature/MC:
Concentrate points where most of the probability lies
via recursion
Adaptive quadrature: Use a pair of lattice rules (for
error estim’n), subdivide regions w/ large error
• ADAPT (Genz & Malik) at GAMS (gams.nist.gov)
• BAYESPACK (Genz; Genz & Kass)—many methods
Automatic; regularly used up to m ≈ 20
Adaptive Monte Carlo: Build the importance sampler
on-the-fly (e.g., VEGAS, miser in Numerical Recipes)
ADAPT in action (galaxy polarizations)
High-D Models (m ∼ 5–106):
Posterior Sampling
General Approach:
Draw samples of θ, φ from p(θ, φ|D, M ); then:
P
• Integrals, moments easily found via i f (θi , φi )
• {θi } are samples from p(θ|D, M )
But how can we obtain {θi , φi }?
Rejection Method:
P(θ ) ●
● ● ●
● ● ● ●
●
● ● ●
● ●
● ● ●
●
● ●
● ● ●
● ● ●
● ● ● ●
● ●
● ● ● ●
● ●
● ●
● ● ●
● ●
● ●
● ● ●
θ
Hard to find efficient comparison function if m>
∼6
A 2-D marginal of a 6-D posterior
Markov Chain Monte Carlo (MCMC):
Let − Λ(θ) = ln [p(θ|M ) p(D|θ, M )]
e−Λ(θ)
Z
Then p(θ|D, M ) = Z≡ dθ e−Λ(θ)
Z
Bayesian integration looks like problems addressed in
computational statmech and Euclidean QFT!
Markov chain methods are standard: Metropolis; Metropolis-
Hastings; molecular dynamics; hybrid Monte Carlo; sim-
ulated annealing
The MCMC Recipe:
Create a “time series” of samples θi from p(θ):
• Draw a candidate θi+1 from a kernel T (θi+1 |θi )
• Enforce “detailed balance” by accepting with p = α
· ¸
T (θi |θi+1 )p(θi+1 )
α(θi+1 |θi ) = min 1,
T (θi+1 |θi )p(θi)
Choosing T to minimize “burn-in” and corr’ns is an art!
Coupled, parallel chains eliminate this for select prob-
lems (“exact sampling”).
Summary
What’s different about Bayesian Inference:
• Problem-solving vs. solution-characterizing approach
• Calculate in parameter space rather than sample
space
Bayesian Benefits:
• Rigorous foundations, consistent & simple interpre-
tation
• Automated identification of statistics
• Numerous benefits from parameter space vs. sample
space
Bayesian Challenges:
• More complicated problem specification
(≥ 2 alternatives; priors)
• Computational difficulties with large parameter spaces
– Laplace approximation for “quick entry”
– Adaptive & randomized quadrature for lo-D
– Posterior sampling via MCMC for hi-D
Compare or Reject Hypotheses?
Frequentist Significance Testing (G.O.F. tests):
• Specify simple null hypothesis H0 such that rejecting
it implies an interesting effect is present
• Divide sample space into probable and improbable
parts (for H0 )
• If Dobs lies in improbable region, reject H0 ; otherwise
accept it
P(D|H) H0
P=95%
D
D obs
Bayesian Model Comparison:
• Favor the hypothesis that makes the observed data
most probable (up to a prior factor)
P(D|H) H2
H1
H0
D
D obs
If the data are improbable under M1 , the hypothesis may be
wrong, or a rare event may have occured. GOF tests reject
the latter possibility at the outset.
Backgrounds as Nuisance
Parameters
Background marginalization with Gaussian noise:
Measure background rate b = b̂ ± σb with source off.
Measure total rate r = r̂ ± σr with source on.
Infer signal source strength s, where r = s + b.
With flat priors,
(b − b̂)2 (s + b − r̂)2
· ¸ · ¸
p(s, b|D, M ) ∝ exp − × exp −
2σb2 2σr2
Marginalize b to summarize the results for s (complete
the square to isolate b dependence; then do a simple
Gaussian integral over b):
(s − ŝ)2
· ¸
ŝ = r̂ − b̂
p(s|D, M ) ∝ exp −
2σs2 σs2 = σr2 + σb2
Background subtraction is a special case of background
marginalization.
Analytical Simplification:
The Jaynes-Bretthorst Algorithm
Superposed Nonlinear Models
N samples of a superpos’n of nonlinear functions plus Gaus-
sian errors,
M
X
di = Aα gα(xi ; θ) + ²i
α=1
X
or d~ = Aα~gα(θ) + ~².
α
The log-likelihood is a quadratic form in Aα,
· ¸
1 Q(A, θ)
L(A, θ) ∝ N exp −
σ 2σ 2
" #2
X
Q = d~ − Aα~gα
α
X X
2
= d −2 Aαd~ · ~gα + AαAβ ηαβ
α α,β
ηαβ = ~gα · ~gβ
Estimate θ given a prior, π(θ).
Estimate amplitudes.
Compare rival models.
The Algorithm
• Switch to orthonormal set of models, ~hµ(θ) by
diagonalizing ηαβ ; new amplitudes B = {Bµ}.
Xh i2
Q= Bµ − d~ · ~hµ (θ) + r 2 (θ, B)
µ
X
residual r(θ, B) = d~ −
~ Bµ~hµ
µ
" #
r2
· ¸
π(θ)J(θ) −1 X 2
p(B, θ|D, I) ∝ exp − exp (B µ − B̂ µ )
σN 2σ 2 2σ 2 µ
Y
where J(θ) = λµ (θ)−1/2
µ
• Marginalize B’s analytically.
r 2(θ)
" #
π(θ)J(θ)
p(θ|D, I) ∝ exp −
σ N −M 2σ 2
residual sum of squares
r 2(θ) =
from least squares
• If σ unknown, marginalize using p(σ|I) ∝ σ1 .
i M −N
p(θ|D, I) ∝ π(θ)J(θ) r 2(θ)
h
2
Frequentist Behavior
of Bayesian Results
Bayesian inferences have good long-run proper-
ties, sometimes better than conventional frequen-
tist counterparts.
Parameter Estimation:
• Credible regions found with flat priors are typically con-
fidence regions to O(n−1/2 ).
• Using standard nonuniform “reference” priors can im-
prove their performance to O(n−1 ).
• For handling nuisance parameters, regions based on marginal
likelihoods have superior long-run performance to re-
gions found with conventional frequentist methods like
profile likelihood.
Model Comparison:
• Model comparison is asymptotically consistent. Popular
frequentist procedures (e.g., χ2 test, asymptotic likeli-
hood ratio (∆χ2 ), AIC) are not.
• For separate (not nested) models, the posterior prob-
ability for the true model converges to 1 exponentially
quickly.
• When selecting between more than 2 models, carrying
out multiple frequentist significance tests can give mis-
leading results. Bayes factors continue to function well.