Tutorial 08 Part 1
Tutorial 08 Part 1
inductive learning
Tom Griffiths
UC Berkeley
Charles Kemp
CMU
Josh Tenenbaum
MIT
What you will get out of this tutorial
• Our view of what Bayesian models have to offer
cognitive science
• In-depth examples of basic and advanced
models: how the math works & what it buys you
• A sense for how to go about making your own
Bayesian models
• Some (not extensive) comparison to other
approaches
• Opportunities to ask questions
Resources…
• “Bayesian models of cognition” chapter in
Handbook of Computational Psychology
• Tom’s Bayesian reading list:
– http://cocosci.berkeley.edu/tom/bayes.html
– tutorial slides will be posted there!
• Trends in Cognitive Sciences special issue on
probabilistic models of cognition (vol. 10, iss. 7)
• IPAM graduate summer school on probabilistic
models of cognition (with videos!)
Outline
• Morning
– Introduction: Why Bayes? (Josh)
– Basics of Bayesian inference (Josh)
– How to build a Bayesian cognitive model (Tom)
• Afternoon
– Hierarchical Bayesian models and learning
structured representations (Charles)
– Monte Carlo methods and nonparametric Bayesian
models (Tom)
Why probabilistic models of cognition?
The big question
(Marr)
Learning the meanings of words
“tufa”
The big question
How does the mind get so much out of so little?
– Perceiving the world from sense data
– Learning about kinds of objects and their properties
– Learning and interpreting the meanings of words, phrases,
and sentences
– Inferring causal relations
– Inferring the mental states of other people (beliefs,
desires, preferences) from observing their actions
– Learning social structures, conventions, and rules
Phrase structure S
P(
P(U | S)
Utterance U
Grammar S ! NP VP
NP ! Det [ Adj ] Noun [ RelClause]
RelClause ! [ Rel ] NP V
VP ! VP NP
P(phrase structure | grammar)
VP ! Verb
Phrase structure
Utterance
P(speech | utterance)
Speech signal
Vision as probabilistic parsing
Structure
Data
Causal learning and reasoning
Principles
Structure
Data
Goal-directed action
(production and comprehension)
p ( d | h) p ( h)
p(h | d ) =
! p(d | h#) p(h#)
h#"H
Sum over space
of alternative hypotheses
Bayesian inference
P ( h) P ( d | h)
• Bayes’ rule: P(h | d ) =
• An example
! P(hi ) P(d | hi )
hi
HHTHT
HHHHH
What process produced these sequences?
Comparing two hypotheses
• Contrast simple hypotheses:
– h1: “fair coin”, P(H) = 0.5
– h2:“always heads”, P(H) = 1.0
• Bayes’ rule:
P ( h) P ( d | h)
P(h | d ) =
! P(hi ) P(d | hi )
hi
D: HHTHT
H1, H2: “fair coin”, “always heads”
P(D|H1) = 1/25 P(H1) = ?
P(D|H2) = 0 P(H2) = 1-?
Comparing two hypotheses
P ( H1 | D ) P ( D | H1 ) P ( H1 )
= !
P( H 2 | D) P( D | H 2 ) P( H 2 )
D: HHTHT
H1, H2: “fair coin”, “always heads”
P(D|H1) = 1/25 P(H1) = 999/1000
P(D|H2) = 0 P(H2) = 1/1000
P ( H1 | D ) 1 32 999
= ! = infinity
P( H 2 | D) 0 1
Comparing two hypotheses
P ( H1 | D ) P ( D | H1 ) P ( H1 )
= !
P( H 2 | D) P( D | H 2 ) P( H 2 )
D: HHHHH
H1, H2: “fair coin”, “always heads”
P(D|H1) = 1/25 P(H1) = 999/1000
P(D|H2) = 1 P(H2) = 1/1000
P( H1 | D) 1 32 999
= " ! 30
P( H 2 | D) 1 1
Comparing two hypotheses
P ( H1 | D ) P ( D | H1 ) P ( H1 )
= !
P( H 2 | D) P( D | H 2 ) P( H 2 )
D: HHHHHHHHHH
H1, H2: “fair coin”, “always heads”
P(D|H1) = 1/210 P(H1) = 999/1000
P(D|H2) = 1 P(H2) = 1/1000
P( H1 | D) 1 1024 999
= " !1
P( H 2 | D) 1 1
Measuring prior knowledge
1. The fact that HHHHH looks like a “mere coincidence”,
without making us suspicious that the coin is unfair, while
HHHHHHHHHH does begin to make us suspicious,
measures the strength of our prior belief that the coin is
fair.
– If θ is the threshold for suspicion in the posterior odds, and D* is
the shortest suspicious sequence, the prior odds for a fair coin is
roughly θ/P(D*|“fair coin”).
– If θ ~ 1 and D* is between 10 and 20 heads, prior odds are roughly
between 1/1,000 and 1/1,000,000.
2. The fact that HHTHT looks representative of a fair coin,
and HHHHH does not, reflects our prior knowledge about
possible causal mechanisms in the world.
– Easy to imagine how a trick all-heads coin could work: low (but
not negligible) prior probability.
– Hard to imagine how a trick “HHTHT” coin could work: extremely
low (negligible) prior probability.
Coin flipping
• Basic Bayes
– data = HHTHT or HHHHH
– compare two hypotheses:
P(H) = 0.5 vs. P(H) = 1.0
• Parameter estimation (Model fitting)
– compare many hypotheses in a parameterized family
P(H) = θ : Infer θ
• Model selection
– compare qualitatively different hypotheses, often
varying in complexity:
P(H) = 0.5 vs. P(H) = θ
Parameter estimation
• Assume data are generated from a
parameterized model:
θ
d1 d2 d3 d4
P(H) = θ
d1 d2 d3 d4 d1 d2 d3 d4 d1 d2 d3 d4
model M
data D
A hierarchical learning framework
model class C p ( D | M ) = ! p ( D | w, M ) p ( w | M )
w
Model selection:
model M
p ( M | D, C ) ! p ( D | M ) p ( M | C )
data D
Bayesian parameter estimation
• Assume data are generated from a model:
θ
d1 d2 d3 d4
P(H) = θ
p ( D | !) p (!)
p (! | D ) =
" p( D | !' ) p(!' ) d!'
• Posterior distribution P(θ | D) is a probability
density over θ = P(H)
• Need to specify likelihood P(D | θ ) and prior
distribution P(θ ).
Likelihood and prior
• Likelihood: Bernoulli distribution
P(D | θ ) = θ NH (1-θ ) NT
– NH: number of heads
– NT: number of tails
• Prior:
P(θ ) ∝ ?
Some intuitions
• D = 10 flips, with 5 heads and 5 tails.
• θ = P(H) on next flip? 50%
• Why? 50% = 5 / (5+5) = 5/10.
• Why? Maximum likelihood: !ˆ = arg max P( D | ! )
!
• Suppose we had seen 4 heads and 6 tails.
• P(H) on next flip? Closer to 50% than to 40%.
• Why? Prior knowledge.
A simple method of specifying priors
• Imagine some fictitious trials, reflecting a
set of previous experiences
– strategy often used with neural networks or
building invariance into machine vision.
• Posterior is Beta(NH+FH,NT+FT)
– same form as prior!
Bayesian parameter estimation
P(θ | D) ∝ P(D | θ ) P(θ ) = θ NH+FH-1 (1-θ ) NT+FT-1
FH,FT
D = NH,NT
d1 d2 d3 d4 H
FH,FT
D = NH,NT
d1 d2 d3 d4 H
Coins
θ ~ Beta(FH,FT)
FH,FT
Coin 1 θ1 Coin 2 θ2
... θ200 Coin 200
d1 d2 d3 d4 d1 d2 d3 d4 d1 d2 d3 d4
Model selection:
model M
p ( M | D, C ) ! p ( D | M ) p ( M | C )
data D
Stability versus Flexibility
• Can all domain knowledge be represented
with conjugate priors?
• Suppose you flip a coin 25 times and get all
heads. Something funny is going on …
• But with F ={1000 heads, 1000 tails},
P(heads) on next flip = 1025 / (1025+1000)
= 50.6%. Looks like nothing unusual.
• How do we balance stability and flexibility?
– Stability: 6 heads, 4 tails θ ~ 0.5
– Flexibility: 25 heads, 0 tails θ ~1
Bayesian model selection
θ
d1 d2 d3 d4 vs. d1 d2 d3 d4
Fair coin, P(H) = 0.5 P(H) = θ
θ = 0.5
D = HHHHH
Comparing simple and complex hypotheses
n N !n
P( D | " ) = " (1 ! " )
θ = 1.0
Probability
θ = 0.5
D = HHHHH
Comparing simple and complex hypotheses
n N !n
P( D | " ) = " (1 ! " )
Probability
θ = 0.6
θ = 0.5
D = HHTHT
Comparing simple and complex hypotheses
• P(H) = θ is more complex than P(H) = 0.5 in
two ways:
– P(H) = 0.5 is a special case of P(H) = θ
– for any observed sequence X, we can choose θ
such that X is more probable than if P(H) = 0.5
• How can we deal with this?
– Some version of Occam’s razor?
– Bayes: automatic version of Occam’s razor
follows from the “law of conservation of belief”.
Comparing simple and complex hypotheses
P( D | h0 ) = (1 / 2) n (1 ! 1 / 2) N ! n = 1 / 2 N
1
P( D | h1 ) = ! P( D | ", h1 ) p (" | h1 )d"
0
The “evidence” or “marginal likelihood”: The
probability that randomly selected parameters
from the prior would generate the data.
P( D | h1 )
log
P( D | h0 )
1
P( D | h1 ) = ! P( D | ", h1 ) p (" | h1 )d"
0
P( D | h0 ) = 1 / 2 N
!
Stability versus Flexibility revisited
fair/unfair?
• Model class hypothesis: is this
coin fair or unfair?
FH,FT
• Example probabilities:
– P(fair) = 0.999
θ
– P(θ |fair) is Beta(1000,1000)
– P(θ |unfair) is Beta(1,1)
d1 d2 d3 d4
• 25 heads in a row propagates up,
affecting θ and then P(fair|D)
P(fair|25 heads) P(25 heads|fair) P(fair)
=
P(unfair|25 heads) P(25 heads|unfair) P(unfair) ~ 0.001
Bayesian Occam’s Razor
M1
p(D = d | M )
M2
M1
p(D = d | M )
M2
M2
M3
D
Observed data
M3
M1: A model that is too simple is unlikely to generate
the data.
M3: A model that is too complex can generate many
possible data sets, so it is unlikely to generate
this particular data set at random.
Summary so far
• Three kinds of Bayesian inference
– Comparing two simple hypotheses
– Parameter estimation
• The importance and subtlety of prior knowledge
– Model selection
• Bayesian Occam’s razor, the blessing of abstraction
• Key concepts
– Probabilistic generative models
– Hierarchies of abstraction, with statistical
inference at all levels
– Flexibly structured representations
Plan for this lecture
• Some basic aspects of Bayesian statistics
– Comparing two hypotheses
– Model fitting
– Model selection
• Two (very brief) case studies in modeling
human inductive learning
– Causal learning
– Concept learning
Learning causation from correlation
C present C absent
(c+) (c-)
E present (e+) a c
E absent (e-) b d
vs.
h1 h0
E E
Bayesian learning of causal structure
• Hypotheses: B C B C
vs.
h1 h0
E E
1 1
P(d | h1 ) = " " P(d | w ,w ) p(w ,w | h ) dw
0 0 0 1 0 1 1 0 dw1
1
P(d | h0 ) = " 0 P(d | w 0 ) p(w 0 | h0 ) dw 0
!
Bayesian Occam’s Razor
h0 (no relationship)
For any model h,
P(d | h )
" P(d | h) = 1
d
h1 (positive relationship)
All!data sets d
People
Assume
B C ΔP
structure:
Estimate w0 w1
strength w1 E
Power PC
Examples of Generalization
“yes” numbers judgments (N = 20)
60
Diffuse similarity
60 80 10 30 Rule:
“multiples of 10”
60 52 57 55 Focused similarity:
numbers near 50-60
Bayesian model
• H: Hypothesis space of possible concepts:
– H1: Mathematical properties: multiples and powers of small numbers.
– H2: Magnitude: intervals with endpoints between 1 and 100.
Background
knowledge
p( y ! C | X ) =
h ! p ( y " C | h) p ( h | X )
h"H
X= x1 x2 x3 x4 y !C ?
Likelihood: p(X|h)
• Size principle: Smaller hypotheses receive greater
likelihood, and exponentially more so as n increases.
n
' 1 $
p ( X | h) = % " if x1 , K , xn ! h
& size(h) #
= 0 if any xi ! h
60
60 80 10 30
60 52 57 55
16
16 8 2 64
16 23 19 20
Stability versus Flexibility
math/magnitude?
• Higher-level hypothesis: is this concept
mathematical or magnitude-based?
• Example probabilities: h
– P(math) = λ
– P(h | math) … X= x1 x2 x3 x4
– P(h | magnitude) …