0% found this document useful (0 votes)

32 views95 pages

Tutorial 08 Part 1

The document presents a tutorial on Bayesian models of inductive learning, highlighting their significance in cognitive science and providing insights into building Bayesian cognitive models. It covers foundational concepts, advanced techniques, and comparisons with other approaches, along with practical examples and resources for further learning. The tutorial aims to address key questions regarding how the mind infers and generalizes from limited data using probabilistic frameworks.

Uploaded by

Lawrence Lee

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views95 pages

Tutorial 08 Part 1

Uploaded by

Lawrence Lee

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 95

Bayesian models of

inductive learning

Tom Griffiths
UC Berkeley
Charles Kemp
CMU
Josh Tenenbaum
MIT
What you will get out of this tutorial
• Our view of what Bayesian models have to offer
cognitive science
• In-depth examples of basic and advanced
models: how the math works & what it buys you
• A sense for how to go about making your own
Bayesian models
• Some (not extensive) comparison to other
approaches
• Opportunities to ask questions
Resources…
• “Bayesian models of cognition” chapter in
Handbook of Computational Psychology
• Tom’s Bayesian reading list:
– http://cocosci.berkeley.edu/tom/bayes.html
– tutorial slides will be posted there!
• Trends in Cognitive Sciences special issue on
probabilistic models of cognition (vol. 10, iss. 7)
• IPAM graduate summer school on probabilistic
models of cognition (with videos!)
Outline
• Morning
– Introduction: Why Bayes? (Josh)
– Basics of Bayesian inference (Josh)
– How to build a Bayesian cognitive model (Tom)
• Afternoon
– Hierarchical Bayesian models and learning
structured representations (Charles)
– Monte Carlo methods and nonparametric Bayesian
models (Tom)
Why probabilistic models of cognition?
The big question

How does the mind get so much out of so little?

How do we make inferences, generalizations,
models, theories and decisions about the world
from impoverished (sparse, incomplete, noisy)
data?

“The problem of induction”

Visual perception

(Marr)
Learning the meanings of words

“horse” “horse” “horse”

The objects of planet Gazoob
“tufa”
“tufa”

“tufa”
The big question
How does the mind get so much out of so little?
– Perceiving the world from sense data
– Learning about kinds of objects and their properties
– Learning and interpreting the meanings of words, phrases,
and sentences
– Inferring causal relations
– Inferring the mental states of other people (beliefs,
desires, preferences) from observing their actions
– Learning social structures, conventions, and rules

The goal: A general-purpose computational framework

for understanding of how people make these
inferences, and how they can be successful.
The problems of induction
1. How does abstract knowledge guide inductive
learning, inference, and decision-making from sparse,
noisy or ambiguous data?
2. What is the form and content of our abstract
knowledge of the world?
3. What are the origins of our abstract knowledge? To
what extent can it be acquired from experience?
4. How do our mental models grow over a lifetime,
balancing simplicity versus data fit (Occam),
accommodation versus assimilation (Piaget)?
5. How can learning and inference proceed efficiently
and accurately, even in the presence of complex
hypothesis spaces?
A toolkit for reverse-engineering induction
1. Bayesian inference in probabilistic generative models
2. Probabilities defined over structured representations:
graphs, grammars, predicate logic, schemas
3. Hierarchical probabilistic models, with inference at all
levels of abstraction
4. Models of unbounded complexity (“nonparametric
Bayes” or “infinite models”), which can grow in
complexity or change form as observed data dictate.
5. Approximate methods of learning and inference, such
as belief propagation, expectation-maximization (EM),
Markov chain Monte Carlo (MCMC), and sequential
Monte Carlo (particle filtering).
Grammar G S ! NP VP
NP ! Det [ Adj ] Noun [ RelClause]
RelClause ! [ Rel ] NP V
VP ! VP NP
P(S | G)
VP ! Verb

Phrase structure S
P(
P(U | S)

Utterance U

P(S | U, G) ~ P(U | S) x P(S | G)

Bottom-up Top-down
“Universal Grammar” Hierarchical phrase structure
grammars (e.g., CFG, HPSG, TAG)
P(grammar | UG)

Grammar S ! NP VP
NP ! Det [ Adj ] Noun [ RelClause]
RelClause ! [ Rel ] NP V
VP ! VP NP
P(phrase structure | grammar)
VP ! Verb

Phrase structure

P(utterance | phrase structure)

Utterance
P(speech | utterance)
Speech signal
Vision as probabilistic parsing

(Han and Zhu, 2006)

Learning word meanings
Whole-object principle
Shape bias
Principles Taxonomic principle
Contrast principle
Basic-level bias

Structure

Data
Causal learning and reasoning
Principles

Structure

Data
Goal-directed action
(production and comprehension)

(Wolpert et al., 2003)

Why Bayesian models of cognition?
• A framework for understanding how the mind can solve
fundamental problems of induction.
• Strong, principled quantitative models of human cognition.
• Tools for studying people’s implicit knowledge of the world.
• Beyond classic limiting dichotomies: “rules vs. statistics”,
“nature vs. nurture”, “domain-general vs. domain-specific” .
• A unifying mathematical language for all of the cognitive
sciences: AI, machine learning and statistics, psychology,
neuroscience, philosophy, linguistics…. A bridge between
engineering and “reverse-engineering”.

Why now? Much recent progress, in computational resources,

theoretical tools, and interdisciplinary connections.
Outline
• Morning
– Introduction: Why Bayes? (Josh)
– Basics of Bayesian inference (Josh)
– How to build a Bayesian cognitive model (Tom)
• Afternoon
– Hierarchical Bayesian models & probabilistic
models over structured representations (Charles)
– Monte Carlo methods of approximate learning and
inference; nonparametric Bayesian models (Tom)
Bayes’ rule

For any hypothesis h and data d,

Posterior Likelihood Prior
probability probability

– Data: John is coughing

– Some hypotheses:
1. John has a cold
2. John has lung cancer
3. John has a stomach flu
– Prior P(h) favors 1 and 3 over 2
– Likelihood P(d|h) favors 1 and 2 over 3
– Posterior P(h|d) favors 1 over 2 and 3
Plan for this lecture
• Some basic aspects of Bayesian statistics
– Comparing two hypotheses
– Model fitting
– Model selection
• Two (very brief) case studies in modeling
human inductive learning
– Causal learning
– Concept learning
Coin flipping
• Comparing two hypotheses
– data = HHTHT or HHHHH
– compare two simple hypotheses:
P(H) = 0.5 vs. P(H) = 1.0
• Parameter estimation (Model fitting)
– compare many hypotheses in a parameterized family
P(H) = θ : Infer θ
• Model selection
– compare qualitatively different hypotheses, often
varying in complexity:
P(H) = 0.5 vs. P(H) = θ
Coin flipping

HHTHT
HHHHH
What process produced these sequences?
Comparing two hypotheses
• Contrast simple hypotheses:
– h1: “fair coin”, P(H) = 0.5
– h2:“always heads”, P(H) = 1.0
• Bayes’ rule:
P ( h) P ( d | h)
P(h | d ) =
! P(hi ) P(d | hi )
hi

• With two hypotheses, use odds form

Comparing two hypotheses
P ( H1 | D ) P ( D | H1 ) P ( H1 )
= !
P( H 2 | D) P( D | H 2 ) P( H 2 )

D: HHTHT
H1, H2: “fair coin”, “always heads”
P(D|H1) = 1/25 P(H1) = ?
P(D|H2) = 0 P(H2) = 1-?
Comparing two hypotheses
P ( H1 | D ) P ( D | H1 ) P ( H1 )
= !
P( H 2 | D) P( D | H 2 ) P( H 2 )

D: HHTHT
H1, H2: “fair coin”, “always heads”
P(D|H1) = 1/25 P(H1) = 999/1000
P(D|H2) = 0 P(H2) = 1/1000

P ( H1 | D ) 1 32 999
= ! = infinity
P( H 2 | D) 0 1
Comparing two hypotheses
P ( H1 | D ) P ( D | H1 ) P ( H1 )
= !
P( H 2 | D) P( D | H 2 ) P( H 2 )

D: HHHHH
H1, H2: “fair coin”, “always heads”
P(D|H1) = 1/25 P(H1) = 999/1000
P(D|H2) = 1 P(H2) = 1/1000

P( H1 | D) 1 32 999
= " ! 30
P( H 2 | D) 1 1
Comparing two hypotheses
P ( H1 | D ) P ( D | H1 ) P ( H1 )
= !
P( H 2 | D) P( D | H 2 ) P( H 2 )

D: HHHHHHHHHH
H1, H2: “fair coin”, “always heads”
P(D|H1) = 1/210 P(H1) = 999/1000
P(D|H2) = 1 P(H2) = 1/1000

P( H1 | D) 1 1024 999
= " !1
P( H 2 | D) 1 1
Measuring prior knowledge
1. The fact that HHHHH looks like a “mere coincidence”,
without making us suspicious that the coin is unfair, while
HHHHHHHHHH does begin to make us suspicious,
measures the strength of our prior belief that the coin is
fair.
– If θ is the threshold for suspicion in the posterior odds, and D* is
the shortest suspicious sequence, the prior odds for a fair coin is
roughly θ/P(D*|“fair coin”).
– If θ ~ 1 and D* is between 10 and 20 heads, prior odds are roughly
between 1/1,000 and 1/1,000,000.
2. The fact that HHTHT looks representative of a fair coin,
and HHHHH does not, reflects our prior knowledge about
possible causal mechanisms in the world.
– Easy to imagine how a trick all-heads coin could work: low (but
not negligible) prior probability.
– Hard to imagine how a trick “HHTHT” coin could work: extremely
low (negligible) prior probability.
Coin flipping
• Basic Bayes
– data = HHTHT or HHHHH
– compare two hypotheses:
P(H) = 0.5 vs. P(H) = 1.0
• Parameter estimation (Model fitting)
– compare many hypotheses in a parameterized family
P(H) = θ : Infer θ
• Model selection
– compare qualitatively different hypotheses, often
varying in complexity:
P(H) = 0.5 vs. P(H) = θ
Parameter estimation
• Assume data are generated from a
parameterized model:
θ

d1 d2 d3 d4
P(H) = θ

• What is the value of θ ?

– each value of θ is a hypothesis H
– requires inference over infinitely many hypotheses
Model selection
• Assume hypothesis space of possible models:
θ s1 s2 s3 s4

d1 d2 d3 d4 d1 d2 d3 d4 d1 d2 d3 d4

Fair coin: P(H) = 0.5 P(H) = θ Hidden Markov model:

si !{Fair coin, Trick coin}

• Which model generated the data?

– requires summing out hidden variables
– requires some form of Occam’s razor to trade off
complexity with fit to the data.
Parameter estimation vs. Model selection
across learning and development
• Causality: learning the strength of a relation vs. learning
the existence and form of a relation
• Language acquisition: learning a speaker's accent, or
frequencies of different words vs. learning a new tense or
syntactic rule (or learning a new language, or the existence
of different languages)
• Concepts: learning what horses look like vs. learning that
there is a new species (or learning that there are species)
• Intuitive physics: learning the mass of an object vs.
learning about gravity or angular momentum
A hierarchical learning framework

model M

parameter Parameter estimation:

w
setting
p ( w | D, M ) ! p ( D | w, M ) p ( w | M )

data D
A hierarchical learning framework
model class C p ( D | M ) = ! p ( D | w, M ) p ( w | M )
w

Model selection:
model M
p ( M | D, C ) ! p ( D | M ) p ( M | C )

parameter Parameter estimation:

w
setting
p ( w | D, M ) ! p ( D | w, M ) p ( w | M )

data D
Bayesian parameter estimation
• Assume data are generated from a model:
θ

d1 d2 d3 d4
P(H) = θ

• What is the value of θ ?

– each value of θ is a hypothesis H
– requires inference over infinitely many hypotheses
Some intuitions
• D = 10 flips, with 5 heads and 5 tails.
• θ = P(H) on next flip? 50%
• Why? 50% = 5 / (5+5) = 5/10.
• Why? “The future will be like the past”

• Suppose we had seen 4 heads and 6 tails.

• P(H) on next flip? Closer to 50% than to 40%.
• Why? Prior knowledge.
Integrating prior knowledge and data

p ( D | !) p (!)
p (! | D ) =
" p( D | !' ) p(!' ) d!'
• Posterior distribution P(θ | D) is a probability
density over θ = P(H)
• Need to specify likelihood P(D | θ ) and prior
distribution P(θ ).
Likelihood and prior
• Likelihood: Bernoulli distribution
P(D | θ ) = θ NH (1-θ ) NT
– NH: number of heads
– NT: number of tails

• Prior:
P(θ ) ∝ ?
Some intuitions
• D = 10 flips, with 5 heads and 5 tails.
• θ = P(H) on next flip? 50%
• Why? 50% = 5 / (5+5) = 5/10.
• Why? Maximum likelihood: !ˆ = arg max P( D | ! )
!
• Suppose we had seen 4 heads and 6 tails.
• P(H) on next flip? Closer to 50% than to 40%.
• Why? Prior knowledge.
A simple method of specifying priors
• Imagine some fictitious trials, reflecting a
set of previous experiences
– strategy often used with neural networks or
building invariance into machine vision.

• e.g., F ={1000 heads, 1000 tails} ~ strong

expectation that any new coin will be fair

• In fact, this is a sensible statistical idea...

Likelihood and prior
• Likelihood: Bernoulli(θ ) distribution
P(D | θ ) = θ NH (1-θ ) NT
– NH: number of heads
– NT: number of tails

• Prior: Beta(FH,FT) distribution

P(θ ) ∝ θ FH-1 (1-θ ) FT-1
– FH: fictitious observations of heads
– FT: fictitious observations of tails
Shape of the Beta prior
Bayesian parameter estimation

P(θ | D) ∝ P(D | θ ) P(θ ) = θ NH+FH-1 (1-θ ) NT+FT-1

• Posterior is Beta(NH+FH,NT+FT)
– same form as prior!
Bayesian parameter estimation
P(θ | D) ∝ P(D | θ ) P(θ ) = θ NH+FH-1 (1-θ ) NT+FT-1

FH,FT

D = NH,NT
d1 d2 d3 d4 H

• Posterior predictive distribution:

1
P(H|D, FH, FT) = !0 P(H|θ ) P(θ | D, FH, FT) dθ
“hypothesis averaging”
Bayesian parameter estimation
P(θ | D) ∝ P(D | θ ) P(θ ) = θ NH+FH-1 (1-θ ) NT+FT-1

FH,FT

D = NH,NT
d1 d2 d3 d4 H

• Posterior predictive distribution:

P(H|D, FH, FT) = (NH+FH)
(NH+FH+NT+FT)
Conjugate priors
• A prior p(θ ) is conjugate to a likelihood
function p(D | θ ) if the posterior has the same
functional form of the prior.
– Parameter values in the prior can be thought of as a
summary of “fictitious observations”.
– Different parameter values in the prior and
posterior reflect the impact of observed data.
– Conjugate priors exist for many standard models
(e.g., all exponential family models)
Some examples
• e.g., F ={1000 heads, 1000 tails} ~ strong
expectation that any new coin will be fair
• After seeing 4 heads, 6 tails, P(H) on next
flip = 1004 / (1004+1006) = 49.95%

• e.g., F ={3 heads, 3 tails} ~ weak

expectation that any new coin will be fair
• After seeing 4 heads, 6 tails, P(H) on next
flip = 7 / (7+9) = 43.75%
Prior knowledge too weak
But… flipping thumbtacks
• e.g., F ={4 heads, 3 tails} ~ weak expectation
that tacks are slightly biased towards heads
• After seeing 2 heads, 0 tails, P(H) on next flip
= 6 / (6+3) = 67%

• Some prior knowledge is always necessary to

avoid jumping to hasty conclusions...
• Suppose F = { }: After seeing 1 heads, 0 tails,
P(H) on next flip = 1 / (1+0) = 100%
Origin of prior knowledge
• Tempting answer: prior experience
• Suppose you have previously seen 2000
coin flips: 1000 heads, 1000 tails
Problems with simple empiricism
• Haven’t really seen 2000 coin flips, or any flips of a
thumbtack
– Prior knowledge is stronger than raw experience justifies
• Haven’t seen exactly equal number of heads and tails
– Prior knowledge is smoother than raw experience justifies
• Should be a difference between observing 2000 flips
of a single coin versus observing 10 flips each for 200
coins, or 1 flip each for 2000 coins
– Prior knowledge is more structured than raw experience
A simple theory
• “Coins are manufactured by a standardized
procedure that is effective but not perfect, and
symmetric with respect to heads and tails.
Tacks are asymmetric, and manufactured to
less exacting standards.”
– Justifies generalizing from previous coins to the
present coin.
– Justifies smoother and stronger prior than raw
experience alone.
– Explains why seeing 10 flips each for 200 coins is
more valuable than seeing 2000 flips of one coin.
A hierarchical Bayesian model
physical knowledge

Coins
θ ~ Beta(FH,FT)
FH,FT

Coin 1 θ1 Coin 2 θ2
... θ200 Coin 200

d1 d2 d3 d4 d1 d2 d3 d4 d1 d2 d3 d4

• Qualitative physical knowledge (symmetry) can

influence estimates of continuous parameters (FH, FT).
• Explains why 10 flips of 200 coins are better than 2000
flips of a single coin: more informative about FH, FT.
Summary: Bayesian parameter estimation
• Learning the parameters of a generative
model as Bayesian inference.
• Prediction by Bayesian hypothesis averaging.
• Conjugate priors
– an elegant way to represent simple kinds of prior
knowledge.
• Hierarchical Bayesian models
– integrate knowledge across instances of a system,
or different systems within a domain, to explain
the origins of priors.
A hierarchical learning framework
model class C p ( D | M ) = ! p ( D | w, M ) p ( w | M )
w

Model selection:
model M
p ( M | D, C ) ! p ( D | M ) p ( M | C )

parameter Model fitting:

w
setting
p ( w | D, M ) ! p ( D | w, M ) p ( w | M )

data D
Stability versus Flexibility
• Can all domain knowledge be represented
with conjugate priors?
• Suppose you flip a coin 25 times and get all
heads. Something funny is going on …
• But with F ={1000 heads, 1000 tails},
P(heads) on next flip = 1025 / (1025+1000)
= 50.6%. Looks like nothing unusual.
• How do we balance stability and flexibility?
– Stability: 6 heads, 4 tails θ ~ 0.5
– Flexibility: 25 heads, 0 tails θ ~1
Bayesian model selection
θ

d1 d2 d3 d4 vs. d1 d2 d3 d4
Fair coin, P(H) = 0.5 P(H) = θ

• Which provides a better account of the data:

the simple hypothesis of a fair coin, or the
complex hypothesis that P(H) = θ ?
Comparing simple and complex hypotheses

• P(H) = θ is more complex than P(H) = 0.5 in

two ways:
– P(H) = 0.5 is a special case of P(H) = θ
– for any observed sequence D, we can choose θ
such that D is more probable than if P(H) = 0.5
Comparing simple and complex hypotheses
n N !n
P( D | " ) = " (1 ! " )
Probability

θ = 0.5

D = HHHHH
Comparing simple and complex hypotheses
n N !n
P( D | " ) = " (1 ! " )

θ = 1.0
Probability

θ = 0.5

D = HHHHH
Comparing simple and complex hypotheses
n N !n
P( D | " ) = " (1 ! " )
Probability

θ = 0.6
θ = 0.5

D = HHTHT
Comparing simple and complex hypotheses
• P(H) = θ is more complex than P(H) = 0.5 in
two ways:
– P(H) = 0.5 is a special case of P(H) = θ
– for any observed sequence X, we can choose θ
such that X is more probable than if P(H) = 0.5
• How can we deal with this?
– Some version of Occam’s razor?
– Bayes: automatic version of Occam’s razor
follows from the “law of conservation of belief”.
Comparing simple and complex hypotheses

P(h1|D) P(D|h1) P(h1)

= x
P(h0|D) P(D|h0) P(h0)

P( D | h0 ) = (1 / 2) n (1 ! 1 / 2) N ! n = 1 / 2 N

1
P( D | h1 ) = ! P( D | ", h1 ) p (" | h1 )d"
0
The “evidence” or “marginal likelihood”: The
probability that randomly selected parameters
from the prior would generate the data.
P( D | h1 )
log
P( D | h0 )

1
P( D | h1 ) = ! P( D | ", h1 ) p (" | h1 )d"
0

P( D | h0 ) = 1 / 2 N

!
Stability versus Flexibility revisited
fair/unfair?
• Model class hypothesis: is this
coin fair or unfair?
FH,FT
• Example probabilities:
– P(fair) = 0.999
θ
– P(θ |fair) is Beta(1000,1000)
– P(θ |unfair) is Beta(1,1)
d1 d2 d3 d4
• 25 heads in a row propagates up,
affecting θ and then P(fair|D)
P(fair|25 heads) P(25 heads|fair) P(fair)
=
P(unfair|25 heads) P(25 heads|unfair) P(unfair) ~ 0.001
Bayesian Occam’s Razor

p(D = d | M )

All possible data sets d

For any model M, ! p (D = d | M ) = 1

all d "D
Law of “conservation of belief”: A model that can predict many
possible data sets must assign each of them low probability.
Occam’s Razor in curve fitting
! p (D = d | M ) = 1 M1
all d "D

M1
p(D = d | M )

M2
M2
M3

D
Observed data
M3
M1: A model that is too simple is unlikely to generate
the data.
M3: A model that is too complex can generate many
possible data sets, so it is unlikely to generate
this particular data set at random.
Summary so far
• Three kinds of Bayesian inference
– Comparing two simple hypotheses
– Parameter estimation
• The importance and subtlety of prior knowledge
– Model selection
• Bayesian Occam’s razor, the blessing of abstraction
• Key concepts
– Probabilistic generative models
– Hierarchies of abstraction, with statistical
inference at all levels
– Flexibly structured representations
Plan for this lecture
• Some basic aspects of Bayesian statistics
– Comparing two hypotheses
– Model fitting
– Model selection
• Two (very brief) case studies in modeling
human inductive learning
– Causal learning
– Concept learning
Learning causation from correlation

C present C absent
(c+) (c-)

E present (e+) a c

E absent (e-) b d

“Does C cause E?”

(rate on a scale from 0 to 100)
Learning with graphical models
• Strength: how strong is the relationship?
B C
Delta-P, Power PC, …
w0 w1
E

• Structure: does a relationship exist?

B C B C

vs.
h1 h0
E E
Bayesian learning of causal structure
• Hypotheses: B C B C

vs.
h1 h0
E E

• Bayesian causal inference:

P(d|h1) likelihood ratio (Bayes factor)
support = log
P(d|h0) gives evidence in favor of h1

1 1
P(d | h1 ) = " " P(d | w ,w ) p(w ,w | h ) dw
0 0 0 1 0 1 1 0 dw1
1
P(d | h0 ) = " 0 P(d | w 0 ) p(w 0 | h0 ) dw 0

!
Bayesian Occam’s Razor

h0 (no relationship)
For any model h,
P(d | h )

" P(d | h) = 1
d

h1 (positive relationship)

All!data sets d

P(e+|c+) ~ P(e+|c+) >>

P(e+|c-) P(e+|c-)
Comparison with human judgments
(Buehner & Cheng, 1997; 2003) 5
.2 5 .5 .7 1
=0 = 0 = 0 = 0 =
ΔP ΔP ΔP ΔP Δ P

People

Assume
B C ΔP
structure:
Estimate w0 w1
strength w1 E
Power PC

Bayesian structure learning

B C B C
vs.
w0 w1 w0
E E
Inferences about causal structure depend on
the functional form of causal relations
Concept learning: the number game

• Program input: number between 1 and 100

• Program output: “yes” or “no”
• Learning task:
– Observe one or more positive (“yes”) examples.
– Judge whether other numbers are “yes” or “no”.
Concept learning: the number game

Examples of Generalization
“yes” numbers judgments (N = 20)

60
Diffuse similarity

60 80 10 30 Rule:
“multiples of 10”

60 52 57 55 Focused similarity:
numbers near 50-60
Bayesian model
• H: Hypothesis space of possible concepts:
– H1: Mathematical properties: multiples and powers of small numbers.
– H2: Magnitude: intervals with endpoints between 1 and 100.

• X = {x1, . . . , xn}: n examples of a concept C.

• Evaluate hypotheses given data:
p ( X | h) p ( h)
p(h | X ) =
! p( X | h#) p(h#)
h#"H
– p(h) [prior]: domain knowledge, pre-existing biases
– p(X|h) [likelihood]: statistical information in examples.
– p(h|X) [posterior]: degree of belief that h is the true extension of C.
Generalizing to new objects
Given p(h|X), how do we compute p( y " C | X ) =
, ! p( y " C |
the probability that C applies to some new h"H
stimulus y?

Background
knowledge
p( y ! C | X ) =
h ! p ( y " C | h) p ( h | X )
h"H
X= x1 x2 x3 x4 y !C ?
Likelihood: p(X|h)
• Size principle: Smaller hypotheses receive greater
likelihood, and exponentially more so as n increases.
n
' 1 $
p ( X | h) = % " if x1 , K , xn ! h
& size(h) #
= 0 if any xi ! h

• Follows from assumption of randomly sampled examples

+ law of “conservation of belief”:
! p (D = d | M ) = 1
all d "D
• Captures the intuition of a “representative” sample.
Illustrating the size principle
h1 2 4 6 8 10 h2
12 14 16 18 20
22 24 26 28 30
32 34 36 38 40
42 44 46 48 50
52 54 56 58 60
62 64 66 68 70
72 74 76 78 80
82 84 86 88 90
92 94 96 98 100
Illustrating the size principle
h1 2 4 6 8 10 h2
12 14 16 18 20
22 24 26 28 30
32 34 36 38 40
42 44 46 48 50
52 54 56 58 60
62 64 66 68 70
72 74 76 78 80
82 84 86 88 90
92 94 96 98 100

Data slightly more of a coincidence under h1

Illustrating the size principle
h1 2 4 6 8 10 h2
12 14 16 18 20
22 24 26 28 30
32 34 36 38 40
42 44 46 48 50
52 54 56 58 60
62 64 66 68 70
72 74 76 78 80
82 84 86 88 90
92 94 96 98 100

Data much more of a coincidence under h1

Prior: p(h)
• Choice of hypothesis space embodies a strong prior:
effectively, p(h) ~ 0 for many logically possible but
conceptually unnatural hypotheses.
• Prevents overfitting by highly specific but unnatural
hypotheses, e.g. “multiples of 10 except 50 and 70”.

e.g., X = {60 80 10 30}:

• X = {60, 80, 10, 30}

• Why prefer “multiples of 10” over “even
numbers”? p(X|h).
• Why prefer “multiples of 10” over “multiples of
10 except 50 and 20”? p(h).
• Why does a good generalization need both high
prior and high likelihood? p(h|X) ~ p(X|h) p(h)

Occam’s razor: balancing simplicity and fit to data

H1: Mathematical properties (24) H2: Magnitude intervals (5050)

• even numbers • 10-15
• powers of two • 20-32
• multiples of three • 37-54
... p(h) = λ / 24 … p(h) = 1-λ / 5050 * Gamma(s;σ)
+ Examples Human generalization Bayesian Model

60 80 10 30

60 52 57 55

16 8 2 64

16 23 19 20
Stability versus Flexibility
math/magnitude?
• Higher-level hypothesis: is this concept
mathematical or magnitude-based?
• Example probabilities: h
– P(math) = λ
– P(h | math) … X= x1 x2 x3 x4
– P(h | magnitude) …

• Just a few examples may be sufficient to infer the kind of

concept, under the size-principle likelihood
– if an a priori reasonable hypothesis of one kind fits much more tightly
than all reasonable hypothesis of the other kind.
• Just a few examples can give all-or-none, “rule-like”
generalization or more graded, “similarity-like” generalization.
– More all-or-none when the smallest consistent hypothesis is much
smaller than all other reasonable hypotheses; otherwise more graded.
Conclusion:
Contributions of Bayesian models
• A framework for understanding how the mind can solve
fundamental problems of induction.
• Strong, principled quantitative models of human cognition.
• Tools for studying people’s implicit knowledge of the world.
• Beyond classic limiting dichotomies: “rules vs. statistics”,
“nature vs. nurture”, “domain-general vs. domain-specific” .
• A unifying mathematical language for all of the cognitive
sciences: AI, machine learning and statistics, psychology,
neuroscience, philosophy, linguistics…. A bridge between
engineering and “reverse-engineering”.
A toolkit for reverse-engineering induction
1. Bayesian inference in probabilistic generative models
2. Probabilities defined over structured representations:
graphs, grammars, predicate logic, schemas
3. Hierarchical probabilistic models, with inference at all
levels of abstraction
4. Models of unbounded complexity (“nonparametric
Bayes” or “infinite models”), which can grow in
complexity or change form as observed data dictate.
5. Approximate methods of learning and inference, such
as belief propagation, expectation-maximization (EM),
Markov chain Monte Carlo (MCMC), and sequential
Monte Carlo (particle filtering).

Probabilistic Reasoning
No ratings yet
Probabilistic Reasoning
32 pages
Bayesian Cognition Explored
No ratings yet
Bayesian Cognition Explored
13 pages
Nhso401 r7b BayesianModeling CompPsy
No ratings yet
Nhso401 r7b BayesianModeling CompPsy
12 pages
COMP3411 Week 9 - Uncertainty
No ratings yet
COMP3411 Week 9 - Uncertainty
70 pages
Unit3pdf 2025 01 14 10 38 08
No ratings yet
Unit3pdf 2025 01 14 10 38 08
4 pages
Bayesian Cognitive Modeling A Practical Course, 1st Edition Entire Book Download
100% (20)
Bayesian Cognitive Modeling A Practical Course, 1st Edition Entire Book Download
15 pages
Ai Unit V
No ratings yet
Ai Unit V
18 pages
Lecture 14
No ratings yet
Lecture 14
54 pages
16 - The Bayesian Mind
No ratings yet
16 - The Bayesian Mind
33 pages
Tics Theories Reprint
No ratings yet
Tics Theories Reprint
10 pages
Artifical Intelligence Notes Part 7
No ratings yet
Artifical Intelligence Notes Part 7
49 pages
Machine Learning Course Overview
No ratings yet
Machine Learning Course Overview
91 pages
Intro to Learning Theory
No ratings yet
Intro to Learning Theory
35 pages
Lec14 15 GenerativeModelsForDiscreteData
No ratings yet
Lec14 15 GenerativeModelsForDiscreteData
74 pages
Statistical Reasoning
No ratings yet
Statistical Reasoning
19 pages
Unit 3 Bayesian Concept Learning
No ratings yet
Unit 3 Bayesian Concept Learning
66 pages
Unit 2
No ratings yet
Unit 2
20 pages
Bayesian Learning for ML Experts
No ratings yet
Bayesian Learning for ML Experts
18 pages
Bayesian Cognitive Modeling A Practical Course - 1st Edition ISBN 1107018455, 9781107018457 Digital Download
100% (1)
Bayesian Cognitive Modeling A Practical Course - 1st Edition ISBN 1107018455, 9781107018457 Digital Download
14 pages
Unit2 AI & ML
No ratings yet
Unit2 AI & ML
29 pages
This Is A Traditional AI Topic, But We Need To Cover It in at Least A Little Detail Here There Are Many Different Approaches To Handling Uncertainty
No ratings yet
This Is A Traditional AI Topic, But We Need To Cover It in at Least A Little Detail Here There Are Many Different Approaches To Handling Uncertainty
32 pages
Chapter 5 - Uncertain Knowledge and Reasoning
No ratings yet
Chapter 5 - Uncertain Knowledge and Reasoning
29 pages
Module 3
No ratings yet
Module 3
70 pages
Bayesian Learning
No ratings yet
Bayesian Learning
42 pages
4th Module - Probabilistic Reasoning
No ratings yet
4th Module - Probabilistic Reasoning
24 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
180 pages
CC Unit 4
No ratings yet
CC Unit 4
29 pages
CC Unit 4
No ratings yet
CC Unit 4
29 pages
Bayes ML Tutorial
No ratings yet
Bayes ML Tutorial
69 pages
Unit Ii
No ratings yet
Unit Ii
23 pages
Lecture 13
No ratings yet
Lecture 13
45 pages
Module V - v1
No ratings yet
Module V - v1
58 pages
Alison Gopnik Cognitive Development
No ratings yet
Alison Gopnik Cognitive Development
4 pages
Slide07 Bayes
No ratings yet
Slide07 Bayes
51 pages
Bayes Algorithm
No ratings yet
Bayes Algorithm
26 pages
Hypothesis Space and Inductive Bias, Training, Test Data and Cross Validation
No ratings yet
Hypothesis Space and Inductive Bias, Training, Test Data and Cross Validation
53 pages
Probabilistic Reasoning: Unit-V
No ratings yet
Probabilistic Reasoning: Unit-V
33 pages
Unit 2 - Probabilistic Reasoning
No ratings yet
Unit 2 - Probabilistic Reasoning
25 pages
Artificial Intelligence M2
No ratings yet
Artificial Intelligence M2
12 pages
Aiml 2 3
No ratings yet
Aiml 2 3
51 pages
Unit 5
No ratings yet
Unit 5
21 pages
Probability Basics for College Students
No ratings yet
Probability Basics for College Students
37 pages
Lecture Notes in Machine Learning
No ratings yet
Lecture Notes in Machine Learning
65 pages
Sept 9 2004
No ratings yet
Sept 9 2004
72 pages
Bayesian Inference and Learning
No ratings yet
Bayesian Inference and Learning
48 pages
Lecture 3 9.66
No ratings yet
Lecture 3 9.66
45 pages
CH 5 Reasoning Under Uncertainty
No ratings yet
CH 5 Reasoning Under Uncertainty
32 pages
6.1 Bayesian Learning
No ratings yet
6.1 Bayesian Learning
33 pages
Bcse306l Ai Module-5 Smsatapathy
No ratings yet
Bcse306l Ai Module-5 Smsatapathy
98 pages
Model Selection/ Structure Learning Koller & Friedman Chapter 14 Mackay Chapter 28
No ratings yet
Model Selection/ Structure Learning Koller & Friedman Chapter 14 Mackay Chapter 28
49 pages
Bark08 Ghahramani Samlbb 01
No ratings yet
Bark08 Ghahramani Samlbb 01
26 pages
Mlnotes 2 Srija
No ratings yet
Mlnotes 2 Srija
15 pages
Bayesian Learning in Machine Learning
No ratings yet
Bayesian Learning in Machine Learning
60 pages
CS6364 Lecture18A Ch18decision-Tree - To Use
No ratings yet
CS6364 Lecture18A Ch18decision-Tree - To Use
60 pages
Introduction To Probabilistic Learning
No ratings yet
Introduction To Probabilistic Learning
9 pages
15CS73 Module 4
No ratings yet
15CS73 Module 4
60 pages
03-Computational Cognitive Science
No ratings yet
03-Computational Cognitive Science
42 pages
Sbi FD Slip
No ratings yet
Sbi FD Slip
1 page
Serrano v. Central Bank of The Philippines
No ratings yet
Serrano v. Central Bank of The Philippines
2 pages
Underwater Data Logger
No ratings yet
Underwater Data Logger
11 pages
BSD 1307 Object Oriented Analysis and Design
No ratings yet
BSD 1307 Object Oriented Analysis and Design
2 pages
Business Intelligence For Big Data Analytics
No ratings yet
Business Intelligence For Big Data Analytics
8 pages
Business Trust & Communication
No ratings yet
Business Trust & Communication
20 pages
CSPHCL JE Electrical 5 Jan 2022 English
No ratings yet
CSPHCL JE Electrical 5 Jan 2022 English
33 pages
Testbank & Ebook College Algebra 11th Edition Larson Instant
No ratings yet
Testbank & Ebook College Algebra 11th Edition Larson Instant
17 pages
D&D Spell Incantations Guide
No ratings yet
D&D Spell Incantations Guide
25 pages
Connection 07
No ratings yet
Connection 07
17 pages
Android Debugging Errors
No ratings yet
Android Debugging Errors
32 pages
Context Content Processes and Consequences
100% (7)
Context Content Processes and Consequences
39 pages
FusionServer 2288H V7 Server User Guide 08
No ratings yet
FusionServer 2288H V7 Server User Guide 08
491 pages
Pulsar Electronic Components Price List
No ratings yet
Pulsar Electronic Components Price List
9 pages
RA 9184 Slides (1) - Atty. Tom
No ratings yet
RA 9184 Slides (1) - Atty. Tom
86 pages
Patrick Nguyen - Key Selection Criteria
No ratings yet
Patrick Nguyen - Key Selection Criteria
2 pages
Ind AS 12
No ratings yet
Ind AS 12
37 pages
2005 FBLA Introduction To Business Communication
No ratings yet
2005 FBLA Introduction To Business Communication
7 pages
A Season in Hell - The Illuminations - Arthur Rimbaud - 2023 - Anna's Archive
No ratings yet
A Season in Hell - The Illuminations - Arthur Rimbaud - 2023 - Anna's Archive
193 pages
Dhananjoy Ghosh - Safety in Petroleum Industries
No ratings yet
Dhananjoy Ghosh - Safety in Petroleum Industries
271 pages
Personal Essay Rubric
No ratings yet
Personal Essay Rubric
1 page
Government Business Enterprise
No ratings yet
Government Business Enterprise
2 pages
Ujian Bulan Mac: Bahasa Inggeris Kertas 1 Tahun 4
No ratings yet
Ujian Bulan Mac: Bahasa Inggeris Kertas 1 Tahun 4
9 pages
Barun Yadav
No ratings yet
Barun Yadav
1 page
DCP-F-CTL-052014 (Exi FCU Catalogue)
100% (1)
DCP-F-CTL-052014 (Exi FCU Catalogue)
12 pages
US Apparel Market Forecast 2024
No ratings yet
US Apparel Market Forecast 2024
24 pages
Lasam Revenue Strategy 2022-2024
No ratings yet
Lasam Revenue Strategy 2022-2024
2 pages
Easy Tense Chart
No ratings yet
Easy Tense Chart
3 pages
Data Analytics
100% (3)
Data Analytics
190 pages
FC Model Trend Catalogue 2018
No ratings yet
FC Model Trend Catalogue 2018
37 pages

Tutorial 08 Part 1

Uploaded by

Tutorial 08 Part 1

Uploaded by

Bayesian models of

How does the mind get so much out of so little?

“The problem of induction”

“horse” “horse” “horse”

The goal: A general-purpose computational framework

P(S | U, G) ~ P(U | S) x P(S | G)

P(utterance | phrase structure)

(Han and Zhu, 2006)

(Wolpert et al., 2003)

Why now? Much recent progress, in computational resources,

For any hypothesis h and data d,

– Data: John is coughing

• With two hypotheses, use odds form

• What is the value of θ ?

Fair coin: P(H) = 0.5 P(H) = θ Hidden Markov model:

• Which model generated the data?

parameter Parameter estimation:

parameter Parameter estimation:

• What is the value of θ ?

• Suppose we had seen 4 heads and 6 tails.

• e.g., F ={1000 heads, 1000 tails} ~ strong

• In fact, this is a sensible statistical idea...

• Prior: Beta(FH,FT) distribution

P(θ | D) ∝ P(D | θ ) P(θ ) = θ NH+FH-1 (1-θ ) NT+FT-1

• Posterior predictive distribution:

• Posterior predictive distribution:

• e.g., F ={3 heads, 3 tails} ~ weak

• Some prior knowledge is always necessary to

• Qualitative physical knowledge (symmetry) can

parameter Model fitting:

• Which provides a better account of the data:

• P(H) = θ is more complex than P(H) = 0.5 in

P(h1|D) P(D|h1) P(h1)

All possible data sets d

For any model M, ! p (D = d | M ) = 1

“Does C cause E?”

• Structure: does a relationship exist?

• Bayesian causal inference:

P(e+|c+) ~ P(e+|c+) >>

Bayesian structure learning

• Program input: number between 1 and 100

• X = {x1, . . . , xn}: n examples of a concept C.

• Follows from assumption of randomly sampled examples

Data slightly more of a coincidence under h1

Data much more of a coincidence under h1

e.g., X = {60 80 10 30}:

• X = {60, 80, 10, 30}

Occam’s razor: balancing simplicity and fit to data

H1: Mathematical properties (24) H2: Magnitude intervals (5050)

• Just a few examples may be sufficient to infer the kind of

You might also like