KEMBAR78
Introduction To Probability | PDF | Bayesian Network | Variance
0% found this document useful (0 votes)
5 views20 pages

Introduction To Probability

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views20 pages

Introduction To Probability

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

The Foundations of Probability: A Comprehensive Guide

1 Introduction to Probability
Probability is a numerical measure that quantifies the likelihood of an event occurring. It’s expressed
as a number between 0 and 1, inclusive:

• 0 (Impossible): An event with a probability of 0 will never happen.


– Example: The probability of rolling an 8 on a standard six-sided die is 0.

• 1 (Certain): An event with a probability of 1 is guaranteed to happen.


– Example: The probability that the sun will rise tomorrow is 1.

1.1 Core Concepts and the Probability Space


The foundation of probability theory lies in defining a probability space, which consists of three main
components:
1. Sample Space (Ω): The set of all possible outcomes.
2. Event Space (F): A collection of subsets of Ω that we define as ”events” (these are the things
we can assign probabilities to). For most practical purposes, this includes all individual outcomes
and combinations of outcomes.
3. Probability Measure (P ): A function that assigns a probability to each event in F.

1.1.1 Sample Space (Ω)


The sample space is the set of all possible outcomes of a random process or experiment. It’s the
complete collection of results that can occur.

• Example 1: Tossing a single coin


The possible outcomes are Heads or Tails. Ω = {Heads, Tails}

• Example 2: Rolling a single standard six-sided die


The possible outcomes are the numbers 1 through 6. Ω = {1, 2, 3, 4, 5, 6}

1.1.2 Event
An event is any subset of the sample space (Ω). It’s a specific outcome or a collection of outcomes
that we are interested in, belonging to the event space (F).

• Example 1: Rolling an even number on a die


Let Event A be ”Rolling an even number.” From the sample space Ω = {1, 2, 3, 4, 5, 6}, the even
numbers are 2, 4, and 6. A = {2, 4, 6}

• Example 2: Tossing a coin twice and getting exactly one Head


From the sample space Ω = {HH, HT, TH, TT}, the outcomes with exactly one head are HT and
TH. Let Event B be ”Getting exactly one Head.” B = {HT, TH}

2
1.1.3 Probability Axioms
These are fundamental rules that all probabilities must follow:

1. Range: The probability of any event A, denoted P (A), must be between 0 and 1. 0 ≤ P (A) ≤ 1
2. Total Probability: The probability of the entire sample space Ω (meaning something in the
sample space will happen) is 1. P (Ω) = 1

3. Additivity (for Disjoint Events): If two events, A and B, are disjoint (also known as mutually
exclusive, meaning they cannot happen at the same time and have no common outcomes, i.e.,
A ∩ B = ∅), then the probability of A or B happening is the sum of their individual probabilities.
P (A ∪ B) = P (A) + P (B)

• Example: When rolling a die, let A be ”rolling an odd number” (A = {1, 3, 5}) and B be
”rolling a 2” (B = {2}). A and B are disjoint. P (A) = 36 , P (B) = 16 . The event ”rolling
an odd number or a 2” is {1, 2, 3, 5}. Its probability is 46 . Using the axiom: P (A ∪ B) =
P (A) + P (B) = 36 + 16 = 46 . The results match.

1.2 Example: Basic Probability Calculation


A fair die is rolled once.

• Sample space: Ω = {1, 2, 3, 4, 5, 6}


• Let A = {2, 4, 6} (even numbers). To find P (A), we use the classical definition of probability (for
Number of outcomes in A 3
equally likely outcomes): P (A) = Total number of outcomes in Ω = 6 = 0.5

3
2 Random Variables
A random variable (RV) is a function that assigns a numerical value to each outcome in the
sample space of a random experiment. Random variables are typically denoted by capital letters like X,
Y, Z.

2.1 Discrete Random Variables


A discrete random variable can take on a finite number of values or a countably infinite number of values
(e.g., whole numbers).

• Example 1: Toss two coins


Let X be the random variable representing the number of heads.
– Outcome HH → X = 2
– Outcome HT → X = 1
– Outcome TH → X = 1
– Outcome TT → X = 0
So, the possible values for X are X ∈ {0, 1, 2}.

2.1.1 Probability Mass Function (PMF)


The Probability Mass Function (PMF) describes how the probability is distributed among the
discrete values that a random variable can take. For a discrete RV X, the PMF, denoted P (X = x)
or p(x), gives the probability that X takes on a specific value x.

• Properties of a PMF:
– 0 ≤ P (X = x) ≤ 1 for all values of x.
P
– x P (X = x) = 1 (the sum of all probabilities for all possible values must equal 1).

• Example: Toss two coins (continued):


Assuming a fair coin, each outcome in Ω = {HH, HT, TH, TT} has a probability of 1/4.
The PMF for X (number of heads) is:
– P (X = 0) = P (TT) = 1/4
– P (X = 1) = P (HT or TH) = P (HT) + P (TH) = 1/4 + 1/4 = 2/4 = 1/2
– P (X = 2) = P (HH) = 1/4
To verify, sum the probabilities: 1/4 + 1/2 + 1/4 = 1. This is a valid PMF.

2.2 Continuous Random Variables


A continuous random variable can take on any real number value within a given range (or across
the entire real number line). You cannot list all possible values.

• Examples:
– Temperature (28.63◦ C)
– Height (175.4 cm)
– Time (e.g., time it takes for a light bulb to burn out)

4
2.2.1 Probability Density Function (PDF)
For continuous random variables, we use a Probability Density Function (PDF), denoted f (x).
Important distinction: For a continuous RV, the probability of it taking on any single exact value
is zero: P (X = x) = 0. This is because there are infinitely many possible values, so the chance of hitting
one specific value is infinitesimally small.
Instead, probabilities for continuous variables are found by calculating the area under the curve
of the PDF over a given interval.
Rb
P (a ≤ X ≤ b) = a f (x) dx

• Properties of a PDF:
– f (x) ≥ 0 for all x (the probability density cannot be negative).
R∞
– −∞ f (x) dx = 1 (the total area under the entire PDF curve must equal 1).

• Example: A Simple Continuous PDF


Consider a random variable X with the following PDF: f (x) = 2x, for 0 ≤ x ≤ 1
f (x) = 0, otherwise.

– Verify if it’s a valid PDF:


1. Is f (x) ≥ 0? Yes, for 0 ≤ x ≤ 1, 2x is non-negative.
R∞ R1
2. Does −∞ f (x) dx = 1? 0 2x dx = [x2 ]10 = 12 − 02 = 1. Yes, it’s a valid PDF.
– Find P (0.2 ≤ X ≤ 0.5):
R 0.5
This is the area under the PDF curve from 0.2 to 0.5. P (0.2 ≤ X ≤ 0.5) = 0.2 2x dx =
[x2 ]0.5 2 2
0.2 = (0.5) − (0.2) = 0.25 − 0.04 = 0.21 What this tells us: This means there’s a 21%
chance that the random variable X will fall between 0.2 and 0.5. The PDF shows that values
closer to 1 are more likely than values closer to 0.

2.2.2 Cumulative Distribution Function (CDF)


The Cumulative Distribution Function (CDF), denoted F (x), gives the probability that a random
variable X takes on a value less than or equal to a specific value x. It’s a fundamental function that
applies to both discrete and continuous random variables.
F (x) = P (X ≤ x)

• Properties of a CDF:

– 0 ≤ F (x) ≤ 1 for all x.


– F (x) is non-decreasing: if x1 < x2 , then F (x1 ) ≤ F (x2 ).
– limx→−∞ F (x) = 0
– limx→∞ F (x) = 1
d
– For continuous RVs, f (x) = dx F (x) (the PDF is the derivative of the CDF).
– For any two points a, b: P (a < X ≤ b) = F (b) − F (a).
• What it tells us: The CDF provides a comprehensive view of the entire probability distribution,
showing the accumulated probability up to any given point. It allows for easy calculation of
probabilities for intervals.

• Example: CDF for Discrete Variable (Toss two coins, X=number of heads)
PMF: P (0) = 1/4, P (1) = 1/2, P (2) = 1/4.
– F (x) = 0 for x < 0
– F (0) = P (X ≤ 0) = P (X = 0) = 1/4
– F (1) = P (X ≤ 1) = P (X = 0) + P (X = 1) = 1/4 + 1/2 = 3/4
– F (2) = P (X ≤ 2) = P (X = 0) + P (X = 1) + P (X = 2) = 1/4 + 1/2 + 1/4 = 1
– F (x) = 1 for x ≥ 2

5
• Example: CDF for Continuous
Rx Variable (from PDF f (x) = 2x for 0 ≤ x ≤ 1)
For 0 ≤ x ≤ 1: F (x) = 0 2t dt = [t2 ]x0 = x2 So, F (x) = x2 for 0 ≤ x ≤ 1. F (x) = 0 for x < 0
F (x) = 1 for x > 1

– Using CDF to find P (0.2 ≤ X ≤ 0.5): P (0.2 ≤ X ≤ 0.5) = F (0.5)−F (0.2) = (0.5)2 −(0.2)2 =
0.25 − 0.04 = 0.21, matching the previous result.

6
3 Measures of Distribution
These are measures that describe the central tendency and spread of a random variable.

3.1 Expectation (Mean) - E[X]


The expectation or mean of a random variable X, denoted E[X] or µ, is the weighted average of all
possible values that the random variable can take. The weights are their respective probabilities.

• What it tells us: The expectation represents the ”long-run average” value if you were to repeat
the random experiment infinitely many times. It’s the ”central tendency” or ”balancing point”
of the distribution. For a fair die, an expectation of 3.5 means that, over many rolls, the average
outcome will tend towards 3.5, even though 3.5 is not an actual outcome. It’s the best single value
to summarize the distribution’s typical outcome.

• For Discrete Random Variables: E[X] = x · P (x)


P
R∞
• For Continuous Random Variables: E[X] = −∞ xf (x) dx
• Example 1: Expectation of a Die Roll
Let X be the outcome of a fair 6-sided die. PMF: P (X = x) = 1/6 for x ∈ {1, 2, 3, 4, 5, 6}.
           
1 1 1 1 1 1
E[X] = 1 · + 2· + 3· + 4· + 5· + 6·
6 6 6 6 6 6
1 21
E[X] = (1 + 2 + 3 + 4 + 5 + 6) = = 3.5
6 6
What this tells us: If you roll a fair die many times, the average of all your rolls will approach 3.5.
• Example 2: Expectation of a Continuous Variable
Using the PDF from the previous section: f (x) = 2x, for 0 ≤ x ≤ 1.
Z 1 Z 1
E[X] = x · (2x) dx = 2x2 dx
0 0
1
2x3 2(1)3 2(0)3

2
E[X] = = − =
3 0 3 3 3

What this tells us: If you were to repeatedly draw samples from this distribution, the average of
those samples would tend towards 2/3.

3.2 Variance (Var(X)) and Standard Deviation (σ)


• What they tell us: These measures quantify the spread or dispersion of the values around
the mean.
– A small variance/standard deviation means the values are tightly clustered around the
mean, implying more consistency, less variability, or greater predictability.
– A large variance/standard deviation means the values are widely scattered, indicating
greater variability, less consistency, or less predictability.
The standard deviation is particularly useful because it is in the same units as the random variable
itself, providing an intuitive measure of the typical deviation from the mean.
• Variance: Var(X) = E[(X − E[X])2 ] (the expected squared difference from the mean). A com-
putationally easier formula is: Var(X) = E[X 2 ] − (E[X])2
• Standard Deviation (σ): The square root of the variance. σ = Var(X)
p

• Example: Variance of a Die Roll


Die roll: X ∈ {1, 2, 3, 4, 5, 6}, P (x) = 1/6. Mean E[X] = 3.5.

7
First, calculate E[X 2 ]:
           
2 2 1 2 1 2 1 2 1 2 1 2 1
E[X ] = 1 · + 2 · + 3 · + 4 · + 5 · + 6 ·
6 6 6 6 6 6
1 91
E[X 2 ] = (1 + 4 + 9 + 16 + 25 + 36) = ≈ 15.1667
6 6
Now, calculate Variance: Var(X) = E[X 2 ] − (E[X])2 Var(X) = 91 2 35
6 − (3.5) = 12 ≈ 2.9167
q
Standard Deviation: σ = 35 12 ≈ 1.7078 What this tells us: On average, the outcomes of a fair die
roll deviate from the mean (3.5) by about 1.71 units.

8
4 Key Probability Distributions
4.1 Univariate (Single) Gaussian (Normal) Distribution
The Univariate Gaussian distribution, also known as the Normal distribution, is the most im-
portant and widely used continuous probability distribution for a single random variable. It’s often
described as the ”bell curve” due to its characteristic shape.

4.1.1 Definition:
(x−µ)2
The PDF of a normal distribution for a single variable X is given by: f (x; µ, σ 2 ) = √ 1
2πσ 2
e− 2σ 2

Where:
• µ (mu) is the mean of the distribution, which determines its central location and the peak of the
bell curve.

• σ 2 is the variance, which determines the spread or width of the distribution.


• σ (sigma) is the standard deviation (square root of variance). A larger σ means a wider, flatter
curve, while a smaller σ means a narrower, taller curve.

4.1.2 Properties:
• Bell-shaped and Symmetric: The curve is symmetrical around its mean µ.

• Parameters: It is completely defined by its mean (µ) and variance (σ 2 ).


• Empirical Rule (68-95-99.7 Rule): For a normal distribution:
– Approximately 68% of the data falls within one standard deviation of the mean (µ ± σ).
– Approximately 95% of the data falls within two standard deviations of the mean (µ ± 2σ).
– Approximately 99.7% of the data falls within three standard deviations of the mean (µ ± 3σ).
• Example: Heights of Adult Males
The heights of adult males in a population often approximate a normal distribution. Suppose the
mean height is µ = 175 cm and the standard deviation is σ = 7 cm. What this tells us: Most men
are around 175 cm tall. About 68% of men would have heights between 168 cm and 182 cm. This
rule helps quickly estimate the proportion of data expected within certain ranges from the mean,
providing a quick understanding of the spread of heights in the population.

4.2 Multivariate Gaussian Distribution


The Multivariate Gaussian Distribution is a generalization of the one-dimensional Gaussian dis-
tribution to multiple dimensions. It describes the joint probability distribution of a vector of related
continuous random variables.
A k-dimensional random vector X = [X1 , . . . , Xk ]T is said to have a multivariate Gaussian distribu-
tion if its PDF is given by:
f (x; µ, Σ) = √ 1 k exp − 12 (x − µ)T Σ−1 (x − µ)

(2π) |Σ|
Where:

• x is a k × 1 column vector representing a specific realization of the random variables.


• µ (mu) is the k × 1 mean vector. Each element µi is the mean of the corresponding random
variable Xi .
– What it tells us: The mean vector indicates the ”center” or ”average point” in the multi-
dimensional space where the data tends to cluster.

• Σ (Sigma) is the k × k covariance matrix.


– The diagonal elements Σii represent the variance of Xi .
– The off-diagonal elements Σij represent the covariance between Xi and Xj .

9
– What the covariance matrix tells us:
∗ Diagonal elements: Indicate the spread of each individual variable along its own axis.
∗ Off-diagonal elements (Covariance): Quantify how two variables vary together.
· Positive Covariance: Implies that if one variable increases, the other tends to
increase as well (e.g., height and weight). The elliptical contours of the distribution
would be stretched along the direction where both variables increase.
· Negative Covariance: Implies that if one variable increases, the other tends to
decrease (e.g., hours spent studying and hours spent playing video games, if they are
inversely related). The elliptical contours would be stretched along the anti-diagonal.
· Zero Covariance: Implies no linear relationship between the two variables. For
Gaussian distributions, zero covariance implies independence. The elliptical contours
would be aligned with the coordinate axes.
∗ Overall, the covariance matrix Σ defines the shape and orientation of the multi-dimensional
”bell” (an ellipsoid), showing how the variables are correlated and spread out together in
the multi-dimensional space.
• |Σ| is the determinant of the covariance matrix.

• Σ−1 is the inverse of the covariance matrix.


The covariance matrix Σ must be symmetric and positive semi-definite.

10
5 Multiple Random Variables and Their Relationships
These concepts explain how different random variables interact and how their probabilities are related.

5.1 Joint Probability


The joint probability describes the probability of two or more random variables taking on specific
values simultaneously.

• For discrete variables X, Y : P (X = x, Y = y) is the probability that X takes value x AND Y


takes value y.
• For continuous variables X, Y : f (x, y) is the joint probability density function. P (a ≤ X ≤
RdRb
b, c ≤ Y ≤ d) = c a f (x, y) dx dy.
• What it tells us: The joint probability (or joint PDF/PMF) provides information about the
likelihood of observing specific combinations of outcomes for multiple random variables. It describes
their combined behavior.
• Example (Discrete): Weather and Umbrella
Consider Weather (X: Rainy/Sunny) and Umbrella (Y: Yes/No).
– P (X=Rainy, Y=Yes) = 0.5 (50% chance it’s rainy AND someone has an umbrella)
• Example (Continuous): Uniform Distribution over a Square
Let X and Y be two continuous random variables jointly distributed with the PDF: f (x, y) = 1,
for 0 ≤ x ≤ 1 and 0 ≤ y ≤ 1
f (x, y) = 0, otherwise. This means the probability density is uniform over the unit square.

– Find P (0.2 ≤ X ≤ 0.5, 0.3 ≤ Y ≤ 0.7):


Z 0.7 Z 0.5
P (0.2 ≤ X ≤ 0.5, 0.3 ≤ Y ≤ 0.7) = 1 dx dy
0.3 0.2
Z 0.7
= [x]0.5
0.2 dy
0.3
Z 0.7
= (0.5 − 0.2) dy
0.3
Z 0.7
= 0.3 dy
0.3
= [0.3y]0.7
0.3
= 0.3 · (0.7 − 0.3) = 0.3 · 0.4 = 0.12

What this tells us: There is a 12% chance that X will be between 0.2 and 0.5 and Y will be
between 0.3 and 0.7 at the same time. This shows the joint likelihood of two events occurring
in specific continuous ranges.

5.2 Marginal Probability


The marginal probability of a single variable is the probability of that variable taking a specific
value, regardless of the values of other variables. It’s obtained by ”summing out” (for discrete) or
”integrating out” (for continuous) the other variables from the joint distribution.

• For discrete variables: P (X = x) = y P (X = x, Y = y)


P

R∞
• For continuous variables: f (x) = −∞ f (x, y) dy
• What it tells us: The marginal probability (or marginal PDF/PMF) gives the individual prob-
ability distribution for one variable, effectively ignoring or ”averaging out” the influence of other
variables it might be jointly distributed with. It allows us to analyze each variable’s behavior
independently of the others.

11
• Example (Discrete): Weather and Umbrella (continued)
If we also knew P (X=Sunny, Y=Yes) = 0.2 and P (X=Sunny, Y=No) = 0.2. P (X=Rainy) =
P (X=Rainy, Y=Yes) + P (X=Rainy, Y=No) = 0.5 + 0.1 = 0.6 What this tells us: There is an
overall 60% chance of it being rainy, regardless of whether someone carries an umbrella or not.

• Example (Continuous): Using the Uniform Joint PDF (continued)


Using f (x, y) = 1 for 0 ≤ x ≤ 1, 0 ≤ y ≤ 1, and we found f (y) = 1 for 0 ≤ y ≤ 1.
R1 R1
– Find the marginal PDF for X, f (x): f (x) = 0 f (x, y) dy = 0 1 dy = [y]10 = 1 − 0 = 1,
for 0 ≤ x ≤ 1. f (x) = 0, otherwise.
What this tells us: Even though X and Y are part of a joint distribution, the probability distribution
of X by itself (ignoring Y ) is a uniform distribution from 0 to 1.

5.3 Conditional Probability


Conditional probability is the probability of an event occurring given that another event has
already occurred. It quantifies how the probability of one event changes if we know something about
another related event.
The formula is: P (A|B) = P (AP and
(B)
B)
(provided P (B) > 0)
f (x,y)
For continuous variables, the conditional probability density function (PDF) is: f (x|y) = f (y)
(provided f (y) > 0)

• What it tells us: The conditional probability (or conditional PDF/PMF) allows us to update our
beliefs about one variable’s likelihood after observing the value of another. It reveals dependencies
between variables: if knowing one variable’s value changes the probability of another, they are
dependent.
• Example (Discrete): Weather and Umbrella (continued)
Find the probability that it’s Rainy, given that someone has an Umbrella (Yes): P (X=Rainy | Y=Yes) =
P (X=Rainy, Y=Yes)
P (Y=Yes) Using our previous examples, P (Y=Yes) = P (Rainy, Yes) + P (Sunny, Yes) =
0.5 + 0.2 = 0.7. P (X=Rainy | Y=Yes) = 0.5 0.7 ≈ 0.714 What this tells us: If you know someone has
an umbrella, there’s about a 71.4% chance it’s rainy. This is higher than the overall 60% chance
of rain, indicating a positive association between rain and carrying an umbrella.
• Example (Continuous): Using the Uniform Joint PDF (continued)
Using f (x, y) = 1 for 0 ≤ x ≤ 1, 0 ≤ y ≤ 1, and we found f (y) = 1 for 0 ≤ y ≤ 1.
f (x,y) 1
– Find the conditional PDF for X given Y=y, f (x|y): f (x|y) = f (y) = 1 = 1, for
0 ≤ x ≤ 1 (for any specific y ∈ [0, 1]). f (x|y) = 0, otherwise.
What this tells us: In this specific uniform example, knowing the value of Y (as long as it’s within
its range) does not change the distribution of X. X is still uniformly distributed between 0 and 1.
This is a characteristic of independent variables. If X and Y were dependent, f (x|y) would be a
function of y, meaning Y ’s value would indeed influence X’s distribution.

5.4 Independence
Two events A and B (or two random variables X and Y) are considered independent if the occurrence
of one does not affect the probability of the other.

• Mathematically, for random variables X, Y :


– Discrete: P (X = x, Y = y) = P (X = x) · P (Y = y) for all x, y.
– Continuous: f (x, y) = f (x) · f (y) for all x, y.
• Alternatively, using conditional probabilities (if P (B) > 0 or f (y) > 0):
– P (A|B) = P (A)
– f (x|y) = f (x)

12
• What it tells us: Independence signifies a lack of predictive power between variables. Knowing
the value of one variable gives you no additional information about the likelihood of the other.
• Example: Tossing 2 coins.
Let X be the outcome of the first toss (Heads/Tails) and Y be the outcome of the second toss.
The outcome of the first toss is independent of the outcome of the second toss. If P (HH) = 0.25,
P (HT) = 0.25, etc., and P (X = H) = 0.5, P (Y = H) = 0.5. Since P (X = H, Y = H) = 0.25 and
P (X = H) · P (Y = H) = 0.5 · 0.5 = 0.25, they are independent. What this tells us: Knowing the
result of the first coin toss gives you no information that changes your prediction for the second
coin toss. The two events do not influence each other’s probabilities.

13
6 Bayesian Networks: Modeling Conditional Dependencies
Bayesian Networks (BNs), also known as Bayes nets or belief networks, are powerful probabilistic
graphical models that represent a set of random variables and their conditional dependencies via
a Directed Acyclic Graph (DAG). They provide a structured way to represent and reason about
uncertainty in complex systems.

6.1 Components of a Bayesian Network:


1. Nodes: Each node in the graph represents a random variable (e.g., ”Rain,” ”Wet Grass,” ”Sprin-
kler,” ”Alarm”). These variables can be discrete or continuous.

2. Directed Edges: Arrows connect nodes to indicate a direct probabilistic influence or causal
relationship. An arrow from A to B means A is a ”parent” of B, and B is a ”child” of A, implying
that B’s probability distribution directly depends on A.
3. No Cycles (Acyclic): The graph must not contain any directed cycles (you cannot start at a
node and follow arrows to return to the same node). This ensures that probabilities are well-defined
and that cause-and-effect relationships don’t loop back on themselves.
4. Conditional Probability Distributions (CPDs) / Tables (CPTs): Each node has an asso-
ciated CPD (or CPT for discrete variables) that quantifies the influence of its parents.
• For a node with no parents (a ”root” node), its CPD is simply its prior probability distribution.
• For a node with parents, its CPD defines the probability of that node’s value given the values
of its parents.

6.2 Key Property: Conditional Independence


The structure of the DAG in a Bayesian Network directly encodes conditional independence as-
sumptions. A key property is that a node is conditionally independent of its non-descendants given its
parents. This means that once you know the state of a node’s parents, information about its ancestors
(non-descendants) becomes irrelevant for predicting its state. This property allows the joint probabil-
ity distribution of all variables in the network to be factorized (broken down) into a product of simpler
conditional probabilities: Q
n
P (X1 , X2 , . . . , Xn ) = i=1 P (Xi |Parents(Xi ))
This factorization vastly simplifies the computation and representation of complex joint distributions,
as you only need to specify local conditional probabilities rather than the full joint table for all variables.

6.3 Inference in Bayesian Networks


One of the main uses of Bayesian Networks is probabilistic inference: given some observed evidence
(values for some variables), what are the updated probabilities of other unobserved variables? This in-
volves using the structure of the network and the CPDs to compute marginal or conditional probabilities.

• Example Inference Types:


– Predictive (or Causal) Inference: Predicting effects from causes (e.g., P (Wet Grass|Sprinkler On)).
– Diagnostic Inference: Diagnosing causes from effects (e.g., P (Rain|Wet Grass)).
– Intercausal Inference (Explaining Away): When two causes influence a common effect,
observing one cause can ”explain away” the other (e.g., P (Rain|Wet Grass, Sprinkler On)
might be lower than P (Rain|Wet Grass)).

6.4 Example: ”Wet Grass” Network


Consider a simple Bayesian Network with three binary variables:
• Rain (R): True/False
• Sprinkler (S): True/False (whether sprinkler is on)

14
• Wet Grass (W): True/False
Structure (DAG):
• Rain → Wet Grass

• Sprinkler → Wet Grass


This structure implies that Rain and Sprinkler cause Wet Grass, and that Rain and Sprinkler are
independent of each other (unless evidence is introduced about Wet Grass, which can create dependencies
through ”explaining away”).
Conditional Probability Distributions (CPDs):
• Prior for Rain: The probability of Rain being True is P (R = True) = 0.2, and False is P (R =
False) = 0.8.
• Prior for Sprinkler: The probability of Sprinkler being True is P (S = True) = 0.3, and False is
P (S = False) = 0.7.

• Conditional Probabilities for Wet Grass given Rain and Sprinkler:


– If Rain is False and Sprinkler is False, the probability of Wet Grass being True is 0.0.
– If Rain is False and Sprinkler is True, the probability of Wet Grass being True is 0.9.
– If Rain is True and Sprinkler is False, the probability of Wet Grass being True is 0.95.
– If Rain is True and Sprinkler is True, the probability of Wet Grass being True is 0.99.
(For each combination of Rain and Sprinkler, the probability of Wet Grass being False is simply
1 − P (W = True) for that combination).
What this tells us: This network explicitly models the probabilistic causal relationships: whether the
grass is wet depends directly on both rain and the sprinkler. By defining these local dependencies, the
network implicitly defines the full joint probability distribution P (R, S, W ) = P (W |R, S) · P (R) · P (S).
This allows us to perform various inferences, such as calculating the probability of rain given that the
grass is wet, or determining the likelihood of the sprinkler being on given no rain but wet grass.

15
7 Bayes’ Theorem
Bayes’ Theorem describes the probability of an event, based on prior knowledge of conditions that might
be related to the event. It’s a cornerstone of probabilistic inference, allowing us to update our beliefs
based on new evidence. It is frequently applied within Bayesian Networks for various inference tasks,
particularly diagnostic inference.
The formula for Bayes’ Theorem is:
P (A|B) = P (B|A)·P
P (B)
(A)

Where:
• P (A|B) is the posterior probability: the probability of event A occurring given that event B
has occurred. This is often what we want to find – our updated belief about A after observing B.
• P (B|A) is the likelihood: the probability of event B occurring given that event A has occurred.
This tells us how likely the observed evidence B is, assuming our hypothesis A is true.
• P (A) is the prior probability: the initial probability of event A occurring before considering any
information about B. This is our initial belief or knowledge about A.
• P (B) is the evidence or marginal probability of B: the total probability of event B occurring,
considering all possible scenarios. This acts as a normalizing constant. It can be calculated using
the law of total probability: P (B) = P (B|A)P (A) + P (B|not A)P (not A) (for two complementary
events A and not A).
What it tells us: Bayes’ Theorem provides a formal way to reverse conditional probabilities and update
our confidence in a hypothesis (A) when new evidence (B) becomes available. It shows how our initial
belief (P (A)) is weighted by how well the evidence supports the hypothesis (P (B|A)) relative to how
likely the evidence is overall (P (B)). It’s fundamental for inference and learning from data.

7.1 Example: Medical Diagnosis


Suppose a rare disease (D) affects 1 in 1000 people (P (D) = 0.001). There’s a test for this disease.
• The test is 99% accurate in detecting the disease if a person has it (true positive rate): P (Positive Test |D) =
0.99.
• The test has a 5% false positive rate (incorrectly identifies healthy people as having the disease):
P (Positive Test |not D) = 0.05.
You test positive. What is the probability that you actually have the disease? (P (D|Positive Test))
Let:
• A = D (Having the Disease)
• B = T + (Positive Test)
We want to find P (D|T +).
We have:
• P (D) = 0.001 (Prior probability of having the disease)
• P (T + |D) = 0.99 (Likelihood of positive test given disease)
• P (not D) = 1 − P (D) = 1 − 0.001 = 0.999 (Prior probability of not having the disease)
• P (T + |not D) = 0.05 (Likelihood of positive test given no disease)
First, we calculate P (T +) (the overall probability of getting a positive test):

P (T +) = P (T + |D)P (D) + P (T + |not D)P (not D)


P (T +) = (0.99 · 0.001) + (0.05 · 0.999)
P (T +) = 0.00099 + 0.04995 = 0.05094

(Evidence: the general chance of a positive test result in the population)

16
Now, apply Bayes’ Theorem:

P (T + |D) · P (D)
P (D|T +) =
P (T +)
0.00099
P (D|T +) = ≈ 0.0194
0.05094
What this tells us: Even with a positive test, the probability of actually having the disease is only about
1.94%! This illustrates the power of Bayes’ Theorem: it shows how the very low prior probability of the
disease (P (D) = 0.001) significantly tempers the impact of the positive test result, given the relatively
high false positive rate (5%) in the larger healthy population. This kind of analysis is crucial in fields
like medicine and other areas requiring probabilistic reasoning.

17
8 KL Divergence (Kullback-Leibler Divergence)
KL Divergence, also known as relative entropy, is a non-symmetric measure of the difference be-
tween two probability distributions. It quantifies how much information is lost when a probability
distribution Q is used to approximate another probability distribution P . In simpler terms, KL Diver-
gence tells us how ”different” or ”distant” one probability distribution Q is from a reference
probability distribution P .
It is commonly used in various fields, including statistics, information theory, and machine learning,
to measure how similar one probability distribution is to another, often when optimizing models to learn
target distributions.

8.1 Formula:
For discrete probability distributions
 P (x) and Q(x) (where P (x) and Q(x) are their PMFs):
P P (x)
DKL (P ||Q) = x P (x) log Q(x)
For continuous probability
  distributions P (x) and Q(x) (where p(x) and q(x) are their PDFs):
R∞
DKL (P ||Q) = −∞ p(x) log p(x)
q(x) dx
Note: The logarithm can be in any base. Using base 2 gives units in ”bits”, while using the natural
logarithm (base e) gives units in ”nats”. The choice of base only scales the result, not its fundamental
meaning.

8.2 What KL Divergence Measures: The Information Theory Perspective


To understand DKL (P ||Q), it’s helpful to consider its roots in information theory.
• Entropy (H(P )): The entropy of a distribution P measures the average number of bits required
to encode (or describe) an event drawn from P , assuming an optimal encoding strategy for P . It
quantifies
P the inherent unpredictability or ”surprise” of events from that distribution. H(P ) =
− x P (x) log(P (x))
• Cross-Entropy (H(P, Q)): Cross-entropy measures the average number of bits required to encode
an event drawn from
P distribution P , if we use an encoding strategy that is optimized for distribution
Q. H(P, Q) = − x P (x) log(Q(x))
KL Divergence can then be expressed as the difference between cross-entropy and entropy:
DKL (P ||Q) = H(P, Q) − H(P )
!
X X
=− P (x) log(Q(x)) − − P (x) log(P (x))
x x
X
= P (x)(log(P (x)) − log(Q(x)))
x
 
X P (x)
= P (x) log
x
Q(x)

Interpretation: DKL (P ||Q) represents the extra bits (or nats) required to encode samples from the
true distribution P when using an encoding scheme optimized for the approximating distribution Q,
compared to using an optimal encoding scheme for P itself. A higher DKL (P ||Q) means that Q is a
poorer approximation of P , and using Q’s encoding would be more inefficient, leading to more ”surprise”
or information loss when observing actual events from P .

8.3 Properties of KL Divergence:


1. Non-negativity: DKL (P ||Q) ≥ 0. The KL divergence is always non-negative. It is equal to 0 if
and only if P (x) = Q(x) for all values of x. This means if the distributions are identical, there’s
no ”information loss” or ”difference.”
2. Non-symmetry: DKL (P ||Q) ̸= DKL (Q||P ) in general. This is a crucial property. It means KL
divergence is not a true ”distance” metric (like Euclidean distance, where d(A, B) = d(B, A)). The
order matters significantly.

18
• Asymmetry Explained:
– If P (x) > 0 for some x where Q(x) = 0, then DKL (P ||Q) will be infinite. This is
because if the approximating distribution Q assigns zero probability to an event that the
true distribution P considers possible, the ”cost” of encoding that event using Q’s scheme
becomes infinitely large. This means Q must cover all possibilities that P covers.
– Conversely, if Q(x) > 0 for some x where P (x) = 0, that specific term in the sum/integral
contributes 0 to DKL (P ||Q). In this case, Q might assign probability to events that P
never produces; this is penalized less severely in DKL (P ||Q) than P assigning probability
to events that Q misses.
• This asymmetry is important in machine learning. For instance, in Maximum Likelihood
Estimation (MLE), we often minimize DKL (Pdata ||Pmodel ), which means the model tries to
cover all data modes. In Variational Inference (VI), we often minimize DKL (Qapprox ||Ptrue ),
which implies that the approximation Q should not assign probability mass where the true
posterior P has none (it will tend to be narrower than P ).

8.4 Examples:
8.4.1 Example 1: Comparing Two Coins
Let’s consider two binary distributions (coin tosses):
• Distribution P (True Coin): A biased coin that lands Heads (H) with probability 0.7 and Tails
(T) with probability 0.3.

– P (H) = 0.7
– P (T ) = 0.3
• Distribution Q (Approximation Coin): A fair coin.
– Q(H) = 0.5
– Q(T ) = 0.5
Let’s calculate DKL (P ||Q) using natural logarithms.

   
P (H) P (T )
DKL (P ||Q) = P (H) ln + P (T ) ln
Q(H) Q(T )
   
0.7 0.3
= 0.7 ln + 0.3 ln
0.5 0.5
= 0.7 ln(1.4) + 0.3 ln(0.6)
≈ 0.7(0.33647) + 0.3(−0.51083)
≈ 0.23553 − 0.15325 = 0.08228 nats

Now, let’s calculate DKL (Q||P ) to demonstrate asymmetry:

   
Q(H) Q(T )
DKL (Q||P ) = Q(H) ln + Q(T ) ln
P (H) P (T )
   
0.5 0.5
= 0.5 ln + 0.5 ln
0.7 0.3
= 0.5 ln(0.71428) + 0.5 ln(1.66667)
≈ 0.5(−0.33647) + 0.5(0.51083)
≈ −0.168235 + 0.255415 = 0.08718 nats

What this tells us:


• DKL (P ||Q) ≈ 0.08228 nats. This is the extra information (in nats) we’d expect to need to represent
outcomes from the biased coin (P) if we assumed they came from the fair coin (Q).

19
• DKL (Q||P ) ≈ 0.08718 nats. This is the extra information we’d expect to need to represent out-
comes from the fair coin (Q) if we assumed they came from the biased coin (P).
• Notice that DKL (P ||Q) = ̸ DKL (Q||P ), confirming the non-symmetric property. The values are
close in this simple case because the distributions aren’t dramatically different, but the principle
holds.

8.4.2 Example 2: Distributions Over Three Outcomes


Consider a discrete variable X that can take values A, B, C.
• True Distribution P:
– P (A) = 0.2
– P (B) = 0.5
– P (C) = 0.3
• Approximating Distribution Q:
– Q(A) = 0.4
– Q(B) = 0.4
– Q(C) = 0.2
Calculate DKL (P ||Q):

     
P (A) P (B) P (C)
DKL (P ||Q) = P (A) ln + P (B) ln + P (C) ln
Q(A) Q(B) Q(C)
     
0.2 0.5 0.3
= 0.2 ln + 0.5 ln + 0.3 ln
0.4 0.4 0.2
= 0.2 ln(0.5) + 0.5 ln(1.25) + 0.3 ln(1.5)
≈ 0.2(−0.6931) + 0.5(0.2231) + 0.3(0.4055)
≈ −0.13862 + 0.11155 + 0.12165
= 0.09458 nats

What this tells us: This positive value indicates that there is some information loss when using Q to
approximate P. Specifically, if we were to encode outcomes of P using a code optimized for Q, on average,
we would use about 0.09458 more nats than if we used a code optimized for P. The higher value compared
to the coin example reflects a greater difference between these two distributions. KL divergence provides
a single numerical value to quantify this discrepancy.

20

You might also like