CHAPTER 5
Conditional Probability and Bayes’s Theorem
1. Introduction
Previously we mentioned the Bayesian interpretation of probability as the (subjective) degree of belief I
have that this event will occur. (I may modify this belief in the light of new evidence.)
It is called Bayesian because Bayes’s Theorem is used to modify our degree of belief (probability) in the
light of new evidence. But first we need to discuss how the probability of one event depends on another:
conditional probability.
2. Joint and Conditional Probability
2.1. Joint probability.
Definition 5.1. The joint probability of events A and B is the probability P (A and B) of both events
occurring, sometimes expressed as P (A \ B) (say ‘A intersection B’).
In Figure 4.2 in Chapter 4, we saw two sets (events) A and B within a sample space S, with the joint
event A \ B = A and B shown in green.
Examples of joint probability:
• probability that a random playing card is a red ace (red and an ace);
• probability that a person is dark-haired and has a limp;
• probability that a town has a railway station and a youth hostel.
For the last of these, Figure 5.1 shows the joint event as the intersection of the two sets “Towns with
Stations” (A) and “Towns with Hostels” (B).
Figure 5.1. Joint probability
69
70 MIS10090
2.2. Independent events.
Definition 5.2. Two events A and B are called independent if the occurrence of one event does not
influence the probability of occurrence of the other event.
Fact: If the two events A and B are independent, then:
P (A and B) = P (A)P (B) (multiply probabilities)
In particular, note that independence of A and B is an entirely di↵erent thing from having P (A and B) =
0, which we have called mutually exclusive. Suppose A and B are mutually exclusive: if A occurs then B
cannot occur: so the occurrence of A has definitely changed the probability of B occurring, P (B|A) = 0,
so they are certainly not independent!
2.3. Conditional probability.
Definition 5.3. The conditional probability of event B on event A is the probability that event B occurs,
given that event A has already occurred. It is defined as
P (A and B)
P (B|A) = .
P (A)
We sometimes call this the probability of B conditioned on A.
Some intuition for this definition:
Notice that if A and B are independent events, then
P (A and B)
P (B|A) =
P (A)
P (A)P (B)
= since A and B are independent
P (A)
= P (B).
That is, the probability of B given that A has occurred is just the normal probability of B: exactly as
we would expect if A and B are independent.
Also note that, rearranging the terms in the definition, P (A and B) = P (B|A)P (A).
By symmetry (since P (A and B) = P (B and A)), we have that P (A and B) = P (A|B)P (B).
In particular, if A and B are mutually exclusive events, that is, P (A and B) = 0, then
• P (B|A) = 0: the probability of B occurring, given that A has occurred, is 0; and
• P (A|B) = 0: the probability of A occurring, given that B has occurred, is 0.
It is intuitively easier to see in terms of areas in a Venn diagram by observing the following: if we know
that a town has a railway station (that is, A has occurred), what is the probability that it has a youth
hostel (that is, P (B|A))?
Figure 5.2. Conditional probability
Taking the conditional probability P (B|A) means we are changing (reducing) the sample space by
restricting to the subset A. Intuitively, if the areas on the diagram are roughly proportional to the
Data Analysis for Decision Makers 71
probabilities, we are asking: what is the relative proportion within the event A in which the event B
occurs (the right-hand part of Figure 5.2)?
2.3.1. Total probability. If we have an event A and another event B, we can write
A = (A \ B) [ (A \ B),
that is, A is the union of
• the part of A overlapping with B, and
• the part of A overlapping with B.
Thus, since A \ B and A \ B are disjoint (because B and B are a partition of S), we can write the
probability of A as the sum of two things:
• the probability of A conditioned on B, and
• the probability of A conditioned on the complement B:
P (A) = P (A and B) + P (A and B)
= P (A|B)P (B) + P (A|B)P (B).
We sometimes call this expression the total probability of A with respect to B.
Example: P (Hostel) = P (Hostel|Station)P (Station) + P (Hostel|No Station)P (No Station).
The above is for the case where the sample space S is partitioned into two parts, B and B. More
generally, suppose the sample space S is partitioned into k parts; then we have:
Definition 5.4. Suppose the sample space S is partitioned into k parts, B1 , . . . , Bk . Then we can write
the total probability of A with respect to B1 , . . . , Bk as
P (A) = P (A and B1 ) + P (A and B2 ) + · · · + P (A and Bk )
= P (A|B1 )P (B1 ) + P (A|B2 )P (B2 ) + · · · + P (A|Bk )P (Bk ).
2.4. Worked examples.
Example 5.5. From the contingency table in Example 4.15 earlier, let’s define M to be the event that
an employee is male and I to be the event that an employee works in IT. We can work out from the
table the probabilities that when we select a random employee, that employee is:
• Male: P (M ) = 470/1000 = 0.47
• Male and works in IT: P (M and I) = 50/1000 = .05
• Male or works in IT: P (M or I) = (470 + 130 50)/1000 = 0.55
• Male given that he/she works in IT: P (M |I) = 50/130 = 0.38.
Notice that P (M ) 6= P (M |I) so these events are not independent.
Also note that the phrase “Male given that he/she works in IT” is just another way of saying “Male
conditioned on working in IT” which hints to us that conditional probabilities are involved. }
3. Bayes’ Theorem and adjusting Probability in light of new Evidence
3.1. Bayes’ Theorem. Recall, to calculate the joint probability of A and B happening we can say
• P (A and B) = P (A|B)P (B)
or
• P (A and B) = P (B|A)P (A).
Thus P (B|A)P (A) = P (A|B)P (B) since each is equal to P (A and B).
Divide across by P (A) to get P (B|A) = P (A|B)P (B)/P (A).
Now substitute P (A) with our earlier formula for total probability, to get:
72 MIS10090
Theorem 5.6 (Bayes’ Theorem). For any two events A and B,
P (A|B)P (B)
P (B|A) = .
P (A|B)P (B) + P (A|B)P (B)
In the more general case of the sample space S being partitioned into k parts B1 , . . . , Bk , (see Defini-
tion 5.4) we get
Theorem 5.7 (Bayes’ Theorem, general case). Suppose B1 , . . . , Bk is a partition of the sample space S.
Then for any event A, and any one of the Bi ,
P (A|Bi )P (Bi )
P (Bi |A) =
P (A)
P (A|Bi )P (Bi )
= .
P (A|B1 )P (B1 ) + · · · + P (A|Bk )P (Bk )
3.1.1. How do we use Bayes’ Theorem? First, it will allow us to “flip” conditional probabilities: if
we know P (A|B) (and some other information), we can calculate P (B|A).
Second, recall the Bayesian interpretation of probability = degree of belief.
Bayes’ theorem is used to factor in new probability information in order to revise existing probabilities
(called “prior” probabilities: prior to or before the new evidence).
We modify our prior belief (probability) in the light of new evidence to get our posterior belief (proba-
bility). (Posterior means after using the new evidence.)
We’ll look at an example in this lecture of the use of Bayes’ theorem: the trial of John Hinckley. This
will show us that the correct understanding of conditional probability can be a matter of life or death.
3.2. Example of incorporating new evidence: John Hinckley’s trial. This example is based
on work of Professor Irwin Greenberg of George Mason University, presented in the article: Barnett, A.,
Greenberg, I. and Machol, R. E. (1984), “Misapplication Reviews: Hinckley and the Chemical Bath”,
Interfaces, 14(4):48–52.
When John Hinckley was on trial for his 1981 assassination attempt on US President Ronald Reagan,
his defence team wanted to use an insanity defence. The insanity defence was used in fewer than 2% of
criminal trials and was unsuccessful in three out of every four attempts. They used an expert witness
in an attempt to demonstrate that Hinckley was likely to be schizophrenic (the basis of their insanity
plea).
Dr. Daniel R. Weinberger, psychiatrist for the National Institute of Mental Health, testified that:
• 30% of CAT scans on schizophrenic people show brain atrophy (shrinkage);
• Only 2% of scans on non-schizophrenics show atrophy.
Defence lawyer Gregory B. Craig wanted to introduce a CAT scan of Hinckley, which showed brain
atrophy.
So is it the case that the odds of Hinckley having schizophrenia is 30:2 — that is, 15:1 — which would
15
mean a probability of 15+1 = 15/16 = 93.75? Let’s see . . .
Bayes’s theorem can be used to test the implications of the psychiatrist’s evidence.
Let A be the event that a CAT scan shows atrophy, and let S be the event that a person is schizophrenic
(so S means the person is not schizophrenic).
• Atrophy is found in 30% of schizophrenics, so P (A|S) = 0.30.
• Atrophy is found in only 2% of non-schizophrenics, so P (A|S) = 0.02
• It is known (from a psychiatry conference in 1976) that 1.5% of the US population su↵er from
schizophrenia, so the base probability is P (S) = 0.015 (and therefore P (S) = 1 0.015 = 0.985).
Data Analysis for Decision Makers 73
What is P (S|A)?
P (A|S)P (S)
P (S|A) =
P (A|S)P (S) + P (A|S)P (S)
0.30(0.015)
=
0.30(0.015) + 0.02(0.985)
= 0.186.
Therefore, if the statistical implications of Hinckley’s CAT scan had been explained, the defence would
have been arguing that there was only an 18.6% chance that Hinckley su↵ered from schizophrenia.
However, Bayes’ theorem was not used to make such an interpretation.
Hinckley was found not guilty by reason of insanity.
Some members of the jury indicated that they had been strongly persuaded by the evidence put forward
by the psychiatrist.
Does the Bayes’ finding of 18.6% show that the psychiatric evidence was misapplied?
Some statisticians have suggested that a jury finding Hinckley to be schizophrenic could be consistent
with Bayes’ theorem.1
The 18.6% hinges on assuming that prior to introducing the CAT scan evidence, Hinckley has a standard
1.5% chance of being schizophrenic. But it could be argued that there are already grounds for suspecting
a higher than normal prior probability of schizophrenia: say 10%. What would this lead to?
A 10% prior probability belief that Hinckley is schizophrenic becomes a 63% posterior probability, after
the positive scan result is taken into account.
Exercise 5.8. For practice, as an exercise, test this finding by applying Bayes’ theorem to the CAT
scan probabilities, using
• a 10% prior probability of schizophrenia, and
• a 90% prior probability of non-schizophrenia
Show that you get a 63% posterior probability.
The not guilty finding sparked public outrage. Following the trial, insanity defence laws were rewritten by
Congress and the defence was abolished by some states. This has led to the execution of several people
who might otherwise have successfully pleaded insanity. Hinckley remained in care in a psychiatric
institution until released on September 10, 2016.
3.3. Bayes’ Theorem in Decision Analysis. Bayes’ Theorem can be applied to decision analysis:
it enables us to factor in the impact of evidence and new information. Prior probabilities are revised in
light of new information, to generate posterior probabilities.
It can also be used to “reverse” or “flip” conditional probabilities.
For example, medical research may give the probability of observing a symptom (e. g., fever) given a
person has a particular disease (e. g., flu), that is, P (fever|flu). But usually a doctor is presented with
a patient showing a symptom and wants the probability that they have the disease, that is, the doctor
wants to know P (flu|fever).
If the doctor knows the actual or base prevalence of flu in the population (that is, P (flu)), he/she can
work out the probability that a particular patient with fever has flu by using Bayes’ Theorem:
P (fever|flu)P (flu)
P (flu|fever) = .
P (fever|flu)P (flu) + P (fever|flu)P (flu)
(Of course, P (flu) = 1 P (flu).)
Exercise 5.9. Try this yourself: research online to find reasonable values for P (fever|flu), P (fever|flu)
and P (flu).
1See the Interfaces article referenced earlier giving Greenberg’s work.
74 MIS10090
Similar applications to this medical one arise in testing for COVID-19 and in drug testing of athletes.
Exercise 5.10. A new test for COVID-19 has been developed. It gives either a positive or a negative
result. Experiments have been carried out on the usefulness of this test, both on people known to have
COVID-19 and people known not to have COVID-19. The results of these experiments were:
• if the tested person has COVID-19, there is a 0.90 probability that the test will be positive;
• if the tested person does not have COVID-19, there is a 0.95 probability that the test will be
negative.
Suppose that 8% of the people to be tested do in fact have COVID-19.
(1) Work out the probability that a randomly selected person will test positive.
(2) Suppose that a randomly selected person tests positive. Work out the probability that he or
she actually has COVID-19.
(3) Suppose that a randomly selected person tests negative. Work out the probability that he or
she actually has COVID-19.
Exercise 5.11. Suppose that a certain test for drugs can only give either a “positive” or a “negative”
result. From earlier experiments on drug testing, scientists have found that
• if the tested person uses drugs, there is a 0.92 probability that the test will be “positive.”
• if the tested person does not use drugs, there is a 0.96 probability that the test will be “negative.”
Suppose that 10% of the competitors to be tested are in fact drug users.
(a) Work out the probability that a randomly selected competitor will test positive.
(b) Suppose that a randomly selected competitor tests positive. Work out the probability that he
or she actually uses drugs.
(c) Suppose that a randomly selected competitor tests negative. Work out the probability that he
or she actually uses drugs.
The numbers given here are actually realistic. Do the answers surprise you? Consider the potential
advantages and disadvantages of bringing in a mandatory drug testing policy using this particular test.
Example 5.12. An experimental chemotherapy treatment improves the recovery rate for a certain type
of cancer from 50% if untreated to 75% if treated. Of 100 patients in a hospital, ten are given the
new treatment. Later, one of the 100 patients is found have recovered from the cancer. What is the
probability that this patient received the experimental chemotherapy treatment?
Solution. Define sample space S = all 100 patients, and events as follows:
R: patient recovers from the cancer
C: patient receives the chemotherapy treatment
N : patient does not receive the chemotherapy treatment.
Since C and N = C are a partition (mutually exclusive and collectively exhaustive), Bayes’s Theorem
says
P (R|C)P (C)
P (C|R) = .
P (R|C)P (C) + P (R|N )P (N )
From the information given, P (C) = 10/100 = 0.1 and P (N ) = 90/100 = 0.9. Also, P (R|C) = 0.75 and
P (R|N ) = 0.5. Thus
0.75 ⇥ 0.1 0.075 0.075 1
P (C|R) = = = = .
0.75 ⇥ 0.1 + 0.5 ⇥ 0.9 0.075 + 0.45 0.525 7
}
Example 5.13. A data processing company has 3 locations, A1 , A2 and A3 , which are responsible for
50%, 30% and 20% of the data processing respectively. Past records reveal that the probability of having
an error with data processed by:
• A1 is 0.01
• A2 is 0.05
Data Analysis for Decision Makers 75
• A3 is 0.20
Calculate:
(1) the probability of an error,
(2) the probability the error is the responsibility of A3 , given that an error has been found.
Solution Approach:
• First, let B represent the event that an error has occurred. We can partition the sample space
of all pieces of data processing work as in Figure 5.3. Note that these areas are not proportional
Figure 5.3. Partitioning the sample space of all pieces of data processing work into
those processed by A1 , A2 and A3 . The errors are the event B (in red)
to the probabilities given.
• Next, label the information given:
– Prior probabilities P (A1 ) = 0.5, P (A2 ) = 0.3 and P (A3 ) = 0.2;
– Conditional probabilities P (B|A1 ) = 0.01, P (B|A2 ) = 0.05 and P (B|A3 ) = 0.20.
Then the solutions are as follows:
(1) Probability of an error is just the total probability that an error has occurred, P (B). Sum up
the ways the errors can occur : we can get an error from A1 , A2 or A3 :
3
X
P (B) = P (B|Ai )P (Ai )
i=1
= 0.5 ⇥ 0.01 + 0.3 ⇥ 0.05 + 0.2 ⇥ 0.20
= 0.06 = 6%.
(2) To find the probability that the error is the responsibility of A3 , given an error has occurred,
we use Bayes’s Theorem to “flip” the conditional probabilities:
• By Bayes’s Theorem,
P (B|A3 )P (A3 )
P (A3 |B) = .
P (B)
• Recall from (1) above, P (B) = 0.06 = 6%, the simple probability of an error
• What proportion is attributable to A3 , that is, what is P (A3 |A)?
0.2 ⇥ 0.20
P (A3 |B) = = 66.67%.
0.06
}
76 MIS10090
Example 5.14. Each year, before delivering presents, Santa divides all of the children in Ireland into
two groups: “naughty” and “nice”. Historical data shows that “naughty” children have a 50% chance of
getting the present they asked for, while “nice” children have a 75% chance of getting what they asked
for. This year Santa has decided that 10% of the children meet the standard of “nice”.
Santa tells you that your five-year-old nephew will get the present he asked for. What is the probability
that your nephew is “nice”?
Solution: Define events as follows:
G: a child is “nice” (G stands for Good);
N : a child is “naughty”;
W : Win: the child gets the present they asked for;
L: Lose: the child doesn’t get the present they asked for.
Then we are told:
P (G) = 0.1 and so P (N ) = 1 0.1 = 0.9 since N = G;
P (W |N ) = 0.5 and so P (L|N ) = 1 0.5 = 0.5 since L = N ;
P (W |G) = 0.75 and so P (L|G) = 1 0.75 = 0.25.
We are asked for P (G|W ) so use Bayes’s Theorem :
P (W |G)P (G)
P (G|W ) =
P (W |G)P (G) + P (W |G)P (G)
P (W |G)P (G)
=
P (W |G)P (G) + P (W |N )P (N )
0.75 ⇥ 0.1
=
0.75 ⇥ 0.1 + 0.5 ⇥ 0.9
0.075 0.075 1
= = = .
0.075 + 0.45 0.525 7
1
so the probability that your nephew was “nice” is 7 ⇡ 0.142857 or 14.3%. }