Lecture-2
Assignment -1 problems:
1. P ( ∅ )=0
We have S=S ∪ ∅ and S ∩∅=∅. Hence
P ( S )=1
P ( S ∪ ∅ ) =1
P ( S ) + P(∅ )=1⟹ P(∅)=0
2. P ( A ∪ B ) =P ( A )+ P ( B )−P ( A ∩B )
A
From the figure, we can write B
A ∪ B= AU ( A c ∩ B)
and also we have A ∩ ( A c ∩ B )=∅
Thus
P ( A ∪ B ) =P ( A )+ P ( A c ∩ B ) (1)
A B
Further we have
B=( A ∩B) U ( A c ∩ B) and ( A ∩ B)∩( A c ∩ B)=∅
Thus
P ( B )=P ( A ∩ B ) + P( A c ∩ B)
Hence,
P ( A c ∩ B )=P ( B )−P ( A ∩ B ) (2)
Substituting (2) in (1), we get
P ( A ∪ B ) =P ( A )+ P ( B )−P ( A ∩B )
Example 1: A card is drawn at random from a pack of cards. Find the probability of getting
Jack of Spade.
There is one outcome favourable to the event 'Jack of Spades' and there are 52 total outcomes
for experiment of 'drawing a card from a pack of cards'. Hence P(Jack of Spades)=1/52
Conditional Probability: Let A be the event 'I carry an umbrella for 60 days a year'. I may
carry an umbrella on really hot days as well as on days when it rains. Then P(A)=60/365. Let
B be the event 'It rains for 90 days in a year'. Then P(B)=90/365. These information is
gathered after a lot of observations and P(A) and P(B) are known as apriori probabilities, that
is probabilities based on overall information and not based on any specific information.
Suppose it has rained today and in my office someone needs an umbrella for a short time to
use. What would be his idea about the probability that I have carried an umbrella with me so
that he can borrow it from me? The fact that it has rained would increase the probability of
my carrying an umbrella. This leads to the idea of conditional probability -
P( I carry an umbrella given that it has rained) - denoted by P(A|B).
Suppose I carry an umbrella for 45 days (out of the 60 total number of days I carry an
umbrella) out the 90 days it rains. Then P(A|B)=45/90=0.5. P(A|B) is known as posteriori
probability, that is the probability that I carry an umbrella after we get some additional
information (that it has rained).
The conditional probability reduces the sample space - you consider only the 90 days when it
rains, not all 365 days, and out of those 90 days how many times I have carried an umbrella.
The event when both A and B have occurred is denoted by A ∩ B and in our example it
happens on 45 days of the year. Hence P ( A ∩ B )=45/365, Now we have
( 45/365 ) P ( A ∩ B )
P( A∨B)=45/90= =
( 90/365 ) P (B )
Thus we have expression for conditional probability:
P ( A ∩B )
P( A∨B)= (3)
P( B)
Bayes Theorem: The conditional probability P( B∨ A) is given by
P ( A ∩B )
P( B∨ A)=
P(A)
From the above we get
P ( A ∩ B )=P( B∨ A)P ( A )( 4)
Substituting (4) in (3) we get,
P( B∨A) P ( A )
P( A∨B)= (5)
P (B)
Equation (5) is known as Bayes Theorem and is used extensively used in many real life
applications. The probabilities P ( A ) , P ( B ), and P( B∨ A) are measured experimentally and
P( A∨B) is calculated to make some decisions.
Example:
Consider a vehicle whose axle is found to fail on certain occasions.
Let A be the event A='Axle failed on a given day'
Let B be the event B='Axle temperature has gone > (ambient + 75o ) on a given day'.
Let us say that it has been observed that the axle fails on average 3 times a year. Thus
P(A)=4/365.
Let us also say that it has been observed that axle temperature has gone beyond (ambient +
75o ) on 5 days. That is P(B) = 5/365.
Now researchers will analyse the temperature of the axle on the day it has failed and observe
that axle's temperature has gone beyond (ambient + 75 o ) just before the axle failed for 90%
of the failed cases. Thus the P(the axle temperature > ambient+75 o given the axle has failed)
= P( B∨ A) = 90/100 = 0.9.
Now we want the probability P(A|B) - what is the probability that Axle fails in near future
given that I have observed that the temperature has gone beyond ambient + 75o?
P( B∨A) P ( A )
P( A∨B)=
P (B)
Plugging in the values of P(A), P(B), and P( B∨ A), we get
P ( A|B )=
0.9∗( 3654 ) =0.72
( 3655 )
We see that the priori probability of axle failure P(A)=4/365 was very less, but with the new
information that axle temperature has been observed to cross ambient+ 75o, the probability of
axle failure P(A|B) shoots up to 0.72. This enables the technical people to take the vehicle the
immediately to service station and check the axle and rectify any problems before it actually
fails.
Likewise, Bayes theorem has been applied to various fields like medical field, weather report,
crime investigation, component failure analysis, etc.
i=n
Total Probability A partition of a set B is a collection of susbsets { Bi }i=1 of B such that
Bi ∩ B j=∅ , for i ≠ j(6)
¿ i=1 ¿ i=n B i=B 1 ∪ B2 ∪ …∪ Bn=B(7)
The property denoted by equation (6) is called
'mutually exclusive' property, that is, if event denoted
by subset A occurs, it will eliminate the possibility of
i
B j (i ≠ j) occuring. The property denoted by equation (7) is called 'collective exhaustive'
property, that is any outcome of the experiment will fall into at least one of the Bi. Both
statements 'mutually exclusive' and 'collectively exhaustive' together mean that any outcome
of the experiment precisely falls into one and only one Bi. Thus, a partition of set B is a
collection of mutually exclusive and collectively exhaustive subsets Bi of B.
Total Probability Theorem: Any subset A of B can be
written as
A=( A ∩B 1 ) ∪ ( A ∩ B2 ) ∪… ∪ ( A ∩ B5 )=¿ i=1 ¿ i=5 A ∩B i
Since ( A ∩ Bi ) ∩ ( A ∩ B j ) =∅ , for i≠ j, we have
p ( A )=P ( A ∩ B1 ) + P ( A ∩B 2) + …+ P ( A ∩B 5 )
.
Thus we obtain the
n n
[ Total Probability Theorem ] P ( A )=∑ P ( A ∩ Bi )=∑ P ( A∨Bi ) P ( Bi ) ( 8 )
i=1 i=1
Equation (8) is known as total probability theorem. Now we want to find the probability
P ( B i∨ A )
P ( Bi ∩ A ) P ( A∨Bi ) P ( Bi ) P( A∨Bi ) P ( Bi )
P ( B i∨ A ) = = = n
(9)
P( A) P( A)
∑ P ( A∨Bi ) P ( Bi )
i=1
The importance of equation (9) is found in the following
Example: Factories Bi ,i=1 , 2 ,3 , manufacture a certain part and the probability for defective
part for the three factories is given by P ( D|B1 )=0.3 , P ( D|B2 )=0.4 , P ( D|B3 )=0.1 . A
company procures the part from the factory B1 for 10% of the times, from factory B2 for 15%
of times, and from factory B3 for the rest 75% of the times. After procuring the part if the
company finds it to be defective what is the probability that the part has come from B1, B2,
and B3?
We are asked to find P ( B i∨D ) for i=1, 2 , 3.
From (8)
n n
P ( D )=∑ P ( D ∩B i )=∑ P ( D∨Bi ) P ( B i )=0.3∗0.1+0.4∗0.15+ 0.1∗0.75
i=1 i=1
¿ 0.165
P ( B 1 ∩ A ) P( D∨B 1) P ( B1 ) 0.3∗0.1
P ( B 1∨D )= = = =0.1 9
P ( D) P(D) 0.165
P ( B2 ∩ A ) P( D∨B2) P ( B2 ) 0.4∗0.15
P ( B 2∨D )= = = =0.36
P ( D) P(D) 0.165
P ( B3 ∩ A ) P ( D∨B3) P ( B 3 ) 0.1∗0.75
P ( B 3∨D ) = = = =0.45
P(D) P(D) 0.165
Even though B2 has highest probability ( P ( D|B2 )=0.4 ) for defective parts, the probability
that the defective part has come from B3 is more ( P ( B 3∨D ) =0.45) because the probability of
visiting B3 is more (75%).
Random Variables: A numerical variable representing any quantity like profit, loss, voltage,
current, number of marks scored, etc can be classified into different categories. For example a
variable can be classified according to the numerical type of values it takes - continuous or
discrete. A variable representing voltage, current, or temperature takes on continuous set of
values. On the other hand a variable representing number of marks scored takes on values
from a discrete set, that is, it takes on only those specific values mentioned in the discrete set.
Thus, a variable could be continuous or discrete. Another way of classifying a variable
depends on the behaviour of the underlying phenomena - deterministic or
stochastic(probabilistic). For example, every time a stone is thrown up from a point in
complete vacuum, it reaches the same height (deterministic) each time. However, the
temperature at a particular point in a town at exactly 12:00 noon will not be the same every
day of the year (stochastic). Hence we have classifications like (continuous, deterministic),
(continuous, stochastic), (discrete, deterministic), (discrete, stochastic). In this course we
focus more on (continuous, stochastic) and (discrete, stochastic) from this knowledge.
Random Variable: A random variable is function that maps point in the event space
(outcome of an experiment) to a number on the real line. That is a random variable associates
a real number with every outcome (in general every event).
Example: Toss a coin twice and the outcomes are {HH, HT, TH, TT}. The random variable
X='number of heads' associates the numbers as follows X(HH) = 2, X(HT) = 1, X(TH) = 1,
X(TT) = 0. Another random variable Y specifies reward (in rupee) you get as Y(HH) = 10,
Y(HT) = 2, Y(TH) = -5, Y(TT) = -5.
Now we can assign probabilities using random variables X and Y as
P(X=3)=0, you won't get 3 heads in tossing a coin twice
P(X=2) = P(HH) = 1/4
P(X=1) = P(HT U TH) = P(HT)+P(TH) = 1/2
P(X=0) = 1/4
P(X<=1) = P(X=0) + P(X=1) = 3/4
P(X<=2) = P(X=0) + P(X=1) + P(X=2) = 1
P(X<=3) = P(X=0) + P(X=1) + P(X=2) + P(X=3) = 1
Similarly
P(Y = -5) = P(TH)+P(TT) = 1/2
P(Y=10) = P(H) = 1/4
and so on.
Topics for self study : (i) Binomial distribution (ii) Poisson distribution. Each of these
distribution has certain parameter(s). Study the importance of those parameters.
In continuous distribution we use integrals while evaluating the probabilities, whereas in
discrete distributions we us the summations while evaluating the probabilities.
Continuous distributions are characterised by probability density function (pdf). The pdf of
random variable X is usually denoted by f X ( x ), where 'X' is the name of the random variable
and x denotes the values that X takes.
Example: The probability that a random variable X takes any value between 'a' and 'b' is
given by
b
P ( a ≤ X ≤ b )=∫ f X (x) dx
a
and
+∞
P (−∞≤ X ≤ ∞ )= ∫ f X ( x)dx=1
−∞
because the value that X takes on is a real number and any real number certainly lies in the
interval (−∞,+∞)
For a continuous random variable P ( X=c )=0 because
c
P ( X=c )=P ( c ≤ X ≤ c ) =∫ f X ( x)dx=0
c
Hence, in case of continuous random variables think of a range of numbers to calculate the
probability and not specific values: P( X ∈∧interval ( a , b )) is meaningful and P ( X=c )=0
Example: Find the value of 'c' for f X ( x ) given below to be a valid probability function
2
f X ( x )= c x for 0 ≤ x ≤ 3
{ 0 otherwise
For a valid probability function we need to have
+∞
∫ f X (x )dx=1
−∞
therefore,
0 3 ∞
∫ f X ( x ) dx+∫ f X ( x)dx+∫ f X ( x)dx=1
−∞ 0 3
Substituting f X ( x )=0 for−∞≤ x ≤ 0 ,∧f X ( x )=c x2 for 0 ≤ x ≤3 ,∧f X ( x )=0 for 3 ≤ x ≤ ∞we get
0 3 ∞
∫ 0 dx +∫ c x 2 dx +∫ 0 dx=1
−∞ 0 3
which gives c=1/9
Hence we get
1 2
{
f X ( x )= 9
x for 0 ≤ x ≤3
0 otherwise
Study Topic: (i) Normal (Gaussian) Distribution - parameters ( μ , σ )
(ii) Exponential Distribution - parameter λ
Moments: The expected value, denoted by E(X), or a random variable X is given by the
expression
+∞
E ( X ) =∫ x f X ( x)dx
−∞
E ( X ) is also called the first moment M 1of the random variable X. E ( X ) is also the mean of
random variable X which is denoted by μ ( X )∨μ X . Hence, E ( X ) is same as μ ( X )∨μ X .
Similarly the second moment M 2 of X is defined as
+∞
M 2= ∫ x 2 f X ( x )dx
−∞
The second moment M 2 becomes the variance σ 2 of X if E ( X ) =0. In general, the nth moment
M n is given by
+∞
M n= ∫ x n f X ( x ) dx
−∞
The third moment M 3 is also called 'skewness', and the fourth moment M 4 is called 'kurtosis'.
These moments can be found using the data from experiments, and these moments are
nothing but the parameters of the assumed distribution (, like ( μ , σ ) for normal distribution or
λ for exponential distribution). Once we find the parameters we can identify the exact
distribution. Thus we will be able to fit a distribution to the observed data.
Sampling: Many time we need to estimate the parameters using a subset of the data that
could be made available. The reason is, though we can collect the entire data for estimating
the parameters, collection of data itself is on many instances impractical, time consuming,
and costly. This give rise to the idea of sampling. For example, let us assume we receive a
big consignment, say 10,000 units, of some machine part. We need to know the average
number of defective units per 100 units. It is evidently very time consuming and costly affair
to test each and every unit we have received and estimate this average. Instead we take a
sample of this big consignment, say a sample of 1000 units and test only these 1000 units to
estimate the average number of defective units per 100 units. Suppose out of these 1000 units
we find 30 as defective, then we can say there are on average 3 defective units per 100 units.
The whole batch of 10,000 units is called 'population' and this specific batch we have chosen
is called a 'sample'. What we have found is the 'sample average per 100', denoted by Ý .
1000
∑ xi
Ý = i=1 × 100
1000
The average resulting from testing the whole population (10000 units) is denoted by μ.
10000
∑ xi
i=1
μ= × 100
10000
Now the quantity μ is fixed because we take the whole population in finding it, whereas the
value of Ý depends on the sample we have chosen. If we had chosen different batch of 1000
units, we would have obtained a different value for Ý . Thus Ý is a random variable (it's value
is not fixed, but depends on the sample we have chosen, and also on the sample size we have
chosen). This Ý is called the 'estimate' of μ, which is fixed. We say Ý is an example of
'statistic' - which means 'an estimate of a population parameter'.
We should note that the values x i in the sample may be from a discrete set. This does not
mean that the random variable Ý is discrete (because it is the sum of x i divided by some
number).
We say a statistic ( X́ ¿ is unbiased when the expected value of the statistic equals the
population parameter (μ). Thus X́ is an unbiased estimate of the population parameter μ if
E ( X́ ) =μ
Usually the population parameters are denoted by Greek letters like μ , σ , θ , λ, etc and the
statistics are denoted using Roman letters like X́ , s 2, etc.
Prove that
n
2
∑ ( x i− X́ )
s2= i=1
n−1
is an unbiased estimate of the variance σ 2, where X́ is the sample mean given by
n
∑ xi
X́ = i=1
n
On the other hand
n
2
∑ ( x i− X́ )
~s2= i=1
n
is NOT an unbiased estimate of σ 2
One should ideally develop the 'domain knowledge' as well as the knowledge about data.
Domain knowledge pertains to the underlying system - how the variables pertaining to the
system ( like temperature, pressure, speed, acceleration in a physical system, or sugar levels,
blood pressure, cholesterol levels in a medical system) influence one another and how they
are related to some of the effects observed. Knowledge about the data involves details about
how the sampling is made, 'is the sample chosen representative of the whole population?'.