Probability Theory
What is probability?
Probability of occurrence of
◦ an event – discrete
◦ a number in continuous data
Example –
◦ Air Quality
What is the probability that today air quality index is
unhealthy?
What is the probability that today air quality index is 161.2?
◦ Grades
What is the probability that I will not qualify in the exam?
What is the probability that I will be amongst first 10 in class
of 40 with marks = 75?
Probability = n/N
Taking the grades example?
If the mean grades of the class is 50, then
what is the likeliness that I will be
amongst top 10 in class of 40 with grades
of 80?
Therefore you need dispersion and use of
z-scores.
MEASURES OF
DISPERSION
What is dispersion?
Data arranged in
ascending order
1
1
2
3
3 Average = 3.9
3 Is it the true measure?
4
4
5
6
7
9
What is
0
0 2 4 6 8 10 12
7
dispersion?
6
0
0 2 4 6 8 10 12
Dispersion
Defines spread of data around central
location
Mean is a hypothetical value
Mean is a model created to summarize
our data
Deviance
It is the difference between the observed
value and the modelled value
It can be considered as the error for each
observation
This gives information on each
observation
What if, if we add all the observations?
Total sum = total sum of all the deviances
???
Dispersion
Total error = sum of deviations from the
mean
Deviance =
So, the total error is zero
But there are errors
Calculate sum of square errors
Sum of squared errors = square sum of
deviation
SS =
Dispersion
Variance (s2) = average sum of squared
errors (SS)
Standard deviation (s) = sqrt (s2)
◦ Measures how well the mean represents the
data in the same unit of dimension as the
observation xi
Coefficient of variation
Standard deviation relative to mean, also
known as relative variability
For example, a standard deviation of 2,
has different interpretation for mean of 6
as compared to 60.
Coefficient of variation is the degree of
dispersion relative to the mean of the
data
Example
Data 1 Data 2
3 1
3 4
6 10
7 20
8 2
9 4
2 6
3 8
1 14
4 16
Compute mean, median, standard deviation for both the datasets
Properties of frequency distribution
Lack of symmetry (skewness)
◦ skewed distribution are not symmetrical
◦ Frequent scores (tall bars in histogram) are
clustered at one end
Pointyness (kurtosis)
◦ Platykurtic: more distribution of data in the
tails
◦ Leptokurtic: more data is around the centre
and less data distributed in the tails
Skewness
What will be mean Positively skewed
with respect to
median?
Symmertrical:
◦ Mean = median
Positive skewed: Negatively skewed
◦ Mean > median
Negative skewed
Mean < median
Kurtosis
What is the
difference in standard
deviation?
Box and Whisker plots
Measures of shape of distribution
If standard deviation
◦ is small, then, data is clustered around mean
◦ is wide, then, mean is the bad representation
of data