Statistics for Data Science-1
Week-3 Graded Assignment
1. The numbers a, b, c, d have frequencies (x + 6), (x + 2), (x − 3) and x respectively. If
their mean is m, find the value of x. (Enter the value as next highest integer)
Solution:
a(x + 6) + b(x + 2) + c(x − 3) + dx
=m
(x + 6) + (x + 2) + (x − 3) + x
ax + 6a + bx + 2b + cx − 3c + dx
=m
4x + 5
ax + bx + cx + dx + 6a + 2b − 3c = m(4x + 5) = (4m)x + 5m
(a + b + c + d − 4m)x = 5m − 6a − 2b + 3c
(5m − 6a − 2b + 3c)
x=
(a + b + c + d − 4m)
Suppose, we substitute values of a, b, c, d and m as 2, 7, 9, 17 and 6.88 respectively,
then
(5 × 6.88) − (6 × 2) − (2 × 7) + (3 × 9)
x= = 4.73
(2 + 7 + 9 + 17 − (4 × 6.88)
Hence, x = 5
The mean and sample standard deviation of the dataset consisting of N observations
is m and s respectively. Later it is noted that one observation x is wrongly noted as p.
Based on the given information, answer questions (2) and (3).
2. What is the mean of the original dataset? (Correct up to 2 decimal place accuracy)
Solution:
Let the sum of all the observations of noted dataset be T and for the original dataset
be T 0 .
T
M ean = =m
N
T =m×N
T0
Therefore, T 0 = T − p + x. Hence, Mean for original dataset=
N
1
Suppose, we substitute values of N , m, s, x and p as 8, 13, 8, 18 and 13 respec-
tively.
Let the sum of all the observations of the noted dataset be T and for the original
dataset be T 0 .
T
M ean = = 13
8
T = 13 × 8 = 104
Therefore, T 0 = T − p + x = 104 − 13 + 18 = 109.
T0 109
Hence, Mean for original dataset= = = 13.625
N 8
3. What is the sample variance of the original dataset? (Correct up to 2 decimal place
accuracy)
Solution:
Σ(xi − x)2 Σ(x2i − 2xi x + x2 ) Σx2i − 2xΣxi + N x2
Sample variance, s2 = = =
N −1 N −1 N −1
2 2 2 2
Σxi − 2x(N x) + N x Σxi N x
⇒ s2 = = −
N −1 N −1 N −1
Let Σx2i be equals to A for noted dataset and for the original dataset be equals to B.
So, B = A − p2 + x2
N m2
2
where, A = s + × (N − 1)
N −1
T0
Also, Mean of original dataset=
N 2
T0
N ×
B N
Hence, sample variance for the original dataset = −
N −1 N −1
B T0 2
= −
N − 1 N (N − 1)
Suppose, we substitute values of N , m, s, x and p as 8, 13, 8, 18 and 13 respectively.
Let Σx2i
be equals to Afor noted dataset and for the original dataset be equals to B.
8 × 132
So, A = 82 + × (8 − 1) = 1800
7
Therefore, B = 1800 − 132 + 182 = 1955
1955 1092
Hence, sample variance for the original dataset = − = 67.125
8−1 8×7
2
4. Let the data x1 , x2 , ..., xn represent the retail prices in rupees of a certain commodity
in n randomly selected shops in a particular city. What will be the sample variance in
the retail prices, if c rupees is added to all the retail prices? (Correct up to 2 decimal
place accuracy)
Solution:
x1 + x2 + ... + xn
Mean =
n
If c rupees is added to all the retail prices, then the new prices will be yi = xi + c
; i = 1, 2, ..., n
Then, New variance = Old variance.
i.e,
Σ(yi − y)2 Σ[(xi + c) − (x + c)]2 Σ(xi − x)2
= =
n−1 n−1 n−1
Suppose the value of n is 6 and the observations are 46, 34, 82, 37, 83, 66, then
46 + 34 + 82 + 37 + 83 + 66
Mean = = 58
6
Σ(xi − x)2
Sample variance (s2 ) =
n−1
(46 − 58) + (34 − 58) + (82 − 58)2 + (37 − 58)2 + (83 − 58)2 + (66 − 58)2
2 2
= = 485.2
5
Suppose, we have n observations such that x1 , x2 , ..., xn . Based on the given informa-
tion, answer questions (5), (6), (7):
5. Calculate 10th , 50th and 100th percentiles?
Solution:
To find the sample 100p percentiles of a dataset of size n;
(1) Arrange the data in ascending order.
(2) If np is not an integer, determine the smallest integer greater than np. The data
value in that position is the sample 100p percentile.
(3) If np is integer, then the average of the values in positions np and np + 1 is the
sample 100p percentile.
For example,
Let n = 7 with observations 31, 36, 25, 34, 115, 108, 88 and ascending order is 25, 31, 34, 36, 88, 108, 115
then,
(i) n = 7 and p = 0.1, then np = 0.7.
Therefore, 10th percentile will be 1st observation = 25.
(ii) n = 7 and p = 0.5, then np = 3.5.
Therefore, 50th percentile will be the 4th observation = 36.
3
(iii) n = 7 and p = 1, then np = 7.
Therefore, 100th percentile will be the last observation = 115.
6. Calculate the Inter Quartile Range (IQR) of the data.
Solution:
To find the sample 100p percentiles of a data set of size n;
(1) Arrange the data in ascending order.
(2) If np is not an integer, determine the smallest integer greater than np. The data
value in that position is the sample 100p percentile.
(3) If np is integer, then the average of the values in positions np and np + 1 is the
sample 100p percentile.
For Q1 , p = 0.25
And, for Q3 , p = 0.75
Therefore, IQR = Q3 − Q1
For example,
Given, n = 7 and p = 0.25, then np = 1.75
Therefore, Q1 = 31. and
Q3 = 75th percentile.
Given, n = 7 and p = 0.75, then np = 5.25.
Therefore, Q3 = 108.
Hence, IQR = Q3 − Q1 = 108 − 31 = 77.
7. How many outliers are there?
Solution:
We know, IQR = Q3 − Q1 .
Outliers < Q1 − 1.5 × IQR and Outliers > Q3 + 1.5 × IQR
For example,
Q1 = 25th percentile of the data.
Given, n = 7 and p = 0.25, then np = 1.75
Therefore, Q1 = 31. and
Q3 = 75th percentile.
Given, n = 7 and p = 0.75, then np = 5.25.
Therefore, Q3 = 108.
Hence, IQR = Q3 − Q1 = 108 − 31 = 77.
Since, Outliers < Q1 − 1.5 × IQR and Outliers > Q3 + 1.5 × IQR
Now, 31 − (1.5 × 77) = −84.5 and 108 + (1.5 × 77) = 223.5
As there are no observations that satisfies the condition of outliers. Hence, there are
no outliers for the given data.
8. In a deck, there are cards numbered 1 to n such that the number of cards of a given
number is the same as the number on the card. Which of the following statement(s)
4
is/are true about the mean and mode of the numbers on this deck of card?
a. Mode is n.
2n + 1
b. Mean is .
3
c. Mode is n − 1.
d. Mean is n.
n+1
e. Mean is .
2
f. Mode is not defined for this data.
Answer: a, b
Solution:
Given that the number of cards of a number in the deck is the same as the number on
the card. It means that:
Number (xi ) Frequency (fi )
1 1
2 2
... ...
... ...
n n
Table 3.1
Hence, Mode = n.
n(n + 1)
Now, Total number of observations = f1 + f2 + ... + fn = 1 + 2 + ... + n =
2
Sum of observations = f1 x1 + f2 x2 + ... + fn xn = 1 × 1 + 2 × 2 + ... + n × n
n(n + 1)(2n + 1)
So, f1 x1 + f2 x2 + ... + fn xn =
6
n(n + 1)(2n + 1)
f1 x1 + f2 x2 + ... + fn xn 6 2n + 1
Therefore, Mean = = =
f1 + f2 + ... + fn n(n + 1) 3
2
Hence, options (a) and (b) are correct.
For example, n = 42
Given that the number of cards of a number in the deck is the same as the number on
the card, it means that:
5
Number (xi ) Frequency (fi )
1 1
2 2
... ...
... ...
42 42
Table 3.2
Hence, Mode = 42.
42(42 + 1)
Now, Total number of observations = f1 + f2 + ... + f42 = 1 + 2 + ... + 42 =
2
Sum of observations = f1 x1 + f2 x2 + ... + f42 x42 = 1 × 1 + 2 × 2 + ... + 42 × 42
42(42 + 1)(2(42) + 1)
So, f1 x1 + f2 x2 + ... + f42 x42 =
6
42(42 + 1)(2(42) + 1)
f1 x1 + f2 x2 + ... + f42 x42 6 2(42) + 1
Mean = = =
f1 + f2 + ... + f42 42(42 + 1) 3
2
Hence, Mean = 28.33
Figure 3.1.G shows a stem and leaf plot of the ratings (out of 100) of an actor’s
performance in different movies. Based on the given information, answer questions (9)
and (10).
Stem Leaf
5 3 9
7 2 2 5 8
8 7 7 7
9 9
Here 6 | 4 represents rating of 64.
Figure 3.1.G
9. What is the Inter Quartile Range (IQR) (Correct up to 1 decimal point accuracy)?
Solution:
To find the sample 100p percentiles of a data set of size n;
(1) Arrange the data in ascending order.
(2) If np is not an integer, determine the smallest integer greater than np. The data
value in that position is the sample 100p percentile.
(3) If np is integer, then the average of the values in positions np and np + 1 is the
sample 100p percentile.
6
For Q1 , p = 0.25
And, for Q3 , p = 0.75
Therefore, IQR = Q3 − Q1
For example, n = 10
Number of observation; n = 10
th
10
Q1 = observation = 3rd observation = 72
4
th
30
Q3 = observation = 8th observation = 87
4
Therefore, IQR = Q3 − Q1 = 87 − 72 = 15
10. What is the median rating, if x points are added to all of his ratings and then converted
to y points? (Correct up to 2 decimal point accuracy)
Solution:
There are 10 observations in the data. So, the Median of the given data will be the
mean of 5th and 6th observation.
75 + 78
Median of given data = = 76.5
2
Now, if x points are added to all of his ratings, the median becomes 76.5 + x.
y
And, for conversion to y points, we have to multiply all the observations by . Hence,
100
y
the median for converted data = (76.5 + x) × .
100
Therefore, option b is correct.
Suppose, we substitute values of x and y as 3 and 40 respectively.
There are 10 observations in the data. So, the median of the given data will be the
mean of 5th and 6th observation.
75 + 78
Median of given data = = 76.5
2
Now, if 3 points are added to all of his ratings, the median becomes 76.5 + 3 = 79.5.
40
And, for conversion to 40 points, we have to multiply all the observations by .
100
40
Hence, the median for converted data = (76.5 + 3) × = 31.8.
100