Chapter 2: Descriptive Statistics
2.2: Histograms
Graphics for Quantitative Data
Histogram - a graph consisting of bars of equal width drawn adjacent (they touch) to each
other. The horizontal scale represents classes of quantitative data values, and the vertical scale
represents frequencies. The heights of the bars correspond to the frequency values.
Relative Frequency Histogram - a graph with the same shape and horizontal scale as a
histogram, but the vertical scale is marked with relative frequencies (proportions or percents).
1
CC By Creative Attribution 4.0 International License
2.3, 2.4: Measures of the Location of the Data, Boxplots
Median – a number that measures the “center” of the data. It is the middle value when the
original data values are arranged in order of increasing (or decreasing) magnitude.
If the number of data values is odd, the median is the number located in the exact
middle of the list.
If the number of data values is even, the median is found by computing the mean of the
two middle numbers.
Example 1: Determine the Median of each of the following set of numbers.
(a) 132, 162, 133, 145, 148, 139, 147, 150, 153
(b) 40, 38, 42, 39, 43, 39
Common measures of location are Quartiles and Percentiles. Quartiles are special percentiles.
The first quartile, 𝑄1, is the same as the 25th percentile, and the third quartile, 𝑄3 , is the same as
the 75th percentile. The median, M, is called both the second quartile, 𝑄2 , and the 50th
percentile.
Quartiles are the numbers that separate the data into quarters. (Four parts)
The 25th, 50th, and 75th percentiles denoted by 𝑄1 , 𝑄2 , 𝑄3 , respectively.
First quartile Q1 = 25th percentile or (P25) = number in
which 25% of scores are below and 75% of scores are
above.
Second quartile Q2 = 50th percentile or (P50) =
number in which 50% of scores are below and 50% of
scores are above. Known as the median.
Third quartile Q3 = 75th percentile or (P75) = number in
which 75% of scores are below and 25% are above.
Interquartile Range – a number that indicates the spread of the middle 50% of the data.
It is the difference between the third quartile 𝑄3 and the first quartile 𝑄1.
𝑰𝑸𝑹 = 𝑸𝟑 −𝑸𝟏
2
CC By Creative Attribution 4.0 International License
Example 2: In a park that has several basketball courts, a student counts the number of players
playing basketball each day over a two-week period and records the following data.
10, 90, 30, 20, 50, 30, 60, 40, 70, 40, 30, 60, 80, 20
a.) Find and interpret the first, second and third quartiles for the number of basketball players
on the courts over the two-week period.
b.) Determine and interpret the interquartile range.
5 Number Summary: a subset of the data that consists of the minimum value, the first
quartile, the median, the third quartile, and the maximum value.
Boxplots - also called box-and-whisker plots; constructed from the 5-number-summary;
shows how far extreme values are from the bulk of the data.
Strengths: give a direct look at location and spread; outliers identified; great for
comparing.
Weaknesses: symmetry and skewness can be judged, but not so much shape.
Outlier – a data point that is not consistent with the bulk of the data from that group.
3
CC By Creative Attribution 4.0 International License
Example 3: Find the 5-number summary and construct a boxplot on the number of players
playing basketball each day over a two-week period.
10, 90, 30, 20, 50, 30, 60, 40, 70, 40, 30, 60, 80, 20
0 10 20 30 40 50 60 70 80 90 100
2.5: Measure of the Center of the Data
A measure of center is a value at the center (or middle) of a data set of numbers.
The three forms of center are mean, median, and mode.
The mean (average) of a data set is found by adding all numbers in the data set and
then dividing by the number of values in the set.
The median is the middle value when a data set is ordered from least to greatest.
The mode is the number that occurs most often in a data set.
Is 𝑥̅ a parameter or a statistic?
Is 𝜇 a parameter or a statistic?
4
CC By Creative Attribution 4.0 International License
Example 4: Use Table 3
a.) Compute the population mean and median of student
test scores.
b.) Take a sample from this class by randomly
selecting four students, then compute the
sample mean.
c.) Now, suppose we replace the lowest test score to 0. Find
the new population mean and median.
d) Which measure of center (mean or median) is more appropriate to use from (c) above?
The median is not affected by an outlier, whereas the mean will be pulled in the
direction of the outlier
5
CC By Creative Attribution 4.0 International License
Resistant - if extreme values (very large or very small) relative to the data do not affect its value
substantially, then, the data is said to be resistant.
Which measure of center is more resistant to outliers – mean or median?
Which measure of center is more appropriate to use when outliers are present?
Example 5:
1. Sam has 20 rose bushes, but only counted the flowers on 6 of them.
Sam's flower counts are: 9, 2, 5, 4, 12, 7
a.) Find the mean and include the unit of measure.
b.) Find the median and include the unit of measure.
6
CC By Creative Attribution 4.0 International License
2.6: Skewness and the Mean, Median, and Mode
Shape of a Distribution
Bell-shaped Distribution – the highest frequency occurs in the middle and frequencies tail off
to the left and the right of the middle. This shape will be used in our study of continuous
probability distributions and will be known as a normal curve.
Right-Skewed (positively skewed) – The tail extends to the right of the peak longer than to the
left.
There are extreme values (outliers) to the right.
Left-Skewed (negatively skewed) – The tail extends to the left of the peak longer than to the
right.
There are extreme values (outliers) to the left.
7
CC By Creative Attribution 4.0 International License
2.7: Measures of the Spread of the Data
Variation - the degree to which the data is spread out.
Are the data values in our set of data concentrated closely near the mean (the center) OR are
the data values more widely spread out from the mean (the center)? To answer this question,
we need to learn about the most common measure of variation, or spread, which is the
standard deviation.
Standard deviation – A number that measures how far data values are from their mean. (center)
The standard deviation provides a measure of the overall variation in a data set.
Range – the difference between the maximum data value and the minimum data value.
Range = max value – min value
∑(𝑿𝒊 −𝝁)𝟐
Population Standard Deviation: 𝝈=√ 𝑵
̅)𝟐
∑(𝒙𝒊 −𝒙
Sample Standard Deviation: 𝒔=√ 𝒏−𝟏
Standard Deviation (Properties)
The standard deviation is a measure of how much data values deviate from the mean.
The value of the standard deviation can never be negative. It is positive or zero.
It is zero if all the values are exactly the same.
Larger values of standard deviation indicate greater amounts of variation.
Outlier(s) can drastically change the value of the standard deviation since it is a
“not resistant” measure.
Variance – deviation about the mean – it is the square of the standard deviation.
∑(𝒙𝒊 −𝝁)𝟐 ̅)𝟐
∑(𝒙𝒊 −𝒙
Population Variance: 𝝈𝟐 = Sample Variance: 𝒔𝟐 =
𝑵 𝒏−𝟏
Relationship between standard deviation and variance
variance = (𝐬𝐭𝐚𝐧𝐝𝐚𝐫𝐝 𝐝𝐞𝐯𝐢𝐚𝐭𝐢𝐨𝐧)𝟐
standard deviation = √𝐯𝐚𝐫𝐢𝐚𝐧𝐜𝐞
8
CC By Creative Attribution 4.0 International License
Example 6:
(a) If σ = 9, find σ2 . (c) If s = 4, find s2 .
(b) If s2 = 36, find s. (d) If σ2 = 49, find σ.
How to calculate the standard deviation for a sample:
1. Calculate the mean of the numbers in the data set.
2. Subtract the mean from each data point (x), then square the result.
3. Sum the squared differences.
4. Take the square root of (sum of sq differences/n-1).
Calculate the standard deviation:
The following are the home run totals of a sample of five players who played at least half of the
games for the 2017 World Champions Houston Astros:
24, 16, 24, 18, 13
Sample Standard
∑(𝑥𝑖 − 𝑥̅ )2
Deviation 𝑠=√
(𝑛 − 1)
Find the standard deviation. Round to two decimal places.
𝑥 (𝑥 − 𝑥̅ ) (𝑥 − 𝑥̅ )2
9
CC By Creative Attribution 4.0 International License
Example 7:
You try: Given the following sample data, answer the following questions:
2 5 4 9 4 8 10
Find the variance.
Find the standard deviation.
Example 8: Identify the symbol for each of the following.
a) the sample mean
b) the population mean
c) the sample standard deviation
d) the population standard deviation
e) the sample variance
f) the population variance
g) the sample size
10
CC By Creative Attribution 4.0 International License
Z-score – the number of standard deviations that a given data value x is above or below the
mean. The z-score is unitless. It has a mean of 0 and a standard deviation of 1.
Z-score (also called Standardized Score)
It represents the number of standard deviations a data point is above or below the mean.
If it is positive, then it is above the mean.
If it is negative, then it is below the mean.
It is a standardized measurement since it is in terms of standard deviation.
The z-score has no unit of measurement.
Z-Score Formula
Discovery:
Algebra quiz: Mean = 90 St. dev = 10
Find the z scores for the following students’ grades:
June: 80
Dahn: 95
Linh: 73
Compare…using z-scores. On which test did I do “relatively” better?
History Test Math Test
Mean = 92 Mean =80
St. Dev = 3 St. Dev = 5
My Score = 95 My score = 90
a higher z-score will produce a “relatively better” outcome.
You try:
Alex got an 82 on the first exam and an 84 on the second exam. The class averaged a 79 on
the first exam with a standard deviation of 3, while they averaged a 79 on the second exam with
a standard deviation of 5.
On which exam did Alex do better?
11
CC By Creative Attribution 4.0 International License
Key Takeaways:
The median, quartiles, IQR are RESISTANT (Robust) measures. They can resist going toward
an extreme value. (or outlier)
The mean, range, standard deviation, and variance are NOT RESISTANT (Not Robust)
measures. These measures are affected by extreme values. These measures will be pulled
toward those outliers in the right or left tail.
Relationship between Quartiles and Percentiles
Units of measure
The z-score has no unit of measurement.
Variance has squared units. (points squared, inches squared, homeruns squared)
All other measures have single units (points, inches, homeruns)
Z-score – the number of standard deviations that a data value 𝑥 is above or below the mean.
Also called standardized score or standardized value
Question: Who did relatively better?
Answer: The higher z-score
̅, mean, median, mode
Measures of center: µ, 𝒙
Measures of spread: variance, standard deviation, range, IQR, σ, 𝝈𝟐 , 𝒔𝟐 , s
12
CC By Creative Attribution 4.0 International License