Describing Data
Visual Information
Khris Griffis, Ph.D.
Lecture 06
CSULA: ME3040
Today's Objectives
🎯 Describe the shape, central tendency, and variability of data distributions
using statistical measures.
🎯 Compare measures of spread, such as range, interquartile range, variance,
and standard deviation.
🎯 Recognize data symmetry, skewness, and understand their impact on
interpreting data.
🎯 Key Focus: Using statistical measures to describe and interpret data
distributions.
1 Descriptive Statistics
Summarizing Data
Describing A Dataset
When describing a dataset, we generally consider the following three questions:
What is the general shape of the data?
Where are the data values centered?
How do the data vary?
These are all aspects of what we call the
distribution of the data.
We use some simple arithmetic, which depends (a little) on what the data distribution represents,
to describe these aspects of the data.
Shape
General Shape Of A Distribution
A symmetric distribution is one in which the left and right hand sides of the distribution are
roughly equally balanced.
A skewed (asymmetric) distribution is one in which there is no such equal balance. Right-
skewness refers to a longer right tail, while left-skewness correspond to a longer left tail.
A uniform distribution is a specific symmetric distribution in which all outcomes are equally
likely.
Example: Distributions
Internet access in the world Time between Old Faithful geyser
eruptions
Central Tendency
The Mean
The arithmetic mean is generally calculated as: We can generalize for unequal weights.
𝑤(𝑥|𝑎) = ∑ 𝑖 𝑥 𝑖 == 𝑎 be an indicator of value 𝑎,
∑
𝑁 Let 𝑛
and ∑𝑏∈𝑆 𝑤(𝑏|𝑆𝑥 ) is the sum of all possible weights
1
𝑥¯ = 𝑥𝑖
𝑁 𝑖=1
in the set, 𝑆𝑥 .
𝑤(𝑥)
Calculation: Then 𝑝(𝑥) = ∑ ∈ 𝑥 𝑤(𝑏) is the weight of value X.
𝑏 𝑆
𝑥1 + 𝑥2 +. . . +𝑥𝑛 Thus, the weighted Mean of random variable 𝑋
𝑥¯ =
𝑛 is:
sum of all the observations
= number of observations
Read: "ex-bar", we can see that it represents the sum of a
𝑥¯𝑤 = ∑
𝑁
𝑖=1
𝑥𝑖 𝑝(𝑥𝑖 )
random variable when all values carry equal weights ( 𝑁1 ).
The Median
The median of an ordered set of data values is:
The average of the middle 2 values (for an even number of entries)
The middle entry (for an odd number of entries)
mean of ( 𝑛2 )𝑡ℎ and ( 𝑛2 + 1) 𝑡ℎ
observations
(
𝑛+1 ) 𝑡ℎ
2
observation
The Mode
The mode is the value that occurs most often in the dataset. If no value in the dataset is repeated, then there
is no mode.
What is the mode in the dataset below?
4 5 9 5 11 7 5 3 7 8 6 5 12
Note: In a bimodal distribution, the taller peak is
called the major mode and the shorter one is
referred as the minor mode.
Robustness
Robustness is related to the impact of outliers on a statistic. In general, we say that a statistic is
robust if it is relatively unaffected by extreme values.
The median and the mode are robust, while the mean is not.
Mean: 4.52 Mean: 6.84
2.5 3.2 3.5 3.9 4.0 4.4 5.3 5.9 6.1 6.4 30
Median: 4.2 Median: 4.4
Pick The Central Tendency Descriptor
Wisely
Number of children per household in China
(2012)
Mean: 1.55
Median: 1
More representative of the "typical" 2012
family (One Child Policy)
Averages Are Not “ Truth ”
In 1943, artist Abram Belskie and
obstetrician-gynecologist Robert Latou Norman Norma
Dickinson created sculptures of the
“average” American man and woman.
Dickinson averaged measurements from
15,000 men and women between the ages
of 21 and 25 to make idealized sculptures of
beauty.
In 1945, Cleveland Health Museum sought
to find a woman who matched Norma's
measurements. Of 4,000 entries only a
handful were similar, and an award of $100
was given to a waitress, who just kind of
matched.
Read Todd Rose's “Flaw of Averages ” for more
on this.
Central Variability:
"Spread" or "Variation" of Data Points
Variance and Standard Deviation
A common measure of data variability is the variance, which measures the squared distance each
data point is from the mean, on average.
𝑠 =
2 ∑ 𝑖(𝑥 − 𝑥¯) 2
Variance 𝑠 2
for a sample
𝑛−1
𝑠 =
2 ( 𝑥 1 −𝑥¯) 2 +(𝑥2 −𝑥¯)2 +...+(𝑥 −𝑥¯)2
𝑛 𝜎 2
for a population
𝑛−1
𝑠 =
2 sum of observed squared distance from sample mean For standard deviation, we use:
number of observations - 1 √2 √2
𝑠 = 𝑠 and 𝜎 = 𝜎 .
The standard deviation puts the variance into the same units as the data, providing a measure of
how large the average standardized distance from the center is.
Note: The use of 𝑛 − 1 in calculating variance helps to ensure that our estimate of the population variance is
unbiased and accounts for the extra uncertainty introduced by estimating the population mean from the
sample itself.
Bias
Sampling bias: when a sampling method
systematically yields results that are either too
high or too low.
🚨 May be avoided by using good sampling
technique (randomized).
Sampling variation: natural variation in results
from one random sample to next.
May be reduced by using a larger sample.
Standard Deviation
Let's consider 2 sets of data, both have a mean of 100
Set 1: all values are equal to the mean, so there is
Numbers Mean SD no variability at all
Set 2: one value equals the mean and other four
100, 100, 100, 100, 100 100 0
values are about 10 points away from the mean.
90, 90, 100, 110, 110 100 10 So the average distance away from the mean is
about 10
Example, procedure for Set 2:
1. Calculate the sample mean: 𝑥¯ = 100 4. Sum result of step 3 divide by (𝑛 − 1): 𝑠2 = 5400
−1 = 4004
= 100
√
2. Calculate the difference between each value and 5. Square root result in step 4: 𝑠 = 100 = 10
the mean: [-10, -10, 0, 10, 10]
3. Square each difference in step 2: [100, 100, 0, 100,
100]
Use SD With Caution
Like the mean, the standard deviation does not cope well with skewed
distributions.
Why Are We Squaring Things?
𝑠 =
2 ∑ ( 𝑥 𝑖 − 𝑥
¯ ) 2
𝑛−1
Variance is the sum of the squared distances from each data point to the model.
Squaring the distances ensures that the distances are always positive.
Squaring makes large deviations gigantic and small deviations minuscule and imposes greater weight
to larger deviations, which taking the square root doesn't fix.
The Normal Distribution
The normal (or Gaussian) distribution is a continuous
probability distribution characterized by a symmetric, bell-shaped
curve.
PDF:
1
𝑓 (𝑥; 𝜇, 𝜎) = √ 𝑒 − 21 𝑥−𝜇
( 𝜎 )
2
𝜎 2𝜋
𝜇 = Central Tendency (𝐸[𝑋]) 68% of the observations lie within 1
standard "distance" of the center
𝜎 = Spread (𝐸[𝑋 − 𝐸[𝑋]]) 95% lie within 1.96 standard "distance" of
the center
𝑥 = Specific value of the 99% lie within 2.58 standard "distance" of
continuous variable the center
Often denoted N(𝜇, 𝜎2 ), the normal distribution is special as it underscores the Central Limit Theorem's
revelation that sums of independent variables universally converge to this form, regardless of their initial
distributions.
Let's Get MAD
MAD(𝑥) = median(|𝑥𝑖 − median(𝑥)|)
The median absolute deviation (MAD) is a measure of the variability of a dataset. It is calculated
by taking the median of the absolute differences between each data point and the median of the
dataset.
MAD is a robust measure of variability, meaning it is less affected by outliers than other
measures of variability
Coefficient of Variation (CV)
Introduced by Karl Pearson to compare relative variability
of different datasets, in an attempt to mitigate confusion
in interpreting standard deviation.
Mathematically, it is defined as:
CV = standard deviation × 100%
mean
Uses:
Neuroscience: Comparing varaince-to-mean (Fano Factor) in spike
counts.
Engineering: Assessing the uniformity of processes or materials.
The CV is “crude”, subject to the same
problems as the mean, and should be
used with caution.
Note: CV is dimensionless, but it is often reported as a
percentage.
2 Extended Measures Of Spread
Ways To Describe Intervals And Ranges
The Range
The range gives you the most basic information about the spread of a dataset. It is calculated by the
(arithmetic) difference between the lowest and highest data value.
Percentiles
The 𝑘 percentile is a value in the dataset that has 𝑘% of the data values at or below it and (100 − 𝑘)% of
th
the data values at or above it.
Note:
Here our dataset contains 40 data points. So each
40 = 2.5% of the data.
data point correspond to 100
A "quantile" is simply the fractional position,
i.e., 100%
𝑘
.
Understanding Quantiles
Quantiles are values that mark where specific proportions of your data fall below.
The quantile function is generally defined as: 𝑄𝑖 (𝑝) = (1 − 𝛾)𝑥𝑗 + 𝛾𝑥𝑗+1 ,
where 𝑗 = ⌊𝑘⌋ for 𝑘 = 𝑛𝑝 + 𝑚 and 𝛾 = 𝑘 − 𝑗.
For quantile method 8 (unbiased median), the quantile function is defined as
𝑄 8 (𝑝) = (1 − 𝛾)𝑥 𝑗 + 𝛾𝑥 𝑗+1 ,
where 𝑗 = ⌊𝑘⌋, 𝑚 = 𝑝+1 3 , 𝑘 = 𝑛𝑝 + 𝑚, and 𝛾 = 𝑘 − 𝑗.
Reference: Hyndman and Fan (1996) for a detailed explanation on quantile types and their applications.
Given dataset: ✔︎ 𝑛 = 10
✔︎ 𝑝=
[−0.977, −0.151, −0.103,0.4,0.411,0.95,0.979,1.764,1.868,2.241] 0.5
+1
✔︎ 𝑚 = 3 = 0.50+1 = 1.5 = 0.5
𝑝
Calculating the median from 𝑄8 (0.5): 3 3
✔︎ 𝑘 = 𝑛𝑝 + 𝑚 = 10 × 0.50 + 0.5 = 5.5
𝑄8 ( ) = (1 − 𝛾)𝑥 + 𝛾𝑥 +1 = (1 − )0.411 + ( )0.95 = 0.6805
0.5 5 5 0.5 0.5
✔︎ 𝑗 = ⌊𝑘⌋ = ⌊5.5⌋ = 5
Calculating the median from taking mean of middle 2 elements: ✔︎ 𝛾 = 𝑘 − 𝑗 = 5.5 − 5 =0.5
𝑚 = 0.411+0.95
2 = 0.6805
Interquartile Range
The median divides the data into two equal halves (it is the 50𝑡ℎ percentile). If we divide each of those
halves again, we obtain two additional statistics known as the first (Q1) and third (Q3) quartiles, which are
the 25𝑡ℎ and 75𝑡ℎ percentiles.
Interquartile range: IQR = 𝑄3 − 𝑄1 A value is considered an outlier if it is:
Smaller than 𝑄1 − 1.5 × 𝐼𝑄𝑅
or
Larger than 𝑄3 + 1.5 × 𝐼𝑄𝑅
MATLAB Code
data = [-0.977, -0.151, -0.103, 0.4, 0.411, 0.95, 0.979, 1.764, 1.868, 2.241];
Q1 = quantile(data, 0.25, 'method', 8);
Q3 = quantile(data, 0.75, 'method', 8);
IQR = Q3 - Q1;
outliers = data(data < Q1 - 1.5 * IQR | data > Q3 + 1.5 * IQR);
disp(outliers);
Outliers
An outlier is an observed value that is notably distinct from the other
values in a dataset. Usually, an outlier is much larger or much smaller than
the rest of the data values.
Displaying the data: Boxplot
A boxplot is a graphical display of the five number summary for a quantitative variable. It shows the general
shape of the distribution, identifies the middle 50% of the data, and highlights any outliers.
A boxplot includes:
A box stretching from Q1 to Q3
A line that divides the box drawn at the
median
A line from each quartile to the most extreme
data value that is not an outlier. (if no outliers
minimum and maximum)
Each outlier plotted individually
Displaying the data: Boxplot v. Histogram
3 Visualizing Data
Graphical Exploration Of Quantitative
Information
Anscombe's Quartet
Dataset Dataset Dataset Dataset Dataset Dataset Dataset Dataset
#1 #2 #3 #4 #1 #2 #3 #4
x y x y x y x y Mean 9 7.5 9 7.5 9 7.5 9 7.5
10 8.04 10 9.14 10 7.46 8 6.58
Variance 11 4.1 11 4.1 11 4.1 11 4.1
8 6.95 8 8.14 8 6.77 8 5.76
Correlation 0.86 0.86 0.86 0.86
13 7.58 13 8.74 13 12.74 8 7.71
Regression y=3+0.5x y=3+0.5x y=3+0.5x y=3+0.5x
9 8.81 9 8.77 9 7.11 8 8.84
line
11 8.33 11 9.26 11 7.81 8 8.47
14 9.96 14 8.1 14 8.84 8 7.04
6 7.24 6 6.13 6 6.08 8 5.25
4 4.26 4 3.1 4 5.39 19 12.5
12 10.84 12 9.13 12 8.15 8 5.56
7 4.82 7 7.26 7 6.42 8 7.91
5 5.68 5 4.74 5 5.73 8 6.89
Anscombe's Quartet
Release The Datasaurus!
100 X Mean: 54.26
Y Mean: 47.83
90
80
70
60
X SD: 16.76
Y SD: 26.93
y
50
Correlation: -0.06
40
30
Same Stats, Different Graphs:
20
10
0 Generating Datasets with Varied
Appearance and Identical Statistics
0 10 20 30 40 50 60 70 80 90 100
x
through Simulated Annealing
Justin Matejka
and George Fitzmaurice,
ACM SIGCHI Conference on Human
Factors in Computing System (2017)
Release The Datasaurus!
X Mean: 54.26
Y Mean: 47.83
X SD: 16.76
Y SD: 26.93
Correlation: -0.06
" Same Stats, Different Graphs:
Generating Datasets with Varied
Appearance and Identical
Statistics through Simulated
Annealing"
Justin Matejka
and George Fitzmaurice,
ACM SIGCHI Conference on Human
Factors in Computing System (2017)
Visualizing The Distribution
A dotplot is a common way to visualize the shape of a moderately sized dataset.
Species Longevity Species Longevity Species Longevity Species Longevity Species Longevity
Baboon 20 Chimpanzee 20 Fox 7 Leopard 12 Rabbit 5
Black bear 18 Chipmunk 6 Giraffe 10 Lion 15 Rhinoceros 15
Grizzly bear 25 Cow 15 Goat 8 Monkey 15 Sea lion 12
Polar bear 20 Deer 8 Gorilla 20 Moose 12 Sheep 12
Beaver 5 Dog 12 Guinea Pig 4 Mouse 3 Squirrel 10
Buffalo 15 Donkey 12 Hippopotamus 25 Opossum 1 Tiger 16
Camel 12 Elephant 40 Horse 20 Pig 10 Wolf 5
Cat 12 Elk 15 Kangaroo 7 Puma 12 Zebra 15
Note:
For this particular dataset,
values are integers and can be
easily stacked.
From Dotplot To Histogram
A dotplot, challenging to construct with overlapping dots for similar, numerous values, can be
replaced by a histogram. Histograms aggregate similar values through counts, effectively
displaying data distribution.
Process to construct a histogram
1 Define "boundaries" (they form bins)
45 50 55 60 65 70 75 80 85 90 95 100 105
2 Count the number of elements inside each
bin
Histogram Characteristics
Histograms can be
Bin width: 5
Bin offset: 0
sensitive to parameter
choices!
In particular the 55
40 45 50 55 60 65 70 75 80 85 90 95 100 105 110
bin width 50
45
40
and bin offset
35
Count
30
25
can drastically change
20
15
10
the histogram overall
5
0
40 45 50 55 60 65 70 75 80 85 90 95 100 105 110
look.
Bargraphs are evil
1) Part of the range covered by the bar might have never been observed in the sample
Bar graphs are evil
2) They conceal the variance and the underlying distribution of the data
Look the same? They're not!
Bargraphs are evil
3) They are associated with (usually not defined) error bars
Different types, different meanings:
Cumming, G. et al. (2007). 'Error bars in experimental
biology'.
J Cell Biol 177 (1): 7-11"
Avoid Bargraphs!
To reveal the distribution of the data: About Figure 1:
Display data in their raw form First set: Gaussian (or normal)
A dot plot is a good start distribution (symmetrically distributed)
Dynamite plunger plots conceal data Second set: right skewed, log-normal
Check the pattern of distribution of the (few large values). This type of
values distribution of values is quite common.
Plunger plots only: who would know that the values were skewed [...] and that
the common statistical tests would be inappropriate?
"For better characterization of a sample, we prefer box, swarm, or violin plots for their ability to show the distribution of the data."
You've been warned before!
A Better Option: Dotplot
If the number of data is relatively small, showing directly the raw data and accompanying
mean/median is best.
A Better Option: Beeswarm
A Beeswarm is a dot plot that shows the distribution of data points in a way that avoids overlap.
10 random points = 50 = 5 Generate
45 46 47 48 49 50 51 52 53 54
Dotplot
Jitter
Beeswarm
A Better Option: Boxplot
A boxplot is a graphical display of the five number summary for a quantitative variable. It shows the general
shape of the distribution, identifies the middle 50% of the data, and highlights any outliers.
A Better Option: Boxplot
A boxplot is a graphical display of the five number summary for a quantitative variable. It shows the general
shape of the distribution, identifies the middle 50% of the data, and highlights any outliers.
Showing The Data Is Best
“ Same Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics through Simulated Annealing ”
Justin Matejka and George Fitzmaurice,
ACM SIGCHI Conference on Human Factors in Computing System (2017)
Lecture 06 Khris Griffis ©2024