Exploring and Organizing Data to
Address Public Health Questions
Marie Diener-West, PhD
Johns Hopkins University
1
Organizing and Grouping Data from Studies
2
Experimental Studies
► Experimental studies: control allocation of ► Laboratory studies: control variation (e.g.,
“treatment” to subjects (experimental units) effect of pesticide on rate of mutations in rat
pups)
► Clinical trials: randomize to produce groups
with similar observed and unobserved
characteristics; average over rather than
control variation (e.g., compare two
treatments to reduce blood pressure)
3
Observational Studies
► Observational studies: do not control ► Longitudinal (cohort) or cross-sectional
allocation of “treatment” to subjects
(experimental units) ► Prospective or retrospective
► Sampling
► At random
► By risk factors
► Or by health outcome
● Case-control study
4
Objectives of Exploratory Data Analysis
► Look at your data
► Think about mechanism that produced them
► Look at detailed distributions
► Not just summary measures
► Look for patterns, anomalies, or possible errors
► Gain a “feeling” for the data and the underlying mechanisms that produced them
5
Methods for Organizing Quantitative Data I
Ordering data Grouping data
► Tallies ► Frequency distributions
► Stem-and-leaf displays ► Percentiles
6
Ordering Data: Tallies—1
Example: ages of public health students
Unordered data (n=10): 35,40,52,27,31,42,43,28,50,35
Ordered data (order statistics) (n=10):
► Order by hand:
7
Ordering Data: Tallies—2
Example: ages of public health students
Unordered data (n=10): 35,40,52,27,31,42,43,28,50,35
Ordered data (order statistics) (n=10):
► Order by hand: 27,
8
Ordering Data: Tallies—3
Example: ages of public health students
Unordered data (n=10): 35,40,52,27,31,42,43,28,50,35
Ordered data (order statistics) (n=10):
► Order by hand: 27,28
9
Ordering Data: Tallies—4
Example: ages of public health students
Unordered data (n=10): 35,40,52,27,31,42,43,28,50,35
Ordered data (order statistics) (n=10):
► Order by hand: 27,28,31
10
Ordering Data: Tallies—5
Example: ages of public health students
Unordered data (n=10): 35,40,52,27,31,42,43,28,50,35
Ordered data (order statistics) (n=10):
► Order by hand: 27,28,31,35,35
11
Ordering Data: Tallies—6
Example: ages of public health students
Unordered data (n=10): 35,40,52,27,31,42,43,28,50,35
Ordered data (order statistics) (n=10):
► Order by hand: 27,28,31,35,35,40,42,43,50,52 ► Tally Group Observations
20–29 //
30–39 ///
40–49 ///
50–59 //
12
Ordering Data: Stem-and-Leaf Displays
► Easier way: create a Stem Leaves Stem Leaves
stem-and-leaf 2| 2*|
display for 10-year 3| 3*|
age groups 4| 4*|
5| 5*|
► Unordered data
(n=10): ► Where the stem is Or can depict as: ► Where:
35,40,52,27,31,42, the first digit of the ► 2* = 20–29
43,28,50,35 age group and the ► 3* = 30–39
leaf is the second ► Etc.
digit
13
Example: Stem-and-Leaf Displays—1
► Stem-and-leaf display for 10-year age groups
► By hand, one can quickly create a stem-and-leaf display using the unordered observations
14
Example: Stem-and-Leaf Displays—2
► Unordered data (n=10): Stem Leaves
35,40,52,27,31,42,43,28,50,35 2|
3| 5
4|
5|
15
Example: Stem-and-Leaf Displays—3
► Unordered data (n=10): Stem Leaves
35,40,52,27,31,42,43,28,50,35 2|
3| 5
4| 0
5|
16
Example: Stem-and-Leaf Displays—4
► Unordered data (n=10): Stem Leaves
35,40,52,27,31,42,43,28,50,35 2|
3| 5
4| 0
5| 2
17
Example: Stem-and-Leaf Displays—5
► Unordered data (n=10): Stem Leaves
35,40,52,27,31,42,43,28,50,35 2| 7
3| 5
4| 0
5| 2
18
Example: Stem-and-Leaf Displays—6
► Unordered data (n=10): Stem Leaves
35,40,52,27,31,42,43,28,50,35 2| 7
3| 5 1
4| 0
5| 2
19
Example: Stem-and-Leaf Displays—7
► Unordered data (n=10): Stem Leaves
35,40,52,27,31,42,43,28,50,35 2| 7
3| 5 1
4| 0 2
5| 2
20
Example: Stem-and-Leaf Displays—8
► Unordered data (n=10): Stem Leaves
35,40,52,27,31,42,43,28,50,35 2| 7
3| 5 1
4| 0 2 3
5| 2
21
Example: Stem-and-Leaf Displays—9
► Unordered data (n=10): Stem Leaves
35,40,52,27,31,42,43,28,50,35 2| 7 8
3| 5 1 5
4| 0 2 3
5| 2 0
22
Example: Stem-and-Leaf Displays—10
Stem Leaves Stem Leaves ► Where:
2| 7 8 2*| 7 8 ► 2* = 20–29
3| 5 1 5 Or 3*| 5 1 5 ► 3* = 30–39
4| 0 2 3 4*| 0 2 3 ► Etc.
5| 2 0 5*| 2 0
23
Example: Stem-and-Leaf Displays—11
► Easily order the Unordered observations Ordered observations ► Where:
leaves (second ► 2* = 20–29
digits) within each Stem Leaves Stem Leaves ► 3* = 30–39
stem (age group) 2*| 7 8 2*| 7 8 ► Etc.
3*| 5 1 5 3*| 1 5 5
4*| 0 2 3 4*| 0 2 3
5*| 2 0 5*| 0 2
24
Example: Stem-and-Leaf Displays—12
► For five-year age groupings, 2* | ► Where:
one could display: 2. | 78 ► 2* = 20–24
3* |1 ► 2. = 25–29
3. | 55 ► Etc.
4* | 023
4. |
5* | 02
5. |
25
Stem-and-Leaf ► Aids in sorting or ordering data
Display: ► Retain more information than a tally
Considerations
► Rough guideline: choose the intervals based upon your knowledge of
the range of the data and with approximately 𝑛𝑛 groups (categories)
► For the previous example: 𝑛𝑛 = 10 students
► The guideline suggests 10 ~3 age groups
► The approximate range of the data = 60 − 20 = 40 years
► 40/3 ~ 13-year spread for each group
► Consider logical categories for the number of stems
► 10-year age groups is more reasonable than 13-year age groups!
26
Grouping Data: Frequency
► Frequency: the count Cumulative Relative Cumulative
(frequency) of the Age group Frequency
frequency frequency relative frequency
number of individuals in a
particular group 20–29 2
30–39 3
► Empirical distribution
function: a frequency 40–49 3
distribution which
50–59 2
describes an observed set
of values of a variable Total 10
27
Grouping Data: Cumulative Frequency
► Cumulative frequency: Cumulative Relative Cumulative
Age group Frequency
the count (frequency) of frequency frequency relative frequency
the number of individuals
in a particular age group 20–29 2 2
or lower age group 30–39 3 5
► That is, the
cumulative count 40–49 3 8
50–59 2 10
Total 10
28
Grouping Data: Relative Frequency
► Relative frequency: the Cumulative Relative Cumulative
Age group Frequency
proportion of individuals frequency frequency relative frequency
in a particular age group =
the count (frequency) of 20–29 2 2 0.2
the number of individuals 30–39 3 5 0.3
in a particular age group
divided by the overall 40–49 3 8 0.3
total
50–59 2 10 0.2
Total 10 1.0
29
Grouping Data: Cumulative Relative Frequency
► Cumulative relative Cumulative Relative Cumulative
Age group Frequency
frequency: the cumulative frequency frequency relative frequency
proportion of individuals
in a particular age group 20–29 2 2 0.2 0.2
or any lower age group 30–39 3 5 0.3 0.5
40–49 3 8 0.3 0.8
50–59 2 10 0.2 1.0
Total 10 1.0
Observation 1 2 3 4 5 6 7 8 9 10
Age 27 28 31 35 35 40 42 43 50 52
30
Grouping Data: Percentiles—1
► The r-th percentile 𝑃𝑃 is the value that is greater than or equal to 𝑟𝑟 percent of a sample of 𝑛𝑛
observations and less than or equal to (100 − 𝑟𝑟) percent of the observations
Percentile Quartile Percentage of observations falling below
𝑃𝑃25 𝑄𝑄1 25%
𝑃𝑃50 𝑄𝑄2 50%
𝑃𝑃75 𝑄𝑄3 75%
31
Grouping Data: Percentiles—2
► Using the example (𝑛𝑛 = 10), a simple way to calculate quartiles:
► 𝑄𝑄2 = median = average of the 5th and 6th observations → 37.5
► 𝑄𝑄1 = median of the lower half of data; third smallest value → 31
► 𝑄𝑄3 = median of the upper half of the data; third largest value → 43
► What if 𝑛𝑛 = 11 (is odd in general)?
► Include the median in both the upper and lower halves of the data
32
Recap
► The type of study design is important and may impact how you look at the data
► Looking at data is important for understanding patterns
► Exploratory data analysis techniques provide useful descriptions
► Data can be organized through stem-and-leaf displays and frequency distributions
33
Summarizing and Displaying Data
from Studies
1
Methods for Organizing Quantitative Data II
Summarizing data Displaying data
► Measures of central tendency ► Graphs
► Measures of dispersion
► Box-and-whiskers plots
2
Summarizing Data: Measures of Central Tendency
► Student ages:
► 35,40,52,27,31,42,43,28,50,35 = 𝑋𝑋1 , 𝑋𝑋2 , 𝑋𝑋3 … … , 𝑋𝑋10 where 𝑋𝑋𝑖𝑖 represents the age of student 𝑖𝑖
and 𝑛𝑛 = 10, which is the number of observations
► Mean (average) =
∑𝑛𝑛𝑖𝑖=1 𝑥𝑥𝑖𝑖 𝑥𝑥1 + 𝑥𝑥2 + … + 𝑥𝑥𝑛𝑛
𝑥𝑥 = =
𝑛𝑛 𝑛𝑛
► Median = middle observation
► Mode = most frequent observation
3
Summarizing Data: Measures of Dispersion or Spread
► Range: difference between largest and smallest values
► Variance: “average” of the squared differences of observations from the sample mean
2
∑ni=1(xi −𝑥𝑥)2
𝑠𝑠 =
𝑛𝑛 − 1
► Standard deviation: 𝑠𝑠 = 𝑠𝑠 2
4
Summarizing Data: Back to the Example
∑n
i=1 𝑥𝑥i
► Public health student ages: ► Mean 𝑥𝑥 = = 38.3
𝑛𝑛
27,28,31,35,35,40,42,43,50,52
► Mode = 35 years
► Median = (35 + 40)/2 = 37.5 years
► Range = 52 − 27 = 25 years
i=1(xi −𝑥𝑥)
∑n 2
2
► Variance = 𝑠𝑠 = = 74.7 years2
𝑛𝑛−1
► Standard deviation = 𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣 = 𝑠𝑠 = 8.6 years
5
Summarizing Data: Box-and-Whiskers Plots—1
► Graphical display using quartiles
► Terminology
► Upper hinge = 𝑄𝑄3
► Median = 𝑄𝑄2
► Lower hinge = 𝑄𝑄1
► Interquartile range (IQR) = 𝑄𝑄3 − 𝑄𝑄1
● Contains the middle 50% of the observations
► Whiskers: lines drawn to the smallest and largest actual observations within the calculated fences
6
Summarizing Data: Box-and-Whiskers Plots—2
► What are fences?
► Fences are not observed data points
► Fences are calculated to provide guidelines for identifying outliers
► 𝑈𝑈𝑈𝑈𝑈𝑈𝑈𝑈𝑈𝑈 𝑓𝑓𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 = 𝑢𝑢𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 ℎ𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 + 1.5 ∗ 𝐼𝐼𝐼𝐼𝐼𝐼 = 𝑄𝑄3 + 1.5 ∗ 𝐼𝐼𝐼𝐼𝐼𝐼
► 𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿 𝑓𝑓𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 = 𝑙𝑙𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜 ℎ𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 − 1.5 ∗ 𝐼𝐼𝐼𝐼𝐼𝐼 = 𝑄𝑄1 − 1.5 ∗ 𝐼𝐼𝐼𝐼𝐼𝐼
► What are outliers?
► Outliers are actual observed data values falling beyond the calculated fences (higher or lower)
7
Summarizing Data: Box-and-Whiskers Plots—3
Back to the example of public health student ages
𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼 𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 𝐼𝐼𝐼𝐼𝐼𝐼 = 𝑄𝑄3 − 𝑄𝑄1
= 43 − 31 = 12 𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦
𝑈𝑈𝑈𝑈𝑈𝑈𝑈𝑈𝑈𝑈 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓 = 𝑄𝑄3 + 1.5 ∗ 𝐼𝐼𝐼𝐼𝐼𝐼
= 43 + 1.5 ∗ 12 = 61 𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦
𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓 = 𝑄𝑄1 − 1.5 ∗ 𝐼𝐼𝐼𝐼𝐼𝐼
= 31 − 1.5 ∗ 12 = 13 𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦
8
Summarizing Box plot of public health student ages
Data: Box-
and-Whiskers
Plots—4
9
Summarizing Data: Box-and-Whiskers Plots, Modified
Example—1
► Suppose the data set contains two more students with ages 80 and 8 (n=12)
► All summary statistics need to be recalculated:
► 𝑈𝑈𝑈𝑈𝑈𝑈𝑈𝑈𝑈𝑈 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓 = 72 𝑈𝑈𝑈𝑈 = 𝑄𝑄3 + 1.5 ∗ 𝐼𝐼𝐼𝐼𝐼𝐼
► 𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓 = 4 𝐿𝐿𝐿𝐿 = 𝑄𝑄1 − 1.5 ∗ 𝐼𝐼𝐼𝐼𝐼𝐼
► The boxplot would now show the outlying value of the age of 80 years
10
Summarizing Box plot of public health student ages
Data: Box-
and-Whiskers
Plots,
Modified
Example—2
11
Summarizing Data: Box-and-Whiskers Plots: Comparing
Groups
► Box plots can be used to compare different
groups
► For example, student ages in Class 1 and
Class 2
12
Summarizing Data: Look for Skewness
► Positively skewed: more lower values, sparse higher values
► Also: long “tail” of higher values
► Also: mean > median > mode
► Negatively skewed: reverse of positively skewed
► Symmetric: not skewed in either direction
13
Summarizing
Data: Types of
Skewness
14
Summarizing Data: Look for Outliers
► Outlying values
► Values that are “far” from most values
► Importance: a few outlying values can strongly influence certain statistical summary measures and
analyses
► Example from student age data:
Data Mean Median Mode
Original (n=10) 38.3 37.5 35
With additional 80-year-old (n=11) 42.1 40 35
15
Displaying ► On an arithmetic scale, each
increment represents change by a
Data: Graphs constant amount
► On a logarithmic scale, each
increment represents change by a
Graphing on an constant multiplier
Arithmetic vs.
Logarithmic Scale
16
Displaying Data: Logarithmic Scales
► Logarithm to the base 10 (common log) ► Logarithm to the base e (natural log)
𝑥𝑥 = 10𝑦𝑦 or 𝑙𝑙𝑙𝑙𝑙𝑙10 𝑥𝑥 = 𝑦𝑦 𝑥𝑥 = 𝑒𝑒 𝑦𝑦 or 𝑙𝑙𝑙𝑙𝑙𝑙𝑒𝑒 (𝑥𝑥) = 𝑦𝑦
𝑙𝑙𝑙𝑙𝑙𝑙10 1 = 0 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 100 = 1,
𝑙𝑙𝑙𝑙𝑙𝑙10 (100) = 2 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 102 = 100
17
Graphing Data: Why Use Logarithms?
► Allows plotting of numbers of different orders of magnitude on the same graph
► May aid in data analysis, data transformations
► May describe a biological relationship (e.g., exponential growth) more accurately
18
Graphing Box-and-whiskers plots of medical expenditures ($) for individuals
without a major smoking-caused disease (mscd–) versus with a
Data: Example major smoking-caused disease (mscd+) within age groups
of an
Arithmetic
Scale on the Y-
Axis
Source: National Medical Expenditures Data Set. 19
Graphing Box-and-whiskers plots of the log medical expenditures (log $)
Data: Example
of a
Logarithmic
Scale on the
Y-Axis
Source: National Medical Expenditures Data Set. 20
Recap
► Data can be summarized in percentiles, box-and-whisker plots, and measures of central tendency and
dispersion
► Data may be displayed on graphs using either an arithmetic or logarithmic scale
21