KEMBAR78
Statmethods1 Lec2 | PDF | Descriptive Statistics | Statistical Analysis
0% found this document useful (0 votes)
25 views54 pages

Statmethods1 Lec2

The document discusses methods for organizing and analyzing data in public health research, focusing on experimental and observational studies. It outlines exploratory data analysis objectives, data ordering techniques, and various ways to summarize and display quantitative data, such as frequency distributions and measures of central tendency. The importance of understanding study design and recognizing patterns in data is emphasized for effective analysis.

Uploaded by

shlonglord420.5
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views54 pages

Statmethods1 Lec2

The document discusses methods for organizing and analyzing data in public health research, focusing on experimental and observational studies. It outlines exploratory data analysis objectives, data ordering techniques, and various ways to summarize and display quantitative data, such as frequency distributions and measures of central tendency. The importance of understanding study design and recognizing patterns in data is emphasized for effective analysis.

Uploaded by

shlonglord420.5
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 54

Exploring and Organizing Data to

Address Public Health Questions


Marie Diener-West, PhD
Johns Hopkins University

1
Organizing and Grouping Data from Studies

2
Experimental Studies

► Experimental studies: control allocation of ► Laboratory studies: control variation (e.g.,


“treatment” to subjects (experimental units) effect of pesticide on rate of mutations in rat
pups)

► Clinical trials: randomize to produce groups


with similar observed and unobserved
characteristics; average over rather than
control variation (e.g., compare two
treatments to reduce blood pressure)

3
Observational Studies

► Observational studies: do not control ► Longitudinal (cohort) or cross-sectional


allocation of “treatment” to subjects
(experimental units) ► Prospective or retrospective

► Sampling
► At random
► By risk factors
► Or by health outcome
● Case-control study

4
Objectives of Exploratory Data Analysis

► Look at your data


► Think about mechanism that produced them

► Look at detailed distributions


► Not just summary measures

► Look for patterns, anomalies, or possible errors

► Gain a “feeling” for the data and the underlying mechanisms that produced them

5
Methods for Organizing Quantitative Data I

Ordering data Grouping data

► Tallies ► Frequency distributions

► Stem-and-leaf displays ► Percentiles

6
Ordering Data: Tallies—1

Example: ages of public health students

Unordered data (n=10): 35,40,52,27,31,42,43,28,50,35

Ordered data (order statistics) (n=10):

► Order by hand:

7
Ordering Data: Tallies—2

Example: ages of public health students

Unordered data (n=10): 35,40,52,27,31,42,43,28,50,35

Ordered data (order statistics) (n=10):

► Order by hand: 27,

8
Ordering Data: Tallies—3

Example: ages of public health students

Unordered data (n=10): 35,40,52,27,31,42,43,28,50,35

Ordered data (order statistics) (n=10):

► Order by hand: 27,28

9
Ordering Data: Tallies—4

Example: ages of public health students

Unordered data (n=10): 35,40,52,27,31,42,43,28,50,35

Ordered data (order statistics) (n=10):

► Order by hand: 27,28,31

10
Ordering Data: Tallies—5

Example: ages of public health students

Unordered data (n=10): 35,40,52,27,31,42,43,28,50,35

Ordered data (order statistics) (n=10):

► Order by hand: 27,28,31,35,35

11
Ordering Data: Tallies—6

Example: ages of public health students

Unordered data (n=10): 35,40,52,27,31,42,43,28,50,35

Ordered data (order statistics) (n=10):

► Order by hand: 27,28,31,35,35,40,42,43,50,52 ► Tally Group Observations


20–29 //
30–39 ///
40–49 ///
50–59 //

12
Ordering Data: Stem-and-Leaf Displays

► Easier way: create a Stem Leaves Stem Leaves


stem-and-leaf 2| 2*|
display for 10-year 3| 3*|
age groups 4| 4*|
5| 5*|
► Unordered data
(n=10): ► Where the stem is Or can depict as: ► Where:
35,40,52,27,31,42, the first digit of the ► 2* = 20–29
43,28,50,35 age group and the ► 3* = 30–39
leaf is the second ► Etc.
digit

13
Example: Stem-and-Leaf Displays—1

► Stem-and-leaf display for 10-year age groups

► By hand, one can quickly create a stem-and-leaf display using the unordered observations

14
Example: Stem-and-Leaf Displays—2

► Unordered data (n=10): Stem Leaves


35,40,52,27,31,42,43,28,50,35 2|
3| 5
4|
5|

15
Example: Stem-and-Leaf Displays—3

► Unordered data (n=10): Stem Leaves


35,40,52,27,31,42,43,28,50,35 2|
3| 5
4| 0
5|

16
Example: Stem-and-Leaf Displays—4

► Unordered data (n=10): Stem Leaves


35,40,52,27,31,42,43,28,50,35 2|
3| 5
4| 0
5| 2

17
Example: Stem-and-Leaf Displays—5

► Unordered data (n=10): Stem Leaves


35,40,52,27,31,42,43,28,50,35 2| 7
3| 5
4| 0
5| 2

18
Example: Stem-and-Leaf Displays—6

► Unordered data (n=10): Stem Leaves


35,40,52,27,31,42,43,28,50,35 2| 7
3| 5 1
4| 0
5| 2

19
Example: Stem-and-Leaf Displays—7

► Unordered data (n=10): Stem Leaves


35,40,52,27,31,42,43,28,50,35 2| 7
3| 5 1
4| 0 2
5| 2

20
Example: Stem-and-Leaf Displays—8

► Unordered data (n=10): Stem Leaves


35,40,52,27,31,42,43,28,50,35 2| 7
3| 5 1
4| 0 2 3
5| 2

21
Example: Stem-and-Leaf Displays—9

► Unordered data (n=10): Stem Leaves


35,40,52,27,31,42,43,28,50,35 2| 7 8
3| 5 1 5
4| 0 2 3
5| 2 0

22
Example: Stem-and-Leaf Displays—10

Stem Leaves Stem Leaves ► Where:


2| 7 8 2*| 7 8 ► 2* = 20–29
3| 5 1 5 Or 3*| 5 1 5 ► 3* = 30–39
4| 0 2 3 4*| 0 2 3 ► Etc.
5| 2 0 5*| 2 0

23
Example: Stem-and-Leaf Displays—11

► Easily order the Unordered observations Ordered observations ► Where:


leaves (second ► 2* = 20–29
digits) within each Stem Leaves Stem Leaves ► 3* = 30–39
stem (age group) 2*| 7 8 2*| 7 8 ► Etc.
3*| 5 1 5 3*| 1 5 5
4*| 0 2 3 4*| 0 2 3
5*| 2 0 5*| 0 2

24
Example: Stem-and-Leaf Displays—12

► For five-year age groupings, 2* | ► Where:


one could display: 2. | 78 ► 2* = 20–24
3* |1 ► 2. = 25–29
3. | 55 ► Etc.
4* | 023
4. |
5* | 02
5. |

25
Stem-and-Leaf ► Aids in sorting or ordering data
Display: ► Retain more information than a tally
Considerations
► Rough guideline: choose the intervals based upon your knowledge of
the range of the data and with approximately 𝑛𝑛 groups (categories)
► For the previous example: 𝑛𝑛 = 10 students
► The guideline suggests 10 ~3 age groups
► The approximate range of the data = 60 − 20 = 40 years
► 40/3 ~ 13-year spread for each group

► Consider logical categories for the number of stems


► 10-year age groups is more reasonable than 13-year age groups!

26
Grouping Data: Frequency

► Frequency: the count Cumulative Relative Cumulative


(frequency) of the Age group Frequency
frequency frequency relative frequency
number of individuals in a
particular group 20–29 2

30–39 3
► Empirical distribution
function: a frequency 40–49 3
distribution which
50–59 2
describes an observed set
of values of a variable Total 10

27
Grouping Data: Cumulative Frequency

► Cumulative frequency: Cumulative Relative Cumulative


Age group Frequency
the count (frequency) of frequency frequency relative frequency
the number of individuals
in a particular age group 20–29 2 2
or lower age group 30–39 3 5
► That is, the
cumulative count 40–49 3 8

50–59 2 10

Total 10

28
Grouping Data: Relative Frequency

► Relative frequency: the Cumulative Relative Cumulative


Age group Frequency
proportion of individuals frequency frequency relative frequency
in a particular age group =
the count (frequency) of 20–29 2 2 0.2
the number of individuals 30–39 3 5 0.3
in a particular age group
divided by the overall 40–49 3 8 0.3
total
50–59 2 10 0.2

Total 10 1.0

29
Grouping Data: Cumulative Relative Frequency

► Cumulative relative Cumulative Relative Cumulative


Age group Frequency
frequency: the cumulative frequency frequency relative frequency
proportion of individuals
in a particular age group 20–29 2 2 0.2 0.2
or any lower age group 30–39 3 5 0.3 0.5

40–49 3 8 0.3 0.8

50–59 2 10 0.2 1.0

Total 10 1.0

Observation 1 2 3 4 5 6 7 8 9 10
Age 27 28 31 35 35 40 42 43 50 52

30
Grouping Data: Percentiles—1

► The r-th percentile 𝑃𝑃 is the value that is greater than or equal to 𝑟𝑟 percent of a sample of 𝑛𝑛
observations and less than or equal to (100 − 𝑟𝑟) percent of the observations

Percentile Quartile Percentage of observations falling below


𝑃𝑃25 𝑄𝑄1 25%
𝑃𝑃50 𝑄𝑄2 50%
𝑃𝑃75 𝑄𝑄3 75%

31
Grouping Data: Percentiles—2

► Using the example (𝑛𝑛 = 10), a simple way to calculate quartiles:


► 𝑄𝑄2 = median = average of the 5th and 6th observations → 37.5
► 𝑄𝑄1 = median of the lower half of data; third smallest value → 31
► 𝑄𝑄3 = median of the upper half of the data; third largest value → 43

► What if 𝑛𝑛 = 11 (is odd in general)?


► Include the median in both the upper and lower halves of the data

32
Recap

► The type of study design is important and may impact how you look at the data

► Looking at data is important for understanding patterns

► Exploratory data analysis techniques provide useful descriptions

► Data can be organized through stem-and-leaf displays and frequency distributions

33
Summarizing and Displaying Data
from Studies

1
Methods for Organizing Quantitative Data II

Summarizing data Displaying data

► Measures of central tendency ► Graphs

► Measures of dispersion

► Box-and-whiskers plots

2
Summarizing Data: Measures of Central Tendency

► Student ages:
► 35,40,52,27,31,42,43,28,50,35 = 𝑋𝑋1 , 𝑋𝑋2 , 𝑋𝑋3 … … , 𝑋𝑋10 where 𝑋𝑋𝑖𝑖 represents the age of student 𝑖𝑖
and 𝑛𝑛 = 10, which is the number of observations

► Mean (average) =
∑𝑛𝑛𝑖𝑖=1 𝑥𝑥𝑖𝑖 𝑥𝑥1 + 𝑥𝑥2 + … + 𝑥𝑥𝑛𝑛
𝑥𝑥 = =
𝑛𝑛 𝑛𝑛

► Median = middle observation

► Mode = most frequent observation

3
Summarizing Data: Measures of Dispersion or Spread

► Range: difference between largest and smallest values

► Variance: “average” of the squared differences of observations from the sample mean

2
∑ni=1(xi −𝑥𝑥)2
𝑠𝑠 =
𝑛𝑛 − 1

► Standard deviation: 𝑠𝑠 = 𝑠𝑠 2

4
Summarizing Data: Back to the Example

∑n
i=1 𝑥𝑥i
► Public health student ages: ► Mean 𝑥𝑥 = = 38.3
𝑛𝑛
27,28,31,35,35,40,42,43,50,52
► Mode = 35 years

► Median = (35 + 40)/2 = 37.5 years

► Range = 52 − 27 = 25 years

i=1(xi −𝑥𝑥)
∑n 2
2
► Variance = 𝑠𝑠 = = 74.7 years2
𝑛𝑛−1

► Standard deviation = 𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣 = 𝑠𝑠 = 8.6 years

5
Summarizing Data: Box-and-Whiskers Plots—1

► Graphical display using quartiles

► Terminology
► Upper hinge = 𝑄𝑄3
► Median = 𝑄𝑄2
► Lower hinge = 𝑄𝑄1
► Interquartile range (IQR) = 𝑄𝑄3 − 𝑄𝑄1
● Contains the middle 50% of the observations
► Whiskers: lines drawn to the smallest and largest actual observations within the calculated fences

6
Summarizing Data: Box-and-Whiskers Plots—2

► What are fences?


► Fences are not observed data points

► Fences are calculated to provide guidelines for identifying outliers


► 𝑈𝑈𝑈𝑈𝑈𝑈𝑈𝑈𝑈𝑈 𝑓𝑓𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 = 𝑢𝑢𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 ℎ𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 + 1.5 ∗ 𝐼𝐼𝐼𝐼𝐼𝐼 = 𝑄𝑄3 + 1.5 ∗ 𝐼𝐼𝐼𝐼𝐼𝐼
► 𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿 𝑓𝑓𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 = 𝑙𝑙𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜 ℎ𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 − 1.5 ∗ 𝐼𝐼𝐼𝐼𝐼𝐼 = 𝑄𝑄1 − 1.5 ∗ 𝐼𝐼𝐼𝐼𝐼𝐼

► What are outliers?


► Outliers are actual observed data values falling beyond the calculated fences (higher or lower)

7
Summarizing Data: Box-and-Whiskers Plots—3

Back to the example of public health student ages

𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼 𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 𝐼𝐼𝐼𝐼𝐼𝐼 = 𝑄𝑄3 − 𝑄𝑄1


= 43 − 31 = 12 𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦

𝑈𝑈𝑈𝑈𝑈𝑈𝑈𝑈𝑈𝑈 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓 = 𝑄𝑄3 + 1.5 ∗ 𝐼𝐼𝐼𝐼𝐼𝐼


= 43 + 1.5 ∗ 12 = 61 𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦

𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓 = 𝑄𝑄1 − 1.5 ∗ 𝐼𝐼𝐼𝐼𝐼𝐼


= 31 − 1.5 ∗ 12 = 13 𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦

8
Summarizing Box plot of public health student ages

Data: Box-
and-Whiskers
Plots—4

9
Summarizing Data: Box-and-Whiskers Plots, Modified
Example—1

► Suppose the data set contains two more students with ages 80 and 8 (n=12)

► All summary statistics need to be recalculated:


► 𝑈𝑈𝑈𝑈𝑈𝑈𝑈𝑈𝑈𝑈 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓 = 72 𝑈𝑈𝑈𝑈 = 𝑄𝑄3 + 1.5 ∗ 𝐼𝐼𝐼𝐼𝐼𝐼
► 𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓 = 4 𝐿𝐿𝐿𝐿 = 𝑄𝑄1 − 1.5 ∗ 𝐼𝐼𝐼𝐼𝐼𝐼

► The boxplot would now show the outlying value of the age of 80 years

10
Summarizing Box plot of public health student ages

Data: Box-
and-Whiskers
Plots,
Modified
Example—2

11
Summarizing Data: Box-and-Whiskers Plots: Comparing
Groups

► Box plots can be used to compare different


groups
► For example, student ages in Class 1 and
Class 2

12
Summarizing Data: Look for Skewness

► Positively skewed: more lower values, sparse higher values


► Also: long “tail” of higher values
► Also: mean > median > mode

► Negatively skewed: reverse of positively skewed

► Symmetric: not skewed in either direction

13
Summarizing
Data: Types of
Skewness

14
Summarizing Data: Look for Outliers

► Outlying values
► Values that are “far” from most values
► Importance: a few outlying values can strongly influence certain statistical summary measures and
analyses
► Example from student age data:

Data Mean Median Mode


Original (n=10) 38.3 37.5 35
With additional 80-year-old (n=11) 42.1 40 35

15
Displaying ► On an arithmetic scale, each
increment represents change by a
Data: Graphs constant amount

► On a logarithmic scale, each


increment represents change by a
Graphing on an constant multiplier
Arithmetic vs.
Logarithmic Scale

16
Displaying Data: Logarithmic Scales

► Logarithm to the base 10 (common log) ► Logarithm to the base e (natural log)

𝑥𝑥 = 10𝑦𝑦 or 𝑙𝑙𝑙𝑙𝑙𝑙10 𝑥𝑥 = 𝑦𝑦 𝑥𝑥 = 𝑒𝑒 𝑦𝑦 or 𝑙𝑙𝑙𝑙𝑙𝑙𝑒𝑒 (𝑥𝑥) = 𝑦𝑦

𝑙𝑙𝑙𝑙𝑙𝑙10 1 = 0 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 100 = 1,

𝑙𝑙𝑙𝑙𝑙𝑙10 (100) = 2 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 102 = 100

17
Graphing Data: Why Use Logarithms?

► Allows plotting of numbers of different orders of magnitude on the same graph

► May aid in data analysis, data transformations

► May describe a biological relationship (e.g., exponential growth) more accurately

18
Graphing Box-and-whiskers plots of medical expenditures ($) for individuals
without a major smoking-caused disease (mscd–) versus with a
Data: Example major smoking-caused disease (mscd+) within age groups

of an
Arithmetic
Scale on the Y-
Axis

Source: National Medical Expenditures Data Set. 19


Graphing Box-and-whiskers plots of the log medical expenditures (log $)

Data: Example
of a
Logarithmic
Scale on the
Y-Axis

Source: National Medical Expenditures Data Set. 20


Recap

► Data can be summarized in percentiles, box-and-whisker plots, and measures of central tendency and
dispersion

► Data may be displayed on graphs using either an arithmetic or logarithmic scale

21

You might also like