Principles of Biostatistics-3 Edition
Principles of Biostatistics-3 Edition
Principles of Biostatistics
Third Edition
Marcello Pagano
Kimberlee Gauvreau
Heather Mattie
Third edition published 2022
by CRC Press
6000 Broken Sound Parkway NW, Suite 300, Boca Raton, FL 33487-2742
Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot
assume responsibility for the validity of all materials or the consequences of their use. The authors and publishers have
attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders
if permission to publish in this form has not been obtained. If any copyright material has not been acknowledged please
write and let us know so we may rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized
in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying,
microfilming, and recording, or in any information storage or retrieval system, without written permission from the
publishers.
For permission to photocopy or use material electronically from this work, access www.copyright.com or contact the
Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. For works that are
not available on CCC please contact mpkbookspermissions@tandf.co.uk
Trademark notice: Product or corporate names may be trademarks or registered trademarks and are used only for
identification and explanation without intent to infringe.
DOI: 10.1201/9780429340512
Typeset in TeXGyreTermesX
by KnowledgeWorks Global Ltd.
Publisher’s note: This book has been prepared from camera-ready copy provided by the authors.
Preface xiii
1 Introduction 1
1.1 Why Study Biostatistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Difficult Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Overview of the Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3.1 Part I: Chapters 2–4 Variability . . . . . . . . . . . . . . . . . . . . . . . 4
1.3.2 Part II: Chapters 5–8 Probability . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.3 Part III: Chapters 9–22 Inference . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.4 Computing Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.4 Review Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
I Variability 13
2 Descriptive Statistics 15
2.1 Types of Numerical Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.1.1 Nominal Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.1.2 Ordinal Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.1.3 Ranked Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.1.4 Discrete Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.1.5 Continuous Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2 Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2.1 Frequency Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2.2 Relative Frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3 Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3.1 Bar Charts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3.2 Histograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3.3 Frequency Polygons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3.4 Box Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.3.5 Two-Way Scatter Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.3.6 Line Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.4 Numerical Summary Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.4.1 Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.4.2 Median . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.4.3 Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.4.4 Range . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.4.5 Interquartile Range . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.4.6 Variance and Standard Deviation . . . . . . . . . . . . . . . . . . . . . . . 39
2.5 Empirical Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.6 Further Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.7 Review Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
vii
viii Contents
4 Life Tables 89
4.1 Historical Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.2 Life Table as a Predictor of Longevity . . . . . . . . . . . . . . . . . . . . . . . . 95
4.3 Mean Survival . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.4 Median Survival . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.5 Further Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.6 Review Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
II Probability 109
5 Probability 111
5.1 Operations on Events and Probability . . . . . . . . . . . . . . . . . . . . . . . . 111
5.2 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
5.3 Total Probability Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
5.4 Relative Risk and Odds Ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
5.5 Further Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
5.6 Review Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
16 Correlation 381
16.1 Two-Way Scatter Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 381
16.2 Pearson Correlation Coefficient . . . . . . . . . . . . . . . . . . . . . . . . . . . 382
16.3 Spearman Rank Correlation Coefficient . . . . . . . . . . . . . . . . . . . . . . . 387
16.4 Further Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389
16.5 Review Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395
Bibliography 547
Glossary 569
Index 601
Preface
This book was written for students of the health sciences and serves as an introduction to the study
of biostatistics – the use of numbers and numerical techniques to extract information from data and
facts, and to then use this information to communicate scientific results. However, just as one can lie
with words, one can also lie with numbers. Indeed, numbers and lies have been linked for quite some
time; there is even a book titled How to Lie with Statistics. This association may owe its origin – or
its affirmation at the very least – to the British Prime Minister Benjamin Disraeli. Disraeli is credited
by Mark Twain as having said, “There are three kinds of lies: lies, damned lies, and statistics.” One
has only to observe any modem political campaign to be convinced of the abuse of statistics. But
enough about lies; this book adopts the position of Professor Frederick Mosteller, who said, “It is
easy to lie with statistics, but it is easier to lie without them.”
Background
Principles of Biostatistics is aimed at students in the biological and health sciences who wish to learn
traditional research methods. The first edition was based on a required course for graduate students
at the Harvard T.H. Chan School of Public Health, which is also attended by a large number of health
professionals from the Harvard medical area. The course is as old as the school itself, which attests
to its importance. It spans 16 weeks of lectures and laboratory sessions; the lab sessions reinforce
the material covered in lectures and introduce the computer into the course. We have included a
selection of lab materials – either additional examples, or a different perspective on the material
covered in a chapter – in the sections called Further Applications. These sections are designed to
provoke discussion, although they are sufficiently complete for an individual who is not using the
book as a course text to benefit from reading them.
The book includes a range of biostatistical topics, the majority of which can be covered at some
depth in one semester in an American university. However, there is enough material to allow the
instructor some flexibility. For example, some instructors may choose to omit the sections covering
the calculation of prevalence (Section 6.5) or the Poisson distribution (Section 7.3), or the chapter
on analysis of variance (Chapter 12), if they consider these concepts to be less important than others.
Structure
Some say that statistics is the study of variability and uncertainty. We believe there is truth to this
adage, and have used it as a guide to divide the book into three parts covering the basic principles
of vip: (1) variability, (2) inference, and (3) probability. For pedagogical purposes, inference and
probability are covered in reverse order in the text. Chapters 2 through 4 deal with the variability
inherent in collections of numbers, and the ways in which to summarize, explore, and explain
them. Chapters 5 through 8 focus on probability, and serve as an introduction to the tools needed
for the subsequent investigation of uncertainty. In Chapter 8 we distinguish between populations
and samples and begin to examine the variability introduced by sampling from a population, thus
progressing to inference in the book’s remaining chapters. We think that this modular introduction
to the quantification of uncertainty is justified by the success achieved by our students. Postponing
xiii
xiv Preface
the slightly more difficult concepts until a solid foundation has been established makes it easier for
the reader to comprehend and retain them.
Computing
There is something about numbers – maybe a little magic – that makes them fun to study. The fun
is in the conceptualization more than the calculations, however, and we are fortunate that we have
the computer to do the drudge work. This allows students to concentrate on the concepts. In other
words, the computer allows the instructor to teach the poetry of statistics and not the plumbing.
To take advantage of the computer, one needs a good statistical package. We use Stata, a product
of the Stata Corporation in College Station, Texas, and also R, a software environment available for
free download. Stata is user-friendly, accurate, powerful, reasonably priced, and works on a number
of different platforms, including Windows, Unix, and Macintosh. R is available on an open-source
license, and also works on a number of platforms. It is a versatile and efficient programming language.
Other statistical packages are available, and this book can be supplemented by any one of them. We
strongly recommend that some statistical package be used for calculations.
Some of the review exercises in the text require the use of a computer. The required datasets
are available on the book’s companion website at https://github.com/Principles-of-Biostatistics/3rd-
Edition. There are also many exercises that do not require the computer. As always, active learning
yields better results than passive observation. To this end, we cannot stress enough the importance
of the review exercises, and urge the reader to attempt as many as time permits.
Preface xv
• The material on Screening and Diagnostic Tests – formerly contained within the Probability
chapter – has been given its own chapter. This new chapter includes sections on likelihood ratios
and the concept of varying sensitivities.
• New sections on sample size calculations for two-sample tests on means and proportions, the
Kruskal-Wallis test, and the Cox proportional hazards model have been added to existing chapters.
• Concepts previously covered in a chapter titled Multiple 2 × 2 Tables have now been moved into
the Logistic Regression chapter.
• The chapter on Sampling Theory has been greatly expanded.
• A new chapter introducing the basic principles of Study Design has been added at the end of the
text.
• Datasets used in the text and those needed for selected review exercises are now available on the
book’s companion website at https://github.com/Principles-of-Biostatistics/3rd-Edition.
• The companion website also contains the Stata and R code used to produce the computer output
displayed in the text’s Further Applications sections, as well as introductory material describing
the use of both statistical packages.
• A glossary of definitions for important statistical terms has been added at the back of the book.
• As previously mentioned, mathematical notation and formulas used in the text have been included
in summary boxes at the end of each section for ease of reference.
• Additional review exercises have been included in each chapter.
In addition to these changes in content, previously used data have been updated whenever possible
to reflect more current public health information. As its name suggests, Principles of Biostatistics
covers topics which are fundamental to an introduction to biostatistics. Of course we have had to limit
the material presented, and some important topics have not been included. Decisions about what to
exclude were difficult, especially as the field of biostatistics and data science continues to evolve. No
small role in this evolution is played by the computer; the capacity of statistical software seems to
increase limitlessly, providing new and exciting inferential tools. However, to truly appreciate these
tools and to be able to utilize them properly requires a strong foundation in traditional statistical
principles. Those laid out in this text are still essential and will be useful to the reader both today
and in the future.
xvi Preface
Acknowledgments
A debt of gratitude is owed to a number of people: former Harvard University President Derek
Bok for providing the support which got the first edition of this book off the ground, Dr. Michael
K. Martin for performing the calculations for the Statistical Tables section, John-Paul Pagano for
assisting in the editing of the first edition, and the individuals who reviewed the manuscript. We
thank the teaching assistants who have helped us teach our courses over the years and who have
made many valuable suggestions. Probably the most deserving of thanks are our students, who have
tolerated us as we learned how to best teach the material. We are still learning.
Marcello Pagano
Kimberlee Gauvreau
Heather Mattie
Boston, Massachusetts
1
Introduction
CONTENTS
1.1 Why Study Biostatistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Difficult Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Overview of the Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3.1 Part I: Chapters 2–4 Variability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3.2 Part II: Chapters 5–8 Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.3 Part III: Chapters 9–22 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.4 Computing Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.4 Review Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
In 1903, H.G. Wells hypothesized that statistical thinking would one day be as necessary for good
citizenship as the ability to read and write. Wells was correct, and today statistics play an important
role in many decision-making processes. For example, before any new drug or medical device can
be marketed legally in the United States, the United States Food and Drug Administration (fda)
requires that it be subjected to a clinical trial, an experimental study involving human subjects. The
data from this study is compiled and analyzed to determine not only whether the drug is effective,
but also if it is safe. How is this determined? As another example, the United States government’s
decisions regarding Social Security and public health programs rely in part on the longevity of the
nation’s population; the government must therefore be able to accurately predict the number of years
each individual will live. How does it do this? If the government incorrectly forecasts human life
expectancy, it could render itself insolvent and endanger the well-being of its citizens.
There are many other issues that must be addressed as well. Where should a government invest
its resources if it wishes to reduce infant mortality? Should a mastectomy always be recommended
to a patient with breast cancer? Should a child play football? What factors increase the risk that an
individual will develop coronary heart disease? Will we be able to afford our health care system in
the future? Does global warming impact the sea level? Our health? What effect would a particular
change in policy have on longevity? To answer these questions and others, we rely on the methods
of biostatistics.
DOI: 10.1201/9780429340512-1 1
2 Principles of Biostatistics
FIGURE 1.1
Maternal mortality per 100,000 live births, 1990–2015
pregnancy, irrespective of the duration and site of the pregnancy, from any cause related to or
aggravated by the pregnancy or its management but not from accidental or incidental causes” [1].
Therefore, when presented with the graph in Figure 1.1 [2, 3], someone concerned with maternal
mortality might react with alarm at the reported striking behavior of the United States and research
the issue further.
How useful is the study of biostatistics? Biostatistics are certainly ubiquitous in the health
sciences. The Centers for Disease Control and Prevention (cdc) reports that “During the 20th century,
the health and life expectancy of persons residing in the United States improved dramatically. Since
1900, the average lifespan of persons in the United States has lengthened by greater than 30 years;
25 years of this gain are attributable to advances in public health” [4–6]. They go on to list what they
consider to be ten great achievements:
When one reads the recounting of these achievements in subsequent Mortality and Morbidity Weekly
Reports, it is evident that biostatistics played an important role in every one of them.
Notwithstanding these societal successes, work still needs to be done. The future with its exabytes
of data – known as big data – providing amounts of information which are orders of magnitude larger
than was previously available is a new challenge. But if we are to progress responsibly, we cannot
ignore the lessons of the past [7]. A case in point is our failure to control the number of deaths from
guns that has led to a public health crisis in the United States. The statistic blared from a headline in
Introduction 3
The New York Times in 2018 [8]: “nearly 40,000 people died from guns in u.s. last year, highest in
50 years.” This crisis looks even worse when one considers what is happening with mass shootings
in schools. The United States is experiencing a remarkable upward trend in the number of casualties
involved. There have been more school shooting deaths in the first 18 years of the 21st century (66)
than in the last 60 years of the 20th century (55). The same is true for injuries due to guns, with 260
and 81 in each of these two time periods, respectively [9]. A summary of this situation is made more
pithy by the statistics.
FIGURE 1.2
Racial breakdown of COVID-19 cases in the United States through May 28, 2020
(1) Chapters 2–4 discuss variability, (2) Chapters 5–8 cover probability, and (3) Chapters 9–22 cover
inference.
FIGURE 1.3
Racial breakdown of COVID-19 cases in the United States in 2020, by age
that assume only two values. The notion of dividing a group into smaller subgroups or classes based
on a characteristic such as age or sex is also introduced. Grouping individuals into smaller, more
homogeneous subgroups decreases variability, thus allowing better prognosis. For example, it might
make sense to determine the mortality of females separately from that of males, or the mortality of
20- to 29-year-olds separately from 80- to 89-year-olds. Chapter 3 also investigates techniques that
allow us to make valid comparisons among populations whose compositions may differ substantially.
Chapter 4 introduces the classical life table, one of the most important numerical summary
techniques available in the health sciences. Life tables are used by public health professionals
to characterize the well-being of a population, and by insurance companies to predict how long
individuals will live. In this chapter, the study of mortality begun in Chapter 3 is extended to
incorporate the actual time to death for each individual, resulting in a more refined analysis.
Together, Chapters 2 through 4 demonstrate that the extraction of information from a collection
of measurements is not precluded by the variability among those measurements. Despite their vari-
ability, the data often exhibit a certain regularity as well. For example, here are the birth rates in the
United States among women 15–19 years of age over the 5-year time span shown [15]:
Are the numbers showing a natural variability around a constant rate over time – think of how many
mistakes can go into the reporting of such numbers – or is this indicative of a real downward trend?
This question deserves better than a simple choice between these two options. To answer it properly,
we need to apply the principles of probability and inference, the subjects covered in the next two
sections of the text.
6 Principles of Biostatistics
FIGURE 1.4
Boxed salesman’s sample set of glass bottles, containing samples from the Crocker company (Buffalo,
New York) (photo courtesy of Judy Weaver Gonyeau) [16]
The ability to generalize results from a sample to a population is the bedrock of empirical
research, and a central issue in this book. One requirement for credible inference is that it be based
on a representative sample. In any particular study, do we truly have a representative sample? If we
answer yes, this leads to a logical conundrum. To truly judge that we have a representative sample we
need to know the entire population. And if we know the entire population, why then focus only on a
sample? If we do not have the ability to study an entire population, the best solution available is to
utilize a simple random sample of the population. This means, amongst other things, that everyone in
the population has an equal chance of being selected into the sample. It ensures us that, on average,
we have a representative sample. A pivotal side benefit of a simple random sample is that it also
provides an estimate of the possible inaccuracy of our inference.
It can often be difficult to obtain a simple random sample. The consequences of mistakenly
thinking that a sample is representative when in fact it is not lead to invalid inferences. A case
in point is provided by the behavioral sciences, where empirical results are often derived from
individuals sampled from western, educated, industrialized, rich, and democratic (weird) societies.
An example of this is the undergraduate students who make a few extra dollars by volunteering to
be a subject for an on-campus study. Since most of these studies are done in the United States, we
can see the problem. Clearly the results will reflect the pool from which the subjects came. Use of
the label weird implies a certain contempt for a large number of published findings attacked in an
article by Henrich and colleagues [17]. They investigate results in the domains of visual perception,
fairness, cooperation, spatial reasoning, categorization and inferential induction, moral reasoning,
reasoning styles, self-concepts and related motivations, and the heritability of iq. They conclude
that “members of weird societies, including young children, are among the least representative
populations one could find for generalizing about humans.” Yet the researchers who published the
original results presumably believed that their samples were random and representative.
8 Principles of Biostatistics
We have repeated this mistake in the bio-medical sciences, where the consequences can be even
more severe. For example, we do not perform as many clinical trials on children as on adults [18].
Trials of adults, even randomized clinical trials, are not representative of children. Children are not
small adults who simply require a modification in dosage. Some conditions – such as prematurity
and many of its sequelae – occur only in infants and children [19]. Certain genetic conditions such
as phenylketonuria (pku) will, if untreated, lead to severe disability or even death in childhood. The
diagnosis, prevention, and treatment of these conditions cannot be adequately investigated without
studying children. Other conditions such as influenza and certain cancers and forms of arthritis
also occur in both adults and children, but their pathophysiology, severity, course, and response to
treatment may be quite different for infants, children, and adolescents. Treatments that are safe and
effective for adults may be dangerous or ineffective for children.
There are many more examples where certain groups have been largely ignored by researchers.
The lack of trials in women [20] and people of color led Congress, in 1993, to pass the National
Institutes of Health Revitalization Act, which requires the agency to include more people from these
groups in their research studies. Unfortunately, success in the implementation of this law has been
slow [21]. The headline in Scientific American on September 1, 2018 – 25 years after the Act was
passed – was clinical trials have far too little racial and ethnic diversity; it’s unethical
and risky to ignore racial and ethnic minorities [22].
This problem extends beyond clinical trials. The 21st century has seen the mapping of the human
genome. Genome wide association studies (gwas) have identified thousands of genetic variants
identified with human traits and diseases. This exciting source of information is unfortunately
restricted, so inference is constrained or biased. A 2009 study showed that 96% of participants in
gwas studies were of European descent [23]. Seven years later this had decreased to 80%, largely
due to studies carried out in China and Japan; the Asian content has increased, but the representation
of other groups has not. Since gwas studies are the basis for precision medicine, this has raised the
fear that precision medicine will exacerbate racial health disparities [24]. This, of course, is a general
trait of artificial intelligence systems: they reflect the information that goes into them.
As an example of the value of inference, we can consider a group of investigators who were
interested in evaluating whether, at the time of their study, there was a difference in how analgesics
were administered to male versus female patients with acute abdominal pain. It would be impossible
to investigate this issue by observing every person in the world with acute abdominal pain, so they
designed a study of a smaller group of individuals with this ailment so they could, on the basis of
the sample, infer what was happening in the population as a whole. How far their inference should
reach is not our focus right now, but it is important to take notice of what they say. Here is a copy of
the abstract from the published article [25]:
objectives: Oligoanalgesia for acute abdominal pain historically has been attributed to the
provider’s fear of masking serious underlying pathology. The authors assessed whether a
gender disparity exists in the administration of analgesia for acute abdominal pain.
methods: This was a prospective cohort study of consecutive nonpregnant adults with
acute nontraumatic abdominal pain of less than 72 hours duration who presented to an urban
emergency department (ed) from April 5, 2004, to January 4, 2005. The main outcome mea-
sures were analgesia administration and time to analgesic treatment. Standard comparative
statistics were used.
results: Of the 981 patients enrolled (mean age standard deviation [sd] 41 17 years;
65% female), 62% received any analgesic treatment. Men and women had similar mean pain
scores, but women were less likely to receive any analgesia (60% vs. 67%, difference 7%,
95% confidence interval (ci) = 1.1% to 13.6%) and less likely to receive opiates (45% vs.
56%, difference 11%, 95% ci = 4.1% to 17.1%). These differences persisted when gender-
specific diagnoses were excluded (47% vs. 56%, difference 9%, 95% ci = 2.5% to 16.2%).
After controlling for age, race, triage class, and pain score, women were still 13% to 25%
Introduction 9
less likely than men to receive opioid analgesia. There was no gender difference in the receipt
of nonopioid analgesia. Women waited longer to receive their analgesia (median time 65
minutes vs. 49 minutes, difference 16 minutes, 95% ci = 3.5 to 33 minutes).
conclusions: Gender bias is a possible explanation for oligoanalgesia in women who
present to the ed with acute abdominal pain. Standardized protocols for analgesic adminis-
tration may ameliorate this discrepancy.
This is a fairly typical abstract in the health sciences literature – it reports on a clinical study and
uses statistics to describe the findings – so we look at it more closely. First consider the objectives
of the study. We are told that the goal is to discover whether there is a gender disparity in the
administration of drugs. This is not whether there was a difference in administering the drugs
between genders in this particular study – that question is easy to answer – but rather a more
ambitious finding; namely, is there something in this study that allows us to generalize the findings
to a broader population?
The abstract goes on to describe the methods utilized in the study, and then its results. We first
learn that the researchers studied a group of 981 patients. To allow the reader to get an understanding
of who these 981 patients are, they provide some descriptive statistics about the patients’ ages and
genders. This is done to lay the groundwork for generalizing the results of the study to individuals
not included in the study sample.
The investigators then start generalizing their results. We are told that even though men and
women suffered similar amounts of pain, women were less likely – 7% less likely – to receive any
analgesia. This difference of 7% is clearly study specific. Had they chosen fewer than 981 patients or
more, or even a different group of 981 patients, they likely would have observed a difference other
than 7%. How to quantify this potential variability from sample to sample – even though we have
observed only a single sample – and how to accommodate it when making inference, is answered
by the most useful and effective result in the book. It is an application of the theory covered in
Chapter 8, and is known as the central limit theorem.
An application of the central limit theorem allows the study investigators to construct a 95%
confidence interval for the difference in proportions, 1.1% to 13.6%. One way to interpret this
interval is to appeal to a thought experiment and repetition: If we were to sample repeatedly from
the underlying population, each sample might result in a difference other than 7%, and a confidence
interval other than 1.1% to 13.6%. However, 95% of these intervals from repeated sampling will
include the true population difference between the genders, whatever its value. The interpretations for
all the other confidence intervals in the abstract are similar. More general applications of confidence
intervals are introduced in Chapter 9, and examples appear throughout the text.
For a study to be of general interest and usefulness, we must be able to extrapolate its findings
to a larger population. By generalizing in this manner, however, we inevitably introduce uncertainty.
There are various ways to measure and convey this uncertainty, and we cover two such inferential
methods in this book. One is to use confidence intervals, as we just saw in the abstract, and the other is
to use hypothesis testing. The latter is introduced in Chapter 10. The two methods are consistent with
each other, and will lead to the same action following a study. There are some questions, however,
that are best answered in the hypothesis testing framework.
As an example, consider the way we monitor the water supply for lead contamination [26].
In 1974, the United States Congress passed the Safe Drinking Water Act, and its enforcement is
a responsibility of the Environmental Protection Agency (epa). The epa determines the level of
contaminants in drinking water at which no adverse health effects are likely to occur, with an
adequate margin of safety. This level for lead is zero, and untenable. As a result, the epa established
a treatment technique, an enforceable procedure which water systems must follow to ensure control
of a contaminant. The treatment technique regulation for lead – referred to as the Lead and Copper
Rule [27] – requires water systems to control the corrosivity of water. The regulation stipulates that
to determine whether a system is safe, health regulators must sample taps in the system that are
10 Principles of Biostatistics
more likely to have plumbing materials containing lead. The number of taps sampled depends on
the size of the system served. To accommodate aberrant local conditions, if 10% or fewer of the
sampled taps have no more than 15 parts per billion (ppb) of lead, the system is considered safe. If
not, additional actions by the water authority are required. We can phrase this monitoring procedure
in a hypothesis testing framework: We wish to test the hypothesis that the water has 15 ppb or fewer
of lead. The action we take depends on whether we reject this hypothesis, or not. According to the
Lead and Copper Rule, the decision depends on the measured tap water samples. If more than 10%
of the water samples have more than 15 ppb, we reject the hypothesis and take corrective action.
Just as with diagnostic testing in Chapter 6, we have the potential to make the wrong decision
when conducting a hypothesis test. The chance of such an error is influenced by the way in which
the samples are chosen, how many samples we take, and the 10% cutoff rule. In 2015, the city of
Flint, Michigan, took water samples in order to check the level of lead in the water [28]. According
to the Lead and Copper Rule, they were supposed to take 100 samples from houses most likely to
have a lead problem. They did not. First, they took only 71 samples; second, they chose the 71 in
what seemed like a random fashion. Setting aside these contraventions, they found that 8 of the 71
samples had more than 15 ppb. This is more than 10% of the samples, and thus they were required to
alert the public and take corrective action. Instead, the State of Michigan forced Flint to drop two of
the water samples, both with more than 15 ppb of lead. This meant that there were only 69 samples,
and 6 had more than 15 ppb of lead. Thus fewer than 10% crossed the threshold, and the authorities
felt free to tell the residents of Flint that their water was fine. This is yet another example of ignoring
the message produced by the scientific method and having catastrophe follow [29]. It seems like the
lead problem is repeating itself, only this time in Newark, New Jersey [30].
In Chapter 10 we apply hypothesis testing techniques to statements about the mean of a single
population, and in Chapter 11 extend these techniques to the comparison of two population means.
They are further generalized to the comparison of three or more means in Chapter 12. Chapter 13
continues the development of hypothesis testing concepts, but introduces techniques that allow the
relaxation of some of the assumptions necessary to carry out the tests. Chapters 14 and 15 develop
inferential methods that can be applied to enumerated data or counts – such as the numbers of cases
of sudden infant death syndrome among children put to sleep in various positions – rather than
continuous measurements.
Inference can also be used to explore the relationships among a number of different attributes,
with the underlying motivation being to reduce variability. If a full-term infant whose gestational age
is 39 weeks is born weighing 4 kilograms, or 8.8 pounds, no one would be surprised. If the infant’s
gestational age is only 22 weeks, however, then their weight would be cause for alarm. Why? We
know that birth weight tends to increase with gestational age, and, although it is extremely rare to
find a baby weighing 4 kilograms at 22 weeks, it is not uncommon at 39 weeks. There is sufficient
variability in birth weights to not be surprised to hear that an infant weighs 4 kilograms at birth,
but when the gestational age of the child is known, there is much less variability among infants of a
particular gestational age, and 4 kilograms may seem out of place. In other words, our measurements
have a more precise interpretation the more information we have about the measurement.
The study of the extent to which two factors are related is known as correlation analysis; this is
the topic of Chapter 16. If we wish to predict the outcome of one factor based on the value of another,
then regression is the appropriate technique. Simple linear regression is investigated in Chapter 17,
and is extended to the multiple regression setting – where two or more factors are used to predict
a single outcome – in Chapter 18. If the outcome of interest can take on only two possible values,
such as alive or dead, then an alternative technique must be applied; logistic regression is explored
in Chapter 19.
In Chapter 20, the inferential methods appropriate for life tables are introduced. These techniques
enable us to draw conclusions about the mortality of a population based on the experience of a sample
of individuals drawn from the population. This is common in clinical trials, especially in randomized
Introduction 11
clinical trials, when the purpose of the trial is to study whether a patient’s survival has been prolonged
by a treatment [31].
Chapter 21 is devoted to surveys and inference in finite populations. These techniques are very
popular around election time in democracies, but also find many uses in public health. For example,
the United States Census Bureau supplements the decennial census with an annual survey called
the American Community Survey; its purpose is to help “local officials, community leaders, and
businesses understand the changes taking in their communities. It is the premier source for detailed
population and housing information about our nation” [32]. In 2017, 2,145,639 households were
interviewed. Once again, the mainstay that enables us to make credible inference about the entire
United States population, 325.7 million people in 2017, is the simple random sample. We take that
as our starting point, and build on it with more refined designs. Practical examples are given by the
National Centers for Health Statistics within the cdc [33].
Once again it would be helpful if we could control variability and lessen its effect. Some survey
designs help in this regard. For example, if we can divide a population into strata where we know
the size of each stratum, we can take advantage of that extra information – the size of the strata – to
estimate the population characteristics more accurately via stratified sampling. If on the other hand
we wish to lower the cost of the survey, we can turn to cluster sampling. Of course, we can combine
these ideas and utilize both in a single survey. These design considerations and some of the issues
raised are addressed in this chapter.
The last chapter, Chapter 22, could have been the first. Even though it is foundational, one needs
the material developed in the rest of the book to appreciate its content. It is here that we bolster the
belief that it is not just the numbers that count, but what they represent, and how they are obtained.
This was made quite clear during the covid-19 pandemic. The proper monitoring of a viral epidemic
and its course requires an enumeration of people infected by the virus. This, unfortunately, did
not happen. Miscounting of covid-19 cases occurred across the world [34], including the United
States [35,36]. One cannot help but think that this disinformation contributed to the resultant damage
from the pandemic.
Chapter 22 explores how best to design studies to take advantage of the methods described in
this book. It also should whet your appetite to study biostatistics further, as the story gets even more
fascinating. To quote what George Udny Yule wrote almost a century ago [37]:
When his work takes an investigator out of the field of the nearly perfect experiments, in
which the influence of disturbing causes is practically negligible, into the field of imperfect
experiment (or a fortiori of pure observation) where the influence of disturbing causes is
important, the first step necessary for him is to get out of the habit of thinking in terms of
the single observation and to think in terms of the average. Some seem never to get beyond
this stage. But the next state is even more important, viz., to get out of the habit of thinking
in terms of the average, and think in terms of the frequency distribution. Unless and until he
does this, his conclusions will always be liable to fallacy.
1. Design a study aimed at investigating an issue you believe might influence the health of
the world. Briefly describe the data you will require, how you will obtain them, how you
intend to analyze the data, and the method you will use to present your results. Keep this
study design and reread it after you have completed the text.
2. Suppose it is stated that in a given year, 512 million people around the world were
malnourished, up from 460 million just five years prior [38].
(a) Suppose that you sympathize with the point being made. Justify the use of these
numbers.
(b) Are you sure that the numbers are correct? Do you think it is possible that 513 million
people were malnourished during the year in question rather than 512 million?
3. In addition to stating that “the Chinese have eaten pasta since 1100 b.c.,” the label on
a box of pasta shells claims that “Americans eat 11 pounds of pasta per year,” whereas
“Italians eat 60 pounds per year.” Do you believe that these statistics are accurate? Would
you use these numbers as the basis for a nutritional study?
Part I
Variability
2
Descriptive Statistics
CONTENTS
2.1 Types of Numerical Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.1.1 Nominal Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.1.2 Ordinal Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.1.3 Ranked Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.1.4 Discrete Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.1.5 Continuous Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2 Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2.1 Frequency Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2.2 Relative Frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3 Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3.1 Bar Charts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3.2 Histograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3.3 Frequency Polygons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3.4 Box Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.3.5 Two-Way Scatter Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.3.6 Line Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.4 Numerical Summary Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.4.1 Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.4.2 Median . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.4.3 Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.4.4 Range . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.4.5 Interquartile Range . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.4.6 Variance and Standard Deviation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.5 Empirical Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.6 Further Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.7 Review Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Every study or experiment yields a set of data. Its size can range from a few measurements to
many millions of observations. A complete set of data, however, will not necessarily provide an
investigator with information that can be easily interpreted. For example, Table 2.1 lists the first
2560 cases of human immunodeficiency virus infection and acquired immunodeficiency syndrome
(hiv/aids) reported to the Centers for Disease Control and Prevention [39]. Each individual was
classified as either suffering from Kaposi sarcoma, designated by a 1, or not suffering from the disease,
represented by a 0. (Kaposi sarcoma is a malignant tumor which affects the skin, mucous membranes,
and lymph nodes.) Although Table 2.1 displays the entire set of outcomes, it is extremely difficult
to characterize the data. We cannot even identify the relative proportions of 0s and 1s. Between the
raw data and the reported results of the study lies some intelligent and imaginative manipulation of
the numbers carried out using the methods of descriptive statistics.
DOI: 10.1201/9780429340512-2 15
16 Principles of Biostatistics
TABLE 2.1
Outcomes indicating whether an individual had Kaposi sarcoma for the first 2560 cases of hiv/aids
reported to the Centers for Disease Control and Prevention in Atlanta, Georgia
Descriptive statistics are a means of organizing and summarizing observations. They provide us
with an overview of the general features of a set of data. Descriptive statistics can assume a number
of different forms, including tables, graphs, and numerical summary measures. Before we decide
which techniques are the most appropriate in a given situation, however, we must first determine
what type of data we have.
TABLE 2.2
Eastern Cooperative Oncology Group’s classification of patient performance status
Status Definition
0 Patient fully active, able to carry on all pre-disease performance
without restriction
1 Patient restricted in physically strenuous activity but ambulatory
and able to carry out work of a light or sedentary nature
2 Patient ambulatory and capable of all self-care but unable to carry
out any work activities; up and about more than 50% of waking
hours
3 Patient capable of only limited self-care; confined to bed or chair
more than 50% of waking hours
4 Patient completely disabled; not capable f any self-care; totally
confined to bed or chair
TABLE 2.3
Ten leading causes of death in the United States, 2016
month. The average number of fatal motor vehicle accidents for these two months is 18.5, which is
not itself an integer.
Section 2.1 describes a gradation of numerical data that ranges from nominal to continuous.
As we progress, the nature of the relationship between possible data values becomes increasingly
20 Principles of Biostatistics
TABLE 2.4
Cases of Kaposi sarcoma for the first 2560 hiv/aids patients reported to the Centers for Disease
Control and Prevention in Atlanta, Georgia
Kaposi Number of
Sarcoma Individuals
Yes 246
No 2314
complex. Distinctions must be made among the various types of data because different techniques
are used to analyze them. As previously mentioned, it does not make sense to speak of an average
blood type of 1.8; it does make sense, however, to refer to an average temperature of 36.1◦ C or
37.2◦ C, which are the upper and lower bounds for normal human body temperature.
2.2 Tables
Now that we are able to differentiate among the various types of data, we must learn how to identify
the statistical techniques that are most appropriate for describing each kind. Although a certain
amount of information is lost when data are summarized, a great deal can also be gained. A table
is perhaps the simplest means of summarizing a set of observations and can be used for all types of
numerical data.
TABLE 2.5
Cigarette consumption per person 18 years of age or older, United States, 1900–2015
Number of
Year
Cigarettes
1900 54
1910 151
1920 665
1930 1485
1940 1976
1950 3522
1960 4171
1970 3985
1980 3851
1990 2828
1995 2505
2000 2076
2005 1717
2010 1278
2015 1078
TABLE 2.6
Absolute frequencies of serum cholesterol levels for 1067 United States males, aged 25 to 34 years
the greatest number of observations. Table 2.6 provides us with a much better understanding of the
data than would a list of 1067 cholesterol level readings. Although we have lost some information
– given the table, we can no longer recreate the raw data values – we have also extracted important
information that helps us to understand the distribution of serum cholesterol levels for this group of
males.
The fact that one kind of information is gained while another is lost holds true even for the simple
binary data in Tables 2.1 and 2.4. We might feel that we do not lose anything by summarizing these
data and counting the numbers of 0s and 1s, but in fact we do. For example, if there is some type
of trend in the observations over time – perhaps the proportion of hiv/aids patients with Kaposi
sarcoma is either increasing or decreasing as the epidemic matures – then this information is lost in
the summary.
Tables are most informative when they are not overly complex. As a general rule, tables and the
columns within them should always be clearly labeled. If units of measurement are involved, such
as mg/100 ml for the serum cholesterol levels in Table 2.6, these units should be specified.
TABLE 2.7
Absolute and relative frequencies of serum cholesterol levels for 2294 United States males
TABLE 2.8
Relative and cumulative relative frequencies in percentages of serum cholesterol levels for 2294
United States males
it is in Table 2.7. For example, 56.7% of the 25- to 34-year-olds have a serum cholesterol level less
than or equal to 199 mg/100 ml, whereas only 25.9% of the 55-to 64-year-olds fall into this category.
Because the relative proportions for the two groups follow this trend in every interval in the table,
the two distributions are said to be stochastically ordered. For any specified level, a larger proportion
of the older males have serum cholesterol readings above this value than do the younger males;
therefore, the distribution of cholesterol levels for the older males is stochastically larger than the
distribution for the younger males. This definition will start to make more sense when we encounter
random variables and probability distributions in Chapter 6. At that point, the implications of this
ordering will become more apparent.
2.3 Graphs
A second way to summarize and display data is through the use of graphs, or pictorial representations
of numerical data. Graphs should be designed so that they convey the general patterns in a set of
observations at a single glance. Although they are easier to read than tables, graphs often supply
a lesser degree of detail. Once again, however, the loss of detail may be accompanied by a gain in
understanding of the data. The most informative graphs are relatively simple and self-explanatory.
Like tables, they should be clearly labeled, and units of measurement should be indicated.
2.3.2 Histograms
Perhaps the most commonly used type of graph is the histogram. While a bar chart is a pictorial
representation of a frequency distribution for either nominal or ordinal data, a histogram depicts a
frequency distribution for discrete or continuous data. The horizontal axis displays the true limits of
the various intervals. The true limits of an interval are the points that separate it from the intervals
on either side. For example, the boundary between the first two classes of serum cholesterol level
in Table 2.5 is 119.5 mg/100 ml; it is the true upper limit of the interval 80–119 and the true lower
limit of 120–159. The vertical axis of a histogram depicts either the frequency or relative frequency
of observations within each interval.
The first step in constructing a histogram is to determine the scales of the axes. The vertical
scale should begin at zero; if it does not, visual comparisons among the intervals may be distorted.
Once the axes have been drawn, a vertical bar centered at the midpoint is placed over each interval.
The height of the bar marks the frequency associated with that interval. As an example, Figure 2.2
displays a histogram constructed from the serum cholesterol level data in Table 2.6.
In reality, the frequency associated with each interval in a histogram is represented not by the
height of the bar above it but by the bar’s area. Thus, in Figure 2.2, 1.2% of the total area corresponds
Descriptive Statistics 25
FIGURE 2.1
Bar chart: Major long-term health conditions experienced by Australian adults, 2014–2015; MBC =
mental and behavioral conditions
FIGURE 2.2
Histogram: Absolute frequencies of serum cholesterol levels for 1067 United States males, aged 25
to 34 years
26 Principles of Biostatistics
FIGURE 2.3
Histogram: Relative frequencies of serum cholesterol levels for 1067 United States males, aged 25
to 34 years
to the 13 observations that lie between 79.5 and 119.5 mg/100 ml, and 14.1% of the area corresponds
to the 150 observations between 119.5 and 159.5 mg/100 ml. The area of the entire histogram sums
to 100%, or 1. Note that the proportion of the total area corresponding to an interval is equal to the
relative frequency of that interval. As a result, a histogram displaying relative frequencies – such as
Figure 2.3 – will have the same shape as a histogram displaying absolute frequencies. Because it is
the area of each bar that represents the relative proportion of observations in an interval, care must
be taken when constructing a histogram with unequal interval widths; the height must vary along
with the width so that the area of each bar remains in proper proportion.
FIGURE 2.4
Frequency polygon: Absolute frequencies of serum cholesterol levels for 1067 United States males,
aged 25 to 34 years
ml, and drop off a little more quickly to the left of this value than they do to the right. Most of the
observations lie between 120 and 280 mg/100 ml, and all are between 80 and 400 mg/100 ml.
Because they can be easily superimposed, frequency polygons are superior to histograms for
comparing two or more sets of data. Figure 2.5 displays the frequency polygons of the serum
cholesterol level data presented in Table 2.7. Since the older males tend to have higher serum
cholesterol levels, their polygon lies to the right of the polygon for the younger males.
Although its horizontal axis is the same as that for a standard frequency polygon, the vertical axis
of a cumulative frequency polygon displays cumulative relative frequencies. A point is placed at the
true upper limit of each interval; the height of the point represents the cumulative relative frequency
associated with that interval. The points are then connected by straight lines. Like frequency polygons,
cumulative frequency polygons may be used to compare sets of data. This is illustrated in Figure 2.6.
By noting that the cumulative frequency polygon for 55- to 64-year-old males lies to the right of the
polygon for 25- to 34-year-old males for each value of serum cholesterol level, we can see that the
distribution for older males is stochastically larger than the distribution for younger males.
Cumulative frequency polygons can also be used to obtain the percentiles of a set of data. The
95th percentile is a value which is greater than or equal to 95% of the observations and less than or
equal to the remaining 5%. Similarly, the 75th percentile is a value which is greater than or equal to
75% of the observations and less than or equal to the other 25%. This definition is only approximate
because taking 75% of an integer does not typically result in another integer; consequently, there
is often some rounding or interpolation involved. In Figure 2.6, the 50th percentile of the serum
cholesterol levels for the group of 25- to 34-year-olds – the value that is greater than or equal to half
of the observations and less than or equal to the other half – is approximately 193 mg/100 ml; the
50th percentile for the 55- to 64-year-olds is about 226 mg/100 ml.
Percentiles are useful for describing the shape of a distribution. For example, if the 40th and 60th
percentiles of a set of data lie an equal distance away from the midpoint, and the same is true of
28 Principles of Biostatistics
FIGURE 2.5
Frequency polygon: Relative frequencies of serum cholesterol levels for 2294 United States males
FIGURE 2.6
Cumulative frequency polygon: Cumulative relative frequencies of serum cholesterol levels for 2294
United States males
Descriptive Statistics 29
FIGURE 2.7
Box plot: Crude death rates for each state in the United States, 2016
the 30th and 70th percentiles, the 20th and 80th, and all other pairs of percentiles that sum to 100,
then the data are symmetric; that is, the distribution of values has the same shape on each side of
the 50th percentile. Alternatively, if there are a number of outlying observations on one side of the
midpoint only, then the data are said to be skewed. If these observations are smaller than the rest of
the values, the data are skewed to the left; if they are larger than the other measurements, the data are
skewed to the right. The various shapes that a distribution of data can assume are discussed further
in Section 2.4.
FIGURE 2.8
Box plots: Crude death rates for each state in the United States, 1996, 2006, and 2016
The lines projecting out from the box on either side extend to the adjacent values of the plot.
The adjacent values are the most extreme observations in the data set that are not more than 1.5
times the height of the box beyond either quartile. In Figure 2.7, 1.5 times the height of the box is
1.5× (969.3−794.1) = 262.8 per 100,000 population. Therefore, the adjacent values are the smallest
and largest observations in the data set which are not more extreme than 794.1 − 262.8 = 531.3 and
969.3 + 262.8 = 1232.1 per 100,000 population, respectively. Since there is no crude death rate less
than 531.3, the lower adjacent value is simply the minimum value, 587.1 per 100,000. There is one
value higher than 1232.1 – the maximum value of 1241.4 per 100,000 – and thus the upper adjacent
value is 1078.8 per 100,000, the next largest value. In fairly symmetric data sets, the adjacent values
should contain approximately 99% of the measurements. All points outside this range are represented
by circles; these observations are considered to be outliers, or data points which are not typical of
the rest of the values. It should be noted that the preceding explanation is merely one way to define
a box plot; other definitions exist and exhibit varying degrees of complexity [47].
Because the box plot displays only a summary of the data on a single axis, it can be used to make
comparisons across groups or over time. Figure 2.8, for example, contains summaries of crude death
rates for the 50 states and the District of Columbia for three different calendar years: 1996, 2006,
and 2016 [46]. The 25th, 50th, and 75th percentiles of crude death rate all decrease from 1996 to
2006, but then increase again in 2016.
FIGURE 2.9
Two-way scatter plot: Forced vital capacity versus forced expiratory volume in one second for
nineteen asthmatic subjects
in a study investigating the physical effects of sulfur dioxide exposure [48]. Forced vital capacity is
the volume of air that can be expelled from the lungs in six seconds, and forced expiratory volume
in one second is the volume that can be expelled after one second of constant effort. Note that the
individual represented by the point that is farthest to the left had an fev1 measurement of 2.0 liters
and an fvc measurement of 2.8 liters. (There are only 18 points marked on the graph instead of 19
because two individuals had identical values of fvc and fev1 ; consequently, one point lies directly on
top of another.) As might be expected, the graph indicates that there is a strong relationship between
these two quantities; fvc increases in magnitude as fev1 increases.
FIGURE 2.10
Line graph: Reported rates of malaria by year, United States, 1940–2015
FIGURE 2.11
Line graph: Health care expenditures as a percentage of gross domestic product (gdp) for the United
States and Canada, 1970–2017
Descriptive Statistics 33
FIGURE 2.12
Leading causes of death in South Africa, 1997–2013, in thousands of deaths; colored bands from
bottom to top represent other causes, digestive, nervous, endocrine, respiratory system, circulatory
system, neoplasm, infectious, external causes, blood and immune disorders
In this section, we have not attempted to examine all possible types of graphs. Instead, we have
included only a selection of the more common ones. It should be noted that many other imaginative
displays exist [52]. One such example is Figure 2.12, which displays the leading causes of death in
South Africa from 1997 through 2013 [53]. The top border of the light blue segment at the bottom is
actually a line graph tracking the number of deaths due to “other causes” – those not represented by
the nine colored bands above it – over the 17-year time period. The purple segment above this shows
the number of deaths due to diseases of the digestive system in each year; the top border of this
segment displays the number of deaths due to other and digestive causes combined. The top of the
uppermost blue segment displays the total number of deaths due to all causes in each calendar year,
allowing us to see that the number of deaths in South Africa increased from 1997 through 2006, and
then decreased from 2006 through 2013. Some of this decrease can be attributed to a fall in deaths
due to diseases of the respiratory system, the bright pink band; note that this band becomes more
narrow beginning in the late 2000s. The number of deaths due to infectious disease – the light green
band – decreased after 2009. Deaths due to many of the other causes have not changed much over
this time period, as evidenced by the segments of constant height.
Regardless of the type of display being used, as a general rule, too much information should not
be squeezed into a single graph. A relatively simple illustration is often the most effective.
34 Principles of Biostatistics
2.4.1 Mean
The most frequently used measure of central tendency is the arithmetic mean, or average. The mean
is calculated by summing all the observations in a set of data and dividing by the total number of
measurements. In Table 2.9, for example, we have 13 observations. If x is used to represent fev1 ,
then x 1 = 2.30 denotes the first in the series of observations; x 2 = 2.15, the second; and so on
up through x 13 = 3.38. In general, x i refers to a single fev1 measurement where the subscript i
can take on any value from 1 to n, the total number of observations in the group. The mean of the
observations in the dataset – represented by x̄, or x-bar – is
n
1X
x̄ = xi .
n i=1
Note that we have used some mathematical shorthand. The uppercase Greek letter sigma, , is the
P
symbol for summation. The expression i=1 x i indicates that we should add up the values of all of
Pn
the observations in the group, from x 1 to x n . When appears in the text, the limits of summation
P
are placed beside it; when it does not, the limits are above and below it. Both representations of a
summation denote exactly the same thing. In some cases where it is clear that we are supposed to
sum all observations in a dataset, the limits may be dropped altogether. For the fev1 measurements,
13
1 X
x̄ = xi
13 i=1
1
!
= (2.30 + 2.15 + 3.50 + 2.60 + 2.75 + 2.82 + 4.05
13
+ 2.25 + 2.68 + 3.00 + 4.02 + 2.85 + 3.38)
38.35
=
13
= 2.95 liters.
The mean can be used as a summary measure for both discrete and continuous measurements. In
general, however, it is not appropriate for either nominal or ordinal data. Recall that for these types
of observations, the numbers are merely labels; even if we choose to represent the blood types o, a,
b, and ab by the numbers 1, 2, 3, and 4, an average blood type of 1.8 is meaningless.
Descriptive Statistics 35
TABLE 2.9
Forced expiratory volumes in 1 second for 13 adolescents suffering from asthma
One exception to this rule applies when we have dichotomous data, and the two possible outcomes
are represented by the values 0 and 1. In this situation, the mean of the observations is equal to
the proportion of 1s in the data set. For example, suppose that we want to know the proportion of
asthmatic adolescents in the previously described study who are males. Listed in Table 2.10 are the
relevant dichotomous data; the value 1 represents a male and 0 designates a female. If we compute
the mean of these observations, we find that
n
1X
x̄ = xi
n i=1
1
!
= (0 + 1 + 1 + 0 + 0 + 1 + 1 + 1 + 0 + 1 + 1 + 1 + 0)
13
8
= = 0.615.
13
Therefore, 61.5% of the study subjects are males. It would have been a little more difficult to
determine the relative frequency of males, however, if we had represented males by the value 5 and
females by 12.
The method for calculating the mean takes into consideration the magnitude of each and every
observation in a set of data. What happens when one observation has a value that is very different
from the others? Suppose, for instance, that for the data shown in Table 2.9, we had accidentally
recorded the fev1 measurement of subject 11 as 40.2 rather than 4.02 liters. The mean fev1 of all
13 subjects would then be calculated as
1
!
x̄ = (2.30 + 2.15 + 3.50 + 2.60 + 2.75 + 2.82 + 4.05
13
+ 2.25 + 2.68 + 3.00 + 40.2 + 2.85 + 3.38)
74.53
= = 5.73 liters,
13
36 Principles of Biostatistics
TABLE 2.10
Indicators of sex for 13 adolescents suffering from asthma
Subject Sex
1 0
2 1
3 1
4 0
5 0
6 1
7 1
8 1
9 0
10 1
11 1
12 1
13 0
which is nearly twice as large as it was before. Clearly, the mean is extremely sensitive to unusual
values. In this particular example, we would have rightfully questioned an fev1 measurement of
40.2 liters and would have either corrected the error or separated this observation from the others.
In general, however, the error might not be as obvious, or the unusual observation might not be an
error at all. Since it is our intent to characterize an entire group of individuals, we might prefer to
use a summary measure that is not as sensitive to each and every observation.
2.4.2 Median
One measure of central tendency which is not as sensitive to the value of each measurement is the
median. Like the mean, the median can be used as a summary measure for discrete and continuous
measurements. However, it can also be used for ordinal data as well. The median is defined as the
50th percentile of a set of measurements; if a list of observations is ranked from smallest to largest,
then half the values would be greater than or equal to the median, and the other half would be less
than or equal to it. If a set of data contains a total of n observations where n is odd, the median is the
middle value, or the [(n + 1)/2]th largest measurement; if n is even, the median is usually taken to
be the average of the two middlemost values, the (n/2)th and [(n/2) + 1]th observations. If we were
to rank the 13 fev1 measurements listed in Table 2.9, for example, the following sequence would
result:
2.15, 2.25, 2.30, 2.60, 2.68, 2.75, 2.82, 2.85, 3.00, 3.38, 3.50, 4.02, 4.05.
Since there are an odd number of observations in the list, the median would be the (13 + 1)/2 = 7th
observation, or 2.82. Seven of the measurements are less than or equal to 2.82 liters, and seven are
greater than or equal to 2.82.
The calculation of the median takes into consideration only the ordering and relative magnitude
of the observations in a set of data. In the situation where the fev1 of subject 11 was recorded as
40.2 rather than 4.02, the ranking of the measurements would change only slightly:
2.15, 2.25, 2.30, 2.60, 2.68, 2.75, 2.82, 2.85, 3.00, 3.38, 3.50, 4.05, 40.2.
Descriptive Statistics 37
As a result, the median fev1 would still be 2.82 liters. The median is said to be robust; that is, it is
much less sensitive to unusual data points than is the mean.
2.4.3 Mode
A third measure of central tendency is the mode; it can be used as a summary measure for all types
of data, although it is most useful for categorical measurements. The mode of a set of values is the
observation that occurs most frequently. The continuous fev1 data in Table 2.9 do not have a unique
mode since each of the values occurs only once. It is not uncommon that continuous measurements
will have no unique mode, or more than one. This is less likely to occur with nominal or ordinal
measurements. For example, the mode for the dichotomous data in Table 2.10 is 1; this value appears
eight times, whereas 0 appears only five times.
The best measure of central tendency for a given set of data often depends on the way in which
the values are distributed. If continuous or discrete measurements are symmetric and unimodal –
meaning that, if we were to draw a histogram or a frequency polygon, there would be only one peak,
as in the smoothed distribution pictured in Figure 2.13(a) – then the mean, the median, and the mode
should all be roughly the same. If the distribution of values is symmetric but bimodal, so that the
corresponding frequency polygon would have two peaks as in Figure 2.13(b), then the mean and
median should again be the same. Note, however, that this common value could lie between the
two peaks, and hence be a measurement that is extremely unlikely to occur. A bimodal distribution
often indicates that the population from which the values are taken actually consists of two distinct
subgroups that differ in the characteristic being measured; in this situation, it might be better to report
two modes rather than the mean or the median, or to treat the two subgroups separately. The data
in Figure 2.13(c) are skewed to the right, and those in Figure 2.13(d) are skewed to the left. When
the data are not symmetric, as in these two figures, the median is often the best measure of central
tendency. Because the mean is sensitive to extreme observations, it is pulled in the direction of the
outlying data values. As a result, the mean might end up either excessively inflated or excessively
deflated. Note that when the data are skewed to the right, the mean lies to the right of the median;
when they are skewed to the left, the mean lies to the left of the median. In both instances, the mean
is pulled in the direction of the extreme values.
Regardless of the measure of central tendency used in a particular situation, it can be misleading
to assume that this value is representative of all observations in the group. One example that illustrates
this point was included in an episode of the popular news program “60 Minutes,” where it was noted
that although the French diet tends to be high in fat and cholesterol, France has a fairly low rate of
heart disease relative to other countries, including the United States. This paradox was attributed to
the French habit of drinking wine with meals, red wine in particular. Studies have suggested that
moderate alcohol consumption can lessen the risk of heart disease. The per capita intake of wine in
France is one of the highest in the world, and the program implied that the French drink a moderate
amount of wine each day, perhaps two or three glasses. The reality may be quite different, however.
According to a wine industry survey, more than half of all French adults never drink wine at all [55].
Of those who do, only 28% of males and 11% of females drink it daily. Obviously the distribution is
far more variable than the “typical value” would suggest. Remember that when we summarize a set
of data, information is always lost. Thus, although it is helpful to know where the center of a dataset
lies, this information is usually not sufficient to characterize an entire distribution of measurements.
As another example, the two very different distributions of data values pictured in Figure 2.14
have the same means, medians, and modes. To know how good our measure of central tendency
actually is, we need to have some idea about the variation among the measurements. Do all the
observations tend to be quite similar and therefore lie close to the center, or are they spread out
38 Principles of Biostatistics
FIGURE 2.13
Possible distributions of data values: (a) unimodal, (b) bimodal, (c) right-skewed, (d) left-skewed
across a broad range of values? To answer this question, we need to calculate a measure of the
variability among values, also called a measure of dispersion.
2.4.4 Range
One number that can be used to describe the variability in a set of data is the range. The range of a
group of measurements is defined as the difference between the largest and the smallest observations.
Although the range is easy to compute, its usefulness is limited; it considers only the extreme values of
a dataset rather than the majority of the observations. Therefore, like the mean, it is highly sensitive
to exceptionally large or exceptionally small values. The range for the fev1 data in Table 2.9 is
4.05 − 2.15 = 1.90 liters. If the fev1 of subject 11 was recorded as 40.2 instead of 4.02 liters,
however, the range would be 40.2 − 2.15 = 38.05 liters, a value 20 times as large.
FIGURE 2.14
Two distributions with identical means, medians, and modes
largest. If nk/100 is an integer, then the kth percentile of the data is the average of the (nk/100)th
and (nk/100 + 1)th largest observations. If nk/100 is not an integer, then the kth percentile is the
( j + 1)th largest measurement, where j is the largest integer which is less than nk/100. To find the
25th percentile of the 13 fev1 measurements, for example, we first note that 13(25)/100 = 3.25 is
not an integer. Therefore, the 25th percentile is the 3 + 1 = 4th largest measurement (since 3 is the
largest integer less than 3.25), or 2.60 liters. Similarly, 13(75)/100 = 9.75 is not an integer, and the
75th percentile is the 9 + 1 = 10th largest measurement, or 3.38 liters. The interquartile ranges of
daily glucose levels measured at each minute over a 24-hour period for a total of 90 days – as well as
10th and 90th percentiles – are presented for a single individual in Figure 2.15. These interquartile
ranges allow us to determine at which times of day glucose has the most variability, and when there
is less variability.
FIGURE 2.15
Medians and interquartile ranges of daily glucose levels measured over a 24-hour period for a total
of 90 days
A mathematically equivalent formula is the more commonly used
n
1 X
s2 = (x i − x̄) 2,
(n − 1) i=1
which is based on the squared difference of each measurement from the sample mean x̄. Although
less intuitive, this formula is easier to calculate by hand.
For the 13 fev1 measurements presented in Table 2.9, the mean is x̄ = 2.95 liters, and the
difference and squared difference of each observation from the mean is given below.
Subject xi x i − x̄ (x i − x̄) 2
1 2.30 −0.65 0.4225
2 2.15 −0.80 0.6400
3 3.50 0.55 0.3025
4 2.60 −0.35 0.1225
5 2.75 −0.20 0.0400
6 2.82 −0.13 0.0169
7 4.05 1.10 1.2100
8 2.25 −0.70 0.4900
9 2.68 −0.27 0.0729
10 3.00 0.05 0.0025
11 4.02 1.07 1.1449
12 2.85 −0.10 0.0100
13 3.38 0.43 0.1849
Total 38.35 0.00 4.6596
Descriptive Statistics 41
4.6596
=
12
= 0.39 liters2 .
The standard deviation of a set of values is the positive square root of the variance. Thus, for the
13 fev1 measurements above, the standard deviation is equal to
p
s = s2
p
= 0.39 liters2
= 0.62 liters.
In practice, the standard deviation is used more frequently than the variance. This is primarily
because the standard deviation has the same units of measurement as the mean, rather than squared
units. In a comparison of two sets of measurements, the group with the smaller standard deviation
has the more homogeneous observations. The group with the larger standard deviation exhibits a
greater amount of variability. The actual magnitude of the standard deviation depends on the values
in the dataset; what is large for one set of data may be small for another. In addition, because the
standard deviation has units of measurement, it is meaningless to compare standard deviations for
two unrelated quantities, such as age and weight.
Together, these two numbers, a measure of central tendency and a measure of dispersion, can be
used to summarize an entire distribution of values. It is most common to see the standard deviation
reported with the mean, and either the range or the interquartile range reported with the median.
n
x̄ = 1n
X
Mean xi
i=1
n
s2 = 1 (x i − x̄) 2
X
Variance
(n − 1)
i=1
n n
1 (x i − x j ) 2
X X
=
2n(n − 1)
i=1 j=1, j,i
v
√
t n
1 (x i − x̄) 2
X
Standard deviation s = s2 =
(n − 1)
i=1
42 Principles of Biostatistics
236.8 ± (1 × 43.8)
or
(193.0 , 280.6)
contains approximately 67% of the total cholesterol measurements,
236.8 ± (2 × 43.8)
or
(149.2 , 324.4)
contains 95%, and
236.8 ± (3 × 43.8)
or
(105.4 , 368.2)
contains nearly all of the observations. In fact, for the 4380 measurements, 69.9% are between 193.0
and 280.6 mg/dL, 96.0% are between 149.2 and 324.4 mg/dL, and 99.4% are between 105.4 and
368.2 mg/dL. The empirical rule allows us to use the mean and the standard deviation of a set of
data, just two numbers, to describe the entire group.
Interpretation of the magnitude of the mean is enhanced by the empirical rule. As previously
noted, however, in order to apply the empirical rule, a distribution of data values must be at least
approximately symmetric and unimodal. The closer the distribution is to this ideal, the more precise
the descriptions provided by the rule. Deviations from the ideal – especially if they are extreme – not
only invalidate the use of the empirical rule, but even call into question the usefulness of the mean
and standard deviation as numerical summary measures.
Returning to the Framingham Heart Study, consider the reported average number of cigarettes
smoked per day at the time of enrollment. In addition to this discrete measurement, the researchers
also collected a binary measurement of smoking status: smoker versus non-smoker. If d is used to
represent smoking status (taking the value 1 for a smoker, and 0 for a non-smoker), while x represents
the average number of cigarettes smoked per day, then the ith individual in the group has a pair
of measurements (d i , x i ). The subscript i takes on any value from 1 to 4402, the total number of
subjects in the study for whom these values were recorded.
Figure 2.17 displays the x values, the average numbers of cigarettes smoked per day. Note that
these values are not symmetric and unimodal, and therefore the empirical rule should not be applied.
Beyond that, however, we might wonder whether the mean is providing any useful information at
Descriptive Statistics 43
FIGURE 2.16
Total cholesterol measurements at the time of enrollment for individuals participating in the Fram-
ingham Heart Study
all. Recall that we introduced the mean as a measure of central tendency, a “typical” value for
a set of measurements. Knowing that the center for the number of cigarettes smoked per day is
x̄ = 9.0 is not particularly helpful. The problem is that there are really two distinct groups of study
subjects: smokers and non-smokers. The mean of the x values ignores the information contained in
d. Cigarette consumption for the individuals who do not smoke – the 51% of the total cohort for
whom d i = 0 – is 0 cigarettes per day, resulting in a mean value of 0 for this subgroup. For the
subgroup of smokers – those for whom d i = 1 – the mean cigarette consumption is 18.4 cigarettes
per day. The overall mean of x̄ = 9.0 is not representative of either of these subgroups. It might be
useful for the manufacturer who is trying to determine how many cigarettes to make, but it does not
help us to understand the health of the population. Instead of attempting to capture the situation with
a single mean, it is more informative to present two numerical summary measures: the proportion
of the population who smokes, and the mean number of cigarettes smoked per day only for the
subgroup of smokers. (Since the binary measurements of smoking status are represented by 0s and
1s, the proportion of 1s – equivalently, the proportion of smokers – is simply the mean of the d i s.)
These two numbers give us a more complete summary of the data. Of course, reporting two means
also complicates the interpretation. Suppose that we want to track changes in smoking habits over
time. With a single mean, it is easy to see whether cigarette consumption is increasing or decreasing;
with two means, it is not. What if fewer people smoke over time, but those who do smoke increase
their consumption? Can this be considered an improvement in health?
Additional complexity is introduced if we are dealing with a rare event. The information in
Table 2.11 was presented as part of an argument about the loss of human lives attributable to
guns [57]. The entries in the table show the number of deaths over each year from 2009 through
2015, by country, attributed to mass shootings. Although there is some disagreement on how to
define a mass shooting, here it is defined as an incident resulting in four or more fatalities. The
44 Principles of Biostatistics
FIGURE 2.17
Average number of cigarettes smoked per day at the time of enrollment for individuals participating
in the Framingham Heart Study
argument utilizing these data focused on a contrast between the United States and Europe. The
authors took a country’s mean number of deaths per year over the seven-year period and divided
by its population size to calculate the “annual death rate from mass shootings per million people.”
Doing this, the United States ranked eighth highest, and it was therefore claimed that it is safer to live
in the United States than in the seven European countries which ranked higher. We might consider,
however, whether this metric is the most meaningful way to summarize this data.
First, note that there are currently 44 countries in Europe, but only 16 are listed in Table 2.11.
These 16 countries were selected because they had at least one mass shooting episode over the
seven-year period. Since the majority of European countries had no mass shootings at all, the sample
of countries shown is not representative. To more fairly compare the situation in Europe to that in
the United States, all countries must be included.
Second, just as with the cigarette consumption measurements from the Framingham Heart Study,
we should consider two dimensions of this data rather than just one: the frequency of mass shootings,
and the number of fatalities when a shooting does occur. Both of these pieces of information are
important. To better understand the frequency of mass shootings, Table 2.12 contains the number of
mass shootings in each year from 2009 to 2015. Over the seven-year period, there were six shootings
in France, and two in Belgium, Russia, Serbia, and Switzerland. Each of the other countries in the
table had just one mass shooting. The 28 European countries not shown in the table had none at all.
At the country level, a mass shooting is a rare event, and the mean number of shootings per year is
not a helpful summary measure, as all the means are low. In contrast, over the same time period, the
United States had 25 shootings, the same number as all of Europe combined. In fact, looking at the
last two rows of the table, the behavior of the two regions is quite similar.
Some might say that a fairer comparison would take into account the relative population sizes of
Europe and the United States. This would certainly be true if we believe that a certain fixed proportion
of a population are potential mass shooters, and therefore a larger population would produce more of
Descriptive Statistics 45
TABLE 2.11
Number of deaths per year attributed to mass shootings, 2009–2015
Country 2009 2010 2011 2012 2013 2014 2015 Total Mean Median
Albania 0 0 0 0 0 4 0 4 0.57 0
Austria 0 0 0 0 4 0 0 4 0.57 0
Belgium 0 0 6 0 0 4 0 10 1.43 0
Czech Republic 0 0 0 0 0 0 9 9 1.29 0
Finland 5 0 0 0 0 0 0 5 0.71 0
France 0 0 0 8 0 0 150 158 22.60 0
Germany 13 0 0 0 0 0 0 13 1.86 0
Italy 0 0 0 0 0 0 4 4 0.57 0
Macedonia 0 0 0 5 0 0 0 5 0.71 0
Netherlands 0 0 6 0 0 0 0 6 0.86 0
Norway 0 0 67 0 0 0 0 69 9.86 0
Russia 0 0 0 6 6 0 0 12 1.71 0
Serbia 0 0 0 0 13 0 0 19 2.17 0
Slovakia 0 7 0 0 0 0 0 7 1.00 0
Switzerland 0 0 0 0 4 0 4 8 1.14 0
United Kingdom 0 12 0 0 0 0 0 12 1.71 0
United States 38 12 18 66 16 12 37 199 28.40 18
TABLE 2.12
Number of mass shootings per year, 2009–2015
Country 2009 2010 2011 2012 2013 2014 2015 Total Mean Mediam
Albania 0 0 0 0 0 1 0 1 0.14 0
Austria 0 0 0 0 1 0 0 1 0.14 0
Belgium 0 0 1 0 0 1 0 2 0.28 0
Czech Republic 0 0 0 0 0 0 1 1 0.14 0
Finland 1 0 0 0 0 0 0 1 0.14 0
France 0 0 0 1 0 0 1 6 0.86 0
Germany 1 0 0 0 0 0 0 1 0.14 0
Italy 0 0 0 0 0 0 1 1 0.14 0
Macedonia 0 0 0 1 0 0 0 1 0.14 0
Netherlands 0 0 1 0 0 0 0 1 0.14 0
Norway 0 0 1 0 0 0 0 1 0.14 0
Russia 0 0 0 1 1 0 0 2 0.28 0
Serbia 0 0 0 0 1 0 0 2 0.28 0
Slovakia 0 1 0 0 0 0 0 1 0.14 0
Switzerland 0 0 0 0 1 0 1 2 0.28 0
United Kingdom 0 1 0 0 0 0 0 1 0.14 0
Europe 2 2 4 4 4 1 8 25 3.57 3
United States 4 2 3 6 3 3 4 25 3.57 4
46 Principles of Biostatistics
FIGURE 2.18
Frequency of number of fatalities per mass shooting, 2009–2015
these individuals. The population of Europe is more than twice that of the United States – in 2019,
there are approximately 740 million people in Europe and 330 million in the United States – and
that ratio has been fairly consistent since 2009. Therefore, we can conclude that the proportion of
mass shooters in the United States is more than twice as high as in Europe.
Going a step further, the description above did not account for the number of fatalities in each
shooting. Figure 2.18 displays the frequencies with which each number of fatalities occurred. In
Europe, the mean number of fatalities is 13.7 per event, while in the United States it is 8.0. Note,
however, the three outlying values, representing 26 deaths in Newtown, Connecticut in 2012, 67 in
Norway in 2011, and 130 in France in 2015. We have seen that the mean is affected by outlying
values. To assess their impact, we might consider excluding the outliers and recalculating the means.
If we do this, the means are 6.4 fatalities per event in Europe and 7.2 in the United States, both
of which are much lower; furthermore, the mean for Europe is now smaller than the mean for the
United States.
In summary, there are instances when a single mean does not provide an accurate representation
of a complex situation, especially when comparisons are being made. To fully understand what is
happening when comparing gun violence in the United States and Europe, the annual death rate
from mass shootings per million people does not give the whole picture; especially when Europe is
represented by a hand-picked sample of countries, chosen in a biased way so as to make a political
point. We might convey a better understanding of the data by noting that the frequency of mass
shootings was the same in Europe and the United States over the seven-year period from 2009
through 2015, with 25 mass shootings in each region, even though the population of Europe is more
than twice as large. There were three mass shootings with exceptionally high numbers of fatalities,
noted above, and excluding these, the mean numbers of deaths per event were 6.4 in Europe and 7.2
in the United States.