KEMBAR78
Principles of Biostatistics-3 Edition | PDF | Regression Analysis | Scientific Method
0% found this document useful (0 votes)
1K views63 pages

Principles of Biostatistics-3 Edition

The document is the third edition of 'Principles of Biostatistics' by Marcello Pagano, Kimberlee Gauvreau, and Heather Mattie, published in 2022. It covers fundamental concepts in biostatistics, including variability, probability, and inference, and includes various statistical methods and applications. The book is intended for readers interested in understanding biostatistics and its practical applications in health sciences.

Uploaded by

yixuan meng
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1K views63 pages

Principles of Biostatistics-3 Edition

The document is the third edition of 'Principles of Biostatistics' by Marcello Pagano, Kimberlee Gauvreau, and Heather Mattie, published in 2022. It covers fundamental concepts in biostatistics, including variability, probability, and inference, and includes various statistical methods and applications. The book is intended for readers interested in understanding biostatistics and its practical applications in health sciences.

Uploaded by

yixuan meng
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 63

Principles of Biostatistics

Principles of Biostatistics
Third Edition

Marcello Pagano
Kimberlee Gauvreau
Heather Mattie
Third edition published 2022
by CRC Press
6000 Broken Sound Parkway NW, Suite 300, Boca Raton, FL 33487-2742

and by CRC Press


4 Park Square, Milton Park, Abingdon, Oxon, OX14 4RN

CRC Press is an imprint of Taylor & Francis Group, LLC

© 2022 Taylor & Francis Group, LLC

Second edition published by 2000 by Brooks/Cole and then Cengage Learning

Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot
assume responsibility for the validity of all materials or the consequences of their use. The authors and publishers have
attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders
if permission to publish in this form has not been obtained. If any copyright material has not been acknowledged please
write and let us know so we may rectify in any future reprint.

Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized
in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying,
microfilming, and recording, or in any information storage or retrieval system, without written permission from the
publishers.

For permission to photocopy or use material electronically from this work, access www.copyright.com or contact the
Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. For works that are
not available on CCC please contact mpkbookspermissions@tandf.co.uk

Trademark notice: Product or corporate names may be trademarks or registered trademarks and are used only for
identification and explanation without intent to infringe.

Library of Congress Cataloging-in-Publication Data


Names: Pagano, Marcello, 1945- author. | Gauvreau, Kimberlee, 1963- author.
| Mattie, Heather, author.
Title: Principles of biostatistics / Marcello Pagano, Kimberlee Gauvreau,
Heather Mattie.
Description: Third edition. | Boca Raton : CRC Press, 2022. | Revised
edition of: Principles of biostatistics / Marcello Pagano, Kimberlee
Gauvreau. 2nd ed. c2000. | Includes bibliographical references and
index.
Identifiers: LCCN 2021057073 (print) | LCCN 2021057074 (ebook) | ISBN
9780367355807 (hardback) | ISBN 9781032252445 (paperback) | ISBN
9780429340512 (ebook)
Subjects: LCSH: Biometry.
Classification: LCC QH323.5 .P34 2022 (print) | LCC QH323.5 (ebook) | DDC
570.1/5195--dc23/eng/20211223
LC record available at https://lccn.loc.gov/2021057073
LC ebook record available at https://lccn.loc.gov/2021057074

ISBN: 978-0-367-35580-7 (hbk)


ISBN: 978-1-032-25244-5 (pbk)
ISBN: 978-0-429-34051-2 (ebk)

DOI: 10.1201/9780429340512

Typeset in TeXGyreTermesX
by KnowledgeWorks Global Ltd.

Publisher’s note: This book has been prepared from camera-ready copy provided by the authors.

Access the Support Material: www.routledge.com/9780367355807


This book is dedicated with love to
Phyllis, Marisa, John-Paul, Camille and Ivy,
Neil and Eliza,
Ali, Bud, Connie, Nanette, Steve, Katie and Buddy
Contents

Preface xiii

1 Introduction 1
1.1 Why Study Biostatistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Difficult Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Overview of the Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3.1 Part I: Chapters 2–4 Variability . . . . . . . . . . . . . . . . . . . . . . . 4
1.3.2 Part II: Chapters 5–8 Probability . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.3 Part III: Chapters 9–22 Inference . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.4 Computing Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.4 Review Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

I Variability 13
2 Descriptive Statistics 15
2.1 Types of Numerical Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.1.1 Nominal Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.1.2 Ordinal Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.1.3 Ranked Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.1.4 Discrete Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.1.5 Continuous Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2 Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2.1 Frequency Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2.2 Relative Frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3 Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3.1 Bar Charts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3.2 Histograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3.3 Frequency Polygons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3.4 Box Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.3.5 Two-Way Scatter Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.3.6 Line Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.4 Numerical Summary Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.4.1 Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.4.2 Median . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.4.3 Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.4.4 Range . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.4.5 Interquartile Range . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.4.6 Variance and Standard Deviation . . . . . . . . . . . . . . . . . . . . . . . 39
2.5 Empirical Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.6 Further Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.7 Review Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

vii
viii Contents

3 Rates and Standardization 67


3.1 Rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.2 Adjusted Rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.2.1 Direct Standardization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.2.2 Indirect Standardization . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.3 Further Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.4 Review Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

4 Life Tables 89
4.1 Historical Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.2 Life Table as a Predictor of Longevity . . . . . . . . . . . . . . . . . . . . . . . . 95
4.3 Mean Survival . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.4 Median Survival . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.5 Further Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.6 Review Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

II Probability 109
5 Probability 111
5.1 Operations on Events and Probability . . . . . . . . . . . . . . . . . . . . . . . . 111
5.2 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
5.3 Total Probability Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
5.4 Relative Risk and Odds Ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
5.5 Further Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
5.6 Review Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

6 Screening and Diagnostic Tests 135


6.1 Sensitivity and Specificity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
6.2 Bayes’ Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
6.3 Likelihood Ratios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
6.4 ROC Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
6.5 Calculation of Prevalence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
6.6 Varying Sensitivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
6.7 Further Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
6.8 Review Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

7 Theoretical Probability Distributions 159


7.1 Probability Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
7.2 Binomial Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
7.3 Poisson Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
7.4 Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
7.5 Further Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
7.6 Review Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186

8 Sampling Distribution of the Mean 191


8.1 Sampling Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
8.2 Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
8.3 Applications of the Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . 193
8.4 Further Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
8.5 Review Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
Contents ix

III Inference 207


9 Confidence Intervals 209
9.1 Two-Sided Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
9.2 One-Sided Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
9.3 Student’s t Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
9.4 Further Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
9.5 Review Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222

10 Hypothesis Testing 227


10.1 General Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
10.2 Two-Sided Tests of Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
10.3 One-Sided Tests of Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
10.4 Types of Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
10.5 Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
10.6 Sample Size Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
10.7 Further Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
10.8 Review Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249

11 Comparison of Two Means 253


11.1 Paired Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
11.2 Independent Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
11.2.1 Equal Variances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
11.2.2 Unequal Variances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
11.3 Sample Size Estimation for Two Means . . . . . . . . . . . . . . . . . . . . . . . 266
11.4 Further Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
11.5 Review Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274

12 Analysis of Variance 279


12.1 One-Way Analysis of Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
12.1.1 The Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
12.1.2 Sources of Variation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282
12.2 Multiple Comparisons Procedures . . . . . . . . . . . . . . . . . . . . . . . . . . 286
12.3 Further Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288
12.4 Review Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293

13 Nonparametric Methods 297


13.1 Sign Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
13.2 Wilcoxon Signed-Rank Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
13.3 Wilcoxon Rank Sum Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304
13.4 Kruskal-Wallis Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307
13.5 Advantages and Disadvantages of Nonparametric Methods . . . . . . . . . . . . . 311
13.6 Further Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311
13.7 Review Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318

14 Inference on Proportions 323


14.1 Normal Approximation to the Binomial Distribution . . . . . . . . . . . . . . . . 324
14.2 Sampling Distribution of a Proportion . . . . . . . . . . . . . . . . . . . . . . . . 326
14.3 Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327
14.4 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329
14.5 Sample Size Estimation for One Proportion . . . . . . . . . . . . . . . . . . . . . 330
14.6 Comparison of Two Proportions . . . . . . . . . . . . . . . . . . . . . . . . . . . 332
x Contents

14.7 Sample Size Estimation for Two Proportions . . . . . . . . . . . . . . . . . . . . 335


14.8 Further Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336
14.9 Review Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345

15 Contingency Tables 351


15.1 Chi-Square Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351
15.1.1 2 × 2 Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351
15.1.2 r × c Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356
15.2 McNemar’s Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 358
15.3 Odds Ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 360
15.4 Berkson’s Fallacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365
15.5 Further Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366
15.6 Review Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373

16 Correlation 381
16.1 Two-Way Scatter Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 381
16.2 Pearson Correlation Coefficient . . . . . . . . . . . . . . . . . . . . . . . . . . . 382
16.3 Spearman Rank Correlation Coefficient . . . . . . . . . . . . . . . . . . . . . . . 387
16.4 Further Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389
16.5 Review Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395

17 Simple Linear Regression 399


17.1 Regression Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399
17.2 The Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 402
17.2.1 Population Regression Line . . . . . . . . . . . . . . . . . . . . . . . . . 402
17.2.2 Method of Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . 404
17.2.3 Inference for Regression Coefficients . . . . . . . . . . . . . . . . . . . . 408
17.2.4 Inference for Predicted Values . . . . . . . . . . . . . . . . . . . . . . . . 410
17.3 Evaluation of the Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413
17.3.1 Coefficient of Determination . . . . . . . . . . . . . . . . . . . . . . . . . 413
17.3.2 Residual Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414
17.3.3 Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 416
17.4 Further Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 419
17.5 Review Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425

18 Multiple Linear Regression 431


18.1 The Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 431
18.1.1 Least Squares Regression Equation . . . . . . . . . . . . . . . . . . . . . 432
18.1.2 Inference for Regression Coefficients . . . . . . . . . . . . . . . . . . . . 434
18.1.3 Indicator Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435
18.1.4 Interaction Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 436
18.2 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 438
18.3 Evaluation of the Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 440
18.4 Further Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 442
18.5 Review Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 451

19 Logistic Regression 455


19.1 The Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455
19.1.1 Logistic Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 457
19.1.2 Fitted Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 458
19.2 Indicator Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 460
19.3 Multiple Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464
Contents xi

19.4 Simpson’s Paradox . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 466


19.5 Interaction Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 467
19.6 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 468
19.7 Further Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 469
19.8 Review Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474

20 Survival Analysis 479


20.1 Life Table Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 481
20.2 Product-Limit Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 487
20.3 Log-Rank Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 491
20.4 Cox Proportional Hazards Model . . . . . . . . . . . . . . . . . . . . . . . . . . 495
20.5 Further Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 496
20.6 Review Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505

21 Sampling Theory 509


21.1 Sampling Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 511
21.1.1 Simple Random Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . 512
21.1.2 Systematic Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514
21.1.3 Stratified Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 515
21.1.4 Cluster Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 519
21.1.5 Ratio Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 521
21.1.6 Two-Stage Cluster Sampling . . . . . . . . . . . . . . . . . . . . . . . . . 523
21.1.7 Design Effect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 526
21.1.8 Nonprobability Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . 527
21.2 Sources of Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 528
21.3 Further Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 530
21.4 Review Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535

22 Study Design 537


22.1 Randomized Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 538
22.1.1 Control Groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 539
22.1.2 Randomization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 539
22.1.3 Blinding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 540
22.1.4 Intention to Treat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 541
22.1.5 Crossover Trial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 541
22.1.6 Equipoise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 541
22.2 Observational Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 542
22.2.1 Cross-Sectional Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . 542
22.2.2 Longitudinal Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 543
22.2.3 Case-Control Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 543
22.2.4 Cohort Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 544
22.2.5 Consequences of Design Flaws . . . . . . . . . . . . . . . . . . . . . . . . 544
22.3 Big Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 544
22.4 Review Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 546

Bibliography 547

Glossary 569

Statistical Tables 583

Index 601
Preface

This book was written for students of the health sciences and serves as an introduction to the study
of biostatistics – the use of numbers and numerical techniques to extract information from data and
facts, and to then use this information to communicate scientific results. However, just as one can lie
with words, one can also lie with numbers. Indeed, numbers and lies have been linked for quite some
time; there is even a book titled How to Lie with Statistics. This association may owe its origin – or
its affirmation at the very least – to the British Prime Minister Benjamin Disraeli. Disraeli is credited
by Mark Twain as having said, “There are three kinds of lies: lies, damned lies, and statistics.” One
has only to observe any modem political campaign to be convinced of the abuse of statistics. But
enough about lies; this book adopts the position of Professor Frederick Mosteller, who said, “It is
easy to lie with statistics, but it is easier to lie without them.”

Background
Principles of Biostatistics is aimed at students in the biological and health sciences who wish to learn
traditional research methods. The first edition was based on a required course for graduate students
at the Harvard T.H. Chan School of Public Health, which is also attended by a large number of health
professionals from the Harvard medical area. The course is as old as the school itself, which attests
to its importance. It spans 16 weeks of lectures and laboratory sessions; the lab sessions reinforce
the material covered in lectures and introduce the computer into the course. We have included a
selection of lab materials – either additional examples, or a different perspective on the material
covered in a chapter – in the sections called Further Applications. These sections are designed to
provoke discussion, although they are sufficiently complete for an individual who is not using the
book as a course text to benefit from reading them.
The book includes a range of biostatistical topics, the majority of which can be covered at some
depth in one semester in an American university. However, there is enough material to allow the
instructor some flexibility. For example, some instructors may choose to omit the sections covering
the calculation of prevalence (Section 6.5) or the Poisson distribution (Section 7.3), or the chapter
on analysis of variance (Chapter 12), if they consider these concepts to be less important than others.

Structure
Some say that statistics is the study of variability and uncertainty. We believe there is truth to this
adage, and have used it as a guide to divide the book into three parts covering the basic principles
of vip: (1) variability, (2) inference, and (3) probability. For pedagogical purposes, inference and
probability are covered in reverse order in the text. Chapters 2 through 4 deal with the variability
inherent in collections of numbers, and the ways in which to summarize, explore, and explain
them. Chapters 5 through 8 focus on probability, and serve as an introduction to the tools needed
for the subsequent investigation of uncertainty. In Chapter 8 we distinguish between populations
and samples and begin to examine the variability introduced by sampling from a population, thus
progressing to inference in the book’s remaining chapters. We think that this modular introduction
to the quantification of uncertainty is justified by the success achieved by our students. Postponing

xiii
xiv Preface

the slightly more difficult concepts until a solid foundation has been established makes it easier for
the reader to comprehend and retain them.

Datasets and Examples


Throughout the text we have used data drawn from published studies to illustrate biostatistical
concepts. Not only is real data more meaningful, it is usually more interesting as well. Of course, we
do not wish to use examples in which the subject matter is too esoteric or too complex. To this end,
we have been guided by the backgrounds and interests of our students – primarily topics in public
health and clinical research – to choose examples that best demonstrate the concepts at hand.
There is some risk involved in using published data. We cannot guarantee that all of the examples
are honest and that the data were properly collected; for this we must rely on the reputations of our
sources. We do not belittle the importance of this consideration. The value of our inference depends
critically on the worth of the data, and we strongly recommend that a good deal of effort be expended
on evaluating its quality. We assume that this is understood by the reader.
In some cases we have used examples in which the population of the United States is broken
down along racial lines. In reporting these official statistics we follow the lead of the government
agencies that release them. We do not wish to rectify this racial categorization, since the observed
differences may well be due to socioeconomic factors rather than the implied racial ones. One option
would be to ignore these statistics; however, this would hide inequities which exist in our health
system – inequities that need to be eliminated. We focus attention on the problem in the hope of
stimulating interest in promoting solutions.
We have minimized the use of mathematical notation because of its well-deserved reputation of
being the ultimate jargon. If used excessively, it can intimidate even the most ardent scholar. We
do not wish to eliminate it entirely, however; it has been developed over the ages to be helpful in
communicating results. In this third edition, mathematical notation and important formulas used in
the text have also been included in summary boxes at the ends of relevant sections.

Computing
There is something about numbers – maybe a little magic – that makes them fun to study. The fun
is in the conceptualization more than the calculations, however, and we are fortunate that we have
the computer to do the drudge work. This allows students to concentrate on the concepts. In other
words, the computer allows the instructor to teach the poetry of statistics and not the plumbing.
To take advantage of the computer, one needs a good statistical package. We use Stata, a product
of the Stata Corporation in College Station, Texas, and also R, a software environment available for
free download. Stata is user-friendly, accurate, powerful, reasonably priced, and works on a number
of different platforms, including Windows, Unix, and Macintosh. R is available on an open-source
license, and also works on a number of platforms. It is a versatile and efficient programming language.
Other statistical packages are available, and this book can be supplemented by any one of them. We
strongly recommend that some statistical package be used for calculations.
Some of the review exercises in the text require the use of a computer. The required datasets
are available on the book’s companion website at https://github.com/Principles-of-Biostatistics/3rd-
Edition. There are also many exercises that do not require the computer. As always, active learning
yields better results than passive observation. To this end, we cannot stress enough the importance
of the review exercises, and urge the reader to attempt as many as time permits.
Preface xv

New to the Third Edition


The third edition continues in the spirit of the first edition, but has been updated to reflect some
of the advances of the last 30 years. It includes revised and expanded discussions on many topics
throughout the book. Major revisions include:
• The chapters on Data Presentation and Numerical Summary Measures from the second edition
have been streamlined and combined into a single chapter titled Descriptive Statistics.
• The chapter on Life Tables has been rewritten, and detailed calculations for the life table have
been moved into the Further Applications section.

• The material on Screening and Diagnostic Tests – formerly contained within the Probability
chapter – has been given its own chapter. This new chapter includes sections on likelihood ratios
and the concept of varying sensitivities.
• New sections on sample size calculations for two-sample tests on means and proportions, the
Kruskal-Wallis test, and the Cox proportional hazards model have been added to existing chapters.
• Concepts previously covered in a chapter titled Multiple 2 × 2 Tables have now been moved into
the Logistic Regression chapter.
• The chapter on Sampling Theory has been greatly expanded.
• A new chapter introducing the basic principles of Study Design has been added at the end of the
text.
• Datasets used in the text and those needed for selected review exercises are now available on the
book’s companion website at https://github.com/Principles-of-Biostatistics/3rd-Edition.
• The companion website also contains the Stata and R code used to produce the computer output
displayed in the text’s Further Applications sections, as well as introductory material describing
the use of both statistical packages.
• A glossary of definitions for important statistical terms has been added at the back of the book.
• As previously mentioned, mathematical notation and formulas used in the text have been included
in summary boxes at the end of each section for ease of reference.
• Additional review exercises have been included in each chapter.
In addition to these changes in content, previously used data have been updated whenever possible
to reflect more current public health information. As its name suggests, Principles of Biostatistics
covers topics which are fundamental to an introduction to biostatistics. Of course we have had to limit
the material presented, and some important topics have not been included. Decisions about what to
exclude were difficult, especially as the field of biostatistics and data science continues to evolve. No
small role in this evolution is played by the computer; the capacity of statistical software seems to
increase limitlessly, providing new and exciting inferential tools. However, to truly appreciate these
tools and to be able to utilize them properly requires a strong foundation in traditional statistical
principles. Those laid out in this text are still essential and will be useful to the reader both today
and in the future.
xvi Preface

Acknowledgments
A debt of gratitude is owed to a number of people: former Harvard University President Derek
Bok for providing the support which got the first edition of this book off the ground, Dr. Michael
K. Martin for performing the calculations for the Statistical Tables section, John-Paul Pagano for
assisting in the editing of the first edition, and the individuals who reviewed the manuscript. We
thank the teaching assistants who have helped us teach our courses over the years and who have
made many valuable suggestions. Probably the most deserving of thanks are our students, who have
tolerated us as we learned how to best teach the material. We are still learning.

Marcello Pagano
Kimberlee Gauvreau
Heather Mattie

Boston, Massachusetts
1
Introduction

CONTENTS
1.1 Why Study Biostatistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Difficult Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Overview of the Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3.1 Part I: Chapters 2–4 Variability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3.2 Part II: Chapters 5–8 Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.3 Part III: Chapters 9–22 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.4 Computing Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.4 Review Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

In 1903, H.G. Wells hypothesized that statistical thinking would one day be as necessary for good
citizenship as the ability to read and write. Wells was correct, and today statistics play an important
role in many decision-making processes. For example, before any new drug or medical device can
be marketed legally in the United States, the United States Food and Drug Administration (fda)
requires that it be subjected to a clinical trial, an experimental study involving human subjects. The
data from this study is compiled and analyzed to determine not only whether the drug is effective,
but also if it is safe. How is this determined? As another example, the United States government’s
decisions regarding Social Security and public health programs rely in part on the longevity of the
nation’s population; the government must therefore be able to accurately predict the number of years
each individual will live. How does it do this? If the government incorrectly forecasts human life
expectancy, it could render itself insolvent and endanger the well-being of its citizens.
There are many other issues that must be addressed as well. Where should a government invest
its resources if it wishes to reduce infant mortality? Should a mastectomy always be recommended
to a patient with breast cancer? Should a child play football? What factors increase the risk that an
individual will develop coronary heart disease? Will we be able to afford our health care system in
the future? Does global warming impact the sea level? Our health? What effect would a particular
change in policy have on longevity? To answer these questions and others, we rely on the methods
of biostatistics.

1.1 Why Study Biostatistics


The study of statistics explores the collection, organization, analysis, and interpretation of numerical
data. The concepts of statistics may be applied to a number of fields, including business, psychology,
and agriculture. When focus is on the biological and health sciences, we use the term biostatistics.
Historically, statistics have been used to tell a story with numbers. Numbers often communicate
ideas more succinctly than do words. For example, the World Health Organization (who) defines
maternal mortality as “the death of a woman while pregnant or within 42 days of termination of

DOI: 10.1201/9780429340512-1 1
2 Principles of Biostatistics

FIGURE 1.1
Maternal mortality per 100,000 live births, 1990–2015

pregnancy, irrespective of the duration and site of the pregnancy, from any cause related to or
aggravated by the pregnancy or its management but not from accidental or incidental causes” [1].
Therefore, when presented with the graph in Figure 1.1 [2, 3], someone concerned with maternal
mortality might react with alarm at the reported striking behavior of the United States and research
the issue further.
How useful is the study of biostatistics? Biostatistics are certainly ubiquitous in the health
sciences. The Centers for Disease Control and Prevention (cdc) reports that “During the 20th century,
the health and life expectancy of persons residing in the United States improved dramatically. Since
1900, the average lifespan of persons in the United States has lengthened by greater than 30 years;
25 years of this gain are attributable to advances in public health” [4–6]. They go on to list what they
consider to be ten great achievements:

– Vaccination – Motor vehicle safety


– Safer workplaces – Control of infectious diseases
– Healthier mothers and babies – Safer and healthier foods
– Fluoridation of drinking water – Family planning
– Decline in deaths from coronary – Recognition of tobacco use
heart disease and stroke as a health hazard

When one reads the recounting of these achievements in subsequent Mortality and Morbidity Weekly
Reports, it is evident that biostatistics played an important role in every one of them.
Notwithstanding these societal successes, work still needs to be done. The future with its exabytes
of data – known as big data – providing amounts of information which are orders of magnitude larger
than was previously available is a new challenge. But if we are to progress responsibly, we cannot
ignore the lessons of the past [7]. A case in point is our failure to control the number of deaths from
guns that has led to a public health crisis in the United States. The statistic blared from a headline in
Introduction 3

The New York Times in 2018 [8]: “nearly 40,000 people died from guns in u.s. last year, highest in
50 years.” This crisis looks even worse when one considers what is happening with mass shootings
in schools. The United States is experiencing a remarkable upward trend in the number of casualties
involved. There have been more school shooting deaths in the first 18 years of the 21st century (66)
than in the last 60 years of the 20th century (55). The same is true for injuries due to guns, with 260
and 81 in each of these two time periods, respectively [9]. A summary of this situation is made more
pithy by the statistics.

1.2 Difficult Numbers


The numbers needed to tell a story are not always easy to come by – examples include attempts
to investigate the volume of illicit human trafficking [10], or to measure the prevalence of female
genital mutilation [11] – but are indispensable for communicating important ideas. The powerful use
of statistics in this argument against continued restrictions on the drug mifepristone’s distribution is
clear [12]:
Since its approval in 2000, more than 3.7 million women have used mifepristone to end an
early pregnancy in the United States — it is approved for use up to 70 days into a pregnancy.
Nearly two decades of data on its use and effects on patients provide significant new insights
into its safety and efficacy. Mifepristone is more than 97% effective. Most adverse effects
are mild, such as cramping or abdominal pain, and the rate of severe adverse events is
very low: such events occur in less than 0.5% of patients, according to the fda. Many drugs
marketed in the United States have higher adverse event rates and are not subject to restricted
distribution.
In this example, the numbers provide a concise summary of the situation being studied. They, of
course, must be both accurate and precise if we are to trust any conclusions based on them.
The examples described deal with complex situations, yet the numbers convey essential informa-
tion. A word of caution: we must remain realistic in our expectations of what statistics can achieve.
No matter how powerful it is, no statistic will convince everyone that a given conclusion is true.
The data on gun deaths in the United States mentioned above are often brushed away with some
variant of the aphorism, “Guns don’t kill people, people do.” This should not come as a surprise.
After all, there are still deniers of global warming, people who believe that the vaccine for measles,
mumps, and rubella causes autism, and members in the Flat Earth Society, whose website states:
“This website is dedicated to unravelling the true mysteries of the universe and demonstrating that
the earth is flat and that Round Earth doctrine is little more than an elaborate hoax” [13].

1.3 Overview of the Text


The aim of a study using biostatistics is to analyze and present data in a transparent, interpretable,
and coherent manner to effectively communicate results and to help lead policy makers to the best
informed decisions. This text book, as its title states, covers the principles of biostatistics. The
21 chapters beyond this one can be arranged into three parts to cover the tenets of biostatistics:
(1) variability, (2) inference, and (3) probability. We list them in this order so students can easily
remember the acronym vip. For pedagogical reasons, however, we present them in a different order:
4 Principles of Biostatistics

FIGURE 1.2
Racial breakdown of COVID-19 cases in the United States through May 28, 2020

(1) Chapters 2–4 discuss variability, (2) Chapters 5–8 cover probability, and (3) Chapters 9–22 cover
inference.

1.3.1 Part I: Chapters 2–4 Variability


If we wish to study the effects of a new diet, we might place a group of individuals on that diet and
measure changes in their body mass over time. Similarly, if we want to investigate the success of an
innovative therapy for treating pancreatic cancer, we might record the lengths of time that patients
treated with this therapy survive beyond their initial diagnosis. These numbers, however, can display
a great deal of variability from one person to another. They are generally not very informative
until we begin combining them in some way. Descriptive statistics, the topic of Chapter 2, are
methods for organizing and summarizing a set of measurements. They help us to better understand
the attributes of a group or population. For instance, to support the premise that there was racial
inequity in who was afflicted by the coronavirus, reporters from The New York Times collected data
and displayed it not only in a table, but also as a graph similar to Figure 1.2 [14]. To dig deeper into
their analysis and show the impact by age group, they also included Figure 1.3 [14]. This example
demonstrates the power of a picture to tell a story. The graphical capabilities of computers make
this type of summarization feasible even for the most modest analyses, and use of both tables and
graphs to summarize information enables scientists and policy makers to formulate hypotheses that
then require further investigation.
By definition, a summary captures only a particular aspect of the data being studied; consequently,
it is important to have an idea of how well the summary represents the set of measurements as a
whole. For example, we might wish to know how long hiv/aids patients survive after diagnosis with
one of the opportunistic infections that characterize the disease. If we calculate an average survival
time, is this average representative of all patients? Furthermore, how useful is it for planning future
health service needs? In addition to tables and graphs, Chapter 2 examines numerical summary
measures that help answer questions such as these. The chapter includes an introduction to the mean
and standard deviation; the former tells us where the measurements are centered, and the latter how
dispersed they are. The chapter ends with the splendid empirical rule, which quantifies the metaphor
“the apple does not fall far from the tree.”
Measurements that take on only two distinct values require special attention. In the health
sciences, one of the most common examples of this type of data is the categorization of being
alive or dead. If we denote survival by 0 and death by 1, we are able to classify each member
of a group of individuals using these numbers and then average the results. In this way, we can
summarize the mortality associated with the group. Chapter 3 deals exclusively with measurements
Introduction 5

FIGURE 1.3
Racial breakdown of COVID-19 cases in the United States in 2020, by age

that assume only two values. The notion of dividing a group into smaller subgroups or classes based
on a characteristic such as age or sex is also introduced. Grouping individuals into smaller, more
homogeneous subgroups decreases variability, thus allowing better prognosis. For example, it might
make sense to determine the mortality of females separately from that of males, or the mortality of
20- to 29-year-olds separately from 80- to 89-year-olds. Chapter 3 also investigates techniques that
allow us to make valid comparisons among populations whose compositions may differ substantially.
Chapter 4 introduces the classical life table, one of the most important numerical summary
techniques available in the health sciences. Life tables are used by public health professionals
to characterize the well-being of a population, and by insurance companies to predict how long
individuals will live. In this chapter, the study of mortality begun in Chapter 3 is extended to
incorporate the actual time to death for each individual, resulting in a more refined analysis.
Together, Chapters 2 through 4 demonstrate that the extraction of information from a collection
of measurements is not precluded by the variability among those measurements. Despite their vari-
ability, the data often exhibit a certain regularity as well. For example, here are the birth rates in the
United States among women 15–19 years of age over the 5-year time span shown [15]:

Year : 2011 2012 2013 2014 2015


Birth rate per 1000 : 31.3 29.4 26.5 24.2 22.3

Are the numbers showing a natural variability around a constant rate over time – think of how many
mistakes can go into the reporting of such numbers – or is this indicative of a real downward trend?
This question deserves better than a simple choice between these two options. To answer it properly,
we need to apply the principles of probability and inference, the subjects covered in the next two
sections of the text.
6 Principles of Biostatistics

1.3.2 Part II: Chapters 5–8 Probability


Probability theory resides within what is known as an axiomatic system; we start with some basic
truths (axioms), and then build up a logical system around them. In its purest form, this theoretical
system has no practical value. Its practical importance comes from knowing how to use the theory to
yield useful approximations. An analogy can be drawn with geometry, a subject that most students
are exposed to relatively early in their schooling. Although it is impossible for an ideal straight line
to exist other than in our imaginations, that has not stopped us from constructing some wonderful
buildings, based on geometric calculations; including some that have lasted thousands of years. The
same is true of probability theory. Although it is not practical in its pure form, its basic principles –
which we investigate in Chapter 5 – can be applied to provide a means of quantifying uncertainty.
An important application of probability theory arises in medical screening and diagnostic testing,
as we see in Chapter 6. Uncertainty is present because, despite some manufacturers’ claims, no
biological test is perfect. This leads to complicated findings, which are sometimes unintuitive, even
in the simple situation where the test is diagnosing the presence or absence of a medical condition.
Before performing the test, we consider each of four possible classifications: the test result is correct
or not, and the person being tested has the condition or not. The relationship between the results
of the test and the truth gives rise to important practical questions. For instance, can we conclude
that every blood sample that tests positive for hiv actually harbors the virus? All the units in the
Red Cross blood supply have tested negative for hiv; does this mean that there are no contaminated
samples? If there are contaminated samples, how many might there be? To address questions such
as these, we study the average or long-term behavior of diagnostic tests by using probability theory.
Chapters 7 and 8 extend probability theory and introduce some common probability distributions
used to describe the variability in a set of measurements. These mathematical models are useful as
a basis for the inferential methods covered in the remainder of the text.

1.3.3 Part III: Chapters 9–22 Inference


The Cambridge Dictionary defines inference as a guess that is made or an opinion that is formed
based on the information available. The paradigm we use in this text is that the inference we make
about a population is based on a sample of observations selected from that much larger population.
On the basis of the sample, we draw conclusions about the entire population, including the part of
the population we did not measure – those not in the sample. Humans are much more similar to
each other than dissimilar, and we capitalize on this fact to add credibility to our inference. However,
knowing how the sample is chosen and whom the sample represents are also of critical importance
for making inference.
An analogy can be made with the way in which traveling salesmen in the late 19th and early 20th
centuries in the United States were able to sell their goods to potential customers. Rather than carry
all the goods to be sold – including big items such as stoves – they would transport miniature models
of the products they were selling; see Figure 1.4 for an example. These replicas were very carefully
crafted, so as to convey an honest likeness, albeit a much smaller version of the sale item [16].
Although these are also called samples, this is where the analogy ceases to be useful; to make
realistic models, the manufacturers had the real item as a guide. When we sample in biostatistics, it
is because we do not know what the measurements look like for the entire target population.
Suppose we want to know whether a new drug is effective in lowering high blood pressure. Since
the population of all people in the world who have high blood pressure is very large, it is implausible
to think we would have either the time or the resources necessary to locate and examine each and
every person with this condition who might be a candidate to use the drug. Out of necessity, we
must rely on a sample of people drawn from the population. The limits to our subsequent inference
– which are always there – are determined by both the population that we sample, and by how well
the sample represents that population.
Introduction 7

FIGURE 1.4
Boxed salesman’s sample set of glass bottles, containing samples from the Crocker company (Buffalo,
New York) (photo courtesy of Judy Weaver Gonyeau) [16]

The ability to generalize results from a sample to a population is the bedrock of empirical
research, and a central issue in this book. One requirement for credible inference is that it be based
on a representative sample. In any particular study, do we truly have a representative sample? If we
answer yes, this leads to a logical conundrum. To truly judge that we have a representative sample we
need to know the entire population. And if we know the entire population, why then focus only on a
sample? If we do not have the ability to study an entire population, the best solution available is to
utilize a simple random sample of the population. This means, amongst other things, that everyone in
the population has an equal chance of being selected into the sample. It ensures us that, on average,
we have a representative sample. A pivotal side benefit of a simple random sample is that it also
provides an estimate of the possible inaccuracy of our inference.
It can often be difficult to obtain a simple random sample. The consequences of mistakenly
thinking that a sample is representative when in fact it is not lead to invalid inferences. A case
in point is provided by the behavioral sciences, where empirical results are often derived from
individuals sampled from western, educated, industrialized, rich, and democratic (weird) societies.
An example of this is the undergraduate students who make a few extra dollars by volunteering to
be a subject for an on-campus study. Since most of these studies are done in the United States, we
can see the problem. Clearly the results will reflect the pool from which the subjects came. Use of
the label weird implies a certain contempt for a large number of published findings attacked in an
article by Henrich and colleagues [17]. They investigate results in the domains of visual perception,
fairness, cooperation, spatial reasoning, categorization and inferential induction, moral reasoning,
reasoning styles, self-concepts and related motivations, and the heritability of iq. They conclude
that “members of weird societies, including young children, are among the least representative
populations one could find for generalizing about humans.” Yet the researchers who published the
original results presumably believed that their samples were random and representative.
8 Principles of Biostatistics

We have repeated this mistake in the bio-medical sciences, where the consequences can be even
more severe. For example, we do not perform as many clinical trials on children as on adults [18].
Trials of adults, even randomized clinical trials, are not representative of children. Children are not
small adults who simply require a modification in dosage. Some conditions – such as prematurity
and many of its sequelae – occur only in infants and children [19]. Certain genetic conditions such
as phenylketonuria (pku) will, if untreated, lead to severe disability or even death in childhood. The
diagnosis, prevention, and treatment of these conditions cannot be adequately investigated without
studying children. Other conditions such as influenza and certain cancers and forms of arthritis
also occur in both adults and children, but their pathophysiology, severity, course, and response to
treatment may be quite different for infants, children, and adolescents. Treatments that are safe and
effective for adults may be dangerous or ineffective for children.
There are many more examples where certain groups have been largely ignored by researchers.
The lack of trials in women [20] and people of color led Congress, in 1993, to pass the National
Institutes of Health Revitalization Act, which requires the agency to include more people from these
groups in their research studies. Unfortunately, success in the implementation of this law has been
slow [21]. The headline in Scientific American on September 1, 2018 – 25 years after the Act was
passed – was clinical trials have far too little racial and ethnic diversity; it’s unethical
and risky to ignore racial and ethnic minorities [22].
This problem extends beyond clinical trials. The 21st century has seen the mapping of the human
genome. Genome wide association studies (gwas) have identified thousands of genetic variants
identified with human traits and diseases. This exciting source of information is unfortunately
restricted, so inference is constrained or biased. A 2009 study showed that 96% of participants in
gwas studies were of European descent [23]. Seven years later this had decreased to 80%, largely
due to studies carried out in China and Japan; the Asian content has increased, but the representation
of other groups has not. Since gwas studies are the basis for precision medicine, this has raised the
fear that precision medicine will exacerbate racial health disparities [24]. This, of course, is a general
trait of artificial intelligence systems: they reflect the information that goes into them.
As an example of the value of inference, we can consider a group of investigators who were
interested in evaluating whether, at the time of their study, there was a difference in how analgesics
were administered to male versus female patients with acute abdominal pain. It would be impossible
to investigate this issue by observing every person in the world with acute abdominal pain, so they
designed a study of a smaller group of individuals with this ailment so they could, on the basis of
the sample, infer what was happening in the population as a whole. How far their inference should
reach is not our focus right now, but it is important to take notice of what they say. Here is a copy of
the abstract from the published article [25]:
objectives: Oligoanalgesia for acute abdominal pain historically has been attributed to the
provider’s fear of masking serious underlying pathology. The authors assessed whether a
gender disparity exists in the administration of analgesia for acute abdominal pain.
methods: This was a prospective cohort study of consecutive nonpregnant adults with
acute nontraumatic abdominal pain of less than 72 hours duration who presented to an urban
emergency department (ed) from April 5, 2004, to January 4, 2005. The main outcome mea-
sures were analgesia administration and time to analgesic treatment. Standard comparative
statistics were used.
results: Of the 981 patients enrolled (mean age standard deviation [sd] 41 17 years;
65% female), 62% received any analgesic treatment. Men and women had similar mean pain
scores, but women were less likely to receive any analgesia (60% vs. 67%, difference 7%,
95% confidence interval (ci) = 1.1% to 13.6%) and less likely to receive opiates (45% vs.
56%, difference 11%, 95% ci = 4.1% to 17.1%). These differences persisted when gender-
specific diagnoses were excluded (47% vs. 56%, difference 9%, 95% ci = 2.5% to 16.2%).
After controlling for age, race, triage class, and pain score, women were still 13% to 25%
Introduction 9

less likely than men to receive opioid analgesia. There was no gender difference in the receipt
of nonopioid analgesia. Women waited longer to receive their analgesia (median time 65
minutes vs. 49 minutes, difference 16 minutes, 95% ci = 3.5 to 33 minutes).
conclusions: Gender bias is a possible explanation for oligoanalgesia in women who
present to the ed with acute abdominal pain. Standardized protocols for analgesic adminis-
tration may ameliorate this discrepancy.

This is a fairly typical abstract in the health sciences literature – it reports on a clinical study and
uses statistics to describe the findings – so we look at it more closely. First consider the objectives
of the study. We are told that the goal is to discover whether there is a gender disparity in the
administration of drugs. This is not whether there was a difference in administering the drugs
between genders in this particular study – that question is easy to answer – but rather a more
ambitious finding; namely, is there something in this study that allows us to generalize the findings
to a broader population?
The abstract goes on to describe the methods utilized in the study, and then its results. We first
learn that the researchers studied a group of 981 patients. To allow the reader to get an understanding
of who these 981 patients are, they provide some descriptive statistics about the patients’ ages and
genders. This is done to lay the groundwork for generalizing the results of the study to individuals
not included in the study sample.
The investigators then start generalizing their results. We are told that even though men and
women suffered similar amounts of pain, women were less likely – 7% less likely – to receive any
analgesia. This difference of 7% is clearly study specific. Had they chosen fewer than 981 patients or
more, or even a different group of 981 patients, they likely would have observed a difference other
than 7%. How to quantify this potential variability from sample to sample – even though we have
observed only a single sample – and how to accommodate it when making inference, is answered
by the most useful and effective result in the book. It is an application of the theory covered in
Chapter 8, and is known as the central limit theorem.
An application of the central limit theorem allows the study investigators to construct a 95%
confidence interval for the difference in proportions, 1.1% to 13.6%. One way to interpret this
interval is to appeal to a thought experiment and repetition: If we were to sample repeatedly from
the underlying population, each sample might result in a difference other than 7%, and a confidence
interval other than 1.1% to 13.6%. However, 95% of these intervals from repeated sampling will
include the true population difference between the genders, whatever its value. The interpretations for
all the other confidence intervals in the abstract are similar. More general applications of confidence
intervals are introduced in Chapter 9, and examples appear throughout the text.
For a study to be of general interest and usefulness, we must be able to extrapolate its findings
to a larger population. By generalizing in this manner, however, we inevitably introduce uncertainty.
There are various ways to measure and convey this uncertainty, and we cover two such inferential
methods in this book. One is to use confidence intervals, as we just saw in the abstract, and the other is
to use hypothesis testing. The latter is introduced in Chapter 10. The two methods are consistent with
each other, and will lead to the same action following a study. There are some questions, however,
that are best answered in the hypothesis testing framework.
As an example, consider the way we monitor the water supply for lead contamination [26].
In 1974, the United States Congress passed the Safe Drinking Water Act, and its enforcement is
a responsibility of the Environmental Protection Agency (epa). The epa determines the level of
contaminants in drinking water at which no adverse health effects are likely to occur, with an
adequate margin of safety. This level for lead is zero, and untenable. As a result, the epa established
a treatment technique, an enforceable procedure which water systems must follow to ensure control
of a contaminant. The treatment technique regulation for lead – referred to as the Lead and Copper
Rule [27] – requires water systems to control the corrosivity of water. The regulation stipulates that
to determine whether a system is safe, health regulators must sample taps in the system that are
10 Principles of Biostatistics

more likely to have plumbing materials containing lead. The number of taps sampled depends on
the size of the system served. To accommodate aberrant local conditions, if 10% or fewer of the
sampled taps have no more than 15 parts per billion (ppb) of lead, the system is considered safe. If
not, additional actions by the water authority are required. We can phrase this monitoring procedure
in a hypothesis testing framework: We wish to test the hypothesis that the water has 15 ppb or fewer
of lead. The action we take depends on whether we reject this hypothesis, or not. According to the
Lead and Copper Rule, the decision depends on the measured tap water samples. If more than 10%
of the water samples have more than 15 ppb, we reject the hypothesis and take corrective action.
Just as with diagnostic testing in Chapter 6, we have the potential to make the wrong decision
when conducting a hypothesis test. The chance of such an error is influenced by the way in which
the samples are chosen, how many samples we take, and the 10% cutoff rule. In 2015, the city of
Flint, Michigan, took water samples in order to check the level of lead in the water [28]. According
to the Lead and Copper Rule, they were supposed to take 100 samples from houses most likely to
have a lead problem. They did not. First, they took only 71 samples; second, they chose the 71 in
what seemed like a random fashion. Setting aside these contraventions, they found that 8 of the 71
samples had more than 15 ppb. This is more than 10% of the samples, and thus they were required to
alert the public and take corrective action. Instead, the State of Michigan forced Flint to drop two of
the water samples, both with more than 15 ppb of lead. This meant that there were only 69 samples,
and 6 had more than 15 ppb of lead. Thus fewer than 10% crossed the threshold, and the authorities
felt free to tell the residents of Flint that their water was fine. This is yet another example of ignoring
the message produced by the scientific method and having catastrophe follow [29]. It seems like the
lead problem is repeating itself, only this time in Newark, New Jersey [30].
In Chapter 10 we apply hypothesis testing techniques to statements about the mean of a single
population, and in Chapter 11 extend these techniques to the comparison of two population means.
They are further generalized to the comparison of three or more means in Chapter 12. Chapter 13
continues the development of hypothesis testing concepts, but introduces techniques that allow the
relaxation of some of the assumptions necessary to carry out the tests. Chapters 14 and 15 develop
inferential methods that can be applied to enumerated data or counts – such as the numbers of cases
of sudden infant death syndrome among children put to sleep in various positions – rather than
continuous measurements.
Inference can also be used to explore the relationships among a number of different attributes,
with the underlying motivation being to reduce variability. If a full-term infant whose gestational age
is 39 weeks is born weighing 4 kilograms, or 8.8 pounds, no one would be surprised. If the infant’s
gestational age is only 22 weeks, however, then their weight would be cause for alarm. Why? We
know that birth weight tends to increase with gestational age, and, although it is extremely rare to
find a baby weighing 4 kilograms at 22 weeks, it is not uncommon at 39 weeks. There is sufficient
variability in birth weights to not be surprised to hear that an infant weighs 4 kilograms at birth,
but when the gestational age of the child is known, there is much less variability among infants of a
particular gestational age, and 4 kilograms may seem out of place. In other words, our measurements
have a more precise interpretation the more information we have about the measurement.
The study of the extent to which two factors are related is known as correlation analysis; this is
the topic of Chapter 16. If we wish to predict the outcome of one factor based on the value of another,
then regression is the appropriate technique. Simple linear regression is investigated in Chapter 17,
and is extended to the multiple regression setting – where two or more factors are used to predict
a single outcome – in Chapter 18. If the outcome of interest can take on only two possible values,
such as alive or dead, then an alternative technique must be applied; logistic regression is explored
in Chapter 19.
In Chapter 20, the inferential methods appropriate for life tables are introduced. These techniques
enable us to draw conclusions about the mortality of a population based on the experience of a sample
of individuals drawn from the population. This is common in clinical trials, especially in randomized
Introduction 11

clinical trials, when the purpose of the trial is to study whether a patient’s survival has been prolonged
by a treatment [31].
Chapter 21 is devoted to surveys and inference in finite populations. These techniques are very
popular around election time in democracies, but also find many uses in public health. For example,
the United States Census Bureau supplements the decennial census with an annual survey called
the American Community Survey; its purpose is to help “local officials, community leaders, and
businesses understand the changes taking in their communities. It is the premier source for detailed
population and housing information about our nation” [32]. In 2017, 2,145,639 households were
interviewed. Once again, the mainstay that enables us to make credible inference about the entire
United States population, 325.7 million people in 2017, is the simple random sample. We take that
as our starting point, and build on it with more refined designs. Practical examples are given by the
National Centers for Health Statistics within the cdc [33].
Once again it would be helpful if we could control variability and lessen its effect. Some survey
designs help in this regard. For example, if we can divide a population into strata where we know
the size of each stratum, we can take advantage of that extra information – the size of the strata – to
estimate the population characteristics more accurately via stratified sampling. If on the other hand
we wish to lower the cost of the survey, we can turn to cluster sampling. Of course, we can combine
these ideas and utilize both in a single survey. These design considerations and some of the issues
raised are addressed in this chapter.
The last chapter, Chapter 22, could have been the first. Even though it is foundational, one needs
the material developed in the rest of the book to appreciate its content. It is here that we bolster the
belief that it is not just the numbers that count, but what they represent, and how they are obtained.
This was made quite clear during the covid-19 pandemic. The proper monitoring of a viral epidemic
and its course requires an enumeration of people infected by the virus. This, unfortunately, did
not happen. Miscounting of covid-19 cases occurred across the world [34], including the United
States [35,36]. One cannot help but think that this disinformation contributed to the resultant damage
from the pandemic.
Chapter 22 explores how best to design studies to take advantage of the methods described in
this book. It also should whet your appetite to study biostatistics further, as the story gets even more
fascinating. To quote what George Udny Yule wrote almost a century ago [37]:
When his work takes an investigator out of the field of the nearly perfect experiments, in
which the influence of disturbing causes is practically negligible, into the field of imperfect
experiment (or a fortiori of pure observation) where the influence of disturbing causes is
important, the first step necessary for him is to get out of the habit of thinking in terms of
the single observation and to think in terms of the average. Some seem never to get beyond
this stage. But the next state is even more important, viz., to get out of the habit of thinking
in terms of the average, and think in terms of the frequency distribution. Unless and until he
does this, his conclusions will always be liable to fallacy.

1.3.4 Computing Resources


In addition to Stata output, R output is presented for all examples in the Further Applications
sections of each chapter. In addition, all Stata and R code is available online, and can be accessed at
https://github.com/Principles-of-Biostatistics/3rd-Edition.
12 Principles of Biostatistics

1.4 Review Exercises

1. Design a study aimed at investigating an issue you believe might influence the health of
the world. Briefly describe the data you will require, how you will obtain them, how you
intend to analyze the data, and the method you will use to present your results. Keep this
study design and reread it after you have completed the text.

2. Suppose it is stated that in a given year, 512 million people around the world were
malnourished, up from 460 million just five years prior [38].
(a) Suppose that you sympathize with the point being made. Justify the use of these
numbers.
(b) Are you sure that the numbers are correct? Do you think it is possible that 513 million
people were malnourished during the year in question rather than 512 million?

3. In addition to stating that “the Chinese have eaten pasta since 1100 b.c.,” the label on
a box of pasta shells claims that “Americans eat 11 pounds of pasta per year,” whereas
“Italians eat 60 pounds per year.” Do you believe that these statistics are accurate? Would
you use these numbers as the basis for a nutritional study?
Part I

Variability
2
Descriptive Statistics

CONTENTS
2.1 Types of Numerical Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.1.1 Nominal Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.1.2 Ordinal Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.1.3 Ranked Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.1.4 Discrete Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.1.5 Continuous Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2 Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2.1 Frequency Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2.2 Relative Frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3 Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3.1 Bar Charts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3.2 Histograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3.3 Frequency Polygons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3.4 Box Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.3.5 Two-Way Scatter Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.3.6 Line Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.4 Numerical Summary Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.4.1 Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.4.2 Median . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.4.3 Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.4.4 Range . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.4.5 Interquartile Range . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.4.6 Variance and Standard Deviation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.5 Empirical Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.6 Further Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.7 Review Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

Every study or experiment yields a set of data. Its size can range from a few measurements to
many millions of observations. A complete set of data, however, will not necessarily provide an
investigator with information that can be easily interpreted. For example, Table 2.1 lists the first
2560 cases of human immunodeficiency virus infection and acquired immunodeficiency syndrome
(hiv/aids) reported to the Centers for Disease Control and Prevention [39]. Each individual was
classified as either suffering from Kaposi sarcoma, designated by a 1, or not suffering from the disease,
represented by a 0. (Kaposi sarcoma is a malignant tumor which affects the skin, mucous membranes,
and lymph nodes.) Although Table 2.1 displays the entire set of outcomes, it is extremely difficult
to characterize the data. We cannot even identify the relative proportions of 0s and 1s. Between the
raw data and the reported results of the study lies some intelligent and imaginative manipulation of
the numbers carried out using the methods of descriptive statistics.

DOI: 10.1201/9780429340512-2 15
16 Principles of Biostatistics

TABLE 2.1
Outcomes indicating whether an individual had Kaposi sarcoma for the first 2560 cases of hiv/aids
reported to the Centers for Disease Control and Prevention in Atlanta, Georgia

00000000 00010100 00000010 00001000 00000001 00000000 10000000 00000000


00101000 00000000 00000000 00011000 00100001 01001100 00000000 00000010
00000001 00000000 00000010 01100000 00000000 00000100 00000000 00000000
00100010 00100000 00000101 00000000 00000000 00000001 00001001 00000000
00000000 00010000 00010000 00010000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00001000 00000000 00010000 10000000 00000000
00100000 00000000 00001000 00000010 00000000 00000100 00000000 00010000
00000000 00000000 00000100 00001000 00001000 00000101 00000000 01000000

00010000 00000000 00010000 01000000 00000000 00000000 00000101 00100000


00000000 00000000 00000100 00000000 01000100 00000000 00000001 10100000
00000100 00000000 00010000 00000000 00001000 00000000 00000010 00100000
00000000 00000000 00000000 10001000 00001000 00000000 01000000 00000000
00000000 00001100 00000000 00000000 10000011 00000001 11000000 00001000
00000000 00000000 00000000 00000000 01000000 00000001 00010001 00000000
10000000 00000000 01000000 00000000 00000000 01010100 00000000 00010100
00000000 00000000 00000000 00001010 00000101 00000000 00000000 00010000

00000000 00000000 00000000 00000001 00000100 00000000 00000000 00001000


11000000 00000100 00000000 00000000 00000000 00000000 00000000 00001000
11000000 00010010 00000000 00001000 00000000 00111000 00000001 01001100
00000000 01100000 00100010 10000000 00000000 00000010 00000001 00000000
01000010 01000100 00000000 00010000 00000000 01000000 00000001 00000000
01000000 00000001 00000000 10000000 01000000 00000000 00000000 00000100
00000000 00000000 01000010 00000000 00000000 00000000 00000000 00000000
00000000 00000010 00001010 00001001 10000000 00000000 00000010 00000000

00000000 01000000 00000000 00001000 00000000 01000000 00010000 00000000


00001000 01000010 01001111 00100000 00000000 00100000 00000000 10000001
00000001 00000000 01000000 00000000 00000000 00000000 00000000 01000000
00000000 00000000 00100000 01000000 00100000 00000000 00000011 00000000
01000000 00000100 10000001 00000001 00001000 00000100 00001000 00001000
00100000 00000000 00000000 00000000 00000010 01000001 00010011 00000000
00000000 10000000 10000000 00000000 00000000 00001000 01000000 00000000
00001000 00000000 01000010 00011000 00000001 00001001 00000000 00000001

01000010 01001000 01000000 00000010 00000000 10000000 00000100 00000000


00000010 00000000 00000000 00000010 00000000 00100100 00000000 10110100
00001100 00000100 00001010 00000000 00000000 00000000 00000000 00000000
00000010 00000000 00000000 00000000 00100000 10100000 00001000 00000000
01000000 00000000 00000000 00100000 00000000 01000001 00010010 00010001
00000000 00100000 00110000 00000000 00010000 00000000 00000100 00000000
00010100 00000000 00001001 00000001 00000000 00000000 00000000 00000000
00000010 00000100 01010100 10000001 00001000 00000000 00010010 00010000
Descriptive Statistics 17

Descriptive statistics are a means of organizing and summarizing observations. They provide us
with an overview of the general features of a set of data. Descriptive statistics can assume a number
of different forms, including tables, graphs, and numerical summary measures. Before we decide
which techniques are the most appropriate in a given situation, however, we must first determine
what type of data we have.

2.1 Types of Numerical Data


In the study of biostatistics we encounter many different types of numerical data: nominal, ordinal,
ranked, discrete, and continuous. The different types of data have varying degrees of structure in the
relationships among possible values.

2.1.1 Nominal Data


One of the simplest types of data is nominal data, in which the values fall into unordered categories
or classes. As in Table 2.1, numbers are often used to represent the categories. In a certain study,
for instance, males might be assigned the value 1 and females the value 0. Although the attributes
are labeled with numbers rather than words, both the order and the magnitude of the numbers are
unimportant. We could just as easily let 1 represent females and 0 designate males, or 5 represent
females and 6 males. Numbers are used for the sake of convenience; numerical values allow us to
use computers to perform complex analyses of the data.
Nominal data that take on one of two distinct values – such as male and female, or alive and
dead – are said to be dichotomous, or binary, depending on whether the Greek or the Latin root
for “two” is preferred. However, not all nominal data need be dichotomous. Often there are three or
more possible categories into which the observations can fall. For example, persons may be grouped
according to their blood type, where 1 represents type o, 2 is type a, 3 is type b, and 4 is type ab.
Again, the sequence or order of these values is not important. The numbers simply serve as labels
for the different blood types, just as the letters do. We must keep this in mind when we perform
arithmetic operations on the data. An average blood type of 1.8 for a given population is meaningless.
One arithmetic operation that can be interpreted, however, is the proportion of individuals that fall
into each group. An analysis of the data in Table 2.1 shows that 9.6% of the hiv/aids patients suffered
from Kaposi sarcoma and 90.4% did not.

2.1.2 Ordinal Data


When the order among categories becomes important, the observations are referred to as ordinal
data. For example, injuries may be classified according to their level of severity, where 1 represents
a fatal injury, 2 is severe, 3 is moderate, and 4 is minor. Here a natural order exists among the
groupings; a smaller number represents a more serious injury. We are still not concerned with the
magnitude of these numbers, however. We could have let 4 represent a fatal injury and 1 a minor one.
Furthermore, the difference between a fatal injury and a severe injury is not necessarily the same
as the difference between a moderate injury and a minor one, even though both pairs of outcomes
are one unit apart. As a result, many arithmetic operations still do not make sense when applied to
ordinal data. Table 2.2 provides a second example of ordinal data; the scale displayed is used by
oncologists to classify the performance status of patients enrolled in trials comparing alternative
treatments for cancer [40]. Together, nominal and ordinal measurements are called categorical data.
18 Principles of Biostatistics

TABLE 2.2
Eastern Cooperative Oncology Group’s classification of patient performance status

Status Definition
0 Patient fully active, able to carry on all pre-disease performance
without restriction
1 Patient restricted in physically strenuous activity but ambulatory
and able to carry out work of a light or sedentary nature
2 Patient ambulatory and capable of all self-care but unable to carry
out any work activities; up and about more than 50% of waking
hours
3 Patient capable of only limited self-care; confined to bed or chair
more than 50% of waking hours
4 Patient completely disabled; not capable f any self-care; totally
confined to bed or chair

2.1.3 Ranked Data


In some situations we have a group of observations that are first arranged from highest to lowest
according to magnitude, and then assigned numbers corresponding to each observation’s place in
the sequence. This type of data is known as ranked data. As an example, consider all possible causes
of death in the United States. We could make a list of all of these causes, along with the number
of lives that each one claimed in a particular calendar year. If the causes are ordered from the one
that resulted in the greatest number of deaths to the one that caused the fewest and then assigned
consecutive integers, the data are said to have been ranked. Table 2.3 lists the ten leading causes of
death in the United States in 2016 [41]. Note that cerebrovascular diseases would be ranked fifth
whether they caused 117,000 deaths or 154,000. In assigning the ranks, we disregard the actual
values of the observations, and consider only their relative magnitudes. Even with this imprecision,
it is amazing how much information the ranks contain. In fact, it is sometimes better to work with
ranks than with the original data; this point is explored further in Chapter 12.

2.1.4 Discrete Data


For discrete data, both ordering and magnitude of the numbers are important. In this case, the
numbers represent actual measurable quantities rather than mere labels. Despite this, discrete data
are restricted to taking on only specified values – often integers or counts – that differ by fixed
amounts; no intermediate values are possible. Examples of discrete data include the number of fatal
motor vehicle accidents in Massachusetts in a specified month, the number of times a female has
given birth, the number of new cases of tuberculosis reported in the United States during a one-year
period, and the number of beds available in the intensive care unit of a particular hospital.
Note that for discrete data a natural order exists among the possible values. If we are interested in
the number of fatal motor vehicle accidents over one month, for instance, a larger number indicates
more fatal accidents. Furthermore, the difference between one and two accidents is the same as the
difference between four and five accidents, or the difference between 20 and 21 accidents. Finally,
the number of fatal motor vehicle accidents is restricted to the nonnegative integers; there cannot be
20.2 fatal accidents. Because it is meaningful to measure the distance between possible data values
for discrete observations, arithmetic rules can be applied. However, the outcome of an arithmetic
operation performed on two discrete values is not necessarily discrete itself. Suppose, for instance,
that in one month there are 15 fatal motor vehicle accidents, whereas there are 22 the following
Descriptive Statistics 19

TABLE 2.3
Ten leading causes of death in the United States, 2016

Rank Cause of Death Total Deaths


1 Diseases of the heart 635,260
2 Malignant neoplasms 599,038
3 Unintentional injuries 161,374
4 Chronic lower respiratory diseases 154,596
5 Cerebrovascular diseases 142,142
6 Alzheimer’s disease 116,103
7 Diabetes mellitus 80,058
8 Influenza and pneumonia 51,537
9 Nephritis, nephrotic syndrome and nephrosis 50,046
10 Intentional self harm (suicide) 44,965

month. The average number of fatal motor vehicle accidents for these two months is 18.5, which is
not itself an integer.

2.1.5 Continuous Data


Data that represent measurable quantities but are not restricted to taking on certain specified values
(such as integers) are known as continuous data. In this case, the difference between any two possible
values can be arbitrarily small. Examples of continuous data include weight, age, serum cholesterol
level, the concentration of a pollutant, length of time between two events, and temperature. In all
instances, fractional values are possible. Since we are able to measure the distance between two
observations in a meaningful way, arithmetic operations can be applied. The only limiting factor for
a continuous observation is the degree of accuracy with which it can be measured; consequently,
we often see time rounded off to the nearest second and weight to the nearest pound or gram or
kilogram. The more accurate our measuring instruments, however, the greater the amount of detail
that can be achieved in our recorded data.
At times we might require a lesser degree of detail than that afforded by continuous data; hence we
occasionally transform continuous observations into ordinal or even dichotomous ones. In a study of
the effects of maternal smoking on newborns, for example, we might first record the birth weights of
a large number of infants and then categorize the infants into three groups: those who weigh less than
1500 grams, those who weigh between 1500 and 2500 grams, and those who weigh more than 2500
grams. Although we have the actual measurements of birth weight, we are not concerned whether a
particular child weighs 1560 grams or 1580 grams; we are only interested in the number of infants
who fall into each category. From prior experience, we may not expect substantial differences among
children within the very low birth weight, low birth weight, and normal birth weight groupings.
Furthermore, ordinal data are often easier to work with than continuous data, thus simplifying the
analysis. There is a consequential loss of detail in the information about the infants, however. In
general, the degree of precision required in a given set of data depends upon the questions that are
being studied.

Section 2.1 describes a gradation of numerical data that ranges from nominal to continuous.
As we progress, the nature of the relationship between possible data values becomes increasingly
20 Principles of Biostatistics

TABLE 2.4
Cases of Kaposi sarcoma for the first 2560 hiv/aids patients reported to the Centers for Disease
Control and Prevention in Atlanta, Georgia

Kaposi Number of
Sarcoma Individuals
Yes 246
No 2314

complex. Distinctions must be made among the various types of data because different techniques
are used to analyze them. As previously mentioned, it does not make sense to speak of an average
blood type of 1.8; it does make sense, however, to refer to an average temperature of 36.1◦ C or
37.2◦ C, which are the upper and lower bounds for normal human body temperature.

2.2 Tables
Now that we are able to differentiate among the various types of data, we must learn how to identify
the statistical techniques that are most appropriate for describing each kind. Although a certain
amount of information is lost when data are summarized, a great deal can also be gained. A table
is perhaps the simplest means of summarizing a set of observations and can be used for all types of
numerical data.

2.2.1 Frequency Distributions


One type of table that is commonly used to evaluate data is known as a frequency distribution.
For nominal and ordinal data, a frequency distribution consists of a set of classes or categories
along with the numerical counts that correspond to each one. As a simple illustration of this format,
Table 2.4 displays the numbers of individuals (numerical counts) who did and did not suffer from
Kaposi sarcoma (classes or categories) for the first 2560 cases of hiv/aids reported to the Centers for
Disease Control and Prevention [39]. A more complex example is given in Table 2.5, which specifies
the numbers of cigarettes smoked per adult in the United States from 1900 through 2015 [42].
To display discrete or continuous data in the form of a frequency distribution, we must break
down the range of values of the observations into a series of distinct, nonoverlapping intervals. If
there are too many intervals, the summary is not much of an improvement over the raw data. If there
are too few, then a great deal of information is lost. Although it is not necessary to do so, intervals are
often constructed so that they all have equal widths; this facilitates comparisons among the classes.
Once the upper and lower limits for each interval have been selected, the number of observations
whose values fall within each pair of limits is counted, and the results are arranged as a table. As part
of a National Health Examination Survey, for example, the serum cholesterol levels of 1067 25- to
34-year-old males were recorded to the nearest milligram per 100 milliliters [43]. The observations
were then subdivided into intervals of equal width; the frequencies corresponding to each interval
are presented in Table 2.6.
Table 2.6 gives us an overall picture of what the data look like; it shows how the values of serum
cholesterol level are distributed across the intervals. Note that the observations range from 80 to
399 mg/100 ml, with relatively few measurements at the ends of the range and a large proportion
of the values falling between 120 and 279 mg/100 ml. The interval 160–199 mg/100 ml contains
Descriptive Statistics 21

TABLE 2.5
Cigarette consumption per person 18 years of age or older, United States, 1900–2015

Number of
Year
Cigarettes
1900 54
1910 151
1920 665
1930 1485
1940 1976
1950 3522
1960 4171
1970 3985
1980 3851
1990 2828
1995 2505
2000 2076
2005 1717
2010 1278
2015 1078

TABLE 2.6
Absolute frequencies of serum cholesterol levels for 1067 United States males, aged 25 to 34 years

Cholesterol Level Number


(mg/100 ml) of Males
80–119 13
120–159 150
160–199 442
200–239 299
240–279 115
280–319 34
320–359 9
360–399 5
Total 1067
22 Principles of Biostatistics

the greatest number of observations. Table 2.6 provides us with a much better understanding of the
data than would a list of 1067 cholesterol level readings. Although we have lost some information
– given the table, we can no longer recreate the raw data values – we have also extracted important
information that helps us to understand the distribution of serum cholesterol levels for this group of
males.
The fact that one kind of information is gained while another is lost holds true even for the simple
binary data in Tables 2.1 and 2.4. We might feel that we do not lose anything by summarizing these
data and counting the numbers of 0s and 1s, but in fact we do. For example, if there is some type
of trend in the observations over time – perhaps the proportion of hiv/aids patients with Kaposi
sarcoma is either increasing or decreasing as the epidemic matures – then this information is lost in
the summary.
Tables are most informative when they are not overly complex. As a general rule, tables and the
columns within them should always be clearly labeled. If units of measurement are involved, such
as mg/100 ml for the serum cholesterol levels in Table 2.6, these units should be specified.

2.2.2 Relative Frequency


It is sometimes useful to know the proportion of values that fall into a given interval in a frequency
distribution rather than the absolute number. The relative frequency for an interval is the proportion
of the total number of observations that appear in that interval. The relative frequency is calculated
by dividing the number of values within an interval by the total number of values in the table. The
proportion can be left as it is, or can be multiplied by 100% to obtain the percentage of values
in the interval. In Table 2.6, for example, the relative frequency in the 80–119 mg/100 ml class is
(13/1067) × 100% = 1.2%; similarly, the relative frequency in the 120–159 mg/100 ml class is
(150/1067) × 100% = 14.1%. The relative frequencies for all intervals in a table sum to 100%.
Relative frequencies are useful for comparing sets of data that contain unequal numbers of
observations. Table 2.7 displays the absolute and relative frequencies of serum cholesterol level
readings for the 1067 25- to 34-year-old males depicted in Table 2.6, as well as a group of 1227 55-
to 64-year-olds. Because there are more males in the older age group, it is inappropriate to compare
the columns of absolute frequencies for the two sets of males. Comparing the relative frequencies
is meaningful, however. We can see that, in general, the older males have higher serum cholesterol
levels than the younger ones; the younger males have a greater proportion of observations in each of
the intervals below 200 mg/100 ml, whereas the older males have a greater proportion in each class
above this value.
The cumulative relative frequency for an interval is the percentage of the total number of
observations that have a value less than or equal to the upper limit of the interval. The cumulative
relative frequency is calculated by summing the relative frequencies for the specified interval and
all previous ones. Thus, for the group of 25- to 34-year-olds in Table 2.7, the cumulative relative
frequency of the second interval is 1.2 + 14.1 = 15.3%; similarly, the cumulative relative frequency
of the third interval is 1.2 + 14.1 + 41.4 = 56.7%. Like relative frequencies, cumulative relative
frequencies are useful for comparing sets of data that contain unequal numbers of observations.
Table 2.8 lists the cumulative relative frequencies for the serum cholesterol levels of the two groups
of males in Table 2.7.
According to Table 2.7, older males tend to have higher serum cholesterol levels than younger
ones do. This is the sort of generalization we hear quite often; for instance, it might also be said
that males are taller than females, or that females live longer than males. The generalization about
serum cholesterol does not mean that every 55- to 64-year-old male has a higher cholesterol level
than every 25- to 34-year-old male; nor does it mean that the serum cholesterol level of every male
increases with age. What the statement does imply is that for a given cholesterol level, the proportion
of younger males with a reading less than or equal to this value is greater than the proportion of older
males with a reading less than or equal to the value. This pattern is more obvious in Table 2.8 than
Descriptive Statistics 23

TABLE 2.7
Absolute and relative frequencies of serum cholesterol levels for 2294 United States males

Cholesterol Ages 25–34 Ages 55–64


Level Number Relative Number Relative
(mg/100 ml) of Males Frequency (%) of Males Frequency (%)
80–119 13 1.2 5 0.4
120–159 150 14.1 48 3.9
160–199 442 41.4 265 21.6
200–239 299 28.0 458 37.3
280–319 34 3.2 128 10.4
320–359 9 0.8 35 2.9
360–399 5 0.5 7 0.6
Total 1067 100.0 1227 100.0

TABLE 2.8
Relative and cumulative relative frequencies in percentages of serum cholesterol levels for 2294
United States males

Cholesterol Ages 25–34 Ages 55–64


Level Relative Cumulative Relative Cumulative
(mg/100 ml) Frequency Frequency Frequency Frequency
80–119 1.2 1.2 0.4 0.4
120–159 14.1 15.3 3.9 4.3
160–199 41.4 56.7 21.6 25.9
200–239 28.0 84.7 37.3 63.2
240–279 10.8 95.5 22.9 86.1
280–319 3.2 98.7 10.4 96.5
320–359 0.8 99.5 2.9 99.4
360–399 0.5 100.0 0.6 100.0
24 Principles of Biostatistics

it is in Table 2.7. For example, 56.7% of the 25- to 34-year-olds have a serum cholesterol level less
than or equal to 199 mg/100 ml, whereas only 25.9% of the 55-to 64-year-olds fall into this category.
Because the relative proportions for the two groups follow this trend in every interval in the table,
the two distributions are said to be stochastically ordered. For any specified level, a larger proportion
of the older males have serum cholesterol readings above this value than do the younger males;
therefore, the distribution of cholesterol levels for the older males is stochastically larger than the
distribution for the younger males. This definition will start to make more sense when we encounter
random variables and probability distributions in Chapter 6. At that point, the implications of this
ordering will become more apparent.

2.3 Graphs
A second way to summarize and display data is through the use of graphs, or pictorial representations
of numerical data. Graphs should be designed so that they convey the general patterns in a set of
observations at a single glance. Although they are easier to read than tables, graphs often supply
a lesser degree of detail. Once again, however, the loss of detail may be accompanied by a gain in
understanding of the data. The most informative graphs are relatively simple and self-explanatory.
Like tables, they should be clearly labeled, and units of measurement should be indicated.

2.3.1 Bar Charts


Bar charts are a popular type of graph used to display a frequency distribution for nominal or ordinal
data. In a bar chart, the various categories into which the observations fall are typically listed along
a horizontal axis. A vertical bar is then drawn above each category such that the height of the bar
represents either the frequency or relative frequency of observations within that class. Sometimes this
format is reversed, with categories listed on the vertical axis and frequencies or relative frequencies
along the horizontal axis. Either way, the bars should be of equal width and separated from one
another so as not to imply continuity. As an example, Figure 2.1 is a bar chart displaying the relative
frequencies of Australian adults experiencing major long-term health conditions, with various health
conditions listed on the vertical axis [44].

2.3.2 Histograms
Perhaps the most commonly used type of graph is the histogram. While a bar chart is a pictorial
representation of a frequency distribution for either nominal or ordinal data, a histogram depicts a
frequency distribution for discrete or continuous data. The horizontal axis displays the true limits of
the various intervals. The true limits of an interval are the points that separate it from the intervals
on either side. For example, the boundary between the first two classes of serum cholesterol level
in Table 2.5 is 119.5 mg/100 ml; it is the true upper limit of the interval 80–119 and the true lower
limit of 120–159. The vertical axis of a histogram depicts either the frequency or relative frequency
of observations within each interval.
The first step in constructing a histogram is to determine the scales of the axes. The vertical
scale should begin at zero; if it does not, visual comparisons among the intervals may be distorted.
Once the axes have been drawn, a vertical bar centered at the midpoint is placed over each interval.
The height of the bar marks the frequency associated with that interval. As an example, Figure 2.2
displays a histogram constructed from the serum cholesterol level data in Table 2.6.
In reality, the frequency associated with each interval in a histogram is represented not by the
height of the bar above it but by the bar’s area. Thus, in Figure 2.2, 1.2% of the total area corresponds
Descriptive Statistics 25

FIGURE 2.1
Bar chart: Major long-term health conditions experienced by Australian adults, 2014–2015; MBC =
mental and behavioral conditions

FIGURE 2.2
Histogram: Absolute frequencies of serum cholesterol levels for 1067 United States males, aged 25
to 34 years
26 Principles of Biostatistics

FIGURE 2.3
Histogram: Relative frequencies of serum cholesterol levels for 1067 United States males, aged 25
to 34 years

to the 13 observations that lie between 79.5 and 119.5 mg/100 ml, and 14.1% of the area corresponds
to the 150 observations between 119.5 and 159.5 mg/100 ml. The area of the entire histogram sums
to 100%, or 1. Note that the proportion of the total area corresponding to an interval is equal to the
relative frequency of that interval. As a result, a histogram displaying relative frequencies – such as
Figure 2.3 – will have the same shape as a histogram displaying absolute frequencies. Because it is
the area of each bar that represents the relative proportion of observations in an interval, care must
be taken when constructing a histogram with unequal interval widths; the height must vary along
with the width so that the area of each bar remains in proper proportion.

2.3.3 Frequency Polygons


The frequency polygon, another commonly used graph, is similar to the histogram in many respects.
A frequency polygon uses the same two axes as a histogram. It is constructed by placing a point at the
center of each interval such that the height of the point is equal to the frequency or relative frequency
associated with that interval. Points are also placed on the horizontal axis at the midpoints of the
intervals immediately preceding and immediately following the intervals that contain observations.
The points are then connected by straight lines. As in a histogram, the frequency of observations for
a particular interval is represented by the area within the interval and beneath the line segment.
Figure 2.4 is a frequency polygon of the serum cholesterol level data in Table 2.6. Compare it
with the histogram in Figure 2.2, which is reproduced very lightly in the background. If the total
number of observations in the data set were to increase steadily, we could decrease the widths of the
intervals in the histogram and still have an adequate number of measurements in each class; in this
case, the histogram and the frequency polygon would become indistinguishable. As they are, both
types of graphs convey essentially the same information about the distribution of serum cholesterol
levels for this population of men. We can see that the measurements are centered around 180 mg/100
Descriptive Statistics 27

FIGURE 2.4
Frequency polygon: Absolute frequencies of serum cholesterol levels for 1067 United States males,
aged 25 to 34 years

ml, and drop off a little more quickly to the left of this value than they do to the right. Most of the
observations lie between 120 and 280 mg/100 ml, and all are between 80 and 400 mg/100 ml.
Because they can be easily superimposed, frequency polygons are superior to histograms for
comparing two or more sets of data. Figure 2.5 displays the frequency polygons of the serum
cholesterol level data presented in Table 2.7. Since the older males tend to have higher serum
cholesterol levels, their polygon lies to the right of the polygon for the younger males.
Although its horizontal axis is the same as that for a standard frequency polygon, the vertical axis
of a cumulative frequency polygon displays cumulative relative frequencies. A point is placed at the
true upper limit of each interval; the height of the point represents the cumulative relative frequency
associated with that interval. The points are then connected by straight lines. Like frequency polygons,
cumulative frequency polygons may be used to compare sets of data. This is illustrated in Figure 2.6.
By noting that the cumulative frequency polygon for 55- to 64-year-old males lies to the right of the
polygon for 25- to 34-year-old males for each value of serum cholesterol level, we can see that the
distribution for older males is stochastically larger than the distribution for younger males.
Cumulative frequency polygons can also be used to obtain the percentiles of a set of data. The
95th percentile is a value which is greater than or equal to 95% of the observations and less than or
equal to the remaining 5%. Similarly, the 75th percentile is a value which is greater than or equal to
75% of the observations and less than or equal to the other 25%. This definition is only approximate
because taking 75% of an integer does not typically result in another integer; consequently, there
is often some rounding or interpolation involved. In Figure 2.6, the 50th percentile of the serum
cholesterol levels for the group of 25- to 34-year-olds – the value that is greater than or equal to half
of the observations and less than or equal to the other half – is approximately 193 mg/100 ml; the
50th percentile for the 55- to 64-year-olds is about 226 mg/100 ml.
Percentiles are useful for describing the shape of a distribution. For example, if the 40th and 60th
percentiles of a set of data lie an equal distance away from the midpoint, and the same is true of
28 Principles of Biostatistics

FIGURE 2.5
Frequency polygon: Relative frequencies of serum cholesterol levels for 2294 United States males

FIGURE 2.6
Cumulative frequency polygon: Cumulative relative frequencies of serum cholesterol levels for 2294
United States males
Descriptive Statistics 29

FIGURE 2.7
Box plot: Crude death rates for each state in the United States, 2016

the 30th and 70th percentiles, the 20th and 80th, and all other pairs of percentiles that sum to 100,
then the data are symmetric; that is, the distribution of values has the same shape on each side of
the 50th percentile. Alternatively, if there are a number of outlying observations on one side of the
midpoint only, then the data are said to be skewed. If these observations are smaller than the rest of
the values, the data are skewed to the left; if they are larger than the other measurements, the data are
skewed to the right. The various shapes that a distribution of data can assume are discussed further
in Section 2.4.

2.3.4 Box Plots


Another type of graph that can be used to summarize a set of discrete or continuous observations
is the box plot. Unlike the histogram or frequency polygon, a box plot uses a single axis to display
selected summaries of the measurements [45]. As an example, Figure 2.7 depicts the crude death
rates for each of the 50 states and the District of Columbia in 2016, from a low of 587.1 per 100,000
population in Utah to a high of 1241.4 per 100,000 population in West Virginia [46]. (For each state,
the “crude” death rate is simply the number of deaths in 2016 divided by the size of the population in
that year. In Chapter 3 we will discuss this further, and investigate the differences among crudes rates,
specific rates, and adjusted rates.) The central box in the box plot – which is depicted vertically in
Figure 2.7 but which can also be horizontal – extends from the 25th percentile, 794.1 per 100,000, to
the 75th percentile, 969.3 per 100,000. The 25th and 75th percentiles of a data set are called quartiles
of the data. The line running between the quartiles at 891.6 deaths per 100,000 population marks
the 50th percentile of the data set; half the observations are less than or equal to 891.6 per 100,000,
and the other half are greater than or equal to this value. If the 50th percentile lies approximately
halfway between the two quartiles, this implies that the observations in the center of the data set are
roughly symmetric.
30 Principles of Biostatistics

FIGURE 2.8
Box plots: Crude death rates for each state in the United States, 1996, 2006, and 2016

The lines projecting out from the box on either side extend to the adjacent values of the plot.
The adjacent values are the most extreme observations in the data set that are not more than 1.5
times the height of the box beyond either quartile. In Figure 2.7, 1.5 times the height of the box is
1.5× (969.3−794.1) = 262.8 per 100,000 population. Therefore, the adjacent values are the smallest
and largest observations in the data set which are not more extreme than 794.1 − 262.8 = 531.3 and
969.3 + 262.8 = 1232.1 per 100,000 population, respectively. Since there is no crude death rate less
than 531.3, the lower adjacent value is simply the minimum value, 587.1 per 100,000. There is one
value higher than 1232.1 – the maximum value of 1241.4 per 100,000 – and thus the upper adjacent
value is 1078.8 per 100,000, the next largest value. In fairly symmetric data sets, the adjacent values
should contain approximately 99% of the measurements. All points outside this range are represented
by circles; these observations are considered to be outliers, or data points which are not typical of
the rest of the values. It should be noted that the preceding explanation is merely one way to define
a box plot; other definitions exist and exhibit varying degrees of complexity [47].
Because the box plot displays only a summary of the data on a single axis, it can be used to make
comparisons across groups or over time. Figure 2.8, for example, contains summaries of crude death
rates for the 50 states and the District of Columbia for three different calendar years: 1996, 2006,
and 2016 [46]. The 25th, 50th, and 75th percentiles of crude death rate all decrease from 1996 to
2006, but then increase again in 2016.

2.3.5 Two-Way Scatter Plots


Unlike the other graphs we have discussed, a two-way scatter plot is used to depict the relationship
between two different continuous measurements. Each point on the graph represents a pair of values;
the scale for one quantity is marked on the horizontal axis, and the scale for the other on the vertical
axis. For example, Figure 2.9 plots two simple measures of lung function – forced vital capacity
(fvc) and forced expiratory volume in one second (fev1 ) – for 19 asthmatic subjects who participated
Descriptive Statistics 31

FIGURE 2.9
Two-way scatter plot: Forced vital capacity versus forced expiratory volume in one second for
nineteen asthmatic subjects

in a study investigating the physical effects of sulfur dioxide exposure [48]. Forced vital capacity is
the volume of air that can be expelled from the lungs in six seconds, and forced expiratory volume
in one second is the volume that can be expelled after one second of constant effort. Note that the
individual represented by the point that is farthest to the left had an fev1 measurement of 2.0 liters
and an fvc measurement of 2.8 liters. (There are only 18 points marked on the graph instead of 19
because two individuals had identical values of fvc and fev1 ; consequently, one point lies directly on
top of another.) As might be expected, the graph indicates that there is a strong relationship between
these two quantities; fvc increases in magnitude as fev1 increases.

2.3.6 Line Graphs


A line graph is similar to a two-way scatter plot in that it can be used to illustrate the relationship
between continuous quantities. Once again, each point on the graph represents a pair of values. In
this case, however, each value on the horizontal axis has a single corresponding measurement on the
vertical axis, and adjacent points are connected by straight lines. Most commonly, the scale along the
horizontal axis represents time. Consequently, we are able to trace the chronological change in the
quantity on the vertical axis over a specified period. Figure 2.10 displays the trends in the reported
rates of malaria that occurred in the United States between 1940 and 2015 [49]. Note the log scale
on the vertical axis; this scale allows us to depict a large range of observations while still showing
the variation among the smaller values.
To compare two or more groups with respect to a given quantity, it is possible to plot more than
one measurement along the vertical axis. Suppose we are concerned with the rising costs of health
care. To investigate this problem, we might wish to compare the variations in cost that have occurred
under two different health care systems in recent years. Figure 2.11 depicts the trends in health care
expenditures in both the United States and Canada between 1970 and 2016 [50, 51].
32 Principles of Biostatistics

FIGURE 2.10
Line graph: Reported rates of malaria by year, United States, 1940–2015

FIGURE 2.11
Line graph: Health care expenditures as a percentage of gross domestic product (gdp) for the United
States and Canada, 1970–2017
Descriptive Statistics 33

FIGURE 2.12
Leading causes of death in South Africa, 1997–2013, in thousands of deaths; colored bands from
bottom to top represent other causes, digestive, nervous, endocrine, respiratory system, circulatory
system, neoplasm, infectious, external causes, blood and immune disorders

In this section, we have not attempted to examine all possible types of graphs. Instead, we have
included only a selection of the more common ones. It should be noted that many other imaginative
displays exist [52]. One such example is Figure 2.12, which displays the leading causes of death in
South Africa from 1997 through 2013 [53]. The top border of the light blue segment at the bottom is
actually a line graph tracking the number of deaths due to “other causes” – those not represented by
the nine colored bands above it – over the 17-year time period. The purple segment above this shows
the number of deaths due to diseases of the digestive system in each year; the top border of this
segment displays the number of deaths due to other and digestive causes combined. The top of the
uppermost blue segment displays the total number of deaths due to all causes in each calendar year,
allowing us to see that the number of deaths in South Africa increased from 1997 through 2006, and
then decreased from 2006 through 2013. Some of this decrease can be attributed to a fall in deaths
due to diseases of the respiratory system, the bright pink band; note that this band becomes more
narrow beginning in the late 2000s. The number of deaths due to infectious disease – the light green
band – decreased after 2009. Deaths due to many of the other causes have not changed much over
this time period, as evidenced by the segments of constant height.
Regardless of the type of display being used, as a general rule, too much information should not
be squeezed into a single graph. A relatively simple illustration is often the most effective.
34 Principles of Biostatistics

2.4 Numerical Summary Measures


Although tables and graphs are extremely useful methods for organizing, visually summarizing,
and displaying a set of data, they do not allow us to make concise, quantitative statements that
characterize the distribution of values as a whole. In order to do this, we instead rely on numerical
summary measures. Together, the various types of descriptive statistics can provide a great deal of
information about a set of observations.
The most commonly investigated characteristic of a set of data is its center, or the point about
which the observations tend to cluster. This is sometimes called a “measure of central tendency.”
Suppose we are interested in examining the response to air pollutants such as ozone and sulfur
dioxide among adolescents suffering from asthma. Listed in Table 2.9 are the initial measurements
of forced expiratory volume in one second for 13 subjects involved in such a study [54]. Recall
that fev1 is the volume of air that can be expelled from the lungs after one second of constant
effort. Before investigating the effect of pollutants on lung function, we might wish to determine the
“typical” value of fev1 prior to exposure for the individuals in this group.

2.4.1 Mean
The most frequently used measure of central tendency is the arithmetic mean, or average. The mean
is calculated by summing all the observations in a set of data and dividing by the total number of
measurements. In Table 2.9, for example, we have 13 observations. If x is used to represent fev1 ,
then x 1 = 2.30 denotes the first in the series of observations; x 2 = 2.15, the second; and so on
up through x 13 = 3.38. In general, x i refers to a single fev1 measurement where the subscript i
can take on any value from 1 to n, the total number of observations in the group. The mean of the
observations in the dataset – represented by x̄, or x-bar – is
n
1X
x̄ = xi .
n i=1

Note that we have used some mathematical shorthand. The uppercase Greek letter sigma, , is the
P
symbol for summation. The expression i=1 x i indicates that we should add up the values of all of
Pn
the observations in the group, from x 1 to x n . When appears in the text, the limits of summation
P
are placed beside it; when it does not, the limits are above and below it. Both representations of a
summation denote exactly the same thing. In some cases where it is clear that we are supposed to
sum all observations in a dataset, the limits may be dropped altogether. For the fev1 measurements,
13
1 X
x̄ = xi
13 i=1

1
!
= (2.30 + 2.15 + 3.50 + 2.60 + 2.75 + 2.82 + 4.05
13
+ 2.25 + 2.68 + 3.00 + 4.02 + 2.85 + 3.38)
38.35
=
13
= 2.95 liters.
The mean can be used as a summary measure for both discrete and continuous measurements. In
general, however, it is not appropriate for either nominal or ordinal data. Recall that for these types
of observations, the numbers are merely labels; even if we choose to represent the blood types o, a,
b, and ab by the numbers 1, 2, 3, and 4, an average blood type of 1.8 is meaningless.
Descriptive Statistics 35

TABLE 2.9
Forced expiratory volumes in 1 second for 13 adolescents suffering from asthma

Subject FEV1 (liters)


1 2.30
2 2.15
3 3.50
4 2.60
5 2.75
6 2.82
7 4.05
8 2.25
9 2.68
10 3.00
11 4.02
12 2.85
13 3.38

One exception to this rule applies when we have dichotomous data, and the two possible outcomes
are represented by the values 0 and 1. In this situation, the mean of the observations is equal to
the proportion of 1s in the data set. For example, suppose that we want to know the proportion of
asthmatic adolescents in the previously described study who are males. Listed in Table 2.10 are the
relevant dichotomous data; the value 1 represents a male and 0 designates a female. If we compute
the mean of these observations, we find that
n
1X
x̄ = xi
n i=1

1
!
= (0 + 1 + 1 + 0 + 0 + 1 + 1 + 1 + 0 + 1 + 1 + 1 + 0)
13
8
= = 0.615.
13
Therefore, 61.5% of the study subjects are males. It would have been a little more difficult to
determine the relative frequency of males, however, if we had represented males by the value 5 and
females by 12.
The method for calculating the mean takes into consideration the magnitude of each and every
observation in a set of data. What happens when one observation has a value that is very different
from the others? Suppose, for instance, that for the data shown in Table 2.9, we had accidentally
recorded the fev1 measurement of subject 11 as 40.2 rather than 4.02 liters. The mean fev1 of all
13 subjects would then be calculated as

1
!
x̄ = (2.30 + 2.15 + 3.50 + 2.60 + 2.75 + 2.82 + 4.05
13
+ 2.25 + 2.68 + 3.00 + 40.2 + 2.85 + 3.38)
74.53
= = 5.73 liters,
13
36 Principles of Biostatistics

TABLE 2.10
Indicators of sex for 13 adolescents suffering from asthma

Subject Sex
1 0
2 1
3 1
4 0
5 0
6 1
7 1
8 1
9 0
10 1
11 1
12 1
13 0

which is nearly twice as large as it was before. Clearly, the mean is extremely sensitive to unusual
values. In this particular example, we would have rightfully questioned an fev1 measurement of
40.2 liters and would have either corrected the error or separated this observation from the others.
In general, however, the error might not be as obvious, or the unusual observation might not be an
error at all. Since it is our intent to characterize an entire group of individuals, we might prefer to
use a summary measure that is not as sensitive to each and every observation.

2.4.2 Median
One measure of central tendency which is not as sensitive to the value of each measurement is the
median. Like the mean, the median can be used as a summary measure for discrete and continuous
measurements. However, it can also be used for ordinal data as well. The median is defined as the
50th percentile of a set of measurements; if a list of observations is ranked from smallest to largest,
then half the values would be greater than or equal to the median, and the other half would be less
than or equal to it. If a set of data contains a total of n observations where n is odd, the median is the
middle value, or the [(n + 1)/2]th largest measurement; if n is even, the median is usually taken to
be the average of the two middlemost values, the (n/2)th and [(n/2) + 1]th observations. If we were
to rank the 13 fev1 measurements listed in Table 2.9, for example, the following sequence would
result:
2.15, 2.25, 2.30, 2.60, 2.68, 2.75, 2.82, 2.85, 3.00, 3.38, 3.50, 4.02, 4.05.
Since there are an odd number of observations in the list, the median would be the (13 + 1)/2 = 7th
observation, or 2.82. Seven of the measurements are less than or equal to 2.82 liters, and seven are
greater than or equal to 2.82.
The calculation of the median takes into consideration only the ordering and relative magnitude
of the observations in a set of data. In the situation where the fev1 of subject 11 was recorded as
40.2 rather than 4.02, the ranking of the measurements would change only slightly:

2.15, 2.25, 2.30, 2.60, 2.68, 2.75, 2.82, 2.85, 3.00, 3.38, 3.50, 4.05, 40.2.
Descriptive Statistics 37

As a result, the median fev1 would still be 2.82 liters. The median is said to be robust; that is, it is
much less sensitive to unusual data points than is the mean.

2.4.3 Mode
A third measure of central tendency is the mode; it can be used as a summary measure for all types
of data, although it is most useful for categorical measurements. The mode of a set of values is the
observation that occurs most frequently. The continuous fev1 data in Table 2.9 do not have a unique
mode since each of the values occurs only once. It is not uncommon that continuous measurements
will have no unique mode, or more than one. This is less likely to occur with nominal or ordinal
measurements. For example, the mode for the dichotomous data in Table 2.10 is 1; this value appears
eight times, whereas 0 appears only five times.

The best measure of central tendency for a given set of data often depends on the way in which
the values are distributed. If continuous or discrete measurements are symmetric and unimodal –
meaning that, if we were to draw a histogram or a frequency polygon, there would be only one peak,
as in the smoothed distribution pictured in Figure 2.13(a) – then the mean, the median, and the mode
should all be roughly the same. If the distribution of values is symmetric but bimodal, so that the
corresponding frequency polygon would have two peaks as in Figure 2.13(b), then the mean and
median should again be the same. Note, however, that this common value could lie between the
two peaks, and hence be a measurement that is extremely unlikely to occur. A bimodal distribution
often indicates that the population from which the values are taken actually consists of two distinct
subgroups that differ in the characteristic being measured; in this situation, it might be better to report
two modes rather than the mean or the median, or to treat the two subgroups separately. The data
in Figure 2.13(c) are skewed to the right, and those in Figure 2.13(d) are skewed to the left. When
the data are not symmetric, as in these two figures, the median is often the best measure of central
tendency. Because the mean is sensitive to extreme observations, it is pulled in the direction of the
outlying data values. As a result, the mean might end up either excessively inflated or excessively
deflated. Note that when the data are skewed to the right, the mean lies to the right of the median;
when they are skewed to the left, the mean lies to the left of the median. In both instances, the mean
is pulled in the direction of the extreme values.

Regardless of the measure of central tendency used in a particular situation, it can be misleading
to assume that this value is representative of all observations in the group. One example that illustrates
this point was included in an episode of the popular news program “60 Minutes,” where it was noted
that although the French diet tends to be high in fat and cholesterol, France has a fairly low rate of
heart disease relative to other countries, including the United States. This paradox was attributed to
the French habit of drinking wine with meals, red wine in particular. Studies have suggested that
moderate alcohol consumption can lessen the risk of heart disease. The per capita intake of wine in
France is one of the highest in the world, and the program implied that the French drink a moderate
amount of wine each day, perhaps two or three glasses. The reality may be quite different, however.
According to a wine industry survey, more than half of all French adults never drink wine at all [55].
Of those who do, only 28% of males and 11% of females drink it daily. Obviously the distribution is
far more variable than the “typical value” would suggest. Remember that when we summarize a set
of data, information is always lost. Thus, although it is helpful to know where the center of a dataset
lies, this information is usually not sufficient to characterize an entire distribution of measurements.
As another example, the two very different distributions of data values pictured in Figure 2.14
have the same means, medians, and modes. To know how good our measure of central tendency
actually is, we need to have some idea about the variation among the measurements. Do all the
observations tend to be quite similar and therefore lie close to the center, or are they spread out
38 Principles of Biostatistics

FIGURE 2.13
Possible distributions of data values: (a) unimodal, (b) bimodal, (c) right-skewed, (d) left-skewed

across a broad range of values? To answer this question, we need to calculate a measure of the
variability among values, also called a measure of dispersion.

2.4.4 Range
One number that can be used to describe the variability in a set of data is the range. The range of a
group of measurements is defined as the difference between the largest and the smallest observations.
Although the range is easy to compute, its usefulness is limited; it considers only the extreme values of
a dataset rather than the majority of the observations. Therefore, like the mean, it is highly sensitive
to exceptionally large or exceptionally small values. The range for the fev1 data in Table 2.9 is
4.05 − 2.15 = 1.90 liters. If the fev1 of subject 11 was recorded as 40.2 instead of 4.02 liters,
however, the range would be 40.2 − 2.15 = 38.05 liters, a value 20 times as large.

2.4.5 Interquartile Range


A second measure of variability – one that is not as easily influenced by extreme values – is called the
interquartile range. The interquartile range is calculated by subtracting the 25th percentile of the data
from the 75th percentile; consequently, it encompasses the middle 50% of the observations. (Recall
that the 25th and 75th percentiles of a data set are called quartiles.) For the fev1 data in Table 2.9, the
75th percentile is 3.38. Note that three observations are greater than this value and nine are smaller.
Similarly, the 25th percentile is 2.60. Therefore, the interquartile range is 3.38 − 2.60 = 0.78 liters.
If a computer is not available, there are rules for finding the kth percentile of a set of measurements
by hand, just as there were rules for finding the median. In that case, we ordered the measurements
from largest to smallest, and the rule used depended on whether the number of observations n was
even or odd. For other percentiles, we again begin by ranking the measurements from smallest to
Descriptive Statistics 39

FIGURE 2.14
Two distributions with identical means, medians, and modes

largest. If nk/100 is an integer, then the kth percentile of the data is the average of the (nk/100)th
and (nk/100 + 1)th largest observations. If nk/100 is not an integer, then the kth percentile is the
( j + 1)th largest measurement, where j is the largest integer which is less than nk/100. To find the
25th percentile of the 13 fev1 measurements, for example, we first note that 13(25)/100 = 3.25 is
not an integer. Therefore, the 25th percentile is the 3 + 1 = 4th largest measurement (since 3 is the
largest integer less than 3.25), or 2.60 liters. Similarly, 13(75)/100 = 9.75 is not an integer, and the
75th percentile is the 9 + 1 = 10th largest measurement, or 3.38 liters. The interquartile ranges of
daily glucose levels measured at each minute over a 24-hour period for a total of 90 days – as well as
10th and 90th percentiles – are presented for a single individual in Figure 2.15. These interquartile
ranges allow us to determine at which times of day glucose has the most variability, and when there
is less variability.

2.4.6 Variance and Standard Deviation


Another commonly used measure of dispersion is known as the variance. The variance quantifies
how different the observations are from each other by computing half of the average squared distance
between the measurements. To find the average distance, we list out all possible pairs of measurements
x i and x j where i , j (we do not want to compare a measurement to itself), calculate the difference
between the observations in each pair, square them, sum them all up, and divide by twice the total
number of pairs. Since the total number of possible pairs for a sample of size n is n(n − 1), the
variance is defined as
n n
1
s2 (x i − x j ) 2 .
X X
=
2n(n − 1) i=1 j=1, j,i
40 Principles of Biostatistics

FIGURE 2.15
Medians and interquartile ranges of daily glucose levels measured over a 24-hour period for a total
of 90 days
A mathematically equivalent formula is the more commonly used
n
1 X
s2 = (x i − x̄) 2,
(n − 1) i=1

which is based on the squared difference of each measurement from the sample mean x̄. Although
less intuitive, this formula is easier to calculate by hand.
For the 13 fev1 measurements presented in Table 2.9, the mean is x̄ = 2.95 liters, and the
difference and squared difference of each observation from the mean is given below.

Subject xi x i − x̄ (x i − x̄) 2
1 2.30 −0.65 0.4225
2 2.15 −0.80 0.6400
3 3.50 0.55 0.3025
4 2.60 −0.35 0.1225
5 2.75 −0.20 0.0400
6 2.82 −0.13 0.0169
7 4.05 1.10 1.2100
8 2.25 −0.70 0.4900
9 2.68 −0.27 0.0729
10 3.00 0.05 0.0025
11 4.02 1.07 1.1449
12 2.85 −0.10 0.0100
13 3.38 0.43 0.1849
Total 38.35 0.00 4.6596
Descriptive Statistics 41

Therefore, the variance is


13
1
s2 (x i − 2.95) 2
X
=
(13 − 1) i=1

4.6596
=
12
= 0.39 liters2 .
The standard deviation of a set of values is the positive square root of the variance. Thus, for the
13 fev1 measurements above, the standard deviation is equal to
p
s = s2
p
= 0.39 liters2
= 0.62 liters.
In practice, the standard deviation is used more frequently than the variance. This is primarily
because the standard deviation has the same units of measurement as the mean, rather than squared
units. In a comparison of two sets of measurements, the group with the smaller standard deviation
has the more homogeneous observations. The group with the larger standard deviation exhibits a
greater amount of variability. The actual magnitude of the standard deviation depends on the values
in the dataset; what is large for one set of data may be small for another. In addition, because the
standard deviation has units of measurement, it is meaningless to compare standard deviations for
two unrelated quantities, such as age and weight.
Together, these two numbers, a measure of central tendency and a measure of dispersion, can be
used to summarize an entire distribution of values. It is most common to see the standard deviation
reported with the mean, and either the range or the interquartile range reported with the median.

Summary: Numerical Summary Measures

Term Notation / Definition

n
x̄ = 1n
X
Mean xi
i=1

Median 50th percentile

Mode Value that occurs most frequently

Range Maximum value − minimum value

Interquartile range (IQR) 75th percentile − 25th percentile

n
s2 = 1 (x i − x̄) 2
X
Variance
(n − 1)
i=1
n n
1 (x i − x j ) 2
X X
=
2n(n − 1)
i=1 j=1, j,i
v

t n
1 (x i − x̄) 2
X
Standard deviation s = s2 =
(n − 1)
i=1
42 Principles of Biostatistics

2.5 Empirical Rule


When a distribution of continuous measurements is symmetric and unimodal, the mean and standard
deviation can be used to construct an interval which captures a specified proportion of the observa-
tions in the dataset. The empirical rule tells us that approximately 67% of the observations lie in the
interval x̄ ± 1s, about 95% in the interval x̄ ± 2s, and almost all of the observations in the interval
x̄ ± 3s. Consider the measurements of total cholesterol depicted in the histogram in Figure 2.16,
which come from the Framingham Heart Study [56]. This study, which began enrolling subjects
who lived in Framingham, Massachusetts in 1948, was the first prospective study investigating risk
factors for cardiovascular outcomes. Total cholesterol levels were measured at the time of enroll-
ment for 4380 individuals in the study, and have a symmetric, unimodal distribution. The mean and
standard deviation of these observations are 236.8 mg/dL and 43.8 mg/dL, respectively. Therefore,
the empirical rule says that the interval

236.8 ± (1 × 43.8)

or
(193.0 , 280.6)
contains approximately 67% of the total cholesterol measurements,

236.8 ± (2 × 43.8)

or
(149.2 , 324.4)
contains 95%, and
236.8 ± (3 × 43.8)
or
(105.4 , 368.2)
contains nearly all of the observations. In fact, for the 4380 measurements, 69.9% are between 193.0
and 280.6 mg/dL, 96.0% are between 149.2 and 324.4 mg/dL, and 99.4% are between 105.4 and
368.2 mg/dL. The empirical rule allows us to use the mean and the standard deviation of a set of
data, just two numbers, to describe the entire group.
Interpretation of the magnitude of the mean is enhanced by the empirical rule. As previously
noted, however, in order to apply the empirical rule, a distribution of data values must be at least
approximately symmetric and unimodal. The closer the distribution is to this ideal, the more precise
the descriptions provided by the rule. Deviations from the ideal – especially if they are extreme – not
only invalidate the use of the empirical rule, but even call into question the usefulness of the mean
and standard deviation as numerical summary measures.
Returning to the Framingham Heart Study, consider the reported average number of cigarettes
smoked per day at the time of enrollment. In addition to this discrete measurement, the researchers
also collected a binary measurement of smoking status: smoker versus non-smoker. If d is used to
represent smoking status (taking the value 1 for a smoker, and 0 for a non-smoker), while x represents
the average number of cigarettes smoked per day, then the ith individual in the group has a pair
of measurements (d i , x i ). The subscript i takes on any value from 1 to 4402, the total number of
subjects in the study for whom these values were recorded.
Figure 2.17 displays the x values, the average numbers of cigarettes smoked per day. Note that
these values are not symmetric and unimodal, and therefore the empirical rule should not be applied.
Beyond that, however, we might wonder whether the mean is providing any useful information at
Descriptive Statistics 43

FIGURE 2.16
Total cholesterol measurements at the time of enrollment for individuals participating in the Fram-
ingham Heart Study

all. Recall that we introduced the mean as a measure of central tendency, a “typical” value for
a set of measurements. Knowing that the center for the number of cigarettes smoked per day is
x̄ = 9.0 is not particularly helpful. The problem is that there are really two distinct groups of study
subjects: smokers and non-smokers. The mean of the x values ignores the information contained in
d. Cigarette consumption for the individuals who do not smoke – the 51% of the total cohort for
whom d i = 0 – is 0 cigarettes per day, resulting in a mean value of 0 for this subgroup. For the
subgroup of smokers – those for whom d i = 1 – the mean cigarette consumption is 18.4 cigarettes
per day. The overall mean of x̄ = 9.0 is not representative of either of these subgroups. It might be
useful for the manufacturer who is trying to determine how many cigarettes to make, but it does not
help us to understand the health of the population. Instead of attempting to capture the situation with
a single mean, it is more informative to present two numerical summary measures: the proportion
of the population who smokes, and the mean number of cigarettes smoked per day only for the
subgroup of smokers. (Since the binary measurements of smoking status are represented by 0s and
1s, the proportion of 1s – equivalently, the proportion of smokers – is simply the mean of the d i s.)
These two numbers give us a more complete summary of the data. Of course, reporting two means
also complicates the interpretation. Suppose that we want to track changes in smoking habits over
time. With a single mean, it is easy to see whether cigarette consumption is increasing or decreasing;
with two means, it is not. What if fewer people smoke over time, but those who do smoke increase
their consumption? Can this be considered an improvement in health?

Additional complexity is introduced if we are dealing with a rare event. The information in
Table 2.11 was presented as part of an argument about the loss of human lives attributable to
guns [57]. The entries in the table show the number of deaths over each year from 2009 through
2015, by country, attributed to mass shootings. Although there is some disagreement on how to
define a mass shooting, here it is defined as an incident resulting in four or more fatalities. The
44 Principles of Biostatistics

FIGURE 2.17
Average number of cigarettes smoked per day at the time of enrollment for individuals participating
in the Framingham Heart Study

argument utilizing these data focused on a contrast between the United States and Europe. The
authors took a country’s mean number of deaths per year over the seven-year period and divided
by its population size to calculate the “annual death rate from mass shootings per million people.”
Doing this, the United States ranked eighth highest, and it was therefore claimed that it is safer to live
in the United States than in the seven European countries which ranked higher. We might consider,
however, whether this metric is the most meaningful way to summarize this data.
First, note that there are currently 44 countries in Europe, but only 16 are listed in Table 2.11.
These 16 countries were selected because they had at least one mass shooting episode over the
seven-year period. Since the majority of European countries had no mass shootings at all, the sample
of countries shown is not representative. To more fairly compare the situation in Europe to that in
the United States, all countries must be included.
Second, just as with the cigarette consumption measurements from the Framingham Heart Study,
we should consider two dimensions of this data rather than just one: the frequency of mass shootings,
and the number of fatalities when a shooting does occur. Both of these pieces of information are
important. To better understand the frequency of mass shootings, Table 2.12 contains the number of
mass shootings in each year from 2009 to 2015. Over the seven-year period, there were six shootings
in France, and two in Belgium, Russia, Serbia, and Switzerland. Each of the other countries in the
table had just one mass shooting. The 28 European countries not shown in the table had none at all.
At the country level, a mass shooting is a rare event, and the mean number of shootings per year is
not a helpful summary measure, as all the means are low. In contrast, over the same time period, the
United States had 25 shootings, the same number as all of Europe combined. In fact, looking at the
last two rows of the table, the behavior of the two regions is quite similar.
Some might say that a fairer comparison would take into account the relative population sizes of
Europe and the United States. This would certainly be true if we believe that a certain fixed proportion
of a population are potential mass shooters, and therefore a larger population would produce more of
Descriptive Statistics 45

TABLE 2.11
Number of deaths per year attributed to mass shootings, 2009–2015

Country 2009 2010 2011 2012 2013 2014 2015 Total Mean Median
Albania 0 0 0 0 0 4 0 4 0.57 0
Austria 0 0 0 0 4 0 0 4 0.57 0
Belgium 0 0 6 0 0 4 0 10 1.43 0
Czech Republic 0 0 0 0 0 0 9 9 1.29 0
Finland 5 0 0 0 0 0 0 5 0.71 0
France 0 0 0 8 0 0 150 158 22.60 0
Germany 13 0 0 0 0 0 0 13 1.86 0
Italy 0 0 0 0 0 0 4 4 0.57 0
Macedonia 0 0 0 5 0 0 0 5 0.71 0
Netherlands 0 0 6 0 0 0 0 6 0.86 0
Norway 0 0 67 0 0 0 0 69 9.86 0
Russia 0 0 0 6 6 0 0 12 1.71 0
Serbia 0 0 0 0 13 0 0 19 2.17 0
Slovakia 0 7 0 0 0 0 0 7 1.00 0
Switzerland 0 0 0 0 4 0 4 8 1.14 0
United Kingdom 0 12 0 0 0 0 0 12 1.71 0
United States 38 12 18 66 16 12 37 199 28.40 18

TABLE 2.12
Number of mass shootings per year, 2009–2015

Country 2009 2010 2011 2012 2013 2014 2015 Total Mean Mediam
Albania 0 0 0 0 0 1 0 1 0.14 0
Austria 0 0 0 0 1 0 0 1 0.14 0
Belgium 0 0 1 0 0 1 0 2 0.28 0
Czech Republic 0 0 0 0 0 0 1 1 0.14 0
Finland 1 0 0 0 0 0 0 1 0.14 0
France 0 0 0 1 0 0 1 6 0.86 0
Germany 1 0 0 0 0 0 0 1 0.14 0
Italy 0 0 0 0 0 0 1 1 0.14 0
Macedonia 0 0 0 1 0 0 0 1 0.14 0
Netherlands 0 0 1 0 0 0 0 1 0.14 0
Norway 0 0 1 0 0 0 0 1 0.14 0
Russia 0 0 0 1 1 0 0 2 0.28 0
Serbia 0 0 0 0 1 0 0 2 0.28 0
Slovakia 0 1 0 0 0 0 0 1 0.14 0
Switzerland 0 0 0 0 1 0 1 2 0.28 0
United Kingdom 0 1 0 0 0 0 0 1 0.14 0
Europe 2 2 4 4 4 1 8 25 3.57 3
United States 4 2 3 6 3 3 4 25 3.57 4
46 Principles of Biostatistics

FIGURE 2.18
Frequency of number of fatalities per mass shooting, 2009–2015

these individuals. The population of Europe is more than twice that of the United States – in 2019,
there are approximately 740 million people in Europe and 330 million in the United States – and
that ratio has been fairly consistent since 2009. Therefore, we can conclude that the proportion of
mass shooters in the United States is more than twice as high as in Europe.
Going a step further, the description above did not account for the number of fatalities in each
shooting. Figure 2.18 displays the frequencies with which each number of fatalities occurred. In
Europe, the mean number of fatalities is 13.7 per event, while in the United States it is 8.0. Note,
however, the three outlying values, representing 26 deaths in Newtown, Connecticut in 2012, 67 in
Norway in 2011, and 130 in France in 2015. We have seen that the mean is affected by outlying
values. To assess their impact, we might consider excluding the outliers and recalculating the means.
If we do this, the means are 6.4 fatalities per event in Europe and 7.2 in the United States, both
of which are much lower; furthermore, the mean for Europe is now smaller than the mean for the
United States.

In summary, there are instances when a single mean does not provide an accurate representation
of a complex situation, especially when comparisons are being made. To fully understand what is
happening when comparing gun violence in the United States and Europe, the annual death rate
from mass shootings per million people does not give the whole picture; especially when Europe is
represented by a hand-picked sample of countries, chosen in a biased way so as to make a political
point. We might convey a better understanding of the data by noting that the frequency of mass
shootings was the same in Europe and the United States over the seven-year period from 2009
through 2015, with 25 mass shootings in each region, even though the population of Europe is more
than twice as large. There were three mass shootings with exceptionally high numbers of fatalities,
noted above, and excluding these, the mean numbers of deaths per event were 6.4 in Europe and 7.2
in the United States.

You might also like