Data Analysis With STATA
Data Analysis With STATA
Data Analysis with
Stata
A Comprehensive Guide for Data Analysis and Interpretation of Outputs
First Edition
Stata Version 13
Russell Kabir
BDS, MPH, MSc, PhD
Senior Lecturer (Research Methods), Course Leader (MSc Public Health and
Community Wellbeing and MPH Global Public Health)
Anglia Ruskin University, UK
Monjura Nisha
BDS, MPH, PhD
Postdoctoral Research Fellow
The Daffodil Centre, The University of Sydney,
A joint venture with Cancer Council NSW
NSW, Australia
ASA Publications
Dhaka, Bangladesh
ISBN 978-984-35-3165-0 [First Edition]
Copyright © 2022 by Authors
First published 2022
This book is copyrighted by the authors. However, the e-book is free for everyone.
Anyone can download it from the links, print it out for personal use, and share it with
others, but it is strictly prohibited to use it for any kind of profit-making venture with-
out the written permission of the first author. Its contents may be used and incorporated
into other materials with proper acknowledgements and citations. The datasets provid-
ed in the links and used in this book are hypothetical and can be used for practice.
Publisher
ASA Publications
Dhaka, Bangladesh
Email: asapublications7@gmail.com
Distributor
Altaf Medical Book Center
Shop: 121 & 128; Lane: 3; Islamia Market, Nilkhet, Dhaka 1205
Cell: +880-1711-985991; +880-1611-985991; +880-1511-985991
Production credits
Editor: Mohammad Tajul Islam
Proofreading: Russell Kabir and Md. Golam Kibria
Composition and Cover Design: Zariath Al-Mamun Badhon
Production Director: Md Altaf Hossain
Printing: ASA Publications
Binding: Rahim Bindings
Suggested Citation
Islam MT, Kabir R, Nisha M, eds. Data Analysis with Stata: A Comprehensive Guide
for Data Analysis and Interpretation of Outputs. Dhaka, Bangladesh: Altaf Publica-
tions; December 2022.
Printed in Bangladesh
To all our family members, students, and promising researchers in
health and social sciences
Preface
Stata is a popular and versatile data analysis software. Many books and guidelines on
Stata are available online or in the market. Stata books focusing on the analysis of
health-related data are scarce. Most of the books used data from sources that are related
to business or psychology, with which health researchers are mostly unfamiliar. All
these factors, and the inspiration of students, encouraged us to write this book. In this
book, we have used simple variables that are commonly used in health and social
science-related research as examples to make it easy and understandable for users.
This book is intended for students (MPH, FCPS, MD, MS, MPhil, PhD, and others),
teachers, and young researchers in health and social sciences. It is written in very
simple language. This book answers three basic questions about data analysis. These
are: a) what to do (what statistics to use for data analysis in order to achieve the objec-
tives); b) how to do (how to analyze data using Stata); and c) what do the outputs mean
(how to interpret the outputs). All these questions are answered in an understandable
manner with examples.
This book covers more than the basic statistical methods of data analysis that are com-
monly used in health and social sciences research. It is the gateway to learning statis-
tics and Stata together, and will help the users go further. This book covers data man-
agement, descriptive statistics, hypothesis testing using bivariate and multivariable
analysis, and others. It is easier to learn through exploration than only reading. Users
are encouraged to explore further once the basics are known. From our understanding,
using the statistics covered in this book, students and researchers will be able to
analyze most of their data from epidemiological studies and publish them in interna-
tional peer-reviewed journals.
We are optimistic that students and researchers will find this book useful while analyz-
ing their data and interpreting the outputs. If you have any comments or suggestions
about this book, feel free to write to the e-mail address below.
M. Tajul Islam
abc.taj@gmail.com
i
Foreword
The book titled "Data Analysis with Stata", written by Dr. M. Tajul Islam, Dr. Russell
Kabir, and Dr. Monjura Nisha, is an immense contribution to teaching, training, and
research in the field of medical sciences and relevant disciplines. The authors of the
book have provided a step-by-step approach from data management to data analysis
using Stata in this book.
The book is a useful resource for data analysis using Stata for undergraduate, postgrad-
uate, and doctoral students of medical sciences and other health-related disciplines,
such as Dental, Pharmacy, Nursing, Occupational Health, Physical Medicine, Biomed-
ical and Social Sciences.
Each chapter is written in a simple manner, focusing on the needs of students. Contents
are systematically presented and described so that students can easily grasp the process
of data analysis and interpret the outputs.
This book will serve as a helpful training tool for university lecturers, dissertation
supervisors, and data analysis instructors.
Readers will benefit from using this book, and I wish for the extensive circulation of
the book.
ii
Acknowledgements
We are particularly indebted and grateful to Dr. Shakil Ahmed and Mr. Md. Golam
Kibria, who reviewed all the sections, provided their critical views and constructive
suggestions, and took the lead role in publishing this book. We are particularly thankful
to Professor Hafiz T. A. Khan of the University of West London for his careful review
and writing the foreword for this book.
We would like to say thanks to Dr. Ali Davod Parsa, Dr. Richard Hayhoe, Dr. Tim
Hayes, Dr. Rachael Cubberley, Dr. Oonagh Corrigan, and Dr. Shannon Doherty of
Anglia Ruskin University, UK; Mrs. Madhini Sivasubramanian and Dr. Divya Vinna-
kota of University of Sunderland, London; Dr. Julia Morgan of Greenwich University,
UK; Dr. Md. Anwarul Azim Majumder of University of West Indies; Dr. Brijesh Sathi-
an of Hamad Medical Corporation, Qatar; Dr. Sujita Kumar Kar of King George’s
Medical University, Lucknow, India; Dr. Shah Jalal Sarker of UCL, UK; and Dr. S. M.
Yasir Arafat of Enam Medical College and Hospital, Bangladesh.
Many of our students, who continually pushed and encouraged us to write this book,
deserve a share of the credit, including Dr. SM Anwar Sadat and Dr. Md. Naymul
Hasan. We, as authors, accept full responsibility for any deficiencies or errors that the
book may have. Finally, we express our sincerest thanks for the assistance that we have
received from ASA Publications in publishing this book.
To the users
This e-book is free for all. However, if you can afford, please donate BDT 100 (US$ 2
for users outside Bangladesh) to a charity or a needy person. This little amount is suffi-
cient to offer a meal for an orphan in developing countries.
iii
Table of Contents
Chapter 1 Introduction 1
1.1 Version dilemma 1
1.2 Stata interface 2
1.3 Steps of data analysis 2
iv
3.5 Stata operator symbols 30
3.6 Getting help in Stata 30
v
Chapter 7 Generating Graphs 69
7.1 Histogram 69
7.1.1 Saving graphs 70
7.3 Box and plot chart 73
7.4 Bar graph 76
7.4.1 Bar graph for the mean of a quantitative variable across a
categorical variable 76
7.4.2 Bar graph for the frequencies of a categorical variable 76
7.5 Line graph 77
7.6 Pie chart 78
vi
11.2.1 Commands for two-way ANOVA 105
11.2.2 Interpretation 106
11.2.3 Post hoc test for two-way ANOVA 106
vii
16.1.1 Commands for simple linear regression 145
16.1.2 Interpretation 146
16.2 Multiple linear regression 149
16.2.1 Sample size for multiple regression 150
16.2.2 Commands for multiple linear regression analysis 151
16.2.3 Interpretation 153
16.2.4 Regression diagnostics 154
16.2.4.1 Checking for multicollinearity 155
16.2.4.2 Checking for linearity 156
16.2.4.3 Checking for normality of residuals 158
16.2.4.4 Checking for homoscedasticity 161
16.2.4.5 Checking for outliers 163
16.2.4.6 Test for independence 164
16.2.5 Variable selection for a model 166
16.2.5.1 Backward selection method 167
16.2.5.2 Forward selection method 168
16.2.5.3 Interpretation 169
viii
17.3.2 Cox regression with constant time 193
17.3.3 Generalized linear model 194
References 249
ix
1
Introduction
Stata is a data analysis software that allows us to perform a wide range of statistical
analyses. Stata can be operated by using its drop-down menu as well as by writing the
commands directly in the command window. The commands of Stata are straightfor-
ward and simple. Therefore, data analysis by writing commands is more popular
among users. This book is written focusing on the use of Stata’s commands for data
analysis.
This book is for students, young researchers, and teachers. It provides more than a
basic understanding and techniques for statistical analysis of data related to health and
social sciences research. This book is based on Stata version 13. There are upper as
well as lower versions of Stata. Stata commands are sufficiently similar, and users can
use this book for both upper and lower versions. The newer versions are mainly for the
more advanced and recently developed techniques that the users most likely do not
require.
This book focuses on statistical decision-making, data analysis, and interpretation of
the outputs. It covers commonly used data analysis techniques in health and social
sciences research. The topics covered in this book include data management, descrip-
tive statistics, and bivariate and multivariable analysis for hypothesis testing, including
nonparametric methods and others.
that the higher versions execute the commands regardless of the version of Stata in
which they are written. It is, therefore, expected that all the commands (syntax) used in
this book will run in higher (or lower) versions.
This book is based on Stata version 13. If you are using a different version (e.g.,
version 17), you can still use the commands. If you find any problem in executing a
command provided in this book in version 17 (or other versions), type the following
command at the top of every Do-file (Chapter 3) that you create:
version 13
This simple step will ensure that your Do-file or program will continue to run not only
in version 17 but also in all future versions of Stata, even if that future version has
changes in the syntax of some of the commands or programming constructs.
You can also use the above command as a prefix while writing a command in the com-
mand window. For example, if you want to execute an ANOVA syntax in version 17,
which is written in version 13, use the command:
version 13: anova …..
This command will set Stata’s version to 13, run the anova command, and then reset
Stata’s version to whatever it was before the command was executed. For further infor-
mation, visit: https://www.stata.com/manuals/pversion.pdf.
studies (e.g., Bangladesh Demographic and Health Survey data). Once data is collect-
ed, the steps of data analysis are:
• Data coding, if a pre-coded questionnaire or record sheet is not used
• Development of a data file and data entry
• Data cleaning (checking for errors during data entry)
• Data screening (checking assumptions for statistical tests)
• Data analysis
• Interpretation of results
Like other data analysis programs, Stata has to read a data file to analyze the data. It is,
therefore, necessary to develop a data file or import it from other programs for use by
Stata. Data files can be generated by Stata, but it is not a popular practice. Researchers
mostly convert or import data files generated in other programs for use by Stata. Data
files generated in other programs can be easily transformed into Stata format for analy-
sis. In this chapter, we will discuss how to generate a data file in Stata.
For generating a data file, the first and basic step is to decide on a name for each of the
variables included in the questionnaire or record sheet. To name a variable, we need to
follow certain rules. They are:
• The variable names must be unique (i.e., all the variables should have different
names)
• A variable name must be between 1 and 32 characters long. But try to keep it
as short as possible.
• Variable names must begin with a letter (small or capital) or an underscore.
Variable names cannot begin with a number. Although underscore can be used
to begin a variable name, it is strongly discouraged because such variable
names are used to indicate temporary variables in Stata.
• Variables cannot include full stop (.), space, or symbols like, ?, *, µ, λ, ~, !, -,
@, and #.
8 Generating Data Files
Categorical variables:
• Sex (m= male; f= female)
• Religion (1= Islam/Muslim; 2= Hindu; 3= Others)
• Occupation (1= Business; 2= Government job; 3= Private job; 4= Others)
• Marital status (1= Married; 2= Unmarried; 3= Others)
• Have diabetes mellitus (1= Yes; 2= No)
Table 2.2 Data collected from the study subjects (only a portion is shown)
idno age sex religion occu income marital
1 26 m 1 2 25000 1
2 28 f 2 2 35000 1
3 29 f 1 1 60000 1
4 34 m 1 3 20000 2
Open the Stata program by double-clicking the Stata icon. You will see the Stata inter-
face (Stata/SE 13.0) as shown in Figure 1.1 (Chapter 1). The simplest way to generate
a data file is through the Data Editor. To go to the Data Editor, select from the menu
bar:
Window > Data Editor
Or,
Data > Data Editor > Data Editor (Edit)
Or,
Click the icon on the toolbar.
You will see the “Data Editor (Edit) – Untitled” as shown in Figure 2.1. This is the
window for defining the variables as well as for data entry. Use the following steps to
generate a data file:
Step 1: Our first variable is "idno" (Table 2.2). When the cursor is placed in the first
column of the first row, it will show "var1[1]" in the box above. Type the first value of
the variable "idno" as shown in Table 2.2 (which is 1, i.e., just type 1 and press the
"Enter" button). You will notice that "var1" appears at the top of the first column (Fig
2.2).
Now, type the value (which is 26) of the second variable "age" in the first box of the
second column and press "Enter". You will see that "var2" appears at the top of the
second column. Like this, enter the values of other variables into the Stata Data Editor
spreadsheet.
If the first value entered for a variable is a number, Stata will consider it a numeric
variable and will permit only numbers as its values subsequently. The numeric values
may begin with a plus or minus sign, and include decimal points. However, the num-
bers should not have any commas (,), such as 10,000 or 1,000,000.
Generating Data Files 11
If the first value entered for a variable is a non-numeric character (such as m, f, or any
other letter), Stata will consider it a string (text) variable. A string variable may have
values up to 244 characters long and may have any combination of letters, numbers,
symbols, and spaces.
In the Data Editor or Data Browser, the string variable values appear as red, numeric
variable values appear as black, and labeled (coded) numeric variable values appear as
blue.
Step 2: In this step, we will rename (replace) the variable names that have been gener-
ated automatically by Stata (like var1, var2, and var3) with the variable names as
shown in the codebook (Table 2.1). For example, we need to rename the first variable
"var1" as "idno", the second variable "var2" as "age", and so on.
You can see that there are three windows on the right side of the Data Editor (Fig 2.1).
These are the "Variables", "Properties", and "Data" windows. In the "Variables"
window, all the variable names have appeared, like var1, var2, and var3 (Fig 2.2).
Click on "var1" in the "Variables" window. Now look at the "Properties" window. It
shows "var1" against "Name" in the window. Double-click on "Name" in the "Proper-
ties" window, delete "var1" and write "idno". This will replace "var1" with "idno". You
can see the new variable name (idno) in the spreadsheet as well as in the "Variables"
window. Like this, rename all the variables generated automatically by Stata with the
variable names of your choice. You can also use the following command to change the
variable name:
rename var1 idno
This command will rename the variable “var1” to “idno”.
Step 3: We will now label the variables one by one. To write the variable label for
“idno”, select (click on) the variable “idno” in the “Variables” window (Fig 2.2). Dou-
ble-click on “Label” in the “Properties” window and write “serial no”. This will be the
variable label for “idno” as shown in the “Variables” window for the variable “idno”
against “Label”. In this way, complete the variable labels of all variables as shown in
the codebook. An alternative to labeling a variable is by using the following command
in the command window:
label var idno “serial no”
This will label the variable “idno” as “serial no”.
Step 4: Step 4 is assigning the value labels. Since the variables "idno" and "age" are
not categorical variables (i.e., these variables are not coded), they do not need to have
value labels. Value labels are only needed for the categorical variables that are coded
numerically. Stata does not allow value labels for string variables. Therefore, we need
Generating Data Files 13
to assign value labels only for the numerically coded variables, such as "religion",
"occupation", and others.
In our example, the variable "religion" is a numerically coded variable, and the code
numbers are 1= Muslim; 2= Hindu; 3= Others. We need to assign value labels to this
variable. To assign value labels, either use the following command or the following
steps:
label define religion 1”Muslim” 2”Hindu” 3”Others”
Or simply,
la de religion 1”Muslim” 2”Hindu” 3”Others”
Or, use the following steps:
• Select the variable “religion’ in the “Variables” window (in Data Editor) (Fig
2.2)
• Click on “Value Label” in “Properties” window (under “variables”)
• You will see a dropdown arrow and a small box with 3 dots
• Click on the 3 dot box
• Click on “Create Label”
• Write “religion” in the box “Label name”
• Write 1 in the “Value” box; write “Muslim” in the “Label” box; and click Add
• Write 2 in the “Value” box; write “Hindu” in the “Label” box; and click Add
• Write 3 in the “Value” box; write “Others” in the “Label” box; and click Add
• Click “OK” and then “Close”
• Click on “Value Label” in “Properties” window and using the dropdown
arrow, select “religion”
The above steps will insert value labels for the variable "religion". In this way (or using
the command), provide value labels to all the coded variables in the data file and enter
the data one by one. If there is any missing value, just keep the cell blank (or type the
assigned missing value code). Stata will consider it (when the cell is kept blank) a
missing value and will indicate it with a dot (.).
Closing the Data Editor window at any time during data entry will keep the data in
memory (data will not be lost) unless you exit the Stata program. Once data entry is
completed (or partially completed), we need to save the data file (Stata data files have
the extension ".dta"). To save a data file, select:
File (from the menu bar) > Save As… > Select the folder where you want to save
14 Generating Data Files
the file > Give a file name (.dta will appear by default) > Save
Or,
You can save the data file using the command “save”
For example, if you want to save the data file by the name "practice" on your desktop,
use the following command (you need to specify the path):
save "C:\Users\HP\Desktop\practice.dta"
When you are in use of a data file for analysis and want to save the data file with the
same name, use the command:
save, replace
This will save the data file with the same name and in the same location. The replace
option in the command replaces the old file with the same name.
So far, we have discussed how to rename and label a variable and how to assign value
labels using the Stata Data Editor as well as by using the commands. However, it is
convenient to rename and label variables and assign value labels by writing commands
in the command window of Stata. The following is the summary of how to use Stata
commands for the above functions:
• If you want to change (rename) the variable name “var1” to “idno”, use the
following command:
rename var1 idno
• If you want to label the variable “idno” as “serial no.”, use the following com
mand:
label var idno “serial no.”
• To assign the value labels (1= Muslim, 2= Hindu, 3= Others) to the variable
“religion”, use the following command:
label define religion 1”Muslim” 2”Hindu” 3”Others”
After executing the above command (label define), you need to select “religion” for the
“Value Label” in the “Properties” window using the dropdown arrow, or simply use the
following command:
label values religion religion
• If you want to change (rename) the variable label of a variable (e.g., variable
Generating Data Files 15
File > Save as… > Select Stata Version 13 SE (*.dta) for the “Save as type” box
using the dropdown arrow > Write “wealth” in the file name box > Save
This will convert and save the SPSS data file “wealth” into Stata format. You can also
use a suitable data conversion program to import data from SPSS or other formats into
Stata format.
In this chapter, we will discuss some of the basic functions related to the use of Stata
program, such as Stata files, command syntax, and others. Before we proceed to data
analysis, it is important to know all these basic things.
ways, such as using the drop-down menu, writing the command, or using the icon.
Suppose that you have a Stata data file named "Data_3.dta" on your computer and its
location is C:\Users\HP\Desktop. If you want to open the data file, use the following
steps or command:
File (on the menu bar) > Open > Go to Desktop and select the data file “Data_3”
> Open
Or,
use C:\Users\HP\Desktop\Data_3, clear
Or,
Click on the icon > Go to Desktop > Select “Data_3” > Open
Or,
Double click on the data file that you want to open
File (on the menu bar) > Log > Begin… > Select the location where you want to
save the file > Write “Results_3” in the “File name” box > Select the format .smcl
(usually the default) > Save
Or,
log using C:\Users\HP\Desktop\Results_3
The above command will open/generate (begin) a log-file with the name
“Results_3.smcl” on the desktop.
Once a log-file is opened, you can temporarily stop (suspend) saving the outputs at any
time during analysis by using the following steps or command:
File > Log > Suspend
Or,
log off
You can restart (resume) saving the outputs at any point in your analysis session by
using the following steps or command:
File > Log > Resume
Or,
log on
The log-file is saved and closed automatically at the end of an analysis session when
you exit Stata. You can, however, save and close the log-file anytime during analysis
by using the following steps or command:
File > Log > Close
Or,
log close
Or,
log using C:\Users\HP\Desktop\Results_3, append
You need to specify the file’s path, otherwise the command will not be executed.
If you want to overwrite (replace the contents of a log-file with the outputs of a new
analysis) the outputs in a log-file (e.g., Results_3.smcl), use the following command:
log using C:\Users\HP\Desktop\Results_3, replace
This command will delete all the contents of the log-file previously saved and will save
the outputs of the new (subsequent) analysis session.
before continuing to the next line. Then Stata will consider that the command is contin-
ued to the next line.
Once a new Do-file is generated, you can copy the commands from Stata’s Review
window and paste them into the Do-file.
This will execute all the commands saved in the Do-file “Test.do” without opening the
Do-file in the Do-file Editor.
command: Indicates Stata’s command for the analysis of data. It tells us what Stata is
supposed to analyze. Stata commands are case sensitive. All the commands must be
written in lowercase format, otherwise they will not work.
varlist: “varlist” stands for “variable list”. It indicates the list of variables needed for a
command to execute. The variable list is optional in many commands. If “varlist” is not
specified, the command runs on all the variables in the dataset. For example, if you use
the command:
summarize age
Stata will provide the summary statistics of the variable “age”. If you use only the com-
mand “summarize” without any variable name, Stata will provide the summary statis-
tics of all the variables in the dataset. Instead of writing the command “summarize”,
you can use only the first three initial letters, such as “sum” to get the summary statis-
tics.
if exp: “if exp” means “if expression”. It specifies the conditions to be considered
during an analysis. It is optional. For example, if you want to get the summary statistics
of age for males only (supposing that males are coded as 1 for the variable “sex”), use
the following command:
sum age if sex==1
in: “in” indicates the range restrictions in terms of observation numbers. It is optional.
For example, if you want to list the first (or last) 10 values of the variable “age” in the
dataset, use the following command:
26 Basics of Stata
[ ]: All the syntax in [ ] is optional. You may not need to select anything. For example,
you can use the following command (without using anything for “if”, “in”, and
“weight”) to get the summary statistics for age.
sum age
weight: “weight” indicates the “weight variable”. If there is any weight variable
(frequency or sampling weight) that you want to include in an analysis, put the variable
after “in”. For example,
sum age [fweight = v2]
Here, “fweight” indicates frequency weight (“pweight” indicates sampling weight)
and “v2” is the weight variable that you want to consider.
options: “, options” indicate an optional instruction for data analysis. Note that there
is a comma (,) before the options, which must be used. For example,
sum age, detail
Here, we have used the option “detail”. Once we use this option (detail), Stata will give
the detailed summary statistics (mean, SD, skewness, kurtosis, percentile, and others)
of the variable.
prefix: The “prefix” is not mandatory for an analysis. The prefix is used to get the
results by sub-groups, such as by sex, occupation, or other variables. For example, if
you want to get the summary statistics of age by sex (i.e., disaggregated by males and
females), the prefix is needed. Then the commands would be like:
sort sex
by sex: sum age
Or,
bysort sex: sum age
Or,
By sex, sort: sum age
Basics of Stata 27
Here, “by sex”, “bysort sex”, and “by sex, sort” are the prefixes, while “sum” is the
main command to get the summary statistics of age. We have used the command “sort
sex” to sort (in ascending order) the variable “sex”. Stata needs sorting (in ascending
order) of the prefix variable (in this example, sex) before executing the main command
“sum”. That’s why we used the first command, “sort sex”. You can, however, use a
single command like “bysort sex” that would first sort the “by variable” (here it is sex)
before executing the main command.
3.3.2 Codebook
Before we start data analysis, it is important to know the variables in the dataset, espe-
cially how they are coded and how the missing values are identified. A good practice is
to either develop a codebook for the data file (as discussed in Chapter 2) or to look at
the data before analysis, so that you understand the structure of the information. This
can be done by using the command “codebook”.
28 Basics of Stata
codebook
codebook age sex_1 diabetes
codebook age sex_1 diabetes, compact
Or,
label list
label list religion occupation diabetes
The first command listed above (codebook), where the variables are not specified,
will provide the codebook of all the variables in the dataset. The second command
will provide the codebook of the variables included in the command (age, sex_1, and
diabetes) (Table 3.2). The “label list” command considers only the numerically coded
variables.
Basics of Stata 29
------------------------------------------------------------------------------
age Age
------------------------------------------------------------------------------
type: numeric (double)
mean: 26.5143
std. dev: 7.49049
-----------------------------------------------------------------------------
sex_1 Sex: numeric
-----------------------------------------------------------------------------
type: numeric (double)
label: sex_1
----------------------------------------------------------------------------
diabetes Have diabetes mellitus
----------------------------------------------------------------------------
type: numeric (double)
label: diabetes
sort age
gsort –age
Table 3.3 Key operator symbols and logical expressions used in Stata
Operator symbols:
+ Addition
- Subtraction
* Multiplication
/ Division
^ Power
Logical expressions:
< Less than
<= Less than or equal to
== Equal to
> Greater than
>= Greater than or equal to
!= Not equal to (~= can also be used; ! or ~ indicates not)
& And
| Or
32
4
Data Cleaning and Data Screening
Once data is entered into Stata or imported from other programs, we need to be sure
that the data is free from errors. Data cleaning is commonly done by generating
frequency distribution tables of all the variables to find the out-of-range values and by
cross-tabulations (or by other means) to check the conditional values. If errors are
identified, they need to be corrected.
Simultaneously, we need to check the data if it fulfils the assumptions of the desired
statistical test (data screening). For example, is the data of a quantitative variable
normally distributed to do a t-test? There are other assumptions that need to be checked
before using statistical techniques, especially for hypothesis testing. In this chapter, we
will primarily discuss data cleaning. For the time being, users may skip this chapter
and proceed to Chapter 5. Once the users develop some skills in data analysis, they can
come back to this chapter. Use the data file <Data_3.dta> for practice. The codebook
for this data file can be seen in Chapter 6 (Table 6.1).
. sum religion
If there is any value that is outside of the valid range of a variable, we can identify that
value against the identification (ID) number (variable name of the identification num-
ber in our dataset is "ID_no"). Let us generate a frequency distribution table of "diabe-
tes" to check the values by using the following command:
tab diabetes
This command will produce Table 4.2. The table shows that there is data with a value
of "2" for diabetes, which is outside the data range (the valid codes for diabetes are “0”
and “1”). Now, to identify the subject (ID_no) who has this value, use the following
command (Table 4.2):
list ID_no diabetes if diabetes==2
This command will provide the identification number (ID_no) of the subject for whom
the value of diabetes is “2” (Table 4.2). The table shows that the subject with the serial
number (ID_no) 9 has a value of “2”. Correct this value in the data file in the Data
Editor (Edit) mode and save the file.
Data Cleaning and Data Screening 35
Have |
diabetes |
mellitus | Freq. Percent Cum.
------------+-----------------------------------
No | 164 78.10 78.10
Yes | 45 21.43 99.52
2 | 1 0.48 100.00
------------+-----------------------------------
Total | 210 100.00
+-----------------+
| ID_no diabetes |
|-----------------|
9. | 9 2 |
+-----------------+
has produced Figure 4.2, which shows the case numbers (ID_no) of the dots. These
dots indicate that there are 3 potential outliers in systolic BP and their case numbers
(ID numbers) are 20, 54, and 193.
200
180
160
Systolic BP
140
120
100
193
54
20
180
160
Systolic BP
140
120
100
Figure 4.2 Box and plot chart of systolic BP with id no. of outliers
Data Cleaning and Data Screening 37
Data analysis may require data manipulations, such as making class intervals, classify-
ing a group of people with a specific characteristic using a cut-off value (e.g., you may
want to classify people who have hypertension using a cut-off value of either systolic
or diastolic BP), and recoding of data for summarization and other purposes. In this
chapter, we will discuss data manipulations that are commonly needed during data
analysis, like:
• Converting string variables to numeric variables
• Recoding of data
• Making class intervals
• Combine data to generate an additional variable
• Data transformation
• Calculation of total score
• Extraction of duration from dates
• Relocation of variables
• Selection of a subgroup for data analysis
• Transformation of data from wide format to long format
Use the data file <Data_3.dta> for practice.
a direct response (e.g., names of districts or provinces). The codes of a string variable
may be a character (e.g., m= male; f= female) or a number (e.g., 1= male; 2= female).
Even though the codes are made up of numbers, they are actually characters. A detailed
overview of how to convert a string variable into a numeric variable is discussed
below.
5.1.1 Converting a string variable with non-numeric values into a numeric vari-
able
In our dataset, the variable “sex” is a string variable with non-numeric codes (m= male;
f= female). We can convert the string variable “sex” into a numeric variable (say, sex1)
by using the command “encode”.
encode sex, generate(sex1)
The above command (you can use "gen" instead of "generate") will generate a new
numeric variable "sex1" with the codes/values of 1 for female and 2 for male. Stata
provides the values in alphabetical order, beginning with 1 (i.e., 1 would be given to
the alphabetically first value of the original variable).
5.1.2 Converting a string variable with numeric values into a numeric variable
A string variable may have string numeric values (such as 1, 2, 3, etc.). The following
command can be used to change a string variable with numeric string values into a
numeric variable, where the numeric values will be retained as numeric.
For example, suppose that the variable "sex" is a string variable with values/codes of
0= female and 1= male. We want to convert "sex" into a numeric variable (say, sex1),
retaining the values of the string variable "sex". Use the following command:
gen sex1= real(sex)
This command will generate a new numeric variable “sex1” while keeping the values
of the string variable same (0= female; 1= male).
You can also convert a string variable with numeric values without generating a new
variable, i.e., into the same variable. To do this, use the following command:
destring sex, replace
This will convert the string variable "sex" into a numeric variable, keeping the name of
the variable and its values the same as before.
Data Management 41
new variable using the command “recode”. Suppose that you have the numeric
variable “diabetes”, which is coded as “1= no diabetes” and “2= have diabetes”. You
want to recode the variable into a new variable “diabetes1”, where “1” will be coded
as “0 (no diabetes)” and “2” will be coded as “1 (have diabetes)”. Use the following
commands:
gen diabetes1=diabetes
recode diabetes1 (1=0) (2=1)
Or,
gen diabetes1=0
replace diabetes1=1 if diabetes==2
The first command (gen) will generate a new variable “diabetes1” equivalent to the
original variable “diabetes” (i.e., the values of diabetes1 and diabetes will be the same).
The second command will change the values of the new variable “diabetes1” from “1”
to “0” and from “2” to “1”.
The alternative approach is to generate a new variable “diabetes1” with all the values
equal to “0” by using the command “gen” (third command). Then replace the value “0”
with “1” if the value of the variable “diabetes” is “2” by using the command “replace”
(fourth command).
You can check the coding scheme of the new variable by using the commands “tab” or
“codebook”, such as:
tab diabetes1
codebook diabetes1
Note that the new variable will always appear as the last variable in the dataset that you
can check in browser mode. When you change the codes, you need to give the variable
label (if coded into a new variable) and value labels of the new codes by using the
following commands (also discussed in Chapter 2):
label var diabetes1 "have diabetes"
la de diabetes1 0”no diabetes” 1”have diabetes”
label values diabetes1 diabetes1
The outputs of the above commands are displayed in Table 5.1. Other examples of
using the “recode” command are provided in Table 5.2.
Data Management 43
. tab diabetes1
For this exercise, we will recode the variable "age" into a new variable "age2". We
recommend that users always recode into a new variable. If you recode into the same
variable, the original data will be lost, and it cannot be recovered once the data file is
saved. To recode into a different variable (age2), use the following commands:
gen age2=age
recode age2 (0/20=1) (21/30=2) (31/40=3) (*=4)
label var age2 "age group"
label define age2 1"<=20 yrs" 2"21-30 yrs" 3"31-40 yrs" 4">40 yrs"
label values age2 age2
tab age2
The above commands will generate a new variable "age2" with four categories of age
as mentioned above (Table 5.3). The "*" in (*=4) indicates all other values.
Using the command "recode", you can also classify people who have hypertension and
those who do not have hypertension (for example). To do this, you need to use a cut-off
value to define hypertension. For example, we have collected data on diastolic BP
(variable name is "dbp"). We want to classify those as "hypertensive", if the diastolic
BP is >90 mmHg. Now, generate a new variable (say, htn) equivalent to the variable
"dbp". Recode the variable "htn" as ≤90= 0 (normal BP) and >90=1 (hypertensive). We
hope you can do it now. If you cannot, use the following commands:
gen htn=dbp
recode htn (0/90=0) (*=1)
Data Management 45
The above two commands will generate a new variable "htn" with values/codes "0"
and "1" (the last variable in the data file). The code "0" indicates people who do not
have hypertension (diastolic BP ≤ 90) and "1" indicates people who have hypertension
(diastolic BP >90).
To give the variable label and value labels, use the following commands:
label var htn "hypertension"
la de htn 0"no hypertension" 1"have hypertension"
label values htn htn
To check whether the commands have been executed properly, make a frequency
distribution table of the new variable "htn". This will also provide you with the propor-
tion of subjects with diastolic hypertension. Use the following command to get the
frequency distribution table for “htn” (Table 5.4):
tab htn
. tab htn1
The first command will recode the knowledge variables k1 to k4 from “2” to “0”, while
the second (or third) command will calculate the total correct answers in the new
variable “tknow” (total knowledge on HIV). You can check the data by generating the
descriptive statistics and a frequency distribution table of the variable “tknow” by
using the following commands (Table 5.9):
sum tknow
tab tknow
Table 5.9 displays the descriptive statistics (mean, standard deviation, and others) and
frequency distribution of the students' overall knowledge on HIV transmission. The
table shows that the mean of total knowledge is 2.18 (SD 0.63), while the minimum
value is 0 and the maximum value is 4. The table also indicates that there are 2 (1.0%)
students who do not have any knowledge on HIV transmission (since the score is 0,
i.e., they could not answer any questions correctly). One hundred and twenty-five
(63.8%) students know two ways of HIV transmission, while only 1.5% of the students
know all the ways of HIV transmission. You can also classify the students as having
"good" or "poor" knowledge by using a cut-off value based on the total knowledge
score.
. tab tknow
This command will generate a new variable “dura” (the last variable in the dataset) that
contains the duration of hospital stay in days. You can check the data by getting the
summary statistics of the variable “dura” using the command “sum” (Table 5.10).
sum dura, detail
keep if diabetes1==1
keep if age>30
Or,
keep if diabetes1==1 & age>30
This will limit your analyses to subjects with diabetes who are over the age of 30. You
can check the data by making a frequency distribution table (using the command "tab
diabetes1") of "diabetes1" and the summary statistics (using the command "sum age,
detail") of "age".
To delete a variable from the dataset, use the command “drop”. For example, if you
want to delete the variables “income” and “sex”, use the following command:
drop income sex
Note that to have analyses for all the data after you use the commands “drop” or
“keep”, you need to read the data file again, either by using the command “use” or by
using the drop-down menu.
You can also select a subset of data or variables for analysis while opening the data file
by using the following commands:
use C:\Users\HP\Desktop\Data_3.dta if diabetes1==1 & age<=30
use age sex diabetes1 using C:\Users\HP\Desktop\Data_3.dta
The first command will select a subset of subjects who are less than or equal to 30
years old and have diabetes. The second command will select only the variables “age”,
“sex”, and “diabetes1” while opening the data file.
The first command will generate a new variable "v5" that will have the row-mean of
the variables v1 to v4. The second command will generate a new variable "v10", which
will have the row-total of the variables v1 to v4.
religion==M |
USLIM | Freq. Percent Cum.
------------+-----------------------------------
0 | 84 40.00 40.00
1 | 126 60.00 100.00
------------+-----------------------------------
Total | 210 100.00
religion==H |
INDU | Freq. Percent Cum.
------------+-----------------------------------
0 | 152 72.38 72.38
1 | 58 27.62 100.00
------------+-----------------------------------
Total | 210 100.00
religion==C |
hristian | Freq. Percent Cum.
------------+-----------------------------------
0 | 184 87.62 87.62
1 | 26 12.38 100.00
------------+-----------------------------------
Total | 210 100.00
Here, the data for each subject is plotted once under each time variable.
In long format, there is a time variable (say, time) with three or more levels/ categories
(in our example, we need three levels, such as 0= time0, 7= time7, and 14= time14) and
there is a separate variable for blood sugar level (say, sugar), as shown in Table 5.13.
In this format, the blood sugar levels of the study subjects are plotted under the variable
"sugar" against the categories of the time variable.
Data Management 55
Table 5.12 Wide data format Table 5.13 Long data format
Subject Time0 Time7 Time14 Subject Time Sugar
1 110 108 107 1 0 110
2 115 112 110 2 0 115
3 112 110 115 3 0 112
4 106 105 104 4 0 106
5 109 108 107 5 0 109
1 7 108
2 7 112
3 7 110
4 7 105
5 7 108
1 14 107
2 14 110
3 14 115
4 14 104
5 14 107
Occasionally, it may be necessary to transform the data from a wide format to a long
format. For example, Stata cannot use wide data format for the analysis of repeated
measures ANOVA. We need to change the wide data format to a long format, which
can be done by Stata. For this exercise, let us use the data file <Data_repeat_2.dta>.
In the data file, there are seven variables — “sl (serial number)”, “subject (study
subject)”, “treatment (treatment group)”, “sugar0 (baseline blood sugar level)”, “sug-
ar7 (blood sugar level at 7 hours after treatment)”, “sugar14 (blood sugar level at 14
hours after treatment)”, and “sugar24 (blood sugar level at 24 hours after treatment)”.
This data file is in wide format for the data of blood sugar levels. We can transform this
data file into a long data format (for blood sugar) by using the following command:
reshape long sugar, i(subject) j(time)
This command will provide Table 5.14. The table shows that with this command, Stata
has generated a new variable "sugar", which contains the blood sugar levels of all the
subjects at different time points. The Stata also generated another new variable "time"
with four levels – 0 (baseline blood sugar level), 7 (blood sugar level at 7 hours after
treatment), 14 (blood sugar level at 14 hours after treatment) and 24 (blood sugar level
at 24 hours after treatment) [You can check it by using the command “tab time”]. Since
there are 15 study subjects and blood sugar levels are measured four times on each
56 Data Management
subject, the total number of observations is 60. The number of variables in the dataset
has also been reduced from 7 to 5.
After the transformation of the dataset, provide the variable label and value labels of
the new variables “time” and “sugar” by using the following commands:
la var time “time of blood sugar measurement”
la de time 0”baseline” 7”sugar at 7 hrs” 14”sugar at 14 hrs” 24”sugar at 24 hrs”
la values time time
la var sugar “blood sugar level”
You can also transform a long data format into a wide data format by using the follow-
ing command:
reshape wide sugar, i(subject) j(time)
6
Data Analysis: Descriptive Statistics
Descriptive statistics are always used at the beginning of data analysis. The objectives
of using descriptive statistics are to organize and summarize data. Commonly used
descriptive statistics are frequency distribution, measures of central tendency (mean,
median, and mode), and measures of dispersion (range, standard deviation, and
variance). Measures of central tendency convey information about the average value of
a dataset, while the measures of dispersion provide information about the amount of
variation present in the dataset. Other descriptive statistics that are used during data
analysis include quartile and percentile. In this chapter, we will discuss how to analyze
data for descriptive statistics. Use the data file <Data_3.dta> for practice. The code-
book for the dataset is given in Table 6.1.
variables after the command “tab” will produce a cross-table for the variables includ-
ed in the command. For example, if you use the variables sex and religion with the
command “tab” (i.e., tab sex religion), Stata will produce a cross-table of sex by
religion. However, if you want to get the frequency distribution of several variables at
a time (e.g., sex and religion), use the command “tab1”. To get the frequency distribu-
tions of sex and religion, use the following command:
tab1 sex religion
This command will produce two separate tables, one each for sex and religion, as
shown in Table 6.2. The table shows that there are a total of 210 subjects in the dataset,
out of which 133 (63.33%) are female and 77 (36.67%) are male. The frequency distri-
bution of religion is provided after the frequency distribution table of sex.
If there is any missing data, the output of the analysis will not show it (by default). If
there are missing values in the data, the total number of subjects will be less than the
number of data collected. For example, if we look at Table 6.2 (frequency distribution
of religion), there are 210 subjects, while Table 6.3 shows that there are 205 subjects in
the frequency distribution table of sex. This indicates that there are five missing values
in the data for "sex". To get the frequency distribution with missing cases (Table 6.3),
use the following command:
tab sex, miss
To get the identification numbers (variable name is "ID_no") of the missing subjects
(for example, for sex), use the following command.
list ID_no if missing(sex)
This will provide the id numbers of the missing cases for the variable “sex” (Table 6.3).
Table 6.3 Frequency distribution of sex with missing cases and their id numbers
. tab sex
+------+
| ID_no |
|------|
6. | 6 |
9. | 9 |
15. | 15 |
20. | 20 |
26. | 26 |
+------+
Data Analysis: Descriptive Statistics 61
This command will provide the percentiles, mean, SD, variance, skewness, kurtosis,
and four lower and four upper values of the variable “age” (Table 6.4).
You can also get the specific statistics that you may want to have for a variable. For
example, if you want to have the mean, median, SD, and 10th percentile of age (you can
also use multiple variables), use the following command:
tabstat age, stat(mean median sd p10)
tabstat age sbp dbp, stat(n mean sd median p10) col(stat) long
You can also get all those statistics at each level of another categorical variable (e.g.,
sex). Use the following commands (Table 6.5):
tabstat age, stat(n mean sd median) by(sex)
tabstat age sbp, stat(n mean sd p50) by(sex) long format
tabstat age sbp, stat(n mean sd p50) by(sex) long col(stat)
tabstat age sbp, stat(n mean sd q) by(sex) long nototal
Sometimes we need to know the confidence interval (CI) for the mean. For instance, if
you want to get the 95% CI for the mean of age, use the following command:
ci age
ci age, level(99)
The first command will provide the 95% CI (default), while the second command will
provide the 99% CI (Table 6.4) for the mean age. You can use multiple variables with
the command to get the CIs. For example, if you want to get the 95% CIs for the means
of age, income, and systolic BP (variable name is sbp), use the following command
(Table 6.4).
ci age income sbp
When a variable is coded as 0/1, Stata considers it a dichotomous categorical variable.
For example, the variable "diabetes1" in the dataset is coded as 0/1 (Table 6.1), where
code "0" indicates the subjects who do not have diabetes and code "1" indicates the
subjects who have diabetes. With this coding scheme of a variable, if we use the com-
mand "sum", the mean actually indicates the proportion of the subjects coded as “1”,
i.e., the prevalence of diabetes in the sampled population when it is a cross-sectional
data (Table 6.6). We can also get the prevalence (proportion) of diabetes using the com-
mand “tab”.
sum diabetes1
tab diabetes1
62 Data Analysis: Descriptive Statistics
. ci age
. ci age, level(99)
The first command (sum) will provide the mean, while the command "tab" will provide
a frequency distribution table of the variable (Table 6.6). To get the CI for a proportion,
use the command “proportion”. For example, to get the 95% CI for the prevalence of
diabetes, use the following command (you can use multiple variables together with this
command) (Table 6.6):
proportion diabetes1
Table 6.6 shows the 95% CI for the proportion (%) of people who have diabetes, which
Data Analysis: Descriptive Statistics 63
is 0.163 (16.3%) to 0.275 (27.5%). The 95% CI indicates we are 95% confident that the
prevalence of diabetes in the population is between 16.3% and 27.5%. You can use
other options with this command, such as:
proportion diabetes1, level(99)
proportion diabetes1, over(religion) level(99)
The second command will display the 99% CI of diabetes in different religious groups.
6.2.1 Interpretation
In Table 6.4, we can see all the descriptive statistics (central tendency and dispersion)
of the variable "age", including the statistics for Skewness and Kurtosis.
We believe you understand the meanings of mean (average), SD (average difference of
individual observations from the mean) and variance (square of SD). The analysis did
not give the median directly. As indicated in Table 6.4, the 50th percentile (P50) is the
median (middle value of the data set).
As presented in Table 6.4, the mean age is 26.5 years and the SD is 7.49 years. Let us
discuss the other statistics provided in Table 6.4, especially the skewness, kurtosis,
percentile, and quartile (not provided directly in the analysis).
64 Data Analysis: Descriptive Statistics
. tab diabetes1
Have |
diabetes |
mellitus | Freq. Percent Cum.
------------+-----------------------------------
No | 165 78.57 78.57
Yes | 45 21.43 100.00
------------+-----------------------------------
Total | 210 100.00
. proportion diabetes1
--------------------------------------------------------------
| Proportion Std. Err. [95% Conf. Interval]
-------------+------------------------------------------------
diabetes1 |
no | .7857143 .0283828 .724512 .8363903
yes | .2142857 .0283828 .1636097 .275488
--------------------------------------------------------------
Skewness and Kurtosis: These two statistics are used to evaluate whether the data has
come from a normally distributed population or not. In Table 6.4, we can see the statis-
tics for skewness (-0.091) and kurtosis (2.69). Skewness indicates the spreadness of
distribution (a measure of symmetry). Skewness "greater than 0" indicates data is
skewed to the right; skewness "less than 0" indicates data is skewed to the left, while
skewness "near to 0" indicates data is symmetrical (normally distributed).
However, the normality of data should not be evaluated based on skewness alone. We
also need to consider the statistics for kurtosis. Kurtosis indicates the "peakness" or
"flatness" of the distribution. Kurtosis is a measure to understand whether the data is
heavy-tailed or light-tailed relative to the normal distribution. Data with high kurtosis
tends to have heavy tails. The heavy-tailed distributions usually have outliers. Data
with low kurtosis tends to have light tails, or a lack of outliers.
The value of kurtosis for a normal distribution is 3. The data for "age" has a skewness
of -0.091 and a kurtosis of 2.69. Since the value of skewness is close to zero and the
value of kurtosis is close to 3, we may consider that the variable "age" has come from
Data Analysis: Descriptive Statistics 65
Percentile and Quartile: Stata has provided the percentiles (1%, 5%, 10%, 25%, etc.)
for the variable "age" (Table 6.4). It did not show the quartiles directly. When a dataset
is divided into four equal parts after being arranged in ascending order, each part is
called a quartile. It is expressed as Q1 (first quartile or 25th percentile), Q2 (second
quartile or median or 50th percentile), and Q3 (third quartile or 75th percentile).
On the other hand, when the data is divided into 100 equal parts (after the ordered array
– from lowest to highest), each part is called a percentile (P). We can see in Table 6.4
that the 10th percentile (P10) is 16.5, the 25th percentile or P25 (Q1) is 21.0, the P50 (medi-
an or Q2) is 27.0, and the P75 (Q3) is 32.0 years. The Q1 (the first quartile) is 21 years,
indicating that 25% of the study subjects’ age is less than or equal to 21 years. On the
other hand, the 75th percentile (P75 or Q3) which is 32 years, indicates that 75% of the
study subjects’ age is less than or equal to 32 years. There is another measurement,
called interquartile range (IQR), which is not shown in the analysis. IQR is the differ-
ence between Q3 and Q1 (Q3 – Q1). In this example, the IQR is 11 years (32 – 21).
Sum of weights (wgt): When the "detail" option is used with the command "sum",
Stata provides the statistics "sum of wgt.", indicating the sum of weights (Table 6.4).
The table shows that the sum of weights for age is 210. Since we did not use any weight
variable in the analysis, by default, each subject is given a weight of 1. When the
option "detail" is used, the sum of the weights will therefore be equal to the number of
observations, as shown in our example. We do not need this information to interpret the
descriptive statistics.
95% confidence interval (CI): Table 6.4 shows the 95% CIs for age and other
variables (income and systolic BP). The 95% CI for mean age is 25.4 to 27.5 years.This
means that we are 95% confident that the mean age of the population from which the
sample is drawn is between 25.4 and 27.5 years.
Table 6.6 shows the mean and 95% CI of the dichotomous categorical variable "diabe-
tes1", which is coded as 0/1 (0= don’t have diabetes; 1= have diabetes). Here, the mean
value is 0.2142. This indicates that 21.42% (mean value multiplied by 100) of the
subjects have diabetes. If it is a cross-sectional data, we can also say that the preva-
lence of diabetes is 21.42%. The 95% CI of the prevalence is 16.36% – 27.54%. Here,
the 95% CI indicates that the prevalence of diabetes in the population would be
between 16.36% and 27.54%, and we are 95% confident about it.
66 Data Analysis: Descriptive Statistics
-----------------------------------------------------------------------------------
-> sex = f
-----------------------------------------------------------------------------------
-> sex = m
---------------------------------------------------------------------------------
-> diabetes = No
---------------------------------------------------------------------------------
-> diabetes = Yes
The information derived from data analysis needs to be presented in an effective and
understandable manner. Data and information can be presented in textual, tabular, or
graphical forms. Tables and graphs are powerful communication tools for the presenta-
tion of information. They can make an article easy to understand for the readers. While
a table is suitable for presenting quantitative and qualitative information, a graph is an
effective visual method for data presentation. A graph displays data at a glance, facili-
tates comparison, and can reveal trends and relationships.
The researchers need to carefully decide which type of graph or chart will be the best
way to present the information. The type of graph or chart to be used depends on the
data type, analysis outputs, and the objective of communicating the information. Inap-
propriate use of graphs may fail to convey the right information and may sometimes
confuse the readers, leading to misinterpretation of data. Therefore, the graphs for data
presentation must be chosen carefully.
The graphs commonly used for the presentation of data include histograms, scatter
plots, box and plot charts, bar graphs, line graphs, and pie charts. In this chapter, we
will discuss how to generate these graphs using Stata. Use the data file <Data_3.dta>.
7.1 Histogram
We usually generate a histogram to assess the distribution of a continuous variable. The
histogram provides information about: a) the distribution of a dataset (whether
symmetrical or not); b) the concentration of values; and c) the range of values. To
generate a histogram (say, for the variable "age"), use the following command:
70 Generating Graphs
histogram age
This will display a histogram for age (Fig 7.1 A). You can specify the Y-axis scale in
the command, such as frequency or percentage, by adding the following options:
histogram age, frequency
histogram age, percent
histogram age, norm percent
The first command will display a histogram where the Y-axis value is the number
(frequency; Fig 7.1 B), while the second command will display the Y-axis value in
percentage. The third command will produce a histogram with the overlying normal
curve (Fig 7.1 C).
You can specify the interval width to construct a histogram. For example, if you want
to have the histogram of age at a class interval of 3, use the following command:
histogram age, width(3) frequency
Or,
histogram age, width(3) start(2) frequency
You can also get the histograms dissertated by a categorical variable (say, by sex). To
do this, use the following command:
histogram age, by(sex) frequency
This command will display histograms of age for females and males, separately (Fig
7.1 D). You can use the graph edit options to give a title, subtitle, and change the font,
background, and color schemes of the histogram. These options can also be specified
through commands.
Looking at the histogram (Fig 7.1), it seems that the data is more or less symmetrical.
This indicates that age may be normally (approximately) distributed in the population.
File > Save As… > Select location of the file to be saved > Give a file name in the
“File name” box (e.g., Graph1) > Select a format in the “Save as type” box (e.g.,
*.gph) > Save
A B
.08
40
.06
30
Frequency
Density
.04
20
.02
10
0
0
10 20 30 40 50 10 20 30 40 50
Age Age
C D
20
f m
30
15
20
Frequency
Percent
10
10
5
0 50 0 50
0
10 20 30 40 50 Age
Age Graphs by Sex: string
You can also use the following commands to save the graph as “Graph1” on your desk-
top (or other locations):
graph save C:\Users\HP\Desktop\Graph1
graph save C:\Users\HP\Desktop\Graph1.emf, replace
The first command will save the graph on the desktop in .gph format, while the second
command will save the graph on the desktop in .emf format.
72 Generating Graphs
200 A B
200
180
180
160
Systolic BP
160
Systolic BP
140
140
120
120
100
100
60 80 100 120
Diastolic BP
60 80 100 120 Fitted values Systolic BP
Diastolic BP
C
200
180
160
140
120
100
60 80 100 120
Diastolic BP
ries of religion (Fig 7.6). You can also generate box and plot charts for multiple
variables simultaneously by using the last command. Horizontal box and plot charts
can also be generated by using the following command:
graph hbox sbp
A B
200
No Yes
193
200
54
20
180
207
159
15 27
122 23
47
197
169 37
160
3 82
196
Systolic BP
166 79 53
210 190 191 61185 68
130
186 16
31
150
204
147 132 7 73 39
86
131
14 103 90
56 144195189
89
114
181
4452 48
140
30 145 6512055
10
138 42
81
100
8574 66
63102
87
129
22175 171
50 161
118 176 36
148 60 142
109
127
96
139 162 26 21 45 12657
78
136 182
133208
71
140
83
205 2420277
116
150
58 6
153 119
84 203
120
100
95 69 2 97
151 4 64 67 80 115
100
Figure 7.3 Scatter plots of systolic BP and diastolic BP with ID numbers (A) and by
diabetes (B)
Figure 7.4 Scatter plots matrix for age, systolic BP and diastolic BP
The box and plot chart provides information about the distribution of a dataset. It also
provides summary statistics of a variable, like Q1 (first quartile or 25th percentile),
median (second quartile or Q2) and Q3 (third quartile or 75th percentile) as well as
information about outliers or extreme values. The lower boundary of the box indicates
Generating Graphs 75
the value for Q1, while the upper boundary indicates the value for Q3. The median is
represented by the horizontal line within the box. The minimum and maximum values
are indicated by the horizontal lines of the whiskers (Fig 7.5).
In the box and plot chart, the presence of outliers is indicated by the dots (Fig 7.5). The
outliers are the values greater than 1.5 box length distance (i.e., interquartile range or
IQR) from the edge (upper or lower) of the box [i.e., greater than (1.5×IQR + Q3) or
less than (Q1 – 1.5×IQR)]. You can see that there are 3 outliers in the data of systolic
BP (Fig 7.5). If you want to get the case numbers of the outliers, use the following
command ("ID_no" is the variable name for identification number in our dataset):
graph box sbp, marker (1, mlabel (ID_no))
80
60
7.4.1 Bar graph for the mean of a quantitative variable across a categorical vari-
able
You may be interested in presenting the mean income of the respondents by religion.
To do this, use the following command:
graph bar income, over(religion)
graph hbar income, over(religion)
The first command will display a bar graph showing mean income across religious
groups (Fig 7.7A). The second command will display a horizontal bar with the same
information. If you want to get the median, instead of the mean income by religion, use
the following command:
graph bar (median) income, over(religion)
Use the following commands if you want to get the value labels on each bar:
graph bar age, over(religion) blabel(bar)
graph bar age, over(religion) blabel(bar, format (%3.1f))
The first command will provide the value labels (mean age) on the bars. The addition
of the option "format (%3.1f)" in the command will provide the mean up to 3 digits
with one decimal point (Fig 7.7B).
fbar religion
fbar religion, percent
fbar religion, by(sex)
The first command will generate a bar graph of religion with a frequency of occurrence
and the second one with a percentage. The last command will generate a frequency bar
graph of religion by sex. The bar graphs generated by the "fbar" command cannot be
edited in graph edits.
An alternative way to generate a bar graph is to use the modified command for histo-
gram. To generate a bar graph for religion with the modified command of histogram,
use the following command:
histogram religion, discrete freq gap(20) xlabel(1 2 3, valuelabel)
histogram religion, discrete percent gap(20) xlabel(1 2 3, valuelabel)
The first command will generate a bar graph with the frequency of the bars, while the
second one will provide the bar with a percentage.
30
80,000
27.0
26.3
24.9
mean of income
mean of age
60,000
20
40,000
10
20,000
0
A B
Fig 7.7 Bar graph of mean income (A) and mean age (B) by religion
name is "tfr") over the period of 1975 to 2018 (variable name is "year"). To do this, use
the following command (use the data file <Line.dta>):
line tfr year
twoway connect tfr year
The first command will produce the line graph A without the markers of the scatter
points, while the second command will produce the line graph B with the markers as
shown in Figure 7.8 (graphs have been edited for better resolution). The command
"connect" actually combines the features of scatter with a line, i.e., connects the scatter
points with a line segment. The Y-axis variable (dependent variable) must be written
first after the basic command.
You can also include two dependent variables to generate a line graph. Suppose that
you want to demonstrate the trend of TFR as well as CPR (contraceptive prevalence
rate; variable name is “cpr”) over the years. Use the following command to get the line
graph (Fig 7.8 C).
twoway connect tfr cpr year
Though you can give a title, Y-axis and X-axis labels, and others by using the Stata
commands, it is easier to write them using the graph edits. Users can easily explore the
graph editing options in Stata.
If you have a dataset in wide format (use the data file “Repeated Anova_2”, which is
in wide format), you can make a line graph by using the following command:
profileplot sugar_0 sugar_7 sugar_14 sugar_24, by(treatment)
Or,
profileplot sugar_0 - sugar_24, by(treatment)
Either of the above commands will generate a line graph of mean blood sugar levels at
4 different time points for three treatment groups (“sugar_0 - sugar_24” indicates the
variables sugar_0 to sugar_24).
6
6
4
4
3
3
2
2
1970 1980 1990 2000 2010 2020 1970 1980 1990 2000 2010 2020
year year
A B
60
40
20
0
C
Fig 7.8 Line graphs of TFR (A & B) and trends of TFR and CPR (C) by year
Christian
HINDU
MUSLIM
A B
12.38% 12.38%
27.62% 27.62%
60% 60%
C D
Fig 7.9 Pie charts of religion
8
Checking Data for Normality
To construct the Q-Q plot for systolic BP, use the first command (Fig 8.2). The second
command will generate a Q-Q plot for males.
qnorm sbp
qnorm sbp if sex==”m”
Formal statistical tests to assess the normality of data can also be used. The statistical
tests for checking the normality of data are the Shapiro Wilk test (commonly used) and
the Skewness-Kurtosis test. To do these tests for systolic BP, use the following com-
mands:
swilk sbp
sktest sbp
swilk sbp if diabetes==1
sktest sbp if diabetes==1
The outputs of first two commands are provided in Table 8.1. You can also use these
tests for multiple variables at a time.
Checking Data for Normality 83
8.1.1 Interpretation
With Stata commands, we have generated the histogram and Q-Q plot (Figs 8.1 and
8.2) for systolic BP to visually check the distribution of data. We have also used the
formal statistical tests (Shapiro Wilk test and Skewness-Kurtosis test) to assess the
normality of data (Table 8.1).
200
150
Systolic BP
100
50
. sktest sbp
Skewness/Kurtosis tests for Normality
------- joint ------
Variable | Obs Pr(Skewness) Pr(Kurtosis) adj chi2(2) Prob>chi2
-------------+---------------------------------------------------------------
sbp | 210 0.0001 0.2925 14.72 0.0006
The histogram (Fig 8.1) provides an impression about the distribution of the data (whether
the distribution is symmetrical or not). If we look at the histogram of systolic BP, it seems
that the data is slightly skewed to the right (i.e., the distribution is not symmetrical).
84 Checking Data for Normality
The Q-Q plot (Fig 8.2) also provides information on whether the data has come from a
normally distributed population or not. The Q-Q plot compares the distribution of data
with the standardized theoretical distribution from a specified family of distribution (in
this case, from a normal distribution). If the data is normally distributed, all the points
(dots) lie on the diagonal straight line. Our interest is in the central portion of the line
as well as in the tails. Deviation from the central portion of the line means non-normal-
ity. Deviations at the ends of the plot indicate the existence of outliers. We can see (Fig
8.2) that there is a slight deviation of the dots at the central portion as well as at the two
ends. This may indicate that the data may not have come from a normally distributed
population.
The specific tests (objective tests) that we have used to assess if the data has come from
a normally distributed population are the Shapiro Wilk test and the S-K test. The results
of these two tests are provided in Table 8.1.
Look at the "Prob > z" (for the Shapiro Wild test) and "Prob>chi2" (for the S-K test)
columns of Table 8.1. These columns indicate the p-values of the tests. A p-value of
<0.05 indicates that the data has not come from a normally distributed population. In
our example, the p-value is 0.000 for both the tests. This indicates that the data for
systolic BP has not come from a normally distributed population. Here, the null
hypothesis is "data has come from a normally distributed population" and the alterna-
tive hypothesis is "data has not come from a normally distributed population". We will
reject the null hypothesis since the p-values of the tests are <0.05.
The formal tests are very sensitive to sample size. These tests may be significant for
slight deviations in large sample data (n>100). Similarly, the likelihood of getting a
p-value <0.05 for a small sample of data (n<20, for example) is low. Therefore, the
rules of thumb for normality checking are:
1) For a sample size of <30: Assume non-normal;
2) For a moderate sample size (30-100): If the formal statistical test is significant
(p<0.05), consider a non-normal distribution; otherwise, check the normality
using other methods, such as histograms and Q-Q plots;
3) For a large sample size (n>100): If the formal statistical test is not significant
(p>0.05), accept normality; otherwise, check with other methods.
However, for practical purposes, just look at the histogram. If it seems that the distri-
bution is approximately symmetrical, consider that the data has come from a normally
distributed population and use a parametric test. If the sample size is less than 30, use
a nonparametric test.
9
Testing of Hypothesis
The present and the following chapters provide basic information on how to select
statistical tests for testing hypotheses, perform the statistical tests with Stata, and inter-
pret the results of common problems related to health and social sciences research.
Before we proceed, let us discuss some of the basic concepts of hypothesis testing.
A hypothesis is a statement about one or more populations. A hypothesis is concerned
with the parameter of a population about which the statement is made. A hospital man-
ager may hypothesize that the average length of stay at his hospital is seven days, or a
researcher may hypothesize that the rate of recovery with drug A is better than that of
drug B. By means of hypothesis testing, one determines whether or not such statements
are compatible with the available data.
There are two types of statistical hypotheses: null (H0) and alternative (HA) hypotheses.
The null hypothesis is the hypothesis of equality or no difference. The null hypothesis
always says that two or more quantities (parameters) are equal. We always test the null
hypothesis, not the alternative hypothesis. Using a statistical test, we either reject or do
not reject the null hypothesis. If we can reject the null hypothesis, then we can only
accept the alternative hypothesis. It is, therefore, necessary to have a very clear under-
standing of the null hypothesis.
Suppose that we are interested in determining the association between coffee drinking
and stomach cancer. In this situation, the null hypothesis is "there is no association
between coffee drinking and stomach cancer (or, coffee drinking and stomach cancer
are independent)", while the alternative hypothesis is "there is an association between
coffee drinking and stomach cancer (or, coffee drinking and stomach cancer are not
independent) ". If we can reject the null hypothesis with a statistical test (i.e., if the test
86 Testing of Hypothesis
is significant; p-value <0.05), then we can only say that there is an association between
coffee drinking and stomach cancer.
Various statistical tests are available to test hypotheses. Selecting an appropriate statis-
tical test is the key to analyzing the data. What statistic is to be used to test a hypothesis
depends on the study design, data type, distribution of data, and objective of the study.
It is, therefore, important to understand the nature of the variable (categorical or quan-
titative), measurement type (nominal, ordinal, interval, or ratio scale), as well as the
study design. The following tables (Tables 9.1 to 9.4) provide basic guidelines about
the selection of statistical tests depending on the type of data and situation.
The student’s t-test is commonly known as the t-test. It is a frequently used parametric
statistical method to test a hypothesis. There are several types of t-tests used in differ-
ent situations (Table 9.1), such as: a) One-sample t-test; b) Independent samples t-test;
and c) Paired t-test. In this chapter, we will discuss all these t-tests and the interpreta-
tion of their outputs. Use the data file <Data_3.dta> for practice.
Hypothesis
Null hypothesis (H0): The mean diastolic BP of students is equal to 80 mmHg in
the population (study population is the students of the State University of Bangla-
desh).
Alternative hypothesis (HA): The mean diastolic BP of students is different from
(not equal to) 80 mmHg in the population.
Assumptions
1. The distribution of diastolic BP in the population is normal;
2. The sample is a random sample from the population.
92 Student’s t-test for Hypothesis Testing
The first job, before testing the hypothesis, is to check whether or not the distribution
of diastolic BP is normal in the population (assumption 1). To do this, check the histo-
gram and/or Q-Q plot of diastolic BP or do the formal statistical test of normality (Sha-
piro Wilk test or S-K test) as discussed in Chapter 8. If the assumption is met (diastolic
BP is at least approximately normal), do the 1-sample t-test; otherwise, use the
nonparametric method (Wilcoxon Signed Rank test, as discussed in Chapter 20).
Suppose that the diastolic BP is normally distributed in the population. Use the follow-
ing command to do the 1-sample t-test:
ttest dbp=80
With this command, Stata will generate Table 10.1.
One-sample t test
------------------------------------------------------------------------------
Variable | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
---------+--------------------------------------------------------------------
dbp | 210 82.76667 .8107779 11.74929 81.16832 84.36502
------------------------------------------------------------------------------
mean = mean(dbp) t = 3.4124
Ho: mean = 80 degrees of freedom = 209
10.1.1 Interpretation
In this example, we have tested the null hypothesis "the mean diastolic BP of the
students is equal to 80 mmHg in the population". Data shows that the mean diastolic
BP of the sample is 82.77 mmHg (95% CI: 81.16 - 84.36) with a SD of 11.75 mmHg
(Table 10.1). The results of the one-sample t-test show that the calculated value of "t"
is 3.412 and the p-value [(Pr(|T| > |t|); 2-tailed] is 0.0008. Since the p-value is <0.05,
we will reject the null hypothesis at the 95% confidence level. This means that the
mean diastolic BP of the students (in the population) from where the sample is drawn
is different from 80 mmHg (p<0.001).
Student’s t-test for Hypothesis Testing 93
Hypothesis
H0: The mean age of diabetic and non-diabetic patients is the same in the popula-
tion.
HA: The mean age of diabetic and non-diabetic patients is different (not the same)
in the population.
Assumptions
1. The dependent variable (age) is normally distributed at each level of the inde-
pendent variable (diabetes);
2. The variances of the dependent variable (age) at each level of the independent
variable (diabetes) are the same/equal;
3. Subjects represent random samples from the population.
Have |
diabetes | Summary of Age
mellitus | Mean Std. Dev. Freq.
------------+------------------------------------
Yes | 27.911111 8.4633494 45
No | 26.133333 7.1835969 165
------------+------------------------------------
Total | 26.514286 7.4904907 210
The results of Levene’s test (using the command "robvar") are provided in Table 10.2.
The table shows the mean and standard deviation of age as well as the total observa-
tions (Freq) for both diabetic (Yes) and non-diabetic (No) patents. We can see that the
standard deviation of age is higher among diabetic patients (8.46) compared to non-di-
abetic (7.18) patients, but the Levene’s test will tell us whether or not this difference is
statistically significant. The results of the "sdtest" command also provide the same
information as shown after Levene’s test results in Table 10.2.
Stata has provided three options for Levene’s test results:
Student’s t-test for Hypothesis Testing 95
• W0, which is the test statistic of Levene’s test centered at the mean;
• W50, which is the test statistic centered at the median; and
• W10, which is the test statistic centered at a 10% trimmed mean (i.e., the top
5% and bottom 5% of the values are trimmed out).
We will consider the test statistics centered at the median (W50) and the p-value for
this is 0.084 (>0.05). This means that the variance of age for diabetic and non-diabetic
patients is not different (equal). Similarly, the p-value (0.147) of the "sdtest" indicates
the same, i.e., the variances for diabetic and non-diabetic patients with regard to age
are not different.
10.2.3 Interpretation
Table 10.3 shows the descriptive measures of age by diabetes. We can see that there are
45 subjects with diabetes and 165 subjects without diabetes. The mean age of the
diabetic patients is 27.9 (SD 8.46) and that of the non-diabetic patients is 26.1 (SD
7.18) years.
The t-test results are provided at the bottom of the descriptive statistics. The calculated
t-value is -1.41 (with the degrees of freedom 208). Since we are interested in under-
standing if the mean age is the same for both diabetic and nondiabetic patients, we will
consider the 2-tailed test. Look at the p-value, provided under "Ha: diff !=0", which is
96 Student’s t-test for Hypothesis Testing
0.158 (>0.05). We cannot, therefore, reject the null hypothesis. This means that the
mean age of diabetic and non-diabetic patients in the population from where samples
are drawn is not different (p=0.158).
The results of the t-test for unequal variances are provided in Table 10.4. The interpre-
tation of the results is the same as above.
Hypothesis
H0: There is no difference in mean scores before and after the training (for exam-
ple 1).
HA: The mean scores are different before and after the training.
Assumptions
1. The difference between two measurements (pre- and post-test) of the depen-
dent variable (examination scores) is normally distributed;
2. Subjects represent a random sample from the population.
10.3.1 Commands
We will use the first example (to compare the pre- and post-test scores) to do the paired
t-test. Use the following command (the variable names are: pre_test and post_test) for
the paired t-test (Table 10.5):
ttest post_test = pre_test
98 Student’s t-test for Hypothesis Testing
Paired t test
------------------------------------------------------------------------------
Variable | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
---------+--------------------------------------------------------------------
post_t~t | 32 90.98438 1.492164 8.440957 87.94109 94.02766
pre_test | 32 53.57813 2.727373 15.42835 48.01561 59.14064
---------+--------------------------------------------------------------------
diff | 32 37.40625 2.47848 14.0204 32.35136 42.46114
------------------------------------------------------------------------------
mean(diff) = mean(post_test - pre_test) t = 15.0924
Ho: mean(diff) = 0 degrees of freedom = 31
10.3.2 Interpretation
Table 10.5 shows the descriptive statistics of both the pre- and post-test results. Look-
ing at the mean scores, we can get an impression of whether the training has increased
the mean score or not. We can see that the post-test mean is 90.9, while the pre-test
mean is 53.5, and the difference is large (37.4). To understand if the difference between
post-test mean and pre-test mean is significant or not, we need to check the paired t-test
results given at the bottom of the descriptive statistics. The results show that the calcu-
lated t-value is 15.09 (degrees of freedom: 31).
Three different p-values for the paired t-test are displayed at the bottom of the table.
Since we are interested in understanding if there is any difference in the mean scores
before and after the training (a two-tailed test), we will consider the p-value provided
at the middle [Ha: mean(diff) != 0], which is 0.000. Since this value is smaller than
0.05 (our level of significance), we will reject the null hypothesis. We can, therefore,
conclude that there is sufficient evidence to say that there is a significant difference in
the mean scores before and after the training; i.e., the mean knowledge score has
increased significantly after the training (p<0.001).
11
Analysis of Variance (ANOVA)
Hypothesis
H0: The mean income of all the religious groups is equal.
HA: Not all means (of income) among the religious groups are equal.
100 Analysis of Variance (ANOVA)
Assumptions
1. The dependent variable (income) is normally distributed at each level of the
independent variable (religion);
2. The variances of the dependent variable (income) at each level of the indepen-
dent variable (religion) are the same (homogeneity of variances); and
3. Subjects represent random samples from the population.
If the variances of the dependent variable in all the categories are not equal (violation
of assumption 2), but the sample size in all the groups is large and similar, ANOVA can
be used.
11.1.1 Commands
All the following commands will provide the ANOVA test results. We recommend
using the first command, which provides the descriptive statistics along with the ANO-
VA test results (Table 11.1).
oneway income religion_2, tabulate
oneway income religion_2
anova income religion_2
Analysis of Variance
Source SS df MS F Prob > F
------------------------------------------------------------------------
Between groups 3.3068e+09 3 1.1023e+09 3.64 0.0136
Within groups 6.2349e+10 206 302663565
------------------------------------------------------------------------
Total 6.5656e+10 209 314141354
11.1.2 Interpretation
The outputs of the one-way ANOVA test are provided in Table 11.1. In this example,
we have used "income" as the dependent variable and "religion" as the independent
variable. The independent variable (religion) has 4 categories (levels) – Muslim,
Hindu, Christian, and Buddhist. The results first provided the descriptive measures
(mean, SD, etc.) of income by religion. For example, the mean income of Muslims is
BDT 88,180.9, with a SD of 17,207.6.
The ANOVA test (F-test) results are provided at the bottom of the descriptive statistics.
The value of the F-statistic is 3.64 and the corresponding p-value (Prob > F) is 0.0136.
Since the p-value is <0.05, we can reject the null hypothesis. This means that not all
group means of income are equal.
However, before interpreting the ANOVA test results, we need to check the Bartlett's
test for homogeneity of variances (equal variances) provided at the bottom of the table.
This test is done to assess if all the group-variances in income are equal (assumption
2). The p-value (Prob>chi2) of the Bartlett’s test is 0.516. Since the p-value is >0.05,
the variances of income at all levels of the religious group are equal (i.e., assumption 2
is not violated). The assumption would have been violated if the p-value was <0.05.
over : religion_2
---------------------------
| Number of
| Comparisons
-------------+-------------
religion_2 | 6
---------------------------
----------------------------------------------------------------------------------------
| Bonferroni Bonferroni
income | Contrast Std. Err. t P>|t| [95% Conf. Interval]
-----------------------+----------------------------------------------------------------
religion_2 |
HINDU vs MUSLIM | -9014.877 3287.767 -2.74 0.040 -17773.41 -256.3402
Christian vs MUSLIM | -8775.289 3747.399 -2.34 0.121 -18758.27 1207.696
BUDDHISM vs MUSLIM | -3384.314 4019.891 -0.84 1.000 -14093.21 7324.585
Christian vs HINDU | 239.5876 4477.525 0.05 1.000 -11688.44 12167.61
BUDDHISM vs HINDU | 5630.563 4707.946 1.20 1.000 -6911.298 18172.42
BUDDHISM vs Christian | 5390.976 5039.677 1.07 1.000 -8034.608 18816.56
----------------------------------------------------------------------------------------
You can also get the error-bar chart of mean income for different religious groups. Use
the following three commands consecutively to get the error-bar (Fig 11.2). If you use
the command "oneway" instead of “anova” for the ANOVA test, the commands “mar-
gins” and “marginsplot” for error-bar will not work.
anova income religion_2
margins religion_2
marginsplot
Analysis of Variance
Source SS df MS F Prob > F
------------------------------------------------------------------------
Between groups 3.3068e+09 3 1.1023e+09 3.64 0.0136
Within groups 6.2349e+10 206 302663565
------------------------------------------------------------------------
Total 6.5656e+10 209 314141354
monly used in health research. Use the data file <Data_3.dta> for practice.
Suppose that we want to compare the mean systolic BP (variable name is "sbp") in
different occupation and sex groups. Here, the dependent variable is systolic BP and
the independent variables are occupation and sex.
Since there are four levels (categories) of occupation (govt. job; private job; business;
and others) and two categories of sex (male and female) in the data, we will have a
factorial design with eight (4×2) data cells. The two-way ANOVA test answers the
following three questions:
Analysis of Variance (ANOVA) 105
1. Does occupation influence the systolic BP (i.e., is the mean systolic BP equal
among the occupation groups)?
2. Does sex influence the systolic BP (i.e., is the mean systolic BP equal for males
and females)?
3. Does the influence of occupation on systolic BP depend on sex (i.e., is there
an interaction between occupation and sex)?
Questions one and two refer to the main effect, while the question three explains the
interaction of two independent variables (occupation and sex) on the dependent
variable (systolic BP). The primary objective of two-way ANOVA is to assess if there
is an interaction between the independent categorical variables on the dependent
variable.
Assumptions
1. The dependent variable (systolic BP) is normally distributed at each level of
the independent variables (occupation and sex);
2. The variances of the dependent variable (systolic BP) at each level of the inde-
pendent variables are equal; and
3. Subjects represent random samples from the population.
First, we need to check if the systolic BP is normally distributed in different categories
of occupation and sex separately. We can check this by constructing the histograms and
Q-Q plots, or by doing the Shapiro Wilk test (Chapter 8). We also need to check the
homogeneity of variances in each group of the independent variables (occupation and
sex) by using the Levene's test (Section 10.2.1).
You can get the error graph of the adjusted mean of systolic BP for sex (males and
females) by occupation. To get the plot of mean systolic BP with error bars of occupa-
tion by sex, use the following commands successively (Fig 11.3):
margins sex_1, at(occupation=(1(1)4))
marginsplot
11.2.2 Interpretation
Table 11.4 shows the outputs of the two-way ANOVA test. The table shows the main
effects of the independent variables as well as their interaction. Look at the p-values
(Prob > F) for occupation and sex. They are 0.758 and 0.072, respectively. The findings
indicate that the mean systolic BP is not different in different occupation groups as well
as sex (males and females). Now, look at the p-value for "occupation#sex_1", which
indicates the significance of the interaction between these two variables (occupation
and sex) on systolic BP. A p-value of <0.05 indicates the presence of interaction. The
presence of interaction indicates that the systolic BP in different occupation groups is
influenced by (depends on) sex. In our example, the p-value for interaction is 0.230
(>0.05), which means that there is no interaction between occupation and sex to influ-
ence the systolic BP.
140
Linear Prediction (sbp)
130
120
110
Female Male
over : occupation
---------------------------
| Number of
| Comparisons
-------------+-------------
occupation | 6
---------------------------
------------------------------------------------------------------------------------------
| Bonferroni Bonferroni
sbp | Contrast Std. Err. t P>|t| [95% Conf. Interval]
-------------------------+----------------------------------------------------------------
occupation |
PRIVATE JOB vs GOVT JOB | -3.036395 3.883548 -0.78 1.000 -13.38208 7.309289
BUSINESS vs GOVT JOB | -1.52619 3.883548 -0.39 1.000 -11.87187 8.819493
OTHERS vs GOVT JOB | -2.364103 3.821385 -0.62 1.000 -12.54419 7.81598
BUSINESS vs PRIVATE JOB | 1.510204 4.074798 0.37 1.000 -9.344964 12.36537
OTHERS vs PRIVATE JOB | .672292 4.015596 0.17 1.000 -10.02517 11.36975
OTHERS vs BUSINESS | -.8379121 4.015596 -0.21 1.000 -11.53537 9.859545
------------------------------------------------------------------------------------------
it is not necessary. Our data shows that there is no significant effect of occupation and
sex on systolic BP, since the p-values are 0.758 and 0.072, respectively. However, if
108 Analysis of Variance (ANOVA)
you want to perform the multiple comparisons test (say, Bonferroni) for “occupation”
after the two-way ANOVA, use the first command, while use the second command to
get the same for “sex_1”.
pwmean sbp, over(occupation) mcompare(bonferroni) effects
pwmean sbp, over(sex_1) mcompare(bonferroni) effects
The first command will provide Table 11.5. Interpretation of the results is the same as
discussed in Section 11.1.4.
12
Repeated Measures ANOVA
sugar levels over time (i.e., whether the mean blood sugar levels over time are the same
or different).
To conduct this study, we have randomly selected 15 individuals from a population and
measured their blood sugar levels at the baseline, i.e., before administration of a drug
(hour-0). All the individuals are then administered a drug (say, drug A), and their blood
sugar levels are measured again after 7 hours, 14 hours, and 24 hours. We are interested
in knowing if the blood sugar levels over time, after giving the drug, are the same or
not (in other words, whether the drug is effective in reducing the blood sugar levels
over time). The variable "time" in the dataset indicates the times of measurement of
blood sugar levels in the subjects. In this example, we have only one treatment group
(received drug A) but have the outcome measurements (blood sugar) at four different
points in time on the same subjects (i.e., we have one treatment group with four levels
of measurement on the same subjects).
Hypothesis
H0: The mean blood sugar levels are equal (same) at each level of measurement
(i.e., the mean blood sugar levels at 0, 7, 14, and 24 hours in the population are the
same).
HA: The mean blood sugar levels are not equal at different levels of measurement
(i.e., the mean blood sugar levels at 0, 7, 14, and 24 hours in the population are
different).
Assumptions
1. The dependent variable (blood sugar level) is normally distributed in the popu-
lation at each level of within-subjects factor;
2. The population variances of the differences between all combinations of relat-
ed groups/levels are equal (called the Sphericity assumption); and
3. The subjects represent a random sample from the population.
12.1.1 Commands
The data file “Repeat anova_3.dta” has the following variables:
Subject: It indicates the study participants, like participants 1, 2, 3, and so on. You will
notice that there are 15 subjects enrolled in this study (you can check it by using the
command "tab", like "tab subject").
Repeated Measures ANOVA 111
Time: The variable "time" indicates the times of measurement of blood sugar levels.
Blood sugar levels were measured on the same subjects at four different times, such as:
a) at the baseline, i.e., before treatment (coded as 0); b) 7 hours after treatment (coded
as 1); c) 14 hours after treatment (coded as 2); and d) 24 hours after treatment (coded
as 3). You can check the times of measurement of blood sugar levels on the subjects by
using the following command.
tab time
This will provide Table 12.1. The table shows that there are four categories of the
variable "time", and they are: a) before treatment; b) 7 hours after treatment; c) 14
hours after treatment; and d) 24 hours after treatment.
Sugar: The variable “sugar” indicates the blood sugar levels of the study subjects at
different times of measurement.
There is another variable named “treatment” in the dataset that indicates in which treat-
ment group the subjects were enrolled in. This variable is not needed for the analysis
of one-way repeated measures ANOVA because we need only one treatment group for
the analysis.
To perform the one-way repeated measures ANOVA test, use the following command:
anova sugar subject time, repeated(time)
Once you use this command, the Stata results may show “matsize too small”. If you see
this message in the output window, you need to increase the “matsize” by using the
following command. Matsize indicates “maximum matrix size”, which influences the
number of variables that can be included in any of Stata's estimation commands.
set matsize 10000
112 Repeated Measures ANOVA
The above command will increase the “matsize” to 11,000 in Stata (you can only
increase the “matsize” up to 11,000). Now, use the following command to perform the
repeated measures ANOVA test:
anova sugar subject time, repeated(time)
While using the command, the computer may take some time to analyze the data,
depending on the speed and memory of your computer and the size of the data. It may
take 3-7 minutes to calculate the ANOVA test results. The results of the analysis are
shown in Table 12.2.
Panel 1.
Number of obs = 60 R-squared = 0.7643
Root MSE = 3.94868 Adj R-squared = 0.6689
Panel 2.
12.1.2 Interpretation
Look at the “time” row in Table 12.2 (Panel 1). The p-value (Prob > F) is 0.000. This
indicates that the mean blood sugar levels significantly differ at different time points
(we have four time points). However, before explaining this table, we need to check
whether the sphericity assumption (assumption 2) has been violated or not. To check
whether the sphericity assumption is violated or not, we need to do the "Mauchly’s
test" for the sphericity assumption. If the assumption is violated (i.e., Mauchly’s test
p-value is <0.05), we commonly use the p-value of the Greenhouse-Geisser (G-G) test
as shown in the second panel of Table 12.2 to interpret the results of the repeated
measures ANOVA test.
If the within-subjects factor has more than two levels, three types of tests are available
in repeated measures ANOVA. In our dataset, the within-subjects factor is the times of
measurement of blood sugar levels (variable name is "time"), which has four levels —
before treatment, 7 hours after treatment, 14 hours after treatment, and 24 hours after
treatment (Table 12.1). The tests are:
1. Standard univariate test (when the sphericity assumption is not violated);
2. Alternative univariate tests (Greenhouse-Geisser, Huynh-Feldt, and Lower-
bound); and
3. Multivariate tests (Pillai's Trace, Wilks' Lambda, Hotelling's Trace, and Roy's
Largest Root).
All these tests evaluate the same hypothesis that the population means are equal at all
levels of measurement. The standard univariate test is based on the sphericity assump-
tion, i.e., the standard univariate test result is considered if the sphericity assumption is
not violated. In reality, and in most cases, the sphericity assumption is violated, and we
cannot use the standard univariate test results as provided in panel 1 of Table 12.2
(time row). It is, therefore, recommended to use the alternative univariate test, such as
the "Greenhouse-Geisser (G-G)" test (or the others), as provided under panel 2 of Table
12.2. The results show that the p-value of the G-G test is 0.0001 (in the time row of
panel 2), which is <0.05. This indicates that the mean blood sugar levels differ signifi-
cantly at different time points. To check the mean of which time points are statistically
different, see below (Table 12.5).
However, if you want to check the sphericity assumption, you need to do the Mauch-
ly’s test. To get the Mauchly’s test, you need to install the modules "Mauchly and
moremata" (commonly, they are not the built-in commands in Stata). To install the
modules, use the following commands:
114 Repeated Measures ANOVA
To get the means and standard deviations of blood sugar levels at four time points, use
the following command:
tabstat sugar, stat(n mean sd) by(time)
This will provide Table 12.4 with the means and standard deviations of blood sugar
levels at different time points, including the number of subjects (n). You can get the
pairwise comparison of mean blood sugar levels by using the following command
(with Bonferroni’s test) (Table 12.5):
pwmean sugar, over(time) mcompare(bonferroni) pveffects
Finally, to get the line graph of mean blood sugar levels over time with 95% CI, use the
Repeated Measures ANOVA 115
Table 12.4 Mean and SD of blood sugar levels at different time points
. tabstat sugar, stat(n mean sd) by(time)
time | N mean sd
-----------------+------------------------------
before treatment | 15 110.5333 4.73387
7 hrs after trea | 15 105.2 4.427189
14 hrs after tre | 15 101.5333 6.300416
24 hrs after tre | 15 100.4667 7.099966
-----------------+------------------------------
Total | 60 104.4333 6.862738
----------------------------------------------------------------------------------
110
105
100
95
before treatment 7 hrs after treatment 14 hrs after treatment 24 hrs after treatment
Time of measurement
Figure 12.1 Mean blood sugar levels with 95% CIs at different time points
116 Repeated Measures ANOVA
over : time
---------------------------
| Number of
| Comparisons
-------------+-------------
time | 6
---------------------------
------------------------------------------------------------------------------------------
| Bonferroni
sugar | Contrast Std. Err. t P>|t|
--------------------------------------------------+---------------------------------------
time |
7 hrs after treatment vs before treatment | -5.333333 2.098526 -2.54 0.083
14 hrs after treatment vs before treatment | -9 2.098526 -4.29 0.000
24 hrs after treatment vs before treatment | -10.06667 2.098526 -4.80 0.000
14 hrs after treatment vs 7 hrs after treatment | -3.666667 2.098526 -1.75 0.516
24 hrs after treatment vs 7 hrs after treatment | -4.733333 2.098526 -2.26 0.168
24 hrs after treatment vs 14 hrs after treatment | -1.066667 2.098526 -0.51 1.000
------------------------------------------------------------------------------------------
13
The chi-square test is a commonly used statistical test for testing a hypothesis in health
and social sciences research. The chi-square value ranges from 0 to ∞ (infinity), and it
does not take any negative value. This test is suitable to determine the association
between two categorical variables, whether the data are from cross-sectional,
case-control, or cohort studies. On the other hand, in epidemiology, cross-tabulations
are commonly done to calculate the odds ratio (OR) [for a case-control study] and
relative risk (RR) [for a cohort study] with their 95% confidence intervals (CI). The
OR and RR are the measures of strength of association between two variables. In this
chapter, we have discussed the chi-square and the Fisher’s exact tests for hypothesis
testing and how to get the RR and OR using Stata. We have also demonstrated how to
perform a stratified analysis in this chapter. Use the data file <Data_3.dta> for practice.
Hypothesis
H0: There is no association between gender and diabetes (it can also be stated as
gender and diabetes are independent).
HA: There is an association between gender and diabetes (or, gender and diabetes
are not independent).
Assumption
1. The data is a random sample drawn from a population.
13.1.1 Commands
The basic command to get the chi-square test results is to generate a cross-table using
the command "tab" with the option of chi-square statistics. Use any of the following
commands to find an association (chi-square test) between sex and diabetes.
tab sex diabetes, chi2
tab sex diabetes, row col chi2
The first command will generate a cross-table of sex (the first variable is placed on the
row) and diabetes with only the observed frequencies and the chi-square test results.
The second command will provide the row and column percentages in the cross-table,
including the observed frequencies and chi-square test results (Table 13.1).
You can also use the option “all” to get all the relevant statistics (chi-square, Cramer’s
V, and others) for the association between two categorical variables (Table 13.2), such
as:
tab sex diabetes, all
tab sex diabetes, col row all
tab sex diabetes, expected all
The last command will provide all the relevant statistics with the expected cell values.
The chi-square test is valid if no more than 20% of cells have expected values of less
than 5. If the expected value is less than 5 in more than 20% of the cells, we need to use
the Fisher’s exact test instead of the chi-square test. The command for the Fisher’s
exact test is (Table 13.3):
tab sex diabetes, exact
tab sex diabetes, expected exact
Association Between Two Categorical Variables: Chi-square Test of Independence 119
Table 13.1 Chi-square test results with row and column percentages
. tab sex diabetes, row col chi2
| Have diabetes
Sex: | mellitus
string | Yes No | Total
-----------+----------------------+----------
f | 20 113 | 133
| 15.04 84.96 | 100.00
| 44.44 68.48 | 63.33
-----------+----------------------+----------
m | 25 52 | 77
| 32.47 67.53 | 100.00
| 55.56 31.52 | 36.67
-----------+----------------------+----------
Total | 45 165 | 210
| 21.43 78.57 | 100.00
| 100.00 100.00 | 100.00
Table 13.2 Cross tabulation with chi-square and other test results
. tab sex diabetes, col row all
| Have diabetes
Sex: | mellitus
string | Yes No | Total
-----------+----------------------+----------
f | 20 113 | 133
| 15.04 84.96 | 100.00
| 44.44 68.48 | 63.33
-----------+----------------------+----------
m | 25 52 | 77
| 32.47 67.53 | 100.00
| 55.56 31.52 | 36.67
-----------+----------------------+----------
Total | 45 165 | 210
| 21.43 78.57 | 100.00
| 100.00 100.00 | 100.00
13.1.2 Interpretation
Table 13.1 is a 2 by 2 table of sex and diabetes with row and column percentages and
the chi-square test results. The question is, which percentage should you report? It
depends on the situation/study design and what you want to report. For the data of a
cross-sectional study, it may provide better information to the readers if row percent-
ages are reported. In that case, the row percentages indicate the prevalence of the
condition (in this example, diabetes).
Table 13.3 Fisher’s exact test results with expected cell values
. tab sex diabetes, expected exact
| Have diabetes
Sex: | mellitus
string | Yes No | Total
-----------+----------------------+----------
f | 20 113 | 133
| 28.5 104.5 | 133.0
-----------+----------------------+----------
m | 25 52 | 77
| 16.5 60.5 | 77.0
-----------+----------------------+----------
Total | 45 165 | 210
| 45.0 165.0 | 210.0
For example, one can understand from Table 13.1 that the prevalence of diabetes
among males is 32.47% and that of females is 15.04% when row percentages are used.
However, the column percentages can also be reported for a cross-sectional data (most
of the publications use column percentages). If column percentages are used, the
meaning will be different. In our example (Table 13.1), the results indicate that 55.56%
of the diabetic patients are males, compared to 31.52% of the non-diabetic individuals.
If the data is from a case-control study, you must report column percentages (you
cannot use row percentages for case-control studies). On the other hand, for the data of
a cohort study, one should report the row percentages. In such a situation, it would
indicate the incidence (instead of the prevalence) of the disease among males and
females.
We can see in Table 13.1 (in the row of "Total") that the overall (irrespective of gender)
prevalence of diabetes is 21.43% (considering the data is from a cross-sectional study).
Association Between Two Categorical Variables: Chi-square Test of Independence 121
Table 13.1 also shows that 32.47% of males have diabetes compared to only 15.04% of
females (i.e., the prevalence among males and females). The chi-square test actually
tests the hypothesis of whether the prevalence of diabetes among males and females in
the population is the same or not.
Now, look at the Pearson’s chi-square test results provided at the bottom of the table
[(Pearson chi2(1) = 8.7995 Pr = 0.003)]. The "Pearson chi2(1)" indicates the Pearson’s
chi-square test result with the degree of freedom (df) of 1, while "Pr" indicates the
p-value.
Before we conclude the chi-square test results, it is important to check if there is any
cell in the 2 by 2 table that has an expected value of less than 5. This can be checked
using the option "expected". Table 13.3 displays the expected cell values and the Fish-
er’s exact test p-value. The table shows that there is no cell in the 2 by 2 table with an
expected value of less than 5 (the minimum expected value found is 16.5). For the use
of the chi-square test results, it is desirable to have no cell in a 2 by 2 table with an
expected value of less than 5. If this is not fulfilled, we need to use the Fisher’s exact
test p-value to interpret the results.
Since we do not have the problem of an expected value of less than 5 in the 2 by 2 table,
we will consider the chi-square test results given at the bottom of Tables 13.1 and 13.2
for conclusion. The results show that the calculated chi-square value is 8.79 (df= 1) and
the corresponding p-value is 0.003. Since the p-value is <0.05, we will reject the null
hypothesis. This indicates that there is a significant association between gender and
diabetes. It can also be concluded that the prevalence of diabetes among males is
significantly higher than that of females, which is statistically significant at 95% confi-
dence level (p=0.003).
is coded as 1/2 (1= have diabetes and 2= don’t have diabetes), we need to change the
coding structure of the variable "diabetes" as 0= don’t have diabetes and 1= have
diabetes using the command "recode" as discussed in Section 5.2. There is no problem
with the variable "sex_1" in the dataset since it is already coded as 0= female and 1=
male. Once it is done (there is a variable in the dataset "diabetes1", which is coded as
0/1), use the following command to get the RR and OR with their 95% CIs:
cs diabetes1 sex_1
cs diabetes1 sex_1, or
cs diabetes1 sex_1, or level(99)
The command “cs” indicates “cohort study”. This command will automatically
provide the RR and its 95% CI (by default) along with other statistics (Table 13.4). If
you want to get the OR, in addition to RR, use the second command. The third com-
mand is for getting the 99% CIs for both the RR and OR.
| Sex: numeric |
| Exposed Unexposed | Total
-----------------+------------------------+------------
Cases | 25 20 | 45
Noncases | 52 113 | 165
-----------------+------------------------+------------
Total | 77 133 | 210
| |
Risk | .3246753 .1503759 | .2142857
| |
| Point estimate | [95% Conf. Interval]
|------------------------+------------------------
Risk difference | .1742994 | .0533492 .2952495
Risk ratio | 2.159091 | 1.287892 3.619616
Attr. frac. ex. | .5368421 | .2235371 .7237276
Attr. frac. pop | .2982456 |
+-------------------------------------------------
chi2(1) = 8.80 Pr>chi2 = 0.0030
Instead of using the “cs” command, you can use the following commands to get the OR
without a 2 by 2 table:
tabodds diabetes1 sex_1, or
tabodds diabetes1 sex_1, base(2) or
The first command will provide the OR without the 2 by 2 table, and with the first
Association Between Two Categorical Variables: Chi-square Test of Independence 123
category of sex (females; indicated by Odds Ratio = 1.0) as the comparison group
(Table 13.5). The second command will provide the same but the second category of
sex (males) as the comparison group (Table 13.5).
---------------------------------------------------------------------------
sex_1 | Odds Ratio chi2 P>chi2 [95% Conf. Interval]
-------------+-------------------------------------------------------------
Female | 1.000000 . . . .
Male | 2.716346 8.76 0.0031 1.362845 5.414068
---------------------------------------------------------------------------
Test of homogeneity (equal odds): chi2(1) = 8.76
Pr>chi2 = 0.0031
---------------------------------------------------------------------------
sex_1 | Odds Ratio chi2 P>chi2 [95% Conf. Interval]
-------------+-------------------------------------------------------------
Female | 0.368142 8.76 0.0031 0.184704 0.733759
Male | 1.000000 . . . .
---------------------------------------------------------------------------
Test of homogeneity (equal odds): chi2(1) = 8.76
Pr>chi2 = 0.0031
If you want to calculate the OR with 95% CI (default) for a case-control study, use the
first command (Table 13.6) as shown below. Use the second command if you want to
get the 99% CI.
cc diabetes1 sex_1
cc diabetes1 sex_1, level(99)
Another way of getting the OR is to use the command “logistic”, which is used for
logistic regression analysis. The command is:
logistic diabetes1 sex_1
The results of the above command (called univariate logistic regression analysis) are
shown in Table 13.7.
124 Association Between Two Categorical Variables: Chi-square Test of Independence
------------------------------------------------------------------------------
diabetes1 | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
sex_1 | 2.716346 .9334131 2.91 0.004 1.385123 5.326991
_cons | .1769912 .0429362 -7.14 0.000 .1100168 .284737
------------------------------------------------------------------------------
In the above examples, both the outcome (dependent variable) and exposure (indepen-
dent variable; sex) variables are dichotomous variables with the coding schemes of 0/1.
If the exposure variable has more than two levels/categories (e.g., the variable "reli-
gion" has 3 categories in our dataset; 1= Muslim, 2= Hindu, and 3= Christian), the
commands "cc" or "cs" will not calculate the ORs. In that case (when the exposure
variable has more than two categories), use the command "logistic" to get the ORs,
such as (outputs are provided in Table 13.8):
logistic diabetes1 i.religion
The prefix "i." used for the variable "religion" (i.e., i.religion) is necessary when it is a
categorical variable with more than two levels. When the "i." prefix is used before the
exposure variable, Stata considers it a categorical variable. If the prefix "i." is not used,
Association Between Two Categorical Variables: Chi-square Test of Independence 125
Stata will consider the variable as a continuous variable and the results will be mislead-
ing. With the above command, Stata will consider the first category (Muslims) as the
comparison group by default. If you want the other category of religion as the compari-
son group (say, last category or Christians), use any of the following commands:
logistic diabetes1 ib3.religion
tabodds diabetes1 religion, base(3) or
Table 13.8 Odds ratios for an exposure variable with more than 2 levels
. logistic diabetes1 i.religion
------------------------------------------------------------------------------
diabetes1 | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
religion |
HINDU | 1.465201 .5378927 1.04 0.298 .7135283 3.008732
Christian | .5016722 .3271554 -1.06 0.290 .1397418 1.801001
|
_cons | .26 .0572364 -6.12 0.000 .1688846 .4002734
------------------------------------------------------------------------------
13.2.1 Interpretation
Table 13.4 shows the cross-tabulation of diabetes and sex. The table shows that of the
45 cases of diabetes (cases), 25 are males (males are exposed since they are coded as
1). On the other hand, of the 165 non-cases (who do not have diabetes), 52 are males.
The RR (also called risk ratio) calculated from the data is 2.15, and the 95% CI is 1.28
- 3.61. Stata has also provided the chi-square test results at the bottom of the table,
including the p-value of the test (0.003). From the data, we can conclude that males are
at 2.15 times higher risk of having diabetes compared to females, which is statistically
significant at 95% confidence level (RR: 2.15; 95% CI of RR: 1.28 - 3.61; p=0.003).
Table 13.6 shows the OR (of being male) with the 95% CI. The OR, as calculated, is
2.71 and its 95% CI is 1.31 - 5.64. The analysis also provided the p-value of the
chi-square test (Pr>chi2), which is 0.003. We can, therefore, conclude that males are
more likely to have diabetes compared to females, which is statistically significant at
95% confidence level (OR: 2.71; 95% CI of OR: 1.31 - 5.64; p=0.003).
126 Association Between Two Categorical Variables: Chi-square Test of Independence
Table 13.7 shows the analysis of the data using the "logistic" command (univariate
logistic regression analysis). The table shows the OR (2.71), the 95% CI of OR (1.38 -
5.32) and the p-value (0.004). This is the OR for males compared to females (lower
value is considered as the comparison group).
Table 13.8 shows the association between diabetes and religion, which has three levels.
Note that Stata considers the lowest value/code (Muslims) of the exposure variable as
the comparison group during such an analysis (by default). The table shows the ORs,
95% CIs, and p-values for Hindus and Christians. The OR of diabetes for being a
Hindu is 1.46 (95% CI: 0.71 - 3.00; p=0.298) compared to Muslims, which is not statis-
tically significant since the p-value is greater than 0.05. Similarly, the OR of diabetes
for being a Christian is 0.50 (95% CI: 0.13 - 1.80; p=0.290) compared to Muslims,
which is also not statistically significant.
13.3.1 Interpretation
The stratified analysis in epidemiology is done to estimate the strength of association
(RR or OR) between an exposure (e.g., sex) and an outcome (diabetes) after
controlling (adjusting) for a third categorical variable (e.g., family history of diabetes).
It also enables us to examine whether: a) the third variable is a confounding factor, or
b) there is an interaction (also called effect modification) between exposure and the
third factor.
The stratified analysis can be done for the data of cross-sectional, cohort, or case-con-
trol studies. It is suitable to adjust for a single stratified variable, though more than one
variable can be used in the analysis.
In our example, we have examined the relationship between sex (exposure) and diabetes
(outcome) at two levels of the categorical stratified variable "family history of diabetes".
128 Association Between Two Categorical Variables: Chi-square Test of Independence
Table 13.9 shows the results of stratified analysis considering that the data is from a
cohort study. The results show that among those who have a family history of diabetes,
the RR for males (compared to females) is 2.90 (95% CI: 1.33 - 6.35), while the RR is
1.79 (95% CI: 0.70 - 4.58) when there is no family history of diabetes. The table also
shows the crude (Crude) and adjusted RR (M-H combined). The crude (or unadjusted)
RR is calculated without considering the third (stratified) variable (i.e., family history
of diabetes). The crude RR, as calculated by Stata, is 2.15 (95% CI: 1.28 - 3.61), while
the adjusted RR (M-H combined) is 2.51 (95% CI: 1.36 - 4.65). The adjusted RR
indicates the RR after controlling for the family history of diabetes.
In our data (Table 13.9), we can see that there is a difference between the crude (2.15)
and adjusted (2.51) RR. This (i.e., when there is a difference between crude and adjust-
ed RR) may indicate that the family history of diabetes has some confounding effect
(influence) on the relationship between sex and diabetes. Though there is no set rule for
deciding what amount of difference is considered significant, in general, more than
20% change is considered important (in our example, it is less than 20%). However, if
the crude (unadjusted) and adjusted RRs (or ORs) are close together, the third variable
(stratified variable) is not a confounding factor in the relationship between an exposure
and an outcome.
Now, the question is whether there is an interaction between sex and family history of
diabetes on the outcome. If there is an interaction, the RRs (or ORs) in two strata of the
third variable will be different. In our example, for males, if there is a family history of
diabetes, the RR is 2.90, while the RR is 1.79 if there is no family history of diabetes,
and they are different. This indicates that there may have an interaction between sex
and family history of diabetes on the outcome. However, before we conclude like this,
we need to check the statistical significance of the difference. Stata has provided the
statistical test (Test of homogeneity (M-H); chi2(1) = 0.631; Pr>chi2 = 0.4271) to
understand the significance of the difference at the bottom of the table (Table 13.9). It
shows that the p-value (Pr>chi2 = 0.4271) is 0.427. Since the p-value is >0.05, the
difference in RRs in the two strata is not statistically significant. We may, therefore,
conclude that there is no interaction between sex and family history of diabetes on
outcome even though the RRs are different in two strata of family history of diabetes.
If the homogeneity test is significant (less than 0.05), it is not appropriate to report the
adjusted RR (or OR). The results should be presented for each stratum separately.
Table 13.10 has provided similar information, where we have calculated the OR
(instead of RR) considering that the data is from a case-control study. We can see that
Association Between Two Categorical Variables: Chi-square Test of Independence 129
the crude and adjusted ORs are 2.71 and 3.25, respectively. Since the ORs are substan-
tially different (exactly 20%), the family history of diabetes is a confounding factor in
the relationship between sex and diabetes. The p-value for the adjusted OR is provided
at the bottom of the table (Mantel-Haenszel chi2(1) = 9.53; Pr>chi2 = 0.0020). It
shows that the p-value is 0.002. We, therefore, conclude that males are 3.25 (M-H com-
bined OR) times more likely to have diabetes compared to females after controlling for
family history of diabetes, which is statistically significant (95% CI: 1.42 – 6.97;
p=0.002).
On the other hand, the ORs of diabetes for males with and without a family history of
diabetes are 3.81 and 2.19, respectively. Though they are different, the difference is not
statistically significant (p=0.505) as shown in the table [Test of homogeneity (M-H);
chi2(1) = 0.44; Pr>chi2 = 0.5056]. Therefore, there is no interaction between sex and
family history of diabetes on the outcome variable.
130
14
Data is frequently collected in health and social sciences research to estimate the
proportions. We may have a situation where there is a single group of individuals and
a certain proportion of them have a particular characteristic (e.g., a disease). For exam-
ple, a researcher has collected data from a population by taking a random sample and
found that a certain percentage (proportion) of individuals have diabetes. The research-
er may be interested in testing the null hypothesis that the population proportion is
equal to a pre-specified value/proportion (one-sample test of proportion).
On the other hand, for the comparison of proportions of two independent samples or
the proportions of two groups of individuals, a two-sample test of proportions is used.
In this chapter, we have discussed how to test the null hypothesis for one-sample and
two-sample proportions. Use the data file “Data_3.dat”.
who have diabetes out of 210. Therefore, the prevalence of diabetes is 21.43%.
We are interested in testing the hypothesis of whether or not the prevalence of diabetes
in the district is different from the national prevalence of 19.0%. Here, the null hypoth-
esis is "the prevalence of diabetes in the district is not different from 19.0%", while the
alternative hypothesis is "the prevalence of diabetes in the district is different from
19.0%". This is a situation where we can apply the one-sample test of proportion.
have |
diabetes 01 | Freq. Percent Cum.
------------+-----------------------------------
no | 165 78.57 78.57
yes | 45 21.43 100.00
------------+-----------------------------------
Total | 210 100.00
The one-sample test of proportion is analogous to the one-sample t-test. The one-sam-
ple proportion test is used to compare the observed proportion with a hypothetical
value. To do the one-sample test of proportion, use the following command:
prtest diabetes1==.19
The results are shown in Table 14.2. The hypothesis that we have tested is a two-tailed
hypothesis. The p-value of the two-tailed test is provided under "Ha: p != 0.19", which
is 0.369 [Pr(|Z| > |z|) = 0.3697]. Since the p-value is >0.05, we cannot reject the null
Hypothesis Test of Proportions 133
hypothesis. It can, therefore, be concluded that the prevalence of diabetes in the district
may not be different from the national prevalence of 19.0%.
+----------------+
| Key |
|----------------|
| frequency |
| row percentage |
+----------------+
Suppose that we are interested in examining whether the same proportion of males and
females has diabetes in the population, i.e., whether the prevalence of diabetes is the
134 Hypothesis Test of Proportions
same among females and males. This is a situation where we can apply the two-sample
test of proportions. For this example, the null hypothesis is "the proportion of males
who have diabetes is equal to the proportion of females who have diabetes in the popu-
lation". The alternative hypothesis is that the two proportions are not the same (differ-
ent) in the population.
To test the null hypothesis that the two proportions are the same in the population, use
the following command:
prtest diabetes1, by (sex_1)
The results are shown in Table 14.4. The table shows that the difference between the
two proportions (female to male) is -.174 (-17.4%). This means that the prevalence of
diabetes among females is 17.4% less than that of males (15.03% vs. 32.46%). The
95% CI of the difference is also given in the table, which is -.295 (29.5%) to -.053
(5.3%) ["diff" row in the table]. Our interest is in the two-sided p-value of the test,
which is 0.003 (Pr(|Z| < |z|) = 0.0030). Since the p-value is <0.05, we will reject the null
hypothesis and conclude that the proportion of males who have diabetes is different
from that of females (i.e., the prevalence of diabetes among males is significantly high-
er than that of females). The chi-square test of independence also tests the same
hypothesis.
135
15
The nature and strength of the relationships between two or more continuous variables
can be examined by regression and correlation analysis. Correlation is concerned with
measuring the strength of a relationship between continuous variables. The correlation
model provides information about the relationship between two variables without
distinguishing which one is the dependent and which one is the independent variable.
But the basic procedure for regression and correlation models is the same.
Under the correlation model, we calculate the "r" value. The "r" is called the sample
correlation coefficient. It indicates the degree of linear relationship between the depen-
dent (Y) and independent (X) variables. The value of "r" ranges between “–1” and
“+1”. This chapter will cover the correlation model. Use the data file <Data_3.dta> for
practice.
Hypothesis
H0: There is no correlation between systolic and diastolic BP.
136 Association Between Two Continuous Variables: Correlation
Assumptions
1. The variables (systolic and diastolic BP) are normally distributed in the popu-
lation;
2. The subjects represent a random sample from the population.
120
120
100
100
Diastolic BP
Diastolic BP
80
80
60
100 120 140 160 180 200
60
Systolic BP
100 120 140 160 180 200 Diastolic BP Fitted values
Systolic BP
Figure 15.1 Scatter plot of systolic and Figure 15.2 Scatter plot of systolic and
diastolic BP diastolic BP with the regression line
120
100
Diastolic BP
80
60
10 20 30 40 50
Age
The first command will provide the Pearson’s correlation coefficient (r value) of
systolic and diastolic BP without the p-value (Table 15.1). The second command will
generate the correlation matrix of the variables (sbp, dbp, age, and income) included in
the command (Table 15.2). If you want to obtain the correlation coefficient of systolic
and diastolic BP along with the p-value, use the third command (Table 15.3). The
fourth command will provide a correlation matrix table for the variables included in the
command with the corresponding p-values (Table 15.4). The commands for the
correlation matrix are provided as an example.
138 Association Between Two Continuous Variables: Correlation
15.1.3 Interpretation
In the first step, we have constructed the scatter plots of systolic and diastolic BP (Figs
15.1 and 15.2). Figure 15.1 shows that the data points are scattered around an invisible
straight line (indicating linear relationship) and there is an increase in the diastolic BP
(Y) as the systolic BP increases (X). This indicates that there may have a positive
correlation between these two variables.
Table 15.2 Correlation matrix of systolic BP, diastolic BP, age and income
. corr sbp dbp age income
(obs=210)
| sbp dbp age income
-------------+------------------------------------
sbp | 1.0000
dbp | 0.8468 1.0000
age | -0.0378 -0.0224 1.0000
income | 0.0163 0.0697 -0.1324 1.0000
Table 15.3 Pearson’s correlation between systolic and diastolic BP with the p-value
. pwcorr sbp dbp, sig
| sbp dbp
-------------+------------------
sbp | 1.0000
|
dbp | 0.8468 1.0000
| 0.0000
Look at Fig 15.2, which shows the regression line on the scatter plot. The regression
line has passed from near the lower left corner to the upper right corner, indicating a
positive correlation between systolic and diastolic BP. If the relationship was negative
(inverse), the regression line would have passed from the upper left corner to the lower
right corner.
Association Between Two Continuous Variables: Correlation 139
Figure 15.3 shows the scatter plot of diastolic BP and age. It does not indicate any
correlation between diastolic BP and age since the dots are scattered around the regres-
sion line, which is more or less parallel to the X-axis.
Table 15.1 shows the Pearson’s correlation coefficient (r value) of systolic and diastol-
ic BP, which is 0.846. Table 15.3 shows the same result, but with the p-value, which is
0.000. The correlation coefficient [r] indicates the strength or degree of the linear
relationship between two variables (systolic and diastolic BP). As the value of "r" is
positive and the p-value is <0.05, we can conclude that there is a significant positive
correlation between systolic and diastolic BP.
The value of “r” lies between “–1” and “+1”. Values near to “zero” indicate no correla-
tion, while values near to “+1” or “–1” indicate a strong correlation. The negative value
of “r” (– r) indicates an inverse relationship. A value of r ≥ 0.8 indicates a very strong
correlation; an “r” value between 0.6 and <0.8 indicates a moderately strong correla-
tion; an “r” value between 0.3 and <0.6 indicates a fair correlation; and an “r” value of
<0.3 indicates a poor correlation [8].
(no, mild, moderate, severe, and very severe pain) and grade of cancer (stage 1, stage
2, stage 3, stage 4, and stage 5).
To obtain the Spearman correlation coefficient of systolic BP (variable name “sbp”)
and income, where income is not normally distributed, use the following command:
spearman sbp income
This command will report the Spearman correlation coefficient of systolic BP and
income along with the p-value as shown in Table 15.5.
The Kendall’s tau-b statistic is used to determine the correlation between two ordinal
variables, or an ordinal and a continuous variable (provided the ordinal variable has
less than 5 levels). To determine the correlation between age group (variable name is
age2) and systolic BP, use the following command (Table 15.6):
ktau age2 sbp
15.2.1 Interpretation
Table 15.5 shows the number of pairwise observations (n=210) used to calculate the
Spearman correlation coefficient. Spearman’s rho indicates the Spearman correlation
coefficient. In our example, Spearman’s rho is 0.007, which is very small. The p-value
of this test is indicated by “Prob > |t|”. The p-value of this test is 0.9192, which is
>0.05. We cannot, therefore, reject the null hypothesis. This indicates that there is no
statistically significant correlation between systolic BP and income.
Table 15.6 shows the results of the Kendall’s tau correlation between age group and
systolic BP. The correlation coefficient (Kendall’s tau-b) of the variables is 0.0114 and
the corresponding p-value is 0.83. This indicates that there is no significant correlation
between age group and systolic BP.
Association Between Two Continuous Variables: Correlation 141
Table 15.6 Kendall’s tau-b correlation between age group and systolic BP
. ktau age2 sbp
15.3.1 Interpretation
The results of the partial correlation of diastolic and systolic BP after adjusting for age
and diabetes mellitus are displayed in Table 15.7. We can observe that the partial
correlation coefficient of diastolic and systolic BP is 0.847 and the p-value is 0.000.
This means that these two variables (diastolic and systolic BP) are significantly
correlated (p=0.000) even after controlling for age and diabetes mellitus. The table also
provides the results of semipartial correlation. In semipartial correlation, the correla-
tion coefficient is calculated holding the other variables (age and diabetes, in our
example) constant either for X or Y, but not for both. In partial correlation, the other
142 Association Between Two Continuous Variables: Correlation
Table 15.7 Correlation between systolic and diastolic BP after controlling for age
and diabetes
. pcorr dbp sbp age diabetes
(obs=210)
variables are held constant for both X and Y. We should consider the partial correlation
for reporting.
The partial correlation provided for the age indicates the partial correlation coefficient
of age and diastolic BP (0.022; p=0.752) after controlling for systolic BP and diabetes.
16
Linear Regression
Regression analysis is a useful statistical method of data analysis in health and social
sciences disciplines. The nature and strength of relationships between two or more
continuous variables can be ascertained by regression and correlation analyses.
We have discussed correlation in Chapter 15. While correlation is concerned with
measuring the strength of the linear relationship between variables, regression analysis
is useful in predicting or estimating the value of one variable corresponding to a given
value of another variable. For instance, we can use regression analysis to examine
whether systolic BP is a good predictor of diastolic BP and also to get the estimated
(predicted) value of diastolic BP corresponding to a value of systolic BP. In regression
analysis, our main interest is in regression coefficient (also called slope or β). The
regression coefficient indicates the strength of association between dependent (Y) and
independent (X) variables. Linear regression analyses can be done either as simple
linear regression or multiple linear regression methods.
In this chapter, both simple and multiple linear regression methods are discussed.
Multiple linear regression is a type of multivariable analysis. The multivariable analy-
sis is a statistical tool where multiple independent variables are considered for a single
outcome variable. The terms "multivariate analysis" and "multivariable analysis" are
often used interchangeably in health and social sciences research. In fact, multivariate
analysis refers to a statistical method for the analysis of multiple outcome variables.
Multivariable analyses are widely used in observational studies, intervention studies
(randomized or nonrandomized trials), and studies of diagnosis and prognosis. The
main purposes of multivariable analyses are to:
144 Linear Regression
Assumptions
1. Normality: For any fixed value of X (systolic BP), the sub-population of Y
values (diastolic BP) is normally distributed;
2. Homoscedasticity: The variances of the sub-populations of “Y” are all equal;
3. Linearity: The means of the sub-populations of “Y” lie on the same straight
line; and
4. Independence: Observations are independent of each other.
The first step in analyzing the data for regression is to construct a scatter diagram to
Linear Regression 145
visualize the relationship between the two variables, which is discussed in Chapter 15.
The scatterplot will provide an indication of the linear relationship between the
variables, diastolic and systolic BP. For example, to get the scatter plot of diastolic and
systolic BP with the fit-line, use the following command:
graph twoway lfit dbp sbp || scatter dbp sbp
16.1.2 Interpretation
The results of simple linear regression analysis are provided in Table 16.1. The table
shows the coefficient of determination (R-squared) of diastolic BP on systolic BP. The
coefficient of determination is the square of the correlation coefficient value (r value).
The table shows that the coefficient of determination (R-squared) of diastolic BP on
systolic BP is 0.717 and the p-value (Prob > F) is 0.000. The table also shows the value
for the adjusted coefficient of determination (Adj R-squared).
------------------------------------------------------------------------------
dbp | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
sbp | .4960326 .0216034 22.96 0.000 .4534429 .5386223
_cons | 19.40677 2.793132 6.95 0.000 13.90029 24.91325
------------------------------------------------------------------------------
The coefficient of determination, or the R-squared value, indicates the amount of varia-
tion in "Y" due to "X" that can be explained by the regression line. Here, the R-squared
value is 0.717 (~0.72), which indicates that 72% of the variation in diastolic BP can be
explained by systolic BP. The rest of the variation (28%) is due to other factors (unex-
plained variation). The adjusted R-squared value (0.715), as shown in the table, is the
adjusted value for better population estimation. The significance of the R-squared
Linear Regression 147
value is assessed by the F-test [F (1, 208) = 527.20] as shown in the table. Since the
p-value (Prob > F) of the coefficient of determination is less than 0.05, it is statistically
significant. This finding indicates that the linear regression model is useful in predict-
ing the dependent variable (diastolic BP) by the independent variable (systolic BP) in
the model.
------------------------------------------------------------------------------
dbp | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
sbp | .5177267 .025748 20.11 0.000 .466791 .5686625
_cons | 17.38358 3.380971 5.14 0.000 10.69521 24.07194
------------------------------------------------------------------------------
We can conclude from the data that there is a significant positive correlation between
diastolic and systolic BP since the regression coefficient (0.496; “Coef.” in Table 16.1)
is positive and statistically significant (p= 0.000)], and we can, therefore, use the
regression equation for prediction. Table 16.1 shows the regression (also called
explained) sum of squares (20688.99; in the row "Model" and column "SS") and the
residual (also called error) sum of squares (8162.57; in the row "Residual" and column
"SS"). The residual is the difference between the observed value and the predicted
value (i.e., the observed value and the value on the regression line). The residual sum
of squares provides an idea of how well the regression line actually fits into the data.
The smaller the value, the better the fit.
The strength of the relationship between variables is measured by the regression coef-
ficient (β or b) or slope. Table 16.1 shows the regression coefficient (row "sbp" and
column "Coef.”) of diastolic BP on systolic BP, which is 0.496 with a p-value of 0.000
(P> | t |). The table also shows the value for "a" (_cons) or Y-intercept, which is 19.40.
These values (regression coefficient and constant) are needed to construct the linear
regression equation.
148 Linear Regression
The value of “b” (regression coefficient) indicates the amount of change in “Y” for
each unit change in “X”. In our example, the value of “b” is 0.496. It indicates that if
the systolic BP increases (or decreases) by 1 mmHg, the diastolic BP will increase (or
decrease) by 0.496 mmHg. The table shows the significance (p-value) of “b”, which is
0.000. A p-value of <0.05 indicates that “b” is not equal to zero in the population (the
null hypothesis is that “b” is equal to zero in the population). For simple linear regres-
sion, if R-square is significant, “b” will also be significant.
On the other hand, the value of "a" (constant or Y-intercept) in our example is 19.407.
The value, a= +19.407, indicates that the regression line crosses or cuts the Y-axis
above the origin (zero) and at the point of 19.407 (a negative value indicates the regres-
sion line cuts the Y-axis below the origin). The value of "a" does not have any practical
meaning since it indicates the average diastolic BP of individuals if the systolic BP is
zero.
We know that the equation for simple linear regression is Y = a + bX, where "Y" is the
predicted value of the dependent variable; "a" is the Y-intercept or constant; "b" is the
regression coefficient or slope; and "X" is a value of the independent variable. There-
fore, the regression or prediction equation for this regression model is:
With this equation, we can estimate the diastolic BP of an individual by his/her systolic
BP. For example, what will be the estimated diastolic BP of an individual whose systol-
ic BP is 130 mmHg? Using the above equation, the answer is that the estimated diastol-
ic BP will be equal to 83.89 mmHg (19.407 + 0.496 × 130). Stata can calculate this for
you if you use the following command after performing the regression analysis:
margins, at(sbp=(130))
If we want to use the regression equation for the purpose of prediction, “b” needs to be
statistically significant (p<0.05). In our example, the p-value for “b” is 0.000. We can,
therefore, use the equation for the prediction of diastolic BP by systolic BP.
The analysis (Table 16.1) has actually evaluated whether "b" is zero or not in the popu-
lation by the t-test (Null hypothesis: the regression coefficient (b) is equal to "zero" in
the population; Alternative hypothesis: the population regression coefficient is not
equal to "zero"). We will reject the null hypothesis since the p-value is 0.000 (<0.05).
We can, therefore, conclude that the systolic BP can be considered in estimating the
diastolic BP by using the following regression equation:
Y = 19.407 + 0.496 × X.
Linear Regression 149
can also generate the dummy variables automatically during regression analysis if the
prefix "i." is used before a variable that has more than two levels (in general, if the
prefix "i." is used for any variable, Stata considers it a categorical variable during
analysis).
For example, if we include religion in the regression analysis as "i.religion", Stata will
automatically generate two dummy variables for religion during the analysis, with the
first category (Muslim) as the comparison group by default. You can change the com-
parison group by using the prefix “.ib”. For example, if you want the second category
of religion (Hindu) as the comparison group, use the prefix “.ib2” (since Hindus are
coded as 2).
We always need to decide on a comparison group (e.g., a comparison group for religion
or other variables) before generating the dummy variables or entering a categorical
variable in the model with the prefix “.i”. Stata, by default, considers the first category
(e.g., Muslims for the variable “religion” since Muslims are in the first category) of a
variable as the comparison group if the variable is entered with the prefix “i.” (e.g.,
i.religion). If we want to consider the other category (e.g., the third category or Chris-
tians) as the comparison group, we should use the prefix “.ib3” since Christians are
coded as 3.
In our regression analysis, we will use the variable “religion” with Christians (coded as
3) as our comparison group by using the prefix “.ib3”. When a variable is coded as 0/1
(e.g., the variables “sex” and “diabetes”), the regression estimates in multiple regres-
sion analysis will be for the higher value, and the lower value will be the comparison
group.
------------------------------------------------------------------------------
dbp | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age | -.0283828 .0562385 -0.50 0.614 -.139266 .0825004
sbp | .4892801 .0216686 22.58 0.000 .446557 .5320032
|
sex |
Male | -2.164135 .9128191 -2.37 0.019 -3.963905 -.3643654
|
religion |
MUSLIM | .212462 1.363372 0.16 0.876 -2.475645 2.900569
HINDU | -.229616 1.488835 -0.15 0.878 -3.165094 2.705862
|
_cons | 21.82239 3.426989 6.37 0.000 15.06553 28.57925
------------------------------------------------------------------------------
------------------------------------------------------------------------------
dbp | Coef. Std. Err. t P>|t| Beta
-------------+----------------------------------------------------------------
age | -.0283828 .0562385 -0.50 0.614 -.0185771
sbp | .4892801 .0216686 22.58 0.000 .8352803
|
sex |
male | -2.164135 .9128191 -2.37 0.019 -.0889736
|
religion |
MUSLIM | .212462 1.363372 0.16 0.876 .00888
HINDU | -.229616 1.488835 -0.15 0.878 -.0087588
|
_cons | 21.82239 3.426989 6.37 0.000 .
------------------------------------------------------------------------------
The first command will provide the average diastolic BP for different categories of
religion and sex, separately. The second and third command will provide the average
diastolic BP for religion and sex together (e.g., Muslim males, Muslim females, Hindu
males, Hindu females, Christian males, and Christian females).
16.2.3 Interpretation
In the analysis, we used diastolic BP (dbp) as the dependent variable and age, systolic
BP (sbp), sex, and religion as the explanatory variables.
Table 16.3 shows the outputs of the multiple regression analysis. The table shows that
data from 210 subjects were analyzed. It shows the values for R-squared (0.725) and
adjusted R-squared (0.718), including the p-value (Prob > F; 0.000).
The R-squared value of 0.725 indicates that all the independent variables (age, systolic
BP, sex, and religion) together in the model explain 72.5% of the variation in diastolic
BP, which is statistically significant (p= 0.000). If the sample size is small, the
R-squared value may overestimate the population value. The adjusted R-squared
(0.718) gives the R-squared value for better population estimation.
Table 16.3 also shows the regression coefficients (Coef.), p-values (P > | t |) and 95%
confidence intervals (95% Conf. Interval) for all the explanatory variables in the mod-
el, along with the constant (_cons). The regression coefficients as shown in the table
are for age (-0.028; p= 0.614), systolic BP (0.489; p<0.001), sex (-2.164; p= 0.019 for
males compared to females), Muslims (0.212; p= 0.876 compared to Christians) and
Hindus (-0.229; p= 0.878 compared to Christians).
From the analysis, we can conclude that the systolic BP and sex are the factors signifi-
cantly influencing the diastolic BP (since the p-values are <0.05). The other variables
in the model (age and religion) do not have any significant influence in explaining (or
predicting) the diastolic BP. The regression coefficient (Coef.) [also called multiple
regression coefficient] for systolic BP, in this example, is 0.489 (95% CI: 0.44 to
0.53; p<0.001). This indicates that the average increase (or decrease) in diastolic BP
is 0.489 mmHg if the systolic BP increases (or decreases) by 1 mmHg after adjusting
for all other variables (age, sex, and religion) in the model. On the other hand, the
regression coefficient for sex is –2.164 (95% CI: -3.96 to -0.36; p= 0.019), which
means that the average diastolic BP of males is 2.16 mmHg less (since the coefficient
is negative) than that of females after adjusting for all other variables (age, systolic BP,
and religion) in the model. If the regression coefficient was positive (e.g., +2.164), the
154 Linear Regression
average diastolic BP of males would be 2.16 mmHg higher than that of females, given
the other variables constant in the model.
Table 16.3 (at the bottom) shows the standardized coefficients (beta). The standardized
coefficients are used to understand the magnitude of the influence of independent
variables on the dependent variable. The higher the value, the greater the influence.
The table shows that the beta for systolic BP and sex are 0.835 and -0.088, respectively.
Therefore, systolic BP has the highest influence (also greater than sex) in predicting
diastolic BP.
Regression equation
The regression equation to estimate the average value of the dependent variable with
the explanatory variables is given below:
linear), we need to check the following four assumptions on residuals for a linear
regression model to be valid. The assumptions to be checked on the residuals are:
1) The residuals are normally distributed with the mean equal to zero;
2) The residuals have constant variance (homoscedasticity);
3) There is no outlier; and
4) The data points are independent.
Table 16.4 Correlation matrix of systolic BP, age, sex, and religion
. corr sbp age sex religion
(obs=210)
To get the measures of tolerance and variance inflation factor (VIF), use the command
“vif” after performing the multiple regression analysis. A tolerance value indicates the
degree of collinearity. The tolerance value is the inverse of the VIF measure (1/VIF).
The tolerance value ranges from 0 to 1. A value of “zero” indicates that the variable is
almost in a linear combination (i.e., has a very strong correlation) with other indepen-
dent variables. To get the values for VIF and tolerance, use the following command
after performing the regression analysis:
vif
This command will provide both the VIF and Tolerance (1/VIF) values of the indepen-
dent variables included in the regression analysis (Table 16.5).
The table (Table 16.5) shows that the tolerance (1/VIF) values for sex, systolic BP, and
age are greater than 0.95. The tolerance values for both religion 1 (Muslims) and 2
(Hindus) are 0.41. Usually, the dummy variables have lower tolerance values. The
recommended tolerance level is greater than 0.6 before we include the variables in the
multiple regression model. However, a tolerance value of 0.40 and above is also
acceptable, especially if it is a dummy variable.
If there are variables that are highly correlated (tolerance value is <0.4), one way to
solve the problem is to exclude one of the correlated variables from the model. The
other way is to combine the explanatory variables together (e.g., by taking their sum).
Finally, to develop a model for multiple regression, we should first check for multicol-
linearity and then the other assumptions (see below). If the requirements are fulfilled,
then only we can finalize the regression model.
linear. This is called the assumption of linearity. The best way to check the linearity
assumption is to construct a scatterplot and visually inspect the plot for linearity.
Checking linearity for a simple linear regression is straightforward since there is only
one predictor (independent) variable against the response (dependent or outcome)
variable. For a simple linear regression, we can check the linearity by generating a
scatterplot of the dependent variable against the independent variable, which is
discussed in Section 7.2. However, to check the linear relationship between diastolic
and systolic BP, use the following commands to get a scatterplot of diastolic BP against
systolic BP:
twoway (scatter dbp sbp)
twoway (scatter dbp sbp) (lfit dbp sbp)
twoway (scatter dbp sbp) (lfit dbp sbp) (lowess dbp sbp)
The first command will display a scatterplot of diastolic BP and systolic BP without the
regression line (fit-line). The second command will provide the scatterplot with the
fit-line (Fig 16.1), while the third command will provide all the above with a smooth
prediction line. Figure 16.1 shows that the relationship between diastolic and systolic
BP is linear since the scatter dots are lying more or less symmetrically around a straight
line.
Checking the linearity assumption in multiple linear regression is not straightforward.
It can be checked in several different ways. The most straightforward way is to draw
scatter plots of standardized residuals (z-residuals) against each of the predictor (inde-
pendent) variables included in the regression analysis. Standard residuals are the resid-
uals divided by the standard deviation of the residuals.
158 Linear Regression
120
100
Diastolic BP
80
60
Figure 16.1 Scatterplot of diastolic and systolic BP with the fit line
4
Standardized residuals
Standardized residuals
2
2
0
0
-2
-2
You can also use formal statistical tests (Shapiro-Wilk test or Skewness-Kurtosis test)
to evaluate the normality of residuals by using any of the following commands (Table
16.6):
.08
.04
.03
.06
Density
.02
Density
.04
.01
.02
0
-40 -20 0 20 40
Residuals
0
Normal density
-20 -10 0 10 20
kernel = epanechnikov, bandwidth = 2.7696 Residuals
Figure 16.4 Kernel density estimate with Figure 16.5 Histogram of residuals
overlaying normal density curve
1.00
40
0.75
20
Normal F[(r-m)/s]
Residuals
0.50
0
0.25
-20
0.00
-40
Figure 16.6 P-P plot of residuals Figure 16.7 Q-Q plot of residuals
swilk residual
sktest residual
Table 16.6 shows that the p-values of both these tests are greater than 0.05, indicating
that the distribution of residuals is normal.
To get the mean of the residuals, use the following command (Table 16.7):
sum residual
All the plots (Kernel density, histogram, P-P, and Q-Q plots) indicate that the distribu-
tion of the residuals is normal (also see Chapters 5 and 8). The mean of the residuals,
Linear Regression 161
as shown in Table 16.7, is -7.01e-09 (-7.01 x e-09), which is equal to “zero”. For practical
purposes, simply construct the histogram of the residuals and decide whether the distri-
bution is approximately normal or not.
. sktest residual
Skewness/Kurtosis tests for Normality
------- joint ------
Variable | Obs Pr(Skewness) Pr(Kurtosis) adj chi2(2) Prob>chi2
-------------+---------------------------------------------------------------
residual | 210 0.7485 0.9940 0.10 0.9499
The first test of heteroscedasticity given by the command “hettest” is the Breusch-Pa-
gan test, and the second test given by the command “imtest” is the White’s test. Both
these tests test the null hypothesis that the variance of the residuals is homogeneous.
Therefore, to fulfill the assumption, we expect a p-value greater than 0.05. Table 16.8
shows that the p-values of both these tests are greater than 0.05 (use the p-value of
162 Linear
LinearRegression
Regression
chi2(1) = 0.27
Prob > chi2 = 0.6059
. estat imtest
---------------------------------------------------
Source | chi2 df p
---------------------+-----------------------------
Heteroskedasticity | 18.98 16 0.2697
Skewness | 4.25 5 0.5144
Kurtosis | 0.02 1 0.8930
---------------------+-----------------------------
Total | 23.24 22 0.3880
---------------------------------------------------
either test for interpretation). This indicates that the variances of the residuals are
homogeneous.
However, these tests are very sensitive to model assumptions. It is, therefore, a com-
mon practice to combine the tests with a diagnostic plot (a plot of residuals against the
predicted values) to make a judgment on the severity of heteroscedasticity. To generate
the plot of residuals against the fitted (predicted) values of the dependent variable
(diastolic BP), use either of the following commands:
rvfplot
rvfplot, yline(0)
The above commands will generate Figures 16.8 and 16.9, respectively. The figures are
similar except that Figure 16.9 has a reference line that represents Y= 0. If the scatters
of the points show no clear pattern (as seen in Fig 16.8 and Fig 16.9), we can conclude
that the variances of the sub-population of “Y” are constant (homoscedastic).
If there is evidence of heteroscedasticity (Fig 16.10), one of the solutions is to run a
regression with robust standard errors by using the following command:
regress dbp age sbp i.sex ib3.religion, vce(robust)
Linear Regression 163
20
20
10
10
Residuals
Residuals
0
0
-10
-10
-20
-20
+-----------+ +----------+
| zresid | | zresid |
|-----------| |----------|
1. | -2.462475 | 201. | 1.600753 |
2. | -2.258894 | 202. | 1.665619 |
3. | -2.158022 | 203. | 1.800676 |
4. | -2.071153 | 204. | 1.823561 |
5. | -2.065784 | 205. | 1.846748 |
|-----------| |----------|
6. | -2.003306 | 206. | 1.887106 |
7. | -1.93732 | 207. | 1.951255 |
8. | -1.908815 | 208. | 2.181215 |
9. | -1.884941 | 209. | 2.678532 |
10. | -1.83873 | 210. | 2.960519 |
+-----------+ +----------+
sive time intervals (i.e., the degree of similarity between a given time series). For
cross-sectional data, autocorrelation is not an issue.
The independence of residuals (autocorrelation) is assessed by the Durbin-Watson
(DW) statistic and is applicable for time-series data. The DW test is done after execut-
ing the multiple regression analysis. The DW statistic ranges from 0 to 4. A value of 2
indicates that there is no autocorrelation. Values less than 0 to 2 indicate the presence
of a positive autocorrelation, while values greater than 2 to 4 indicate the presence of a
negative autocorrelation. To perform the DW test, it requires a time variable. In our
dataset (Data_4.dta), there is no time variable. However, we can generate a time
variable (for the purpose of demonstration) by using the following command before
doing the DW test:
gen time = _n
This command will generate a time variable “time”. Now, use the following command
to let Stata know which variable is the time variable for this analysis (if the time variable
in your data file is “year”, use the command: tsset year):
tsset time
Finally, use the following command, after multiple regression analysis, to get the DW
test statistic (Table 16.10):
166 Linear Regression
. tsset time
time variable: time, 1 to 210
delta: 1 unit
. estat dwatson
estat dwatson
If the DW statistic value hovers around 2 (between 1.5 and 2.5), it indicates that the
data points are independent (there is no autocorrelation). Table 16.10 shows that the
value of the DW statistic is 1.70. Since the value is close to 2, it can be considered that
there is no autocorrelation in the residuals.
------------------------------------------------------------------------------
dbp | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
sbp | .489694 .0215077 22.77 0.000 .4472918 .5320963
_Isex | -2.180167 .8930832 -2.44 0.015 -3.940872 -.4194618
_cons | 21.01581 2.838019 7.41 0.000 15.42068 26.61094
------------------------------------------------------------------------------
For a forward stepwise method with a removing criteria of p≥0.2 and adding (entry)
criteria of p<0.15, use the following command:
xi: stepwise, pr(.2) pe(.15) forward: regress dbp sbp age i.sex i.religion
16.2.5.3 Interpretation
The interpretation of the outputs of stepwise methods is the same as discussed in
Section 16.2.3. In the backward selection method of analysis, our dependent variable
was “dbp” (Table 16.12). The independent variables included in the analysis were
“sbp, age, sex, and religion”. The outputs of the analysis (Table 16.12) show that two
variables (systolic BP and sex) are significantly associated with diastolic BP and are
retained in the model.
The R-squared value, calculated in the analysis, is 0.725, indicating that the indepen-
dent variables (systolic BP and sex) together explain 72.5% of the variation in diastolic
BP, which is statistically significant [p=0.000 (Prob > F)]. The regression coefficients
(Coef.) and p-values (P > | t |) for systolic BP and sex are 0.489 (p=0.000) and -2.180
(for males compared to females; p=0.015), respectively. We can, therefore, conclude
from the analysis that the systolic BP and sex are the important factors significantly
influencing the diastolic BP (since the p-values are <0.05).
170 Linear Regression
The regression coefficient for systolic BP is 0.489 (95% CI: 0.44 to 0.53; p<0.001).
This indicates that the average increase (or decrease) in diastolic BP is 0.49 mmHg if
the systolic BP increases (or decreases) by 1 mmHg after adjusting for sex. The regres-
sion coefficient for sex, on the other hand, is –2.180 (95% CI: -3.940 to -0.419; p=
0.015), which means that the average diastolic BP of males is 2.18 mmHg less (since
the coefficient is negative) than the females after adjusting for systolic BP.
Table 16.13 Backward selection method with forced-entry of age in the model
. xi: stepwise, pr(0.2) lockterm1: regress dbp age sbp i.religion
------------------------------------------------------------------------------
dbp | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age | -.0353995 .056452 -0.63 0.531 -.1466941 .0758952
sbp | .4956516 .0216435 22.90 0.000 .4529816 .5383216
_cons | 20.48269 3.281515 6.24 0.000 14.01322 26.95217
------------------------------------------------------------------------------
17
Logistic Regression
1. Binary logistic regression: This method is used when the dependent variable is a
dichotomous (binary) categorical variable, such as a disease (present/absent), vaccinat-
ed (yes/no), or the outcome of a patient (died/survived). The binary logistic regression
can be applied as:
a) Unconditional binary logistic regression: This method is used when the depen-
172 Logistic Regression
2. Multinominal logistic regression: This method of data analysis is used when the
outcome (dependent) variable is a nominal categorical variable with more than two
levels, such as health-seeking behavior (did not seek treatment/ received treatment
from village doctors/ received treatment from pharmacists), type of cancer (stomach
cancer/ lung cancer/ skin cancer) and others.
Where, P denotes the probability of the outcome, β0 is the intercept (constant) and βi is
the regression coefficient of the ith variable (i= 1, 2, …, n), and Xi represents the values
of the predictor (independent) variables in the model, Xi = (X1, X2, … Xn).
The regression coefficients (β) that we get in logistic regression analysis are the ln
odds and the exponential of the regression coefficients are the odds ratios (ORs) for the
Logistic Regression 173
categorical independent variables after adjusting for other variables in the model. If the
independent variable is a continuous variable, the interpretation is different and is
discussed in Section 17.2.1.1. We can calculate the probability (p) of an outcome by
using the following formula:
Or
A detailed explanation of the model can be found in any standard biostatistics book. In
this chapter, we will discuss the unconditional and conditional logistic regression
methods, while the multinominal logistic regression method is discussed in Chapter
18.
------------------------------------------------------------------------------
diabetes | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
sex |
Male | 4.80143 2.555999 2.95 0.003 1.691379 13.63014
age | 1.262949 .0503621 5.85 0.000 1.168 1.365616
|
pulcer |
Yes | 5.937446 2.947017 3.59 0.000 2.244453 15.70684
|
fhistory |
Yes | 2.85935 1.531351 1.96 0.050 1.000918 8.168382
|
religion |
HINDU | 1.655078 .9067061 0.92 0.358 .5655909 4.843225
Christian | .8324862 .7949383 -0.19 0.848 .1281054 5.409867
|
_cons | .0000187 .0000296 -6.90 0.000 8.48e-07 .0004131
------------------------------------------------------------------------------
and 1 for males. Therefore, the analysis will give the OR for males compared to
females. On the other hand, the independent variable “religion” has three levels (Table
17.1). Since Muslims are coded as 1 (first category), Muslims will be the comparison
group for religion. The ORs that the Stata will provide for other religions (Hindus and
176 Logistic Regression
Christians) will be compared to Muslims. However, you can change the comparison
group by using the following commands:
logistic diabetes i.sex age i.pulcer i.fhistory ib3.religion
logistic diabetes ib1.sex age i.pulcer i.fhistory ib3.religion
The first command will consider the last category of religion (category 3 or Christians)
as the comparison group, while the second command will consider males (because
males are coded as 1) as the comparison group for sex, and Christians as the compari-
son group for religion during analysis.
Occasionally, in cross-sectional studies, data is collected through cluster sampling
methods. In such a situation, it is necessary to control (adjust) for the cluster effects
during analysis. To adjust for cluster effects (say, the name of the cluster variable is
“clus”), use the following command:
logistic diabetes i.sex age i.pulcer i.fhistory i.religion, vce(cluster clus)
17.2.1.1 Interpretation
In logistic regression analysis, we have entered five independent variables, among
which one is a continuous variable (age) and the others are categorical variables (sex,
peptic ulcer, family history of diabetes, and religion). Let us interpret the outputs
provided in Table 17.2.
The table shows that the data from 210 subjects were analyzed. The likelihood ratio
(LR) chi-square [LR Chi2(4)] value is 102.61 and its p-value (Prob > chi2) is 0.000.
We want the LR chi-square test to be significant (p<0.05). A significant LR chi-square
test indicates that the proposed model is better than the null model (i.e., a model with-
out any independent variable) in predicting the outcome variable. Furthermore, if the
LR chi-square test p-value is greater than 0.05, it means that the independent variables
are unable to predict the outcome variable (a situation where none of the independent
variables in the model are significant).
The pseudo R-square value indicates the amount of variation in the outcome variable
that can be explained by the independent variables in the model. In this example, the
pseudo R-square (Pseudo R2) value is 0.4702. This indicates that 47.02% of the varia-
tion in the outcome variable (diabetes) can be explained by all the independent
variables (sex, age, peptic ulcer, family history of diabetes, and religion) in the model.
However, the pseudo R-square value should be interpreted cautiously because it is not
equivalent to the R-squared value that we get in a linear regression model (Sections
Logistic Regression 177
------------------------------------------------------------------------------
diabetes | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
sex |
Male | 1.568914 .5323411 2.95 0.003 .5255445 2.612283
age | .2334492 .0398766 5.85 0.000 .1552926 .3116058
|
pulcer |
Yes | 1.781279 .4963443 3.59 0.000 .808462 2.754096
|
fhistory |
Yes | 1.050594 .5355591 1.96 0.050 .0009176 2.100271
|
religion |
HINDU | .5038483 .5478328 0.92 0.358 -.5698842 1.577581
Christian | -.1833386 .9548967 -0.19 0.848 -2.054902 1.688225
|
_cons | -10.88592 1.578628 -6.90 0.000 -13.97998 -7.79187
------------------------------------------------------------------------------
16.1.2 and 16.2.3). This information is not needed if the objective of the analysis is to
adjust for Odds Ratio.
The table (Table 17.2) also shows the ORs (Odds Ratios) for the independent variables
with their standard errors (Std. Err.), 95% CIs (95% Conf. Interval), and p-values
(P>|z|). In our analysis, there are five independent variables in the model. All of them
are categorical variables (sex, family history of diabetes, peptic ulcer, and religion),
except for age, which is entered as a continuous variable. The interpretation of the OR
for a categorical and a continuous variable is not the same.
Let us first interpret the ORs for categorical variables. Table 17.2 shows that the OR
for sex (being male) is 4.80 (95% CI: 1.69 - 13.63), which is statistically significant
(P=0.003). Here, the comparison group is female. The OR of 4.80 suggests that males
are 4.8 times more likely to have diabetes compared to females after adjusting (con-
178 Logistic Regression
trolling) for age, family history of diabetes, peptic ulcer, and religion. Note that the
ORs provided by logistic regression analysis are the adjusted ORs. Similarly, individu-
als who have a family history of diabetes are 2.85 times more likely (OR: 2.85; 95%
CI: 1.00 to 8.16; p=0.05) to have diabetes compared to those who do not have a family
history, after adjusting for age, sex, peptic ulcer, and religion. But the OR is not statisti-
cally significant (though the p-value is 0.05) as the null value (one) is included in the
95% CI. Lastly, the ORs for Hindus and Christians provided in the table are compared
to Muslims after controlling for age, sex, peptic ulcer, and family history of diabetes.
The interpretation of OR for age is different since the variable is entered as a continu-
ous variable. In our example, the OR for age is 1.262 (95% CI: 1.168 – 1.365;
p=0.000). This means that the odds of having diabetes will increase (since the value of
OR is greater than one) by 26.2% (calculated as OR – 1; i.e., 1.262 – 1 = 0.262 or
26.2%) (95% CI: 16.8% to 36.5%) with each year increase in age after adjusting for all
other variables in the model, which is statistically significant (p<0.001).
All the outputs in Table 17.3 are the same as the outputs in Table 17.2 except that Table
17.3 provides the regression coefficients (with their standard errors and 95% CIs)
rather than the ORs. Note that the z-values (Z) and p-values (P>|z|) are the same in both
tables. The coefficients are used to calculate the predicted probabilities of the outcome
variable with the independent variables in the model. In logistic regression, the odds
are transformed into ln odds (logit transformation) during analysis. Therefore, the
coefficients reported in the table are the ln odds and their exponentials (ex) are the ORs.
For example, the coefficient for males is 1.568914 and its exponential is 4.80, which is
the OR for males (Table 17.2).
The table (Table 17.3) also shows the iterations. The iterations indicate the log
likelihood values. The first iteration (iteration 0) indicates the log likelihood of the
null model (i.e., a model without any independent variable). The table shows that the
log likelihood has increased [from -109.11 (Iteration 0) to -57.80 (Iteration 5)] with
the inclusion of independent variables in the model. The improvement in log likeli-
hood value is statistically significant as indicated by the p-value (0.000) of the log
likelihood ratio chi-square test (Prob > Chi2). A significant p-value indicates that the
model is significantly better than the null model.
biased estimates of the coefficients, which may lead to misinterpretation of the results.
It is, therefore, important to check the underlying assumptions before considering the
logistic regression analysis valid. Regression diagnostics are used to evaluate whether
the assumptions are true or not. In this section, we will discuss how to assess some of
the important assumptions to validate the model, like multicollinearity, model fit, and
others.
(non-significant). If the p-value is not significant (>0.05), it suggests that the model is
good for prediction of the outcome variable (i.e., the observed and predicted values are
close together). This test is done after running the logistic regression analysis, and the
command is:
lfit, group(10)
The above command will generate Table 17.6. The table shows that the Hosmer-Leme-
show chi-square test p-value is 0.085. Since the p-value is greater than 0.05, we can
conclude that the model is useful for prediction of the outcome variable by the inde-
pendent variables included in the model. If the test is significant (p<0.05), we will
conclude that the model is not good enough to predict the outcome variable by the
independent variables in the model. This information is not needed if the objective of
doing the logistic regression analysis is to adjust for the confounding factors.
Classification table
The classification table provides us with the sensitivity, specificity, and positive and
negative predictive values, and overall accuracy of the model. The predictive values
indicate how well the model is able to predict the correct category of the dependent
variable (i.e., have or do not have the disease). To get the sensitivity, specificity, and
positive and negative predictive values, we need to generate the classification table by
using the following command (after running the logistic regression analysis). Usually,
the classification table is generated at a cut-off value of 0.5 (50%). You can, however,
change the cut-off value of your choice.
lstat, cutoff(0.5)
This command will generate Table 17.7. The table shows that the overall accuracy of
182 Logistic Regression
this model to predict diabetes (with a predicted probability of 0.5 or greater) is 90.5%
(shown at the bottom of the table as correctly classified). The sensitivity and specificity
of the model are 71.11% [32 ÷ 45] and 96.36% [159 ÷ 165], respectively, while the
positive and negative predictive values are 84.2% [32 ÷ 36] and 92.4% (159 ÷ 172],
respectively. Interpretation of these findings is a little complicated and needs further
explanation, especially the explanation of sensitivity, specificity, and positive and
negative predictive values. For a detailed explanation, readers may refer to any
standard epidemiology book [10, 16]. You can see the sensitivity and specificity with
varying cut-off points (in a graph) by using the following command (not shown):
lsens
To conclude, the information provided in the classification table is needed if the inten-
tion of logistic regression analysis is to predict the outcome variable with independent
variables in the model. We can ignore the information if the objective of the analysis is
to adjust for the confounding factors.
outstanding discrimination.
To get the ROC curve, after performing the logistic regression analysis, use the follow-
ing command (Fig 17.1):
lroc
Figure 17.1, generated by the command above, shows that the area under the curve is
0.915 (at the bottom of Figure 17.1). Since the value is >0.9, it is an excellent model
for prediction. However, we can separately calculate the area under the ROC curve
with its 95% CI. To calculate the area under the curve, we need to generate a classifier
variable (say, class). Then we will calculate the area under the curve with its 95% CI.
Use the following commands:
predict class
roctab diabetes class
The first command will generate the classifier variable “class”, while the second com-
mand will calculate the area under the curve with its 95% CI (Table 17.8).
184 Logistic Regression
1.00
0.75
Sensitivity
0.50
0.25
0.00
------------------------------------------------------------------------------
| Delta-method
| Margin Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
religion |
MUSLIM | .0720608 .0275834 2.61 0.009 .0179982 .1261233
HINDU | .11389 .0490783 2.32 0.020 .0176983 .2100817
Christian | .0607226 .0527785 1.15 0.250 -.0427214 .1641666
|
sex |
female | .0467855 .0199011 2.35 0.019 .00778 .085791
male | .1907179 .0637455 2.99 0.003 .065779 .3156569
------------------------------------------------------------------------------
.3
.2
Pr(Diabetes)
.1
0
-.1
female male
(mean)
absent. If the strength of association (OR or RR) is different in those two strata, there
is an interaction between smoking and hypertension for the causation of heart disease
(see Section 13.3).
We can include the interaction terms in logistic regression analysis with other variables
for adjustment in the model. Suppose that we are interested in assessing if there is an
association of sex and a family history of diabetes with diabetes mellitus. We are also
keen to know if there is an interaction between sex and family history of diabetes on
the outcome. In such a situation, we need to include both the independent variables
(sex and family history of diabetes) and their interaction terms in the model. Note that
the variables for which we are looking for interaction must be included in the model
independently. To add the interaction term (sex#fhistory) into the model, use the
following command:
logistic diabetes i.sex age i.pulcer i.religion i.fhistory i.sex#i.fhistory
The above command will generate Table 17.10. The interaction term included in the
model is indicated as “male#Yes”. The table shows that the p-value for the interaction
is 0.514 (>0.05), indicating that there is no interaction between sex and family history
of diabetes for the outcome (i.e., the effect of sex on diabetes is not dependent on the
family history of diabetes). If there is an interaction (p-value <0.05), data needs to be
analyzed separately at each level of sex or family history of diabetes, depending on the
objective.
188 Logistic Regression
------------------------------------------------------------------------------
diabetes | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
sex |
male | 3.421403 2.510496 1.68 0.094 .8121282 14.41398
age | 1.259393 .0504362 5.76 0.000 1.16432 1.362229
|
fhistory |
Yes | 2.104785 1.4767 1.06 0.289 .5321228 8.325375
|
pulcer |
Yes | 5.802878 2.886573 3.53 0.000 2.188887 15.3838
|
religion |
HINDU | 1.692301 .9342122 0.95 0.341 .5735658 4.993121
Christian | .8405074 .8056811 -0.18 0.856 .1284131 5.501405
|
sex#fhistory |
male#Yes | 1.978321 2.069448 0.65 0.514 .2546167 15.37117
|
_cons | .0000256 .0000418 -6.49 0.000 1.05e-06 .0006254
------------------------------------------------------------------------------
following command:
clogit death age ib3.religion i.htn i.diabetes, group(mID)
The variable immediately after the command (clogit) must be the outcome (dependent)
variable. The above command will generate Table 17.12. The table shows the logistic
regression coefficients, which are difficult to interpret. We prefer to get the ORs, which
are easier to interpret. To get the ORs, use the following command after the primary
analysis:
clogit, or
This command will give you Table 17.13 with the ORs. We will interpret the outputs
provided in this table to draw conclusions.
190 Logistic Regression
17.2.6.1 Interpretation
The interpretation of the outputs is similar to unconditional logistic regression analysis
as discussed earlier. Table 17.12 shows that the model explains 34% of the variation in
the outcome variable by the independent variables (age, religion, hypertension, and
diabetes) included in the model (Pseudo R2 = 0.3421).
Our main interest is in ORs, 95% CIs, and p-values. Table 17.13 shows the adjusted
ORs for all the variables in the model. The table shows that age (p= 0.004), religion
(Muslims compared to Christians) (p= 0.020) and having hypertension (compared to
no hypertension) (p= 0.037) are the factors significantly associated with deaths due to
COVID. Data indicates that Muslims are more likely to die compared to Christians
(adjusted OR: 13.0; 95% CI: 1.49 – 113.2; p= 0.020) after adjusting for age, hyperten-
sion, and diabetes. On the other hand, those who have hypertension are 2.56 times
more likely to die compared to those who do not have hypertension (adjusted OR:
2.56; 95% CI: 1.06 – 6.22; p= 0.037) after controlling for age, religion, and diabetes.
The interpretation of OR for age is different since the variable is entered as a continu-
ous variable (see section 17.2.1.1). In this example, the OR for age is 1.076 (95% CI:
1.02 – 1.13; p=0.004). This means that the odds of dying increase by 7.6% (1.076 – 1.0
= 0.076 or 7.6%) with each year increase in age after adjusting for religion, hyperten-
sion, and diabetes, which is statistically significant at 95% confidence level.
Logistic Regression 191
------------------------------------------------------------------------------
death | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age | .0741377 .0258347 2.87 0.004 .0235026 .1247727
|
religion |
MUSLIM | 2.565603 1.104299 2.32 0.020 .401217 4.729989
HINDU | 1.922886 1.091629 1.76 0.078 -.2166663 4.062439
|
htn |
Yes | .9436792 .4514828 2.09 0.037 .0587891 1.828569
|
diabetes |
Yes | .780711 .4194145 1.86 0.063 -.0413263 1.602748
------------------------------------------------------------------------------
------------------------------------------------------------------------------
death | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age | 1.076955 .0278228 2.87 0.004 1.023781 1.132891
|
religion |
MUSLIM | 13.0085 14.36527 2.32 0.020 1.493641 113.2943
HINDU | 6.840675 7.467477 1.76 0.078 .8051986 58.1159
|
htn |
Yes | 2.569417 1.160048 2.09 0.037 1.060552 6.224974
|
diabetes |
Yes | 2.183024 .9155918 1.86 0.063 .959516 4.966664
------------------------------------------------------------------------------
In this section, we will discuss how to analyze cross-sectional data to get the adjusted
PR. Several methods are available to obtain the adjusted PR with Stata. They are:
a) Poisson regression;
b) Cox regression (proportional hazards analysis) with constant time; and
c) Generalized linear model (GLM).
Of the above methods, Poisson regression is the easiest. We will, however, demonstrate
the use of other methods in this section.
We will use the data file <Data_4.dta> for the analysis (consider that the data is from
a cross-sectional study). Our interest is to estimate the PR of diabetes for males com-
pared to females after controlling for age, family history of diabetes (fhistory), and
peptic ulcer (pulcer).
Logistic Regression 193
------------------------------------------------------------------------------
diabetes | IRR Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
sex |
male | 1.938541 .5830272 2.20 0.028 1.075156 3.495254
age | 1.139111 .0281784 5.27 0.000 1.085199 1.1957
|
fhistory |
Yes | 1.446838 .4649351 1.15 0.250 .770708 2.716125
|
pulcer |
Yes | 2.359057 .7257438 2.79 0.005 1.290843 4.311251
_cons | .0011379 .0010385 -7.43 0.000 .0001902 .0068073
------------------------------------------------------------------------------
------------------------------------------------------------------------------
_t | Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
sex |
male | 1.938541 .5830272 2.20 0.028 1.075156 3.495254
age | 1.139111 .0281784 5.27 0.000 1.085199 1.1957
|
fhistory |
Yes | 1.446838 .4649351 1.15 0.250 .770708 2.716125
|
pulcer |
Yes | 2.359057 .7257438 2.79 0.005 1.290843 4.311251
------------------------------------------------------------------------------
196 Logistic Regression
18.1 Interpretation
Table 18.2 shows the results of the analysis of the second command. The first iteration
(iteration 0) indicates the log likelihood of the null model (i.e., a model without any
independent variable). The table shows that the log likelihood has increased [from
-223.23 (Iteration 0) to -174.89 (Iteration 5)] with the inclusion of independent
variables in the model. The log likelihood chi-square p-value (Prob > Chi2) as provid-
ed in the table is significant (p= 0.000). A significant p-value indicates that the model
is significantly better than the null model; i.e., the addition of independent variables
(religion, severity of diarrhea, and age) in the model has improved the ability to predict
the outcome variable compared to the null model.
Multinominal Logistic Regression 199
-------------------------------------------------------------------------------------------
behavior | RRR Std. Err. z P>|z| [95% Conf. Interval]
--------------------------+----------------------------------------------------------------
Did_not_receive_treatment | (base outcome)
--------------------------+----------------------------------------------------------------
Treated_by_vill_doc |
religion |
Muslim | 2.627298 1.324947 1.92 0.055 .9777956 7.059443
|
sdiar |
severe | 4.8923 2.469855 3.14 0.002 1.818812 13.15947
age | 1.271572 .0517161 5.91 0.000 1.174145 1.377083
_cons | .0000683 .0001029 -6.37 0.000 3.56e-06 .0013092
--------------------------+----------------------------------------------------------------
Treated_by_pharmacist |
religion |
Muslim | 2.080179 .6647571 2.29 0.022 1.111948 3.891498
|
sdiar |
severe | .8920665 .3586054 -0.28 0.776 .4057134 1.961441
age | 1.002805 .0242082 0.12 0.908 .9564624 1.051392
_cons | .5782204 .395847 -0.80 0.424 .1511349 2.212188
-------------------------------------------------------------------------------------------
The Pseudo R-squared (Pseudo R2) value indicates how much variation in the depen-
dent variable can be explained by the independent variables in the model. The results
show that 21.6% of the variation in the dependent variable can be explained by the
independent variables together in the model (religion, severity of diarrhea, and age).
However, the readers should interpret the pseudo R-squared value cautiously as it is
not equivalent to the R-squared value that we get in linear regression analysis (Sections
16.1.2 and 16.2.3).
The main output has two parts, labelled with the categories of outcome variable (treat-
ed by village doctors and treated by pharmacists). The analysis provided the RRRs
with their 95% CIs and p-values.
The first half of the table has the results for “treated by village doctors” compared
to “did not receive treatment”. The results indicate that Muslims (compared to other
200 Multinominal Logistic Regression
religions) are more likely to receive treatment from village doctors after controlling for
mothers’ age and severity of diarrhea, but the association is not statistically significant
(adjusted RRR: 2.62; 95% CI: 0.97 – 7.05; p=0.055). However, severity of diarrhea
(i.e., if the baby has severe diarrhea) is significantly associated with seeking treatment
from village doctors after controlling for mothers’ age and religion (adjusted RRR:
4.89; 95% CI: 1.81 – 13.15; p=0.002).
For the quantitative variables, an RRR greater than one indicates an increased likeli-
hood of the response category (treated by village doctors) compared to the reference
category (did not receive treatment). The results show that with the increase in moth-
ers’ age, it is significantly more likely to receive treatment from the village doctors
after adjusting for religion and severity of diarrhea (adjusted RRR: 1.27; 95% CI: 1.17
– 1.37; p=0.000) [in other words, the likelihood of receiving treatment from village
doctors increases by 27% with each year increase in mother’s age].
The second half of the table shows the results for "treated by pharmacists" compared
to "did not receive treatment". The interpretations are similar to those mentioned
above. The results show that only religion (being Muslim) is significantly associated
with seeking treatment from pharmacists after adjusting for maternal age and severity
of diarrhea (adjusted RRR: 2.08; 95% CI: 1.11 – 3.89; p= 0.022).
------------------------------------------------------------------------------
| Delta-method
| Margin Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
religion |
Muslim | .3902659 .0514549 7.58 0.000 .2894162 .4911156
Others | .5810089 .0563714 10.31 0.000 .470523 .6914949
------------------------------------------------------------------------------
------------------------------------------------------------------------------
| Delta-method
| Margin Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
religion |
Muslim | .1166562 .0368918 3.16 0.002 .0443495 .1889628
Others | .0661029 .0283829 2.33 0.020 .0104735 .1217323
------------------------------------------------------------------------------
------------------------------------------------------------------------------
| Delta-method
| Margin Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
religion |
Muslim | .4930779 .0527806 9.34 0.000 .3896299 .5965259
Others | .3528881 .0548775 6.43 0.000 .2453302 .460446
------------------------------------------------------------------------------
202 Multinominal Logistic Regression
We can use the command "margins" to plot the predicted probabilities of each category
of the outcome variable (behavior) by religion. Stata will create the plots based on the
last margins command used. We can also combine 3 margins plots into a single figure
by using the command "graph combine" (Fig 18.1). Use all the following commands
successively:
margins religion, atmeans predict (outcome(1))
marginsplot, name (not_treated)
margins religion, atmeans predict (outcome(2))
marginsplot, name (treated_vdoc)
margins religion, atmeans predict (outcome(3))
marginsplot, name (treated_pharm)
graph combine not_treated treated_vdoc treated_pharm
Pr(Behavior==Did_Not_Receive_Treatment)
Adjusted Predictions of religion with 95% CIs Adjusted Predictions of religion with 95% CIs
Pr(Behavior==Treated_By_Vill_Doc)
.7
.2
.15
.6
.5
.1
.05
.4
.3
Muslim Others
Religion
Figure 18.1 Marginal values of different categories of religion for seeking treatment
19
Survival Analysis
There are situations when researchers are interested in knowing the progress of a
patient from a specific point in time (e.g., from the point of diagnosis or initiation of
treatment) until the occurrence of a certain outcome, such as death or recurrence of any
event (e.g., recurrence of cancer). The prognosis of a condition is commonly assessed
by estimating the: a) Median survival time and b) Cumulative probability of survival
after a certain time interval (e.g., 5-year or 3-year or others).
For instance, a researcher may be interested in determining the median survival time of
colonic cancer if the patient is treated (or not treated), and the estimated probability
that a patient with colonic cancer may survive for more than 5 years (5-year cumulative
survival probability). The methods employed to answer the above questions in a
follow-up study are known as survival analysis (or lifetable analysis) methods.
Survival analysis is done in follow-up studies, where subjects are usually followed
over a specified period of time and the focus is on the time at which the event of inter-
est occurs. To do the survival analysis, we need to have data (information) from each
of the subjects, at least on the following variables:
• Time: It is the length of time the patient was observed in the study (called
“survival time”). Time can be measured in days, weeks, months, years or other
units of time.
• Outcome: Whether the patient developed the outcome of interest (event) during
the study period, or whether the patient was either lost to follow-up or remained
alive at the end of the study (censored); and
• Treatment group: Which treatment (e.g., treatment A or B) did the patient receive
in the study (optional)?
204 Survival Analysis
Survival time is of two types – a) Censored time; and b) Event time. The censored time
is the amount of time contributed by the:
a) Patients who did not develop the outcome and remained in the study up to the
end of the study period, or
b) Patients who were lost to follow-up due to any reason, such as migration or
withdraw, or
c) Patients who developed the outcome for reasons other than the disease of interest
(e.g., died in a car accident)
On the other hand, the event time is the amount of time contributed by the patients who
developed the outcome of interest during the study period. In survival analysis, the
outcome measure of interest is survival time, which is a mixture of event time and
censored time.
If we have the above information, it is possible to estimate the median survival times
and cumulative survival probabilities for two or more treatment groups for compari-
son. Such a comparison allows us to answer the question, “which treatment delays the
time of occurrence of an event?” The method commonly used to analyze survival-time
data is the Kaplan-Meier method, and Stata can be used for the analysis of such data.
Use the data file <Data_survival_4.dta> for practice. The codebook for the data file
is provided in Table 19.1. Note that in this data, the event of our interest is death.
were followed until time to death or up to six months (180 days), whichever came first.
The outcome of interest in this study was the time to death (event) due to acute heart
attack. The objective was to assess whether the "new treatment" delays (increases) the
time to event (death) compared to placebo among patients with heart attack. Data from
this study is provided in the data file <Data_survival_4.dta>, which includes the
following necessary variables:
• Time: The variable “time” indicates the amount of time each patient has spent
in the study in days;
• Treatment: It specifies which treatment the patient received in this clinical trial
(coded as 0= received placebo; 1= received new treatment);
• Outcome (event): Whether the patient developed the event, i.e., died or not
(coded as 0= censored/did not die; 1= died).
Assumptions
• The probability of outcome is similar among the censored (lost to follow-up)
and under-observation individuals;
• There is no secular trend over the calendar period;
• Risk is uniform during the interval; and
• Loss to follow-up is uniform over the interval.
+-------------------+
| Key |
|-------------------|
| frequency |
| column percentage |
+-------------------+
The primary objectives of survival analysis are to get the median survival time and
cumulative survival probability (survival functions) for each treatment group and the
significance of the difference (log-rank test) in survival functions between the treat-
ment groups. Once the data set is prepared (by the "stset" command), use the following
commands to get the median survival time and its 95% CI by treatment group. The
option "by(treatment)" will present the results separately for the placebo and new treat-
ment groups.
Survival Analysis 207
stsum, by(treatment)
stci, by(treatment)
The first command will provide the median survival times of the placebo and new
treatment groups, while the second command will provide the same but with their 95%
CIs (Table 19.3).
. stci, by(treatment)
| no. of
treatment | subjects 50% Std. Err. [95% Conf. Interval]
-------------+-------------------------------------------------------------
Placebo | 22 40 12.89864 22 71
New trea | 22 146 10.79461 89 .
-------------+-------------------------------------------------------------
total | 44 89 21.23218 41 168
To get the survival functions disaggregated by treatment group, use the following com-
mands:
sts list, by(treatment)
sts list, by(treatment) compare
The first command will provide a long table showing the survival functions (cumula-
tive survival probabilities) at different time points (Table 19.4), while the second com-
mand will provide a table showing a comparison of survival functions between placebo
and new treatment groups (Table 19.5).
208 Survival Analysis
We need to use a statistical test to objectively assess the significance of the difference
in survival functions between the treatment groups. The most commonly used statisti-
cal test is the log-rank test. However, there are alternatives to this test. They are the
Tarone-Ware and Peto tests. The commands to get these test statistics are:
Survival Analysis 209
Survivor Function
treatment Placebo New treat
----------------------------------
time 2 0.9545 0.9545
24 0.7273 0.8636
46 0.4545 0.8636
68 0.3182 0.7701
90 0.2727 0.7219
112 0.2727 0.6257
134 0.2727 0.6257
156 0.2727 0.4562
178 0.2727 0.3041
200 . .
----------------------------------
19.1.3 Interpretation
In total, 44 subjects were enrolled in this study, of which 22 received a placebo and
210 Survival Analysis
| Events Events
treatment | observed expected
--------------+-------------------------
Placebo | 16 10.62
New treatment | 11 16.38
--------------+-------------------------
Total | 27 27.00
chi2(1) = 4.66
Pr>chi2 = 0.0309
chi2(1) = 6.07
Pr>chi2 = 0.0138
chi2(1) = 6.03
Pr>chi2 = 0.0141
another 22 received the new treatment. There were a total of 27 deaths during the study
period, among which 16 (59.26%) were in the placebo group and 11 (40.74%) were in
the new treatment group (Table 19.2).
Survival Analysis 211
1.00
0.75
0.75
0.50
0.50
0.25
0.25
0.00
0.00
0 50 100 150 200 0 50 100 150 200
analysis time analysis time
treatment = Placebo treatment = New treatment treatment = Placebo treatment = New treatment
Figure 19.1 Kaplan-Meier survival curve Figure 19.2 Cumulative incidence curve
Table 19.3 shows the median survival times for both the placebo and new treatment
groups, including their 95% CIs. The median survival time is the time when the cumu-
lative survival probability is 0.50 (i.e., the time when 50% of the patients develop the
event). The table indicates that the median survival time, if a patient is in the placebo
group, is 40 days (95% CI: 22 to 71), while it is 146 days (95% CI: 89 to .), if a patient
is in the new treatment group. This means that the new treatment increases the survival
time, i.e., the new treatment is associated with a longer time-to-event (and the placebo
is associated with a shorter time-to-event). We, therefore, conclude that an individual
will live longer if s/he receives the new treatment compared to the placebo.
Table 19.4 shows the cumulative survival probability (survivor function) at different
points in time in the placebo and new treatment groups. In this table, we can see that
the cumulative survival probability at the end of 71 days (in the time column), in the
placebo group, is 0.272 (27.2%). Since there is no death after that, the cumulative
survival probability at the end of 180 days will be the same (0.272).
On the other hand, the cumulative survival probability is 0.304 (30.4%) at the end of
168 days for the patients who were in the new treatment group. As there is no death
after that, the cumulative survival probability at the end of 180 days will be the same
(0.304). In the new treatment group, the cumulative survival probability at the end of
71 days (68 days in the table) is about 0.770 (77.0%), which is much higher than the
placebo group (0.273). This indicates that the probability of surviving at the end of 71
days is higher among the patients who received the new treatment compared to place-
bo. This also indicates the benefit of the new treatment (i.e., the new treatment is better
than the placebo). A comparison of the survival functions at different points in time is
212 Survival Analysis
provided in Table 19.5. For instance, at the end of 90 days, the cumulative probability
of survival is 0.2727 in the placebo group compared to 0.7219 in the new treatment
group.
However, if we consider the cumulative survival probability of patients in both groups
at the end of 180 days, the probabilities are not that different (0.272 in the placebo
group and 0.304 in the new treatment group). This information suggests that though
survival experiences are significantly different between the treatment groups (as
indicated by the log rank test), the difference in survival probability at the end of 180
days is small.
Now, the question is whether the survival experiences of both these groups in the popu-
lation are different or not. For an objective comparison of survival experiences in two
groups, it is desirable to use a statistical method that will tell us whether the difference
in survival experiences in the population is statistically significant or not. Here, the
null hypothesis is that “the survival experiences in the placebo and new treatment
groups are the same in the population”. Such a null hypothesis can be tested by the
log-rank test. We have done the log-rank test and the results are provided in Table 19.6.
The p-value (Pr>chi2) of the test is 0.030, which is <0.05. This indicates that survival
experiences of both these groups in the population are not the same. In other words, it
tells us that the survival probability is better (since the median survival time is higher
in the new treatment group) if the patient is in the new treatment group compared to the
placebo group (i.e., the new treatment is more effective/better than placebo in improv-
ing the patients’ survival).
There are alternative procedures for testing the null hypothesis that the two survival
curves are identical. They are the Breslow test, the Tarone-Ware test, and the Peto test.
Here, we have performed the Tarone-Ware and Peto tests, and the results are shown in
Table 19.6. The log-rank test ranks all the deaths equally, while the other tests give
more weight to early deaths.
19.2.1 Commands
In Cox regression analysis, we will consider the variables age and sex for adjustment.
The command for Cox regression is “stcox” and it reports the hazard ratios (HRs). We
only use the independent variables with the “stcox” command. In survival analysis,
once the data are set with the “stset” command (see Section 19.1.1), Stata automatical-
ly recognizes the time and event variables. To perform the Cox regression analysis, use
214 Survival Analysis
--------------------------------------------------------------------------------
_t | Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval]
---------------+----------------------------------------------------------------
treatment |
New treatment | .3665283 .1678311 -2.19 0.028 .1493989 .8992233
|
sex |
male | 2.433823 1.060213 2.04 0.041 1.036315 5.715921
|
age | 1.008859 .0229496 0.39 0.698 .9648665 1.054857
--------------------------------------------------------------------------------
19.2.2 Interpretation
We have analyzed the data of 44 subjects (22 in the new treatment group and 22 in the
placebo group). The variables included in the analysis are “treatment”, “sex”, and
“age” to get the estimated effect of the new treatment (compared to placebo) after
adjusting for sex and age. Table 19.7 shows the results of the Cox regression analysis.
The table at the beginning shows the iteration history. The first iteration (iteration 0)
indicates the log likelihood of the null model (i.e., a model without any independent
variable). The table shows that the log likelihood has increased (from -88.96 to -83.98)
with the inclusion of independent variables in the model. The log likelihood ratio
chi-square p-value (Prob > chi2), as shown in the table, is significant (p= 0.018). A
significant p-value indicates that the addition of independent variables in the model has
improved the ability to predict the outcome compared to the null model.
216 Survival Analysis
------------------------------------------------------------------------------
_t | Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
treatment |
Placebo | 2.728302 1.249273 2.19 0.028 1.112071 6.693488
|
sex |
male | 2.433823 1.060213 2.04 0.041 1.036315 5.715921
|
age | 1.008859 .0229496 0.39 0.698 .9648665 1.054857
------------------------------------------------------------------------------
Table 19.7 shows that out of a total of 44 subjects included in the analysis, 27 died
(number of failures). In our analysis, placebo is the comparison group for the variable
“treatment” and females are the comparison group for “sex”. We will, therefore, get the
HRs for the new treatment group compared to placebo and for the males compared to
females.
The table (Table 19.7) shows the HRs and their corresponding p-values (P>|z|) with the
95% CIs. The HR for the new treatment is 0.366 (95% CI: 0.149 to 0.899) with a p-val-
ue of 0.028. This finding indicates that compared to the placebo, patients in the new
treatment group are less likely (63.4%; one minus 0.366) to have a shorter time to
event (i.e., have greater survival time or survive longer) after controlling for age and
sex, which is statistically significant (p=0.028) at 95% confidence level. Furthermore,
males are more likely (2.43 times) to have a shorter time to event (have a shorter
survival time) compared to females (p=0.041) after controlling for the variables “treat-
Survival Analysis 217
ment” and “age”. Age, independently, does not have any significant effect on the
survival time since the p-value is greater than 0.05.
1.00
0.75
0.75
0.50
0.50
0.25
0.25
0.00
0.00
0 50 100 150 200 0 50 100 150 200
analysis time analysis time
treatment = Placebo treatment = New treatment treatment = Placebo treatment = New treatment
Figure 19.3 Survival curve after Figure 19.4 Cumulative incidence curve
controlling for age and sex after controlling for age and sex
If you consider the new treatment as the comparison group (second command; Table
19.8), you will get the HR for the placebo group, which is 2.728. This value (2.728) is
actually the inverse of the HR (0.366) that we had for the new treatment group com-
pared to the placebo. Interpretation is simple. An HR of 2.7 indicates that the patients
in the placebo group are 2.7 times more likely to have a shorter time to event (i.e., the
patients in the placebo group are more likely to die early) compared to patients in the
new treatment group.
The interpretation of a continuous variable (e.g., age) when included in the model is
different. If the HR for age is greater than 1 (say, 1.3), it indicates that the HR will
increase (or decrease) by 30% (1.3 minus 1) with each year of increase (or decrease) in
age. On the other hand, if the HR is less than 1 (say, 0.7), it means that with each year
increase in age, the HR will be reduced by 30% (1 minus 0.7).
To check for the presence of multicollinearity, look at the standard errors (std. err.) of
the coefficients of the variables included in the model (Table 19.9). Since there is no
value which is very small (<0.001) or very large (>5.0) (also see Section 17.2.2.1),
there is no problem of multicollinearity in the model.
The second assumption (the proportionality assumption) is the major one. If this
assumption is violated, the simple Cox regression model is invalid, and more sophisti-
cated analyses are required. Formal statistical tests and graphical methods (log-mi-
nus-log plot) can be used for detecting violation of this assumption.
The statistical test for checking the proportional hazards assumption is performed by
the following command (this command needs to be used after performing the Cox
regression analysis since this test is based on the most recent use of the Cox regression
results):
estat phtest, detail
This command will provide Table 19.10. The table shows the p-values for the variables
“treatment” (new treatment), “sex” (male) and “age”. The p-value for age is less than
0.05, while the p-values for other variables are greater than 0.05. A p-value of less than
0.05 indicates that the assumption is violated. This suggests that there are some poten-
tial problems with age, while there is no problem with sex and treatment groups (since
the p-values are greater than 0.05). The overall test (global test) p-value is also <0.05.
In such a situation, you can either omit age from the model (since it is not significant)
or use the time-dependent Cox regression method. For further details, readers are
referred to any standard text book.
As an alternative, we can also check the assumption by using the graphical method (by
a log-minus-log plot). To generate the log-minus-log plot for the treatment groups,
after adjusting for age and sex, use the following command:
stphplot, by(treatment) adjust(age sex)
The above command will display the log-minus-log plot for the treatment groups after
adjusting for age and sex (Fig 19.5). If there is a constant vertical difference between
the two curves (i.e., the curves are parallel to each other), it means that the relative
hazards over time are proportional. If the curves cross each other or are much closer
together at some points in time and much further apart at other points in time, then the
assumption is violated. In our example (Fig 19.5), the two lines are not really parallel,
indicating that the assumption may be violated. When the proportional hazards
assumption is violated, it is recommended to use the Cox regression with a time-depen-
dent covariate.
Survival Analysis 219
------------------------------------------------------------------------------
_t | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
treatment |
Placebo | 1.003679 .4578939 2.19 0.028 .1062239 1.901135
|
sex |
male | .8894631 .4356164 2.04 0.041 .0356707 1.743255
|
age | .0088198 .022748 0.39 0.698 -.0357655 .0534051
------------------------------------------------------------------------------
Time: Time
----------------------------------------------------------------
| rho chi2 df Prob>chi2
------------+---------------------------------------------------
0b.treatment| . . 1 .
1.treatment | 0.05039 0.07 1 0.7892
0.sex | 0.20439 1.12 1 0.2901
1b.sex | . . 1 .
age | -0.53299 9.22 1 0.0024
------------+---------------------------------------------------
global test | 12.40 3 0.0061
----------------------------------------------------------------
220 Survival Analysis
3
-ln[-ln(Survival Probability)]
2
1
0
1 2 3 4 5
ln(analysis time)
Nonparametric methods, in general, are used when the continuous dependent variable
is not normally distributed. Nonparametric tests are also used when the data is
measured on nominal and ordinal scales. Table 20.1 shows the types of nonparametric
methods recommended against parametric tests when the dependent variable is not
normally distributed in the population. Nonparametric tests are less sensitive compared
to parametric tests and may, therefore, fail to detect differences between groups that
may actually exist. Use the data file <Data_4.dta> for practice.
the sample distributions are not normal or the sample size is small (<30). This test is
based on ranks of observations and is more efficient than the median test. This test tests
the null hypothesis that the two populations have the same median. For example, we
may want to know whether the median systolic BP (where the distribution of systolic
BP is non-normal) of males and females is the same. To test this hypothesis, use the
following command:
ranksum sbp, by(sex)
sex | p50
-------+----------
female | 124
male | 122
-------+----------
Total | 123
------------------
This command will provide Table 20.2. The table shows the p-value (Prob > |z|) of the
test, which is 0.167 (greater than 0.05). This indicates that the distribution of systolic BP
among males and females is not different (or, the median systolic BP of males and
females is not different). With this test result, the median systolic BP of females and males
should be reported. To get the median systolic PB by sex, use the following command:
Nonparametric Methods 223
Median test
Greater |
than the | Diabetes mellitus
median | No Yes | Total
-----------+----------------------+----------
no | 106 4 | 110
yes | 59 41 | 100
-----------+----------------------+----------
Total | 165 45 | 210
Continuity corrected:
Pearson chi2(1) = 41.2416 Pr = 0.000
diabetes | p50
---------+----------
No | 26
Yes | 39
---------+----------
Total | 28
--------------------
This command will provide Table 20.4. Just look at the p-value of the test, which is
0.000. This indicates that the pre- and post-test scores (medians) are significantly
different. We may, therefore, conclude that the training has significantly improved the
post-test scores compared to the pre-test scores. You can get the medians of the pre-
and post-test scores by using the “tabstat” command as shown earlier (Section 20.2).
+----------------------------+
| religion | Obs | Rank Sum |
|-----------+-----+----------|
| MUSLIM | 126 | 13298.50 |
| HINDU | 58 | 6175.00 |
| Christian | 26 | 2681.50 |
+----------------------------+
measured their blood sugar levels at the baseline, i.e., before administration of a drug
(hour-0). All the individuals were then administered a drug (say, drug A), and their
blood sugar levels were measured again after 7 hours, 14 hours, and 24 hours. We are
interested in knowing if the blood sugar levels over time after giving the drug are the
same or not (in other words, whether the drug is effective in reducing the blood sugar
levels over time). The variable "time" in the dataset indicates the times of measurement
of blood sugar levels. In this example, we have only one treatment group (received
drug A) but the outcome is measured (blood sugar) at four different points in time on
the same subjects (i.e., we have one treatment group with four levels of measurements
on the same subjects). Use the data file <Data_Repeat_anova_3.dta> for the exercise.
To perform the Friedman test, we need to install a module (emh package), which does
not come preinstalled in Stata. To install the module, use the following command:
ssc install emh
For the Friedman test, Stata needs a dataset which is in long format (our dataset is in
long format). If your dataset is in wide format, you need to convert it into long format
as discussed in Section 5.12. After installing the module, let us first check the number
of study subjects and the mean and median blood sugar levels at different time points
by using the following command (Table 20.6):
tabstat sugar, by(time) stat(n mean p50)
Nonparametric Methods 227
that are significantly correlated with the dependent variable. One can also include
categorical variables as covariates in the model.
Suppose that a researcher is interested in comparing the effectiveness of 3 drugs (drug
A, drug B, and drug C) in reducing systolic BP. To conduct the study, the researcher
randomly selected three groups of people and assigned a drug to each group. In this
scenario, one-way ANOVA could be used. However, it was observed that the mean age
and pre-treatment systolic BP of these three groups were not the same. Since age and
pre-treatment systolic BP may influence the effectiveness of the drugs in reducing
systolic BP, it requires adjustment for these variables (age and pre-treatment systolic
BP) to conclude the results. In such a situation, one-way ANCOVA can be used. In
ANCOVA, the independent variable must be a categorical variable (here it is "type of
drug"). ANCOVA can adjust more than one covariate, either continuous or categorical.
Another example: Assume that you have organized a staff training. You have taken the
pre-and post-tests of the participants to evaluate the effectiveness of the training. Now,
you want to conclude if males and females (independent variable "sex") have similar
performance in the post-test (dependent variable), after controlling for age and pre-test
scores (covariates). If the assumptions are met, one-way ANCOVA is the appropriate
test for this situation as well.
Hypothesis
Assume that you want to assess if the mean systolic BP (dependent variable; variable
name is “sbp”) is the same among males and females (independent variable; variable
name is “sex_1”) after controlling for diastolic BP (covariate; variable name is “dbp”).
H0: There is no difference in mean systolic BP between males and females in the
population, after controlling for diastolic BP.
HA: The mean systolic BP of males and females is different in the population, after
controlling for diastolic BP.
Assumptions
1. The dependent variable is normally distributed at each level of the independent
variable;
2. The variances of the dependent variable at each level of the independent
variable are the same (homogeneity of variances);
3. The covariates (if more than one) are not strongly correlated with each other
(r<0.8);
Analysis of Covariance (ANCOVA) 231
4. There is a linear relationship between the dependent variable and the covariates
at each level of the independent variable;
5. There is no interaction between the covariate (diastolic BP) and the independent
variable (sex) [called homogeneity of regression slopes].
21.1.1 Commands
Before performing the ANCOVA, it is better to check the descriptive statistics of the
dependent variable (systolic BP) at each level of the independent variable (sex). Use
the following command to get the descriptive statistics of systolic BP by sex (Table
21.1):
tabstat sbp, by(sex_1) stat(n mean sd)
sex_1 | N mean sd
-------+------------------------------
Female | 133 129.5714 21.37695
Male | 77 124.5584 17.22108
-------+------------------------------
Total | 210 127.7333 20.05794
--------------------------------------
Now, to perform the ANCOVA, use the first of the following commands. Here, the
dependent variable is systolic BP (sbp), the independent variable is sex (sex_1), and
the covariate is diastolic BP (dbp).
anova sbp i.sex_1 c.dbp i.sex_1#c.dbp
regress
margins sex_1, atmeans
The outputs of the above commands are displayed in Tables 21.2 (first and second
commands) and 21.3 (third command). The prefix “i.” when used before the name of
an independent variable, tells Stata that it is a categorical variable. It also tells Stata to
generate dummy variables (a set of dichotomous or indicator variables) if the categori-
cal variable has more than two levels (see Sections 5.11 and 16.2). On the other hand,
the prefix “c.” is used before a variable to indicate that the variable is a continuous
232 Analysis of Covariance (ANCOVA)
------------------------------------------------------------------------------
sbp | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
sex_1 |
Male | 1.569876 12.44837 0.13 0.900 -22.97267 26.11242
dbp | 1.458842 .0732245 19.92 0.000 1.314476 1.603207
|
sex_1#c.dbp |
Male | .0022405 .1526608 0.01 0.988 -.2987373 .3032184
|
_cons | 6.348646 6.254367 1.02 0.311 -5.982131 18.67942
------------------------------------------------------------------------------
variable. The follow-up “regress” command without any argument will display results
in the form of a regression table (Table 21.2). The last command will provide the
predicted (adjusted) mean of systolic BP at each level of sex considering the average
value of the covariate (diastolic BP) (Table 21.3).
You can get the pairwise comparison of predicted means of the dependent variable
when the main effect of the independent variable (with more than two levels) is statisti-
cally significant. If the main effect is not significant (in our example, sex is not signifi-
cantly associated with systolic BP; Table 21.2), it is not really necessary. However, for
the purpose of demonstration, we have performed the pairwise comparison with the
Bonferroni option, using the following command:
Analysis of Covariance (ANCOVA) 233
------------------------------------------------------------------------------
| Delta-method
| Margin Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
sex_1 |
Female | 127.0921 .9372843 135.60 0.000 125.2442 128.94
Male | 128.8475 1.282687 100.45 0.000 126.3186 131.3763
------------------------------------------------------------------------------
However, before we conclude the results, it is important to check whether the assump-
tion of homogeneity of regression slopes is violated or not. To check this, we need to
look at the significance level (p-value) of the interaction term (sex_1#dbp) in the table
(Table 21.2). We can see that the p-value for interaction is 0.988 (>0.05), suggesting
that there is no interaction. This indicates that the homogeneity of regression slopes
assumption is not violated. A p-value of <0.05 (i.e., if there is an interaction) suggests
the regression slopes are not homogeneous and the ANCOVA test is inappropriate.
Margins : asbalanced
Table 21.3 shows the outputs of the “margins” command. Margins are the statistics
calculated from predictions of a previously analyzed model. In our example, we have
used the option “atmeans” with the “margins” command (there are other options like
“asbalanced” and “asobserved”) to get the predicted values (adjusted values) consider-
ing the average value of the covariate. Therefore, the “margins” command has provid-
ed us with the adjusted (predicted) mean of systolic BP for each level of sex, consider-
ing the average value of diastolic BP (82.76 mmHg). We can see that the mean systolic
BP of females is 127.09 mmHg, while it is 128.84 mmHg for males after adjusting for
the average value of diastolic BP (the adjusted means are different from the unadjusted
means as shown in Table 21.1).
Table 21.4 shows the pairwise comparison of the predicted means. This analysis is not
necessary in our example since the independent variable (sex) is not significantly asso-
ciated with the dependent variable. If the independent variable has more than two
levels and is statistically significant, then the table for pairwise comparison is import-
ant.
Table 21.4 shows that there is no significant difference in mean systolic BP between
Analysis of Covariance (ANCOVA) 235
males and females as the p-value is 0.90 (>0.05). The analysis ignored the Bonferroni
option since the variable “sex” has only two levels. The Bonferroni test is a type of
multiple comparison test (there are other tests for pairwise comparisons, such as Schef-
fe’s, Sidak’s, and Tukey’s tests) used for pairwise comparison of means. The Bonfer-
roni correction is done to adjust the type I errors when multiple pairwise tests are
performed on a single variable.
(diastolic BP). For the analysis, we will use the data file <Data_3.dta>. The variable
names are for diastolic BP is “dbp”, occupation is “occupation (1= govt. job; 2= private
job; 3= business; 4= others)”, diabetes is “diabetes1 (0= no diabetes; 1= have diabe-
tes)”, and age is “age”.
Assumptions
All the assumptions for the one-way ANCOVA are applicable to two-way ANCOVA.
21.2.1 Commands
To perform the two-way ANCOVA, use the following command. The variables includ-
ed in the analysis are dbp (diastolic BP; as dependent variable), occupation, diabetes1,
and age.
anova dbp i.occupation i.diabetes1 c.age i.occupation#i.diabetes1
regress
The first command is the basic command for ANCOVA. The follow-up command
"regress" will display the outputs in the form of a regression table, as we have seen
earlier in one-way ANCOVA. The outputs of both the commands are displayed in
Table 21.5.
To get the predicted (adjusted) mean of diastolic BP at each level of occupation and
diabetes considering the average values of age (covariate), use the following com-
mand:
margins occupation diabetes1, atmeans
The outputs are displayed in Table 21.6. You can also get the marginal means for a
combination of occupation and diabetes (e.g., govt. job with diabetes, govt. job without
diabetes, etc.), if you use the following command (outputs not shown):
margins occupation#diabetes1
Or,
margins occupation#diabetes1, atmeans
You can generate a plot of predicted values (adjusted means) of the dependent variable
for the independent variables in the model. To generate a plot of adjusted mean diastol-
ic BP for occupation and diabetes, use the following commands successively (Fig
21.1):
Analysis of Covariance (ANCOVA) 237
. regress
--------------------------------------------------------------------------------------
dbp | Coef. Std. Err. t P>|t| [95% Conf. Interval]
---------------------+----------------------------------------------------------------
occupation |
PRIVATE JOB | -1.370739 2.507417 -0.55 0.585 -6.314954 3.573477
BUSINESS | -.9578347 2.602651 -0.37 0.713 -6.089838 4.174168
OTHERS | -3.433968 2.561338 -1.34 0.182 -8.484508 1.616571
|
diabetes1 |
yes | -1.513519 4.127483 -0.37 0.714 -9.652241 6.625204
age | -.0365282 .1107028 -0.33 0.742 -.2548161 .1817597
|
occupation#diabetes1 |
PRIVATE JOB#yes | -1.536793 6.179707 -0.25 0.804 -13.72217 10.64858
BUSINESS#yes | 2.673676 5.64161 0.47 0.636 -8.450656 13.79801
OTHERS#yes | 3.312635 5.554671 0.60 0.552 -7.640267 14.26554
|
_cons | 85.12535 3.322763 25.62 0.000 78.57341 91.67729
--------------------------------------------------------------------------------------
You can perform the pairwise comparison of predicted means of the dependent
variable when the main effects of the independent variables are statistically significant.
If the main effect is not significant for an independent variable, it is not necessary to do
the pairwise comparison test. In our example, the main effects of both occupation (p=
0.76) and diabetes (p= 0.84) are not statistically significant. However, for the purpose
238 Analysis of Covariance (ANCOVA)
------------------------------------------------------------------------------
| Delta-method
| Margin Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
occupation |
GOVT JOB | 83.83251 1.549116 54.12 0.000 80.7779 86.88711
PRIVATE JOB | 82.13245 1.717086 47.83 0.000 78.74664 85.51827
BUSINESS | 83.4476 1.712003 48.74 0.000 80.07181 86.82339
OTHERS | 81.10839 1.663185 48.77 0.000 77.82886 84.38792
|
diabetes1 |
no | 82.76318 .92934 89.06 0.000 80.93067 84.59569
yes | 82.33521 1.830683 44.98 0.000 78.7254 85.94501
------------------------------------------------------------------------------
90
85
Linear Prediction
80
75
70
no yes
Margins : asbalanced
---------------------------
| Number of
| Comparisons
-------------+-------------
occupation | 6
diabetes1 | 1
---------------------------
------------------------------------------------------------------------------------------
| Bonferroni Bonferroni
| Contrast Std. Err. t P>|t| [95% Conf. Interval]
-------------------------+----------------------------------------------------------------
occupation |
PRIVATE JOB vs GOVT JOB | -2.139135 3.089479 -0.69 1.000 -10.37143 6.093161
BUSINESS vs GOVT JOB | .3790031 2.821173 0.13 1.000 -7.138358 7.896364
OTHERS vs GOVT JOB | -1.77765 2.778096 -0.64 1.000 -9.18023 5.624929
BUSINESS vs PRIVATE JOB | 2.518138 3.00248 0.84 1.000 -5.482339 10.51862
OTHERS vs PRIVATE JOB | .3614848 2.963006 0.12 1.000 -7.53381 8.256779
OTHERS vs BUSINESS | -2.156654 2.677556 -0.81 1.000 -9.291331 4.978024
|
diabetes1 |
yes vs no | -.4011393 2.050795 -0.20 0.845 -4.444971 3.642693
------------------------------------------------------------------------------------------
Table 21.7 depicts the pairwise comparison of adjusted mean diastolic BP within
different occupational groups and diabetes. When the independent variable(s) with
more than two levels is significantly associated with the dependent variable, a pairwise
comparison is required. Examine the p-values (P> |t|) in Table 21.7. Since all the p-val-
ues are >0.05, there is no significant difference in mean diastolic BP among the occu-
pational groups and diabetes.
Figure 21.1 plotted the adjusted mean diastolic BP with 95% CI of different occupa-
tional groups disaggregated by diabetes. Finally, from the data, we can conclude that
the diastolic BP is not influenced (there is no association) by occupation and diabetes
after controlling for age and the independent variables (diabetes for occupation and
occupation for diabetes) included in the model.
22
Miscellaneous
In this chapter, we will discuss two important and useful statistical methods that are
frequently needed during data management and analysis, such as how to test the
reliability of a scale and develop the wealth quintiles.
alpha q1-q4
alpha q1-q4, std
alpha q1-q4, std item detail
alpha q1-q4, item reverse(q3 q4)
The first command will provide Cronbach’s alpha coefficient based on unstandardized
items (Table 22.1) (q1-q4 indicates the variables from q1 to q4). When the option “std”
is used (second command), Stata will provide the coefficient based on standardized
values of the items (Table 22.1). This option is used when items are not measured on
the same scale. It also provides an unbiased estimate. Our suggestion is to use the
option “std” to get the alpha coefficient even though items are on the same scale of
measurement.
The third command (with the use of options "std", "item", and "detail"), Stata will
provide the item-wise values of coefficients in a table and an interitem correlation
matrix (Table 22.2). The last command [with the use of option "reverse(q3 q4)"] will
reverse the coding of variables q3 and q4. This option is used when one or more nega-
tively worded variables need to be reversed (we don’t have such a problem in our data).
22.1.1 Interpretation
Table 22.1 shows the unstandardized (0.839) and standardized (0.840) values of Cron-
bach’s alpha coefficients. The standardized value is especially important when all the
items are not on the same scale of measurement. It also provides an unbiased estimate.
Miscellaneous 243
q1 q2 q3 q4
q1 1.0000
q2 0.5122 1.0000
q3 0.4911 0.6346 1.0000
q4 0.5483 0.6295 0.5892 1.0000
However, before considering the value of Cronbach’s alpha coefficient, look at the
"Interitem correlations matrix" displayed at the bottom of Table 22.2. All the values in
the matrix must be positive (all the values are positive in our example). The presence
of one or more negative values indicates that some of the items have not been "reverse
scored" correctly. This information is also provided in the first part of the table under
the "Sign" column (all the items have a positive sign).
The “item-rest correlation” in Table 22.2 indicates the degree to which each item
correlates with the total score. In our example, the values for q1 to q4 are 0.60, 0.71,
0.67, and 0.70, respectively. A small value (<0.30) for any item could be a problem. If
the Cronbach’s alpha coefficient (alpha) and item-rest correlation values for any item
are small (<0.7 and <0.3, respectively), one may consider omitting the item from the
scale that has a small value. In our example, there is no such problem.
If the number of items is small on the scale (fewer than 10), it may be difficult to get
a reasonable Cronbach’s alpha coefficient value. In such a situation, consider the aver-
age interitem correlation value provided in Table 22.2. In this example, the average
244 Miscellaneous
interitem correlation value ranges from 0.542 to 0.617, and the scale mean (average) is
0.567).
Step 1: All the variables to be used for constructing wealth quintiles must be dichoto-
mous variables with the coding scheme of 0/1. In our dataset, there are 18 variables, of
which two are categorical variables with more than two levels (water and toilet). We
need to dichotomize them with a 0/1 coding scheme. Let us first check the value labels
and coding schemes of the variables “water” and “toilet” by using the following com-
mand:
label list water toilet
Table 22.3 shows the code numbers and value labels of the variables. The table shows
that the variable "water" has 14 categories and the variable "toilet" has 11 categories.
We need to dichotomize (0/1) both these variables using some guidelines (e.g., demo-
Miscellaneous 245
graphic health survey guidelines). Let us consider that the categories (code numbers)
11 to 13, 21, and 91 are the safe sources of drinking water and will code them as 1. The
other categories will be coded as 0. Similarly, for the variable "toilet (use of latrine)",
let us consider the code numbers 11 to 13, and 22 as hygienic practices, and will code
them as 1, while the others will be coded as 0. Use the following commands to recode
the variable "water":
gen water1=0
replace water1=1 if water==11
replace water1=1 if water==12
replace water1=1 if water==13
replace water1=1 if water==21
replace water1=1 if water==91
All these commands will generate a new variable “water1” with the coding scheme of
0/1 as stated before. Now to label the new variable (water1) and put the value labels,
246 Miscellaneous
Step 2: Once the multinominal variables are dichotomized, check the prevalence (rela-
tive frequency) of all the variables to be included in the PCA by using the following
command:
sum electricity-toilet1
This command will provide Table 22.4 (“electricity-toilet1” as used with the command
indicates all the variables from electricity to toilet1). We can see that all the variables
are coded as 0/1 (columns Min and Max). The column "Mean" in the table indicates the
relative frequency (proportion) of code 1. For example, the mean of the variable "elec-
tricity" is 0.42. This indicates that 42% of the subjects use electricity. Now, identify the
variables with very small proportions (<0.01 or <1%). In our example, the variables
"car" and "boat" have very low proportions (0.005 or 0.5% and 0.004 or 0.4%, respec-
tively) and we will not include them in the PCA.
Step 3: Now we will do the principal component analysis (PCA) of all the variables
except for “car” and “boat” and calculate the wealth index by using the following com-
mands:
pca electricity radio tv mobile refrigerator almirah table chair watch ///
cycle motorcycle rikshow hhland firmland water1 toilet1, factor(1)
Or,
pca electricity - motorcycle rikshow - toilet1, factor(1)
predict comp1
ren comp1 w_index
The first or second command will do the PCA (output not shown). The third command
(which must be used after performing the PCA) will generate a new variable "comp1"
with a wealth index for all the study subjects. The last command will rename "comp1"
to "w_index" (we did it for our understanding).
Miscellaneous 247
. tab w_quintile
5 quantiles |
of w_index | Freq. Percent Cum.
------------+-----------------------------------
poorest | 237 20.00 20.00
poorer | 237 20.00 40.00
middle | 238 20.08 60.08
richer | 236 19.92 80.00
richest | 237 20.00 100.00
------------+-----------------------------------
Total | 1,185 100.00
Step 4: We will now construct the wealth quintiles from the wealth index variable
(w_index) by using the following command:
xtile w_quintile=w_index, nq(5)
248 Miscellaneous
The above command will construct the wealth quintiles in a new variable “w_quintile”.
Finally, label the new variable and give the value levels by using the first two com-
mands below, and check the wealth quintiles by generating a frequency distribution
table using the last command. The outputs are provided in Table 22.5.
lab var w_quintile "wealth quintile"
la de w_quintile 1"poorest" 2"poorer" 3"middle" 4"richer" 5"richest"
la values w_quintile w_quintile
tab w_quintile
References
1. Acock AC. (2014). A Gentle Introduction to Stata. (4th Edition). Stata Press,
StataCorp LP, Texas
2. Advanced Research Computing: Statistical Methods and Data Analytics.
Regression with Stata chapter 2 – regression diagnostics: UCLA.
https://stats.oarc.ucla.edu/stata/webbooks/reg/chapter2/stata-webbooksregres
sionwith-statachapter-2-regression-diagnostics/
3. Altman DG. (1992). Practical Statistics for Medical Research (1st Edition).
Chapman & Hill.
4. Anderson M, Nelson A. Data analysis: Simple statistical tests. FOCUS on
Field Epidemiology: UNC School of Public Health. North Carolina Centre for
Public Health Preparedness; Vol 3(6).
5. Barros AJD, Hirakata VN. Alternatives for logistic regression in cross-sectional
studies: an empirical comparison of models that directly estimate the preva-
lence ratio. BMC Medical Research Methodology 2003; 3(21):1-13. http://w-
ww.biomedcentral.com/1471-2288/3/21
6. Bergmire-Sweat D, Nelson A, FOCUS Workgroup. Advanced Data Analysis:
Methods to Control for Confounding (Matching and Logistic Regression).
Focus on Field Epidemiology: UNC School of Public Health. North Carolina
Centre for Public Health Preparedness; Volume 4, Issue 1.
7. Chan YH. Biostatistics 103: Qualitative Data – Tests of Independence. Singapore
Med J 2003; Vol 44(10):498-503.
8. Chan YH. Biostatistics 104: Correlational Analysis. Singapore Med J 2003;
Vol 44(12):614-619.
9. Chan YH. Biostatistics 201: Linear Regression Analysis. Singapore Med J
2004; Vol 45(2):55-61.
10. Chan YH. Biostatistics 202: Logistic regression analysis. Singapore Med J
2004; Vol 45(4):149-153.
11. Chan YH. Biostatistics 203. Survival analysis. Singapore Med J 2004; Vol
45(6):249-256.
12. Chan YH. Biostatistics 3.5. Multinominal logistic regression. Singapore Med
J 2005; 46(6):259-268.
13. Daniel WW. (1999). Biostatistics: A Foundation for Analysis in the Health
Science (7th Edition). John Wiley & Sons, Inc.
250
14. Daniels L, Minot N. An introduction to statistics and data analysis using Stata.
(2020). SAGE Publications, Inc: USA.
15. Eric Vittinghoff E, Glidden DV, Shiboski SC, McCulloch CE. Regression
methods in biostatistics: Linear, Logistic, Survival, and Repeated Measures
Models. (2012). 2nd ed. Springer New York.
16. Gordis L. (2014). Epidemiology (5th Edition). ELSEVIER Sounders.
17. Hamilton LC. (2013). Statistics with Stata (8th Edition). Brooks/Cole Cengage
Leaning: USA.
18. Hess KR. Graphical methods for assessing violations of the proportional
hazards assumption in Cox regression. Stat Med 1995;14(15):1707-23. doi:
10.1002/sim.4780141510.
19. Islam MT, Kabir R, Nisha M. (2021). Learning SPSS without Pain (2 nd
Edition. ASA publications, Dhaka.
20. Juul S, Frydenberg M. (2014). An introduction to Stata for health researchers
(5th Edition). Stata press, StataCorp, Texas.
21. Katz MH. (2009). Study Design and Statistical Analysis: A Practical Guide for
Clinicians. Cambridge University Press.
22. Katz MH. (2010).Evaluating Clinical and Public Health Interventions – A
Practical Guide to Study Design and Statistics. Cambridge University Press.
23. Katz MH. (2011). Multivariable Analysis: A Practical Guide for Clinicians and
Public Health Researchers (3rd Edition). London, Cambridge University Press.
24. Katz MH. Multivariable Analysis: A Primer for Readers of Medical Research.
Ann Intern Med 2003; 138:644–650.
25. Khamis H. Measures of Association: How to Choose? JDMS 2008; 24:155–162.
26. Lee J, Chia KS. Estimation of prevalence rate ratios for cross sectional data: an
example in occupational epidemiology. British Journal of Industrial Medicine
1993; 50:861-864.
27. Long JS, Freese J. Regression models for categorical dependent variables
using Stata. Stata publication. StataCorp LP, Texas.
28. Malek MH, Berger DE, Coburn JW. On the inappropriateness of stepwise
regression analysis for model building and testing. Eur J Appl Physiol 2007;
101: 263–264. https://doi.org/10.1007/s00421-007-0485-9.
251
29. Pfaff T. A brief introduction to Stata with 50+ basic commands. (2009). Institute
for Economic Education, University of Münster.
30. Rabe-Hesketh S, Everitt B. A Handbook of Statistical Analyses using Stata.
(2004). 3rd ed. Chapman & Hall/CRC.
31. Reboldi G, Angeli F, Verdecchia P. Multivariable Analysis in Cerebrovascular
Research: Practical Notes for the Clinician. Cerebrovasc Dis 2013;
35:187–193. DOI: 10.1159/000345491.
32. Schlesselman JJ, Stolley PD. (1982). Case-Control Studies: Design, Conduct,
Analysis. Oxford University Press, Oxford, New York.
33. Stata press. (2009). Stata multivariate statistics reference manual release 11.
Stata Publication. StataCorp LP, Texas.
34. Stata press. (2012). Data Analysis Using Stata (3rd Edition). StataCorp LP, Texas.
35. Szklo M, Nieto FJ. (2007). Epidemiology: Beyond the Basics (2nd Edition).
Jones and Bartlett Publishers.
36. Tabachnik BG, Fidell LS. (2007). Using multivariate statistics (5th Edition).
Boston: Pearson Education.
37. Thompson ML, Myers JE, Kriebel D. Prevalence odds ratio or prevalence
ratio in the analysis of cross sectional data: what is to be done? Occup Environ
Med 1998; 55:272-277.
38. UCLA. Advanced Research Computing: Statistical Methods and Data Analytics;
Repeated Measures Analysis with Stata. https://stats.oarc.ucla.edu/stata/semi-
nars/repeated-measures-analysis-with-stata/
39. UCLA. Institute for Digital Research & Institution: Statistical Consulting.
https://stats.idre.ucla.edu/stata/output/descriptive-statistics-using-the-summar
ize-command/
40. Whittingham M, Stephens P, Bradbury R, Freckleton R. Why do we still use
stepwise modelling in ecology and behavior? Journal of Animal Ecology
2006; 75(5):1182–1189.
41. Williams R. Using the margins command to estimate and interpret adjusted
predictions and marginal effects. The Stata Journal 2012;12(2):308–331.
42. Zwiener I, Blettner M, Hommel G: Survival analysis— part 15 of a series on
evaluation of scientific publications. Dtsch Arztebl Int 2011; 108(10):163–9.
DOI: 10.3238/arztebl.201 1.0163
252
Subject index with chapters and sections