The Workflow of Data Analysis
Using Stata
J.SCOP? LONG
Deqortments of Sociology and Statistics
Indiana University Bloomington
A Stata Press Publication
StataCorp LP
College Station, TexasCopyright © 2009 by StataCorp LP
All rights reserved. First edition 2009
Published by Stata Press, 4905 Lakeway Drive, College Station, Texas 77845
Typeset in XTIEX 2¢
Printed in the United States of America
0987654321
ISBN-i0: 1-59718-047-5
ISBN-13. 978-1-59718-047-4
No part of this book may be reproduced, stored in a retrieval system, or transcribed, in any
form or by any means—electronic, mechanical, photocopy, recording, or otherwise—without
the prior written permission of StataCorp LP.
Stata is a registered trademark of StataCorp LP. IIKX2e is a trademark of the American
Mathematical Society.To ValerieContents
List of tables xxi
List of figures xxiii
List of examples XXV
Preface xxix
A word about fonts, files, commands, and examples xxxiii
1 Introduction 1
1.1 Replication: The guiding principle for workflow ....- 2.2.0... 2
1.2 Steps in the workflow. 2.2.0... 2.200. 0020022-00005 3
12.1 Cleaningdata.. 2.0... ee 4
ieioe RUNNING anys 4
1.2.3. Presenting results... 2-2 ee ee ee 4
1.2.4 Protecting files ©... 2.0.2.2... ee eee eee 4
18 Tasks withineach step... 2.0.22 ee 5
13.1 Planning... 2. eee 5
13.2 Organization... 2... ee eee 5
1.3.38 Documentation .............. eee 5
Meoede EXOCUUON ee 6
1.4 Criteria for choosing a workflow... . 2. ...-...--.-00050. 6
14.1 Accuracy... 2. eee 6
Meath MTICtCNCY) ee 6
eaS| | OipUCIby, ee ee i
1.44 Standardization... 2.2... ee ee 7
145 Automation ........00.. 0.00.02... 00008 il
MEAG | \eability 7vill
15
16
Contents
14.7 Scalability ........~.-
Changing your workflow .. 2.6... ee ee ee
How the book is organized... - ee ee
Planning, organizing, and documenting
21
2.2
2.3
24
The cycle of data analysis ©... ee eee
Planning. 2... et ee ee ee
Organization 6
2.3.1 Principles for organization... 2 2... ee eee
2.3.2 Organizing files and directories ................
2.3.3 Creating your directory structure... 2... ..-0-4-
A directory structure for a small project ...........
A directory structure for a large, one-person project
Directories for collaborative projects ........
Special-purpose directories... 0... ee ee
Remembering what directories contain... .........
Planning your directory structure... 2. ....0.00.0-
Naming files... ee ee
Batch files... 0-0 ee eee
2.3.4 Moving into a new directory structure (advanced topic)
Example of moving into a new directory structure .
Documentation ©... ee
2.4.1 What should you document?. 2... ...........
2.4.2 Levels of documentation .. 2... 2... 0.0.2.0 000.
2.4.3 Suggestions for writing documentation ..........-.-
Evaluating your documentation... 2... 0.00 .0.004
244 The research log... .......00.2 00.00.00 0005
A sample page from a research log .........-....
A template for research logs... 0 ee
AD) 1 CODCDOOKS) ee
A codebook based on the survey instrument .........Contents
2.5
3.1
3.2
3.3
2.4.6 Dataset documentation............. 000-0005
Conclusions
Writing and debugging do-files
Three ways to execute commands ..............0-0008
3.1.1 The Command window... 2.0.2... 0.0 .0.0000.
3.1.2 Dialog boxes... 2. ee
3.1.3 Dofiles. 2... 0... ee ee
Writing effective do-files ©... 60... eee ee
3.2.1 Making do-files robust 2.2. ee
Make do-files self-contained .........-.....---
Use verslon| control (5
Exclude directory information... ...........---
Include seeds for random numbers... ...-...-.-5-
3.2.2 Making do-files legible . 2... 2.2... ee eee
Use lots of comments... 2... ee ee eee
Use alignment and indentation ...........
Use short lines. 2. ee
Limit your abbreviations... ............
Be consistent 2... 2. es
3.2.3 Templates for do-files. 0... ee eee
Commands that belong in every do-file............
A template for simple do-flles .....-.....0.000-
A more complex do-file template
Debugging do-files .. 2.2.2... 0.0.0.0 000000000005
3.3.1 Simple errors and how to fixthem ..............
Log fileisopen .. 0... 2. 2c eee ee ee eee
Tog hilejalready exists) 6 ee
Incorrect command name .... 6... eee ee
Incorrect variablename.... 2.6... - 22 eee .Contents
Incorrect option. 6... . 70
Missing comma before options... 6... ee 70
3.3.2 Steps for resolving errors... ee 70
Step 1: Update Stata and user-written programs ...... 70
Step 2: Start with a clean slate 2... 2 ee 71
Step 3: Try other data... 6... ee ee ee |
Step 4: Assume everything could be wrong.......-.- 72
Step 5: Run the program in steps... 2... ee ee 72
Step 6: Exclude parts of the do-file .. 2... 0.0. .004 74
Step 7: Starting over... ee 74
Step 8: Sometimes it is not your mistake... 2.2.2... 75
3.3.3. Example 1: Debugging a subtle syntax error. 2... 2...) 75
3.3.4 Example 2: Debugging unanticipated results... 2... . 77
3.3.5 Advanced methods for debugging ............... 81
3.4 How togethelp.........2..0.. Sob bo sc 5d 5ocuGG 82
35 Conclusions... 0... eee ce ee eee 82
Automating your work 83
Aol Macs 83
4.1.1 Local and global macros... 2... 2.22. an 84
MiOCal TN SCIO9 ee 84
Global macros... 2.2... be eee ee BS
Using double quotes when defining macros... . 2.2... 85
Creating long strings .. 2.2... 2. ee ee 85
4.1.2 Specifying groups of variables and nested models ..... . 86
4.1.3 Setting options with locals... 0.6... 2 ee 88
4.2 Information returned by Stata commands ..............5 90
Using returned results with local macros... 2. ...0.- 92
4.3 Loops: foreach and forvalues... 2... 0... ee ee ee 92
The foreach command .....-...... a oe od
The forvalues command .........-...-0.0005 95Sas
Contents xi
4.3.1 Ways to use loops... ............0-. fo oo
Loop example 1: Listing variable and value labels... ... 96
Loop example 2: Creating interaction variables ...... - 97
Loop example 3: Fitting models with alternative mea-
sures of education ..............-00. 98
Loop example 4: Recoding multiple variables the same way 98
Loop example 5: Creating a macro that holds accumu-
lated information... 0.0... 0... eee 99
Loop example 6: Retrieving information returned by Stata. 100
Ars). Counters in 0p 101
Using loops to save results toa matrix ..........- - 102
Agia | Nestediloopse 6 ee 104
43.4 Debugging loops .......-........0004 ».. 105
44 Theincludecommand ........--. 0.0.00 000 00000 106
4.4.1 Specifying the analysis sample with an include file ..... 107
4.4.2 Recoding data using include files ©... 0. ....0.004 107
4.4.3 Caution when using include files... ......22.008. 109
Ab) AGO Mes 110
4.5.1 A simple program to change directories... .......-. lit
4.5.2 Loading and deleting ado-files ..............-4. 112
4.5.3 Listing variable names and Jabels .. 2.2.2.2... 113
4.5.4 A general program to change your working directory .... 117
voto) Words Of caULION: 118
46 Helpfiles 2.2.0... 0.000 eee ee 119
AGM ninlabel np ee 119
4.6.2 help me
Ate COMCIISIONS ee
Names, notes, and labels 125
Ol Posting files 125
5.2 The dual workflow of data management and statistical analysis ... 127
5.3 Names, notes, and labels... 1 0. ee 129xil
5.5
5.6
5.7
Contents
Naming do-files 2 0. 0 ee 129
5.4.1 Naming do-files to re-create datasets ........-5.--. 130
5.4.2 Naming do-files to reproduce statistical analysis... . . . - 130
5.4.3 Using master do-files ©... 0... ee ee eee 131
Master log files 2. 0 ee 133
5.44 A template for naming do-files............0-00, 134
Using subdirectories for complex analyses .......... 135
Naming and internally documenting datasets ............. 136
Never name it final!. 6... ee 137
5.5.1 One time only and temporary datasets ......-..... 137
5.5.2 Datasets for larger projects ...-.......--..00. 138
5.5.3 Labels and notes for datasets... 2... 000 -00005 138
5.5.4 The datasignaturecommand................-- 139
A workflow using the datasignature command . . . e140)
Changes datasignature does not detect... 2... 2... 1
Naming variables ©... ee 143
5.6.1 The fundamental principle for creating and naming variables 143
5.6.2 Systems for naming variables ...........-..--5 144
Sequential naming systems... . 2... ...--.000.0- 145
Source naming systems... 6... 2 ee eee 145
Mnemonic naming systems... 2... ee eee 146
5.6.3 Planningnames..... 0.0.0... 000-00 e eee 146
5.6.4 Principles for selecting names ............. ... V7
Anticipate looking for variables... 0.00.00 00000- 147
Use simple, unambiguous names... 2.0... 0. ee 148
Try names before you decide... 2.026... 2. ee 151
Labeling variables... 0. 0 ee 151
5.7.1 Listing variable labels and other information. ....... . 151
Changing the order of variables in your dataset ....... 155
5.7.2 Syntax for label variable... 2.2... eee 155Contents xiii
5.7.3 Principles for variable labels .- 2... 2. ee ee 156
Beware of truncation 6... 6. ee 156
Test labels before you post the file .......... ... 157
5.7.4 Temporarily changing variable labels ............. 157
5.7.5 Creating variable labels that include the variable name... 158
5.8 Adding notes to variables ©... ee ee 160
5.8.1 Commands for working with notes .............. 161
Dsisting NOUS 161
Removing notes... 2... ee ee 162
Searching notes... 2... ee ee ee eee 162
5.8.2 Using macros and loops with notes ..........-.0.. 162
DO) | Valucilabels ge 163
5.9.1 Creating value labels is a two-step process... 2... 2... 164
Step 1: Defining labels ©... 0.2. ee eee 164
Step 2: Assigning labels... 2... 0. ee ee ee 164
Why a two-step system? .. 2... 0.0.0.0 0.-0000. 164
Removing labels... 2... 22. eee 165
5.9.2 Principles for constructing value labels ..-.......-. 165
1) Keep labels short 2. 165
2) Include the category number ............-.-5 166
3) Avoid special characters... 2... 0 ee ee 168
4) Keeping track of where labels areused .........-.- 169
5.9.3 Cleaning value labels... 22-2 2. ee eee 170
5.9.4 Consistent value labels for missing values. ........-- 171
5.9.5 Using loops when assigning value labels 7
5.10 Using multiple languages... 0... 2. ee ee ee eee 173
5.10.1 Using label language for different written languages... . . 174
5.10.2 Using label language for short and long labels ...... . « 174
5.11 A workflow for names and labels .. 6.2... 0-00-0000 0005 176
Step 1: Plan the changes... 6... 0... ue 176xiv
6
5.12
Contents
Step 2: Archive, clone, and rename .......
Step 3: Revise variable labels... 0.0.2... eee
Step 4: Revise value labels... 0. ee
Step 5: Verify the changes... 00. ee ee eee
5.11.1 Step 1: Check the source data... 6... .-...0000-
Step 1a: List the current names and labels... 6.2.2...
Step 1b: Try the current names and labels... 2.0...
5.11.2 Step 2: Create clones and rename variables ........-
Step 2a: Create clones 2... ee ee
Step 2b: Create rename commands
Step 2c: Rename variables... ..........0.000-
5.11.3 Step 3: Revise variable labels ... 1.2.0... 2.
Step 3a: Create variable-label commands... .......-
Step 3b: Revise variable labels... 2... 0.00.00.-
5.1L4 Step 4: Revise value labels. .............
Step 4a: List the current labels . 2... 2... eee
Step 4b: Create label define commands to edit... ... .
Step 4c: Revise labels and add them to dataset... ... .
5.11.5 Step 5: Check the new names and labels ..... 2...
Conclusions .......-.....
Cleaning your data
61
Mi POrtiNg Se
Gilet) | Datatormate ge
ASCII data formats... 20. ee
Binary-data formats .............
6.1.2 Ways toimport data... 2.2 ee ee
Stata commands to import data... ..........0.-.
Using other statistical packages to export data... .....
Using a data conversion program .......-.-..0..
177Contents
6.2
6.3
xv
6.1.3 Verifying data conversion... ...........-0.008 203
Converting the ISSP 2002 data from Russia ...... -. 204
Verifying variables 2... ee ee 210
6.2.1 Values review 2.2... ee ee 211
Values review of data about the scientific career... ... 212
Values review of data on family values ............ 215
6.2.2 Substantive review 2... 0... ee ee 216
What does time to degree measure?......... 216
Examining high-frequency values .....-.....-0... 218
Links among variables ©... 2... 2 ee ees 220
Changes in survey questions... .....-......000. 225
6.2.3 Missing-data review... 2.2... ee 225
Comparisons and missing values... 2... ....-...5 225
Creating indicators of whether cases are missing... ... . 228
Using extended missing values... 2... 6... ee 228
Verifying and expanding missing-data codes .........- 229
Using include files... 0... ee 236
6.2.4 Internal consistency review... 2.6.02... eee 238
Consistency in data on the scientific career... 2.2... 238
6.2.5 Principles for fixing data inconsistencies ......-.... 241
Creating variables for analysis... 2... 0. ee ee eee 241
6.3.1 Principles for creating new variables ............. 242
New variables get new names ... 6.2... .. 000005 242
Verify that new variables are correct 2.6... 2. ee 243
Document new variables... 1.2... 20-2. eee ee 244
Keep the source variables .. 0... 000-0... 00008 244
6.3.2 Core commands for creating variables... ........0.- 244
The generate command... 2... 0.000020. 00004 245,
The clonevar command... .....-..-0 0.000005 245
The replace command ..-.........-.000 0005 246xvi
6.4
6.5
Contents
6.3.3 Creating variables with missing values... 2.2... . - 247
6.3.4 Additional commands for creating variables ........-. 249
Pie recode comma ete fee . 2. 249
The egen command... . 61. ee es 250
The tabulate, generate() command ......... .. 252
6.3.5 Labeling variables created by Stata..........- 200
6.3.6 Verifying that variables are correct... 2.2.2. -0-5- 254
Checking the code 6... eee 255
Neisting:variablesi: ees pe 255
Plotting continuous variables... ........-.00-. 256
Tabulating variables . 2... .......0---2--0-5 258
Constructing variables multiple ways .......-..... 259
Devine datasets: oe 260
6.4.1 Selecting observations 2... .......0.2.0 000-05 261
Deleting cases versus creating selection variables... . . . . 261
64:2 ~Droppingivatiables 3 262
Selecting variables for the ISSP 2002 Russian data... . . 263
G23 2 Ordering variable: 5 263
644 Internal documentation... ..............00..0. 264
6.4.5 Compressing variables ...........-... . . 264
6.4.6 Running diagnostics ........-.... pee 200)
The codebook, problems command .............. 265
Checking for unique ID variables .....-...0..-.0. 267
6.4.7 Adding adatasignature ............-...---. 269
6.4.8 Saving the fille... 2.2.0.0... eee eee eee eee 270
6.4.9 After afileissaved ... 01. eee eee 271
Extended example of preparing data for analysis .........., 271
Creating control variables ........-...-.-..0005 271
Creating binary indicators of positive attitudes ....... 274
Creating four-category scales of positive attitudes... . . . 2dContents
GiO) Merging filed oe
6.6.1
6.6.2
6.6.3
Match-merging .. 2.2... 0. ee ee
Sorting the ID variable... 2.0... 2222 ee ee
One-to-one merging... -........--.2-..-.0-.
Combining unrelated datasets... 0. ........004
Forgetting to match-merge.. 2... 2... 0
6.7 Conclusions .. 2... eee
7 Analyzing data and presenting results
7.1 Planning and organizing statistical analysis ©... 2... 0.000.
TAA
712
7.1.3
Planning in the large... 0... 2 eee eee
Planning in the middle... 2.2.2... ee eee
Planning) in Chersmelle ese
eo) Organizing do,nles)
7.21
7.2.2
Using master do-files .. 0... 6. ee ee
What belongs in your do-file? .. 6.2.2.5... ...0..
7.3. Documentation for statistical analysis ... 2... 2-200.
7.3.1
7.3.2
The research log and comments in do-files ..........
Documenting the provenance of results... 6.2.0.0...
Captions on graphs ©... ee ee
74 Analyzing data using automation. ........--...-...-
TAL
TAQ
74.3
TAA
745
Locals to define sets of variables... ...-..--..
Loops for repeated analyses... 6... eee eee
Computing t tests using loops... 2.1.2.2 -..--0008
Loops for alternative model specifications ........-..
Matrices to collect and print results... .......-.-.
Collecting results of t tests... 0. ee ee
Saving results from nested regressions ............
Saving results from different transformations of articles . . .
Creating a graph fromiaimattin (eg ee
Include files to load data and select your sample... ... .
xvii
279
280
281
281
281
283
285
287
287
288
289
291
291
292
294
295
295
296
298
298
299
300
300
302
303
303
306
308xviii Contents
7.5 Baseline statistics... eee 312
1G 1 Replication se 313
7.6.1 Lost or forgotten files... 2... ee 313
7.6.2 Software and version control... 2... ee ee 314
7.6.3 Unknown seed for random numbers... ........-- 314
(Bootstrap standard Crrove. | 314
Letting Stata set the seed 2... ee eee 315
Training and confirmation samples ..........---- 316
7.6.4 Using a global that is not in your do-file ........... 318
7.7 Presenting results... ee 318
7.7.1 Creatingtables .. 0.2.0.2... ....004. .. 319
(Usinpispreadsheats| 4 319
Regression tables with esttab ..... eee eee ee B21
7.7.2 Creating graphs 2... ee 323
Colors, black, and white .. 2.2... . ee ld
Font size
7.7.3 Tips for papers and presentations .........-....-. 326
SD CTS 326
PT@SeNUOUOUS 0 327
7.8 Aproject checklist 2... 0. 2 020
79 Conclusions... . ee 328
8 Protecting your files 331
8.1 Levels of protection and types of files. ..............0.. 332
8.2 Causes of data loss and issues in recovering a file ........... 334
8.3 Murphy’s law and rules for copying files ................ 337
8.4 A workflow for file protection . 0... 6.6.0.0 - 0 eee eee 338
Part 1: Mirroring active storage... 6... 2... ee 338
Part 2: Offline backups 340
8.5 Archival preservation. 2... ee ee 343
SiG) Conclusions) 0 345Contents xix
5 9 Conclusions 347
i A How Stata works 349
j A.l How Stata works 2... 0... ee ee 349
Stata directories... ...- 350
The working directory... 0... ee ee ee 350
i (Ar2) Workingion aynetwork, 6 351
A.3 Customizing Stata 2... eee 353
i A.3.1 Fonts and window locations ................8.- 353
A.3.2 Commands to change preferences ........-.....-. 353
Options that can be set permanently .............- 353
Options that need to be set each session .... 6,-2.2. 355
(Area) DVO 6 ge 355
Function keys 2.0 ee 356
(AWA *Additional resoutcesstst 0 356
References 359
Author index 363
Subject index 365Se ONTO
Tables
31
5.1
5.2
TA
8.1
Stata command abbreviations used in the book. ........... 63
Recommendations for capital letters used when naming variables . . 150
Suggested meanings for extended missing-value codes ........ 171
Example of a TeX table created using esttab. . 2.2... 6... 323
Issues related to backup from the perspective of a data analyst ... 333Figures
2.1
2.2
2.3
2.4
2.5
2.6
4.1
4.2
5.1
5.2
5.3
6.1
6.2
6.3
6.4
6.5
6.6
6.7
6.8
6.9
6.10
6.11
The cycle of data analysis... 0. 13
Spreadsheet plan of a directory structure ............0.0- 29
Sample page from a research log... 2.2... 0 ee ee es 41
Workflow research log template... 2.2... ...2-0-02..0005 42
Codebook created from the survey instrument for the SGC-MHS Study 44
Data registry spreadsheet ©... 1. ee ee 45
Viewer window displaying help nmlabel ........-...... 120
Viewer window displaying help me... ......-....0.00- 123
The dual workflow of data management and statistical analysis... 127
The dual workflow of data management and statistical analysis
after fixing an error indata03.do ..........0.0-20006 132
Sample spreadsheet for planning variable names ........... 147
A fixed-format ASCII file. 22... eee 199
A free-format ASCII file with variable names... ........... 199
A binary file in Stata format 2... eee 200
Descriptive statistics from SPSS... eee 205
Frequency distribution from SPSS... -.-- 2... eee eee 205
Transfer tab from the Stat/Transfer dialog box ............ 206
Observations tab from the Stat/Transfer dialog box. .......... 207
Combined missing values in frequencies from SPSS .......... 209
Four overlapping dimensions of data verification ........... 210
Thumbnails of two-way graphs . 2.2... ee ee 224
Spreadsheet of possible reasons for missing data ........... 231xxiv
6.12
6.13
71
7.2
7.3
74
75
76
TT
78
8.1
8.2
Figures
Spreadsheet with variables that require similar data processing
grouped together... 22.0... ee ee 232
Merging unrelated datasets... 2.2... ee ee 282
Example of a research log with links to dofiles ©... .....00. 296
Example of results reported ina paper... . 2.0... eee. 297
Addition using hidden font to show the provenance of the results . . 297
Spreadsheet with pasted text... .-....0..0..20. 02006 320
Convert text to columns wizard. 6... ee 320
Spreadsheet with header information added... 2.2 ....-02.- 321
Colored graph printed in black and white ............... 325
Graph with small, hard to read text and a graph with readable text 326
Levels of protection for files... 0. 0... ee een 332
A two-part workflow for protecting files used in data analysis... . 338a
Examples
eS kk ke R Bk ww ww
oo, Fk ke ek ek ee eB ew
or
Selecting a random subsample... . . . . pee i
Debugging a syntax error in graph... 2... 22.0000.
Combining information on binary variables... 2... 0.0000 +
Debugging unanticipated results 2... ee
Local, specifying groups of variables. 2... ee
Local, with graph options. 2. ee
Returned results, centering a variable... 2... 0 ee ee
Loop, listing variable and value labels... 2-0. ee ee
Loop, creating interactions... ee
Loop, fitting models with alternative measures... 2.2.2.0...
Loop, recoding variables ©... 1...
Loop, creating a macro with results... 6. ee ee
Loop, retrieving returned information... ..........0.
Tigopyacding counter 6
Loop, saving results to matrix 2... 2... ee
Include file, specifying the sample Bo 50cds504caogs
Include file: recoding datas} 3) ) 0 SouoondS
Ado-file, change to specific directory ©... ..-...-...000.
Ado-file, listing variable and value labels .. 2... .....00..
Ado-file, general program to change directories... .......4.
Ado-file, listing variable and value labels .. 2... 2... ee
Help file, nmlabel-hlp .....-..-........000.
Master do-file and log file... 2... ee ee
Truncation, long names... 1.2.0.2. 0. ee ee
100
101
102
107
107
11
117
117
118
119
149XXVi
Examples
Truncation, long labels... 0 ee 156
Loop, cloning variables and adding notes... 6.02... we 162
Local, using a tag local anda loop... 2... ee 162
Aruncations long; valuellabels|t ett 165
Loop, adding value labels. 6... ee 172
Names and labels (extended cxample).. 6... 00. ee 176
Verifying data conversion... 0 eee 203
Values review of data about the scientific career... 2 ee 212
Loop, generating dotplots. 6 0. ee 214
Values review of data on family values... 2.2... 0.00.00. 215
Substantive review of time to degree 2... ee 216
Graphs, dotplots to compare variables 218
Substantive review using links among science variables 220
Loop, genorating scatterplots. 0... 222
Graphs, all pairs of variables 2... ee 222
Missing data, ercating indicator variable... 20... 228
Missing data, months of marriage 2... ee 234
Internal consistency review of science data... 0.0.00. 238
Recoding variables with recode ... 2.2... 249
Creating indicator variables with tabulate, generate()..... . 252
Graphs, -y compared with y plot... 2... 02 ee 257
Finding identical observations with duplicates 2... 0 0... 269
Preparing data for analysis (extended example)... 2... 0... 271.
Creating binary indicators of attitude variables 2.2... 0.0... 274
Creating four-category scales... 2. ee 277
Merging files, match-merging. . 2... ee 280
Merging files, merging unrelated datasets... 2... 2. ee 281
Master do-file and log file for study of well being... 2... ..0.. 292
Gtaphsyedding avcopWON te 298
Local, define sets of variables... 299Examples
AVA AWWA AAW A A
xxvii
Loop, collect data on multiplet tests... 0. ....2-.0--0. 300
Loop, alternative model specifications... 2... 302
Matrix, collect results of group comparison... 2... 2... we. 303
Matrix, collect results from nested models ......-....--- 306
Matrix, collect results for different transformations ......... 308
Graph, created from data in matrix 2... 6.202 2 eee 310
Include file, load data and select sample ............... 3il
Baseline SvabiSuiCS 312
Bootstrap standard errors... 2.2.2... . poo0o0cD 314
Stepwise model selection .. 2... 0... 0000.00 0000005 316
Regression tables with esttab .. 02... 2 ee eee 321Ym Se Rane ce
wet ae aoe
ee
Preface
This book is about methods that allow you to work efficiently and accurately when you
analyze data. Although it does not deal with specific statistical techniques, it discusses
the steps that you go through with any type of data analysis. These steps include
planning your work, documenting your activities, creating and verifying variables, gen-
erating and presenting statistical analyses, replicating findings, and archiving what you
have done. These combined issues are what I refer to as the workflow of data analysis.
A good workflow is essential for replication of your work, and replication is essential for
good science.
My decision to write this book grew out of my teaching, researching, consulting, and
collaborating. I increasingly saw that people were drowning in their data. With cheap
computing and storage, it is easier to create files and variables than it is to keep track
of them. As datasets have become more complicated, the process of managing data has
become more challenging. When consulting, much of my time was spent on issues of data
management and figuring out what had been done to generate a particular set of results.
In collaborative projects, I found that problems with workflow were multiplied. Another
motivation came from my work with Jeremy Freese on the package of Stata programs
known as SPost (Long and Freese 2006). These programs were downloaded more than
20,000 times last year, and we were contacted by hundreds of users. Responding to these
questions showed me how researchers from many disciplines organize their data analysis
and the ways in which this organization can break down. When helping someone with
what appeared to be a problem with an SPost command, I often discovered that the
problem was related to some aspect of the uscr’s workflow. When people asked if there
was something they could read about this, I had nothing to suggest.
A final impetus for writing the book came from Bruce Fraser’s Real World Camera
Raw with Adobe Photoshop CS2 (2005). A much touted advantage of digital photog-
taphy is that you can take a lot of pictures. The catch is keeping track of thousands of
pictures. Imaging experts have been aware of this issue for a long time and refer to it as
workflow—keeping track of your work as it flows through the many stages to the final
product. As the amount of time I spent looking for a particular picture became greater
than the time I spent taking pictures, it was clear that I needed to take Fraser’s advice
and develop a workflow for digital imaging. Fraser’s book got me thinking about data
analysis in terms of the concept of a workflow.
After years of gestation, the book took two years to write. When I started, I thought
my workflow was very good and that it was simply a matter of recording what I did. As
writing proceeded, I discovered gaps, inefficiencies, and inconsistencies in what I did.XXX Preface
Sometimes these involved procedures that | knew were awkward, but where I never took
the time to find a better approach. Some problems were duc to oversights where I had
not realized the consequences of the things | did or failed to do. In other instances,
I found that, I used multiple approaches for the same task, never choosing one as the
best practice. Writing this book forced me to be more consistent and efficient. The
advantages of my improved workflow became clear when revising two papers that were
accepted for publication. The analyses for one paper were completed before | started
the workflow project, whereas the analyses for the other were completed after much
of the book had been drafted. I was pleased by how much easier it was to revise the
analyses in the paper that used the procedures from the book. Part of the improvement
was due to having better ways of doing things. Equally important was that I had a
consistent and documented way of doing things.
T have no illusions that the methods I recommend are the best or only way of doing
things. Indeed, I look forward to hearing from readers who have suggestions for a better
workflow. Your suggestions will be added to the book’s web site. However, the methods
I present work well and avoid many pitfalls. An important aspect of an efficient. workflow
is to find one way of doing things and sticking with it. Uniform procedures allow you
to work faster when you initially do the work, and they help you to understand your
earlier work if you need to return to it at a later time. Uniformity also makes working
in research teams easier because collaborators can more easily follow what others have
done. There is a lot to be said in favor of having established procedures that are
documented and working with others who use the same procedures. I hope you find
that this book provides such procedures.
Although this book should be useful for anyone who analyzes data, it is written
within several constraints. First, Stata is the primary computing language because I
find Stata to be the best, general-purpose software for data management and statistical
analysis. Although nearly everything I do with Stata can be done in other software, I
do not include examples from other packages. Second, most examples use data from
the social sciences, because that is the field in which I work. The principles I discuss,
however, apply broadly to other fields. Finally, I work primarily in Windows. This
is not because I think Windows is a better operating system than Mac or Linux, but
because Windows is the primary operating system where I work. Just about everything
I suggest works equally well in other operating systems, and 1 have tried to note when
there are differences.
T want to thank the many people who commented on drafts or answered questions
about some aspect of workflow. I particularly thank Tait Runfeldt Medina, Curtis Child,
Nadine Reibling, and Shawna L. Rohrman whose detailed comments greatly improved
the book. | also thank Alan Acock, Myron Gutmann, Patricia McManus, Jack Thomas,
Leah VanWey, Rich Watson, Terry White, and Rich Williams for talking with me about
workflow. Many people at StataCorp helped in many ways. I particularly want to thank
Lisa Gilmore for producing the book, Jennifer Neve for editing, and Annette Fett for
designing the cover. David M. Drukker at StataCorp answered many of my questions.
His feedback made it a better book and his friendship made it more fun to write
enemies i nee RllPreface xxxi
Some of the material in this book grew out of research funded by NIH Grant Number
RO1TW006374 from the Fogarty International Center, the National Institute of Mental
Health, and the Office of Behavioral and Social Science Research to Indiana University-
Bloomington. Other work was supported by an anonymous foundation and The Bayer
Group. I gratefully acknowledge support provided by the College of Arts and Sciences
at Indiana University.
Without the unintended encouragement from my dear friend Fred, I would not have
started the book. Without the support of my dear wife Valerie, I would not have
completed it. Long overdue, this book is dedicated to her.
Bloomington, Indiana Scott Long
* October 2008cecteninaiiniesdnneadetine on tnsniinaitde ain ianubdentenemeA word about fonts, files, commands, and
examples
The book uses standard Stata conventions for typography. Items printed in a typewriter-
style typeface are Stata commands and options. For example, use mydata, clear.
Italics indicate information that you should add. For example, use dataset-name,
clear indicates that you should substitute the name of your dataset. When I provide
the syntax for a command, I generally show only some of the options. For full docu-
mentation, you can type help command-name or check the reference manual. Manuals
are referred to with the usual Stata conventions. For example, [R] logit refers to the
logit entry in the Base Reference Manual and [{D] sort refers to the sort entry in the
Data Management Reference Manual.
Within the text, the commands or output for some examples will trail off the right
side of the page; see page 59 for an example. This is intentional to show you the
consequence of not controlling the length of commands and output.
The book includes many examples that I encourage you to try as you read. If the
name of a file begins with wf, you can download that file. I use (file: filename .do) to
let you know the name of the do-file that corresponds to the example being presented.
With few exceptions (e.g., some ado-files), if the name of a file does not begin with wi
(e.g., science2.dta), the file is not available for download. To find where a downloaded
file is used in the text, check the index under the entry for Workflow package files.
To download the examples, you must be in Stata and connected to the Internet.
There are two Workflow packages for Stata 10 (wf10-part1 and wf10-part2) and two
for Stata 9 (w£09-part1 and wf09-part2). To find and install the packages, type
findit workflow, choose the packages you need, and follow the instructions. Al-
though two packages are needed because of the large number of examples, I refer to
them simply as the Workflow package. Before trying these examples, be sure to up-
date your copy of Stata as described in [GS] 20 Updating and extending Stata—
Internet functionality. Additional information related to the book is located at
http://www. indiana.edu/-jslsoc/workflow.htm.eeee — — — — —EE—E—E—E—E———————E—eee
1
Introduction
This book is about methods for analyzing your data effectively, efficiently, and accu-
rately. J refer to these methods as the workflow of data analysis. Workflow involves the
entire process of data analysis including planning and documenting your work, cleaning
data and creating variables, producing and replicating statistical analyses, presenting
findings, and archiving your work. You already have a workflow, even if you do not
think of it as such. This workflow might be carefully planned or it might be ad hoc.
Because workflow for data analysis is rarely described in print or formally taught, re-
searchers often develop their workflow in reaction to problems they encounter and from
informal suggestions from colleagues. For example, after you discover two files with the
same name but different content, you might develop procedures (i.e., a workflow) for
naming files. Too often, good practice in data analysis is learned inefficiently through
trial and error. Hopefully, my book will shorten the learning process and allow you to
spend more time on what you really want to do.
Reactions to early drafts of this book convinced me that both beginners and experi-
enced data analysts can benefit from a more formal consideration of how they do data
analysis. Indeed, when I began this project, I thought that my workflow was pretty good
and that it was simply a matter of writing down what I routinely do. I was surprised
and pleased by how much my workflow improved as a result of thinking about these
issues systematically and from exchanging ideas with other researchers. Everyone can
improve their workflow with relatively little effort. Even though changing your workflow
involves an investment of time, you will recoup this investment by saving time in later
work and by avoiding errors in your data analysis.
Although I make many specific suggestions about workflow, most of the things that I
recommend can be done in other ways. My recommendations about the best practice for
a particular problem are based on my work with hundreds of researchers and students
from all sectors of employment and from fields ranging from chemistry to history. My
suggestions have worked for me and most have been refined with extensive use. This is
not to say that there is only one way to accomplish a given task or that I have the best
way. In Stata, as in any complex software environment, there are a myriad of ways to
complete a task. Some of these work only in the limited sense that they get a job done
but are error prone or inefficient. Among the many approaches that work well, you will
need to choose your preferred approach. To help you do this, I often discuss several
approaches to a given task. I also provide examples of ineffective procedures because
seeing the consequences of a misguided approach can be more effective than hearing
about the virtues of a better approach. These examples are all real, based on mistakes2 Chapter 1 Introduction
1 made (and I have made lots) or mistakes ] encountered when helping others with
data analysis. You will have to choose a workflow that matches the project at hand,
the tools you have, and your temperament. ‘There are as many workflows as there are
people doing data analy nd there is no single workflow that is ideal for every person
or every project. What is critical is that you consider the general issues, choose your
own procedures, and stick with them unless you have a good reason to change them.
In the rest of this chapter, I provide a framework for understanding and evaluating
your workflow. I begin with the fundamental principle of replicability that should guide
every aspect of your workflow. No matter how you proceed in data analysis, you must
be able to justify and reproduce your results. Next I consider the four steps involved
in all types of data analysis: preparing data, running analysis, presenting results, and
preserving your work. Within each step there are four major tasks: planning the work,
organizing your files and materials, documenting what you have done, and executing
the analysis. Because there are alternative approaches to accomplish any given aspect,
of your work, what makes one workflow better than another? To answer this question,
I provide several criteria for evaluating the way you work. These criteria should help
you decide which procedures to use, and they motivate many of my recommendations
for best practice that are given in this book.
1.1 Replication: The guiding principle for workflow
Boing able to reproduce the work you have presented or published should be the cor-
nerstone of any workflow. Science demands replicability and a good workflow facilitates
your ability to replicate your results. How you plan your project, document your work,
write your programs, and save your results should anticipate the need to replicate. Too
often researchers do not worry about replication until their work is challenged. This is
not to say that they are taking shortcuts, doing shoddy work, or making decisions that,
are unjustified. Rather, I am talking about taking the steps necessary so that all the
good work that has been done can be easily reproduced at a later time. For example,
suppose that a colleague wants to expand upon your work and asks you for the data
and commands used to produce results in a published paper. When this happens, you
do not want to scramble furiously to replicate your results. Although it might take a
few hours to dig out your results (many of mine are in notebooks stacked behind my
file cabinets), this should be a matter of retrieving the records, not trying to remember
what it was you did or discovering that what you documented does not correspond to
what. you presented.
Think about replication throughout your workflow. At the completion of each stage
of your work, take an hour or a day if necessary to review what you have done, to check
that the procedures are documented, and to confirm that the materials are archived.
When you have a draft of a paper to circulate, review the documentation, check that
you still have the files you used, confirm that the do-files still run, and. double check
that the numbers in your paper correspond to those in your output. Finally, make sure
that all this is documented in your research log (discussed on page 37).
escent1.2 Steps in the workflow 3
If you have tried to replicate your own work months after it was completed or
tried to reproduce another author’s results using only the original dataset and the
published paper, you know how difficult it can be to replicate something. A good way
to understand what is required to replicate your work is to consider some of the things
that can make replication impossible. Many of these issues are discussed in detail later
jp the book. First, you have to find the original files, which gets more difficult, as time
passes. Once you have the files, are they in formats that can be analyzed by your current
software? If you can read the file, do you know exactly how variables were constructed
or cases were selected? Do you know which variables were in each regression model?
Even if you have all this information, it is possible that the software you are currently
using does not compute things exactly the same way as the software you used for the
original analyses. An effective workflow can make replication easier.
A recent example illustrates how difficult it can be to replicate even simple analyses.
I collected some data that were analyzed by a colleague in a published paper. I wanted
to replicate his results to extend the analyses. Due to a drive failure, some of his files
were lost. Neither of us could reproduce the exact results from the published paper.
We came close, but not close enough. Why? Suppose that 10 decisions were made in
the process of constructing the variables and selecting the sample for analysis. Many
of these decisions involve choices betwecn options where neither choice is incorrect. For
example, do you take the square root of publications or the log after adding .5? With
10 such decisions, there are 2)° = 1,024 different outcomes. All of them will lead to
similar findings, but not exactly the same findings. If you lose track of decisions made
in constructing your data, you will find it very difficult to reproduce what you have
done. By the way, remarkably another researcher who was using these data discovered
the secret to reproducing the published results.
Even if you have the original data and analysis files, it can be difficult to reproduce
results. For published papers, it is often impossible to obtain the original data or the
details on how the results were computed. Freese (2007) makes a compelling argument.
for why disciplines should have policies that govern the availability of information needed
to replicate results. I fully support his recommendations.
1.2 Steps in the workflow
Data analysis involves four major steps: cleaning data, performing analysis, presenting
findings, and saving your work. Although there is a logical sequence to these steps, the
dynamics of an effective workflow are flexible and highly dependent upon the specific
project. Ideally, you advance one step at a time, always moving forward until you are
done. But, it never works that way for me. In practice, I move up and down the
steps depending on how the work goes. Perhaps I find a problem with a variable while
analyzing the data, which takes me back to cleaning. Or my results provide unexpected
insights, so I revise my plans for analysis. Still, I find it useful to think of these as
distinct steps.4 Chapter 1 Introduction
1.2.1. Cleaning data
Before substantive analysis begins, you need to verify thal your data are accurate and
that the variables are well named and properly labeled. That is, you clean the data.
First, you must bring your data into Stata. If you received the data in Stata format,
this is as simple as a single use command. If the data arrived in another format, you
need to verify that they were imported correctly into Stata. You should also evaluate
the variable names and labels. Awkward names make it more difficult to analyze the
data and can lead to mistakes. Likewise, incomplete or poorly designed labels make the
output difficult to read and lead to mistakes. Next verify that the sample and variables
are what they should be. Do the variables have the correct values? Are ing data
coded appropriately? Are the data internally consistent? Is the sample size correct?
Do the variables have the distribution that you would expect? Once these questions are
resolved, you can select the sample and construct new variables needed for analysis.
1.2.2. Running analysis
Once the data are cleaned, fitting your models and computing the graphs and tables
for your paper or book are often the simplest part of the workflow. Indeed, this part of
the book is relatively short. Although I do not discuss specific types of anal
I talk about ways to ensure the accuracy of your results, to facilitate later rept
and to keep track of your do-files, data files, and log files regardless of the statistical
methods you are using
1.2.3 Presenting results
Once the analyses are complete, you want to present ther. | consider several issues
in the workflow of presentation. First, you need to move the results from your Stata
output into your paper or presentation. An efficient workflow can automate much of this
work. Second, you need to document the provenance of all findings that you present.
If your presentation does not preserve the source of your results, it can be very difficult
to track them down later (e.g., someone is trying to replicate your results or yon must
respond to a reviewer). Finally, there are a number of simple things that you can do to
make your presentations more effective.
1.2.4 Protecting files
When you are cleaning your data, running analyses, and writing, you need to protect.
your files to prevent loss due to hardware failure, file corruption, or unintentional dele-
tions. Nobody enjoys redoing analyses or rewriting a paper because a file was lost.
There are a number of simple things you can do to make it easier to routinely save your
work, With backup software readily available and the cost. of disk storage so cheap, the
hardest parts of making backups is keeping track of what you have. Archiving is dis-
tinct from backing up and more difficult because it involves the long-term prescrvationa
file formats and storage media will be accessible in the future. You must also consider
the operating system you use (it is now difficult to read data stored using the CP/M
operating system), the storage media (can you read 5 1/4” floppy disks from the 1980s
or even a ZIP disk from a few years ago?), natural disasters, and hackers.
1.3.3 Documentation
of files so that they will be accessible years into the future. You need to consider if the
q
| 1.3. Tasks within each step
Within each of the four major steps, there are four primary tasks: planning your work,
organizing your materials, documenting what you do, and executing the work. While
some tasks are more important within particular steps (e.g., organization while plan-
ning), each task is important for all steps of the workflow.
1.3.1 Planning
Most of us spend too little time planning and too much time working. Before you
priorities. What types of analyses are needed? How will you handle missing data? What
new variables need to be constructed? As your work progresscs, periodically reassess
your plan by refining your goals and analytic strategy based on the work you have
completed. A little planning goes a long way, and I almost always find that planning
load data into Stata, you should draft a plan of what you want to do and assess your
saves time.
1.3.2 Organization
Careful organization helps you work faster. Organization is driven by the need to find
things and to avoid duplication of effort. Good organization can prevent you from
searching for lost files or, worse yet, having to reconstruct them. Jf you have good
documentation about what you did, but you cannot. find the files used to do the work,
little is gained. Organization requires you to think systematically about how you name
files and variables, how you organize directories on your hard drive, how you keep track
of which computer has what information (if you use more than one computer), and
where you store research materials. Problems with organization show up when you
have not been working on a project for a while or when you need something quickly.
Throughout the book, I make suggestions on how to organize materials and discuss
tools that make it easier to find and work with what you have.
1.3.3 Documentation
Without adequate documentation, replication is virtually impossible, mistakes are more
likely, and work usually takes longer. Documentation includes a research log that records
what you do and codebooks that document the datasets you create and the variables
they contain. Complete documentation also requires comments in your do-files and6 Chapter } Introduction
labels and notes within data files. Although 1 find writing documentation to be an
onerous task, certainly the least enjoyable part of data analysis, { have learned that
time spent on documentation can literally save weeks of work and frustration later.
Although there is no way to avoid time spent writing documentation, I can suggest
things that. make documenting your work faster and more effective.
1.3.4 Execution f
Execution involves carrying out specific tasks within each step. Effective execution
requires the right tools for the job. A simple example is the editor used to write
your programs. Mastcring a good text editor can save you hours when writing your
programs and will lead to programs that. are better written. Another example is learning
the most, effective commands in Stata. A few minutes spent learning how to use the
recode command can save you hours of writing replace commands. Much of this
book involves sclecting the right tool for the job. Throughout my discussion of tools, I
emphasize standardizing tasks and automating them. The reason to standardize is that
it is generally faster to do something the way you did it before than it is to think up a
new way to do it. If you set up templates for common tasks, your work becomes more
wniform, which makes it easier to find and avoid errors. Efficient execution requires
assessing the trade-off between investing the time in learning a new tool, the accuracy
gained by the new tools, and the time you save by being more efficient.
1.4 Criteria for choosing a workflow
As you work on the various tasks in each step of your workflow, you will have cho’
of different ways to do things. How do you decide which procedure to use? In this
section, 1 consider several criteria for evaluating your current workflow and choosing
from among alternative procedures for your work.
|
|
1.4.1 Accuracy
Getting the correct answer is the sine qua non of a good workflow. Oliveira and Stewart:
(2006, 30) make the point very well, “If your program is not correct, then nothing else
matters.” At cach step in your work, you must verify that your results are correct.
Are you answering the question you sect oul to answer? Are your results what you
wanted and what you think they are? A good workflow is also about making mistakes.
Snvariably, mistakes will happen, probably a lot of them. Although an effective workflow
can prevent some errors, it should also help you find and correct them quickly.
1.4.2 Efficiency
You want to get your analyses done as quickly as possible, given the need for accuracy
and replicability. There is an unavoidable tension between getting your work done and1.4.6 Usability 7
the need to work carefully. If you spend so much time verifying and documenting your
work that you never finish the project, you do not have a viable workflow. On the other
hand, if you finish by publishing incorrect results, both you and your field suffer. You
want a workflow that gets things done as quickly as possible without sacrificing the
accuracy of your results. A good workflow, in effect, increases the time you have to do
your work, without sacrificing the accuracy of what you do.
1.4.3 Simplicity
oe
A simpler workflow is better than a more complex workflow. The more complicated
your procedures, the more likely you will make mistakes or abandon your plan. But
what is simple for one person might not be simple for another. Many of the procedures
that I recommend involve programming methods that may be new to you. If you have
never used a loop, you might find my suggestion of using a loop much more complex
than repeating the same commands for multiple variables. With experience, however,
you might decide that loops are the simplest way to work.
am
1.4.4 Standardization
Standardization makes things easier because you do not have to repeatedly decide how
to do things and you will be familiar with how things look. When you use standardized
formats and procedures, it is easier to see when something is wrong and ensure that you
do things consistently the next time. For example, my do-files all use the same structure
for organizing the commands. Accordingly, when I look at the log file, it is easier for
me to find what I want. Whenever you do something repeatedly, consider creating a
template and establishing conventions that become part of your routine workflow.
1.4.5 Automation
Procedures that are automated are better because you are less likely to make mistakes.
Entering numbers into your do-file by hand is more error prone than using programming
tools to transfer the information automatically. Typing the same list of variables multi-
ple times in a do-file makes it easy to create lists that are supposed to be the same but
are not. Again automation can eliminate this problem. Automation is the backbone for
many of the methods recommended in this book.
1.4.6 Usability
Your workflow should reflect the way you like to work. If you set up a workflow and
then ignore it, you do not have a good workflow. Anything that increases the chances
of maintaining your workflow is helpful. Sometimes it is better to use a less efficient
approach that is also more enjoyable. For example, I like experimenting with software
and prefer taking longer to complete a task while learning a new program than getting8 Chapter 1 Introduction
things done quicker the old way. On the other hand, I have a colleague who prefers
using a familiar tool even if it takes a bit longer to complete the task. Both approaches
make for a good workflow because they complement. our individual styles of work.
1.4.7 Scalability
Some ways of work are fine for small jobs but do not work well for larger jobs. Consider
the simple problem of alphabetizing 10 articles by author. The easiest approach is to
lay the papers on a table and pick them up in order. This works well for 10 articles but.
is dreadfully slow with 100 or 1,000 articles. This issue is referred to as scalability—
how well do procedures work when applied to a larger problem? As you develop your
workflow, think about how well the tools and practices you develop can be applied
to a larger project. An effective workflow for a small project where you are the only
researcher might not be sustainable for a large project involving many people. Although
you can visually inspect every case for every variable in a dataset with 25 measures of
development in 80 countries, this approach does not work with the National Longitudinal
Survey that has thousands of cases and thousands of variables. You should strive for a
workflow that adapts easily to different types of projects. Few procedures scale perfectly.
As a consequence you are likely to need different workflows for projects of different
complexities.
1.5 Changing your workflow
This book has hundreds of suggestions. Decide which suggestions wil! help you the most,
and adapt them to the way you work. Suggestions for minor changes to your workflow
can be adopted at any time. For example, it takes only a few minutes to learn how to use
notes, and you can benefit from this command almost immediately. Other suggestions
might require major changes to how you work and should be made only when you have
the time to fully integrate thern into your work. It is a bad idea to make major changes
when a deadline is looming. On the other hand, make sure you find time to improve
your workflow. Time spent improving your workflow should save time in the long run
and improve the quality of your work. An effective workflow is something that evolves
over time, reflecting your experience, changing technology, your personality, and the
nature of your current research.
1.6 How the book is organized
This book is organized so that it can be read front to back by someone wanting to learn
about the entire workflow of data analysis. I also wanted it to be useful as a reference
for people who encounter a problem and who want a specific solution. For this purpose,
I have tried to make the index and table of contents extensive. It is also useful to
understand the overall structure of this book before you proceed with your reading.1.6 How the book is organized 9
Chapter 2 - Planning, organizing, and documenting your work discusses how to plan
your work, organize your files, and document what you have done. Avoid the
temptation of skipping this chapter so that you can get to the “important” details
in later chapters.
Chapter 3 -- Writing and debugging do-files discusses how do-files should be used
for almost. all your work in Stata. I provide information on hew to write more
effective do-files and how to debug programs that do not work. Both beginners
and advanced users should find useful information here.
Chapter 4 ~ Automating Stata is an introduction to programming that discusses haw to
create macros, run loops, and write short programs. This chapter is not intended
to teach you how to write sophisticated programs in Stata (although it might
be a good introduction), rather it discusses tools that all data analysts should
find useful. I encourage every reader to study the material in this chapter before
reading chapters 5-7.
Chapter 5 - Names and labels discusses both principles and tools for creating names
and labels that are clear and consistent. Even if you have received data that are
labeled, you should consider improving the names and labels in the dataset. This
chapter is long and includes a lot. of technical details that you can skip until you
need them.
Chapter 6 - Cleaning data and constructing variables discusses how to check whether
your data are correct and how to construct new variables and verify that they
were created correctly. At least. 80% of the work in data analysis involves getting
the data ready, so this chapter is essential.
Chapter 7 - Analyzing, presenting, and replicating results discusses how to keep track
of the analyses used in presentations and papers, issues to consider when present-
ing your results, and ways to make replication simpler.
Chapter 8 - Saving your work discusses how to back up and archive your work. This
seemingly simple task is often frustratingly difficult and involves subtle problems
that can be easy to overlook.
Chapter 9 - Conclusions draws general conclusions about workflow
Appendix A reviews how the Stata program operates; considers working with a net-
worked version of Stata, such as that found in many computer labs; explains how
to install user-written programs, such as the Workflow package; and shows you
how to customize the way in which Stata works.
Additional information about workflow, including examples and discussion of other
software, is available at http://www.indiana.edu/-jslsoc/workflow.htm.2 Planning, organizing, and
documenting
This chapter describes the three critical activities that occur at each step of data anal-
ysis: planning your work, organizing materials, and documenting what has been done.
These tasks, which are closely related and equally irksome to many, are an essential part
of your workflow. Planning is strategic, focusing on broader objectives and priorities.
Organization is tactical, developing the structures and procedures needed to complete
your plan. This includes deciding what goes where, what to name it, and how to find
it. Documentation involves bookkeeping, recording what you have done, why you did
it, when it was done, and where you put it. Without documentation, replication is
effectively impossible.
All data analysts plan, organize. and document (PO&D) but. to greatly differing
degrees. When you begin your analysis, you have at least a basic idea of what you want
to do (the plan), you know where things will be put (the organization), and you keep at:
Jeast a few notes (the documentation). Most researchers will benefit from a more formal
approach to these activities. Although thi true for all research, the importance of
PO&D increases with the complexity of the project, the number of projects you are
working on, and the frequency of interruptions while you work.
There is a huge temptation to jump into analysis and let planning, organization,
and documentation come later. Crunching numbers is immensely more engaging than
writing a plan, putting files in order, and documenting what you have done. However,
even preliminary, exploratory analysis needs a plan, benefits from organization, and
must be documented. Investing timc in these activities makes you a better data analyst,
speeds up your work, and helps you avoid mistakes. Critically, these activities make it
easier to replicate your work.
One of the few advantages of working on a mainframe computer during the 1960s,
1970s, and 1980s was that card punches with 10-minute limits for use, queues to submit
programs, delays in mounting tapes, and waits of hours or days for output encouraged
and rewarded efficiency and planning. Although you waited for results, you had time
to plan your next steps, to document what you were doing, and to organize earlier
printout. Importantly, you also had the opportunity to watch how more experienced
researchers did things. With delays built into the process, you did not want to forget
a critical step in your program, incorrectly type a command, lose analyses that were
completed, use the wrong variables, add unnecessary steps to the analyses, or forget
what you had already done. Becatise computing was more expensive during the day
i12 Chapter 2. Planning, organizing, and documenting
(and you paid real dollars to compute), you used the day to plan the most. efficient way
to proceed and submitted your programs to run overnight. An unanticipated cost of
cheap computing is that computation no longer imposes delays that encourage you to
plan, organize, and document. Such planning is still rewarded, but the inducements are
less obvious. With personal computers, there is less opportunity to watch and to learn
from how others work.
The most impressive example of planning that I know of involves Blau and Duncan’s
(1967) masterpiece The American Occupational Structure. In the preface, the authors
write (1967, 18-19)
It should be mentioned here that at no time have we had access to the original
survey documents or to the computer tapes on which individual records are
stored. ... Consequently it was necessary for us to provide detailed outlines
of the statistical tables we desired for analysis without inspecting the “raw”
data, and to provide these, moreover, some 9 to 12 months ahead of the
time when we might expect. their delivery. ... We had to state in advance
just which tables were wanted, out of the virtually unlimited number that
conceivably might have been produced, and to be prepared to make the best
of whal we got. Cost factors, of course, put strict limits on how many tables
we could request. We had to imagine in advance most. of the analysis we
would want. to make, before having any advance indications of what any of
the tables would look like. The general plan of the analysis had, therefore,
to be laid out a year or more before the analysis actually began, ... We
were couscious of the very real hazard that our initial plans would overlook
relationships of great: interest. However, some months of work were devoted
to making rough estimates from various sources to anticipate as closely as
possible how the tables might. look.
I doubt if this exemplar of quantitative social science research would have been com-
pleted more quickly or better if the authors had been given full access to the data and
complete control of a mainframe2.1 The cycle of data analysis 13
2.1 The cycle of data analysis
“\- Organize
Document
Figure 2.2. The ©
cle of data analysis
In an ideal world, planning, organizing, computing, and documenting occur in the
sequence illustrated in figure 2.1. You begin by sketching a plan for analysis, sctting
up a folder for data and do-files, spending a week fitting models, and taking a few
notes as you proceed. In practice, you are likely to go through this cycle many times,
often moving among tasks in any order. On a large project, you begin with the master
plan (¢.g., the grant proposal, the dissertation proposal), set up an initial structure to
organize your work (e.g., notebooks, files, a directory structure on disk drives), and
examine the general characteristics of your datasets (e.g., how many cases, where data
are missing). Once you have a sense of the complexities and problems with your data
(e.g., inconsistent coding of missing data, problems converting the data into Stata),
you develop a more detailed plan for cleaning the data, selecting your sample, and
constructing variables. As analyses progress, you might reorganize your files to make
them easier to find. At this point, you are ready to fit additional models. Preliminary
results might uncover problems with variables that send you hack to cleaning the data,
perhaps requiring you to construct new variables; thus, starting the cycle again.
An effective workflow involves PO&D at different levels and in different ways. Broad
plans consider your research within the context of the existing literature and determine
where your research can make a contribution. More specific plans consider which vari-
ables to extract, how to select the sample, and what scales to construct. When data
have been extracted and variables created, you need a plan for which models to fit, tests
to make, and graphs and tables to summarize your results. Similarly, you need to orga-
nize materials including datasets, reprints, output, and budget sheets. You must decide14 Chapler 2. Planuing. organizing, and documentiug
where to locate files and where to archive them. During the analyses, you organize your
do-files so that you can find what you nced quickly and within the files you organize
the commands in a logical sequence. Documentation also occurs on many levels. A
research log keeps track of what you did and when. Codebooks, along with variable
and value labels, document variables. Comments within do-files provide indispensable
documentation of your analyses. When you write a paper, book, or presentation, you
necd to record where each number comes from should you need to revisit it later.
Planning, organizing, and documenting are ongoing tasks that. affect everything that
you do throughout the life of the project. At each new stage of data management and
cal analysis, you should re and extend your plan, decide how to incorporate
new work into the existing organization, and update your documentation. Each of these
tasks pays huge dividends in the quality and efficiency of your work. As you read this
chapter, keep in mind that Po&D do not need to take a great deal of time and often
save time. For example, I find that it takes much longer to search for one lost. file than
to create a directory structure that prevents losing a file. Plus many of the tasks are
quite simple. For example, when | suggest that you “decide how to incorporate new
work into the existing organization”, this might simply involve looking at the directories
you have and deciding everything is fine or it might require quickly adding one. or two
directories to hold new analyses.
2.2 Planning
Planning at the beginning of a project saves tine and prevents errors. A plan begins
with broad considerations and goals for the entire project, anticipating the work that
needs to be completed and thinking about how to complete these tasks most efficiently.
Data analysis often involves side trips to deal with unavoidable problems and to explore
unanticipated findings. A good plan keeps your work on track. Michael Faraday, one of
the greatest scientists of all timc, seemed well aware of the need to stay focused until a
project is complete. His laboratory had a sign that said simply (Cragg 1967): “Work.
Finish. Publish.” A plan is a reminder to stay on track, finish the project, and get. it
into print.
Although planning is important in all types of research, 1 find i, particularly valuable
in certain types of projects. First. in collaborative work, inadequate planning can lead
to misunderstandings about who is doing what. This Jeads to a duplication of effort, to
working at, cross-purposes with one person undoing what someone else is doing, and to
misunderstandings about access to data and authorship. Second, the larger the project
and the more complex the analysis, the more important it is to plan. In projects such as
a dissertation or book, it is impossible to remember all the details of what you have done.
However, even if your analysis is exploratory and the project is small, your work will
benefit from a plan. Third, the longer the duration of a project, the more important it
is to plan and document your work. Finally, the more projects you work on, the greater
the need to have a written plan.2.2 Planning 15
In the rest of this section, 1 suggest les to consider as you plan. This list. is
suggestive, not definitive. It includes topics that might be irrelevant to your work and
excludes other topics that might be important. The list suggests the range of issues that
should be considered as you plan. Ultimately, you have the best idea of what issues
need to be addressed.
General goals and publishing plans
Begin with the broad objectives of your research. What papers do you plan to
write and where will you submit them? Thinking about potential papers is a useful
way to prioritize tasks so that initial writing is not held up by data collection or data
management that has not, been completed.
Scheduling
A plan should include a timeline with target dates for completing key stages of the
project (c.g., data collection, cleaning and documenting data, and initial analysis). You
might not meet the goals, but comparing what you have done with the timeline is useful
for assessing your plan. If you are falling behind. consider revising the plan. You also
want lo note deadlines. If there are conferences where you want to present the results,
when are the submission deadlines? If there is external funding, are there deadlines for
progress reports or expending funds?
Size and duration
The size and duration of the project have implications for how much detail and
structure is needed. If you are writing a research note, a simple structure suffices. A
paper takes more planning and organization, whereas a book or series of articles makes
it more important to think about how the structure you develop adapts as the research
evolves,
Division of labor
Working in a group requires special considerations. Who is responsible for which
tasks? Who coordinates data management? If multiple people have access to the data,
how do you ensure that only one person is changing the data at a time? If the analysis
begins while data collection continues, how do you make sure that people are working
with the latest version of the data? Who handles backups and keeps the documentation
up to date? What agreements do team members have about collaboration and joint
authorship? Both the success of the project and interpersonal relationships depend on
these considerations.16 Shapter 2 Planning, organizing, and documenting
The enforcer
In collaborations, you need to agree on policies for documentation and organization,
including many of the issues discussed in chapters 5-8. Even if everyone agrees, however,
it is easy to assume (or hope) that somebody else is taking care of PO&D while you fit
the models. By the time a problem is noticed, it can take louger to fix things than
if the issue had been anticipated and resolved earlier. In collaborative research, you
should decide who is responsible for enforcing policics on documenting, organizing, and
archiving. This does not need to be the person who is doing the work, but someone has
to make it their responsibility and a high priority.
Datasets
What. data will be used? Do you need to apply for access to restricted datasets
such as the National Longitudinal Study of Adolescent Health? What variables will
be used? How many panels? Which countries? Anticipating the complexity of the
dataset can prevent initial decisions that later cause problems. If you are extracting
variables from a large dataset, reviewing the thousands of variables and deciding which
you need to extract can prevent you from repeatedly returning to the dataset to got a
few forgotten variables, If your research includes many variables, consider dividing the
variables among multiple datasets. For example, in a study of work, health, and labor-
force participation using the National Longitudinal Survey, we decided that kecping
all variables in one file would not work because only one person could construct new
variables at a time. We divided variables into groups and created separate datasets
for each type of variable (e.g., demographic characteristics, health measures, and work
history). We created analysis datasets by merging variables from these files (see page 279
for details on merging files).
Variable names and tabels
Start. with a master plan for naming and labeling variables, rather than choosing
names anid labels in an ad hoc manner. A simple exaniple of the problems caused by
careless names and labels occurred in a survey where the same question was asked early
in the survey and again near the end. Unfortunately, the variables were named ownsex
with the label How good own sexuality? and ownsexu with the label Ow sexuality
is .... Neither the names or the labels made it clear which variable corresponded
to the question that was asked first. It took hours to verify which was which. When
planning names, anticipate new variables that could be added later. For example, if
you expect to add future panels, you need names that distinguish between variables in
different panels (c.g., health status in panel |, health status in panel 2). If you are using
software that restricts names to eight characters, you should plan for this. Chapter 5
has an extended discussion on variable names (section 5.6) and labels (section 5.7).2.2 Planning 17
Data collection and survey design
When collecting your own data, many things can go wrong. Before you start col-
lecting data, I recommend that you create a codebook and write the do-files that create
variable and value labels. This gives you one more chance to find problems when you
can do something about them. Another survey gave respondents options for percentage
of time that included the ranges 0)-10%, 20-30%, 40-50%, and so on. After the data
collection was complete, the person adding value labels noticed that 11-19%, 31-39%,
and so on had been excluded.
Missing data
What types of missing data will be encountered, and how will these types be coded?
Will a single code for missing values be sufficient, or will you need multiple codes that
indicate why the data are missing (e.g., attrition, refusal, or a skip pattern in the
survey}? Try to use the same missing-value codes for all variables. For example, letting
.n stand for “not answered” for one variable and stand for “not applicable” in another
is bound to cause confusions. See section 6.2.3 for details on missing data in Stata.
Analysis
What types of statistical analyses are anticipated? What software is needed, and
is it locally available? Thinking about software helps you plan data formats, naming
conventions, and data structures. For example, if you plan to use software that limits
names to eight characters, you might want a simpler naming structure than if you plan
to work exclusively in Stata, which allows longer names.
Documentation
What documentation is needed? Who will keep it? In what format? A plan for how
to document the project. makes it more likely that things will be documented.
Backing up and archiving
Who is going to make regular backups of the files? Long-term preservation should
also be considered. If the research is funded, what requirements does the funding agency
have for archiving the data? What sort of documentation do they expect and what data
formats? If the research is not funded, would it not be a good idea to make the data
available when you finish the research? Creating the documentation as you go makes
this much simpler. See chapter 8 for further information on backing up and archiving
files.18 Chapter 2. Planning, organizing, and documenting
2.3. Organization
Organization involves deciding what goes where, what to name it, and how you will find
it. A good plan makes it casier to create a rational structure to organize your work.
Plans for the broader objectives help you define how complex your organization needs to
be. Plans for more specific issues, such as how to name files, help you complete the work
accurately and quickly. Thoughtful organization also makes it simpler to document your
work because a clear logic to the organization makes it easier to explain what files are
and where they are located.
2.3.1 Principles for organization
There are several principles that should guide how you organize your work. These prin-
ciples apply to all aspects of your research, including budget sheets, reprints, computer
files, and more. Because this book is about data management, I focus on issues related
to data analysis.
Start early
The more organized you are when a project begins, the more organized you will be
at the end. Organization is contagious. If things are disorganized, there is a Lemptation
to leave them that way because it takes so much time to put them into order. If things
start out organized, keeping them organized takes very little time.
Simple, but not too simple
More elaborate schemes for organization are not necessarily better. The goal is to be
organized but to do this as simply as possible. A complex directory or folder structure
is essential for large projects but makes things harder for simple projects. For example,
if you have only one dataset and a few dozen do-files, a single directory should be fine
If you have hundreds of do-files and dozens of datasets, it can be difficult to find things
in a single directory. Because I find that: most projects end up more complicated than
anticipated, | prefer more elaborate organization at the start. You can also start with a
simple structure, and let it grow more complex as needed. Examples of how to organize
directories are given in section 2.3.2.
Consistency
Consistency and uniformity pay dividends in organization as well as in documenta-
tion. If you use the same scheme for organizing all your projects, you will spend less time
thinking about organization because you can take advantage of what you already know.
For example, if all projects keep codebooks in a directory named \Documentation, you
always know where to find this information. If you organize different projects differently,
you are bound to confuse yourself and spend time looking for things.2.3.2 Organizing files and directories 19
Can you find it?
Always keep in mind how you will find things. This seems obvious but is easily
overlooked. For example, how will you find a file that is not in the directory where
it should be? Seftware that searches for files helps, but these programs work better if
you plan your file naming and content so that search programs work more effectively.
For example, suppose you have a paper about cohort effects on work and health that
you refer to as the CWH paper. To take advantage of searching by name, filenames
must include the right information (e.g., the abbreviation cwh). With search programs,
you can look for a file with a specific name (e.g., cwh-scale1.do) or for a file with
a name that matches some pattern (e.g., cwh*.do looks for all files that begin with
cwh and end with .do). To search by content, you must include keywords within your
files. For example, suppose that all do-files related to the project include the letters
“CwWH” within them. If you lose a file, you can let a search program run overnight to
find all files that have the extension .do and contain the phrase “CWH”. If you forget
to include “CWH” inside a file, you will not find the file. Or, if you place different files
with the same name in different directories (e.g., two projects each use a file called
extract-data.do), searching by filename will turn up multiple files.
Document your organization
‘You are more likely to stay organized if you document your procedures. Written doc-
umentation helps you find things, prevents you from changing conventions midproject
if you forget the original plan, and reminds you to stick to your plan for organization.
In collaborations, written procedures are essential.
2.3.2 Organizing files and directories
It is easier to create a file than to find a file. It is easier to find a file than to
know what is in the file. With disk space so cheap, it is tempting to create
a lot of files.
Do any of the following sound familiar?
© You have multiple versions of a file and do not know which is which.
¢ You cannot find a file and think you might. have deleted it.
¢ You and a colleague are not sure which draft of your paper is the latest or find
that there are two different “latest” drafts.
© You want the final version of the questionnaire and are not sure which file it is
because two versions of the questionnaire include “final” in the name.
I find that these and similar problems are very common. One approach is to document
the name, content, and location of each file in your research log. In practice, this takes
too long. Instead, care in naming files and organizing their location is the key to keeping
track of your files.20 Chapter 2 Planning, organizing, and documenting
The easiest. approach to organizing project files is to start with a carefully designed
directory structure. When files are created, place them in the appropriate directory.
For example, if you decide that all PDFs for readings associated with a project belong
in the \Readings directory, you are less likely to have PDFs scattered across your hard
drive, including duplicate copies downloaded after you misplaced the first copy. Another
advantage of a carefully planned directory structure is that a file’s location becomes an
integral part of your documentation. If a file is located in the directory \CWH in the
subdirectory \Proposal, you know the file is related to the research proposal for the
CWH project. Section 2.3.3 discusses creating a directory structure. Approaches to
naming files are discussed in chapter 5. Before proceeding, keep in mind, that if you
create an elaborate directory structure but do not use it consistently, you will only make
things worse.
What characters to use in names?
Not all names work cqually well in all operating systems. Names are most likely to
work across operating systems if you limit the characters used to a~z, A-Z, 0-9, the
underscore -, and the dash -. Macintosh names can include any character except a
colon :. Windows names have more exceptions and should not use /, (.],. 3,5, ",\y 55
|, *, and , . In Linux, names can include numbers, letters, and the symbols ., -, and -.
Although blank spaces can be used in file and directory names, some people feel strongly
that spaces should never be used. For example, instead of having a directory called
\My Documents, thcy prefer \My-documents, \My_-documents, or simply \Documents.
Blanks can make it more difficult to refer to a file. For example, suppose that T save
auto.dta in c:\Workflow\My data\auto.dta. To use this dataset, I must include
double quot \Workflow\My data\auto.dta". I! T forget the quotes, an
error messag
« use d:\Workflow\My data\auto.dta
invalid “data”
(198);
Similarly, if you name a do-file my pgm.do and need to search for the file. you need
to search for "my pgm.do", not simply my pgm.do. As a general rule, | avoid filenames
that include spaces, but I use spaces in directory names when the spaces make it easier
for me to understand what is in the directory or because I think it looks better. Thus, in
the names of the directories that I suggest below, some directory names include spaces,
although the most frequently used directories do not. If you want to avoid spaces, you
can replace them with either a dash (-) or an underscore (_), or simply remove the space
from the name.
Pick a mnemonic for each project
The first step in naming files and directories is to pick a short mnemonic for your
project. For example, cwh for a paper on cohort, work, and health; sdsc for the project
on sex differences in the scientific career; eps for my collaboration with Eliza Pavalko.2.3.3 Creating your directory structure 21
This lets me easily add the project identifier to file and directory names. When choosing
a mnemonic, pick a string that is short because you do not want your names to get too
long. Avoid mnemonics that are commonly found in other contexts or as part of words.
For example, do not choose the mnemonic the because “the” occurs in many other
contexts, and do not use ead because these letters are part, of many common words.
2.3.3 Creating your directory structure
Directories allow you to organize many files, just as file cabincts and file folders allow
you to organize many papers. Indeed, some operating systems use the term folder in-
stead of directory. When referring to a directory or folder, J start the name with \,
such as \Examples. Directories themselves can contain directories, which are called
subdirectories because they arc “below” the parent directory. All the work related to
a project should be contained within a single directory that J refer to as the project
directory or the level-0 directory. For example, \Workflow is the project directory
for this book. The project directory can be a subdirectory of some other directory
or can be on a network, on your local hard drive, on an external drive, or on a
flash drive. Under the project directory you can create subdirectories to organize
your files. The term /evel indicates how far a directory is below the project direc-
tory. A level-1 directory is the first level under the project directory. For example,
\Workflow\Examples indicates the level-1 directory \Examples contained within the
level-0 directory \Workflow. A level-1 directory can have level-2 directorics within
it, and so on, For example, \Workflow\Examples\SourceData adds the level-2 di-
rectory \SourceData. When referring to a directory, I might indicate all levels (e.g.,
\Workflow\Examples\SourceData) or simply refer to the subdirectory of interest (e.g.,
\SourceData). With this terminology in hand, I consider several directory structures
for use with increasingly complex projects.
A directory structure for a small project
Consider a smal] project that uses a single data source, only a few variables, and a
limited number of statistical analyses. The project might be a research note about
labor-force participation. I start by creating a project directory \LFP that will hold
everything related to the project. Under the project directory, there are five level-1
subdirectories:
Directory Content
\LFP Project name
\Administration Correspondence, budgets, etc.
\Documentation Research log, codebooks, and other documentation
\Posted Completed text, datasets, do-files, and Jog files
\Readings PDF files with articles related to the project
\Work Text and analyses that are being worked on22 Chapter 2 Planning, organizing, and documenting
To make it easier to find things, all files are placed in one of the subdirectories,
rather than in the project directory itself.
The \Work and \Posted directories
The folders \Work and \Posted are critical for the workflow that I recommend. The
directory \Work holds work in progress. For example, the draft of a paper I am actively
working on would be located here, as would the do-files that I am debugging. At some
point I decide that a draft is ready to circulate to colleagues. Before sharing the paper,
T move the text file to the \Posted directory. Or, when I think that a group of do-files
is running correctly and I want to share the results with others, I move the files to
\Posted. There are two essential rules for posting files:
The share rule: Results are only shared after the associated files are posted.
The no-change rule: Once a file is posted, it is never changed.
These simple rules prevent many problems and help assure that publicly available results
can be replicated. By following these rules, you cannot have multiple copies of the
“same” paper or results that differ because they were changed after they were shared.
If you decide something is wrong in your analyses or you want to revise a paper that
was circulated, you create new files with new names, but do not change the posted files.
The distinction between the \Work and \Posted directories also helps me keep track of
work that is not finished (e.g., I am still revising a draft of a paper, | am debugging
programs to construct scales) and work that is finished. When I return to a project
after an interruption, I check the \Work directory to see if there is work that I need to
finish. For a detailed discussion of the idea of posting and why it is critical for your
workflow, see page 125.
Expanding the directory structure
As my work develops, I might accunmlate dozens or hundreds of do-files. When
this happens, I could divide \LFP\Posted to include level-2 subdirectories for different
aspects of data management and statistical analysis. For example,
Directory Content.
\LFP Project name
\Posted Datasets, do-files, logs, and text files
\Analysis Do-files and logs for statistical analyses
\DataClean Do-files and logs for data management
\Datasets Datasets
\Text Drafts of paper
The idea is to add subdirectories when you have trouble keeping track of what is in
a directory. The principle is the same as used when putting reprints in a file cabinet.2.3.3 Creating your directory structure 23
Initially, T might have sections A-F, G-K, L-P, and Q~Z. If you have a lot of papers
in the L- P folder, I might divide that folder into L-M and N-P. Or, if I have lots of
papers by R. A. Fisher, I might create a separate folder just for his papers.
A directory structure for a large, one-person project
Larger projects require a more elaborate structure. Suppose that you are the only
person working on a paper, book, or grant. Collaborative projects are discussed below.
Your project directory might begin with a structure like this:
Directory Content.
\Administration Files for administrative issues
\Budget Budget spreadsheets and billing information
\Correspondence Letters and emails
\Proposal Grant proposal and related materials
\Posted Datasets, do-files, logs, and text files
\DataClean Clean data and construct. variables
\Datasets Datasets
\Derived Datasets constructed from the source data
\Source Original, unchanged data sources
\DescStats Descriptive sti ics
\Figures Programs to create graphs
\PanelModels Panel models of discrimination
\Text Drafts of paper
\Documentation Project documentation (e.g., research log, codebooks)
\Readings Reprints and bibliography
\Work Text and analyses that are being worked on
Later in this section, I suggest other directories that you might want to add, but
first I discuss changes needed for collaborative projects.
Directories for collaborative projects
A clear directory structure is particularly important for collaborative projects where
things can get disorganized quickly. In addition to the directories from the prior section,
I suggest a few more.
The mailbox directory
You need a way to exchange files among researchers. Sending files as attachments
can fill up your email storage quota and is not efficient. I suggest a mailbox directory.
Suppose that Eliza, Fong, and Scott are working on the project. The mailbox looks like
this:24 Chapter 2. Planning, organizing, and documenting
Directory Content
\Mailbox Files being exchanged
\Eliza to Fong Eliza's files for Fong
\Eliza to Scott Eli files for Scott
\Fong to Eliza Fong’s files for Eliza
\Fong to Scott Fong’s files for Scott
\Scott to Eliza Scolt’s files for Eliza.
\Scott to Fong Scott's files for Fong
We exchange files by placing them within the appropriate directory.
Private directories
1 also suggest private directories where you can put work that you are not ready
to share with others. One approach js to create a level-1 directory \Private with
subdirectories for each person:
Directory Content
\Private
\Eliza Eliza’s private files
\Fong Fong’s private files
\Scott Scott's private files
With only a few team members, you might not need the \Private directory and
could create the private directories in the first level of the project directory, such as
\epsl\Eliza and \epsl\Scott. Each person can decide how they want to organize
files within their private directory.
The data manager and transfer directories
Even if everyone agrecs in principle on where the files should be put, you need a
data manager to enforce the agreement. Otherwise, entropy creeps in and you will lose
files, have multiple copies of some files, and have different files with the same name. The
data manager makes sure that files are pul in the right place. The principle is the same
as used by libraries where librarians rather than users shelve the books. Hach member
of the team needs a way to transfer files to the data manager. To make this work, I
suggest a data transfer directory called \- To file along with subdirectories for each
member of the team. The directory name begins with - so that it appears at the top
of a sorted list of files and directories. For our project, we set up this structure:2.3.3 Creating your directory structure 25
Directory Content
\- To file Files for the data manager to relocate
\- To clean Files that need to be evaluated before filing
\From Eliza Files Eliza wants to have relocated
\From Fong Files Fong wants to have relocated
\From Scott Files Scott wants to have relocated
The data manager verifies cach file before moving it to the appropriate location. The
\- To clean directory is for thosc files that invariably appear that nobody is sure who
created or what they are.
Restricting access
For collaborations, you are probably using a local area network (LAN) where everyone
can potentially access the files. If people store project files on their local hard drives,
you risk having data scattered across multiple machines and it is difficult to find and to
back up what you need. Although a LAN solves this problem, you might have files that
you do not want everyone to use. For example, you might want to restrict access to
the budget materials in \Administration\Budget. Or you might want some people to
have only read access to datasets to avoid the possibility of accidental changes. You can
work with your network administrator to set up file permissions that determine who
gets what type of access to which files and directories.
Is the LAN backed up?
If you are using a LAN, you should not assume that it is backed up until] you talk with
your LAN manager. Find out how often the LAN is backed up, how long the backups
are kept, where the backups are located, and how easy it is to retrieve a lost file from
the backup. These issues are discussed in chapter 8.
Special-purpose directories
I also use several special-purpose directories for things such as holding work that needs
to be done or holding backup copies of files. Although I begin the names of these
directories with a dash (e.g., \- To do}, you can remove the dash if you prefer (e.g.,
\To do).
The \- To do directory
Work that has not been started goes here as a subdirectory under \Work. These files
are essentially a to-do list. If 1 think of something that needs to be done, a reprint I
need to read, a do-file that needs to be revised, etc., it belongs here until I get a chance
to do it. I begin the name with a dash go that it appears at the top of a sorted list of
directories.26 Chapter 2 Planning, organizing, and documenting
The \- To clean directory
Tnevitably, 1 accumulate files that I am not sure about or that need to be moved
to the appropriate directory. By having a special folder for these files, I am less likely
to carelessly put them in the wrong directory. At some point, I review these files and
move them to their proper location. This directory can be located irmmediately under
the project directory or as a subdirectory elsewhere.
The \- Hold then delete directory
This directory holds files that I want to eventually delete and short-term copies of
files as a fail-safe in case I accidentally delete or incorrectly change the original. For
example, if I decide to abandon a set of do-files and logs for analyses that did not work,
I move thern here. This makes it easy to “undelete” the files if I change my mind. Or
suppose that I am writing a series of do-files to create scales, select cases, merge datasets,
and so on. These programs work, but before finalizing them ] want to add labels and
comments and perhaps streamline the commands. Making these improvements should
not change the results, but there is a chance that I will make a mistake and break a
program that was working correctly. When this happens, it is somctimes easiest to
return to the version of the program that worked and start again rather than debugging
the program that does not work. With this in mind, before I start revising the programs
I copy them from \Work to \~ Hold then delete. I might have subdirectories with
the date on which the backup was made. For example,
Directory Content
\- Hold then delete Temporary copies of files
\2006-01-12 Files backed up on January 12, 2006
\2006-02-09 Files backed up on February 9, 2006
Or I might use subdirectories that indicate what the backups are for. For example,
Directory Content
\- Hold then delete Temporary copies of files
\VarConstruct Files used in variable construction
\REmodels Files used to fit random-effects models
When I have completed a major step in the project (¢.g., submitted a paper for review),
I might copy all the critical files to \- Hold then delete. For example,
i
t
:
i
i
i
}
i
ia
23.3 Creating vour directory structure 27
Directory Content
\- Hold then delete Temporary copies of files
\2007-06-13 submitted Do-files, logs, data, and text when paper
was submitted
\2008-04-17 revised Do-tiles, logs, data, and text when revisions
were submitted
\2008-01-02 accepted Do-files, logs, data, and text when paper
was accepted
The critical files should already be in the \Posted directory, but before posting files, I
often delete things that I do not expect to need. By keeping temporary copies of these
files, I can easily recover a file if I made a mistake by deleting it. In many ways, this
directory is like the Windows Recycle Bin or Mac OS Trash Can. I put files here that
I do not expect to need again, but 1 want to easily recover them if [ change my mind.
When organizing files, it is important to keep track of the files you need and also the
files which you do not need. If you do not keep track of files that can be deleted, you
are likely to end up with lots of files that you do not know what to do with (sound
familiar?). When I need disk space or the project is finished, I delete the files in the
\- Hold then delete directory.
The \Scratch directory
When learning a new command or method of analysis, I often experiment to make
sure that I understand how things work. For example, if I am importing data, I might
verify that missing data codes are transferred the way 1 expect. If 1 am trying a new
regression command, J might experiment with the command using data from a published
source where I know what the estimates should be. These analyses are important, but
I do not need to save the results. For this type of work, I use a \Scratch directory.
When J need disk space or the project is finished, these files can be deleted. Generally,
\Scratch is located within the \Work directory. But, wherever it appears, | know that
the files are not critical.
Remembering what directories contain
You need a way to keep track of what a directory is for and which files go where. You
could give each directory a long name that describes its contents, such as \Text for
workflow book. However, if each directory name is long, you can end up with path
names that are so long that some programs will not process the file. Long names are
also tedious to type. To keep track of what. a directory is for, I suggest a combination
of the following approaches.
First, decide on a directory structure with short names and use the same structure
for everything you do. Eventually, it will become second nature. For example, if ev-
ery project directory contains a subdirectory \Work, you know where things you are28 Chapter 2 Planning, organizing, and documenting
currently working on are located when you return to the project. You can choose a
different name than \Work but use the same name for all your projects.
Second. use a text file within the directory to explain what goes in the directory. For
example, the \Workflow\Posted\Text\Submitted directory for the workflow project
could have a file Submitted .is that contains
Project: | Workflow of Data Analysis
Directory: \Workflow\Posted\Text\Submitted
Content: Files submitted to StataCorp for production.
Author: Scott Long
Created 2008-06-09
Note: These files were submitted to StataCorp for copy
editing and latexing. Prior drafts are located
in \Workflow\Posted\Text\Drafte.
The naming file can be as large as you like. Because you must open the file to read the
information, this approach is not effective as a quick reminder.
Third, you can create naming directories whose sole purpose is to remind you of
what is in the directory above it. For example,
Directory Content
\Private Private files
\- Private files for team members Description of the \Private
directory
I use this approach to keep track of directories containing backup files. The naming
directory tells me which external drive holds the backup copies. For example,
Directory Content
\- Hold then delete Backup files
\2006-01-12 Date files were placed in this directory
\- Copied to EX02 Remindor that files are on external drive EXO2
\2007-06-03 Date files were placed in this directory.
\- Copied to EX03 Reminder that files are on external drive EX03
Finally, L use a directory named \- History that, contains naming directories with
critical information about the files in the project. For example,
Directory
\- History
\2006-01-12 project directory created
\2006-06-17 all files backed up to EX02
\2007-03-10 initial draft completed
\2007-03-10 all files backed up to EX04
|
i
f
}
|
i
'
‘2.3.3 Creating your directory structure 29
I find these reminders to be very useful when returning to a project that has been put
on hold. It also documents where backup copies of files have been put (e.g., EX02 is the
volume name of an external drive).
Pianning your directory structure
You might prefer to use different directory names than I have suggested. Having names
that make sense to you is an advantage, but there is also an advantage to using names
that have been documented. This, I believe, is a good reason to stick with the names I
suggest or versions of these names that replace spaces with dashes or underscores. If you
add people to your project, they can read this chapter to find out what the directories
are for. Still, even if you use my names, you will need to customize some things. A
spreadsheet is a convenient way to plan your directory structure. For example (file:
wf2-directory~design.x1s),! see figure 2.2.
Project
Directory Level 1 vel 2 Level 3 urpose
VAgeDisc Project directory.
\- To file Files to examine and move to appropriate location.
\Administration Administration,
\Budget Budget sheets
\Correspondence Letters and emails.
Proposal Grant proposal and related material.
\Documentation Documentation for project.
\Codebooks Codebooks for source and constructed variables.
\Hold then delete
\Posted
\Readings
Work
\2007-06-13 submited
\2008-04-17 revised
{2008-01-02 accepted
\- Datasets
‘\Derived
\Source
\- Text
\DateClean
\DescStats
\Figures
\PanelModels
\: To do
VText
Delete when project is complete.
Do, data and text when paper was submitted.
Do, data and text when revisions are sumbitted,
Do, data and text when paper Is accepted.
Cotnpleted files that cannot be changed
Datasets.
Dataset constructed from original date files
Oniginat data without modifications.
Completed drafts of paper.
Data cleaning and variable construction.
Descriptive statistics and sample selection.
Graphs of data.
Pane! models for discrimination.
Articles retated to project; bibliography.
Work directory,
Work that hasn't been started,
Active drafts of paper
Figure 2.2. Spreadsheet: plan of a directory structure
This spreadsheet would be kept in the \Documentation directory.
1. This is the first time I have referred to a file that is available from the Workflow web site. Throughout
the book, files that have names that begin with wf can be downloaded. See the Preface for further
details.30 Chapter 2. Planning, organizing, and documenting
Naming files
After you sct up a directory structure, you should think about how to name the files
in these directories. Just as you need a logical structure for your directories, you need
a logical structure for how you will name files. For example, if you put reprints in the
\Readings directory, bul the files are not consistently named, it will be hard to find
them. My PDF files with reprints are a good example of what not to do. Although L
routinely filed paper reprints by author in a file cabinet, { often downloaded files and
kept whatever names they had. As a result, here is a sample of files from my \Readings
directory:
03-19Greene. pdf
OOWENS94 . pdf
12087810. pdf
12087811 .pdf
Chapter03. pdf
CICoxBM95.pdt
cordermanton . pdf
faigq-example. pdf
gllamm2004-12-10 pdf
long2. pdf
Muthen1999biometrics.pdaf
It is not worth the effort to renaine these files, but L name new PD¥s with the first
author’s last name followed by year, journal abbreviation, and keyword (e.g., Smith
2005 JASA missingdata.pdf). Issues of naming, which are even more important when
it comes to do-files and datasets, are discussed in chapter 5.
Batch files
J prefer to create the directory structure using a batch file in Windows or a script file
in Mac OS or Linux rather than right-clicking, choosing Create a new folder, and
typing the name. A batch file is a text file that. contains instructions to your operating
system aboul doing things such as creating directories. The first advantage of a batch
file is that if you change your mind, you can easily edit the batch file to re-create the
directories. Second, you can use the batch file from one project as a template for creating
the directory structure for another project. For example, 1 use this file to create the
directories for a project with Eliza (file. w£2-dircollab.bat):
md "- Hold then delete"
md "- To file\Eliza to data manager"
nd "- To file\Scott to data manager"
md "- To file\- To clean"
nd "Administration\Budget”
nd “Administration\Correspondence"
nd “Adninistration\Proposal"
md "Posted\Datasets"
nd "Documentation\Codebooks"
md "Mailbox\Eliza to Scott"
nd "Mailbox\Scott to Eliza"
md "Private\Eliza"
nd "Private\Scott"
nd "Readings"2.3.4 Moving into a new directory structure (advanced topic) 31
To set up directories for a different project, I only need to make a few changes to the
batch file. Details on batch files are beyond the scope of this book; ask your local
computer support person for help.
Stee
2.3.4 Moving into a new directory structure (advanced topic)
Ideally, you create a directory structure at the start of a project and routinely place new
files in the proper directory. However, even with the best intentions, you are likely to end
up with orphan files created over several years and scattered across directories on several
computers. At some point, these files need to be combined into one project directory.
| Or, perhaps this chapter has convinced you to reorganize your files. In this section,
I discuss how to merge files from multiple locations into a unified directory structure.
Reorganizing files is difficult, especially if you have lots of files. If you start the job but
do not finish it, you are likely to make things worse. If you begin to reorganize files
without a careful plan, you can make things worse and even lose valuable data.
ee
Aside on software
When doing a lot of work with files, utility programs can save time and
prevent errors. First, third-party file managers are often more efficient
for moving and copying files than those built into the operating system.
Second, when you copy a file, most. programs do not verify that the copy is
exactly like the original. For example, in Windows when Explorer copies
a file, it only verifies that the copied file can be opened but it does not
(contrary to what you sometimes read) verify that the new file is exactly
like the source file. I highly recommend using a program that verifies the
copy is exactly the same as the original by comparing every bit in the
original file to every bit in the destination file. This is referred to as bit
verification. Programs for backing up files and many file managers do this.
Third, when combining files from many locations, you are likely to have
duplicate files. It is slow and tedious to verify that files with the same
names are in fact identical and that files with different names are not the
same. I recommend using a utility to find duplicate files. Software for file
management is discussed on the Workflow web site.
Example of moving into a new directory structure
To make the discussion of moving into a new directory structure concrete, I explain how
I would do this for a collaborative project known as eps1 (named with the initials of
the two researchers).32 Chapter 2 Planning, organizing, and documenting
Step 1. Inform collaborators
Before I start to reorganize files, I let everyone using the files know what { am doing.
Others can still use files from their current locations, but they should not add, change,
or delete files within the current directory structure. Instead, [ create new directories
(e.g., \epsl-new-files\eliza and \epsl-new-files\scott) where new or changed
files can be saved until the new directory structure is completed.
Step 2. Take an inventory
Next ] take an inventory of all files related to the project. The inventory is critical
because I do not. want to complete the reorganization and then discover that 1 forgot
some files. I found files on the LAN directory \eps1; on Eliza’s home, office, and Japtop
computers; and on my home and two work computers. I create a text file that lists
each file and where it was found. This list is used to document, where files were before
they were reorganized and to help plan the new organization. I do not want, to try to
relocate 10,000 files without having a good idea of where I want to put things. Most:
operating systems have a way to list files; see the Workflow web site for further details.
Step 3. Copy files from all source locations
On an external drive, I create a holding directory with subdirectories for cach source
location. For example,
Directory Content
\epsl-to~be-merged Holding directory with copies of files to be merged
\Eliza-home Files from Eliza’s home computer
\Eliza-laptop Files from Eliza’s laptop
\Eliza-office Files from Eliza’s office computer
\LAN Files from LAN
\Scott-home Files from Scott's home computer
\Scott-officeWin Files from Scott’s Windows computer
\Scott-officeMac Files from Scott's Mac computer
Using bit verification, 1 copy files from each source location to the appropriate directory
in \epsl-to-be-merged. Do not delete the files from their original location until the
entire reorganization is complete.
Step 4, Make a second copy of the combined files
After all the files have been copied to the external drive, I make a second backup
copy of these files. If you do not have many files, you could copy the files to CDs or DVDs,
although I prefer using a second external drive because hard drives are much faster and
hold more. The copies are bit verified against the originals. The first portable drive will
be used to move files into their new location, while the second backup copy is put in a
safe place as part of the backups for the project.2.3.4 Moving into a new directory structure (advanced topic) 33
Step 5. Create a directory structure for the merged files
Next I create the destination directory structure that will hold the merged source
files. For example,
Directory Content
\epsl-cleaned-and-merged Destination directory with cleancd files
\- Hold then delete Files that can be deleted
\- To file Files to move to their proper folder
\- To clean Files to clean before relocating
\From Eliza
\From Scott
\Administration Administrative materials
\Budget
\Correspondence
\Documentation Project, documentation
\Codebooks
\Mailbox Location for exchanging files
\Eliza to Scott
\Scott to Eliza
\Posted Posted datasets, do-files, etc.
\Datasets Completed datasets
\Derived
\Source
\Text Sompleted drafts of papers
\Private Private files
\Eliza
\Scott
\Readings PDFs related to project
I make the directory structure as complete as possible. For example, if there are a lot
of analysis files, I would create subdirectories for each type of analysis. Creating the
new directory structure takes careful planning but is critical for getting the job done
efficiently.
Step 6. Delete duplicate files
There are likely to be multiple copies of some files. For example, Eliza and I might
both have copies of the grant proposal or key datasets. Or my laptop and office machine
might have copies of many of the same files. We could also have files with different names
but identical content. Or worse, we could have files with the same name but different
content. I need to delete these duplicate files, but. the problem is finding them efficiently.
For this, I use a utility that searches for duplicate files.34 Chapter 2 Planning, organizing, and documenting
Step 7. Move files to the new directory structure
Next I move the files from the directory \eps1-to~be-merged to their new location
in \epsl-cleaned-and-merged. Because J am moving the files, J cannot. accidentally
copy the same file to two locations and end up with more files than I started with.
Moving the files to their new location can take a lot of time and I might encounter files
that I am unsure about. I put these files in the \- To file\- To clean directory to
relocate later.
Step 8. Back up and publish the new files and structure
When I am done moving files to their new location, I back up the newly merged
files in \epsl-cleaned-and-merged. If I have room for these files on the portable
drive that I used for the backup copy of \epsl-to-be-merged, I would put them there.
Next, I move \epsl-cleaned-and-merged to its new location on the LAN and start
implementing new procedures for saving files.
Step 9. Clean up and let people know
I now either delete the original files or move them into a directory called \- Hold
and delete epsl. It is essential that people stop using their old files or we will end
up repeating the entire process, but next time we will need to deal with the files that
were just cleaned. J inform collaborators that the new directory structure is available
and ask them to move any new files they created to the \- To file directory.
Step 10. Documentation
I return to the list of files I created in step 2 and add details on where the files were
moved. I also list problems that I encountered and assumptions that I made (e.g., 1
assured that mydataxyz.dta was the most recent. version of data even though it had
an older date). I also add information to my research log that briefly discusses how the
files were reorganized and where the archived copies of the original files are stored.
2.4 Documentation
Long's law of documentation: It. is always faster to document it today than
tomorrow.
Documentation boils down to keeping track of what you have done and thought. It
reminds you of decisions made, work completed, and plans for future work. Without
documentation, replication is essentially impossible. Unfortunately, writing good docu-
mentation is hard and few enjoy the task. It is more compelling to discover new things
by analyzing your data than it is to document how you scaled your variables, where
you stored a file, or how you handied missing data. But, the time spent documenting
your work saves time in the long run. When writing a paper or responding to reviews,2.4 Documentation 35
I often use analyses that were completed months or even years before. This is much
casier when decisions and analyses are clearly documented. For example, a collaborator
and T were asked by a reviewer to refit our models using the number of children under
18 years old in the family rather than the number of children under 6 years old, which
we had used. Using our documentation and archived copies of the do-files, the new
analyses took only an hour. Without careful documentation and archiving, it would
have taken us much longer, perhaps days.
If you do not document your work, many of the advantages of planning and organi-
zation are lost. A wonderful directory structure is not much help if you forget what goes
where. The most efficient plan for archiving is of no value if you forget what the plan is
or you fail to document the location of the archived files. To ensure that you keep up
with documentation, you need to include it as a regular part of your workflow. You can
add the task to your calendar just like a meeting, although this does not work for me.
Instead, J keep up with documentation by linking it to the completion of key steps in
the project. For example, when a paper is sent for review, { check the documentation
for the analyses used in the paper, add things that are missing, organize files, and verify
that files are archived. When I finish data cleaning and am ready to start the analysis,
J make sure that my documentation of the dataset and variables is up to date.
Tronically, the insights you gain through days, weeks, or years on a project make it
harder to write documentation. When you are immersed in data analysis, it is difficult
to realize that details that are second nature to you now are obscure to others and may
be forgotten by you in the future. Was cohort 1 the youngest cohort or the oldest?
Which is the latest version of a variable? What assumptions were made about missing
data? Is ownsex or ownsexu the name of the variable for the question asked later in the
survey? Does JM refer to Jack Martin or Janice McCabe? As you work on a project,
you accumulate tacit knowledge that needs to be made explicit. Rather than thinking
of documentation as notes for your own use, think of it as a public record that someone
else could follow. Terry White, a researcher at Indiana University, refers to the “hit-
by-a-bus” test. If you were hit by a bus, would a colleague be able to reconstruct what
you were doing and keep the project moving forward?
Although documentation is central to training in some fields, it is largely ignored in
others. In chemistry, a great deal of attention is given to recording what was done in
the laboratory and publishers even sell special notebooks for this purpose. The Ameri-
can Chemical Society has published Writing the Laboratory Notebook (Kanare 1985),
which is devoted entirely to this topic. A search of the web provides wonderful examples
of how chemists document their work. For example, Oregon State University’s Special
Collection Library maintains a web site with scans of 7,680 pages from 46 volumes of
research notes written by Nobel Laureate Linus Pauling
(http: //osulibrary.oregonstate.edu/specialcollections/rmb/index.html). A Google search
turns up jobs descriptions that include statements like
(http://ilearn.syr.edu/pgm-urp-project.htm): “Involvement in on-going chemical re-
search toward published results. Act as junior scientist, not skilled technician. Maintain
research log, attend weekly (evening) group meetings, present own results informally.”36 Chapter 2 Planning, organizing, and documenting
Tn my experience, documentation is rarely discussed in courses in applied statistics
(if you know of exceptions, please let me know). This is not to say that skilled data
analysts do not keep research records but rather that the training is haphazard and too
many data analysts learn the hard way about. the importance of documentation.
2.4.1 What should you document?
What necds to be documented varics by the nature of the research. The ultimate
criterion for whether something should be documented is whether it is necessary for
replicating your findings. Unfortunately, it is not always obvious what will be necessary.
For example, you might not think of recording which version of Stata was used to fit
your model, but this can be critical information (see section 7.6.2). Hopefully, the
following list gives you an idea of the range of materials Lo consider for inclusion in your
documentation.
Data sources
If you are using secondary sources, keep track of where you got the data and which
release of the data you are using. Some datasets are updated periodically to correct
errors, to add new information, or to revise the imputations for missing data.
Data decisions
How were variables created and cases selected? Who did the work? When was it
done? What coding decisions were made and why? How did you scale the data and what
alternatives were considered? If you dichotomized a scale, what was your justification?
For critical decisions, also document why you decided not to do something.
Statistical analysis
What steps were taken in the statistical analysis, in what order, and what guided
those analyses? If you explored an approach to modeling but decided not to use it, keep
a record of that as well.
Software
Your choice of software can affect your results. This is particularly true with recent
statistical techniques where competing packages might use different algorithms lead-
ing to different results. Moreover, newer versions of the same software package might
compute things differently.2.4.2 Levels of documentation 37
Storage
Where are the results archived? When you complete a project or put it aside to work
on other projects, keep a record of where you are storing the files and other materials.
Ideas and plans
Ideas for future research and lists of tasks to be completed should be included in
the documentation. What seems like an obvious idea for future analysis today might
be forgotten later.
2.4.2 Levels of documentation
Documentation occurs on several levels that complement one another.
The research log
The research log is the cornerstone of your documentation. The log chronicles the
ideas underlying the project, the work you have done, the decisions made, and the
reasoning behind each step in data construction and statistical analysis. The log includes
dates when work was completed, who did the work, what files were used, and where the
materials are located. As the core of your documentation, the log should indicate what
other documentation is available and where it is located. In section 2.4.4, I present an
excerpt from one of my research logs and provide a template that makes it easier to
keep a log.
Codebooks
A codebook summarizes information on the variables in your dataset. The codebook
reflects the final decisions made in collecting and constructing variables, whereas the
research log chronicles the steps taken and computer programs used to implement these
decisions. The amount of detail in a codebook depends on a number of things. How
many people will use the data? How much detail is in your research log? How much
documentation was stored internally to the dataset, such as variable labels, value labels,
and notes. Additional information on codebooks is provided in section 2.4.5. See also
section 8.5 on preparing data for archival preservation.
Dataset documentation
Tf you have many datasets, you might want a registry of datasets. This will help you
find a particular dataset and can help ensure that you are working with the latest data.
An example is given below. You can also use Stata’s label and notes commands to
add metadata to your datasets as discussed in section 2.4.6 and chapter 5.38 Chapter 2 Planniug, organizing, and documenting
Documenting do-files
Although the research log should include information about your do-files, your do-
files should also include detailed comments. These comments are echoed in the Stata
log file and clarify what the output means, where it came from, and how it should be
interpreted. You need to find a practical balance between how much information goes in
the research log and how much goes in the do-file. My research log usually has limited
information about each do-file, with fuller documentation located within the do-files.
Indeed, for smaller projects, you might find that your do-files along with the variable
labels, value labels, and notes in the dataset provide all the documentation you need for
a project. This approach, however, requires that you include very detailed comments
in your do-files and that you are able to fully replicate your results by rerunning the
do-files in sequence.
Internally labeling documents
Every document should include the author’s name, the name of the document file
(so you can search for the file if you have a paper copy but want to edit the file),
and the date it was created. One of the most frequent and easily remedied problems
T see is documents that do not include this information. Worse yet, someone revises
a document, but does not change the document’s internal date and perhaps does not
change the name of the file. (Have you ever been in a mecting where participants debate
which version of a document is the latest?) On collaborative projects, it is easy to lose
track of which version of a document is the latest. This can be avoided if you add a
section at the end of each document that records a document’s pedigree. With each
revision, add a new entry indicating who wrote it, when, and what it was called. You
might wonder why you cannot use the operating system’s file date to determine when a
file was created. Unfortunately, that date can be changed by the operating system even
if the file has not changed. It is much safer to rely on a date that is internal to the file.
2.4.3 Suggestions for writing documentation
Although there are many ways to write documentation and I encourage you to find the
method that works best for you, there are several principles of documentation that are
worth remembering. :
Do it today
When things are fresh in your mind, you can write documentation faster and more
accurately.2.4.4 The research log 39
Check it later
If you write documentation while doing the work, it is easy to forget information
that is obvious now but that should be recorded for future reference. Ideally, write your
documentation soon after the work is completed. Then either have someone else check
the documentation or check it yourself at a later time.
Know where the documentation is
Decide where to keep your documentation. If you cannot find it, it does not do
you any good! I keep electronic copies of my documentation in the \Documentation
subdirectory of each project. I usually keep a printed copy in a project notebook that
I update after each step of the project is completed.
Include full dates and names
When it comes to dates, the year is important. On February 26, it might seem
inconceivable that the project will continue through the next calendar year, but even
simple research notes can take years to finish. Include full names. “Scott” or the initials
“s!” may be clear now, but at a later time, there might be more than one Scott or two
people with the same initials.
Evaluating your documentation
Here is a thought experiment for assessing the adequacy of your documentation. Think
of a variable or data decision that was completed early in the project. In a study of
aging, this could be how the age of a respondent was determined. Imagine that you have
finished the first draft of a paper and then discovered that age was computed incorrectly.
This might seem far fetched, but the National Longitudinal Survey has revised the birth
years of respondents several times. How long would it take to create a corrected dataset
and redo the analyses? Could other researchers understand your documentation well
enough to revise your programs to correct the variable and recompute all later analyses?
If not, your documentation is inadequate. When teaching statistics, I require students
to keep a research log. This log mimics what they should record if they were working
on a research paper. The standard for assessing the adequacy of the log and the file
organization is the following. During the last week of class, imagine returning to the
second assignment, removing the first three cases in the dataset (i.e., drop if n < 4),
and rerunning the analyses. If the documentation and file organization are adequate,
this should take less than five minutes.
2.4.4 The research log
The research log is the core of your documentation, serving as a diary of the work you
have done on a project. Your research log should accomplish three things:40 Chapter 2 Planning, organizing, and documenting
¢ The research log keeps your work on track. By including your research plan, the
log helps you set priorities and complete work in an efficient way.
¢ The research log helps you deal with interruptions. Ideally, you start a project
and work on it without interruption until it is finished. In practice, you are likely
to move among projects. When you return to a project, the research log helps
you pick up the work where it ended without spending a lot of time remembering
what you did and what needs to be done.
¢ The research log facilitates replication. By recording the work that was done and
the files that were used, the research log is critical for replicating your work.
As long as these objectives are met, your research log is a good one.
Researchers keep logs in many formats (e.g., bound books, loose-leaf notebooks,
computer files) and refer to them by different names (e.g., project notes, think books,
project diaries, workbooks). While writing this book, I asked several people to show
me how they keep track of their research. I discovered that there are many styles and
approaches, all of which do an admirable job of meeting the fundamental objective of
recording what was done so that results could be reproduced at a later time. Several
people conveyed stories of how their logs became more detailed as the result of a painful
lesson caused by inadequate documentation in the past. Without question, keeping
a research log involves considerable work. So, it is important to find an approach to
keeping a log that appeals to you. If you prefer writing by hand to typing, use a bound
volume. If you would rather type or your handwriting is illegible, use a word processor.
The critical thing is to find a way that allows you to keep your documentation current.
My research log records what I have done, why I did it, and how I did it. It also
records things that I decided not to pursue and why. Finally, it includes information on
what I am thinking about doing next. To accomplish this, the log includes information
on the source of data, problems encountered and solutions used, extracts from emails
from coauthors, summaries of meetings, idcas related to the current analyses, and a lists
of things I need to do. When I begin a project, I start with my research plan. The plan
lays out the work for the following weeks or months. As work progresses, the plan is
folded into the record of work that was completed. As such, the plan becomes a to-do
list, whereas the research log is a diary of how the work was completed.
A sample page from a research log
To give you an idea of what my research logs look like, figure 2.3 shows an extract from
the research log for a paper completed several years ago.2.4.4 The research log AL
JSIFLIMiog: 4/1/02 to 61222 - Page 11
Eirst complet FLIM measures paper
£2alt0la.do - 24May2002
Descriptive information on ali rhs, Lhs, and Llim measures
£2a1t01b.do - 25May2002
Compute bic’ for each of four outcomes and al) flim measures.
Outcone: Can Work globel Ihs “qcanwria5"
: Work in three categories global ins “dhlthwk9s"
bath trouble Global Ihs “bathdif95"
z adisun95 - sum of adls global Lhs “adisuni5"
f2alt01c.do - 25May2002
Compute bic’ for each of four oulcomes and with only these restricted
flim measures.
1. Antat.5) and In(x#t}
2. 9 count: pa T=7 (50% and 751)
3. 8 counts: >6=6 (508 and 758)
AL 16 counts: ><9=9 >=24=14 (S01 and 758)
5. probability splits at .5: these don't work well in prior teats
f2alt0ld.do - 25May2002
bic! for all four outcomes in models that include all raw flim measures
(fla*p5; fll*pS); pairs of u/l measures; groups of LCA measure:
f2alt0le.do - ali LCA probabilities - 25May2002
£2a1t01j.do - use three probability measures fram LCA - 29May2002
£2alt02c.do - 29May2002
use three binary variables, not just LC class numbers.
: dummies work better than the class number;
+ effects of lower and severe are not signiticantly difterent.
Redo £2 analyses - error in adisum - 3Jun2002
ARGH! adlsum is incorrect -- it included going to bed twice.
All of the f2alt analyses need to be redone using the corrected dataset .
£3alt_qflim07.do: create qflim07.dta 3Jun2002
1) Correct aldsum: adi sum35p
2) Add binary indicators of Lmaxp5: LmaxNonep5, etc.
£3alt01a (redo f2alt012.do) - 30um2002
£3alt0lb.do (redo £2 job) - 322002
Figure 2.3. Sample page from a research log42 Chapter 2 Planning, organizing, and documenting
This section of the log documents analyses completed after the data were cleaned and
variables were constructed. The do-files from £2a1t01a.do to £2alt02c.do complete
the primary analyses for the paper. When reviewing these results, I discovered that
a summated scale was incorrect, as it included the same variable twice. The program
£3alt_qf1im07.do fixed the problem and created the dataset qflim07.dta. The do-
files £3alt*.do are identical to £2alt*.do except that the corrected scale is used. As
I reread this research log, which was written four years ago, I found many things that
were unclear. But, because the log pointed to the do-files that were used, it was simple
to figure things out by checking those files. Thus the comments in the do-files were
critical for making the log effective. The point is that your research log does not need
to record every detail. Rather, it needs to provide enough detail so that you can find
the information you need.
A template for research logs
Keeping a research log is easier if you start with a template. For example, I use the
Word document research-log-blank. docx (see figure 2.4) to start a new research log
(available at the Workflow web site):
Workflow research fog template {alt-h)
Headin 11 falt-1
Normal text {ctrl-n}
Heading level 2 {alt-2
Normal text follows by default.
Reading level 3 {alt-3)
Normal text follows by default.
fleadang level $ {alt-4)
Normal text follows by default
Heading level § falt-5)
Normal text follows by default
Output in 16 point font {alt-G} :
42345678901234567890123456769012345670901291567890129456789012945678901234567890
oueput in 9 point font (alt-3)
1 2 3
3 ‘ ? e
12345678801234567890125456785012345€78901224567490123456765012345678501234567890123456785
lane denmientin matinee autethel ah
Figure 2.4. Workflow research log template2.4.5 Codebooks 43
J Joad the file, change the header and title to correspond to the project, and delete the
remaining lines in the file. These lines are included in the template to remind me of
the keyboard shortcuts built into the document. For example, to add a level-1 heading,
press Alt+1; to add output in a 9-point font, press Alt+9; and so on. The body of
the document is in a fixed font, which I find easiest. because I can paste output and it
will line up properly. I change the name of the file and save it to the \Documentation
directory for the project.
2.4.5 Codebooks
Codebooks describe your dataset. If you are collecting your own data, you need to
create a codebook for all the variables. If you are using an existing dataset that has a
codebook, you only need to create a codebook for variables that you add.
There is an excellent guide for preparing a codebook, Guide to Social Science Data
Preparation and Archiving: Best Practice Throughout the Data Life Cycle (ICPSR
2005), which can be downloaded as a PDF. Here ¥ highlight only some key points. The
Guide suggests that you start by writing a plan for what the codebook will look like
and think about how you can use the output from your software to help you write the
codebook. For example, Stata’s codebook command might have most of the information
you want to include in the codebook. For each variable, consider including the following
information:
¢ The variable name and the question number if the variable came from a survey.
e The text of the original question from the survey that collected the data or the
details on the method by which the data for that variable were obtained. Include
the variable label used in your data file.
If the data are collected with a survey that includes branching (e.g., if the respon-
dent answers A, go to question 2; if B, got to question 7), include information on
how the branching was determined.
¢ Descriptive statistics including value labels for categorical variables.
« Descriptions of how missing data can occur along with codes for each type of
missing data.
If there was recoding or imputation, include details. If a variable was constructed
from other variables in the survey (e.g., a scale), provide details, including how
missing data were handled.
« An appendix with abbreviations and other conventions used.
A codebook based on the survey instrument
If you are collecting data, editing the original survey instrument is a quick and effective
way to create a codebook. For example, figure 2.5 is an edited version of the survey
instrument used for the SGC-MHS Study (Pescosolido et al. 2003). The variable names44 Chapier 2 Planning, organizing, and documenting
and labels were added in bold. Other information on coding decisions, skip patterns,
and so on was documented elsewhere.
Not at all - v
important Important
043 Tum to famdy for help 123945 6489 9 0
tefam Q43 How Important: Tuen to family for help
Q44 Tum to friends for help. eee ee
tefriend —Q44 How Important: Turn to friends for help
45, Tum to a mriruster, priest, Rabbi or otherrebgousleader { 2 3 4 5 6 8 9 9 10
terelig _Q45 How Important: Turn to a Minister, Priest, Rabbi or other religious leader
O46, Go fo a general meckcal doctor for holp 1234568839
tedoc Q46 How Important: Go to a general medical doctor for help
Q47, Go to a peychiatrist for help 1234568 9 9 0
tcysy ——_Q47 How Important: Go to.a psychlatrist for Help
46 Go to a mental heath professional ar help 123456898 0
tcmhprof 048 How Important: Go to a mental heaith professional
ALLOWED DEFINITION - PSYCHOLOGIST, THERAPIST, SOCIAL WORKER, OR COUNSELOR
|NTERINEWER NOTE: CODE “DOW'T KioW” AS 88 ABOVE SEQUENCE.
‘The next few questions deal with the government's responsiblity fo help people Ke NAME. For each statement please
tell me you think the goverment defindaly should, probably shoud, probably should nol, or defindely shad not be
__fesnonate helping peopl with sates fhe NAME,
ee a mt anegenes tae Na
Figure 2.5. Codebook created from the survey instrument for the SGC-MHS Study
2.4.6 Dataset documentation
Your research log should include details on how each dataset was created. For ex-
ample, the log might indicate that cwh-data01a-scales.do started with the dataset
cwh-01.dta, created new scales, and saved the dataset cwh-02.dta. I also recommend
including information inside the dataset. Stata’s label data command lets you add a
label that is displayed every time you load your data. For example.
. use jsl-ageism04
(Ageism data from NLS \ 2006-06-27)
The data label, listed in parentheses, reminds me that the file is for a project that is
analyzing reports of age discrimination from the NLS and that the dataset was created on
June 27, 2006. Stata’s notes command lets you embed additional information in your
dataset. When I create a dataset, I include a notes with the name of the do-file that
created the dataset. When a file is updated or merged with another file, the notes are
carried along. This means that internal to the dataset I have a history of how the dataset
was created. For example, jsl-ageism04.dta is from a project with Eliza Pavalko that
has been ongoing for five years- The project required dozens of datasets, thousands of
variables, and hundreds of do-files. If I found a problem in jsl-ageism04.dta, I can
use notes to track down what caused the problem. For example,2.5 Conclusions 45
+ notes _dta
adta:
1. baseOi.dta: base vars with birthyr and cohort \ baseOia.do js] 2001-05-31.
2. base02.dta: add attrition info \ base0ib.do jsl 2001-06-29.
(output omitted)
38, jsl-ageism04.dta: add analysis variables \ age07b.do js1 2006-06-27.
There were 38 steps that went into creating this dataset. If a problem was found
with the attrition variable, the second note indicates that this variable was created by
base01b.do on June 29, 2001. I can check the research log for that date or go to the
do-file to find the problem. The advantage of internal documentation is that it travels
with the dataset, and saves me from searching research logs to track down the problem.
Essentially, I use the notes command to index the research log. Details on Stata’s
label data and notes commands are given on page 138.
For large projects, you might want a registry of datasets. For example, J am working
on a project in which we will be receiving datasets from 17 countries where each country
has several files. We created a registry to keep track of the datasets. The data registry
can be kept in a spreadsheet that looks like figure 2.6 (file: w£2-data-registry.x1s):
Lg Te ~—f.-4
1
2 Data Registry for
Data Files.
3 Created by:
4
5
| 6 | Dataset# File name Date created do-flie Comments
7 1
8
9
10
MN
12
13
ka
| 15,
16.
[7
[ 18}
19
3 {
Pal. ee
Figure 2.6. Data registry spreadsheet
2.5 Conclusions
The critical point of this chapter is that planning, organizing, and documenting are
essential tasks in data analysis. Planning saves time. Organization makes it easier to
find things. Documentation is essential for replication, and replication is fundamental
to the research enterprise. Although I hope that my discussion will help you accomplish46 Chapter 2 Planning, organizing, and documenting
these tasks more effectively and convince you of their importance, any way you look at
it PO&D are hard work. When you are tempted to postpone these tasks, keep in mind
that it is almost always easier to do these tasks earlier than later. Make these tasks a
routine part of your work. Get in the habit of checking your documentation at natural
divisions in your work. If you find something confusing (e.g., you cannot remember how
a variable was created) or if you have trouble finding something, take the time right
then to improve your documentation and organization. When thinking about PO&D
consider the worst case scenario when things go wrong and time is short, not the ideal
situation when you have plenty of uninterrupted time to work on a project from start to
finish. By the time you lose track of what you are doing, it often takes longer to create
the plan, organize the files, and document the work than if you had started these tasks
at the very beginning.
The next two chapters look at features of Stata that are critical for developing an
effective workflow. Chapter 3 reviews basic tools and provides handy tricks for working
with Stata. Chapter 4 introduces Stata features for automating your work. Time spent
learning these tools really pays off when using Stata.3 Writing and debugging do-files
Before discussing how to use Stata for specific tasks in your workflow, I want to talk
about using Stata itself. Part of an effective workflow is taking advantage of the powerful
features of your software. Although you can learn the basics of Stata in an hour, to work
efficiently you need to understand some of its more advanced features. I am not talking
about specific commands for transforming data or fitting a model, but rather about the
interface of the program, the principles for writing do-files, and how to automate your
work. The time you spend learning these tools will quickly be recovered as you apply
these tools to your substantive work. Moreover, each of these tools contributes to the
accuracy, efficiency, and replicability of your work. This chapter discusses writing and
debugging do-files. Chapter 4 introduces powerful tools for automating your work. The
tools and techniques from chapters 3 and 4 are used and expanded upon in chapters 5-7
where different parts of the workflow of data analysis are discussed.
I begin the chapter reviewing three ways to execute commands: submit them from
the Command window, construct them with dialog boxes, or include them in do-files.
Each approach has its advantages, but I argue that the most effective way to work is
with do-files. Because the examples in the rest of the book depend on do-files, I discuss
in section 3.2 how to write more effective do-files that are easier to understand and
that will continue to work on different computers, in later versions of Stata, and after
you change the directories on your computer. Although these guidelines can prevent,
many errors, sometimes your do-files will not work. Section 3.3 describes how to debug
do-files, and section 3.4 describes how to get help when the do-files still do not work.
T assume that you have used Stata before, although I do not assume that you are an
expert. If you have not used Stata, J encourage you to read [Gs] Getting Started with
Stata and those sections of the [U] User’s Guide that seem most useful. Appendix A
discusses how the Stata program works, which directories it uses, how to use Stata on
a network, and ways to customize Stata. Even experienced users may find some useful
information there.
3.1 Three ways to execute commands
There are three ways to execute commands in Stata. You can submit commands inter-
actively from the command line. This is ideal for trying new things and exploring your
data. You can use dialog boxes to construct and submit commands, which is particu-
larly useful for finding the options you need when exploring new commands. You can
also run do-files, which are text files that contain Stata commands. Each method has
4748 Chapter 3 Writing and debugging do-files
advantages, but I will argue that serious work requires do-files. Indeed, I only use the
other methods to help me write do-files.
3.1.1 The Command window
You can type one command at a time in the Command window. Type the command and
press Enter. When experimenting with how a command works or checking some aspect
of my data, | often usc this method. I try a command, press Page Up to redisplay the
command in the Command window, revise it, press Enter to run it again, and so on. The
disadvantage of working interactively is that you cannot easily rerun your commands
at a later time.
Stata has a number of features that are very useful when working from the Command
window.
Review window
The commands you submit. from the Command window are echoed to the Review
window. When you click on a command in the Review window, it is pasted into the
Command window where you can revise it and then submit it by pressing Enter. If you
double-click on a command in the Review window, it is sent to the Command window
and automatically executed.
Page up and page down
The Page Up and Page Down keys let. you scroll through the commands in the Review
window, Pressing Page Up multiple times moves through multiple prior commands. Page
Down moves you forward to more recent commands. When a command appears in the
Command window, you can edit it and then rerun it by pressing Enter.
Copy and paste
You can highlight and copy text from the Command window or the Results window.
This information can be pasted into other applications, such as your text editor. This
allows you to debug a command interactively, then copy the corrected commands to
your do-file.
Variables window
The Variables window lists the variables in the current dataset. If you click on a
variable name in this window, the name is pasted into the Command window. This is
often the fastest. way to construct a list of variable names. You can then copy the list
of names and paste it into your do-file.3.1.3 Do-files 49
Logging with log and cmdlog
If you want to reproduce the results you obtain interactively, you should save your
session to a log file with the Log using command. You can then edit the log file to
create a do-file to rerun the commands. Suppose that you start an interactive session
with the command
log using datacheck, replace text
After you are done with your session, you close the log file with log close to cre-
ate the file datacheck.log. To create a do-file that will produce the same results,
you can copy the log file to datacheck.do, remove the .’s in front of each command,
and delete the output. This is tedious but sometimes quite useful. An alternative
is to use cmdlog to save your interactive commands. For example, cmdlog using
datacheck.do, replace saves all commands from the Command window (but no out-
put) to a file named datacheck.do, which you can use to create your do-file. You close
a cmdlog with the cmdlog close command.
3.1.2 Dialog boxes
You can use dialog boxes to construct commands using point-and-click. You open a
dialog box from the menus in Stata by selecting the task you want to complete. For ex-
ample, to construct a scatterplot matrix, you select Graphics (Alt+G) > Scatterplot
matrix (s, Enter). Next you select options using your mouse. After you have selected
your options, click on the Submit button to run the command. The command you
submit is echoed to the Results window so that you can see how to type the command
from the Command window or with a do-file. If you press Page Up, the command gen-
erated by the dialog box is brought into the Command window where you can edit it,
copy it, or rerun it.
Although dialog boxes are easy to learn, they are slow to use. However, dialog boxes
are very efficient when you are looking for an option used by a complex command. I use
them frequently when creating graphs. I select the options I need, run the command by
clicking on the Submit button, and then copy the command from the Results window
to my do-file.
3.1.3 Do-files
Over 99% of the work I do in Stata uses do-files. Do-files are simply text files that,
contain your commands. Here is a simple do-file named wf3-intro.do.
log using wf3-intro, replace text
use wi-lfp, clear
summarize lfp age
log close50 Chapter 3 Writing and debugging do-files
‘This program loads data on labor-force participation and computes summary statistics
for two variables. If you have installed the Workflow package in your working directory,
you can run this do-file by typing the command do wf3-intro.do.' The extension .do
is optional, so you could simply type do wf3-intro. After submitting the file, I obtain
these results:
log: e:\workflow\work\w£3-intro. log
log type: text
opened on: 3 Apr 2008, 05:27:01
. use wf-ltp, clear
(Workflow data on labor force participation \ 2008-04-02)
. summarize 1fp age
Variable | Obs Mean = Std. Dev. Min Max
lip 753 «8683931 4956295 9 1
age 753 42.53785 8.072574 30 60
. log close
log: :\workflow\work\wf3-intro.log
log type: text
closed on: 3 Apr 2008, 08:27:01
That is how simple it is to run a do-file. If you have avoided them in the past, this is
a good time to take an hour and learn how they work. That hour will save you many
hours later.
I use do-files for two major reasons. First, with do-files you have a record of the
commands you ran, so you can rerun them in the future to replicate your results or to
modify the program. Recall the research log on page 41 that documented a problem
with how a variable was created. If 1 had not been using do-files, I would have needed
to reconstruct weeks of work rather than changing a few lines of code and rerunning
the do-files in sequence. Second, with do-files, you can use the powerful features of
your text editor, including copying, pasting, global changes, and much more (see the
Workflow web site for information on text editors), The editor built into Stata can
be opened several ways: run the command doedit, select the Do-file Editor from the
Window menu of Stata, or click on the Do-file Editor icon. For details on the Stata
Do-file Editor, type help doedit, or see [R] doedit.
3.2 Writing effective do-files
The rest of the book assumes that you are using do-files to run commands, with the
exceptions of occasionally testing commands from the Command window or using dialog
boxes to track down options. In this section, I consider how to write do-files that are
robust and legible. Here is what I mean by these terms:
1, Appendix A explains the idea of a working directory. The Preface has information on installing
the Workflow package.3.2.1 Making do-files robust 51
Robust do-files produce exactly the same result when run at a later time or
on another computer.
Legible do-files are documented and formatted so that it is easy to under-
stand what is being done.
Both criteria are important becausc they make it possible to replicate and correctly
interpret your results. As a bonus, robust and legible do-files are easier to write and
debug. To illustrate these characteristics of do-files, [ use examples that contain basic
Stata commands. Although you might encounter a command that you have not seen
before, you should still be able to understand the general points I am making even if
you do not follow the specific details.
3.2.1 Making do-files robust
A do-file is robust if it produces exactly the same result when it is rerun on your
computer or run on a different computer. The key to writing robust do-files is to make
sure that results do not depend on something left in memory (e.g., from another do-file
or a command submitted from the Command window) or how your computer is set up
(e.g., the directory structure you use). To operationalize this standard, imagine that
after running a do-file you copy this file and all datasets used to a USB drive, insert the
USB drive in another computer, and run the do-file again without any changes. If you
cannot do this and get the same results, replication will be difficult or impossible. Here
are my suggestions for making your do-files robust.
Make do-files self-contained
Your do-file should not rely on something left in memory by a prior do-file or commands
run from the Command window. A do-file should not use a dataset unless it loads
the dataset itself. I¢ should not compute a test of coefficients unless it estimates those
coefficients. And so on. To understand why this is important, consider a simple example.
Suppose that wf3-stepi.do creates new variables and wf3-step2.do fits a model. The
first program loads a dataset and creates two variables indicating whether a family has
young children and whether a family has older children:
log using wf3-stepi, replace text
use wf-lfp, clear
generate hask5 = (k5>0) & (k5<.)
label var hask6 "Has children less than 5 yrs old?"
generate hask618 = (k618>0) & (k618<.)
label var hask618 "Has children between 6 and 18 yrs old?"
log close
The program wf3-step2.do estimates the logit of 1fp on seven variables, including the
two created by wf3-stepi .do:
log using wf3-step2, replace
logit 1fp hask6 hask618 age wc hc lug inc, nolog
log close52 Chapter 3. Writing and debugging do-files
If these programs are run one after the other, with no commands run in between,
everything works fine. What if the programs are not run in sequence? For example,
suppose that I run wf3-step1.do and then run other do-files or commands from the
Command window. Or I might later decide that the model should not include age, so I
modify wf3-step2.do and run it again without running wf3-stept .do first. Regardless
of the reason, if ] run the second do-file without running wf3-step1 .do first, I get the
following error:
« logit 1fp hask5 hask618 age we he lwg inc, nolog
no variables defined
r(111);
The error occurs because the dataset is no longer in memory. I might change the
program so that the original dataset is loaded
log using vi3-step2, replace
use wi-lfp, clear
logit 1fp hask5 hask618 age wc he lwg ine, nolog
log close
Now the error is
. logit 1fp hask5 hask618 age we he lg inc, nolog
variable haskS not found
r(ii1);
This error occurs because hask& is not in the original dataset but was created by
wf3-stepi.do.
To avoid this type of problem, I can modify the two programs to make them self-
contained. I change the first program so that it saves a dataset with the new variables
(file: w£3-stept-v2.do):
log using wf3-stepi-v2, replace
use wi-lfp, clear
generate haskS = (k5>0) & (k5<.)
label var haskS "Has children less than 5 yrs old?”
generate hask618 = (k618>0) & (k618<.)
label var hask618 "Has children between 6 and 18 yrs old?"
save wf-lfp-v2, replace
log close
I change the second program so that it loads the dataset created by the first program
(file: wi3-step2-v2.do):
log using wf3-step2-v2, replace
use wf-lfp-v2, clear
logit 1fp hask5 hask618 age we he lwg inc, nolog
log close
The do-file wf3-step2-v2 . do still requires running w£3-step1-v2.do to create the new
dataset, but it does not require running w£3-step2-v2.do immediately after
wf3-step1-v2.do or even that it be run in the same Stata session.3.2.1 Making do-files robust 53
There are a few exceptions of do-files that need to be run in sequence. For example,
if I am doing postestimation analysis of coefficients from a model that takes a long
time to fit (e.g., asmprobit), I do not want to refit the model repeatedly while I debug
the postestimation commands. I would use one do-file to fit the model and a second
do-file for postestimation analysis. The second do-file only works if the prior do-file was
run. To ensure that I remember that the programs need to be run in tandem, I add a
comment to the second do-file:
// Note: This do-file assumes that program1.do was run first.
After debugging the second program, I would combine the two do-files to create one
do-file that is self-contained.”
Use version control
Tf you run a do-file at a later time, perhaps to verify a result or to modify some part of
the program, you could be using a newer version of Stata. If you share a do-file with
a colleague, she might be using a different version of Stata. Sometimes new versions of
Stata change the way in which a statistic is computed, perhaps reflecting advances in
computational methods. When this occurs, the same commands can produce different
results in different versions of Stata. Newer versions of Stata might change the name of
acommand (e.g., clear in Stata 9 was changed to clear all in Stata 10). The solution
is to include a version command in your do-file. For example, if your do-file includes
the command version 6 and you run the do-file in Stata 10, you will get exactly the
same answer that you would obtain in Stata 6. This is true even if Stata 10 computes
the particular statistic differently (e.g., the computations in some xt commands changed
between Stata 6 and Stata 10). On the other hand, if your do-file includes the command
version 10 and you try to run the program in Stata 8.2, you get an error:
- version 10
this is version 8.2 of Stata; it cannot run version 10.0 programs
You can purchase the latest version of Stata by visiting
http: //www.stata.com.
x(9)5
You could rerun the program after changing the version 10 command to version 8.2.
There is no guarantee that programs written for newer versions of Stata will work in
older versions.
Exclude directory information
J almost never specify a directory location in commands that read or write files. This
lets my do-files run even if the directory structure of the computer I am using changes.
For example, suppose that my do-file loads data with the command
2. With Stata 10, I might use the new estimates save command to save the estimates in the first
do-file and then load them at the start of the second do-file that does postestimation analysis. This
would allow each program to be self-contained, even when debugging the second program. For
details, see [R] estimates save.54 Chapter 3 Writing and debugging do-files
use c:\data\wi-lfp, clear
Later, when J rerun the do-file on a computer where the dataset is stored in d:\data\,
T get an error:
use ¢:\data\uf-lfp, clear
file c:\data\vf-1fp.dta not found
(601);
To avoid such problems, I do not include a directory location. For example, to load
wf-lfip.dta, I use the command
use wf-lfp, clear
When no directory is specified, Stata looks in the working directory.
The working directory is the directory you are in when you launch Stata.* In Win-
dows, you can determine your working directory by typing cd. For example,
. od
e:\data
In Mac OS or Unix, you use the pwd command. For example, on a Mac:
. pwd
~:data
‘You can change your working directory with the cd command. For example, when
testing commands for this book, I used the e: \workflow\work directory. To make this
my working directory, I would type
ed e:\workflow\work
To change to the working directory used for the CWH project. I would type
cd e:\cwh\work
If the directory name includes blanks or special characters, you need to put the name
in quotes. For example,
cd “c:\Documents and Settings\jslong\Projects\workf lou\work”
The advantage of not including directory locations in your do-file is that you can run
your do-files on other computers without any changes. Although it is tempting to say
that you will always keep your data in the same place (e.g., d:\data), this is unlikely
for several reasons.
1. If you change computers or add a new drive to your computer, the drive letters
might change.
3. Appendix A has a detailed discussion of the directories used by Stata.3.2.2 Making do-files legible 55
2. If you keep data on external drives, including USB flash drives, the operating
system will not always assign the drive the same drive letter.
3. If you reorganize your files, the directory structure could change.
4. When you restore files from your archive, you might not remember what the
directory structure used to be.
If you share do-files with a collaborator or someone helping you debug your program,
they will probably have a different directory structure than yours. If you hardcode
the directory, the person you send the do-file to must either create the same direc-
tory structure or change your program to load data from a different directory. When
the collaborator sends you the corrected do-file, you will have to undo the directory
changes that were made, and so on. All things considered, I think that it is best prac-
tice to write do-files that do not require a particular directory structure or location
for the data. There are two exceptions that are useful. First, if you are loading a
dataset from the web, you need to specify the specific location of the file. For example,
use http: //www.stata-press.com/data/r10/auto, clear. Second, you can specify
relative directories. Suppose there is a subdirectory \data located in your working di-
rectory. To keep things organized, you place all your datasets in this directory, while
your do-files and log files remain in your working directory. You can assess the datasets
by specifying the subdirectory. For example, use data\wf-lfp, clear.
Include seeds for random numbers
Random numbers are used in a variety of ways in data analysis. For example, if you
are bootstrapping standard errors, Stata draws repeated random samples. If you try to
replicate results that use random numbers, you need to use the same random numbers or
you will obtain different results. Stata uses pseudorandom numbers that are generated
by a formula in which one pseudorandom number is transformed to create the next
number. This transformation is done in such a way that the sequence of numbers
behaves as if it were truly random. With pseudorandom numbers, if you start with the
same number, referred to as the seed, you will re-create exactly the same sequence of
numbers. Accordingly, to reproduce exactly the same results when you rerun a program
that uses pseudorandom numbers, you need to start with the same seed. To set the
seed, use the command
set seed #
where # is a number you choose. For example, set seed 11020. For further details
and an example, see section 7.6.3.
3.2.2. Making do-files legible
I use the term legible to describe do-files that are internally documented and carefully
formatted. When writing a do-file, particularly one that does complex statistical analy-
ses or data manipulations, it is easy to get caught up in the logic of what you are doing56 Chapter 3 Writing and debugging do-files
and forget about documenting your work and formatting the file to make the content
clear. Applying uniform procedures for documenting and formatting your do-files makes
them easicr to debug and helps you and your collaborators understand what you did.
There are many ways lo make your do-files casier to understand. If you do not like
my stylistic suggestions, feel free to create your own style. The important thing is to
establish a slyle that you and others find legible. If you are collaborating, try to agree
upon a common style for writing do-files that, makes it simpler to share programs and
results. Clear and well-formatted do-files are so important for working efficiently that
one of the first things I do when helping someone debug a program is to reformat, their
do-file to make the code casier to read.
Use lots of comments
Thave never returned to a do-file and regretted how many comments it had, but I have
often wished that I had written more. Commands that seem obvious when I write them
can be obscure Jater. I try to add at least a few comments when I initially write a
do-file. After the program works the way I want, I add additional comments. These
comments are used both to label the output and to explain commands and options that
might later be confusing.
Stata provides three ways to add comments. The first two create comments on a
single line, whereas the third allows you to easily write multiline comments. The method
you use is largely a matter of personal preference.
* comments
If you start a line with a *, everything that follows on that linc is trealed as a
comment. For example,
* Select sample based on age and gender
or
* The following analysis includes only those people
* who responded to all four waves of the survey.
You can temporarily stop a command from being executed:
+ logit 1fp we he age inc
// comments
You can add comments after a //. For example,
// Select sample based on age and gender
This method can also be used at the end of a command. For example,
logit 1fp we be // includes only education, add wages later3.2.2 Making do-files legible 57
/* and */ comments
Everything between an opening /* and a closing */ is treated as a comment. This
is particularly useful for comments that extend over multiple lines. For example,
/*
These analyses are preliminary and are based on those countries
for vbich complete data were available by January 17, 2005.
+/
Comments as dividers
Comments can be used as dividers to distinguish among different parts of your
program. For example,
SEES Soren oon Ecce cones e et
+s Descriptive statistics by gender
or
Te
// = Logit models of depression on genetic factors
Obscure comments
Comments are useful only when they are accurate and clear. When writing a complex
do-file, 1 use comments to remind me of things I need to do. For example,
* check this. wrong variable?
or
* see ekp’s comment and model specification
After the program is written, these comments should be deleted because later they will
be confusing.
Use alignment and indentation
It is easier to verify your commands if things line up. For example, here are two ways
to format the same commands for renaming variables. Which is easier for spotting a
mistake? This?
(Continued on next page)58 Chapter 3 Writing and debugging do-files
rename dev origin
rename major jobchoice
rename HE parented
rename interest goals
rename score testscore
rename sgbt sgstd
rename restrict restrictions
Or this?
Tepame dev origin
rename major jobchoice
rename HE parented
rename interest goals
rename score —_testscore
rename sgbt sgsta
rename restrict restrictions
Most text editors, including Stata’s Do-file Editor, allow tabbing that makes lining
things up easier.
When commands take more than one line, I indent the second and later lines. I find
it easier to read
logit y var01 var02 var03 var04 var05 var06 ///
var07 var08 var09 var10 varli var12 ///
vari3 var14 vari5
than
logit y var0i1 var02 var03 var04 var05 var06 ///
var07 var08 var09 vari0 vari1 vari2 ///
vari3 vari4 vari
Some text editors, including Stata’s, can automatically indent. This means that if you
indent a line, the next line is automatically indented. If you find that the Stata Do-file
Editor does not do this, you need to turn on the Auto-indent option. While in the
Editor, press Alt+e and then f to open the dialog box where you can set this option.
You can also highlight lines in the Do-file Editor and indent them all by pressing Ctrl-+i
or outdent them by pressing Ctrl-+Shift+i.
Use short lines
Mistakes are easy to make if you cannot see the complete command or all the output. To
avoid problems with truncation or wrapping, I keep my command lines to 80 columns or
less and set the line size to 80 (set linesize 80) because this works with most printers
and on most screens. To illustrate why this is important, here is a problem I encountered
when helping someone debug a program using the Listcoef command, which is part
of SPost. The do-file I received looked like this, where the line with mlogit that trails
off the right-hand side of the page is 182 characters long (file: wf3-longcommand.do):3.2.2 Making do-files legible 59
use vf-longcommand, clear
mlogit jobchoice income origin prestigepar aptitude siblings friends scalei_std demands interestlvl
listcoef
Because the outcome had three categories, listcoef should have listed coefficients
comparing outcomes 1 to 2, 2 to 3, and 1 to 3 for each variable. For some variables,
that was the case:*
Variable: income (sd=1.1324678)
Gdds comparing
Alternative 1
to Alternative 2 b z Pozi eb eo“ bStaK
2 -3 0.49569 0.826 0,409 1.6416 1.7530
2 “1 0.68435 2.483 «0.013 1.9825 2.1706
3 -2 0.49569 -0.825 0.409 0.6092 0.5704
3 m4 0.18866 0.377 0.706 1.2076 1.2382
1 -2 0.68435 -2.483 0.013 0.5044 0.4607
1 -3 0.18866 -0.377 0.706 0.8281 0.8076
For other variables, some comparisons were missing:
Variable: female (sd=,50129175)
Odds comparing
Alternative 1
to Alternative 2 db z P>izl eb e"bStdX
2 -1 1.25085 1.788 0,079 3.4933 1.8721
1 -2 1.25085 1.758 0.079 0.2863 0.5342
Initially, I did not see a problem with the model and began looking for a problem in
the code for the listcoef command. Eventually, I did what I should have done from
the start-—I reformatted the do-file so that it looked like this:
mlogit jobchoice income origin prestigepar aptitude siblings friends ////
scalel_std demands interestlvl jobgoal scale3 scale2_std motivation ///
parented city female, noconstant baseoutcome(1)
Once reformatted, | immediately saw that the problem was caused by the noconstant
option. Although noconstant is a valid option for mlogit, it was inappropriate for the
model as specified. While this problem did not show up in the mlogit output, it did
lead to misleading results from listcoef.
Having output lines that are too long also causes problems. Because you can control
line length of output in your do-file, this is a good place to talk about it. Suppose that
your line size is set at 132 and you create a table (file: w£3-longoutput lines .do):
set linesize 132
tabulate occ ed, row
4. The real example had comparisons among six categories, so the output took dozens of pages.60 Chapter 3. Writing and debugging do-files
When you print the results they are truncated on the right:
frequency
row percentage
Years of education
Occupation 3 6 7
Menial 0 2 0
0.00 6.45 0.00
BlueCol 1 3 1
1.45 4.35 1.48
craft 0 3 2
8.00 3.57 2.38
WhiteCol 0 0 0
0.00 0.00 0.00
Prof 0 0 1
0.00 0.00 0.89
Total 1 8 4 12 9 10 19
0.30 2.37 1.19 3.56 2.67 2.97 5.64
Depending on how you print the log file, the results might wrap and look like this:
Years of education
Occupation 3 6 7 8 9 10
it 12 13 Total
Menial 0 2 0 0 3 1
3 12 2 31
0.00 45 0.00 0.00 9.68 3.23
9.68 38.71 6.45 100.00
BlueCol 1 3 1 a 4 6
5 26 7 89
1.45 4.35 1.45 10.14 5.80 8.70
7.26 37.68 10.14 100.00
Craft 0 3 2 3 2 2
7 39 T 84
0.00 3.57 2.38 3.87 2.38 2.38
8.33 46.43 8.33 100.00
{output omitted )
I have often seen incorrect numbers taken from wrapped output. If your output. wraps,
fix it right away by changing the linesize to 80 and recycle the original output!3.2.2. Making do-files legible 61
Limit your abbreviations
Variable abbreviations
In Stata, you can refer to a variable using the shortest abbreviation that is unique.
As an extreme example, suppose you have a variable with the valid but unwieldy name
age-at_ist_survey. If there is no other variable in your dataset that starts with a,
you can abbreviate the long name simply as a. Although this is easy to type, your
program will not work if you add a variable starting with a. For example, suppose you
add a variable agesq that is the square of age_at.1st_survey. Now the abbreviation
a generates the error:
a ambiguous abbreviation
rit);
This error is received because Stata cannot tell if you are referring to age_at_1st_survey
or agesq.
Abbreviations can lead to other, perplexing problems. Here is an example I recently
encountered. The dataset has four binary variables bmi1_1019, bmi2_2024, bmi3_2530,
and bmi4_31up indicating a person’s body mass index (BMI). I got in the habit of using
the abbreviations bmii, bmi2, bmi3, and bmi4. Indeed, I had forgotten that these were
abbreviations. Then I wanted to use svy: mean to test race differences in the mean of
bmi: 7
svy: mean bmi1, over(black)
test [bmi1]black = (bmil]wbite
The svy: mean command worked, but test gave the error:
equation (bmii] not found
(303);
Because I do not use svy commands regularly, I assumed that there must be another
way to compute the test. when using survey means. The problem could not be with the
name bmii because IT “knew” that was the right name. Eventually, I realized that the
problem was the abbreviation. Although svy: mean allows abbreviations (e.g., bmit
for bmii_1019), the test command requires the full name:
test [bmii_1019] black = [bmi1_1019]vhite
The time saved using the abbreviation was more than lost uncovering the problem
caused by the abbreviation.
As tempting as it is to use abbreviations for variables, it is better not to use them.
If you find that names are too long to type, consider changing the names (see sec-
tion 5.11.2) or enter the variable names by clicking on the names in the Variables
window. Then copy the names from the Command window to your do-file. To prevent
Stata from allowing abbreviations for variable names, you can turn this feature on and
off with the command:62 Chapter 3 Writing and debugging do-files
set varabbrev {on|off}, permanently
Command abbreviations
Many commands and options can also be abbreviated, which can be confusing. For
example, you can abbreviate the command name and variable name for
summarize education
as
sue
I find this to be too terse to be clear. A compromise is to use something like
sum educ
or
sum education
Consider a slight modification of a command I received in a do-file someone sent to me:
lal in 1/3
I find it much clearer to write the command like this:
list age lwg in 1/3
Longer abbreviations are not necessarily better than shorter ones. For example, in a
recent article that used Stata I saw the command:
nois sum mpg
Thad not seen the nois command before so I checked the manual. Eventually, I realized
the nois is an abbreviation for noisily. For me, noi is clearer than nois.
If you use abbreviations for commands, I suggest. keeping them to three letters or
more. In the rest of the book, I will abbreviate only a few commands where I find the
abbreviations clear and convenient. Specifically, those in table 3.1.3.2.3. Templates for do-files 63
Table 3.1. Stata command abbreviations used in the book
Full command name Abbreviation
generate gen
label define label def
label values label val
label variable label var
quietly qui
summarize sum
tabulate tab
As a general rule, command abbreviations make it harder for others to read your
code. If you want your code to be completely legible to others, do not use command
abbreviations.
Be consistent
All else being equal, you will make fewer errors and work faster if you find a standard
way to do things. This applies to the style of your do-files (more on this below), how
you format things, the order you run commands, and which commands you use. For
example, when I create a variable with generate, | follow it with a variable label, a
note, and a value label (if appropriate).
generate incomesqrt = sqrt (incone)
label var incomesqrt “Square root of income"
notes incomesqrt: sqrt of income \ dataclean01.do jsl 2006-07-18.
3.2.3. Templates for do-files
The more uniform your do-files are, the less likely you are to make errors and the easier
it will be to read your output. Accordingly, I suggest that you create a template. You
can load the template into your text editor, make changes, and save the files with a
new name. This has several advantages. First, the template includes commands used in
all do-files that you will not have to type (e.g., capture log close). Second, you will
not forget to include commands that are in the template. Third, a standard structure
makes it simpler to work uniformly across projects.
Commands that belong in every do-file
Before presenting two templates for do-files, I want to discuss commands that I suggest
you include in every do-file. Here is a simple do-file named wf3-example.do, where the
line numbers on the left are used to refer to a specific line but are not part of the file:64 Chapter 3 Writing and debugging do-files
1> capture log close
2> log using wf3-example, replace text
>
4> // wf3-example.do: compute descriptive statistics
5> // scott long O3Apr2008
6>
7> version 10
8> clear all
9> macro drop _all
10> set linesize 80
1b
12> * load the data and check descriptive statistics
13> use vf-lfp, clear
14> summarize
15>
16> log close
17> exit
Opening your log file
Line 1 can best be explained after I go through the rest, of the program. Line 2 opens
a log file to record the output. I recommend that you give the log file the same name
as the do-file that created it (the prefix only, not the suffix .do). Because I have not
specified a directory, the log is created in the current working directory. The replace
option tells Stata that if wf3-example.1log already exists, replace it. This is handy if
you need to rerun the do-file while debugging it. If you do not add replace, the second
time you run the program you get the error
. log using wf3-example, text
log file already open
(604);
The text option specifies that the output is written in plain text rather than in Stata
Markup and Control Language (SMCL). Although SMCL output looks nicer, only Stata
can print it, so I do not use it. Line 16 closes the log file. This means that Stata will
stop sending output to the file.
Blank lines
Lines 3, 6, 11, and 15 are blank to make the program easier to read. If you do not
find blank lines to be useful, do not use them.
Comments about what the do-file does
Lines 4 and 5 explain what the do-file does so that. this information will be included
in your log file. I recommend including the name of the do-file, who wrote the do-file,
the date it was written, and a summary of what the do-file does.3.2.3 Templates for do-files 65
Controlling Stata
Lines 7-10 affect the way Stata runs. Line 7 indicates the version of Stata being
used. Because version 10 is in the file, if you run this do-file in later versions of Stata,
you should get exactly the same output that you got today using Stata 10. Because
the version command is located after the log using command, version 10 will be
included in the log that allows you to verify from printed output which version of Stata
was used. Lines 8 and 9 reset Stata so that your do-file will run as if it was the
first thing done after starting Stata. This is important for making your do-file robust.
Many commands leave information in memory that you do not want to affect your do-
file. clear all removes from memory the data, value labels, matrices, scalars, saved
results, and more. For a full description, see help clear. In Stata 9, you use the
command clear, not clear all. Oddly, clear all clears everything but. macros from
memory. To do this, you use macro drop -all. Line 10 sets the line size for output
to 80 columns. Even if the default line size for my copy of Stata is 80 (see appendix A
for how to set the default line size), I want to explicitly set the line size in the do-file
so that it will generate output that is formatted the same way if it is run with a copy
of Stata with a different default line size. To see why this is important, you can try
running tabulate for variables with a lot of categories using different line sizes.
Your commands
Your commands begin at line 12 and include comments to describe what you are
doing.
Ending the do-file
Line 17 is critical. Stata only executes a command when it encounters a carriage
return.® Without a carriage return at the end of line 16, the log close command does
not run and your log file remains open. Although line 17 could be anything, including
a blank, I prefer the exit command. This command tells Stata to terminate the do-
file (i.e., do not run any more commands in the do-file). For example, I could include
comments and commands after exit, such as
exit
1) Double check how the sample is selected.
2) Consider running these commands.
describe
summarize
tabi all
The lines after exit are ignored by Stata.
5. I used to place the version command immediately after line 1, as suggested by Long and Freese
(2006). When writing this book, a colleague showed me @ problem that would have been simple to
resolve if the version command had been part of the output that he had been trying to replicate.
Instead, it took him two weeks to figure out why he could not replicate his earlier results.
6. The language of computers is filled with anachronisms. On a typewriter, the mechanism that holds
the paper using a platten is called the carriage. When you type to the end of a line, you “return
the carriage” to advance to the next line. Even though we no longer use a carriage to advance to
a new line, we refer to the symbol that is created by pressing Enter as a carriage return.66 Chapter 3. Writing and debugging do-files
capture log close
Now I can explain why line 1 is needed. Suppose that the first time I ran wf3-
example.do, the program terminated with an error before executing log close in
line 16. The log file would be left open, meaning that new results generated by Stata
would continue to be sent to the log. When I rerun the program, assuming for the
moment that line 1 is not in the do-file, line 2 would cause the error r(604): log file
already open because I am trying to open a log file when a log file is already open. To
avoid this error, E could add the command log close before the log using command.
If I do this, the first time I run the do-file, the log close command will generate the
error r(606): no log file open because I am trying to close a log file when no log
file is open. The capture command in line 1 means “if this line generates an error,
ignore it and continue to the next line”. If you do not completely follow what I just
explained. do not worry about it. Just get in the habit of beginning your do-files with
the command capture log close.
A template for simple do-files
Based on the principles just considered, here is a template for simple do-files (file:
wf3-simple .do).
capture log close
log using _name., replace text
//mame_.do:
// scott long _date_
version 10
clear all
macro drop _all
set linesize 80
* my commands start here
log close
exit
T save this file in my working directory or to my computer’s desktop, perhaps with the
name simple.do. When I want to create a new do-file, I load simple.do into my editor,
change .name. and .date_, and write my program. I save the file with its new name,
say, myprogram.do, and then from the Command window, type run myprogram.
A more complex do-file template
For most of my work, I used a more elaborate template (file: w£3-complex.do):3.2.3 Templates for do-files 67
capture tog close
log using _name_, replace text
// program: -hame_.do
// task:
// project:
// author: -who_ \ _date.
// #0
// program setup
version 10
clear all
set linesize 80
macro drop _all
Mt
// describe task 1
// #2
// describe task 2
log close
exit
This template makes it easier to document what the do-file is doing, especially by includ-
ing numbered sections to the output for different steps of the analysis. By numbering
sections, it is easier to find things within the file and to discuss the results with others
(especially over email). When I send a log file to someone, I might write: “Do you think
the results at #6 are consistent with our earlier findings?” If you start numbering parts
of your do-files, I think you will find that it saves a lot of time and confusion.
There are many effective templates that can be used. The most important thing is
to find a template that you like and use it consistently.
Aside on text editors
A full-featured text editor is probably the most valuable tool you can have
for data analysis. A good editor speeds up your work, makes your do-files
more uniform, and helps you debug programs. Although text editors have
hundreds of valuable features, here are a few that are particularly useful.
First, many editors can automatically insert text into a file. I have mine
set up so that the keystroke Alt+0 inserts the simple do-file template (so I
do not have to remember where I stored the template) and Alt+1 inserts
the more complex template. Then the editor automatically inserts the
date. Second, sophisticated text editors have a feature known as syntax
highlighting that helps you find errors. These editors recognize predefined
words and display them in different colors. For example, if you type the line
oloigit warm we he age k5, the word oloigit will not be highlighted
because it is not a Stata command. If you had typed ologit, the word
would be highlighted because it is a valid command name. This is very
handy for finding and fixing errors before they occur. The Workflow web
site provides additional information.68 Chapter 3. Writing and debugging do-files
3.3 Debugging do-files
In a perfect. world, your do-files run the first time, every time. In practice, your do-
files generate errors and probably lots of errors. Sometimes it is frustrating and time-
consuming to determine the source of an error. While the principles for writing legible
and robust do-files should make crrors less likely and make it easier to resolve errors
when they occur, you are still likely to spend more time than you like debugging your
do-files. This section discusses how to debug do-files for both simple and complicated
errors. I begin by reviewing a few simple strategies for finding problems. The section
ends with two extended examples that illustrate how to fix more subtle bugs.’
3.3.1 Simple errors and how to fix them
To get started, I want to illustrate some very common errors.
Log file is open
If you have a log file open (for example, it might be left open because your last do-file
ended with an error) and you try to open a log file, you get the message
+ log using examplei, replace
log file already open
(604) ;
The simplest solution is to place capture log close at the top of your do-file.
Log file already exists
Because do-files are often run several times before they are debugged, you want to
replace the log file that contains an error with the output from the corrected do-file. If
your do-file contains the command
log using example2, text
and that log file already exists, you get the error
file e:\workflow\work\example2.log already exists
1 (602);
The solution is the option replace:
log using exanple2, text replace
7. One theory of the origin of the term “bug” refers to a two-inch moth taped to Grace Murray
Hopper’s research log for September 9, 1947. This moth shorted a circuit in the Harvard University
Mark IT Aiken Relay Calculator (Kanare 1985).3.3.1 Simple errors and how to fix them 69
Incorrect command name
The command
loget 1fp KB k618 age we he lwg inc
generates the error
unrecognized command: loget
(199);
The message makes it clear that something is wrong with the word loget and you
are likely to quickly see that you mistyped logit. If you did not understand what
unrecognized command meant, Stata can provide more information. In the Results
window, r(199) appears in blue. Blue indicates that the highlighted word is linked to
more information. If you click on r(199), a Viewer window opens with the information:
[PF SrEOR ee Return code 199
unrecognized command;
Stata failed to recognize the command, program, or ado-file name,
probably because of a typographical or abbreviation error.
Sometimes, unrecognized commands will not be easy to see. For example,
. tabl lfp k5
unrecognized command: tabl
(199);
The problem is that I typed tabl instead of tabi, which can look very similar with
some fonts. When I get an error related to the name of a command and everything
looks fine, I often just retype the command and find that the second time I typed the
command correctly.
Incorrect variable name
In the following logit command, the name of one of the variables is incorrect.
+ logit lfp KOS k618 age wc he lwg inc
variable 05 not found
r(411);
I meant to type k05 (kay-zero-five), not k05 (kay-capital-oh-five). If you think a name
is correct but you are getting an error, there are a few things to try. Suppose the error
refers to a name beginning with “k”. Type describe k+ to describe all the variables
that begin with k. Verify that the name in your do-file is listed. If it is and you still
do not see the problem, you can click on the variable name in the Variables window.
This will paste the name to the Command window. Copy the name from here to your
do-file.
Stata reports only one incorrect name at a time. If you fixed the command above to
logit 1fp k05 k618 age we he lwg inc70 Chapter 3 Writing and debugging do-files
and k618 was the wrong name (e.g., it was supposed to be k0618), a new r(111) error
message is generated.
Incorrect option
If you type an incorrect option, you get an error message like this:
logit 1fp k5 k618 age we he lwg inc, logoff
option logoff not allowed
(198) ;
I wanted to turn off the iteration log for logit but incorrectly thought the option was
logoff. To find the correct option, I could 1) try another name for the option, 2) type
the help logit command from the Command window, 3) open the logit dialog box
and find the option name, or 4) check the manual. Each would show you that the option
I wanted was nolog.
Missing comma before options
This error confuses many people learning Stata:
+ logit 1fp we nowc k5 k618 age be lug inc nocon
variable nocon not found
r(i11);
The problem is that you need a comma before the nocon option:
logit 1fp we nowc kS k618 age he lug inc, nocon
3.3.2 Steps for resolving errors
The errors above were easy to solve. In other cases, it can be very difficult to determine
from the error message what is wrong. In later sections, I give examples of the multiple
steps you might need to track down a problem. In this section, I provide some general
strategies that you should consider if you do not see an obvious solution for the error
you encountered.
Step 1: Update Stata and user-written programs
Before spending too much time debugging an error, make sure that your copy of Stata
and any installed user-written ado-files are up to date. Your error might be caused
by an error in the command that you are using, not by a mistake in your do-file.
Updating Stata is simple, unless you are running Stata on a network. If you are on
a network, you will have to talk to your network administrator (see.appendix A for
further information). While Stata is running and you are connected to the Internet,
run the update all command and follow the instructions. This will update official
Stata, including the executable, the help files, and the ado-files. If the do-file you are3.3.2 Steps for resolving errors 7
debugging uses commands written by others (e.g., listcoef in the SPost package),
you should update those programs as well. The first thing to do is try the adoupdate
command that was introduced in Stata 9.2. If you type the adoupdate command, it
will check if your user-written ado-files are up to date. You can then either update
the packages individually with adoupdate package-name or update all packages with
adoupdate, all. To automatically update all your packages, try adoupdate, update.
Unfortunately, this handy command only works with user-written packages where the
author has made the package compatible with adoupdate. If some of your user-written
commands are not checked with the adoupdate command (you will know this if they
are not listed after the command is entered), you can run findit command-or-package
and follow the instructions you receive.
Step 2: Start with a clean state
When things do not work and your initial attempts fail to fix the problem, make sure
that. there is not information left in memory that is causing the problem (e.g., a matrix
that should not be there). There are several ways to do this.
clear all and macro drop -all
From the command line, type clear all and macro drop -all or add them to
your do-file. These commands tell Stata to forget everything that happened since you
launched Stata. In Stata 9, use clear instead of clear all.
Restart Stata
If clear all and macro drop .all do not fix the problem, exit Stata, relaunch
Stata, and try the program again.
Rebooting
Next reboot your computer and try the program again. After rebooting and before
Jeading Stata, close all programs, including utilities such as macro programs, screen
capture utilities, and so on. This might seem extreme, but if 1 had followed this ad-
vice three years ago, I would have saved myself and a very patient econometrician at
StataCorp a great deal of trouble.
Use another computer
Still not working? You might try the program on another computer that is configured
differently than your own. If it works there, the problem is caused by the way Stata is
installed on your system.72 Chapter 3 Writing and debugging do-files
Step 3: Try other data
Some errors are caused by problems in the dataset, such as perfect collinearity or zero
variance for a variable. In other cases, the specific names or labels could be causing
problems. The SPost command mlogview used to generate an error when certain char-
acters were included in the value labels. If you get the same error using another dataset,
you can be fairly sure that the problem is in your commands. If the error does not occur
with the new data, focus on characteristics of your data.
Step 4: Assume everything could be wrong
It is easy to ignore parts of your program that you are “sure” are right. Most people
who do a lot of programming have learned this lesson the hard way. As we will see, some
error messages point to a part of the program that is actually correct. If the obvious
solutions to an error do not work, review the entire program.
Step 5: Run the program in steps
T usually write a program a few commands at, a time, rather than typing 100 lines at
once. For example, I start with a do-file that only loads the data and runs descriptive
statistics. If that works, I add the next set of commands. If that works, | add the
next lines, and so on. This approach does not work as well if you have an extremely
large sample or you are using a command that is computationally very demanding (e.g.,
asmprobit). In such cases, you can test you program using a small sample or block out
parts of the program that have been tested.3.3.2 Steps for resolving errors
Aside on selecting a random subsample
If you need a small sample for debugging your program, here is how you
can take a random sample from your data (file: w£3-subsample .do):
. use wi-fp, clear
(Workflow data on labor force participation \ 2008-04-02)
. set seed 11020
. generate isin = (runiform()>.8)
. label var isin "1 if in random sample (seed 11020)"
. label def isin 0 O_NoIn 1 1_InSample
- label val isin isin
. keep if isin
(601 observations deleted)
. tabulate isin, missing
4 if in
random
sample
(seed
11020) Freq. Percent Cum.
1_InSample 152 100.00 100.00
Total 152 100.00
. label data "20% subsample of wf-1fp."
. notes: wi3-subsample.do \ js1 2008-04-03
. save x-wf3-subsample, replace
file x-wf3-subsample.dta saved
The command set seed 11020 sets the seed for the random-number
generator and is important if you want to create exactly the same sample
later. You can pick any number for the seed. The command generate
isin = (runiform() > .8) creates a binary variable equal to 1 if the
random number is greater than .8. Because runiform() creates a uniform
random variable with values from 0 to 1, isin will be 1 about 20% of the
time. If you want a larger sample, replace .8 with a smaller number; for
a smaller sample, replace .8 with a larger number. The last part of the
program saves a dataset that contains roughly 20% of the original sample.
Note: The runiform() function was introduced in Stata 10.1. If
you are using Stata 10, but have not updated to Stata 10.1 and you are
connected to the Internet, run the update all command and follow the
instructions. If you are running Stata 9, use the uniform() function
instead of runiform().
7374 Chapter 3 Writing and debugging do-files
Step 6: Exclude parts of the do-file
If you have a long do-file that is generating an error, it is often useful to run only part
of the file. This can be done using comments. You can add a * to the front, of any line
you do not want to run. For example,
+ logit 1fp we he
To comment out a series of lines, use /* and */. Everything between the opening /* and
the closing */ is ignored when you run the do-file. This technique is used extensively
with the extended examples presented later in this section.
Step 7: Starting over
Sometimes the fastest way to fix a problem is to start over. You checked the syntax
of each command, you clicked on the blue error message to make sure you understand
what the error means, you showed the problem to others who see no problems, yet the
program keeps generating an error. This is a good time to start over. Hopefully, if
you re-create the program without looking at the original version, you will not. make
the same mistake again. Of course, you might make the same error again. But, if you
already tried everything you can think of, it is worth a try.
Why does this method sometimes work? Some errors are caused by subtle typing
errors that you do not see even when looking at the code very carefully. Research on
reading has shown that people construct much of what they read from what they think
they should be reading. This is why it can be so hard to find typos. For example,
you have written tabl rather than tabi or tried to analyze var01 or vari instead of
var01. You can stare at this a long time and still not see it. If you start over, retyping
all commands and variable names, there is a chance that you will not make the same
typing error again. When starting over, here are some things to keep in mind.
Throw out all the original code
It is tempting to keep some of your original code that you “know” is correct. I once
spent hours debugging a complex program until I discovered that the error was in a
part of the program that. was so simple and “obviously correct” that I skipped over it.
Use a new file
Start with a new file, rather than simply deleting everything in the original do-file.
Why? It is possible to have a problem in a do-file that is caused by characters that are
not visible and that your editor cannot delete. Your new program might look exactly
like the old one, but a bit comparison of the two files will show that the files are different.3.3.3. Example 1: Debugging a subtle syntax error 75
Try alternative approaches
When starting over, I often use a different approach rather than trying to do exactly
what I did before. For example, if I think the command name is tabl and not tabi, I
will intentionally enter the same incorrect command again. If instead I use a series of
tab commands, the problem is resolved.
Step 8: Sometimes it is not your mistake
It is possible that there is an error in Stata or a user-written program that you are using.
If you have tried everything you can think of to fix the problem, you might try posting
the problem on Statalist (http://www.stata.com/statalist/), checking Stata’s frequently
asked questions (http://www.stata.com/support/faqs/), or contacting technical support
at StataCorp (http://www.stata.com/support/). Before you do this, read section 3.4
about getting the most out of asking for help.
3.3.3. Example 1: Debugging a subtle syntax error
In this section, I go through the steps I would use to debug problems when the er-
ror message does not immediately point to a solution. I want to plot the prestige
of a person’s doctoral department against the prestige of the person’s first academic
job. These commands, which are so long they run off the page, were extracted from
wf3-debug-graphi.do
use wf-acjob, clear
twoway (scatter job phd, msymbol(smcircle_hollow) msize(small)), ///
ytitle(Where do you work?) yscale(range(1 5.)) ylabel(i(1)5, angle(minety)) xtitle(Where did yc
xscale(range(1 5)) xlabel(1,5) caption (wf3-debug-graph1.do 2006-03-17, size(small)) scheme(s2me
The error message is
option § not allowed
x (198) ;
Because the message confuses me, I click on r(198) and obtain
{P] @ITOr Return code 198
invalid syntax;
wa-------- invalid;
range invalid;
a invalid obs no;
invalid filename;
invalid varname;
_ invalid name;
multiple by’s not allowed;
eeeeneeee found vhere number expected;
on or off required;
Al) items in this list indicate invalid syntax. These errors are
often, but not always, due to typographical errors. Stata attempts
to provide you with as much information as it can. Review the
syntax diagram for the designated command.
In giving the message “invalid syntax", Stata is not very helpful.
Errors in specifying expressions often result in this message.76 Chapter 3 Writing and debugging do-files
This message does not help much (even Stata warns me that the error message is not:
very helpful!), but it suggests that the problem might be related to an option that
contains a 5:
Aside on why error messages can be misleading
Error messages do not always point to the real problem. The reason is that
Stata knows how to parse the syntax of correct commands, not incorrect
commands. Although Stata tries to make sense out of incorrect commands,
it might not succeed. Think of error messages as suggestions that might
point to the problem or that might be misleading.
The first thing I do to debug this program is to reformat the command so that it is
easier to read (file: wf3-debug-graph?2.do):
twoway (scatter job phd, msymbol(smcircle hollow) msize(small)), ///
ytitle(Where do you work?) yscale(range(1 5.)) MW
ylabel(1(1)S, angle (ninety)) WM
xtitle(Where did you graduate?) xscale(range(1 5)) xlabel(1,5) ///
caption (w£3-debug-graph2.do 2006-03-17, size(smal1)) MW
scheme(s2manual) aspectratio(1) by(fem)
The command is easier to read, but it generates the same error because I only changed
the formatting. If you have sharp eyes and a good understanding of the twoway com-
mand, you might see the error, particularly because the error message suggests that the
problem has something to do with a 5. Still, let us suppose that I do not know what is
causing the problem.
Next I check that the variables are appropriate for this type of plot by creating a sim-
ple graph from the command line using the same variables (file: w£3-debug-graph3 . do):
scatter job phd
This works, so I know that the problem is not with the data. Next I comment out part
of the original command using the /* and */ delimiters. My strategy is to comment
out most of the command and verify that the program runs. Then I gradually add back
parts of the original code until J find exactly which part of the command is causing the
problem. Often this makes it simple to see what is causing the error. The next time I
try the program it looks like this (file: wf3-debug-graph4 . do):
twoway (scatter job phd, msymbol(smcircle_hollow) msize(small)), /* ///
ytitle(Where do you work?) yscale(range(i 5.)) My
ylabel(1(1)5, angle(ninety)) MM
xtitle(Where did you graduate?) xscale(range(1 5)) xlabel(1,5) ///
caption (wf3-debug-graph4.do 2008-04-03, size(small)) MW
scheme(s2manual) aspectratio(1) by(fem) +*/
This works and adds symbols to the graph. Next I include options that refine the y axis
(file: w£3-debug-graph5 .do):3.3.4 Example 2: Debugging unanticipated results 77
twoway (scatter job phd, msymbol(smcircle_hollow) msize(small)), Mt
ytitle(Where do you work?) yscale(range(1 5.)) Mf
ylabel(1(1)5, angle(ninety)) I+ I
xtitle(Where did you graduate?) xscale(range(1 5)) xlabel(1,5) ///
caption (wf3-debug-graphS.do 2008-04-03, size(small)) M1
scheme(s2manual) aspectratio(1) by(fem) +f
This works too, so I decide that the error is not caused by the 5s in this part of my
program. Next I uncomment the commands controlling the x axis
(file: w£3-debug-graph6. do):
twoway (scatter job phd, msymbol(smcircle_hollow) msize(small)), MW
ytitle(Where do you work?) yscale(range(1 5.)) M1
ylabel(1(1)5, angle(ninety)) Mi
xtitle(Where did you graduate?) xscale(range(1 5)) xlabel(1,5) /* ///
caption (wf3-debug-graph6.do 2008-04-03, size(small)) MW
scheme(s2manual) aspectratio(1) by(fem) */
This generates the original error, so I conclude that the problem is probably in this
segment of code:
xtitle(Where did you graduate?) xscale(range(1 5)) xlabel(1,5)
The xtitle() option looks fine. I could verify this by rerunning the program after
commenting out the xscale() and xlabel() commands. Because it is hard to make
a mistake with a simple xtitle() option, I decide not to do this (yet). I assume that
the problem is caused by the xscale() or xlabel() options. Looking closely, I see the
error is with xlabel(1,5). Although this looks like a reasonable way to indicate that
labels should go from 1 to 5, the correct syntax is xlabel(1(1)5). I change this and
the program does just what I want it to do (file: w£3-debug-graph7 do).
If I did not see that the error was caused by xlabel (1,5), [ would run the command
with only the xtitle() and xscale() options included (file: w£3-debug-graph8 . do):
twoway (scatter job phd, msymbol(smcircle_hollow) msize(small)), Wt
ytitle(Where do you work?) yscale(range(1 5.)) MW
ylabel(1(1)5, angle(ninety)) Ms
xtitle(Where did you graduate?) xscale(range(1 5)) /# xlabel(1,5) ///
caption (wf3-debug-graph8.do 2008-04-03, size(small)) M1
scheme(s2manual) aspectratio(1) by(fem) */
This also runs, so 1 would know that the problem is with the xlabe1() option.
3.3.4 Example 2: Debugging unanticipated results
You might have a do-file that runs without error but produces strange or unanticipated
results. To illustrate this type of problem, I use an example motivated by a question
I received from a sophisticated Stata user.8 I have nine binary indicators of functional
limitations (e.g., Do you have problems standing? Walking? Reaching?). Before trying
8. Claudia Geist kindly allowed me to use this example. I have changed the data and variables, but
the problem is the same one she encountered.78 Chapter 3. Writing and debugging do-files
to scale these measures, ] want to determine if there are certain combinations that occur
commonly. Kor example, do troubles with walking tend to occur with other problems
in lower-body function? Do some limitations tend to occur in pairs, but less often by
themselves? And so on. J start. by looking at the percentage of 1s for each variable (file:
wf3-debug-precision.do). Because the variables are binary, I can simply compute the
summary statistics:
. use wi-flims, clear
(Workflow data on functional limitations \ 2008-04-02)
. summarize hnd hvy lft rch sit std stp str wlk
Variable Obs Mean Std. Dev. Min Max
bnd 1644 - 169708 3754903 0 Zz
hvy 1644 - 4288321 -4950598 0 1
lft 1644 2475669 .4317301 0 1
rch 1644 1703163 - 3760248 0 7
sit 1644 «2104623 407761 0 1
std 1644 .3607056 =. 4803514 0 1
stp 1644 3643552 .4813953 oO 1
str 1644 2974453 -4572732 0 t
wik 1644. 2706813 4444469 QO 1
The distributions for the nine variables individually (or even 72 tabulations between
pairs of variables) do oot tell me all T want to know about how limitations cluster. A.
seemingly quick way to look at this is to create a new variable that combines the nine
binary variables. For example, with the variables str and wlk, | create the variable
strwlk:
generate strwlk - 10*str + wik
strwik is 0 if both wlk and str are 0, | if only wlk is 1. 10 if only str is 1, and 11 if
both are 1.
. tabulate strwlk, missing
strwlk Freq. Percent cum
66.36
70.26
76.82
100.00
Seems easy, so ] extend the idea to the nine variables:
generate flimall = hnd*100000000 + hvy*10000000 + 1ft*1000000 ///
+ rch#100000 + sit*10000 + std*1000 + stp*100 + stre10 + wlk
label var flimall "hnd-hvy-lft-rch-sit-stp-stp-str-wik"
Next I tabulate flimall where the valuc 0 indicates no limitations in any function;
111,111,111 indicates limitations with all activities; and other combinations of 0s and
1s reflect other patterns of limitations. Here is the output:3.3.4 Example 2: Debugging unanticipated results 79
. tabulate flimall, missing
hnd-hvy-lft
-rch-sit-st
d-stp-str-w
lk Freq. Percent Cun.
0 715 43.49 43.49
1 5 0.30 43.80
10 8 0.49 44.28
it 2 0.12 44.40
(output omitted)
1100111 1 0.06 54.08
1101100 1 0.06 $4.14
1.00e+07 86 5.23 59.37
(output omitted )
1.10e+08 7 0.43 88.56
1.44e+08 15 0.91 91.42
(output omitted )
Total 1,644 100.00
Unfortunately, the large numbers are in scientific notation and I lose the information
that I want. To fix this, I create a string variable:
generate sflimall = string(flimall, "%16.0£")
The #16 .0f indicates that I want the string to correspond to a 16-digit number without
decimal points (for details, see help format or [D] format; also see section 6.4.5, which
discusses how data are stored in Stata). I add a label and tabulate the new variable:
label var sflimall “hnd-hvy-lft-rch-sit-std-stp-str-w1k"
tabulate sflimall, missing
(Continued on next page)80 Chapter 3. Writing and debugging do-tiles
T see something very peculiar.
hnd-hvy-1ft
-rch-sit-st
d-stp-str-v
lk Freq. Percent Cum.
0 715 43.49 43.49
1 5 0.30 43.80
10 8 0.49 44.28
100 28 1.70 45.99
(output omitted)
10000000 86 5.23 53.83
100000000 15 0.91 54.74
10000001 4 0.24 4.99
100000096 4 0.24 55.23
(output omitted )
1000001 1 0.06 55.29
10000010 5 0.30 55.60
10000011 5 9.30 55.90
(outpnt omitted )
Total 1,644 100.00
The values are supposed to be all Os and Is, but I have a number 100000096. To figure
out. what went wrong, | run tab1 hnd-wik, wissing to verify that: the variables only
have values of 0 and 1. If 1 find four cases with 9s for str and 6 for wik, I know that I
have a problem with my original data, but the data look fine. Next I clean up the code
to make it easier to find typos:
generate flimall = hnd#100000000 ///
hvy*10000000 ///
1£t*1000000 ///
xch+100000 ///
sit*10000 ///
std*1000 ///
stp¥100 ///
str#10 ///
+ wlk
eee
The code looks fine, so I try the same approach but with only four variables. A good
strategy when debugging is to see if you can get a similar but simpler program to work.
; generate flimall = std*1000 ///
> + stp*100 ///
> + str#10 ///
> + wlk
. generate sflimall = string(flimall,"%9.0f")
+ label var sflimali "std-stp-str-wlk"3.3.5 Advanced methods for debugging 8L
- tabulate sflimall, missing
std-stp-str
~wlk Freq. Percent Cum.
8 866 52.68 52.68
i 16 0.97 53.65
10 24 1.46 55.11
100 80 4.87 59.98
1000 73 4.44 64.42
1001 13 0.79 65.21
101 8 0.49 65.69
1010 15 0.91 66.61
1011 25 1.52 68.13
it 13 0.79 68.92
110 24 1.46 70.38
1100 72 4.38 74.76
1101 27 1.64 76.40
411 20 1.22 77.62
1110 45 2.74 80.35
1111 323 19.65 100.00
Total 1,644 100.00
Again this looks fine. I continue adding variables and things still work with eight
variables. Further, it does not matter which eight I choose. I conclude that there is
a problein going from eight to nine variables, The problem is because the nine-digit.
number | am creating with flimall is too large to be held accurately. Essentially, this
means that 100,000,096 (the number above that. seemed odd) is only an approximation
to the correct result 100,000,100. Indeed, the number that raised suspicions is off by only
4 out of over 100 million. The solution is to store the information in double precision.
With the addition of onc word, the problem is fixed:
generate double flimall = bnd+100000000 ///
+ bvy#10000000 ///
+ 1ft#1000000 ///
+ rch#100000 ///
+ sit#10000 ///
+ std#1000 ///
+ stpri00 ///
+ str*i0 ///
+ wik
vvvvvvyy-
See section 6.4.5 for more information on types of variables.
3.3.5 Advanced methods for debugging
If things are still not working, you can trace the error. Tracing refers to looking at
each of the steps taken by Stata in executing your program (i.e., you trace the steps
the program takes). This shows you what the program is doing behind the scenes,
often revealing the specific step that causes the problem. To trace a program, type the
command set trace on. Stata echoes every line of code it runs, both from your do-file82 Chapter 3 Writing and debugging do-files
and from your ado-files. To turn tracing off, type set trace off. For details on how
to use this powerful feature, type help trace or see [P| trace.
3.4 How to get help
At some point, you will need to ask for help. Here are some things that make it easier
for someone to help you and increase your chances of getting the help you need.
1. Try all the suggestions above to debug your program. Read the manual for the
commands related to your error.
2. Make sure that your copy of Stata and user-written programs are up to date.
3. Write a brief description of the problem and the things you have done to resolve
the problem (e.g., updated Stata, tried a different dataset). I often solve my own
problems when I am composing a detailed email asking someone for help.
4, Create a do-file that generates the error using a small dataset. Do not send a
huge dataset as an attachment. Make the do-file self-contained (e.g. it loads the
dataset) and transportable (e.g., it does not hardcode the directory for the data).
5. Send the do-file, the log file in text format, and the dataset to the person you are
asking for help.
When you ask for help, the clearer and more detailed the information you provide, the
greater the chance that someone will be willing and able to help you.
3.5 Conclusions
Although this chapter contains many suggestions on using Stata, it only touches on
the many features in the program. If you spend a lot of time with Stata, it is worth
browsing the manuals. Often you will find a command or feature that. solves a problem.
I know that in writing this book I discovered many useful commands that 1 was un-
aware of. If you do not like reading manuals, consider a NetCourse (web course) from
StataCorp (http://www.stata.com/netcourse/). The investment: of time in learning the
tools usually saves time in the long run.4 Automating your work
A great deal of data management and statistical analysis involves doing the same task
multiple times. You create and label many variables, fit a sequence of models, and
run multiple tests. By automating these tasks, you can save time and prevent errors,
which are fundamental to an effective workflow. In this chapter, I discuss six tools for
automation in Stata.
Macros: Macros are simply abbreviations for a string of characters or a
number. These abbreviations are amazingly useful.
Saved results: Many Stata commands save their results in memory. This
information can be retrieved and used to automate your work.
Loops: Loops are a way to repeat a group of commands. By combining
macros with loops, you can speed up tasks ranging from creating variables
to fitting models.
The include command: include inserts text from one file into another, which
is useful when the same commands are used multiple times in do-files.
Ado-files: Ado-files let you write your own commands to customize Stata,
automate your workflow, and speed up routine tasks.
Help files: Although help files are primarily used to document ado-files, they
can also be used to document your workflow.
Macros, saved results, and loops are essential for chapters 5-7. Although include,
ado-files, and help files are very useful, they are not essential for later chapters. Still, I
encourage you to read these sections.
4.1 Macros
Macros are the simplest tool for automating your work. A macro assigns a string of
text or a number to an abbreviation. Here is a simple example. I want to fit the model
logit y vari var2 var3
I can create the macro rhs with the names of the independent or right-hand-side vari-
ables:
local rhs “vari var2 var3"
83Bd Chapter 4 Automating your work
Then I can write the logit command as
logit y “rhs”
where the > and “ indicate that T want to insert the contents of the macro rhs. The
command logit y “rhs works exactly the same as logit y var1 var2 var3. In
the examples that follow, T show you many ways to use macros. For a more technical
ion, sec [P] macro.
discus
4.1.1 Local and global macros
Stata has two types of macros, local macros and global macros. Local macros can be
used only within the do-file or ado-file in which they are defined. When that program
ends, the local macro ppears. For example, if I create the local rhs in step1.do,
that local disappears as soon as step1.do ends. By comparison, a global macro persists
until you delete it or exit. Stata. Althongh global macros can be useful, they can lead
to do-files that unintentionally depend on a global macro created by another do-file or
from the Command window. Such do-files are not robust and can lead to unpredictable
results. Accordingly, I almost exclusively use local macros.
Local macros
Local macros that contain a string of characters are defined as
local local-name “string”
For example,
local rhs "vari var2 var3 var4"
A local macro can also be set. equal to a numerical expression:
local local-name = expression
For example,
local neases = 198
The content of a macro is inserted into your do-file or ado-file by entering ~local-name’.
For example, to print the contents of the local rhs, type
. display "The local macro rhs contains: ~rhs“"
The tocal macro rhs contains: vari var2 var3 var4
or type
. display "The local ncases equals: “neases“”
The local ncases equals: 1984.1.1 Local and global macros 85
The opening quote ~ and closing quote ~ are different symbols that look similar with
some fonts. To make sure you have the correct symbols, load the do-file wf4-macros .do
from the Workflow package and compare the symbols it contains with those you can
create from your keyboard.
Global macros
Global macros are defined much like local macros:
global global-name ," string"
global global-name = expression
For example,
global rhs "var1 var2 var3 var4"
global ncases = 198
The content of a global macro is inserted by entering $global-name. For example,
. display "The local macro rhs contains: rhs"
The local macro rhs contains: varl var2 var3 var4
or
. display "The local ncases equals: $ncases"
The local ncases equals: 198
Using double quotes when defining macros
When defining a macro containing a string, you can include the string in quotes. For
example,
local myvars "y xi x2"
Or you can remove the quotation marks:
local nyvars y x1 x2
J prefer using quotation marks because they clarify where the string begins and ends.
Plus text editors with syntax highlighting can show everything that appears between
quotation marks in a different color, which helps when debugging programs.
Creating long strings
You can create a macro that contains a long string in one step, such as
local demogvars "female black hispanic age agesq edhighschi edcollege edpostgrad"86 Chapter 4 Automating your work
‘The problem is that long commands are truncated or wrapped when viewed on screen
or printed. As shown on page 58, this can make it harder to debug your program. To
keep lines shorter than 80 columns (the local command above is 81 columns wide), I
build long macros in steps. For example, | can create demogvars by starting with the
Grst five variable names:
local demogvars "female black hispanic age agesq
The next line takes the current, content of demogvars and adds new names to the end.
Remember, the content of demogvars is inserted by “ demogvars*:
local demogvars "“demogvars” edhighschl edcollege edpostgrad"
Additional names can be added in the same way.
4.1.2 Specifying groups of variables and nested models
Macros can hold the names of variables that you are analyzing. Suppose that I want
summary statistics and estimates for a logit of 1fp on k5, k618, age, wc. hc, lwg, and
inc. Without macros, I enter the commands like this (file: w£4-macros.do):
summarize 1fp kS k618 age wc he lwg inc
logit Lip k5 k618 age we he lwg inc
If I change the variable:
both commands:
say, deleting he and adding agesquared, 1 need to change
summarize 1fp k5 k618 age agesquared we Iwg inc
logit 1fp k5 k618 age agesquared we lwg inc
Alternatively, I can define a macro with the variable names:
local myvars “1fp k5 k618 age wc he lwg inc"
Then lL compute the statistics and fit the model like this:
summarize “myvars”
logit ~myvars”
Stata replaces ~myvars ” with the content of the macro. Thus the summarize ~myvars”
command is exactly equivalent to summarize lfp k5 k618 age we hc lwg inc.
Using a local macro to specify variables allows me to change the variables being
analyzed by changing the local. For example, I can change the list. of variables in the
macro myvars:
local myvars “lip KS k618 age agesquared we lwg inc"
Then I can use the same commands as before to analyze a different. set of variables:
summarize ~myvars~
logit = myvars”4.1.2 Specifying groups of variables and nested models 87
The idea of using a macro to hold variable names can be extended by using different.
macros for different groups of variables (e.g., demographic variables, health variables).
These macros can be combined to specify a sequence of nested models. First, I create
macros for four groups of independent variables:
local seti_age "age agesquared"
local set2_educ "wo he"
local set3_kids "k5 k618"
local set4_money “lwg inc”
To check that a local is correct, I display the content. For example,
. display “set3_kids: “set3_kids“"
set3_kids: k5 k618
Next J specify four nested models, The first: model includes only the first, set. of variables
and is specified as
local model_{ ""seti_age"”
The macro model_2 combines the content, of the local model_i with the variables in
local set2_educ:
local model_2 “*model_1° ~set2_educ“"
The next two models are sp d the same way:
local model_3
local model_4
model_2° ~set3_kids“"
model_3° ~set4_money""
Next I check the variables in each model
. display “model_1: *model_i“*
model_1: age agesquared
. display “model_2: “model_2°"
model_2: age agesquared we he
. display “model_3: “nodel_3“"
model_3: age agesquared wc he kS k618
. display "model_4: “model_4°"
model_4: age agesquared we he k5 k618 lwg inc
Using these locals, I estimate a series of logits:
Logit 1fp “model_1”
logit lip “model_2°
logit 1fp *model_3°
logit 1fp “model_4~
There are several advantages to using locals to specify models. First, when specifying
complex models, it is easy to make a mistake. For example, here are logit commands
for a series of nested models from a project 1 am currently working ou. Do you see the
error?83 Chapter 4 Automating your work
logit y black
logit ¥ black agel0 agei0sq edhs edcollege edpost incdollars childsqrt
logit y black agei0 agei0sq edhs edcollege edpost incdollars ///
childsqrt bmi bmi3 bmi4 menoperi menopost mcs_12 pcs_i2
logit y black age10 age10sq edhs edcollege edpost incdollars M1
childsqrt bmil bmi3 bmi4 menoperi menopost mes_12 ///
pcs_12 sexactsqrt phys6_imp2 subj8_imp2
logit y black age10 age10sq edhs edcollege edpost incdollars ///
childsqrt bmil bmi3 bmi4 menoperi menopost mcs_12 ///
pes_12 sexactsqrt phys8_imp2 subj8_imp2 selfattr partattr
Second, locals make it easy to revise model specifications, Even if 1 am successful in
initially defining a set of models by typing each variable name for cach model, errors
creep in when I change the models. For example, suppose that J do not need a quadratic
term for age. Using locals, I need to make only one change:
local setl_age "age"
This change is automatically applied to the spec tions of all models:
Jocal model_1 ""seti_age’"
local model_2 ""model_1” *set2_educ“"
local mode2_3 "“model_2° *set3_kids
local model_4 "“model_3” “set4_money““
In chapter 7, the
«leas are combined with loops to simplify complex analyses.
4.1.3 Setting options with locals
T often use locals to specify the options for a command. ‘This makes it easier to change
options for multiple commands and helps organize the complex options sometimes
needed for graphs.
Using locals with tabulate
Suppose that | want to compute several two-way tables using tabulate. ‘his com-
mand Las many options that control what is printed within the table and the summary
atistics that are computed. For my first tables, T want cell percentages requiring the
cell option, missing valnes requiring the missing option. numeric values rather than
value labels for row and column labels requiring the nolabel option, and a chi-squared
test of independence requiring the chi2 option. I can put these options in a local:
local opt_tab "cell miss nolabel chi2"
I use this local to set the options for two tabulate commands:
tabulate we hc, “opt_tab”
tabulate we lfp, “opt_tab’
I could have dozens of tabulate commands that use the same options. If I later decide
that I want to add row percentages and remove cell percentages, I need to change only
one line:4.1.3 Setting options with locals 89
local opt_tab “row miss nolabel chi2"
This change will be applied to all the tabulate commands that use opt-tab to set the
options.
Using locals with graph
The options for the graph command can be very complicated. For example. here
is a graph comparing the probability of tenure by the number of published articles for
male and female biochemists:
Probability of tenure
20 30
Number of Articles
Even though this is a simple graph, the graph command is complex and hard to read
(file: w£4-macros-graph .do):
graph twoway ///
(connected pr_women articles, lpattern(solid) iwidth(medthick) ///
color(black) msymbol(i)) ///
(connected pr_men articles, lpattern(dash) lwidth(medthick) ///
lcolor(black) msymbol(i)) ///
» ylabel(0(.2)1, grid glwidth(medium) glpattern(dash)) xlabel(0(10)50) ///
ytitle("Probability of tenure") ///
legend(pos(11) order(2 1) ring(0) cols(1))
Macros make it simpler to specify the options, to see which options are used, and to
revise them. For this example, I can create macros that specify the line options for men
and for women, the grid options, and the options for the legend:
local opt_linF "1pattern(solid) Iwidth(medthick) lcolor(black) msymbol(i)"
local opt_tinf "Ipattern(dash) lwidth(medthick) lcolor(black) msymbol(i)"
local opt_ygrid “grid glwidth(medium) glpattern(dash)”
local opt_legend “pos(11) order(2 1) ring(0) cols(1)"90 Chapter 4 Automating your work
Using these macros, J create a graph command that I find easier to read:
graph twoway ///
(connected pr_vonen articles, copt_linF*) ///
(connected pr_men articles, opt_linm’) ///
, xlabel(0(10)50) ylabel(0(.2)1, “opt.ygrid’) ///
ytitle("Probability of tenure") WL
legend("opt_legend’)
Moreover, if [ have a series of similar graphs, I can use the same locals to specify options
for all the graphs. If I want, to change something, T only have to change the macros,
not each graph command, lor example, if { decide to use colored lines to distinguish
between men and women, I change the macros containing line options:
local opt_linF "lpattern(solid) lwidth(medthick) lcolor(red) msymbol(i)"
local opt_linM “Ipattern(dash) lwidth(medthick) 1color(blue) msymbol(i)"
With these changes, ] can usc the same graph twoway command as before
4.2 Information returned by Stata commands
Drukker’s dictum: Never type anything that you can obtain from a saved result.
When writing do-files you never want to type a number if Stata can provide the
number for you. Fortunately, Stata can provide just about any number you need. To
understand what this means, consider a simple example where I mean center the variable
age. I could do this by first computing the mean (file: w£4-returned. do):
. use wf-lfp, clear
(Workflow data on labor force participation \ 2008-04-02)
. summarize age
Variable | Obs Mean Std. Dev. Min Max
age | 753 42.53785 8.072574 30 60
Next I use the mean from summarize in the generate command:
. generate age_mean = age - 42.53785
The average of the new variable is very close to 0 as it should be (within .000001):
. Summarize age_mean
Variable | Obs Mean Std. Dev. Min Max
age_mean ee 8.072574 -12.53785 17.46215
I can do the same thing without typing the mean. The summarize command both
sends output to the Results window and saves this information in memory. In Stata’s
terminology, summarize returns this information. To see the information returned by
the last command, | use the return list command. For example,4.2 Information returned by Stata commands
. summarize age
91
Variable | Obs Mean Std. Dev. Min Max
age 753 42.53785 8.072874 30 60
. return list
scalars:
x(N) = 753
x(sumw) = 753
r(mean) = 42.53784860557769
r(Var) = 65.16645121641095
r(sd) = 8.072574014303674
r(min) 30
r(max) = 60
r(sum) = 32031
The mean is returned to a scalar named r(mean).! J use this value to subtract the mean
from age:
+ Generate age_meanV2 = age - r(mean)
When I compare the two mean-centered variables, I find that the variable created using
xr(mean) is slightly closer to zero:
. summarize age_mean age_meanV2
Variable Obs. Mean Std. Dev. Min Max
age_mean 753 -1.49e-06 8.072574 -12.53785 17.46215
age_meanV2 753 6.29e-08 8.072574 -12.53785 17.46215
I could get even closer to zero by creating a variable using double precision:
+ summarize age
Variable Obs Mean Std. Dev. Min Max
age 753 42.53785 8.072574 30 60
- generate double age_meanV3 = age - r(mean)
. label var age_meanV3 “age - mean(age) using double precision"
. summarize age_mean age_meanV2 age_meanV3
Variable Obs. Mean Std. Dev. Min Max
age.mean 753 -1.49e-06 8.072574 -12.53785 17.46215
age_meanV2 753 6.29e-08 8.072574 -12.53785 17.46215
age_meanV3 753 -3.14e-25 = 8.072574 -12.53785 17. 46215
This example illustrates the first reason why you never want to enter a number by
hand if the information is stored in memory. Values are returned with more numerical
precision than shown in the output from the Results window. Second, using returned
results prevents errors when typing a number. Finally, using a returned value is more
robust. If you type the mean based on the output from summarize and later change
the sample being analyzed, it is easy to forget to change the generate command where
you typed the mean. Using r(mean) automatically inserts the correct quantity.
1, Scalar means a single numeric value.92 Chapter 4 Automating your work
Most Stata commands that compute numerical quantities return those quantities
and often return additional information that is not in the output. To look at the
returned results from commands that are not. fitting a model, use return list. For
estimation commands, use ereturn list. To find out what each return contains, enter
help command-name and look at the section on saved results.
Using returned results with local macros
In the example above, I used the returned mean when generating a new variable. I can
also place the returned information in a macro. For example, if Trun summarize age,
the mean and standard deviations are returned. T can assign these quantities to local
MACTOS:
. local mean_age = r(mean)
. local sd_age = r(sd)
I can now display this information:
. display “The mean of age “mean_age’ (sd="ed_age’)"
The mean of age 42.53784860557769 (sd=8.072574014303674)
If you are using returned results to compute other quantities (c.g., to center a variable),
you want to retain all 14 decimal digits. If you only want to display the quantity,
you might, want to round the result to fewer decimal digits. You can do this with the
string() function. For example,
. local mean_agefmt = string(r (mean) ,"%8.3£")
- local sd_agefmt = string(r(sd),"%8.3f")
. display "The mean of age “mean_agefmt” (sd="sd_agefmt~).”
The mean of age 42.538 (sd=8.073).
The locals mean_agefut and sd_agefmt have been printed with only three digits of
precision and should not be used for computing other quantities.
Roturned results are used in many ways in the later chapters. [ encourage you to
experiment with assigning returns to locals and using the display command. For more
information, see help display and help return, or [bp] display, {R] saved results,
and {P] return.
4.3 Loops: foreach and forvalues
Loops let you execute a group of commands multiple times. Here is a simple example
that illustrates the key features of loops. I have a four-category ordinal variable y with
values from 1 to 4. I want to create the binary variables y_-1t2, y_1t3, and y_1t4 that
equal 1 if y is less than the indicated value, else 0. I can create the variables with three
generate commands (file: w£4-loops.do):4.3 Loops: foreach and forvalues 93
generate y_lt2 = y<2 if !missing(y)
generate y_lt3 = y<3 if !missing(y)
generate y_lt4 = y<4 if !missing(y)
where the if condition !missing(y) selects cases where y is not missing. I could ©
the same generate commands with a foreach loop:
1> foreach cutpt in 234 {
2> generate y_lt°cutpt’ = y 3}
Let's look at each part. of this loop. Line 1 starts the loop with the foreach command.
cutpt is the name I chose for a macro to hold the cutpoint used to dichotomize y. Each
time through the loop, the value of cutpt changes. in signals the start of a list of values
that will be assigned in sequence to the local cutpt. The numbers 2 3 4 are the values
to be assigned to cutpt. { indicates that the list has ended. Line 2 is the command
that I want to run multiple times. Notice that it uses the macro cutpt that was created
in line 1. Line 3 ends the foreach loop.
Here is what happens when the loop is executed. The first time through foreach
the local cutpt is assigned the first value in the list. This is equivalent to the command
local cutpt "2". Next the generate command is run, where ~ cutpt” is replaced by
the value assigned to cutpt. The first time through the loop, line 2 is evaluated as
generate y_lt2 = y<2 if !missing(y)
Next the closing brace } is encountered, which sends us back to the foreach command
in line 1. In the second pass, foreach assigns cutpt to the second value in the list,
which means that the generate command is evaluated as
generate y_lt3 = y<3 if !missing(y)
This continues once more, assigning cutpt to 4. When the foreach loop ends, three
variables have been generated.
Next 1 want to estimate binary logits on y-1t2, y-1t3, and y1t4.? [ assign my
right-hand-side variables to the local rhs:
local rhs "yr89 male white age ed prst"
To run the logits, I could use the commands
logit y_1t2 “rhs*
logit y_1t3 “rhs”
logit y1t4 “rhs”
2. Lam using a series of binary logits to assess the parallel regression assumption in the ordinal logit
model; see Long and Freese (2006) for details.9A Chapter 4 Automating your work
Or [could do the same thing with a loop:
foreach lhs in y_lt2 y_1t3 y_1t4 {
logit “lhs* “rhs”
+
Using foreach to fit three models is probably more trouble than it is worth. Suppose
that 1 also want to compute the frequency distribution of the dependent variable and
fit. a probit. model. I can add two lines to the loop:
foreach lhs in y_lt2 y_1t3 y_1t4 {
tabulate ~1hs*
logit ~1hs* “rhe”
probit “lhs” “rhs*
}
lf I want to change a command, say, adding the missing option to tabulate, I have to
make the change in only one place and it applies to all three outcomes.
I hope this simple example gives you some ideas about how useful loops can be.
In the next section, I present the syntax for the foreach and forvalues commands.
The foreach command has options to loop through lists of existing variables, through
lists of variables you want to create, or through numeric lists. The forvalues com-
mand is for looping though numbers. After going through the syntax, I present more
complex examples of loops that illustrate techniques used in later chapters. For further
information, use help or check [P] foreach and [P| forvalues.
The foreach command
The syntax is
foreach local-name {in|of list-type} list {
commands referring to ~ local-name’
}
where local-name is a local macro whose value is assigned by the loop. list contains the
items to be assigned to local-name. With the in option, you provide a list of valucs or
names and foreach goes through the list one at a time. For example,
foreach i in 12345 {
will assign i the values 1, 2, 3, 4, and 5, or you can assign names to i:
foreach i in vari var2 var3 var4 vars {
The of option lets you specify the kind of list you are providing and Stata verifies
that all the elements in the list are appropriate. The command foreach local-name
of varlist list { is for lists of variables, where list is expanded according to standard
variable abbreviation rules. For example,4.3.1 Ways to use loops 95
foreach var of varlist 1fp~inc {
expands 1fp-inc to include all variables between 1fp and inc. In wf-lfp.dta, this
would be 1fp k5 k618 age wc hc lwg inc. Stata verifies that each name in the list
corresponds to a variable in the dataset in memory. If it does not, the loop ends with
an error.
The command foreach local-name of newlist newvarlist is for a list of variables
to be created. The names in newvarlist are not automatically created, but Stata verifies
that the names are valid for generating new variables. The command foreach local-
name of numlist numlist is used for numbered lists, where numlist uses standard
number list notation. For details on the many ways to create sequences of numbers
with numlist, type help numlist or sce [U) 11.1.8 numlist.
The forvalues command
The forvalues command loops through numbers. The syntax is
forvalues Iname = range {
commands referring to ~ local-name”
}
where range is specified as
Syntax Meaning Example Generates
#1(#d)#2 From #1 to #2 in steps of #d 1(2)10 1, 3,5, 7,9
#1/#2 From #1 to #2 in steps of 1 1/10 1, 2, 3,..., 10
#1 #t to #2 From #1 to #2 in steps of (#t-#1) 1 4 to 15 1, 4,7, 10, 13
For example, to loop through ages 40 to 80 by 5s:
forvalues i = 40(5)80 {
Or to loop from 0 to 100 by .1:
forvalues i = 0(.1)100 {
4.3.1 Ways to use loops
Loops can be used in many ways that make your workflow faster and more accurate. In
this section, I use loops for the following tasks:
« Listing variable and value labels
« Creating interaction variables
« Fitting models with alternative measures of education96 Chapter 4 Automating your work
¢ Recoding nultiple variables the same way
¢ Creating a macro that holds accumulated information
¢ Retrieving information returned by Stata
The examples are simple, but illustrate features that are extended in later chapters.
Hopefully, as you read these examples you will think of other ways in which loops can
benefit. your work. All the examples assume that wf-loops .dta has been loaded (file:
wf4-loops.do)
Loop example 1: Listing variable and value labels
Surprisingly, Stata does not have a command to prit a list of variable names followed
only by their variable labels. The describe command lists more information than I
often need, plus it contains details that often confuse people (¢.g., what does byte
49.0g warmlbl mean?). To create a list of names and labels, I loop through a list
of variables, retrieve each variable label, and print the information. To retrieve the
variable label for a given variable, I use an extended macro function. Stata has dozens
of extended macro functions that are used to create macros with information about
variables, datasets, and other things. For example, to retrieve the variable label for
warm, I use this command
local varlabel : variable label warm
To sec the contents of varlabel, type
. display "Variable label for varm: ‘“varlabel”"
Variable label for warm: Mon can have warn relations with child
To create a list for several variables, [ loop through a list of variable names. extract
each variable label, and print the results:
1> foreach varname of varlist warm yr89 male white age ed prst {
> local varlabel : variable label “varname
32> display ""varname“" _col(12) "‘varlabel“"
a)
Line J starts the loop through seven variable names. The first time through the Joop, the
local varname contains warm. Line 2 creates the local varlabel with the variable label
for the variable in varname. Line 3 prints the results. Everything in this line should be
familiar, except for -col(12), which specifics that the label should start. printing in the
12th column. Here is the list produced by the loop:
warm Mom can have warm relations with child
yr89 Survey year: 1=1989 0=1977
male Gender: i=male 0=female
white Race: I-white O=not white
age Age in years
ed Years of education
prst Occupational prestige4.3.1 Ways to use loops 97
If 1 want, the labels to be closer to the names, | could change col(12) to col (10)
or some other value. In section 4.5, I elaborate this simple loop to c
command that lists variable names with their labels:
eate a new Stata
Loop example 2: Creating interaction variables
Suppose that | need variables that are interactions between the binary variable male
and a set of independent variables. I can do this quickly with a loop:
1> foreach varname of varlist yr89 white age ed prst {
2 generate maleX*varname” = male** varname”
BD dabel var maleX°varname” "male*~varname™"
4 }
Line 1 loops through the list of independent variables. Linc 2 generates a new vari-
able named maleX*varname~. For example, if varname is yr89, the new variable is
maleXyr89. The variable label created in line 3 combines the names of the two vari-
ables used to create the interaction. For example, if varname is yr89, the variable label
is maleXyr89. To examine the new variables and their labels, I use codebook:
. codebook maleX+, compact
Variable Obs Unique Mean Min Max Label
maleXyr89 = 2293 2 1766245 0 1 male*yr89
maleXwhite 2293 2 .4147405 9 1 male*white
maleXage —- 2293 71 20.50807 0 89 maletage
maleXed 2293 2! §.735717 © 20 malered
maleXprst 2293 59 18.76625 0 82 malexprst
Although this variable label clearly indicates how the variable was generated, I prefer
a label that includes the variable label from the source variable. I do this using the
extended macro function introduced in Loop example }:
1> foreach varname of varlist yr89 white age ed prst {
> local varlabel : variable label *varname*
2 generate maleX’varname” = male*'varname’
label var maleX’varnane’ “male*’ varlabel“"
5> +
Line 2 retrieves the variable label for “varname’ and line 4 uses this to create the new
variable label. For maleXage, the label is malexAge in years. I could create an even
more informative variable label by replacing line 4 with
label var maleX varname” "male**varname” (“varlabel)"
For example, for maleXprst, the label would be male*prst (Occupational prestige).98 Chapter 4 Automating your work
Loop example 3: Fitting models with alternative measures of education
Suppose I want to predict labor-force participation using education and five additional
independent variables. My dataset has five measures of education (e.g., years of educa-
tion, a binary indicator of attending high school), but I have no theoretical reason for
choosing among them. | decide to try each measure in my model. First, | create a local
containing the names of the education variables:
local edvars "edyrs edgths edgtcol edsqrtyrs edlths"
‘The other independent, variables are
local rhs "male white age prst yr89"
I loop through the education variables and fit five ordinal logit models, each with a
different measure of education:
foreach edvarname of varlist “edvars’ {
display _newline "==> education variable: “edvarname’"
ologit warm “edvarname” “rhs”
3
This is equivalent to running these commands:
display _newline "==> education variable: edyrs"
ologit varm edyrs male white age prst yr89
display _newline "==> education variable: edgths"
ologit warm edgths male white age prst yr89
display _newline "==> education variable: edgtcol"
ologit warm edgtcol male white age prst yr89
display _newline "==> education variable: edsqrtyrs"
ologit warm edsqrtyrs male white age prst yr89
display newline "==> education variable: edlths"
ologit warm edlths male white age prst yr89
I find the loop to be simpier and easier to debug than the repeated list. of commands.
In chapter 7, this idea is extended to collect information for sclecting among the models
using a Bayesian information criterion stati 306).
ic (sce page 4
Loop example 4: Recoding multiple variables the same way
J often have multiple variables that I want to recode the same way. For example, I have
six variables that measure social distance (e.g., would you be willing to have this person
live next door to you?) using the same 4-point scale. The variables are
local sdvars “sdneighb sdsocial sdchild sdfriend sdwork sdmarry"43.1
to use loops 99
To dichotomize these variables, | use a loop:
1> foreach varname of varlist “sdvars” {
2> generate B'varname” = ~varname’
3> label var B’varname’ "“varname“: (1,2)=0 (3,4)=1"
a> replace B*varname
5> replace B’ varnane
6> }
Line 2 generates a new variable equal to the source variable. The new variable name
adds B (for binary) to the source variable name (c.g., Bsdneighb from sdneighb). Line 3
adds a variable label. Line 4 assigns 0 to the new variable when the source variable is
1 or 2, where the | symbol means “or”. Similarly, line 5 assigns 1 when the source
variable is 3 or 4. The loop applies the same recoding to ali the variables in the local
sdvars
Suppose that I have measures of income from five panels of data. The variables are
named incp1 through incp5. I can transform each by adding .5 and taking the log:
foreach varname of varlist incp1 incp2 incp3 incp4 incp5 {
generate In‘varname” = 1n(~varname“+.5)
label var In“varname” “Log(“varname’+.5)"
+
Loop example 5: Creating a macro that holds accumulated information
Typing lists is boring and often leads to mistakes. In the last example, typing the five
income measures was simple, but if I had 20 panels it would be tedious. Instead, { can
use a loop to create the list of names. First, 1 create a Jocal varlist that contains
nothing (known as a null string):
local varlist ""
I will use varlist to hold my list of names. Next I loop from 1 to 20 to build my list.
Here I use forvalues because it automatically creates the sequence of numbers 1-20:
1> forvalues panelnum = 1/20 {
2 local varlist “‘varlist” incp’ panelmumn’"
> }
The local in line 2 can be confusing, so let me decode it from right to left (not left to
tight). The first time through the loop, incp* panelnum’ is evaluated as incp1 because
~panelnum’ is 1. To the left, “varlist” is a null string. Combining ~varlist~ with
incp* panelnum’ changes the local varlist from a. null string to incp1. The second
time through the loop, incp* panelnum’ is incp2. ‘This is added to varlist, which now
contains incp1 incp2. And so on.
Hopefully, my explanation. of this loop was clear. Suppose that you are still con-
fused (and macros can be confusing when you first use them). You could add display
commands that print the contents of the local macros at each iteration of the loop:100 Chapter 1 Automating your work
local varlist ""
forvalues panelnum = 1/20 {
local varlist ““varlist” iacp*paneloum“"
display _newline “panelnum is: “panelnum
display “yarlist is: ‘varlist’"
+
The output looks like this:
panelnum is: 1
varlist is: incpi
2
incpt incp2
panelnun is:
varlist is
panelnum is: 3
varlist is: incpt incp2 incp3
panelnum is: 4
varlist is: incp1 incp2 incp3 incp4
(output omitted )
Adding display to loops is a good way to verify that the loop is doing what you think it
should. Once you have verified that the loop is working correctly, you can comment. oul
the display commands (e.g., put a * in front of each line). As an exercise, particularly
if any of the examples are confusing, add display commands to the loops in prior
examples to verify how they work.
Loop example 6: Retrieving information returned by Stata
When Stata executes a command, it almost always leaves information in tiemory. You
can use this information in many ways. For example, I start by computing summary
statistics for one variable:
. summarize Bsdneighb
Variable | Obs Mean Std. Dev. Min Max
Bsdneighb 490 — .1938776 =. 3957381 0
After summarize runs, I type the command return list to what. information was
left. in memory. In Stata terminology, this information was “returned” by summarize:
. return list
scalars:
r(N) = 490
r(sumw) = 490
r(mean) = .1938776510204082
r(Var) = .1566086557322316
r(sd) = ,3957381150865198
r(min) = 0
x(max) = 1
r(sum) = 95
This information can be moved into macros. For example, to retrieve the number of
es, type4.3.2 Counters in loops 107
local samplesize = r(N)
To compute the percentage of cases equal to one, I can multiply the mean in r(mean)
by 100:
local pet1 = r(mean)*100
Next | use returned information in a Joop to list the percentage of ones and the sample
size for each measure of social) distance:
1> foreach varname of varlist ~sdvars* {
2> quietly summarize B’ varnane”
3> local samplesize = r(N)
4> local peti = r(mean)+100
5> display "B'varname’:" _col(14) “Petis = " 45.2f “peti” ///
> co1(30) "N = “samplesize’"
6> }
Line } loops through the list. of variables. Line 2 computes statistics for one variable
at atime. After 1 was sure the loop worked correctly, 1 added quietly to suppress the
output from summarize. Line 3 grabs the sample size from r(N) and puts it in the local
samplesize. Similarly, line 4 grabs the mean and multiplies it by 100 to compute the
percentage of ones. Line 5 prints the results using the format %5.2f, which specifies
five columns and two decimal digits (type help format or see [D] format for further
details). The output from the loop looks like this
Bsdneighb: Pctis = 19.39 N= 490
Bsdsocial: Pctis = 27.46 N = 488
Bsdchild: Pctis = 71.73 N= 481
Bsdfriend: Pctis = 28.75 N= 487
Bsdwork: Petis = 31.13 N= 485
Bsdmarry: Pctis = 52.75 N= 455
As a second example, I use the returns from summarize to compute the coefficient. of
variation (CV). The CV is a measure of inequality for ratio variables that equals the
standard deviation divided by the mean. I compute the CV with these commands:
foreach varname of varlist incpl incp2 incp3 incp4 {
quietly summarize “varname”
local cv = r(sd)/r(mean)
display "CV for ~varname™: " %8.3f “cv”
4.3.2 Counters in loops
In many applications using loops, you will need to count how many times you have gone
through the loop. To do this, I create a local macro that will contain how often I have
gone through the loop. Because I have not started the loop yet, I start by setting the
counter to @ (file: w£4-loops .do):
local counter = 0102 Chapter 4 Automating your work
Next I loop through the variables as I did above:
1> foreach varnane of varlist warm yr89 male white age ed prst {
2> local counter = “counter” + 1
> ocal varlabel : variable label “varname”
> display “counter”. “varname’" col(12) "*varlabel‘"
}
Line 2 increments the counter. To understand how this works, start on the right and
move left, I take 1 and add it to the current. value of counter. | retrieve this value with
counter’. The first time through the loop, “counter” is 0, so “counter” + 1 is 1.
Line 3 retrieves the variable label, and line 4 prints the results using the local counter
to ntunber cach line. The results look like this:
1. warm Mom can have warm relations with child
2. yr89 Survey year: 1=1989 0=1977
3. male Gender: nale O=female
4, white Race: 1=white O=not white
5. age Age in years
6. ed Years of education
7
. prst Occupational prestige
Counters are so usefil that Stata has a simpler way to increment, them. ‘The command
local ++counter is equivalent fo local counter = “counter” + 1. y this, the
loop becomes
local counter = 0
foreach varname in warm yr89 male white age ed prst {
local ++counter
local varlabel : variable label “varname”
display “‘counter”. ~varname™" _col(12) ““varlabel”
Using loops to save results to a matrix
Loops are critical for accumulating results from statistical analyses. To iThistrate this
application, | extend the example on page 101 so that instead of printing the percentage
of ones and the sample size. 1 save (his information in a matrix. I begin by creating a
local with the names of the six binary measures:
local sdvars "Bsdneighb Bsdsocial Bsdchild Bsdfriend Bedwork Bsdmarry"
T use an extended macro function to count the number of variables in the list:
local nvars : word count “sdvars”
By using this extended macro function, I can change the list of variables in sdvars and
not worry about updating the count for the number of variables I want to analyze. You
are always better off letting Stata compute a quantity than entering it by hand. For
each variable, I need the percentage of ones and the number of nonmissing cases. I will
save these in a matrix that will have one row for each variable and two columns. T use
a matrix command ([P] matrix) to create a 6 x 2 matrix named stats:4.3.2 Counters in loops 103
matrix stats = JC nvars’,2,.)
The J() function creates a matrix based on three arguments. The first is the number
of rows, the second the number of columns, and the third is the value used to fill the
matrix. Here I want the matrix to be initialized with missing values that are indicated
by a period. The matrix looks like this:
. matrix list stats
stats[6,2]
ct ¢2
rn
r2
r3
r4
x5
x6
To document what is in the matrix, I add row and column labels:
matrix colnames stats = Pctis N
matrix rownames stats = ‘sdvars”
The matrix now looks like this:
. matrix list stats
stats [6,2]
Pctis N
Bsdneighb :
Bsdsocial
Bsdchild
Bsdfriend
Bsdwork
Badmarry
Next I loop through the variables in local sdvars, run summarize for each variable, and
add the results to the matrix. I initialize a counter that will indicate the row where I
want to put the information:
local irow = 0
Then I loop through the variables, compute what I need, and place the values in the
matrix:
1> foreach varname in “sdvars” {
2> local ++irow
> quietly sum “varname’
> local samplesize = r(N)
S> local peti = r(mean)#100
6> matrix stats{"irow’,1] = “peti”
?> matrix stats["irow’,2] = ~samplesize”
8 }
Lines 1-5 are similar to the example on page 101. Line 6 places the value of pct1 into
row irow and column | of the matrix stats. Line 7 places the sample size in column 2.104 Chapter 4 Automating your work
After the loop has completed, I list the matrix using the option format (49 .3£). This
option specifies that 1 want to display each nuniber in nine columns and show three
decimal digits:
. matrix list stats, format(%9.3£)
stats[6,2]
Pctis N
Bsdneighb 19.388 490.000
Bsdsocial 27.459 488.000
Bsdchild 71.726 481.000
Bsdfriend 28.747 487.000
Bsdwork 31.134 485.000
Bsdnarry 52.747 455.000
This technique for accumulating results is used extensively in chapter 7.
4.3.3 Nested loops
You can nest loops by placing one loop inside of another loop. Consider the earlier
example (page 93) of creating binary variables indicating if y was Jess than a given
value. Suppose that T need to do this for variables ya, yb, yc, and yd. T could repeat
the code used above four times, once for each variable. A better approach uses a foreach
loop over the four variables (file: w£4-Loops .do):
foreach yvar in ya yb yc yd { // loop 1 begins
(content of loop goes here)
} // loop i ends
Within this loop, I insert. a modification of the loop used before to dichotomize y. 1
refer to this as loop 2:
1> foreach yvar in ya yb ye yd { // loop 1 begins
2» foreach cutpt in 234 { // loop 2 begins
2 * create binary variable
> generate “yvar’_lt°cutpt” = y * add labels
> label var yvar“_lt*cutpt’ "y is less than “cutpt’?"
D> label define “yvar“_lt*cutpt’ 0 "Not label values ~yvar’_1t~cutpt’ “yvar‘_it* cutpt*
9> } // oop 2 ends
10> } // loop 1 ends
The first time through loop 1, the local yvar is assigned ya, so when “yvar” appears
in later lines, it is evaluated as ya. The second loop varies over the three values for
cutpt. The locals from the two loops are combined in later lines. For example, in line 4
I create a variable named “yvar’.1t~cutpt*. The local yvar is initially ya and the first
valuc of cutpt is 2. Accordingly, the first variable created is ya_1t2. Then ya_1t3 and
ya_1t4 are created. At this point, loop 2 ends and the value of yvar in loop 1 becomes
yb and variables yb-1t2, yb_1t3, and yb_1t4 are generated by loop 2.4.3.4 Debugging loops 105
4.3.4 Debugging loops
Loops can generate confusing errors. When this happens, 1 am often able to figure out
what is wrong by using display to monitor the values of the local variables created in
the loop. For example, this loop looks fine (file: w£4-loops-error1.do)
foreach varname in "sdneighb sdsocial sdchild sdfriend sdvork sdmarry" {
generate B'varname” = ~varname”
replace B°varnane’ = 0 if “varname“s=1 | ~
replace B'varname’ = 1 if “varname’==3 | ~
i
but it generates the following error:
sdsocial already defined
r({10);
To debug the loop, I start by removing sdsocial from the list to see if there was
something specific to this variable that caused the error. When I do this, however, I get
the same error for a different variable (file: w£4-loops-error1a.do):
sdchild already defined
r(110) ;
Because the second variable in the list causes the same error, I suspect that, the problem
is not with variables that I want to recode. Next l add a display command immediately
after the foreach command (file: w£4-loops~error1b. do):
display "==> varname is: >" varname’<"
This command prints ==> varname is >... <, where ... is replaced by the contents
of the local varname. I print > and < to make it easy to see if there are blanks at the
beginning or end of the local. Here is the output:
==> varname is: >sdneighb sdsocial sdchild sdfriend sdvork sdnarry<
sdsocial already defined
(110);
Now I see the problem. The first time through the loop, I wanted varname to con-
tain sdneighb but instead it contains the entire list of variables sdneighb sdchild
sdfriend sdwork sdmarry. This is because everything within quotes is considered to
be a single item; the solution is to get rid of the quote marks:
foreach varname in sdneighb sdsocial sdchild sdfriend sdvork sdmarry {
Errors in Joops are often caused by problems in the local variable created by the foreach
or forvalues command. The specific error message you get depends on the commands
used within the loop. Regardless of the error message, the first thing I do when J have a
problem with a loop is to use display to show the value of the local created by foreach
or forvalues. More times than not, this uncovers the problem.106 Chapter 4 Automating your work
Using trace to debug loops
Another approach to debugging loops is to trace the program execution (see page 81).
Before the loop begins, type the command
set trace on
Then, as the loop is executed, you can see how each macro has been expanded. For
example,
. foreach varname in "sdneighb sdsocial sdchild sdfriend sdwork sdmarry” {
2. gen B'varname” = “varname“
3. replace B’varname” 1 ~ varname “==:
4. replace B'varname” 1 ‘varname“==4
5. 3
~ foreach varname in "sdneighb sdsocial sdchild sdfriend sdwork sdmarry" {
~ gen B'varname” = ~varname”
= gen Bsdneighb sdsocial sdchild sdfriend sdwork sdmarry = sdneighb sdgocial sdc
> hild sdfriend sdwork sdwarry
sdsocial already defined
replace B’varname’ = 0 if “varnane
replace B’varname’ = 1 if “varname
a
(110);
With trace, lines that begin with = show the command after ail the macros have been
expanded. In this example, you can sce right away what the problem is. To turn trace
off. type the command set trace off.
4.4 The include command
Sometimes I repeat the same code multiple times within the same do-file or across
multiple do-files. For example, when cleaning data, 1 might have many variables that
use 97, 98, and 99 for missing values where I want. to recode these values to the extended
missing value codes .a, .b, and .c. Or [ want to select my sample in the same way in
multiple do-files. Of course, I can copy the same code into each file, but if I decide to
change something, say, to usc .n rather than .c for a missing value, I must change each
do-file in each location where the recoding is done. Making such repetitious changes is
time-consuming and error-prone. An alternative is to use the include command. The
include command inserts code from a file into your do-file just as if you had typed it
at the location of the include command. To give you an idea of how to use include,
I provide two examples. The first example uses an include file to select the sample in
multiple do-files. The second example uses include files to recode data. The section
ends with some warnings about things that can go wrong. The include command was
added in Stata 9.1, where help include is the only documentation; in Stata 10, also
see [P] include.4.4.2. Recoding data using inchide files
4.4.1 Specifying the analysis sample with an include file
107
I have a series of do-files that analyze mydata.dta.* For these analyses I want to use
the same cases selected with the following commands:
use mydata, clear
keep if panel==1 // only use 1st panel
drop if male: // restrict analysis to males
drop if inc>=. // drop if missing on income
I could type these commands at the beginning of each do-file. Instead, I prefer to use
the include command. | create a file called mydata~sample.doi, where I chose the
suffix doi to indicate that the file is an include file. You can use any suffix you want,
but I suggest you always use the same suffix to make it easier to find your include files.
My analysis program uses the include file like this:
* load data and select sample
include mydata-sample.doi
* get descriptive statistics
summarize
* run base model
Rogit y x1 x2 x3
This is exactly equivalent to the program:
* load data and select sample
use mydata, clear
keep if panel==1 // only use ist panel
drop if male // restrict analysis to males
drop if inc>=, // drop if missing on income
* get descriptive statistics
summarize
* run base model
Logit y x1 x2 x3
If I use different analy:
files, say,
nydata-males-p1.doi
mydata-males-allpanels .doi
nydata-fenales-p1.doi
nydata-females-allpanels.doi
s samples for different purposes, | can create a series of include
By selecting one of these to include in a do-file, I can quickly sclect the sample I want,
to use.
4.4.2 Recoding data using include files
I also use include files for data cleaning when I have a lot of variables that need to be
changed in similar ways. Here is a simple example. Suppose that variable inneighb
3. Recall that if a filename does not start with wf, it is not part of the Workflow package that you can
download.108 Chapter 4 Automating your work
uses 97, 98, and 99 as missing values. I want to recode these values to be extended
missing values. For example (file: wf4-include do),
+ inneighb: recode 97, 98 & 99
inneighb
. if inneighb!
clonevar inneighbR
replace inneighbR
replace inneighbR = .b if inneighbR==98
replace inneighbR = .c if inneighbR==99
tabulate inneighb inneighbR, miss nolabel
Because I want. to recode insocial, inchild, infriend, inmarry, and inwork the same
way, I use similar commands for each variable:
* insocial: recode 97, 98 & 99
clonevar insocialR = insocial
replace insocialR
replace insocialR
replace insocialR = .¢ if insocialR==99
tabulate insocial insocialR, miss nolabel
* inchild: recode 97, 98 & 99
clonevar inchildR = inchild
replace inchildR = .a if inchildR==97
replace inchildR .b if inchildR==98
replace inchildR = .c if inchildR==99
tabulate inchild inchildR, miss nolabel
(and so on for infriend, inmarry, and inwork)
Or L can use a loop:
foreach varname in inneighb insocial inchild infriend {
clonevar “varnane
replace ~varnane ‘
replace “varname’R = .b if “varname
replace “varname’R = .c if “varname”
tabulate “varname’ “varname“R, miss nolabel
= “varnane”
:a if “varname “R==97
aw
}
1 can do the same thing with an include file. 1 create the file w£4-include-2digit-
recode.doi that contains:
clonevar ~varname
replace ~varname’
replace ~varname“
replace ~varname“R = .c if ~varname“R==99
tabulate “varname” ~varname“R, miss nolabel
As in the foreach loop, these commands assume that. the local varname contains the
name of the variable being cloned and recoded. For example,
local varname inneighb
include wf4-include-2digit-recode.doi
For the next variable,
local varname insocial
include wf4-include-2digit-recode.doi4.4.3 Caution when using include files 109
and so on. I create other include files for other types of recoding. For example,
wf4-include-3digit-recode.doi has the commands
clonevar ~varname“R = ~
replace ~varname“
replace “varname’R = .b if ~varname’R==998
replace “varname’R = .c if ~varnane“R==999
tabulate “varname” ~varname’R, miss nolabel
My program to recode all variables looks like this:
// recode two-digit missing values
local varname inneighb
include wf4-include-2digit-recode.doi
local varname insocial
include wf4-include-2digit-recode doi
local varname inchild
include wf4-include~2digit-recode.doi
local varname infriend
include wf4-include-2digit-recode.doi
// recode three-digit missing values
local varname inmarry
include wi4-include-3digit-recode. doi
local varname inwork
include wf4-include-3digit-recode. doi
Or I could use loops:
// recode two-digit missing values
foreach varname in inneighb insocial inchiid infriend {
include wf4-include-2digit-recode.doi
}
// recode three-digit missing values
foreach varname in inmarry inwork {
include wf4-include-3digit~recode.doi
}
I can create a different include file for each type of recoding that needs to be done. 1
find this to be very helpful in large data-cleaning projects as shown on page 236.
4.4.3 Caution when using include files
Although include files can be very useful, you need to be careful about preserving,
documenting, and changing them. When backing up your work, it is easy to forget the
include files. If you cannot find the include file that is used by a do-file, the do-file will
not. work correctly. Accordingly, you should carefully name and document your include
files. I give include files the suffix .doi so that I can easily recognize them when looking
at a list of files. I use a prefix that links them to the do-files that call them. For example,
if mypgm.do uses an include file and no other do-files use this include file, I name the
include file mypgm.doi. If I have an include file that is used by many do-files, I start
the name of the include file with the same starting letters of the do-file. For example,
cwh-men-sample.doi might be included in cwh-O1desc.do and cwh-O2logit.do. I
document include files both in my research log and within the file itself. For example,
the include file might contain110 Chapter 4 Automating your work
// include: cwh-men-sample.doi
// used by: — cwhx.do analysis files
// task: select cases for the male sample
{¢ author: scott long \ 2007-08-05
The advantage of include files is that, they let you easily use the same code in multiple
do-files or multiple times in the same do-file. If you change an include file, you must be
certain that the change is appropriate for all do-files thal. use the include file. For exam-
ple, sppose that cwh-sample . doi selects the sample for my analysis in the CWH project.
The do-files cwh-O1desc.do, cwh-O2table.do, cwh-O03logit.do, and cwh-O4graph.do
all include cwh-sample.doi. When reviewing the results for cwh-O1desc.do, I decide
that I want to include cases that I had initially dropped. If 1 change cwh-sample.doi,
this will affect the other do-files. The best approach is to always follow the rule that,
once you have finished your work on a. do-file or include file, if you change it, you
should give it a new name. For example, the do-file becomes cwh-O1descv2.do and
includes cwh-samplev2.doi. For details on the unportance of renaming changed files,
see section 5.J.
The include command should not be used when other methods will produce clearer
code. For example, the foreach version of the code fragment on page 108 is casier
to understand than the corresponding code using include that follows because the
include version hides what is being done in wf4-include-2digit-recode.doi. But,
as the block of code contained in wf4-include-2digit-recode.doi grows, the include
version becomes more attractive.
4.5 Ado-files
This section provides a basic introduction to writing ado-files.4 Ado-files are like do-files,
except that they are automatically run. Indced, .ado stands for automatically loaded
do-file. To understand how these files work. it helps to know something abont the
inner workings of Stata (see appendix A for further details). The Stata for Windows
executable is a file named wstata.exe or mpstata.exe that contains the compiled
program that is the core of Stata. When you click the Stata icon, this file is launched
by the operating em. Some commands are contained in the executable, such as
generate and summarize. Many other commands are not part. of the executable but
instead are ado-files. Ado-files are programs written using features from the executable
to complete other tasks. For example, the executable does not have a program to fit
the negative binomial regression model. Instead. this model is fitted by the ado-file
nbreg.ado. Stata 10 has nearly 2,000 ado-files. A clever and powerful feature of Stata
is that when you run a command, you cannot tell whether it is part. of the executable
or is an ado-file. This means that Stata users can write new commands and use them
just like official Stata commands.
4. When you install the Workflow package, the ado-files and help files from this section are placed in
your working directory. Because Siata automatically installs user-written ado-files and help files
to the PLUS directory (sec page 350), I have named these files with the suffixes -ado and _hip (¢.g.,
wi ..ado, wf. -hlp) so they will be downloaded to your working directory. Using your file manager,
you should rename the files to remove the underscores:4.5.1 A simple program to change directories 11.
Suppose that I have written the ado-file Listcoef.ado and type listcoef in the
Command window. Because listcoef is not an internal command, Stata automatically
looks for the file Listcoef .ado. If the file is found, it is run. This happens very quickly,
so you will not be able to tell if listcoef is part of the executable, an ado-file that is
part of official Stata, or an ado-file written by someone else. This is a very powerful
feature of Stata.
Although ado-files can be extremely complex (for example, from the Command
window, run viewsource mfx.ado Lo see an ado-file from official Stata), it is possible
to write your own ado-files that are simple yet very useful.
4.5.1 A simple program to change directories
The cd command changes your working directory. For example, my work for this book is
located in e: \workflow\work. To make this my working directory, I type the command
cd e:\workflow\vork
Because I work on other projects and each project has its own directory, 1 change
directories frequently. To make this easier, I can write an ado-file called wf.ado that
automatically changes my working directory to e:\workflow\work. The ado-file is
program define wf
version 10
cd e:\workflow\vork
end
The first line names the new command and the last line indicates that the code for
the command has ended. The second line indicates that the program assumes you are
running Stata 10 or later. Line 3 changes the working directory. I save wf.ado in my
PERSONAL directory (typc adopath to find where your PERSONAL directory is located).
To change the working directory, I simply type wf.
T can create ado-files for each project. For example. my work on SPost is located in
e:\spost\work\. So I create spost.ado:
program define spost
version 10
cd e:\spost\work
end
For scratch work, I use the d:\scratch directory. So the ado-file is
program define scratch
version 10
ed d:\scratch
end
In Windows, I often download files to the desktop. To quickly check these files, I might
want to try them in Stata before moving them to their permanent, location. To change
to the desktop, I need to type the awkward command112 Chapter 4 Automating your work
cd “c:\Documents and Settings\Scott Long\Desktop"
It, is much easier to create a command called desk:
program define desk
version 10
cd "c:\Documents and Settings\Scott Long\Desktop"
end
Now I can move around directories for different projects easily:
. wi
e:\vorkflow\vork
. desk
¢:\Documents and Settings\Scott Long\Desktop
. spost
e:\spost\work
. wt
e: \workflow\work
. scratch
@:\scratch
lf you have not written an ado-file before, this is a good time to try writing a few that
change to your favorite working directories.
4.5.2 Loading and deleting ado-files
Before proceeding to a more complex example, I need to further explain what happens
to ado-files once they are loaded into memory and what happens if you need to change
an ado-file that is already loaded. Suppose that you have the file wf .ado in your working
directory when you start Stata. If you enter the wf command, Stata. will look for wf. ado
and run the file automatically. This loads the wf command into memory. Stata will try
to keep this command in memory as long as possible. This means that if you enter the
wf command again, Stata will use the command that is already in memory rather than
running wf .ado again. Tf you change wf .ado, say. to fix an error or add a feature, and
try to run it again, you get an error:
. Tun wf.ado
wf already defined
1(110);
Stata will not create a new version of the wf command because there is already a version
in memory. The solution is to drop the command stored in memory. For example,
+ program drop uf
; run wf.ado
When debugging an ado-file, I start the file with the capture program drop command-
name command. If the command is in memory, it will be dropped. If it is not in
memory, capture prevents an error that occurs if you try to drop a command that is
not in memory.4.5.3 Listing variable names and labels 113
4.5.3 Listing variable names and labels
As a more complex example, I will automate the loop used on page 96 to list variable
names and labels. I start by creating the nmlabel command that works very simply.
Then I add options to introduce new programming features.® For a command to run
automatically, you need to give the file the same name as the command. For example,
nmlabel.ado should define the nmlabel command. In the examples that follow, I
create several versions of the nmlabel command. When you download the Workflow
package, these are named to reflect their version and have suffixes ._ado rather than .ado
(e.g., nmlabel-v1._ado). The suffix . ado is necessary to download the files into your
working directory; if the suffix was .ado, the file would be placed in your PLUS directory.
Before working with these files, change the suffixes to .ado. For example, change
nmlabel-vi..ado to nmlabel-vi.ado. If you want a particular version of the command
to run automatically, you need to rename the file, such as renaming nmlabel-v1.ado
to nmlabel.ado. After renaming, it will run automatically if you enter nmlabel.
Version 1
My first version of nmlabel lists the names and labels with no options. It looks like
this (file: nmlabel-vi .ado)
> #! version 1.0.0 \ js] 2007-08-05
2> capture program drop nmlabel
3> program define nmlabel
4> version 10
5> syntax varlist
&> foreach varname in ~varlist~ {
D> Jocal varlabel : variable label “varname”
8> display in yellow "“varname™" _col(10) "“varlabel““
»
10> end
and is saved as nmlabel .ado. Line 1 is a special type of comment. If a comment begins
with *!, I can list the comment using the which command:
. which nmlabel
.\nmlabel.ado
*! version 1.0.0 \ jsl 2007-08-05
The output .\nmlabel.ado tells me that the file is located in my working directory,
indicated by .\. Next the comment is echoed. If the file was in my PERSONAL directory,
which would produce the following output:
. which nmlabel
.c:\ado\personal\nmlabel.ado.
*! version 1.0.0 \ jsl 2007-08-05
5. After writing nmlabel.ado as an example, | found it so useful that I created a similar command
called nmlab to be part of my personal collection of ado-files. This file is installed as part of the
Workflow package.1l4 Chapter 4. Automating your work
When writing an ado-file, you can initially save it in your working directory. When it
works the way you want it to, move it to your PERSONAL directory so that Stata can
find the file regardless of what your current working directory is.
Returning to the ado-file, the third line names the command. Line 4 says that the
program is written for version 10 and later of Stata. If 1 run the command in version 9
or earlier, I will get an error. Line 5 is an example of the powerful syntax command,
which controls how and what information you can provide your program and gencrates
warnings and errors if you provide incorrect information (see help syntax or [P] syntax
for more information). The syntax clement varlist means that. | am going to provide
the program with a list of variable names from the dataset that is currently in memory.
Tf enter a name that is not a variable in my datasct, syntax reports an crror. Lines 6-9
are the loop used in section 4.3. In line 10, end indicates that the program has ended.
Here is how the command works:
. nmlabel 1fp-we
lfp Paid Labor Force: 1=yes 0=no
KS # kids <6
K618 — # kids 6-18
age Wife’s age in years
we Wife College: 1=yes 0=no
I typed the abbreviation 1fp-we rather than 1fp k5 k618 age we. The syntax com-
mand automatically changed the abbreviation into a list of variables
Version 2
Reviewing the output, I think it might look better if there was a blank line between
the echoing of the command and the list of variables. To do this, T add an option skip
that will determine whether to skip a line. Although this option is not terribly useful,
it shows you how to add options using the powerful syntax command. The new version
of the program looks like this (file: nmlabel-v2.ado):®
1> *! version 2.0.0 \ js1 2007-08-05
2 capture program drop nnlabel
3> program define nmlabel
> version 10
> syntax varlist [, skip]
6> if ‘skip ™"=="skip" {
RP display
> }
9> foreach varname in “varlist’ {
10> local varlabel : variable label ~varname’
ip display in yellow "“varname“" _col(10) "“varlabel“"
12>
13> end
The syntax command in line 5 adds [, skip]. The , indicates that what follows is
an option (in Stata options are placed after a comma). The word skip is the name I
6. If you have already run nmlabel-vi.ado, you need to drop the program nmlabel before running
nmlabel-v2.ado. To do this, enter program drop nmlabel.4.5.3 Listing variable names and labels 115
chose for the option. The [ }’s indicate that the option is optional---that is, you can
specify skip as an option but you do not have to. If I enter the command with the
skip option, say, nmlabel lfp wc hc, skip, the syntax command in line 5 creates a
local named skip. Think of this as if I ran the command
local skip “skip"
This can be confusing, so I want to discuss it in more detail. When I specify the skip
option, the syntax command creates a macro named skip that contains the string skip.
If I do not specify the skip option, syntax creates the local skip as a null string:
local skip “"
Line 6 checks whether the contents of the macro skip (the contents is indicated by
*skip’) are equal to the string skip. If they are, the display command in line 7 is
run, creating a blank line. If not, the display command is not run.
To see how this works, I trace the execution of the ado-file by typing set trace on.
Here is the output, where T have added the line numbers:
i>. nmlabel 1fp k5, skip
2 begin nmlabel
3> - version 10
4> - syntax varlist [, skip]
S> - if ""skip’"=="skip" {
6> = if "skip"=="skip" {
7> ~- display
a>
ey
(output omitted)
Line 1 is the command | typed in the Command window. Line 2 indicates that this is
a trace for a command named nmlabel. Line 3 reports that the version 10 command
was executed. The - in front. of the command is how trace indicates that what follows
echoes the code exactly as it appears in the ado-file. Line 4 echoes the syntax command,
and line 5 echoes the if statement. Line 6 begins with = to indicate that what follows
expands the code from the ado-file to insert values for things like macros. Here “skip”
has been replaced by its value, which is skip
Returning to the code for version 2 of nmlabel on page 114, lines 9-12 loop through
the variables being listed by nmlabel. To see what happens, I can look at the output
from the trace:
(Continued on next page)116 Chapter 4 Automating your work
- foreach varname in “varlist’ {
= foreach varname in Lfp k5 {
- local varlabel : variable label “varname*
= local varlabel ; variable label lfp
- display in yellow "‘varname’" _col(10) "“varlabel’"
= display in yellow "fp" _col(10) “In paid labor force? 1=yes O=no"
lp In paid labor force? i=yes O=no
-}
~ local varlabel ; variable label “varname”
= local varlabel : variable label kS
- display in yellow "“varname" _col(10) "*varlabel~"
= display in yellow "k5" _col(10) "# kids < 6"
KS, # kids <6
-}
end nmlabel
Not only is set trace ona good way Lo see how your ado-file works, but it is invaluable
when debugging your program. To turn trace off, type set trace off.
Version 3
Next I want to add line numbers to my list. To do this, I necd a new option and a
counter as illustrated in section 4.3.2. Here is my new program (file: nmlabel-v3. ado):
41> *! version 3.0.0 \ jsi 2007-08-05
2> capture program drop umlabel
3> program define nmlabel
» version 10
5> syntax varlist [, skip NUMber ]
6 if "‘skip“"=="skip" {
D> display
&> +
9 local varnumber = 0
10> foreach varname in “varlist’ {
11> local ++varnumber
12> local varlabel : variable label ~varname’
13> if "number a // do not number lines
14> display in yellow "‘varname"" _col(10) "*varlabel“"
15>
16> else { // number lines
im display in green “#*varnumber’: " ///
18> in yellow "“varname“" _col(13) "“varlabel““
19> }
20> 3
21> end
The syntax command in line 5 adds NUMber, which means that there is an option
named number that can be abbreviated as num (the capital letters indicate the shortest
abbreviation that is allowed), Line 9 creates a counter, and line 11 increments its
value. Lines 13 15 say that if the option number is not selected (i.e., "number
is not "number"), then print things just as before. Linc 16 starts the portion of the
program that runs when the if condition in ling 13 is not true. Lines 17 and 18 print
the information I want including a line number. Line 19 ends the else condition from
line 16. The new version of the command produces output like thisA general program 10 change your working directory 7
. Bmlabel 1lfp k5 k618 inc, num
#1: lfp ‘In paid labor force? 1=yes O=no
#2: kB # kids <6
#3: k618 # kids 6-18
#4: inc Family income excluding wife’s
Version 4
Version 3 looks good, except. that. long variable names will get in the way of the
labels. 1] could change -col(13) to .col(18), but why not add an option instead? In
this version of nmlabel, J add COLnum(integer 10) to syntax to create an option
named colnum() that can be abbreviated as col(). integer 10 means that if I do
not specify the colnum() option, the local column will automatically be set equal to
10. If f do not want to begin the labels in column 10, I use the colsum() option such
as nmlabel lfp, col(25) and the labels begin in column 25. Here is the new ado-file
(file; nmlabel-v4. ado):
capture program drop nmlabel
program define nmiabel
version 10
syntax varlist (, skip NUMber COLnum(integer 10)]
if "“skip“"=="skip" {
display
local varnumber = 0
foreach varname in “varlist’ {
local ++varnumber
local varlabel : variable label “varname”
if "‘number { // do not number lines
display in yellow "“varname™’ _col(*colnum’) "*varlabel”"
else { // number lines
local colnumplus2 = “colnum” + 2
display in green “#varnumber”: " ///
in yellow "*varname“" .col(~colnumplus2°) "“varlabel“"
}
end
I encourage you to study the changes. Although some of the changes might. not. be
obvious, you should be able to figure them out using the tools from chapters 3 and 4.
4.5.4 A general program to change your working directory
We now have enough tools to write a more general program for changing your working
directory.” Instead of having a separate ado-file for each directory, I want a command
7. This example was suggested by David Drukker. When you install the Workflow package, the wd
command will be downloaded to your working directory with the name wd.-ado. The suffix .ado
is necessary to download the file to your working directory; if the suffix was .ado, the file would be
placed in your PLUS directory. Before working with this file, you should rename it wd. ado.118 Chapter 4 Automating your work
wd, where wd wf changes to my working directory for the workflow project, wd spost
changes to my working directory for SPost, and so on. Here is the programm:
41> #! version 1.0.0 \ scott long 2007-08-05
2> capture program drop wd
3> program define wa
” version 10
s> args dir
6> if "“dir"=="wi" {
> cd e:\vorkflow\vork
8> 2,
@ else if "dir post" {
10> cd e:\spost \work
11> }
12> else if “"dir“"=="scratch" {
13> ed d:\scratch
14> 3
15> else if ““dir’"=="" { // list current working directory
16> cd
17> 3
18> else {
19> display as error "Working directory ‘dir’ is unknown."
20>
21> end
The args command in line 5 retrieves a single argument from the command line. If 1
type wd wf, then args will take the argument wf and assign it to the local macro dir.
Line 6 checks if dir is wf. If so, line 7 changes to the directory e:\workflow\work.
Similarly, lines 9-11 check if the argument is spost and then changes to the appropriate
working directory. If wf is run without an argument, lines 15-17 display the current
working directory. If any other argument is given, lines 18-20 display an error. You can
customize this program with your own else if conditions that use the abbreviations
you want for making changes to working directories that you specify. Then, after you
put wd.ado in your PERSONAL directory, you can easily change your working directories.
For example,
. ud vi
e:\workflow\work
- wd
e:\workflow\work\
. wd scratch
d:\scratch
- wd spost
e:\spost \work
4.5.5 Words of caution
If you write ado-files, be careful of two things. First, you must archive these files. If
you have do-files that depend on ado-files you have written, your do-files will not work
if you lose the ado-files. Second, if you change your ado-files, you must verify that
your old do-files continue to work. For example, if I decide that I do not like the name
number for the option that numbers the list of variables and change the option name4.6.1 nmlabel.hlp 119
to addnumbers, do-files that, use the command nmlabel with the number option will
no longer work. With ado-files, you must be careful that improvements do not break
programs that used to work.
4.6 Help files
When you type help command, Stata searches the ado-path for the file command. sthlp
or command.hip.® If the file is found, it is shown in the Viewer window. In this section,
I start by showing you how to write a simple help file for the nnlabel command written
in section 4.5.3. Then I show you how I use help files to remind me of options and
commands that I frequently use.
4.6.1 nmlabel.hip
To document the nmlabel command, | create a text file called nmlabel.hlp. When I
type help nmlabel a Viewer window displays the file; see figure 4.1.
8. The advantage of using the suffix .sthlp rather than .hip is that many email systems refuse to
accept attachments that have the .hlp suffix because they might contain a virus.120
€@ GS Aferrmea
(Aico | Getter fate Ned (News)
j help for nmlabel :: 2008-03-07
‘jable names and variable
\
|
i create a Vist of v.
nmlabel varlist, [ number col(#) skip ]
| that you provide
mumber produces a numbered list.
By default, the label begins in colus
of names and Jabels
. use wf fp
(Bata from 19’6 PSID-T Mroz)
nmlabel Vfp 5
itp In paid labor force? 1syes d=no
kS # kids < 6
. nmlabel Ifp k5, nus
#1: ifp In paid labor force? 1-yes 9-no
Linas) #kids < 6
nmlabel 1fp k5, num col(15)
#1: 1fp In paid abor force? l=yes
#22 KS # kids <
|. nm@ilabel 1fp kS, skip
j lfp In paid labor force? l=yes 0-no
| KS # kids < 6
nmlabel lists the names and variable labels for a list of variables
| col@) indicates the colum in which the variable label will begin. }
skip will skip a qa the echoed command name and the listing
Chapter 4 Automating your work
Jabeis j
124 |
O=no
Figure 4.1. Viewer window displaying help nmlabel4.6.1 nmilabel.bip tat
The file nmlabel.h1p is a text file that looks like this:
help for “nmlabel~ :: 2008-03-07
Create a list of variable names and variable labels
“nmlabel* varlist~,~[ “num*ber ~col(~#")""skip7]
Description
“nmlabel” lists the names and variable labels for a list of variables
that you provide.
Options
“number” produces a numbered list.
“col("#*)* indicates the column in which the variable label will begin.
By default, the label begins in column 12.
“skip* will skip a line between the echoed command name and the listing
of names and labels
. “use vé-1fp™
(Data from 1976 PSID-T Mroz)
- “nmlabel 1fp k5~
lp In paid labor force? 1=yes 0=no
ks. # kids < 6
. “nmlabel 1fp k5, num*
#1: lfp In paid labor force? 1=yes O-no
#2: 5 # kids < 6
- “nmlabel 1fp k5, num col (15)7
#1: 1fp In paid labor force? 1=yes O=no
#2: 5 W kids <6
. “nmlabel fp k5, skip*
1fp In paid labor force? i=yes O=no
KS # kids < 6
Author: Scott Long - www.indiana.edu/”jslsoc/workflow. htm
The file includes two shortcuts for making the file easier to write and easier to read
in the Viewer windows. In the first line, .~ is interpreted by the Viewer window as a
solid line from the left border to the right. The carets ~ are used to toggle bold text on
and off. For example, at the top of the Viewer window, the word “nmlabe!” is in bold122 Chapter 4 Automating your work
because the text file contains “nmlabel~, or consider the sequence ~col(*#°)~. The
first six characters ““col(*” makes col ( bold, then # is not bold, and ~)~ makes ) bold
If I wanted to make the help file fancier, with links to other files, automatic indentation,
italics, and many other features, I could use the Stata Markup and Control Language
(SMCL). Sce [U] 18.11.6 Writing online help and [R] help for further information.
4.6.2 help me
L use a help file named me hip to give me quick access to information that I frequently
use. This includes both summaries of options and fragments of code that I can copy-
and-paste into a do-file. I put me.hlp in the PERSONAL directory. Then, when I type
help me, a Viewer window opens (sce figure 4.2) and I can quickly find this information
(file: me .h1p):4.7 Conclusions
[ Aance | (Gontots ) fars ed [ews]
€% 68 bike af
help for me :: Scott Long \ 2007-67-28
Reset everything: clear all
Updates :
ado dir list installed packages
update al] : update ado-files and executable
adoupdate, update : update user written packages
Axes options:
x/yscale(lo, hi)
xfylabel()
TAlinet
Vine()
Symibols =
0 large circle S large square T large triangle
o small circle d small diamond p smat) plus
xx i invisible = dot
Mark missing values
mark nomissv
label var nomissv "L if no missing”
label def noniss 1 NoMissing 0 Missing
} Yabel val nomissy nomiss
markout non'issv Ths! “rhs"
replace nommissv = . if nomissv==
keep if nonissy:
Scatterplot for two groups
twoway (scatter y x if a
(Scatter y x if a
, title(Compare two groups)
msymbol(circle_hollow} mcolor(red)) ///
}» msymbol (square hollow) mcolor(blue}) ///
Figure 4.2. Viewer window displaying help me
4.7 Conclusions
123
Automation is fundamental to an effective workflow, and Stata provides many tools for
automating your work. Although this chapter provides a lot of useful information, it is
only a first step in learning to program in Stata. If you want to learn more, consider
taking a NetCourse from StataCorp (http://www.stata.com/netcourse/). NetCourse
151—Introduction to Stata Programming is a great way to learn to use Stata more
effectively even if you do not plan to do advanced programming. NetCourse 152—
Advanced Stata Programming teaches you how to write sophisticated commands in
Stata. If you spend a lot of time using Stata, learning how to automate your work will
make your work easier and more reliable, plus it will save you time.5 Names, notes, and labels
This chapter marks the transition from discussions of broad strategy in chapter 2 and
general tools in chapters 3 and 4 to discussions of the specific tasks that. you encounter
as you move from an initial dataset to published findings. Chapter 5 discusses names,
notes, and labels for variables, datasets, and do-files; these topics are essential for effec-
tive organization and documentation. Chapter 6 discusses cleaning data, constructing
variables, and other common tasks in data management. For most projects, the vast
majority of your time will be spent getting your data ready for statistical analysis, Fi-
nally, chapter 7 discusses the workflow of statistical anal and presentation. Topics
include organizing your analyses, extracting results for presentation, and documenting
where the results you present came from. These three chapters incorporate two ideas
that I find indispensable for an effective workflow. First, the concept of posting a file
refers to deciding that a file is final and can no longer be changed. Posting files is critical
because otherwise you risk inconsistent results that cannot be replicated. The second
idea is that data analysis should be divided between data management and statistical
analysis. Data management includes cleaning your data, constructing variables, and
creating datasets. Statistical analysis involves examining the structure of your data us-
ing descriptive statistics, model estimates, hypothesis tests, graphical summaries, and
other methods. Creating a dual workflow for data management and statistical anal-
ysis simplifies writing documentation, makes it easier to fix problems, and facilitates
replication.
5.1 Posting files
Posting a file is a simple idea that is essential for data analysis. At some point when
writing a do-file, you decide that the program is working correctly. When this hap-
pens, you should post your work. Posting means that the do-file and log file, along
with datasets that were created, are placed in the directory where you save completed
work (e.g., c:\cwh\Posted\). The fundamental principles for posted files is simple but
absolute:
Posting principle: Once a file is posted, it should never be changed.
If you change a posted file, you risk producing inconsistent results based on different
variables that have the same name or two datasets with the same name but different
content. I have seen this problem repeatedly and the only practical way that I know
to avoid it is to have a strict policy that once a file is posted, it cannot be changed.
125126 Chapter 5 Namos, notes, and labels
An implication of this rule is that only posted files should be shared with others or
incorporated into papers or presentations.
The posting principle does not mean that you cannot change a do-file durmg the
process of debugging. As you debug a do-file, you create the same dataset each time
you rim the prograin and might change the way a variable is created. That is not a
problem because the files have not been posted, but once the files are posted, you must
not change them,
Nor does posting a file mcan that you cannot correct errors in do-files that: have
been posted. Rather it means that to fix the errors you need to create new files and
possibly new variables. For example, suppose that mypgm01.do creates mydata0t.dta
with variables var01—var99. After posting these files, 1 discover a mistake in how var49
was created. To fix this, I create a revised mypgm01V2.do that correctly generates the
variable that 1 now name var49V2 and saves the new dataset mydata01V2.dta. I can
keep the original var49 in the new dataset or I can delete it, but I must not change
var49, I can delete mydata01.dta oy I can keep it, but J must not change it. Because
posted files are never changed, I can never have results for var49 where the meaning of
var49 has changed. Nor is it possible for two people to analyze datasets with the same
name but different content.
Finally. the practice of posting files does not mean that you must post each file
immediately alter you decide that it is complete and verified. I often work on a dozen
related do-files at a time until I get things the way } want them. For me, this is the
mast. efficient way to work. Something I learn while debugging one do-file might: lead
me to change another do-file. At some point, 1 decide that all the do-files and datasets
are the way I want. Then the iterative process of debugging and program development
ends. When this happens, [ move the do-files, log files, and datasets from my working
directory into a directory with completed work. That is, ] post the files. After the files
are posted, and only after they are posted, J can inchide the results in a paper, make
the datasets available to collaborators. or share the log files with colleagues.
Although 1 find that most people agree in theory with the idea of posting, in practice
the rule is violated frequently. [have been part of a project where a researcher posted a
dataset, quickly realized a mistake, and ten minutes later replaced the posted file with a
different file that had the same name. During those ten minutes, I had downloaded the
file. It took us a lot of time to figure out why we were getting different results frorn the
“same” dataset. I recently received a dataset that had the same name as an earlier one
but was a different size. When I asked if the dataset was the same, I was told, “Exactly
the same except that { changed the married variable”.
The simplest thing is to make no exceptions to the rule for posting files. Once you
allow exceptions, you start down a slippery slope that is bound to lead to problems.
When a dataset is posted, if anything is changed, the dataset gets a new name. If a
posted do-file is changed, it gets a new name. And so on. If you do not make an absolute
distinction between files that are in process and those that are complete and posted, you
tisk producing inconsistent results and undermining your ability to replicate findings.5.2 The dual workflow of data management and statistical analysis 127
5.2 The dual workflow of data management and statistical
analysis
Data Statistical
Management Analysis
LS
{ datadt.do }
SS )
( data02.do
SS
Figure 5.1. The dual workflow of data management and statistical analysis128 Chapter 5 Names, notes, and labels
I distinguish between programs for data management and programs for statistical
. J refer to this as a dual workflow as illustrated in figure 5.1. The two sets
of do-files are distinct, in the sense that, programs for data management do not depend
on programs for statistical analysis. Operationally, this means that I can run the data-
iianagement programs in sequence withoul running any of the programs for statistical
This is possible because programs for statistical analysis never change the
(thcy might. tell you how you want. to change the datasct, but they do not make
the change). Programs for statistical analysis do, however, depend on the datasets
created by data-manageinent. programs. For example, stat03a.do will not work unless
data04.dta has been created by data04.do.
A dual workflow makes it casier to correct errors when they occur. For example,
if I find an error in variS in data02.dta, J only have to look for the problem in
the data-management. programs because the statistical analysis programs never create
variables that are saved. If I find a problem in data02.do, I create the corrected do-
file data02V2.do, which saves data02V2.dta, and corrects the problem in data02.dta.
Then I revise. rename, and rerun any of the stat*.do do-files that depend on the
changed data.
‘This workflow implics that you do not create and save new variables in your analysis
do-files. For example, if 1 have a variable named gender coded 1 for men and 2 for
women and decide that | want a variable female coded 1 for female and 0 for male,
T would create a new dataset that added female, rather than creating female in the
do-files for statistical analyses. I prefer this approach because | rarely create a variable
that [use only once. | might think I will use it only once. but in practice [ often need it
for other, unanticipated analyses. Searching earlier do-files to find how a variable was
created is time consuming and error prone. Also I might forget that 1 created a variable
with the same name and later create a variable with the same name but a different
meaning. Saving the variable in a dataset is easier and safer.
The distinction between data management and statistical analysis is not always
clear. For example, I might use factor analysis to create a scale that I want to include
in a datasct. The task of specifying, filting, and testing a factor model is part of
statistical analysis. But, constructing a scale to save is part of data management. In
such situations, [ might violate the principle of a dual workflow and create a dataset
with a program that is part of the statistical analysis workflow. More likely, T would
keep programs to fit, test, and perfect the factor model as part of the statistical analysis
workflow. Once T have decided on the model I want for creating factor scores, I would
incorporate that model into @ program for data management. The dual workflow is
not a Procrustean bed but rather is a principle that generally makes your work more
eflicient and facilitates replication.5.4 Naming do-files 129
5.3 Names, notes, and labels
With the principles of posting and a dual workflow in mind, we are ready to consider
the primary topics of this chapter: names, notes, and labels for variables, datasets, and
do-files. Is it worth your time to read an entire chapter about something as seemingly
simple as picking names and labeling thizgs? I think so. Many problems in data analysis
occur because of misleading names and incomplete labels. An unclear name can lead to
the wrong variable in a model or to incorrect interpretations of results. Less drastically,
inconsistent names and ineffective labels make it harder to find the variables that you
want and more difficult. to interpret: your output. On the other hand, clear, consistent,
and thoughtful names and labels speed things up and prevent errors. Planning names
and labels is one of the simplest things you can do to increase the ease and accuracy of
your data analysis. Because choosing better names and adding ful! Jabels does not take
much time, relative to the time lost by not doing this, the investment is well worth it.
Section 5.4 describes naming do-files in a way that keeps them organized and facil-
itates replication. Section 5.5 describes changing the filename of a dataset and adding
an internal note that, documents how the dataset was changed every time you change
a dataset, no matter how small the change. The next five sections focus on variables.
Section 5.6 is about naming variables. with topics ranging from systems for organizing
names to how names appear in the Variables window. Section 5.7 describes variable
labels. These short descriptions are included in the output of many commands and are
essential for an effective workflow. Section 5.8 introduces the notes command for doc-
umenting variables. This command is incredibly useful, yet J find that many people are
unaware of it. Section 5.9 describes labels for values and tools for keeping track of these
labels, Section 5.10 is about a unique feature of Stata, the ability to create labels in
multiple languages within one dataset. This is most obviously valuable with languages
such as French, English, and German but is also a handy way to include long and short
labels in the same language. Although you have no choice about the names and labels
in data collected by others, you can change those names and create new labels that: work
better. A workflow for changing variable names and labels is presented in section 5.11
that includes an extended exainple using programming tools from chapter 4. Even if
you are already fam with commands such as label variable, label define, and
label values, I think this section will help you work faster and more accurately.
5.4 Naming do-files
A single project can require hundreds of do-files. How you name these files affects
how easily you can find results, document, work, fix errors, and revise analyses. Most
importantly, carefully named do-files make it easier to replicate your work. My recom-
mendation for naming do-files is simple:
The run order rule: Do-files should be named so that when run in alphabet-
ical order they exactly re-create your datasets and replicate your statistical
analyses.130 Chapter 5 Names, notes, and labels
For simplicity, I refer to the order in which a group of do-files needs to be run as the run
order. The reasons you want names that reflect the run order differs slightly depending
on whether the do-files are used to create datasets or to compute statistical analyses.
5.4.1 Naming do-files to re-create datasets
Creating a dataset often requires that several do-files are run in a specific order. If
you run them in the wrong order, they will not work correctly. For example, sup-
pose that I need two do-files to create a dataset. The first. do-file merges the variable
hlthexpend from medical.dta and the variable popsize from census.dta to create
health01.dta. The second do-file creates a variable with generate hlthpercap =
hithexpend/popsize and then saves the dataset health02.dta. If] name the do-files
merge.do and addvar.do, the names do not reflect the run order needed to create
health02.dta. However, if I name them data01-merge.do and data02-addvar.do,
the order is clear. Of course, in such a simple example, I could easily determine the
sequence in which the do-files need to be run no matter how I name them. With dozens
of do-files written over months or years, names that indicate the sequence in which the
programs need to be run are extremely helpful.
Naming do-files to indicate the run order also makes it simpler to correct mistakes.
Suppose that I need ten do-files to create mydata01.dta and that the programs need
to run in the order data01.do, data02.do, through data10.do. After running the ten
do-files and posting mydata01.dta, I realize that data06.do incorrectly deleted several
observations. To fix the error, I create the corrected do-file data06V2.do and run the
sequence of programs data06V2.do through data10V2.do. Because of the way I named
the files, I know exactly which do-files need to be run and in what order to create a
corrected dataset named mydata01V2.dta.
5.4.2. Naming do-files to reproduce statistical analysis
If you write robust do-files, as discussed in chapter 3 (see page 51), results should
not depend on the order in which the programs are run. Still, 1 recommend that you
sequentially name your analysis do-files so that the last do-file in the sequence pro-
duces the latest analyses. Suppose that you are computing descriptive statistics and
fitting logit models. You might need a half dozen do-files as you refine your choice
of variables and decide on the descriptive statistics that you want. Similarly, you
might write several do-files as you explore the specification of your model. I sug-
gest naming the do-files to correspond to the run order for each task. For example,
you might have desc01.do—desc06.do and logit01.do—logit05.do, where you know
that desc06.log and logit05.log have the latest results. This naming scheme pre-
vents the problem of thinking that you are looking at the latest analyses when you are
not.5.4.3 Using master do-files 131
5.4.3 Using master do-files
Sometimes you will need to rerun a sequence of do-files to reproduce all the work related
to some part of your project. For example, when I complete the do-files to create a
dataset, I want to verify that all the programs work correctly before posting the files.
Or after discovering an error in one program in a sequence of related jobs, I want to fix
the error and verify that all the programs continue to work correctly. A master do-file
makes this simple. A master do-file is simply a do-file that runs other dc-files. For
example, I can create the master do-file dual-dm.do to run all the programs from the
left column of figure 5.1:
// dual-dm.do: do-file for data management
// scott long \ 2008-03-14
do data01.do
do data02.do
do data03.do
jo data04.do
exit
a
To rerun the four do-files in sequence, I type the command
do dual-dm.do
Similarly, for the statistical analysis, 1 can create dual-sa.do
// dual-sa.do: do-file for statistical analysis
// scott long \ 2008-03-14
* descriptive statistics
do stat0la.do
do stat01b.do
do statOic.do
+ logit models
do stat02a.do
do stat02b.do
* graphs of predictions
do stat03a.do
do stat03b.do
exit
which can be run by typing
do dual-sa.do
Suppose that | find a problem in data03.do that affects the creation of data03.dta
and consequently the creation of data04.dta. This would also affect the statistical
analyses based on these datasets. I need to create V2 versions of several do-files for data
management and statistical analysis as shown in figure 5.2.132 Chapter 5 Names, notes, and Jabels
Data Statistical
Management Analysis
data 1.dta
i
( “dae02 da
—— stat01a.do
datad2.dta
statOib.do
data03V2.do _—/
stat01¢.do
stat02aV2.do
stat03bV2.do
Figure 5.2. The dual workflow of data management and statistical analysis after fixing
an error in data03.do5.4.3 Using master do-files 133
After revising the do-files, my master do-files become
// dual-dm.do: do-file for data management
// scott long \ 2008-03-14; revised 2008-03-17
do data01.do
do data02.do
do data03V2.do
do data04V2.do
exit
and
// dual-sa.do: do-file for statistical analysis
// scott long \ 2008-03-14; revised 2008-03-17
* descriptive statistics
do statOla.do
do stat01b.do
do stat0tc.do
* logit models
do stat02aV2.do
do stat02bV2.do
* graphs of predictions
do stat03aV2.do
do stat03bV2.do
exit
By running the following commands, all my work will be corrected:
do dual-da.do
do dual-sa.do
Master log files
Stata allows you to have more than one log file open at the same time. This provides a
convenient way to combine all the log files generated by a master do-file into one log.
For example (file: wf5-master.do),
1> capture log close master
2> log using wi5-master, name(master) replace text
a> // program: —wiS-master.do
4> // task: Creating a master log file
5> // project: workflow chapter 5
6> // author:
7> do wfS-master01-desc.do
8> do wiS-master02-logit do
9> do wf5-master03-tabulate.do
10> log close master
11> exit
jsl \ 2008-04-03
Line 2 opens wf5-master.log. The name(master) option assigns the log a nickname
referred to as the “logname”. When you have more than one log file active, you need
a logname for all but one of the logs. Line 1 closes master if it is already open, with
capture meaning that if it is not open, ignore the error generated by the log close
command. Lines 3-6 are recorded in wf5-master.log. In addition, the output from134 Chapter 5 Names, notes, and labels
the do-files run in lines 7-9 are sent to w£5-master.log. Line 10 closes the master log
file. When w£5-master.do was run, four log files were croated:
wid-master log
wi5-master01-desc. log
wiS-master02-logit. log
wiS-master03-tabulate. log
The file w£5-master.log contains all the information from the three other log files.
Instead of printing three files (or dozens in a complex set. of analyses), T can print one
file. If 1 am inchiding results on the web, L need to post only one log file.
5.4.4 A template for naming do-files
Although my primary consideration in naming do-files is that. the alphabetized names
indicate the yun order, there are other factors to consider:
¢ Use names that remind you of what is in the file and that help you find the file
later. For example, logit01.do is better than pgm01.do.
e Anticipate revising your do-files and adding new do-files. If you find an error in
a do-file, what, will you name the corrected file and will the new mane. retain the
sort order? If you need to add a step between two do-files. will your system allow
you to add the do-file with a name that retains the run order?
Choose names that are easy to type. Names that are too long or that have special
characters should be avoided.
With these considerations in mind, [ suggest the following template for naming do-files,
where no spaces are included in the flename:
project |-task] step [letter | [Vuersion] [-description | do
For example, £1-clean01a-CheckLabels.do or f1-logitO1aV2-BaseModel .do. Here
are the details:
project-task The project is a short mnemonic such as cwh for a study of cohort, work, and
health; £1 for a study of functional limitations; and sgc for the Stigma in a
Global Context project. As needed, | divide the project into lasks. For example,
Ymight have cwh-clean for jobs related to cleaning data for the cwh project.
step and letter Within a project and task, the two-digit. step indicates the order in which the
do-files are run. For example, £1-desc01.do, £1-desc02.do, etc. If the project
is complex, I might also use a letter, such as f1-descO1a.do and £1-descO1b.do.
version A version number is added if there is a revision to a do-file that. has been posted.
For example, if f1-descO1a.do was posted before an error was discovered, the
replacement file is named f1-descO1aV2.do. I have never needed ten revisions so
only one digit is used.5.4.4 A template for naming do-files 135
description The description is used to make it easicr to remember what a do-file is for. The
description does not affect. the sort order and is not required to make the name
unique. For example, I am not likely to remember what f1-descO1a.do does, but
f1-descO1a-health.do reminds me that the program is computing descriptive
statistics for health variables. When I refer to do-files, say in my rescarch log, I
do not need to include the description as part of the name. That is, I could refer
to fl~descOla.do rather than f1-descO1a-health.do.
Expanding the template
What happens if you have a very large project with complicated programs that
require lots of modifications and additions? The proposed template scales easily. For
example, between f1-pgm01a.do and £1-pgm01b.do I can insert £1-pgm01a1.do and
£1-pgm01a2.do. Between these jobs J can insert £1-pgmO1ala.do and f£1-pgm01aib.do.
Collaborative projects
In a collaborative project, I often add the author's initials to the front of the job
name. For example, I could use js1-f1-descO1a.do rather than £1-descO1a.do.
Using subdirectories for complex analyses
As discussed in chapter 2, I use subdirectories to organize the do-files from large projects.
This can best be explained by an example. Eliza Pavalko and I (Long and Pavalko
2064) wrote a paper examining how using different rncasures of functional limitations
affected substantive conclusions. These measures were created from questions that ask
if a person has trouble with physical activities such as standing, walking, stooping, or
lifting. Using questions on nine activities, we constructed hundreds of scales to measures
a person’s overall limitations, where th ] pased on alternative measures used
in the research literature. When the paper was finished, we had nearly 500 do-files to
construct the scales and run analyses. Here is how we kept track of them.
The mnemonic for the project is £1 standing for functional limitations. All project
files had names that started with £1 and were saved in the project directory \flalt.
Posted files are placed in \flalt\Posted within these subdirectories:
(Continued on next page)136 Chapter 5 Names, notes, and labels
Directory Task
\f£100-data Datasets
\f101-extr Extract data from source files
\f£102-scal Construct scales of functional limitations
\f103-out Construct outcome measures
\£104-desc Descriptive statistics for source variables
\f£105-1ca Fit latent-class models
\f106-reg Fit regression models
The first: directory holds datasets, the next three directories are for data management
and scale construction, and the last three directories are used for statistical analyses.
If [ need an additional step, say, verifying the scales, 1 can add a subdirectory that. is
numbered so that. it sorts to the proper location in the sequence (¢.g., \f£103-1-verify).
Each subdirectory holds the do-files and log files for that task with datasets kept in
\f£100-data. The do-files within a subdirectory are named so that, if they are run in
alphabetical order, they reproduce the work for that task. Even though it is unlikely
that I will finish the work in task \f101-extr before I start \f104-desc (e.g., while
looking at the descriptive stati , Lam likely to decide that T need to extract, other
iables). my goal is to organize the files so that I could correctly reproduce everything
by rumiing the jobs in order: all the jobs in \f£101-extr, followed by all the jobs in
\£102-scal, and so on. This is very helpful when trying to replicate your work or when.
you need to make revisions.
5.5 Naming and internally documenting datasets
The objective when naming datascts is to be certain that you never have two datasets
with the same name but different content. Because datasets are often revised as you
add variables. I suggest a simple convention that makes it easy to indicate the version
of the dataset:
datasct-name# # .dta
For example, if the initial dataset is mydata01.dta, the next one is mydata02.dta, and
so on. Every time I change the current version, no matter how smal) the change, I
increment the number by onc. The most common objections T get to creating a new
very time a change is made are “I’m getting too many datasets!” and “These
take up too much space!” Storage is cheap so you can casily keep many versions
of your data, or you can delete earlier versions of a dataset because you can reproduce
them with your do-files (assuming you have an effective workflow). Alternatively, you
the datasets before archiving ther. For example, the dataset attr04.dta
has information on attrition from the National Longitudinal Survey. The file is 2.065.040
bytes long but when compressed (see page 264) is reduced to 184,552 bytes. When I
compress a datasct, | like to combine the dataset with a do-file and log file that describes
the data. The do-file inight simply contain5.5.1 One time only and temporary datasets 137
log using attr04-dta, replace
use attr04, clear
describe
summarize
notes
log close
When T unzip the dataset I can quickly verify the content. of the dataset without having
to load the dataset or check my research log.
Never name it final!
Although it is tempting to name a dataset as final, this usually leads to confusion.
For example, after a small error is found in mydata-final.dta, the next version is
called mydata-final2.dta, and then later mydata-reallyfinal.dta. If final is in
the name, you run the risk that you and others might believe that the dataset is final
when there is an updated version. Recently, I was copied on a message that asked,
“Does final2 have a date attached so J know it is the most recent version?”
5.5.1 One time only and temporary datasets
If I create a datasct that 1 expect. to use only once, I give it the name of the do-file that
created it. For example, suppose that, demogcheck01.do merges data from two datasets
to verify that the demographic data from the two sources are consistent. Because I
do not anticipate further analyses using this dataset, but I want to keep it if I have
questions later, I would name it demogcheck01.dta. Then the name of the dataset
documents its origin.
I often create temporary datasets when building a dataset for analysis (sec sec-
tion 5.11 for an example). I keep these datasets until the project is completed, but
they are not posted. To remind me that these files are not critical, I name them be-
ginning with x-. Accordingly, if I find a dataset that starts with x-, I know that
I can delete it. For example, suppose that ] am merging demographic information
from demog05.dta and data on functional limitations from f1im06.dta. My pro-
gram fl-mrg01.do extracts the demographic data and stores it in x-f£1-mrg01.dta;
£1-mrg02.do extracts the limitation data and stores it in x-fl-mrg02.dta. Then
£1-mrg03.do creates fl-paper01.dta by merging x-fl-mrg01.dta and x-fl-mrg02
.dta. I delete the x- files when I finish the project.
I also find that prefacing a file with x- can prevent a problem when collaborating.
Suppose that I am constructing a dataset that my collaborator and T both plan to
analyze. I write a series of do-files to create the dataset that I will eventually name
£1-paper01.dta. Initially, 1 am not sure if I have extracted all the variables that we
need or created all the scales we planned. Rather than distributing a dataset named
f1-paper01.dta, I create x-f1-paper01.dta. Because the name begins with x-, my
colleague and 1 know that this is not a final dataset so there is no chance of acciden-
tally running serious analyses. When we agree that the dataset is correct, I create
£1-paper01.dta and post the dataset.138 . Chapter 5 Names, notes, and labels
5.5.2 Datasets for larger projects
When working on projects using lots of variables, I prefer a separate dataset for each
type of variable rathcr than one dataset, for all variables. For example, in a. project.
using the National Longitudinal Survey, we grouped variables by content and created
these datasets:
Dataset Content
attd## .dta Attitude variables
attr##.dta Attrition information
catl##.dta Control variables such as age and education
emps ##.dta Employment status
fami f##.dta Characteristics of the family
flin## .dta Health and functional limitations
By dividing the variables, each member of the project could work on a different part of
the data without risk of interfering with the work done by other team members. ‘This was
important because each set of variables took dozens of do-files and weeks to complete.
When a segment of the data was completed, the new dataset was posted along with
the associated do-files and log files. To run substantive analyses, we extracted variables
from the multiple, source datasets and merged them into one analysis dataset.
5.5.3 Labels and notes for datasets
When you save a dataset, you should add internal documentation with a dataset label,
a note, and a data signature. These are all forms of what is referred to as metadata—
data about data. The advantage of metadata is that it is internal to the dataset so that;
when you have the dataset you have the documentation. To add a dataset: label, use
the command
label data "label"
For example,
label data "CWH analysis file \ 2006-12-07"
save cwhO1, replace
The data label is echoed when you use the data:
. use cwh02, clear
(CWH analysis file \ 2006-12-07)
I use notes to add further details:
notes: note5.5.4 The datasignature command 139
Because no variable name is specified, the note applies to a dataset rather than to
a variable (see section 5.8). In the note, I include the name of the dataset, a bricf
description, and details on who created the dataset with what do-file on what date. For
example,
notes: cwh01.dta \ initial CWH analysis dataset \ cuh-dta0la.do jsl 2006-12-07.
label data "CWH analysis file \ 2006-12-07"
save cwh01, replace
After I load the dataset, I can easily determine how the dataset was created. For
example,
. Rotes _dta
-dta:
1. cwhOi.dta \ initial CWH analysis dataset \ cuh-dta01a.do jsl 2006-12-07.
Each time I update the data (e.g., create cwh02.dta from cwh01.dta), I add a note.
Listing the notes provides a quick summary of the do-files used to create the dataset:
. use cwhOS, clear
(CWH analysis file \ 2006-12-22)
- notes _dta
-dta:
1. cwhO1.dta \ initial CWH analysis dataset \ cwh-dta0ta.do jsi 2006-12-07.
2. cwhO2.dta \ add attrition \ cwh-dta02a.do js] 2006-12-07.
3. cwhO3.dta \ add demographics \ cwh-dta03c.do js] 2006-12-09
4, cwh04.dta \ add panel 5 data \ cwh-dta04a.do js1 2006-12-19
5. cwh0S.dta \ exclude youngest cohort \ cwh-dta05a.do jsl 2006-12-22
As an example of how useful this is, while writing this book I lost the do-file that created
a dataset used in an example. I had the dataset, but, needed to modify the do-file that
created it so I could add another variable. ‘To find the file, I loaded the dataset, checked
the notes to find the name of the do-file that created it, and searched my hard drive for
the missing do-file. A good workflow makes up for lots of mistakes!
5.5.4 The datasignature command
The datasignature command protects the integrity of your data and should be used ev-
ery time you save a dataset.! datasignature creates a string of numbers and symbols,
teferred to as the data signature or simply signature, which is based on five character-
istics of the data, For example (file: wf5-datasignature.do),
. use wf-datasig01, clear
(Workflow data for illustrating datasignature #1 \ 2008-04-02)
. datasignature
753: 8(54146) : 1899015902: 1680634677
1. The datasignature command in Stata 10 is not the same as datasignature in Stata 9. The newer
command is much easier to use.140 Chapter 5 Names, notes, and labels
The string 753:8(54146) : 1899015902: 1680634677 is the signature for
wf-datasig01.dta (below | explain where this string comes from). If I load a dataset
that. dees not have this signature, whethcr it is named wi-datasig01.dta or something
else, Lam certain that the datasets differ. On the other hand, if I load a dataset: that has
this signature, J am almost. certain that | have the right dataset. (The reason that Tam
not completely certain is discussed below.) This can be useful in many ways. You and
a colleague can verify whether you are analyzing the same dataset. If you are revising
labels, as discussed later in this chapter, you can check if you mistakenly changed the
data itself, not just the labels. If you store datasets on a LAN where others have read
and write privileges, you can determine if somcone changed the dataset but forgot to
save it with a different name. datasignature is an easy way to prevent many problems,
The signature consists of five numbers, known as checksums, that describe the
dataset. Anyone with the same dataset using the same rules for computing the check-
sums will obtain the same values. The first checksum is the number of cases (753 in
the example above). If I load a dataset with more or less observations, this number
will not match and I will know I have the wrong data. The second is the number of
variables (8 in our example). If I load a dataset that does not have 8 variables, this part
of the signature will not match. The third part of the signature is based on the names
of the variables. To give you a simplified idea of how this works, consider the variables
in wi-datasig01.dta:
. describe, simple
lfp k5 k618 age we bc lvg inc
These names are 22 (= 3+2+4+3+2+2+43-+3) characters long. If] load a dataset
where the length of the names is not 22, | know that I have the wrong dataset. The
fourth and fifth numbers are checksums that characterize the values of variables.
The idea behind a data signature is that if the signature of a dataset that you use
matches the signature of a dataset you saved, it is very likely that the two datasets
are the same. The signature is not perfect, however. If you have a lot of computing
power, you could probably find two datasets with the same signature but different.
content (Mackenzie 2008). In practice, this is extremely unlikely so you can reasonably
assume that if the data signatures from two datasets match, the data are the same.
For full details on how the signature is computed, type help datasignature or see
[D] datasignature.
A workflow using the datasignature command
I suggest that you always compute a data signature and save it with your dataset.
When you use a dataset, you should confirm that the embedded signature matches the
signature of the data in memory. The datasignature set command computes the
signature. For example,
. datasignature set
753:8(54146) : 1899015902: 1680634677 (data signature set)5.54 The datasignature command 14]
Once the signature is set, it is automatically saved when you save the dataset. For
example,
» notes: wf-datasig02.dta \ add signature \ wiS-datasignature.do js] 2008-04-03.
: label data "Workflow dataset for illustrating datasignature \ 2008-04-03"
. save wf-datasig02, replace
file vf-datasig02.dta saved
When I Joad the datasct, I can confirm that the dataset in memory generates the same
signature as the one that was saved:
. use wf-datasig02, clear
(Workflow dataset for illustrating datasignature \ 2008-04-03)
. datasignature confirm
(data unchanged since O3apr2008 09:58)
Because the signature matches, | am confident that I have the right data.
Why would a signature fail to match? Suppose that my colleague used
wf-datasig02.dta that [ created on 3 April 2008. He renamed a variable, changed the
datasct label, and violated good workflow by saving the changed data with the same
name:
. use wf-datasig02, clear
(Workflow dataset for illustrating datasignature \ 2008-04-03)
. rename k5 kids5
. save wf-datasig02, replace
file uf-datasig02.dta saved
He did not run the datasignature set command before saving the dataset. When [
load the dataset and check the signature, I am told that the dataset has changed:
- use wf-datasig02, clear
(Workflow data for illustrating datasignature \ 2008-04-03)
. datasignature confirm
data have changed since O3apr2008 09:58
x(9);
I know immediately that there is a problem.
Changes datasignature does not detect
The datasignature confirm command does not detect every change in a dataset.
First, the signature does not change if you only change labels. For example,
. use wi-datasig02, clear
(Workflow dataset for illustrating datasignature \ 2008-04-03)
. label var k5 “Number of children less than six years of age"
. datasignature confirm
(data unchanged since O3apr2008 09:58)142 Chapter 5 Names, notes, and labels
The signature does not. change because it does not contain checksums based on variable
or value labels. Because changed labels can cause a great deal of confusion, I hope this
information is added to a later version of the command.
Second, datasignature confirm does not detect changes if the person saving the
dataset embeds a new signature. For example, I load a dataset that includes a signature:
. use wf~datasig02, clear
(Workflow dataset for illustrating datasignature \ 2008-04-03)
. datasignature confirm
(data unchanged since O3apr2008 09:58)
Next I rename variables k5 and k618:
. rename kS kidsS
. rename k618 kids618
Now I reset the signature and change the data label:
. datasignature set, reset
753:8 (61387) : 1899015902: 1680634677 (data signature reset)
. notes: Rename kids variables \ datasig02.do jsl 2008-04-04.
- label data "Workflow data for illustrating datasignature \ 2008-04-04"
By mistake, J save the dataset with the same name:
. save wf-datasig02, replace
file wi-datasig02.dta saved
The next time T load wf-datasig02.dta, J check the signature:
. use wf-datasig02, clear
(Workflow data for illustrating datasignature \ 2008-04-04)
. datasignature confirm
(data unchanged since O4apr2008 11:23)
Appropriately, datasignature confirm finds that the embedded signature matches the
dataset in memory. The problem is that I should not have saved the dataset with the
same name wf-datasig02.dta. Because I used label data and notes:, the dataset
contains information that points to the problem. First, the data label has the date
2008-04-04, whereas the original dataset was saved on 2008-04-03. The notes also show
a problem:
. notes
_dta:
1. wf-datasigO1.dta \ no signature \ wf-datasig01-supportV2.do js1
2008-03-09.
2. wi-datasig02.dta \ add signature \ wfS-datasig01.do jsl 2008-04-03.
3. wf-datasig02.dta \ rename kids variables \ datasig02.do jsl 2008-04-04,
Given my workflow for saving datasets, there should not be two notes indicating that
the same dataset was saved by different do-files on different dates.5.6.1 The fundamental principle for creating and naming variables 143
5.6 Naming variables
Variable names are fundamental to everything you do in data management and statis-
tical analysis. You want names that are clear, informative, and easy to use. Choosing
effective names takes planning. Unfortunately, planning names is an uninspiring job,
is harder than it first appears, and seems thankless because the payoff generally comes
much later. Everyone should think about how variables are named before they begin
their analysis. Even if you use data collected by others, you need to choose names
for the variables that you want to add. You might also want to revise the original
names (discussed in section 5.11). In this section, I consider issues ranging from general
approaches for organizing names to practical considerations that affect your choice of
names.
5.6.1 The fundamental principle for creating and naming variables
The most basic principle for naming variables is simple:
Never change a variable unless you give it a new name.
Replication is nearly impossible if you have two versions of a dataset that contain
variable var27, but where the content. of the variable has changed. Suppose that you
want to recode var27 to truncate values above 100. You should not replace the values
in the existing variable var27 (file: w£5-varnames.do):
replace var27 = 100 if var27>100 // do NOT do this!
Instead, you should use either generate or clonevar to create copies of the original
variable and then change the copy. The syntax for these commands is
generate newvar = sourcevar [if] [in]
clonevar newvar = sourcevar [if] [in]
The generate command creates a new variable but does not transfer labels and other
characteristics. The clonevar command creates a variable that is an exact duplicate of
an existing variable including variable and value labels; only the name is different. For
example, I can create two copies of the variable 1fp:
. Use wi-names, clear
(Workflow data to illustrate names \ 2008-04-03)
+ generate lfp_gen = 1fp
(327 missing values generated)
- clonevar 1fp_clone = 1fp
(327 missing values generated)
The original 1fp and the generated 1fp_gen have the same descriptive statistics but
1fp_gen does not have value or variable labels. 1fp.clone, however, is identical to 1fp:4d Chapter 5 Names, notes, and labels
. codebook lfp*, compact
Variable Obs Unique Mean Min Max Label
lfp 753 2 5683931 © 1 Paid labor force?
lfp.gen 753 2 .8683931 0 1
lfp_clone 753 2 5683931 0 1 Paid labor force?
. describe lépt
storage display value
variable name type format label variable label
lfp byte lfp Paid labor force?
lfp_gen float
1fp_clone byte lp Paid labor force?
Returning to our earlier example, after you generate or clone var27, you can change
the copy. With generate, type
generate var27trunc = var27
replace var27trunc = 100 if var27trunc>100 & !missing(var27trunc)
Or with clonevar type
clonevar var27trunc = var27
replace var27trunc = 100 if var27trunc>100 & !missing(var27trunc)
Because truncating a variable can substantially affect Jater results, you probably agree
that I should create a new variable with a different name. Suppose that T am not “really”
changing the values. Imagine that. educ uses 99 to indicate missing values and I decide
to recode these values to ., the sysmiss. Do I really need to create a new variable for
this? In one sense, I have not changed the data—missing are still missing. However,
you never want to risk that a changed variable will be confused with the original. The
lest thing to do is lo always create a new variable no matter how small the change.
Here I would use these commands:
clonevar educV2 = educ
replace educV2 = . if educV2==99
If you violate this rule, you can end up with results that are difficult or impossible to
replicate and findings that are unclear or wrong.
5.6.2 Systems for naming variables
There are three basic systems for naming variables: sequential naming, source naming,
and mnemonic naming.? Each has its advantages and in practice you might use a
combination of all three.
2. This discussion is based in part on ICPSR (2005).5.6.2 Systems for naming variables 145
Sequential naming systems
Sequential names use a stub followed by sequential digits. For example, the 2002 Tn-
ternational Social Survey Program (http://www.issp.org) uses the names v1, v2, v3,
-+, V362. The National Longitudinal Survey uses names that start with R and end
with seven digits such as RO000100, RO002203, and ROO81000. Some sequential names
use padded numbers (¢.g.,v007, v011, v121), while others do not (c.g., v7, v11, v121).
Stata’s aorder command (see page 155) alphabctizes sequential names as if they are
padded with zeros, even if they are nol padded.
The numbers used in sequential names might correspond to the order in which the
questions werc asked, to some other aspect of the data, or be meaningless. Although
sequential naming is often necessary with large datasets, these names do not work well
for data analysis. Because the names do not reflect content, it is easy to use the wrong
variable, it is hard to remember the name of the variable you need, and it is difficult to
interpret output. For example, was the command supposed to be this?
logit RO051400 0000100 0002203 RO0B1000
Or this?
ogit R00541400 R1000100 R0002208 R0081000
Because of the risk of using the wrong variable when using scquential names, I often
refer to a printed list of variable names, descriptive statistics, and variable labels, such
as produced by codebook, compact.
Source naming systems
Source names use information about where a variable came from as part of the name.
The first three questions from a survey might be named q1, q2, and q3. If a question
had multiple parts, the variables might be named q4a, q4b, and a4c. In older datasets,
names might index the card and column where a variable is located (c.g., c1¢15). With
source names, you are likely to have some variables that do not fit into the scheme,
which requires using some names that are not based on the source. For example, there
might be variables with information about the site of data collection or from debriefing
questions that are not numbered as part of the survey instrument. If you are creating
a dataset using source names, be sure to plan how you will name all the variables that
will be needed.
Names based on the source question are more useful than purely sequential names
because they refer to the questionnaire. Still, it is hard to look at a model specification
using source names and be certain that you have selected the correct. variables.146 Chapter 5 Names. notes, and labels
Mnemonic naming systems
Mnemonic names use abbreviations that convey content (e.g., id, female, educ). I
much prefer this system because the names partially document, your commands and the
output. A command like this
logit 1fp age educ kids
is casier to use than this
Logit ROOS1400 R0000100 RO0O2203 RO0B1000
or this
logit q17 431 qi9 q02
Although mnemonic names have many advantages, you need to choose the names care-
fully because finding names that are short, unambiguous, and informative is hard.
Mnemonic names created “on the fly” can be misleading and difficult to use.
5.6.3 Planning names
If you are collecting your own data, you should plan names before the dataset is created.
If you are extracting variables from an existing dataset, you should plan which vari-
ables you need and how you want to rename them before data extraction begins. Large
datasets such as the National Longitudinal Survey (NLS, http://www.bls.gov/nls) or the
National Longitudinal Study of Adolescent Health (http: //www.cpc.unc.edu/addhealth)
have thousands of variables, and you might want to extract hundreds of them. For exam-
ple, Eliza Pavalko and { (Long and Pavalko 2004) used data from the NLS on functional
limitations. We extracted variables measuring limitations for nine activities in each of
four panels for two cohorts and created over 200 scales. It took several iterations to
come up with names that were clear and consistent.
When planning names, think about how you will use the data. The more complex
the project, the nore detailed your plan needs to be. Will the project last a few weeks
or several years? Do you anticipate a small uumber of analyses or will the analyses be
detailed and complex? Are you the only one using the data or will it be shared with
others? Will you be adding a new wave of data or another country? The answers to
these and similar questions need to be anticipated as you plani your names.
After you make general decisions on how to name variables, I suggest Uat you create
a spreadsheet to help you plan. For example, in a study of stigma
(http://www.indiana.edu/-sgemhs/), we received datasets from survey centers in 17
countries. Each center used source nates for most variables. Ta create mnemonic
names, we began by listing the original name and question. We then classified vari-
ables into categories (e.g., questions about treatment, demographics, measures of social
distance). One member of the research team then proposed a sct of mnemonic names
that was circulated for comments. After several iterations, we came up with names that5.6.4 Principles for selecting names 7
we agreed upon, Figure 5.3 is a portion of the large spreadsheet that we used (file:
wf5-names-plan. xls):
on oA 2b Bg
i ‘Question Proposed 7
1_ (Question 1D name Category
14 |Question stem. What should NAME do about this situation.
Tatk to family q2-4 fofam treatment_option
|. Talk to friends, q2-2 —tofriend treatment_option
Talk fo a religious leader q2-3 torel treatment_option
Go fo a medical doctor q2-4 todoc treatment_option
.Go to a psychiatrist 92-5 topsy treatment_option
-Go to a counselor or another mental health professional 2-6 tocou treatment_option
|. Go to a spiritual or traditional healer a7 taspi trealment_option
j...Take non-prescription medication 92-8 tonpm trealment_option
|...Take prescription medication @2-9 —_topme treatrnent_option
Check into 4 hospital q2-10 — tohos treatment_oplion
.Pray q2-11 topray treatment_option
Change lifestyle Q2-42 tolifest treatment_option
Take herbs q2-13 — toherb treatment_option
Try to forget about it q2-14 — toforg treatment_option
Get involved 1n other activities q2-15 —_ toothact treatment_option
| ebiabénuahier spa. UKbUBe. senate rcpt ified fsa ACO wml ALTE AOD:
Figure 5.3. Sample spreadsheet for planning variable names
5.6.4 Principles for selecting names
Although choosing a system for naming variables is the first step, there are additional
factors to consider when selecting names (file: w£5-varnames.do).
Anticipate looking for variables
Before you decide on names (and labels, which are discussed in section 5.7), think about
how you will find variables during your analysis. This is particularly important with
large datasets. There are two aspects of finding a variable to consider. First, how wilt
the names work with Stata’s lookfor command? Second, how will the names appear
in a sorted list?
The lookfor string command lists all variables that have string in their names or
variable labels. Of course, lookfor is only useful if you use names and labels that
include the strings that you are likely to search for. For example, if I name three
indicators of race racebick, racewhite, and raceasian, then lookfor race will find
these variables. For example,14s Chapter 5 Names, notes, and labels
. lookfor race
storage display value
variable name type format label variable label
racewhite byte Lyn Is white?
raceblack byte Lys Is black?
raceasian byte Lyn Is asian?
If T use the names black, white, and asian, then lookfor race will not find them
unless “race” is part of their variable labels. There is a trade-off between short names
and being able to find things. For example, if ] abbreviate race as rce to create shorter
names, I must remember to use lookfor rce to find these variables because lookfor
race will not find them.
You can sort. variables so that they appear in alphabetical order in the Variables
window (see the discussion of order and aorder on page 155). This is handy for finding
variables, especially if you like to click on a name in the Variables window to insert the
name into a command. When choosing names, think about how the names will appear
when sorted. For example, suppose T have several variables that measure a person's
preference for social distance from someone witli mental illness. These questions deal
with different types of contact, such as having the person as a friend, having the person
marry a relative, working with the person, having her as a neighbor, and so on. I could
choose names such as friendsd, marrysd, worksd, and neighbsd. If] sorted the names,
the variables will not be next to one another. If I name the variables sdfriend, sdmarry,
sdwork, and sdneighb, they appear together in an alphabetized list. Similarly, the
names raceblck, racewhite, and raceasian work better than blckrace, whiterace,
and asianrace. If! have binary indicators of educational attainment (e.g., completing
high school, completing college), the names edhs, edcol, and edphd work better than
hsed, coled, and phded.
Use simple, unambiguous names
There is a trade-off between the length of a name and its clarity. Although the name
1Q_23v is short, it is hard to remember and hard to type. A name like
socialdistancescale?2 is descriptive but too long for typing and is likely to be trun-
cated in your output or when converting your data to another format. In a large datasct,
it is impossible to find names that meet all your goals for being clear and easy to use.
Keeping names short often conflicts with making names clear and being able to find
them with lookfor. With planning, however, you can select names that are much more
useful than if you create names without anticipating their later use. Here arc some
things to consider when Jooking for simple, effective names.
Use shorter names Stata allows names of up to 32 characters but often truncates long
names when listing results. You need to consider not only how clear a name is but also
how clear it is when truncated in the output. For example, I generate three variables
with names that arc 32 characters long and use the runiform() function to assign
uniform random numbers to the variables (file: wf5-varnames .do):5.6.4 Principles for selecting names 149
Generate a2345678901234567890123456789012 = runiform()
generate a23456789012345678901234567890_1 = runiform()
generate a23456789012345678901234567890_2 = runiform()
When analyzed, the names are truncated in a way that: is confusing:
+ summarize
Variable | Obs Mean Std. Dev. Min Max
23456789-12 100 .4718318 =. 2695077 .0118152.9889972
234567890-1 100 4994476 = .2749245 = 0068972. 9929506
823456789-_2 100 -4973259 .3026792 0075843 .9889733
Because most Stata commands show at least 12 characters for the name, I suggest the
following guideline:
Use names that are at most 12 characters long.
For the original variables in a dataset, limit names to 10 characters so that you have
two characters to indicate version if the variable is revised. For example,
generate socialdistV2 = socialdist if socialdist>=0 & !missing(socialdist)
Some statistics packages do not allow long variables names. For example, when T con-
verted the variables above to a LIMDEP dataset, (http://www.limdep.com), the names
were changed to a2345678, a2345670, and a2345671. The only way to verify how the
converted names mapped to the source names was by looking at the raw data. If T plan
to use software that limits names to eight characters. I either restrict variable names to
eight characters in Stata, or I create a new Stata dataset in which I explicitly shorten
the names. After I rename a variable, | revise the variable label to document the original
name. For example,
rename socialdistance socdist
label var socdist “socialdistance \ social distance from person with MI"
rename socialdistance socdist
label var socdist "social distance from person with MI (socialdistance)"
Now when I convert the dataset I have control over the names that are used.
Use clear and consistent abbreviations Because long names are harder to type and might
be truncated, I often use abbreviations as part of the variable names. For example, [
might use ed as an abbreviation for education and create variables such as ed_1ths and
ed.hs rather than educationlths and educationhs. Abbreviations, however, by their
nature are ambiguous. To make them as clear as possible, plan your abbreviations and
get feedback from a colleague before you finalize them. Then use those abbreviations
consistently and keep the list of abbreviations as part of the project documentation. A
convenient way to do this is with the notes command as discussed in section 5.8.150 Chapter 5 Naines, notes, and labels
Use names that convey content All clse being equal, names that convey content are
casier to use than those that do not. Names such as educ or socdist are easier Lo use
and less likely to cause errors than names such as q32part2 or ROO3197, There are
other ways to make names more informative. For binary variables, I suggest names that,
indicate the category that is coded as 1. For cxample, if 0 is male and | is female, |
would name the variable female, not gender. (When you sec a regression coefficient, for
gender, is it. the effect of being male or being female?) If you have multiple scales coded
in different, directions (i.e. scaled is coded | = disagree, 2 = neutral, and 3 = agree,
whereas scale2 is coded 1 = agree, 2 = neutral, and 3 = disagree), I suggest names
that indicate the direction of the scale. For example, ] might use the names sdist1P,
sdist2N, and sdist3N, where N and P indicate uegative and positive coding.
Be careful with capitalization Stata distinguishes between names with the same let!
but different capitalization. For example, educ, Educ, and EDUC are three different:
variables. Although such names are valid and distinct in Stata, they are likely to cause
confusion. Further, some statistical packages do not differentiate between uppercase
and lowe! s. Worse, programs that convert between formats might simply drop the
“extra” bles. When L converted a Stata datasct containing educ, Educ, and EDUC
to a format. that, is case insensitive. the conversion program dropped two of the variables
without warning and without indicating which variable was kept! 1 do, however, use
capitalization to highlight information. For example, I use N to indicate negatively
coded scales and P for posilively coded scales. Capitalization emphasizes this so T prefer
scaleiN and scale2P to scalein and scale2p. I would never create a pair of variables
called scalein and scaleiN. I use the capitals in table 5.1 as standard abbreviations
within variable names:
TS
Table 5.1. Recommendations for capital letters used when naming variables
Letter Meaning Example
B Binary variable highsch1B
I Indicator variable edIhs, edlgths, edIcol
L Value labels used by multiple variables Lyesno
M Indicator of data being missing* educM
N A negatively coded scale sdworkN
0 Too close to the number 0, so 1 do not use it
P A positively coded scale. sdkidsP
S The unchanged, source variable educS; Seduc
Vv Version number for modified variables marstatV2
x A temporary variable Xtemp
* These are binary variables equal to 1 if the source variable is missing, and 0 otherwise.
For example, educM would be 1 if educ is missing, and 0 otlierwise.5.7.1 Listing variable labels and other information 151
Try names before you decide
Selecting effective names and labels is an iterative process. After you make initial
selections, check how well the names work with the Stata commands you anticipate
using. If the names are truncated or confusing in the output from logit and you plan
to run a lot of logit models, consider different names. Continue revising and trying
names until you are satisfied.
5.7 Labeling variables
Variable labels are text strings of up to 80 characters that are associated with a variable.
These labels are listed in the output of many commands to document the variables being
analyzed. Variable labels are easy to create, and they can save a great deal of confusion.
My recommendation for variable labels is simple:
Every variable should have a variable label.
If you receive a dataset that does not include labels, add them. When you create a new
variable, always add a variable label. It is tempting to forgo labeling a variable that
you are “sure” you will not need later. Too often, such variables find their way into a
saved dataset (e.g., you create a temporary variable while constructing a variable but
forget to delete the temporary variable).? When you later encounter these unlabeled
variables, you might forget what they are for and be reluctant to delete them. A quick
label such as
label var checkvar “Scott’s temp var; can be dropped"
avoids this problem. The accumulation of stray variables is a bigger problem in collab-
orative projects when several people can add variables, and you do not want: to delete
a variable someone else needs. In the long run, the time you spend adding labels is less
than the time you lose trying to figure out what a variable is.
5.7.1 Listing variable labels and other information
Before considering how to add variable labels and principles for choosing labels, 1 want
to review the ways you can examine variable labels. There are many reasons why
you might want a list of variables with their labels—to construct tables of descriptive
Statistics in a paper, to remind you of the names of variables as you plan your analyses,
or to help you clean your data (file: wf5-~varlabels.do).
3. One way to avoid the problem of saving temporary variables is to use the tempvar command. For
details, see help tempvar or [P] macro.152 Chapter 5 Naines, notes, and labels
codebook, compact
The codebook, compact command lists variable names, labels, and soine descriptive
statistics. The syntax is
codebook [ varlist | (i ] [in], compact
The if and in qualifiers allow you to select the cases for computing descriptive statistics.
Here is an example of the output:
codebook id tclfam tc2fam tc3fam vignum, compact
Variable Obs Unique Mean Min Max Label
id 1080 1080 540.5 1 1080 Identification number
telfam 1074 10 8.755121 . 10 Q43 How important is it to turn ...
te2fiam 1074 10 8.755121 . 10 Q43 How Impt: Turn to family for...
tc3fam 1074 10 8.755121 1 10 Q43 Family help important
vignun 1080 12 6.187963 1 12 Vignette number
If your variable labels are truncated on the right, you can increase the line size. For
example, set linesize 120. Unfortunately, codebook docs not give you a choice of
which statistics are shown and there is no measure of variance.
describe
The describe command lists variable names, variable labels, and characteristics of
the variables. The syntax is
describe [ vorlist | [i] [in] [. simple fullnames nunbers |
Vf varlist is not given, all variables are listed. If you have long variable names, by default
they are truncated in the list. With the fullnames option, the entire name is listed.
The numbers option numbers the variables. For othcr options, use help describe.
Here is an example of the default. output:
. describe id telfam tc2fam tc3fam vignum
storage display value
variable name type format label variable label
id int 49.08 Identification number
telfam byte 421.0 Ltenpt Q43 How important is it to turn
to family for help
to2fam byte —421.0g Ltenpt Q43 How Impt: Turn to family for
help
tc3fam byte = 421.0 Ltenpt Q43 Family help important
vignum byte 135.0g vignun + Vignette number
Storage type tells you the numerical precision used for storing that variable (see the
compress command on page 264 for further details). Display format, reasonably enough,
describes the way a variable is displayed. | have never had to worry about this because5.7.1 Listing variable labels and other information 153
Stata seems to figure out how to display things just fine. However, if you are curious, see
{u] 15.5 Formats: controlling how data are displayed for details. The value label
column lists the name of the value label associated with each variable (see section 5.9
for information on value labels). The *’s indicate that there is a note associated with
that. variable (see section 5.8 for further details). If you only want a list of names, add
the simple option. For example, to create a list of all variables in your dataset, type
describe, simple
id telfan tedmhprof = Ed varl4
vignum te2fan tc3mhprof ED vari5
(output omitted )
tefam telmbprot ed var13
Or, to quickly find the variables included in a varlist shorthand notation, say, id-opdoc,
type
. describe id-opdoc, simple
id female opnoth —opfriend opdoc
vignum serious opfam oprelig
nmlab
Stata does not have a command that lists only variable names and labels. Because
I find such lists to be useful, I adapted the code used as an example in chapter 4 to
create the command nmlab. Most simply, type
. nmlab id telfam te2fam te3fam vignum
id Identification number
tcifam Q43 How important is it to turn to family for help
te2fam Q43 How Impt: Turn to family for help
tce3fan Q43 Family help important
vignun Vignette number
The number option numbers the list, whereas column(#) changes the start column for
the variable labels. The vl option adds the name of the value label, as discussed below.
Just typing nmlab lists all the variables in the dataset.
tabulate
This command shows you the variable label and the value labels (see section 5.9.3):
. tabulate tcfam, missing
Q43 How Impt:
Turn to family
for help
Percent Cum.
ANot at all Impt ; 0.83
7 0.37 41.20
3 qi 1.02 2.22
(output omitted )154 Chapter 5 Names, notes, and labels
Although tabulate does not truncate long labels, longer labels are often more difficult
to understand than shorter ones:
. tabulate tcfamV2, missing
Question 43: How
important is it
to you to turn
to the family
for support? Freq. Percent Cum.
ANot at all Impt 9 0.83 0.83
2 4 0.37 1.20
3 it 1.02 2.22
(output omitted )
The Variables window
Because variable labels are shown in the Variables window, I also make sure that
the labels work well here. For example,
[Name | Label
‘id Identification number
vignum —_—-Vignette number
female Ris female?
serious Q01 How serious is Xs problem
opnoth Q02_00 X do nothing
opfam Q02_01 X talk to family
opftiend —Q02_02 X talk to friends
sate Q02_03 X talk to relig leader
Q02_04 X see medical doctor
eee Qi5 Would let X care for children
If your variable labels do not appear in the window or if there is a large gap between
the names and the start of the label, you need to change the column in which the labels
begin. By default, this is column 32, which means you need a wide Variables window
or you will not see the labels. In Windows and Macintosh for Stata 10, you can use the
mouse to resize the columns. In Unix, you can change the space allotted for variable
names with the command
set varlabelpos #
where # is the maximum number of characters to display for variable names. Once you
change this setting, it persists in later sessions. Because 1 typically limit names to 12
characters or less, I use set the variable label position to 12.a
5.7.2 Syntax for label variable Lt
Changing the order of variables in your dataset
Commands such as codebook, describe, nmlab, and summarize list variables in the
order they are arranged in the dataset. You can see how variables are ordered by
looking at the Variables window or by browsing your data (type browse to open a
spreadsheet view of your data). When a new variable is created, it is placed at the end
of the list. You can change the order of variables with the order, aorder, and move
commands. Changing the order lets you put frequently used variables first to make
them easier to click on in the Variables window. You can alphabetize names to make
them easier to find, place related variables together, and do other similar things. The
aorder command arranges the variables in varlist alphabetically. The syntax is
aorder [ varlist |
If no varlist is given, all variables are alphabetized. The order command allows you to
move a group of variables to the front of the dataset:
order varlist
To move one variable, use the command
move variable-to-move target-variable
where variable-to-move is placed in front of the ¢arget-variable. For many datasets, I
run this pair of commands:
aorder
order id
where id is the name of the variable with the 1D number. This arranges variables
alphabetically, except that the 1D variable appears first. The best way to learn how
these commands work is to open a dataset, try the commands. and watch how the list
of variables in the Variables window changes.
§.7.2 Syntax for label variable
Now that we know how to look at variable labels, we can create them. The label
variable command assigns a text label of up to 80 characters to a variable. The
syntax is
label variable varname “label"
Although I generally do not abbreviate commands, I often use the abbreviation label
var, which is shorter, yet still clear. For example,and labels
156 Chapter 5
label var artsqrt "Square root of # of articles"
To remove a label, you use the command
label variable varname
For example,
label var artsgrt
5.7.3 Principles for variable labels
Just like names, you can create more-useful labels by planning. Here are some things
4 think about as you plan your labels.
Beware of truncation
A. variable label should be long enough to provide the essential information but short
enough that the content can be grasped quickly. Although variable labels can he 80
characters long, many commands truncate labels that. are longer than about 30 charac-
ters. Accordingly, I recommend
Put the most important information in the first 30 columns of a variable label.
Here is an example of what can happen if you use Lhe labels typically found in sccondary
data. The data we received used iabels that were slightly condensed versions of the
qnestions from the survey. For example, onc group of questions asked a person who
they would turn to if they needed care:
telfan Q43 How important is it to turn to family for help
telfriend Q44 How important is it to turn to friends for help
telrelig Q45 How important is it to turn to a minister, priest, rabbi or other religious
teidoc Q46 How important is it to go to a general medical doctor for help
telpsy Q47 How important is it to go to a psychiatrist for help
tclmhprof Q48 How important is it to go to a mental health professional
The labels are so long that they are useless for commands that truncate the labels at
column 30. For example,
. codebook tci*, compact
Variable Obs Unique Mean Min Max Label
teldoc 1074 10 8.714153 1 10 Q46 How important is it to go to .
telfan 1074 10 8.755121 1 10 Q43 How important is it to turn t.
tclfriend 1073 10 7.799627 1 10 Q44 How important is it to turn t.
tclmhprof 1045 10 7.58756 1 10 Q48 How important is it to go to .
tclpsy 1050 10 7.567619 1 10 Q47 How important is it to go to .
tclrelig 1039 10 5.66025 1 10 Q45 How important is it to turn t.5.7.4. Temporarily changing variable labels 157
A better set of labels Jooks like this:
. codebook tc2+, compact
Variable Obs Unique Mean Min Max Label
te2doc 1074 10 8.714153 1 10 46 How Impt: Go to a gen med doc...
te2fam 1074 10 8.755121 1 10 043 How Impt: Turn to family for .
te2friend 1073 10 7.799627 1 10 Q44 How Impt: Turn to friends for...
tc2mhprof 1045 10 7.58756 1 10 Q48 How Impt: Go to a mental heal...
te2psy 1050 10 7.567619 1 10 Q47 How Impt: Go to a psych for Help
te2relig 1039 10 §.66025 1 10 Q45 How Impt: Turn to a religious...
We eventually chose even shorter labels:
. codebook te3+, compact
Variable Obs Unique Mean Min Max Label
tcdoc 1074 10: 8.714153 1-10-46 Med doctor help important
te3fam =» 1074. 10« 8.755121 1 10 Q43 Family help important
te3friend 1073 10 7.799627 1 10 Q44 Friends help important
tcSubprof 1085 10 7.58756 1 10 Q48 MH prof help important
tc3psy «1050S 10.« 7.867619 1 10 Q47 Psychiatric help important
tce3relig 1039 10 5.66025 1 10 Q45 Relig leader help important
Given our familiarity with the survey instrument, these labels tell us everything we need
to know.
Although I find short variable labels work best for analysis, I sometimes want to see
the original labels. For example, I might want to verify the exact wording of a question
or know exactly how the categories are labeled. Stata’s language command allows you
to have both long, detailed labels for documenting your variables and shorter labels that.
work better in your output. This is discussed in section 5.10.
Test labels before you post the file
After creating a set of labels, you should check how they work with commands such
as codebook, compact and tabulate. If you do not like how the labels appear in the
output, try different labels. Rerun the test commands and repeat the cycle until you
are satisfied.
5.7.4 Temporarily changing variable labels
Sometimes I need to temporarily change or eliminate a variable label. For example.
tabulate does not list the name of a variable if it has a variable label. Yet, when
cleaning data, 1 often want to know the variable name. To see the variable name in the
tabulate output, you need to remove the variable label by assigning a null string as
the label:
label variable varname ""158 Chapter 5 Names, notes, and labels
1 can do this for a group of variables using a loop (file: wf5-varlabels.do):
. foreach varname in pub1 pub3 pub6 pub9 {
2. label var “varname” ""
3. tabulate “varname’, missing
ce
pubi Freq. Percent Cun.
° 7 25.00 25.00
1 cis 24.35 49.35
2 36 11.69 61.04
(output omitted )
Another reason to change the variable label temporarily is to revise labels in graphs,
By default, the variable label is used to label the axes.
5.7.5 Creating variable labels that include the variable name
Recently, I was asked, “Do you know of a Stata comunand that will add the variable
name to the beginning of the variable label?” Although there is not a Stata command
to do this, it is easy to do this using a loop and a local macro
(file: w£5-varname-to-label.do).* Here are the current labels:
. use wf-lfp, clear
(Workflow data on labor force participation \ 2008-04-02)
. nmlab
lfp In paid labor force? i=yes O=no
KS kids < 6
618 # kids 6-18
age Wife’s age in years
we Wife attended college? 1=yes O=no
fc Husband attended college? 1=yes O=no
lwg Log of wife's estimated wages
inc Family income excluding wife's
To see why I want to add the name of the variable to the label, consider the ontput
from tabulate:
. tabulate we he, missing
Wife
attended Husband attended
college? | college? 1=yes O=no
t=yes 0=no O_NoCol 1_College Total
0_NoCol a7 124 Sai
1_College al am 212
Total 458 295 753
4. If you want to try creating your own command with an ado-file, | suggest you write a command
that adds a variable’s name to the front of its label.5.7.5 Creating variable labels that include the variable name 159
It, would be convenient to know the names of the variables in this table. This can be
done by adding the variable name to the front of the variable label. I start by using
unab to create a list of the variables in the dataset, where -all is Stata shorthand for
“all the variables in memory”:
+ unab varlist : _all
. display “varlist is: “varlist~"
varlist is: 1fp k5 k618 age wc hc lwg inc
Next. I loop through the variables:
1> foreach varname in ‘varlist” {
2> local varlabel : variable label “varname~
> label var “varname’ "‘varname’: “varlabel“"
4 +
Line 2 is an extended macro function that creates the local varlabel with the variable
label for the variable named in local varname. Extended macro functions, which are
used extensively in section 5.11, retricve information about variables, datasets, labels,
and other things and place the information in a macro. The command begins with local
varlabel to indicate that you want to create a local macro named varlabel. The :
is like an equal-sign. saying that. the local equals the content. described on the right.
For example, local varlabel : variable label lip assigns local varlabel to the
variable label for 1fp. Line 3 creates a new variable label that begins with the variable
name (ie., “varname’), adds a colon, and inserts the current label (i.c., ~varlabel ”)
Here are the new variable labels:
. nolab
lfp lfp: In paid labor force? t=yes O=no
KS -k5: # kids < 6
618 k618: # kids 6-18
age age: Wife’s age in years
we wc: Wife attended college? i=yes O=no
he hc: Husband attended college? 1=yes O=no
lwg wg: Log of wife's estimated wages
inc inc: Family income excluding wife's
Now when [ use tabulate, I see both the variable name and its label:
. tabulate we he, missing
we: Wife
attended | hc: Husband attended
college? | college? 1=yes 0=no
J=yes O=no 0.NoCol 1_College Total
0_NoCol 417 124 Sat
1_College 4t 171 212
Total 458 295 753
I changed the variable labels without changing the names of the variables. In general,
I think this is fine. If I wanted to keep the new labels, I would save these in a new
dataset.160 Chapter 5 Names, uotes, and labels
5.8 Adding notes to variables
The notes command attaches information to a variable that is saved in the dataset as
metadata. notes is incredibly useful for documenting your work, and I highly recom-
mend that, you add a note when creating new variables. The syntax for notes is
notes fvarname |: tet
Llere is how I routinely use notes when generating new variables. I start by creating
pub9trunc from pub9 and adding a variable label (file: w£5-varnotes.do):
. generate pubStrunc = pub9
(772 missing values generated)
. replace pubStrunc = 20 if pubStrunc>20 & !missing(pubStrunc)
(8 real changes made)
. label variable pub9trunc "Pub 9 truncated at 20: PhD yr 7 to 9”
I use notes to record how the variable was created, by what. program, by whom, and
when:
+ notes pubStrunc: pubs>20 recoded to 20 \ wfS-varnotes.do js1 2008-04-03.
‘The note is saved when I save the dataset. Later. if | want details on how the variable
was created, T run the command:
+ notes pub9trunc
pub9trunc:
1. pubs>20 recoded to 20 \ wiS-varnotes.do jsl 2008-04-03.
I can also add longer notes (up to 8,681 characters in Small Stata and 67,784 characters
in other versions). For example.
. notes pub9trunc: Earlier analyses (pubreg04a.do 2006-09-20) showed
that cases with a large number of articles were outliers. Program
pubreg04b.do 2006-09-21 examined different transformations of pub9
and found that truncation at 20 was most effective at removing
the outliers. \ jsl 2008-04-03,
vvvyv
Now, when I check the notes for pub9trunc, | see both notes
- notes pub9trunc
pub9trunc:
1. pubs>20 recoded to 20 \ wfS-varnotes.do js1 2008-04-03.
2. Earlier analyses (pubreg04a.do 2006-09-20) showed that cases with a large
number of articles were outliers, Program pubreg04b.do 2006-09-21 examined
different transformations of pub9 and found that truncation at 20 was most
effective at removing the outliers. \ jsl 2008-04-03.
With this information and my research log, I can easily reconstruct how and why I
created the variable.5.8.1 Commands for working with notes 161
The notes command has an option to add a time stamp. In the text of the note,
the letters TS (for time stamp) surrounded by blanks are replaced by the date and time
For example,
. notes pub9trunc: pub9 truncated at 20 \ wfS-varnotes.do jsl TS
. notes pub9trune in 3
pubStrunc:
3. pub9 truncated at 20 \ wf5S-varnotes.do jsi 3 Apr 2008 11:28
5.8.1 Commands for working with notes
Listing notes
To list all notes in a dataset, type
notes
To list the notes for selected variables, use the command
notes list variable-list
If you have multiple notes for a variable, they are numbered. To list notes from start-#
to end-#:
notes list variable-list in start-#[/end-# |
For example, if vignum has many notes, I can look at just. the second and third:
. notes list vignum in 2/3
vignum:
2. BGR - majority vs. minority = bulgarian vs. turk
3. ESP - majority vs. minority = spaniard vs. gypsy
You can also list notes with codebook using the notes option. For example,
. codebook pubitrunc, notes
pubitrunc (unlabeled)
type: numeric (float)
range: [0,20] units: 1
unique values: 17 missing .: 772/1080
mean: 2.53247
std. dev: 3.00958
percentiles: 10% 28% 50% 75% 90%
° 5 2 4 6
pubitrunc:
1, pubs# truncated at 20 \ w£S-varnotes.do js] 2008-04-03.162 Chapter 5 Names, notes, and labels
Removing notes
To remove notes for a given variable, use the command
notes drop variuble-name [in #[/#]]}
where in #/# specilics which uotes to drop. For example, notes drop vignum in
2/3.
Searching notes
Although there currently is no Stata command to search notes, this feature is planned
for future versions of Stata. For now, the only way do this is to open a log file and run
notes
Then close the log and use a text editor to scarch the log file.
5.8.2 Using macros and loops with notes
You can use macros when creating notes. For example, to create similar notes for several
variables, I use a local that | call tag with information for “tagging” each variable:
local tag "pub# truncated at 20 \ wfS-varnotes.do jsl 2008-04-09."
notes pubitrunc: “tag”
notes pub3trunc: *tag~
notes pubétrunc: “tag
notes pubStrunc: “tag”
Then
- notes pub+
pubitrunc:
1. pub# truncated at 20 \ wf5-varnotes.do jsl 2008-04-09.
pub3trunc:
1. pub# truncated at 20 \ wfS-varnotes.do jsl 2008-04-09.
(output omitted )
The advantage of using macros is that exactly the same information is added to each
variable. You can also create notes within a loop. For example,
local tag "wfS-varnotes.do jsl 2008-04-09."
foreach varname in publi pub3 pub6 pub9 {
clonevar ~varname“trunc = ~varname™
replace “varname“trunc = 20 if ‘varname trunc>20 ///
& !missing(~varname“trunc)
label var “varname‘trunc "“varname’ truncated at 20"
notes “varname“trunc: “varname® truncated at 20 \ “tag”5.9 Value labels 163
5.9 Value labels
Value labels assign text labels to the numeric values of a variable. The rule for value
labels is
Categorical variables should have value labels unless the variable has an
inherent metric.
Although there is little benefit from having value labels for something like the number of
young children in the family, a variable indicating attending college should be labeled.
To see why labels are important, consider k6, which is the number of young children
in the family, and we, indicating whether the wife attended at least some college coded
as 0 and 1. Without value labels, the tabulation of wc and k5 looks like this (file:
wf5-vallabels.do):
; tabulate we_vi k5
Did wife
attend # of children younger than 6
college? 0 1 2 3 Total
0 444 85 12 0 a1
1 162 33 14 3 212
Total 606 118 26 3 753
Although it is reasonable to assume that 1 stands for yes and 0 stands for no, what
would you decide if the output. looked like this?
. tabulate we_v2 k5
Did wife
attend # of children younger than 6
college? 0 i 2 3 Total
1 444 85 12 ° 541
2 162 33 14 3 212
Total 606 118 26 3 753
A value label attaches a label to each value. Here I use a label that includes both the
value and a description of the category:
. tabulate we_v3 k5
Did wife
attend # of children younger than 6
college? 0 1 2 3 Total
0_No 444 85 12 ° 541
1Yes 162 33 14 3 212
Total 606 118 26 3 763164 Chapter 5 Names, notes, and labels
5.9.1 Creating value labels is a two-step process
Stata assigns labels in two steps. In the first slep, label define associates labels with
values; that, is, the labels are defined. In the second step, label values assigns a
defined label to one or more variables.
Step 1: Defining labels
In the first step, I define a set of labels to be associated with values without indicating
which variables use these labels. For yes/no questions with yes coded as 1 and no coded
as 0, I could define the label as
label define yesno 1 yes 0 no
For a five-point scale with low values indicating negative responses, 1 could define
label define lownegS 1 StDisagree 2 Disagree 3 Neutral 4 Agree 5 StAgree
For scales where low values are positive, I could define
label define lowpos5 1 StAgree 2 Agree 3 Neutral 4 Disagree 5 StDisagree
Step 2: Assigning labels
After labels are defined, label values assigns the defined labels to one or morc vati-
ables, For example, because we and he are yes/no questions, | can use the label definition
yesno for both variables:
label values we yesno
label values he yesno
Or, in the latest version of Stata 10, 1 can assign labels to both variables in one command:
label values we he yesno
Why a two-step system?
The primary advantage of a two-step system for creating value labels is that it facilitates
having consistent labels across variables and simplifies making changes to labels used by
multiple variables. For example, surveys often have many yes/no variables and many
positively or negatively ordered five-point scales. For these three types of variables, [
need three label definitions:
label define yesno 0 No 1 Yes
label define negS 1 StDisagree 2 Disagree 3 Neutral 4 Agree 5 StAgree
label define posS 1 StAgree 2 Agree 3 Neutral 4 Disagree 5 StDisagree
If T assign the yesno label to all yes/no que:
exactly the same labels. The same holds for 2
ions, I know that these questions have
signing negS and posd to variables that5.9.2 Principles for constructing value labels 165
are negative or positive five-paint scales. Defining labels only once makes it. morc likely
that labels are assigned correctly.
This system also has advantages when changing value labels. Suppose that 1 want
to shorten the labels and begin each label with its value. All 1 need to do is change the
existing definitions using the modify option:
label define yesno 0 ONo 1 1Yes, modify
label define negS 1 1StDis 2 2Disagree 3 3Neutral ///
4 4hgree 5 SStAgree, modify
label define posS 1 1StAgree 2 2Agree 3 3Newtral ///
4 4Disagree 5 SStDis, modify
The revised labels are automatically applied to all variables for which these definitions
have been assigned.
Removing labels
To remove an assigned value label, usc label values without specifying the label. For
example, to remove the yesno label assigned to we, type
label values we
In the latest version of Stata 10, you can use a new syntax where a period indicates
that the label is being removed:
label values we .
Although I have removed the yesno Jabel from we, the label definition has not been
deleted and can be used by other variables.
5.9.2 Principles for constructing value labels
You will save time and have clearer output if you plan value labels before you create
them. Your plan should deterinine which variables can share labels, how missing values
will be labeled, and what the content, of your labels will be. As you plan your labels,
here are some things to consider.
1) Keep labels short
Because value labels are truncated by some commands, notably tabulate and tab1, I
recommend
Value labels should be eight or fewer characters in length.
Here’s an example of what can happen if you use longer labels. I have created two
label definitions that could be used to label variables measuring social distance (file:
wf5-vallabels.do):166
Chapter 5
. labelbook sd_vi sd_v2
Names, notes, and labels
value label sd_v1
(output omitted )
definition
Rene
variables:
Definitely Willing
Probably Willing
Probably Unwilling
Definitely Unwilling
sdchild_v1
value label sd_v2
(output omitted )
definition
il
2
3
4
variables:
The sd_v1 definitions use labels that are identical to the
1Definite
2Probably
3ProbNot
ADefNot
sdchild_v2
wording on the questionnaire.
These labels were assigned to sdchild_vi. The sd_v2 labels are shorter and add the
category number to the label; these were assigned to schild_v2. With tabulate, the
original definitions are worthles:
. tabulate female sdchild_vi
R is Q15 Would let X care for children
female? | Definitel Probably Probably Definite] Total
OMale 41 99 185 197 492
1Female 73 98 156 215 542
Total 144 197 3i1 412 1,034
The sd_v2 definitions are much better:
. tabulate female sdchild_v2
R is Q15 Would let X care for children
female? | 1Definite 2Probably 3ProbNot —4DefNot Total
OMale 41 99 155 197 492
iFenale 73 98 156 215 542
Total 114 197 311 412 1,034
2) Include the category number
When looking at tabulated results, I often want to know the numeric value assigned to
a category. You can see the values associated with labels by using the nolabel option
of tabulate, but with this option, you no longer see the labels. For example,
asst
se
Ze nevisedaBabe xd5.9.2 Principles for constructing value labels 167
. tabulate sdchild_vl, nolabel
Qi5 Would
let X care
tor
children Freq. Percent Cum,
i 114 11.03 11.03
2 197 19.05 30.08
3 ait 30.08 60.15
4 412 39.85 100.00
Total 1,034 100.00
A better solution is to use value labels that include both a label and the value for each
category as illustrated with the label sd_v2.
Adding values to value labels
One way to include numeric values in value labels is to add them when you define
the labels (file: w£5-vallabels .do):
label define defnot 1 1Definite 2 2Probably 3 3ProbNot 4 4DefNot
If you already have label definitions that do not include the values, you can use the
numlabel command to add them. Suppose that I start with these labels:
label define defnot 1 Definite 2 Probably 3 ProbNot 4 DefNot
To add values to the front of the label, I use the command:
numlabel defnot, mask(#) add
Before explaining the command, let us look at the new labels:
label val sdchild defnot
. tabulate sdchild
Q15 Would
let X care
for
children Freq. Percent Cum.
Definite 314 11.03 11.03
2Probably 197 19,05 30.08
3ProbNot 311 30.08 60.15
4DefNot 412 39.85 100.00
Total 1,034 100.00
The mask() option for numiabel controls how the values are added. The mask(#)
option adds only numbers (e.g., 1Definite); mask(#-) adds numbers followed by an
underscore (e.g., 1Definite); and mask(#. ) adds the values followed by a period
and a space (e.g., 1. Definite).168 Chapter 5 Names, notes, and labels
You can remove values from labels with the remove option. For example, numlabel
defnot, mask(#_) remove removes values that are followed by an unde
Creating new labels before adding numbers
are changed, the original
Tho numlabel command changes existing labels. Once the
labels are no longer in the datasct. This can be a problem if you want to replicate prior
uilis. With the label copy command, added in the February 25, 2008 update of
Stata 10, you can solve this problem by making copies of the original labels. For
example, T can create a new value label definition named defnotNew that is an exact
copy of defnot:
label copy defnot defnotNew
Then T revise the copy, Jeaving the original label intact:
. pumlabel defnotNew, mask(#_) add
label val sdchild defnotNew
tabulate sdchild
Q15 Would
let X care
tor
children Percent Cum.
1_Definite itd 11.03 11.03
2_Probably 197 19.05 30.08
3_Probllot, 311 30.08 60,15
4 _DefNot 412 39.85 100.00
Total 1,034 100.00
To reassign the original labels,
label val sdchild defnot
» tabulate sdchild
Qi5 Would
ace eee
for
children Freq. Percent cum.
Definite 114 11.03 11.03
Probably 197 19.05 30.08
ProbNot 311 30.08 60.15
DefNot 412 39.86 100.00
Total 1,034 100.00
3) Avoid special characters
Adding spaces and characters such as =, 4, @, {, and } to labels can cause problems
with some commands (e.g., hausman), even though label define allows you to use5.9.2 Principles for constructing value labels 169
these characters in your labels. To avoid problems, I suggest that you use only letters,
numbers, dashes, and underscores. If you include spaces, you must have quotes around
your labels. For example, you need quotes here
label define yesno_v2 1 “1 yes" 0 "O no”
but not here
label define yesno_v3 t i_yes 0 0_no
4) Keeping track of where labels are used
The two-step system for labels can cause problems if you do not keep track of which
labels are assigned to which variables. Suppose female is coded 1 for female and 0 for
male and 1fp is coded 1 for being in the labor force and 0 for not. I could Jabel the
values for both variables as yes and no:
label define twocat 0 No 1 Yes
label values 1fp female twocat
When I tabulate the variables, I get the table | want
. tabulate female lip
R is Paid labor force?
female? No Yes Total
196 345
232 408
428 753
Later I decide that, it would be more convenient to label female with Omale and 1female.
Forgetting that the label twocat is also used by lip. 1 change the label definition:
label define twocat 0 O.Male 1 1_Female, modify
This works fine for female but. causes a problem with 1fp:
. tabulate female lip
Ris | Paid labor force?
female? O_Male 1_Female Total
O_Male 149 196 345
1_Fenale 176 232 408
Total 325 428 783
To keep track of whether a label is used for one variable or many variables, I use these
rules:
If a value label is assigned to only one variable, the label definition should
have the same name as the variable.170 Chapter 5 Names, notes, and labels
Ifa value label is assigned to multiple variables, the name of the label defi-
nition should begin with L.
For example, I would define label define female 0 0.Male 1 1-Female and use it
with the variable female. I would define label define Lyesno 1 1-Yes 0 ONo to
remind me that if L change the definition of Lyesno I need to verify that the change is
appropriate for all the variables using this definition.
5.9.3 Cleaning value labels
There are several commands that make it easier to review and revise value labels. The
commands describe and nmlab list variables along with the name of their value labels.
The codebook, problems command searches for problems in your dataset, including
some related to value labels. 1 highly recammend using it; sce section 6.4.6 for further
details, Two other commands provide lists of labels. ‘The label dir command lists the
names of al] value labels that have been defined. For example,
. label dir
vignun
serious
female
we_v3
Lyesno
Ldefnot
Ltenpt
lp
Lyn
This list includes defined labels even if they have not been assigned to a variable with
label values. The labelbook command lists all Jabels, their characteristics, and the
variables to which they are assigned. For example.
labelbook Ltenpt
value label Ltenpt
values labels
range: [1,10] string lengt! (6,16)
M5 unique at full length: yes
gaps: yes unique at length 12: yes
missing .*: 3 null string: no
leading/trailing blanks: no
numeric -> numeric: no
definition
1 1Not_at all Inpt
10 10Vry Impt
sa a NAP
+c .¢_Dont know
-d 4. No ansr, ref
variables: tcfam tclfam tc2fam tc3fam tclfriend tc2friend tc3friend
tctrelig te2relig te3relig tcidoc te2doc tc3doc tcipsy te2psy
tc3psy tclmhprof tc2mhprof tc3mhprof5.9.5 Using loops when assigning valuc labels 171
5.9.4 Consistent value labels for missing values
Labels for missing values need to be considered carefully. Stata uses the sysmiss . and
26 extended missing values .a .z (sce section 6.2.3 for more information on missing
values). Having multiple missing values allows you to code the reason why information
is missing. For example,
¢ The respondent did not know the answer.
¢ The respondent refused to answer.
¢ The respondent did not answer the current question because the lead-in question
was refused.
The question was not appropriate for the respondent. (e.g., asking children how
many cars they own).
The respondent was not asked the question (e.g., random assignment of who gets
asked which questions).
You can prevent confusion by using the same missing-valuc codes to mean the same
things across questions. If you are collecting your own data, you can do this when
developing rules for coding the data. If you are using data collected by others, you
might find that the same codes are used throughout or you might need to reassign
missing values to make them uniform (see on 5.11.4 for an example). In iy work,
J generally associate these meanings to the missing-values codes in table 5.2:
Table 5.2. Suggested meanings for extended missing-value codes
Letter Meaning Example
5 Unspecified missing value Missing data without the reason being made explicit
.d Don't know Respondent did net know the answer
a Do not use this code — } (lowercase L) is too close to 1 (one) so avoid it
ay Not applicable Only adults were asked this question
“P Preliminary question refused Question 5 was not asked because respondent. did
not answer the lead-in question
wv Refused Respondent refused Lo answer question
8 Skipped due to skip pattern Given answer to question 5, question 6 was not asked
.t Technical problem Error reading data from questionnaire
5.9.5 Using loops when assigning value labels
The foreach command is very effective for adding the same value labels to multi-
ple variables. Suppose that I want to recode the 4-point scales sdneighb, sdsocial,
sdchild, sdfriend, sdwork, and sdmarry to binary variables that indicate whether
the respondent agrees or disagrees with the question. First, I define a new label (file:
wf5-vallabels.do):
label define Lagree 1 1_Agree 0 0_Disagree172 Chapter 5 Names, notes, and labels
Then f use a foreach loop to create new variables and add labels
1> foreach varname in sdneighb sdsocial sdchild sdfriend sdwork sdmarry {
2 display _newline "=-> Recoding variable “varname’" _newline
2 clonevar B'varname” = “varname”
ry recode Bivarname” 1/2=1 3/4=0
58> label values B'varname’ Lagree
> tabulate B'varname’ “varnane’, miss
7?
Line 1 creates the local varname that holds the name of the variable to recode. The first
time through the loop varname contains sdneighb. Line 2 displays a header indicating
which variable is being processed (sample output is given below). The newline directive
adds a blank line to improve readability. Line 3 creates the variable Bsdneighb as a
clone of the source variable sdneighb; the variables are identical except for name. Line 4
combines values 1 and 2 into the value 1 and values 3 and 4 into the value 0. Line 5
assigns the value label Lagree to Bsdneighb. Line 6 tabulates the new Bsdneighb with
the source sdneighb. Line 7 ends the loop. The output. for the first pass through the
loop is
--> Recoding variable sdneighb
(20 missing values generated) i
(Bsdneighb: 670 changes made)
Q13 Would
have X as Q13 Would have X as neighbor
neighbor | iDefinite 2Probably 3ProbNot —4DefNot -c_DK Total
O_Disagree 9 0 133 6t 0 194
1_Agree 390 476 0 Q 0 866
. 6 0 0 0 20 20
Total 390 a6 133 61 20 1,080
The message 20 missing values generated means that. when Bsdneighb was cloned
there were 20 cases with missing values in the source variable. Although .c had the label
-¢.DK in the value label used for sdneighb, the value labels for the recoded variable do
include a label for .c. 1 could revise the label definition to add this label:
label define Lagree 1 1_agree 0 O_disagree .c .c_DK .d .d_NA_ref, modify
The message Bsdneighb: 670 changes made was generated by recode to indicate
how many cases were changed when the recodes were made. The program can be
improved by adding notes and variable labels:
1> local tag “wf5-vallabels.do jsl 2008-04-03."
2> foreach varname in sdneighb sdsocial sdchild sdfriend sdvork sdmarry {
32> display -Newline "--> Recoding variable ~varname’" _newline
> clonevar B varname” = ~varname™
5> recode B'yvarname’ 1/2=1 3/4=0
> label values B’varname’ Lagree
?> notes B varname’: "Recode of “varname” \ “tag”
a> label var = B'varname” "Binary version of “varname’"
9> tabulate Bo varname” “varname
10> +5.10 Using multiple languages 173
Line 1 creates a local used by notes in line 7. The variable label in line 8 describes
where the variable came from.
5.10 Using multiple languages
The language facility allows you to have roultiple sets of labels saved within one dataset,
Most, obviously you catt have labels in inore than one language. For example, I have
created a dataset, with labels in Spanish, English, and French (I discuss how to do this
later). If I want labels in English, I select that language and then run the commands
as I normally would (file: w£5-language .do):
. use wf-languages-spoken, clear
(Workflow data with spoken languages \ 2008-04-03)
. label language english
. tabulate male, missing
Gender of
respondent Freq. Percent Cun.
0_Women 1,227 53.81 53.52
1_Men 1,066 46.49 100.00
Total 2,293 100.00
If { want labels in French, [ specify French:
. label language french
. tabulate male, missing
Genre de
répondant Freq. Percent Cun.
0_Fenmes 1,227 53.51 83.51
1_Hommes 1,066 46.49 100.00
Total 2,293 100.00
Wheu I first read about label language, | thought about it only in terms of languages
snch as French and German. When documenting and archiving the data collected by
Alfred Kinsey, we faced the problem that some of the labels in the original dataset had
inconsistencies or small errors, We wanted to fix these, but we also wanted to preserve
the original labels. The solution was to use multiple languages. We let label language
original include the historical labels, whercas label language revised incorporated
our changes. In the same way, you can create a short and long language for your dataset.
The long version could have labels that. match the survey instrument. The short version
could use labels that are more effective for analysis.
(Continued on next page)174 Chapter 5 Names, notes, and labels
5.10.1 Using label language for different written languages
‘Lo create a new language, you indicate thé name for the new language and then create
labels as you normally would. A simple example shows you how to do this. 1 start by
loading a dataset with only English labels and add French aud Spanish labels:
. use wf-languages-single, clear
. * french
. label language french, new i
. label define male.zr 0 "Q.Fenmes“ 1 "1 Hommes"
. label val male maletr
. label var male "Genre de répondant”
. * spanish
. label language spanish, new
. label define male.es 0 "0 Mujeres" 1 "1_Hombres"
. label val male male.es
. label var male "Género del respondedor"
When you save the dataset, labels are saved for three languages. As far as T know, Stata
is the only data format. with multiple languages, Lf you convert a Stata dataset. with
multiple languages to other formats, you will have to create distinct datasets for each
langnage.
|
5.10.2 Using label language for short and long labels
Stata’s label language feature is a great solution to the trade-off between labels that
correspond to the data source (c.g., the survey instrument) and labels that ave conve-
nient for analysis. For analysis, shorter labels are often more useful, but. for documen-
tation, yon might want to know exactly how the questions were asked. Here is a simple
example of how label language can address this dilennna. lirst, I load the data and
set the language to source 10 use the labels based on the source questionnaire (file:
wf5-language.do):
. use wf-languages-analysis, clear
(iorkflow data with analysis and source labels \ 2008-04-03)
. label language source
Using describe, I look at, two variables:
. describe male warn :
storage display value
variable name type format label variable label
male byte 410.0g Smale Gender
varn byte %417.0g Swarm A vorking mother can establish
just as warm and secure a
relationship with her c
\
The value labels begin with S that I used to indicate that these are the source labels.
If I tabulate the variables, T get results using source labels:5.10.2 Using label language for short and long labels 175
. tabulate male warm, missing
A working mother can establish just as warm
and secure a relationship with her c
Gender | Strongly Agree Disagree Strongly Total
Female 139 323 461 304 1,227
Male 158 400 395 113 1,066
Total 297 723 856 4ai7 2,293
These labels are too long to be useful. Next I switch to the labels I created for analyzing
the data:
. label language analysis
. describe male warm
storage display value
variable name type format label variable label
male byte 410.0 Amale Gender: t=male O=fomale
ware byte 417.0g Awarm Mom can have warm relations with
child?
The value and variable labels have changed. When 1 tabulate the variables, the results
are much clearer:
. tabulate male warm, missing
Gender:
i=male Mom can have warm relations with child?
O=female 1_SD 2D 3A 4_Sh Total
O_Women 139 323 461 304 1,227
1 Men 158 400 395 113 1,066
Total 297 723 856 417 2,293
Tf I need the original labels, I simply change the language with the command label
language source.
Note on variable and value labels
There is an important difference in how variable and value Jabels are treated with
languages. After changing to the analysis language, I simply created new variable
labels. For value labels, I had to define labels with different names than they had
before. For example, in wf-languages-analysis.dta, the label assigned to warm was
named Swarm (where $ indicates that this is the source label). In the analysis language,
the label was named Awarm. With multiple languages, you must create new value-label
definitions for each language.176 Chapter 6 Names, notes, and labels
5.11 A workflow for names and labels
This section provides an extended cxample:of how to revise names and labels using the
tools for automation that were introduced in chapter 4. The example is taken from
research with Bernice Pescosolido and Jack Martin on a 17-country survey of sligina
and mental health. The data we reccived had nonnmemonic variable names with labels
that, closely matched the questionnaire. Initial analyses showed that the names were
inconsistent and sometimes misleading with labels that were often trumcated or unclear
in the output. Accordingly, we undertook a major revision of names and labels that
took months to complete.
Because we needed to revise 17 datasets with thousands of variables, we spent a
great, deal of time planning the work and perfecting the methods we used. To speed
up the process of entering thousands of rename, label variable, label define, and
label values commands, we used automation tools to create dummy commands that
were the starting point for the commands we needed. To understand the rest of this
section, it is essential to understand how dummy commands were used. Suppose that 1
need the following rename commands:
rename atdis atdisease
rename atgenes atgenet
reneme ctxfdoc clawdoc
rename ctxfhos clawhosp
rename ctxfmed claypmed
Instead of typing each command from scratch, 1 create a list of dunmy rename com-
mands that looks like this:
rename atdis atdis
rename atgenes atgenes
rename ctxfdoc ctxfdoc
rename ctxfhos ctfxhos
rename ctxfmed ctxfmed
The dummy commands are edited to create tle commands I need. Before getting into
the specific details, I want to provide an‘ overview of the five steps and 11 do-files
required.
Step 1: Plan the changes
The first step is planning the new names and labels. 1 start with a list of the current
names and labels:
wiS-sgcla-list.do:
List names and labels from the source dataset wf-sgc-source.dta.
5. The data from each country consisted of about 150 variables with some variation of content across
countries. Vor the first country, our data manager estimates that it took a month to » the
names and labels and verily the data. Later countries took four to five days. The data used for
this example are artificial but have similar names. \ahels, and content to the real data that. have
not yet. been released.5.11 A workflow for names and labels 177
This information is exported to a spreadsheet uscd to plan the changes. To decide what
changes to make, J check how the current names and labels appear in Stata output:
wi5-sgcib-try.do:
‘Try the names and labels with tabulate.
Step 2: Archive, clone, and rename
Before making any changes, I back up the source dataset. Because I want to keep the
original variables in the revised dataset, I create clones:
wf5-sgc2a-clone.do:
Add cloned variables and create wf-sgc01 .dta.
Next I create a file with dummy rename commands:
wf5-sgc2b-rename-dump. do:
Create a file with rename commands.
J edit the file with rename commands and use it to rename the variables:
wiS-sgc2c-rename. doa:
Rename variables and create wf-sgc02.dta.
Step 3: Revise variable labels
The original variable labels are used to create dummy commands:
wf5-sgc3a-varlab-dump.do:
Use a loop and extended functions to create a file with label variable com-
mands.
Before adding new labels, I save the original labels as a second language called original.
The revised labels are saved in the default language:
wf5-sgc3b-varlab-revise .do:
Create the original language for the original variable labels and save the revised
labels in the default language to create wf-sgc03.dta.
Step 4: Revise value labels
Changing value labels is more complicated than changing variable labels due to the
two-step process used to label values. I start by examining the current value labels
to determine which variables could share label definitions and how to handle missing
values:Chapter 5 Names, notes, and labels
178
wf5-sgc4a-vallab-check. do: i
List current value labels for review. |
To create new value labels, I create dummy label define and label values com-
mands:
wf5-sgc4b-vallab-dump.do:
Create a file with label define and label values commands.
The edited commands for value labels are used Lo create a new dataset:
wf5-sgc4c-vallab-revise.do:
Add new value labels to the default language and save wf-sgc04.dta
Step 5: Verify the changes
Before finishing, I ask everyone on the research team to check the revised names and
labels, and then steps 2-4 are repeated as: necded.
wf5-sgc5a~check.do:
Check the names and labels by trying them with Stata commands
When everyone agrees on the new names and labels, wf-sgc04.dta and the do-files and
dataset are posted.
With this overview in mind, we can get into the details of making the
changes.
5.11.1 Step 1: Check the source data
Step la: List the current names and labels
First, I load the source data and check the data signature (file: w£5~sgc1a-list .do).
. use wi-sge-source, clear
(Workflow data for SGC renaming ome \ 2008-04-03)
. datasignature confirm
(data unchanged since 03apr2008 13: 28)
+ notes _dta
wdta:
1. wf-sge-source.dta \ wf-sge-support.do jsl 2008-04-03
The unab command creates the macro veriiey with the names of all variables:5.411 Step l Check the source data 179
. wnab varlist : _ell
. display "“varlist
id_iu cntry_iv vignum serious opfam opfriend tospi tonpm oppme opforg atdisease
> atraised atgenes sdlive sdsocial sdchild sdfriend sdvork sdmarry impown inptre
> at stout stfriend stlimits stuncom tcfam tcfriend tedoc gvjob gvhealth gvhous
> gvdisben ctxfdoc ctxfmed ctxfhos cause puboften pubfright pubsymp trust gender
> age wrkstat marital edudeg
Using this list, I loop through cach variable and display its name, value label, and
variable label. Before generating the list, I set linesize 120 so that long variable
labels are not wrapped. Here is the loop:
1> local counter = 1
2> foreach varname in “varlist” {
3> local varlabel : variable label ~varname*
a> local vallabel : value label “varname”
5> display "“counter”." _col(6) ""varname™” _col(19) ///
> "“vallabel“" _col(32) "“varlabel“™
6> local ++counter
7 +}
Before explaining the loop, it helps to see some of the output:
1. idliu Respondent Number
2. entry_iu cntry_iu IU Country Number
3. vignum vigoun Vignette
4. serious serious Qi How serious would you consider Xs situation to be?
5 opfam Ldumay Q2_i What X should do:Talk to family
6. opfriend Ldummy Q2_2 What X should do:Talk to friends
7. tospi Ldumny Q2_7 What X should do:Go to spiritual or traditional healer
8. tonpm Ldumny 2_8 What X should do:Take nonprescription medication
(output omitted )
Returning to the program, line 1 initiates a counter for numbering the variables. Line 2
begins the loop through the variable names in varlist and creates the local varname
with the name of the current. variable. Line 3 is an extended macro function that creates
the local varlabel with the variable label for the variable in varname (see page 159 for
further details). Line 4 uses another extended macro function to retrieve the name of
the value-label definition. Line 5 displays the results, line 6 adds one to the counter,
and line 7 ends the loop.
Although I could use this list to plan my changes, I prefer a spreadsheet where I
can sort and annotate the information. To move this information into a spreadsheet, [
create a text file, where the columns of data are separated by a delimiter (i.e., a character
designated to indicate a new column of data). Although commas are commonly used as
delimiters, I use a semicolon because some labels contain commas. The first five lines
of the file I created look like
Number ;Name;Value label;Variable labels
1;id_iu;;Respondent Number
2;entry.iu;cntry_iu;1U Country Number
3; vignun; vignum; Vignette
4;serious;serious;Qi How serious would you consider Xs situation to be?180. Chapter 5 Names, notes, and labels
To create a text file, I need to tell thejoperating system to open the file named
wi5-sgcla-list.txt. The commands that write to this file vefer to it by a shorter
name, a nickname if you will, called a file handle. I chose myfile as the file handle.
‘This means that referring to myfile is the same as referring to wiS-sgcla-list.txt.
Before opening myfile, I necd to make sure that the file is not already open. I do this
with the command capture file close myfile, which tells the operating system to
close any file named myfile that is open. jcapture means that if the file is not open,
ignore the error that is generated when ydu try to close a file that is not open. Next,
the file open command creates the file
capture file close myfile
file open myfile using wf5S-sgcla-list.txt, write replace
‘The options write and replace mean thad 1 want to write to the file (not just read the
file) and if the file cxists, replace it. Here is the loop that writes to the file:
1> file write myfile "Number;Name;Value label;Variable labels" _newline
2> local counter = 1
3> foreach varname in “varlist™ {
4> local varlabel variable label ~varname~
> local vallabel : value label “yarnane”
> file write myfile “counter “;"varname”;~vallabel’;*varlabel“" _newline
7> local ++counter
8> }
9> file close myfile
Line 4 writes an initial line with labels for each colunm: Number, Name, Value label,
and Variable labels. Lines 2 5 are the same as the commands used in the loop on
page 179. Line 6 replaces display with file write, where newline starts a new line
in the file. The string "*counter”;~ varname’;~vallabel”;~varlabel“" combines
three local macros with semicolons in between. Line 7 increments the counter by 1, and
line 8 closes the foreach loop. Linc 9 closes the file. T import. the file to a spreadsheet
program, here Excel, where the data look like this (file: wiS-sgcia-List xls):
I
ALB c 7 D 7
Number Name Value label Variable labels
1 id_iu Respondent Number
2 entry_iu cntry_iu —{U Country Number
3 vignum vignum Vignette
4 serious serious Q1 How seripus would you consider Xs situation to be?
5 opfam — Ldummy ca twhar aoa deton tents
6
7
8
ae. win oe
opfriend Ldummy Q2_2 What X should do:Talk to friends
tospi dummy —_2_7 What X should do:Go to spiritual or traditional healer
tonpm —Ldummy _Q2_8 What X should do:Take non-prescription medication
oppme _idummy, 02,9 What X should docTake prescription medication
L use this spreadsheet to plan and document the changes I want to make.5.11.1 Step 1: Check the source data 18]
E Step 1b: Try the current names and labels
To determine how well the current names and labels work, I start with codebook,
compact (file: wf5-sgcl1b-try.do):
= codebook, compact
Variable Obs Unique Mean Min Max Label
id_iu 200 200 1772875 1100107 2601091 Respondent Number
catry_iu 200 8 17.495 tL 26 IU Country Number
vignum 200 12 6.305 1 12 Vignette
serious 196 4 1.709184 1 4 Qi How serious would you c...
opfam 199 2 1.693467 1 2 Q2_1 What X should do:Talk
opfriend 198 2 1.833333 1 2 Q2_2 What X should do:Talk
(output omitted )
The labels for opfam and opfriend show that truncation is a problem. Next [ use a loop
to run tabulate with each variable, quickly showing problems with the value labels. I
start by dropping the 1D variables and age because they have too many unique values to
tabulate and then create a macro varlist with the names of the remaining variables:
drop id_iu cntry_iu age
unab varlist : _all
The loop is simple:
1> foreach varname in “varlist” {
2> display ““varname*:"
> tabulate gender “varname’, miss
> }
Line 2 prints the name of the variable (because tabulate does uot tell you the name
of a variable if there is a variable label). Line 3 tabulates gender against the current
variable from the foreach loop. I use gender as the row variable because it has only
two categories, making the tables small. The loop produces tables like this:
vigaum:
Vignette
Gender | Depressiv Depressiv Depressiv Depressiv Schizophr Total
Male 16 uw 3 4 T 90
Female 8 12 9 8 13 110
Total 23 23 12 9 20 200
(output omitted)
Clearly, the value labels for vignum necd to be changed. Here is another example where
the truncated category labels are a problem:
(Continued on next page)182 Chapter 5 Names, notes. and labels
sdlive:
Q13 To have k as a neighbor?
Gender | Definitel Probably Probably Definite rc Total
Male 39 32 10 4 4 90
Female 45 51 9 5 0 110
Total 84 83 19 9 4 200
Q13 To
have X as
a
neighbor?
Gender 4 Total
Male 1 90
Female 0 110
Total 1 200
Other labels have legs scrious problems. For example, here | can tell what cach category
means, but the labels are hard to read:
serious: '
Qi How serious would you onsider Xs situation to be?
Gender | Very seri Moderatel Not' very Not at al “ Total
Male 42 37 8 2 1 90
Female 49 38 18 2 3 110
Total ot 75 26 4 4 200
Oy, for trust, [ find the variable label is too long and the value labels are uncl
trust:
Q75 Would you say people can be trusted or need to be
careful dealing w/people?
Gender | Most peop Need to b Loa .c ia Total
Male 14 a7 | 29 9 0 90
Female 13 71 24 1 1 110
Total 2 118 53 1 1 200
As I go through the output, I add notes ta the spreadsheet and plan the changes that
I want. |
5.11.2 Step 2: Create clones and rename variables
When you rename and relabel variables, niistakes can happen. To prevent loss of critical
information, I back up the data as described in chapter 8. [ also create clones of the
original variables that. I keep in the dataet to compare them to the variables with
revised names and labels. For example, f the source variable is vignum, I create the
clone Svignum (where $ stands for source variable). 1 can delete these variables later or
keep them in the final dataset. Next I run a pait of programs to rename variables.5.11.2 Step 2: Create clones and rename variables 183
Step 2a: Create clones
I start by defining a tag macro to use when adding notes to variables
(file: w£5-sgc2a-clone.do). The tag includes only that part of the do-file name that
is necessary to uniquely identify it:
local tag "wE5-sge2a.do jsl 2008-04-09."
Next I load the dataset and check the signature:
. use wf-sge-source, clear
(Workflow data for SGC renaming example \ 2008-04-03)
. datasignature confirm
(data unchanged since O3apr2008 13:25)
To create clones, I use a foreach loop that is similar to that used in step 1:
i> unab varlist : _all
2> foreach varname in ‘varlist’ {
3> clonevar S'varname” = ~varname*
4> notes S’varname”: Source variable for “varname’ \ ‘tag~
5> notes ~varname: Clone of source variable S varname” \ “tag”
6> }
Line 3 creates a clone whose name begins with S and ends with the name of the source
variable. Line 4 adds a note to the clone using the local tag to add the name of the
do-file, the date it was run, and who ran it. Line 5 adds a note to the original variable.
(To test your understanding of how notes works, think about what would happen if
line 5 was placed immediately after line 2.) All that remains is to sign and save the
dataset:
. note: wf-sge01.dta \ create clones of source variables \ “tag”
. label data "Workflow data for SGC renaming example \ 2008-04-09"
. datasignature set, reset
200: 90(85238) : 981823927 : 1981917236 (data signature reset)
. save wf-sgc01, replace
file wf-sgcOl.dta saved
Step 2b: Create rename commands
The rename command is used to rename variables:
rename old_varname new-varname
For example, to rename VARO6 to var06, the command is rename VARO6 var06. To
tename the variables in wf-sgc01.dta, I begin by creating a file that contains dummy
rename commands that I can edit. For example, I create the command rename atgenes
atgenes that I revise to rename atgenes atgenet. I start by loading the dataset and
verifying the data signature (file: w£5-sgc2b-rename-dump .do):184 Chapter 5 Names, notes, and labels
. use wi-sge01, clear |
(Workflow data for SGC renaming example \ 2008-04-09)
. datasignature confirm
(data unchanged since O9apr2008 14:12)
. potes _dta
adta:
1. wf-sge-source.dta \ wf-sgc-suppott.do jsl 2008-04-03
2. wf-sgc01.dta \ create clones of gource variables \ wfS-sgc2a.do jsl
2008-04-09. :
Next I drop the clones (that I do not want to rename) and alphabetize the remaining
variables:
drop s+
aorder
I use a loop to create the text file w£5-sgc2b-rename-dummy .doi with dummy rename
commands that I edit and include in step 2c:
unab varlist : all
file open myfile using wf5-sgc2b-rename-dummy.doi, write replace
foreach varname in “varlist~ { |
file write myfile "*rename ~varnahe’" _col(22) ““varname’" _newline
}
file close myfile
I use the file write command to write commands to the .doi file. I preface the
commands in the .doi file with * so that they are commented out. If I want to rename
a variable, I remove the * and edit the command. The output file looks like this:
yrename age age
yrename atdisease atdisease
‘rename atgenes atgenes
(output omitted )
I copy wf5-sgc2b-rename-dummy .doi to wpb egc2b-renane-revised. tos and edit the
dumped commands.
Step 2c: Rename variables
The do-file to rename variables starts by creating a tag and checking the source data
(file: wf5-sgc2c-rename.do):
local tag "“wt5-sgc2c.do js) 2008-04-09."
use wf-sgce01, clear
datasignature confirm
notes _dta
Next I include the edited rename commands:
|
include wi5-sgc2b-rename-revised. doi5.11.3 Step 3: Revise variable labels 185
For variables that I do not want to rename (e.g., age), I leave the * so that the line is a
comment. I could delete these but decide to leave them in case I later change my mind.
Here are the names that changed:
Original Revised
atgenes => atgenet
ctxfdoc = clawdoc
ctxfhos => clawhosp
ctxfmed => clawpmed
gvdisben = gvdisab
gvhous => gvhouse
opforg => opforget
oppme = oppremed
pubfright = pubfrght
sdlive = sdneighb
stuncom => stuncmft
tonpn = opnomed
tospi => opspirit
Why were these variables renamed? atgenes was changed to atgenet because genet
is the abbreviation for genetics used in other names. ctxf refers to “coerced treatment.
forced”, which is awkward compared with claw for “coerced by law”. hos was changed
to hosp, which is a clearer abbreviation for hospital; med was changed to pmed to indicate
psychopharmacological medications. Next the dataset is saved with a new name:
+ note: wf-sgc02.dta \ rename source variables \ “tag”
. label data “Workflow data for SCC renaming example \ 2008-04-00"
. datasignature set, reset
200:90 (109624) :981823927 : 1981917236 (data signature reset)
. save vi-sgc02, replace
file wi-sgc02.dta saved
I check the new names using nmlab, summarize, or codebook, compact.
5.11.3 Step 3: Revise variable labels
Based on my review of variable labels in step 1, I decided to revise some variable labels.
Step 3a: Create variabie-label commands
First, I use existing variable labels to create dummy label variables commands (file:
wf5-sgc3a-varlab-dump.do). As in step 2b, I load the dataset, drop the cloned vari-
ables, sort the remaining variables, and create a local with the names of the variables:186 Chapter 5 Names, notes, and labels
use wi-sgc02.dta j
datasignature confirm
drop S*
aorder
unab varlist : _all
Next I open a text file that will hold the dummy-variable labels. As before, 1 loop
through varlist and use an extended macro function to retrieve the variable labels.
The file write command sends the information to the file:
file open myfile using wf5-sgc3a-varlab-dummy.doi, write replace
foreach varnane in “varlist” {
local varlabel : variable label “varname“
file write myfile "label var ‘varname’ " _col(24) ~"""varlabel”
i F
file close myfile |
anewline
The only tricky thing is putting double quotes around the variable labels. That is, I
want to write "Current employment status" not just Current employment status.
This is done with the code: ~"""varlabel“""~. At the center, ""varlabel“" inserts
the variable label, such as Current employment status, where the double quotes are
standard syntax for enclosing strings. To Write quote marks as opposed to using them
to delimit a siring, the characters ~" and "“ are used. The resulting file looks like this:
label var age "Age"
label var atdisease "04 Xs situation is caused by: A brain disease or disorder"
label var atgenet "Q7 Xs situation is caused by: A genetic or inherited problem"
label var atraised "Q5 Xs situation is caused by: the way X was raised"
label var cause "Q62 Is Xs situation caused by depression, asthma, schizophrenia, stress)
(output omitted )
I copy w£5-sgc3a-varlab-dummy .doi to w£5-sgc3a-varlab-revised.doi and edit the
dummy commands to be used in step 3b.
Step 3b: Revise variable labels |
The next do-file adds revised variable labels to the dataset. (file: w£5-sgc3b-varlab-
revise.do). I start by creating a tag, then J load and verify the data:
local tag "“wf5-sgc3b.do jsi 2008-04-09."
use wi-sgc02, clear \
datasignature confirm !
notes _dta
Although I want to create better labels, I do not want to lose the original labels, so I
use Stata’s language capability. By default, a dataset uses a language called default. I
created a second language called original (for the original, unrevised variable labels)
that is a copy of the default language before that language is changed:
label language original, new copy5.114 Step 4: Revise value labels 187
With a copy of the original labels saved, I go back to the default language where I will
change the labels:
label language default
To document how the languages were created, I add a note:
note: language original uses the original, unrevised labels; language ///
default uses revised labels \ “tag”
Next I include the edited file with variable labels:
include wf5-sgc3a-varlab-revised.doi
The commands in the include file look like this:
label var age “Age in years”
label var atdiseas "Q04 Cause is brain disorder"
label var atgenet "QO7 Cause is genetic"
label var atraised “QO5 Cause is way X vas raised"
label var cause "Q62 Xs situation cased by what?"
(output omitted )
With the changes made, I save the data:
note: wi-sgc03.dta \ revised var labels for source & default languages \ “tag”
. label data “Workflow data for SGC renaming example \ 2008-04-09"
. datasignature set, reset
200:90 (109624) :981823927 : 1981917236 (data signature reset)
. save uf-sgc03, replace
file wi-sgcO3.dta saved
To check the new labels in the default language, I use nmlab:
- nmlab tefam tcfriend vignum
tcfan «943 Family help important?
tcfriend Q44 Friends help important?
vignum Vignette number
To see the original labels, type
. label language original
. nmlab tcfam tefriend vignum
tcfam Q43 How Important: Turn to family for help
tcfriend Q44 How Important: Turn to friends for help
vignun Vignette
If I am not satisfied with the changes, J revise the include file and rerun the program.
5.11.4 Step 4: Revise value labels
Revising value labels is more challenging for several reasons: value labels require the
two steps of defining and assigning labels; each label definition has labels for multiple188 Chapter 5 Names, notes, and labels
values; one value definition can be used by multiple variables; and to create value labels
in a new language, you must create new label definitions, not just revise the existing
definitions. Accordingly, the programs that follow, especially those of step 4b, are more
difficult than those in the earlier steps. I suggest that you start by skimming this section
without worrying about the details. Then reread it while working through each do-file,
preferably while in Stata where you can experiment, with the programs.
Step 4a: List the current labels
J load the dataset and use labelbook to list the value labels and determine which
variables use which labels definitions (file: wf5-sgc4a-vallab-check .do):
use wf-sge03, clear
datasignature confirm
notes _dta
labelbook, length(10)
Here is the output for the Ldist label definition:
value label Ldist
values labels
range: (1,4) string length: [16,20]
N: 4 unique at full length: yes
gaps: no unique at length 10: no
missing .*: 0 {null string: no
leading/trailing blanks: no
numeric -> mumeric: no
definition
1 Definitely Willing
2 Probably Willing
3 Probably Unwilling
4 Definitely Unvilling
in default attached to sdneighb sdsoial sdchild séfriend sdwork sdnarry
Sedlive Ssdsocial Ssdchild Ssdfriend Ssdvork Ssdnarry
in original attached to sdneighb sdsocial sdchild sdfriend sdwork sdmarry
Ssdlive Ssdsocial Ssdchild Ssdfriend Ssdwork Ssdmarry
The first part of the output summarizes the label with information on the number of
values defined, whether the values have gaps (e.g., 1, 2, 4), how long the labels are, and
more. The most critical information for my. purposes is unique at length 10, which
was requested with the length(10) option. This option determines whether the first
ten characters of the labels uniquely identify the value associated with the label. For
example, the label for 1 is Definitely Willing whereas the label for 4 is Definitely
Unwilling. If I take the first ten letters of these labels, both 1 and 4 are labeled as
Definitely. Because Stata commands often use only the first ten characters of the
value label, this is a big problem that is indicated by the warning unique at length
10: no. Next definition lists each value with its label. The section in default
attached to lists variables in the default language that use this label, followed by 25.11.4 Step 4: Revise value labels 189
list. of variables that use this label in the original language. I review the output and
plan my changes.
Step 4b: Create label define commands to edit
To change the value labels, I create a text file with dummy label define and label
values commands. These are edited and included in step 4c. I start by loading the
data and dropping the cloned variables (file: w£5-sgc4b-vallab-dump . do):
use wi-sgc03, clear
datasignature confirm
notes _dta
drop S#
Next I create the local valdeflist with the names of the label definitions used by all
variables except the clones, which have been dropped. Because I only want the list and
not the output, I use quietly:
quietly labelbook
local valdeflist = r(names)
The list of label definitions is placed in the local valdeflist. Next ] create a file with
dummy label define commands. There are two ways to do this.
Approach 1: Create label define statements with label save
The simplest way to create a file with the label define command for the current
labels is with label save:
label save “valdeflist” using ///
w£§-sgc4b-vallab-labelsave-dunmy.doi, replace
This command creates wf5~sgc4b-vallab-labelsave-dummy.doi with information
that looks like this for the Ldist label definition:
label define Ldist 1 ““Definitely Willing”’, modity
label define Ldist 2 “"Probably Willing"”, modify
label define Ldist 3 °"Probably Unwilling", modify
label define Ldist 4 “Definitely Unwilling"’, modify
I copy wi5-sgc4b-vallab-labval-dummy .doi to wf5-sgc4b-vallab-labval-revised
-doi and make the revisions. I change the name of the definition to NLdist because I
want to keep the original Ldist labels unchanged. The edited definitions look like this:
label define NLdist 1 ““1DefWillng””, modify
label define NLdist 2 ~"2ProbWill"”, modify
label define NLdist 3 ~"3ProbUnwil"”, modify
label define NLdist 4 ~"4DefUnwill"’, modify
After revising all definitions, J use the edited file as an include file in step 4c.190 Chapter 5 Names, notes, and labels
Approach 2: Create customized label define statements (advanced material)
|
If you have a small number of label definitions to change, label save works fine,
Because our project had 17 datasets and hundreds of variables, 1 wrote a program that
creates commands that are easier to edit. Although you can skip this section if you find
the commands created by label save adequate, you might find the programs useful
for learning more about automating your work. First, I run the uselabel command:
uselabel “valdeflist”, clear
This command replaccs the data in memory with a dataset consisting of value labels
from the definitions listed in valdeflist (created above by labelbook). Each obser-
vation has information about the label for one value from one value-label definition.
For example, here are the first four observations with information on the Ldist label
definition:
. list in i/4, clean
Iname value label trunc
1. Ldist 1 Definitely Willing 0
2. Ldist 2 Probably Willing °
3. Ldist 3 Probably tpn 0
4, Ldist 4 Definitely Unwilling 0
Variable Iname is a string variable containing the name of the value-label definition;
value is the valuc being labeled by the current row of the dataset; label is the value
label; and trunc is 1 if the value label has been truncated to fit into the string variable
label.
Next 1 open a file to hold dummy label define commands that, I edit to create an
include file used in step 4c to create new value labels:
file open myfile using wf5-sge4b-vallab-labdef-dummy doi, write replace
Belore exainining the loop that creates the connnands, it helps to see what the file will
look like:
“ 1234567890
label define NLdist 1 "Definitely Willing", modify
label define NEdist 2 "Probably Willing", modify
label define NLdist 3 "Probably Unwilling", modify
label define NLdist 4 "Definitely Unwilling", modify
M7 an
label define NLdummny "Yes", modify
label define NLdumey "No", modify
(output omitted)
ne
The first line is a comment that includes the numbers 1234567890 that serve as a
guide for editing the label define commands to create labels that are 10 characters
or shorter. These guide numbers are the major advantage of the current approach over
approach 1. The next four lines are the label define commands needed to create
NLdist. Another line with the guide is written before the label define commands for
NLdummy, and so on. Here is the loop that produces this output:5.11.4 Step 4: Revise value labels 191
1> local rownum = 0
2> local priorlbl ""
3> while “rownum’ <= _N {
4> local ++rownum
5> local lblnm = iname[~ rownum’)
e local Iblval = value[* rownum’)
? local 1bl1bl = label [*rownum"]
8> local startletter = substr("*1blval“",1,1)
> if "‘priorlbl“"!=""1bInm“" (
10> file write myfile "//" _col(31) "1234567890" newline
11> }
12> if ““startletter“*!="." {
13> file write myfile M1
14> “label define N"lblnm’ “ col(25) "“1blval“" ///
15> -col(30) “""*1bllb1“""" ", modify” _newline
16> 3
17> local priorlbl ""1blnm“"
1> }
19> file close myfile
Although the code in this section is complex, | describe what it does in the section
below. In addition, I encourage you to try running wf5-sgc4b-vallab-dump.do (part.
of the Workflow package) and experiment with the code.
Lines 1 and 2: define locals. Line 1 creates local rownum to count. the row of the dataset
that is being read. Line 2 defines the local prior1b1 with the name of the label from
the prior row of the dataset. For example, if rownum is 9, priorlb1 contains the name
of the label when rownum was 8. This is used to compare the label being read in the
current row with that in the prior row. If they differ, a new label is started.
Lines 3, 4, 18: loop. Line 3 begins a loop in which loca] rownum increases from 1
through the last row of the dataset (_N indicates the number of rows in the dataset)
Line 4 increases the counter rownum by 1. The loop ends in line 18.
Lines 5~8: retrieve information on current row of data. These lines retrieve information
about the label in the current row of the dataset. Line 5 creates local 1b1nm with the
contents of variable Iname (a string variable with the name of the label for this row)
in row rownum. For example, in row 1 1blnm equals Ldist. Lines 6 and 7 do the same
thing for the variables value and label, thus retrieving the value considered in this
row and the label assigned to that value. Line 8 creates the local startletter with the
first letter of the value label (the function substr extracts a substring from a string).
If startletter contains a period, I know that. the label is for a missing value.
Lines 9-11: write a header with guide numbers. Line 9 checks if the name of the label in
the current row (contained in local 1b1nm) is the same as the name of the label from the
prior row (contained in the local priorlbl). The first time through the loop, the prior
label is a null string which does not match the label for the first row. If the current
and prior labels differ, the current row is the first row for the new label. Line 10 adds
a comment with guide numbers that help when editing the labels to make them ten
characters or less. The if condition ends in line 11.192 Chapter 5 Names, notes, and labels
Lines 12-16: write label define command. ; Line 12 checks if the first letter of the value
of the current: label is a period. If it is, then the value is a missing value and J do not
want to write a label define command. | will handle missing values later in the program.
Lines 13-15 write a dummy label define command to the file as illustrated in the
sample contents of the file listed above. The names of the value labels start with an
N (standing for new label) followed by the original label name (¢.g., label age becomes
Nage). I change the name because I do not. want to change the original labels. Line 16
ends the if condition. :
Line 17: update local priorlbl. Line 17 assigns the current label name to the local
priorlbl. This information is used in line 9 to determine if the current observation
starts a new value label.
Line 19: close the file. Line 19 closes the file myfile. Remember that a file is not written
to disk until it is closed.
Create label values commands
Next I generate the commands to absign these labels to variables. By now, you
should be able to follow what the program is doing:
use vi-sgc03, clear
drop S*
aorder
unab varlist : _all
file open myfile using wf5-sgedb-valfiab-labval-dunmy.doi, write replace
foreach varname in “varlist” {
local 1binm : value label “varname”
if ""Lblnm’"t="" {
file write myfile “label values “varname“" _col(27) "N"1blnm’" _newline
a
}
file close myfile
The output looks like this:
label values age Nage
label values atdisease NLlikely
label values atgenet NLlikely
label values atraised —NLlikely
label values cause Neause
label values clawdoc NLrespong
label values clawhosp NLrespons
label values clavpmed —_NLrespons
(output omitted )
The two files created in this step are edited and used to create new labels in step 4c.5.11.4 Step 4: Revise value labels 193
Step 4c: Revise labels and add them to dataset
I copy wf5-sgce4b-vallab-labdef-dummy .doi to
wf5-sgc4b-vallab-labdef-revised.doi and revise the label definitions. For example,
here are the revised commands for NLdist:
“ 1234567890
label define NLdist "1Definite", modify
label define NLdist “2Probably", modify
label define NLdist “3ProbNot", modify
label define NLdist “4DefNot", modify
Rone
The guide numbers verify that the new labels are not too long. Similarly, I copy
wf5-sgc4b-vallab-labval-dummy.doi to wf5-sgc4b-vallab-labval-revised.doi
and revise it to look like this:
label values age Nage
label values atdiseas —‘NLlikely
lebel values atgenet NLlikely
label values atraised NLJikely
label values cause Neause
label values clavdoc NLrespons
(output omitted )
Now I am ready to change the labels. I load the data and confirm that it has the right
signature (file: wf£5-sgc4c-vallab-revise.do):
. use wi-sgc03, clear
(Workflow data for SGC renaming example \ 2008-04-09)
. datasignature confirm
(data unchanged since 09apr2008 17:59)
Next I include the files with the changes to the labels:
include wf5-sgc4b-vallab-labdef-revised.doi
include wf5-sgc4b-vallab-labval-revised.doi
Now ] add labels for the missing values. To do this, I need a list of the value labels being
used in the noncloned variables. Because I do not want to lose the label definitions I
just added, I save a temporary dataset, drop the cloned variables, and use labelbook
to get a list of value definitions. Then I load the dataset that I had temporarily saved:
save x-temp, replace
drop S*
quietly labelbook
local valdeflist = r(names)
use x-temp, clear
(Continued on next page)