• 371 • Shanghai Archives of Psychiatry, 2014, Vol. 26, No.
•Research methods in psychiatry•
Secondary analysis of existing data: opportunities and
implementation
Hui G. CHENG1*, Michael R. PHILLIPS1,2
Summary: The secondary analysis of existing data has become an increasingly popular method of enhancing
the overall efficiency of the health research enterprise. But this effort depends on governments, funding
agencies, and researchers making the data collected in primary research studies and in health-related
registry systems available to qualified researchers who were not involved in the original research or in the
creation and maintenance of the registry systems. The benefits of doing this are clear but the barriers are
many, so the effort of increasing access to such material has been slow, particularly in low- and middle-
income countries. This article introduces the rationale and concept of the secondary analysis of existing
data, describes several sources of publicly available datasets, provides general guidelines for conducting
secondary analyses of existing data, and discusses the advantages and disadvantages of analyzing existing
data.
Key words: statistical data interpretation; secondary analysis; existing data; data collection;
National Institute of Health
[Shanghai Arch Psychiatry. 2014; 26(6): 371-375. doi: http://dx.doi.org/10.11919/j.issn.1002-0829.214171]
1. Background the steps of conducting analyzes of existing data, and
A typical mental health research project begins with the discuss the pros and cons of analyzing existing data.
development of a comprehensive research proposal
and is (hopefully) followed by the successful acquisition 2. Data sources
of funding; the researcher then collects data, analyzes
the results, and writes-up one or more research reports. 2.1 ‘Primary data’, ‘secondary data’, or ‘existing data’?
Another less common, but no less important, research There is frequently confusion about the use of the
method is the analysis of existing data. The analysis of terms ‘primary data’, ‘primary data analysis’, ‘secondary
existing data is a cost-efficient way to make full use of data’, and ‘secondary data analysis’. This confusion
data that are already collected to address potentially arises because it is never completely clear whether
important new research questions or to provide a more data employed in an analysis should be considered
nuanced assessment of the primary results from the ‘primary data’ or ‘secondary data’. Based on the usage
original study. In this article we discuss the distinction of the National Institute of Health (NIH) in the United
between primary and secondary data, provide States, ‘primary data analysis’ is limited to the analysis
information about existing mental health-related data of data by members of the research team that collected
that are publically available for further analysis, list the data, which are conducted to answer the original
1
Shanghai Mental Health Center, Shanghai Jiao Tong University School of Medicine, Shanghai, China
2
Departments of Psychiatry and Global Health, Emory University, Georgia, United States
*correspondence: chengyaojin@yahoo.com
A full-text Chinese translation of this article will be available at www.shanghaiarchivesofpsychiatry.org on January 25, 2015.
Shanghai Archives of Psychiatry, 2014, Vol. 26, No. 6 • 372 •
hypotheses proposed in the study. All other analyses of (a) The World Health Organization (WHO) Global Health
data collected for specific research studies or analyses Observatory Data Repository (http://apps.who.int/
of data collected for other purposes (including registry gho/data/?theme=main) provides statistics on an
data) are considered ‘secondary analyses of existing array of health-related topics for countries around
data’, whether or not the persons conducting the the world. However, these statistics are generally
analyses participated in the collection of the data. This at the country-level so regional or population
replacement of the traditional term ‘secondary data subgroup-specific data are not usually available.
analysis’ with the term ‘secondary analysis of existing Another similar source is data available on the
data’ is a much clearer categorization because it avoids website of the Institute of Health Metrics and
the confusion of trying to decide whether the data Evaluation at the University of Washington in the
employed in an analysis is ‘primary data’ or ‘secondary United States (http://www.healthdata.org/). This
data’. website includes the Global Burden of Disease (GBD)
estimates which quantify country-level health-
Of course, there are cases where the distinction is related burden (i.e., cause-specific mortality and
less clear. One example would be the analysis of data disability) from 1990 to 2010 and data visualization
by a researcher who has no connection with the data tools which make it possible to compare the relative
collection team to address a research question that importance of different health conditions (including
overlaps with the hypotheses considered in the original mental disorders) between countries and between
study. Another example would be when a member of different population groups within countries (http://
the original research team subsequently revisits the www.healthdata.org/gbd/data-visualizations).
original hypothesis in an analysis that uses different
statistical methods. These situations commonly occur (b) Established in 1962, the Inter-university Consortium
in the analyses of large-scale population surveys for Political and Social Research (ICPSR, http://
where the research questions are generally broad www.icpsr.umich.edu/icpsrweb/landing.jsp) is a
(e.g., sociodemographic correlates of depression) and major data source for scholars in the social sciences.
when the participating researchers share the cleaned Located at the University of Michigan in the United
data with the broader research community. In both States, ICPSR is a membership-based network that
of these situations, based on a strict application of includes 65,000 datasets from over 8,000 discrete
the NIH usage, the analyses would be considered studies or surveys, including a number of large-
‘secondary analysis of existing data’ NOT ‘primary data scale population surveys conducted in the United
analysis’ and NOT ‘secondary data analysis’. In fact, we States and other countries. The website provides
recommend avoiding the ambiguous term ‘secondary online analysis tools to generate simple descriptive
data analysis’ entirely. statistics including frequencies and cross-tabulations.
In addition to ASCII and .txt format, the website
also provides options for downloading data in
2.2 Sources of existing data formats that are compatible with popular statistical
Existing data can be private or public. To maximize the software packages such as SAS, Stata, SPSS, and R.
output of data collection efforts, researchers often The website also provides technical support in data
assess many more variables than those strictly needed analysis and in the identification of potential data
to answer their original hypotheses. Often times, these sources. In order to download data, users need to
data are not fully used or explored by the original register with the system.
research team due to restrictions in time, resources,
or interest. Unfortunately, the vast majority of these (c) A variety of government agencies in the United States
completed datasets are not made available, and in regularly collect data on different health-related
many countries (including China), there isn’t even a topics and post them online for free download once
registry or other means of determining what data have data cleaning is completed. For example, the United
been previously collected about a specific research topic States Census Bureau (http://www.census.gov/
(so there are many unnecessarily duplicated studies). data.html) provides basic demographic data and the
However, if the research team is willing to share their Centers for Disease Control and Prevention (http://
data with other researchers who have the interest, www.cdc.gov) provides access to data on cause-
skills, and resources to conduct additional analyses, this specific disability, mortality, and an array of health
can greatly increase the productivity of the research conditions including injuries and violence, alcohol
team that conducted the original study. This type of use, and tobacco smoking. The Substance Abuse
exchange usually involves an agreement between the and Mental Health Services Administration have a
data collection team and the data analysis team to range of datasets posted on their website (http://
clarify details about data sharing protocols and how the www.samhsa.gov/data/) about various mental and
data should be used. substance use disorders. Users interested in more
There are several publically available health-related information about publicly available health-related
electronic databases that can be used to address a data can refer to Secondary data sources for public
variety of research topics. A few examples follow. health: A practical guide by Boslaugh.[1]
• 373 • Shanghai Archives of Psychiatry, 2014, Vol. 26, No. 6
3. Conducting a secondary analysis of existing data paid to skip patterns, which can result in large
There are two general approaches for analyzing existing numbers of missing values for certain variables.
data: the ‘research question-driven’ approach and In comprehensive surveys that take a long time to
the ‘data-driven’ approach. In the research question complete, skipping a group of questions that are not
approach, researchers have an a priori hypothesis or relevant for a particular respondent (i.e., ‘skips’) is a
a question in mind and then look for suitable datasets common method used to reduce interviewee burden
to address the question. In the data-driven approach and to avoid interviewee burn-out. For example,
researchers glance through variables in a particular in a survey about alcohol-related problems, the
dataset and decide what kind of questions can be survey module typically starts with questions about
answered by the available data. In practice, the two whether the interviewee has ever drunk alcohol. If
approaches are often used jointly and iteratively. the answer is negative, all questions about drinking
Researchers typically start with a general idea about behaviors and related problems are skipped because
the question or hypothesis and then look for available it is safe to assume that this interviewee does not
datasets which contain the variables needed to address have any such problems. Prior to conducting the
the research questions of interest. If they do not find full analysis, these types of missing values (which
datasets that contain all variables needed, they usually indicate that a particular condition is not relevant
modify the research question(s) or the analysis plan for the respondent) need to be distinguished from
based on the best available data. missing values for which the data is, in fact, missing
(which indicate that the status of the individual
When conducting either research question-driven related to the variable is unknown). Researchers
or data-driven approaches to the analysis of existing should be aware of these skips in order to make
data, researchers need to follow the same basic steps. a strategic judgment about the coding of these
(a) There needs to be an analytic plan that includes the variables.
specific variables to be considered and the types (e) Finally, the researcher should recode the original
of analyses that will be conducted. (In the research variables in order to properly handle missing values
question-driven approach this is determined before and, if necessary, to transform the distribution of the
the researchers look at the actual data available variables so that they meet the assumptions of the
in the dataset; in the data-driven approach this is statistical model to be used in the intended analysis.
determined after the researchers look through the The recoded variables should be stored in a new
dataset.) dataset and all syntax for the recoding of variables
(b) Re s e a r c h e rs m u s t h av e a c o m p r e h e n s i v e (and for the analysis itself) should be documented.
understanding of the strengths and weaknesses The original dataset should NEVER be altered in any
of the dataset. This involves obtaining detailed way.
descriptions of the population under study, (f) When using data from longitudinal surveys or when
sampling scheme and strategy, time frame of data using data stored in different datasets, it is critical
collection, assessment tools, response levels, and to check the accuracy of the identifier variable(s) to
quality control measures. To the extent possible, ensure that the data from different time periods or
researchers need to obtain and study in detail all from different datasets is matched correctly when
survey instruments, codebooks, guidebooks and merging the datasets.
any other documentation provided for users of
the databases. These documents should provide (g) For longitudinal studies, the assessment methods
sufficient information to assess the internal and and the coding methods for key variables can change
external validity of the data and allow researchers to over time. Thus, close examination of the survey
determine whether or not there are enough cases in questionnaires and codebooks are essential to
the dataset to generate meaningful estimates about ensure that each variable in the combined dataset
the topic(s) of interest. has a uniform interpretation throughout the study.
This may require the creation of separate uniform
(c) Before conducting the analysis, researchers need variables that are constructed in different ways at
to generate operational definitions of the exposure different points in time throughout the study, such
variable(s), outcome variable(s), covariates, and as the crosswalks to convert diagnostic categories
confounding variables that will be considered in the between DSM-III, DSM-IV, and DSM-5.
analysis.
(h) Many population-based surveys, particularly those
(d) The first step in the analysis is to run frequency focused on assessing the prevalence of relatively
tables and cross-tabulations of all variables that uncommon conditions such as schizophrenia,
will be included in the main analysis. This provides employ multi-stage sampling strategies to enrich
information about the use of the coding pattern the sample. In this case, the data set usually
for each variable and about the profile of missing includes design variables for each case (including
data for each variable. Due attention should be sampling weight, strata, and primary sampling unit)
Shanghai Archives of Psychiatry, 2014, Vol. 26, No. 6 • 374 •
that are needed to adjust the analysis of interest respondents, variables that may be important in the
(such as the prevalence of a condition, odds ratios, intended analysis such as zip codes, the names of the
mean differences, etc.). Researchers who conduct primary sampling units, and the race, ethnicity, and
secondary analysis of existing data should consider specific age of respondents. This can create residual
the design variables used in the original study and confounding when the omitted variables are crucial
apply these variables appropriately in their own covariates to control for in the secondary analysis.
analyses in order to generate less biased estimates.[2,3] Another major limitation of the analysis of existing
data is that the researchers who are analyzing the
4. Pros and cons of the secondary analysis of data are not usually the same individuals as those
existing data involved in the data collection process. Therefore,
they are probably unaware of study-specific nuances
4.1 Advantages or glitches in the data collection process that may be
The most obvious advantage of the secondary analysis important to the interpretation of specific variables in
of existing data is the low cost. There is sometimes a fee the dataset. Sometimes, the amount of documentation
required to obtain access to such datasets, but this is is daunting (particularly for complex, large-scale surveys
almost always a tiny proportion of what it would cost to conducted by government agencies), so users may
conduct an original study. Also, the data posted online miss important details unless they are prominently
are usually cleaned by professional staff members who presented in the documents. Succinct documentation
often provide detailed documentation about the data of important information about the validity of the data
collection and data cleaning process. Moreover, teams (by the provider) and careful examination of all relevant
conducting large-scale population-based surveys that documents (by the user) can mitigate this problem.
are made available to others usually employ statisticians
to generate ready-to-use survey weights and design
variables – something that most users of the data are 5. Government support for secondary analysis of
unable to do – so this helps users make necessary existing data
adjustments to their estimates. This is a great boon to This paper discusses several issues related to the
graduate students and others who have lots of good secondary analysis of existing data. There are definitely
ideas but no money to conduct the studies that could limitations to such analyses, but the great advantage
test their ideas. is that secondary analyses can dramatically increase
Researchers who would rather spend their time the overall efficiency of the research effort and –
testing hypotheses and thinking about different a secondary advantage – give young researchers
research approaches rather than collecting primary data with good ideas but little access to research funds
can find a large amount of data online. The increasing the opportunity to test their ideas. Recognizing the
availability of such data online encourages the creative importance of making the most of high-quality research
use and cross-linking of information from different data and of rapidly translating research findings into
data sources. For example, experts in hierarchical actionable knowledge, starting in 2003 the United
models can combine data from individual surveys with States National Institute of Health, the largest funding
aggregate data from different administrative levels of agency for biomedical research in the world, required
a community (e.g., village, township, county, province, all projects with annual direct costs of 500,000 US
etc.) to examine the factors associated with health- dollars or more to include data-sharing plans in their
related outcomes at each level. The availability of such proposals. Moreover, NIH has released several program
databases also provides statisticians with real-life data to announcements specifically designed to promote
test new statistical models. Such analyses could identify secondary analysis of existing datasets. Other countries
potential new interventions to existing problems that and some large health care providers also make registry
can subsequently be tested in prospective studies. data available to qualified researchers. These practices
ensure that other researchers not involved in the studies
or in the creation and maintenance of the registries will
4.2 Disadvantages be able to use the data generated by these big projects
Inherent to the nature of the secondary analysis of or by the registries to test a wide range of hypotheses.
existing data, the available data are not collected to Other governments (including the Chinese government),
address the particular research question or to test the health-related non-government organizations, and
particular hypothesis. It is not uncommon that some other funders of biomedical research need to follow
important third variables were not available for the these examples. Failure to provide qualified researchers
analysis. Similarly, the data may not be collected for all access to government-generated registry data or to
population subgroups of interest or for all geographic government-supported research data results in a huge
regions of interest. Another problem is that to protect but unnecessary wastage of economic and intellectual
the confidentiality of respondents, publicly available resources that could be better employed to improve the
datasets usually delete identifying variables about health of the nation.
• 375 • Shanghai Archives of Psychiatry, 2014, Vol. 26, No. 6
Conflict of interest Funding
The authors declare no conflict of interest related to this This work was supported by a grant from the China
article. Medical Board (13-165) to HGC.
现有数据的分析 : 机遇与实施
程辉 , 费立鹏
概述 : 现有数据的二次分析已成为提升卫生研究机构 析的基本原理和概念,描述了若干个可公开获得的数
整体效率的一种日益流行的方法。该工作取决于政府、 据库,为现有数据的二次分析提供一般准则,并讨论
资助机构以及研究者,取决于他们能不能让没有参与 了现有数据分析的优势和不足。
原始研究、没有参与创建和维护登记系统的其他合格
研究人员获得原始研究数据或登记系统的数据。二次 关键词 : 统计学数据解释,数据采集,美国国立卫生
分析的好处是显而易见的,但面临的障碍很多。因此 研究所
提高这些数据可获得性的工作进展缓慢,在低收入和 本文全文中文版从 2015 年 1 月 25 日起在 www.shanghaiarchivesofpsychiatry.org
中等收入国家尤为如此。本文介绍了现有数据二次分 可供免费阅览下载
References
1. Boslaugh S. Secondary data sources for public health: A 3. Graubard BI, Korn EL. Modelling the sampling design in the
practical guide. New York, NY: Cambridge; 2007 analysis of health surveys. Stat Methods Med Res. 1996; 5(3):
2. Lohr SL. Sampling: Design and analysis (2nd Ed.). Boston, 263-381
MA: Brooks/Cole; 2010
(received, 2014-11-11; accepted, 2014-12-04)
Dr. Hui Cheng is an epidemiologist by training. She is currently a post-doctoral research associate at
Shanghai Mental Health Center, Shanghai Jiao Tong University School of Medicine. She has published
findings from studies on mental health related topics using public data. Her main interest is substance
use and related problems, and public mental health.