Sensitivity, Specificity, and Predictive Values
Sensitivity, Specificity, and Predictive Values
*Correspondence:
Robert Trevethan
robertrevethan@gmail.com INTRODUCTION
Specialty section: There are arguably two kinds of tests used for assessing people’s health: diagnostic tests and screening
This article was submitted to tests. Diagnostic tests are regarded as providing definitive information about the presence or absence
Epidemiology, of a target disease or condition. By contrast, screening tests—which are the focus of this article—
a section of the journal typically have advantages over diagnostic tests such as placing fewer demands on the healthcare
Frontiers in Public Health
system and being more accessible as well as less invasive, less dangerous, less expensive, less time-
Received: 05 September 2017 consuming, and less physically and psychologically discomforting for clients. Screening tests are
Accepted: 03 November 2017
also, however, well-known for being imperfect and they are sometimes ambiguous. It is, therefore,
Published: 20 November 2017
important to determine the extent to which these tests are able to identify the likely presence or
Citation: absence of a condition of interest so that their findings encourage appropriate decision making.
Trevethan R (2017) Sensitivity,
If practitioners are confident when using screening tests, but their confidence is not justified, the
Specificity, and Predictive Values:
Foundations, Pliabilities, and
consequences could be serious for both individuals and the healthcare system (1, 2). It is important,
Pitfalls in Research and Practice. therefore, that confusion should be avoided with regard to how the adequacy and usefulness of
Front. Public Health 5:307. screening tests are determined and described. In this article, an attempt is made to identify why con-
doi: 10.3389/fpubh.2017.00307 fusion can exist, how it might be resolved, and how, once resolved, improvements could be made with
regard to the description and use of screening tests. The focus is reference standard, and whether the screening test yielded a
on the sensitivity, specificity, and predictive values of those tests. positive result (the person appears to have the condition) or a
negative result (the person appears not to have the condition).
What are referred to as sensitivity, specificity, and predictive
DETERMINING SENSITIVITY, SPECIFICITY,
values can then be calculated from the numbers of people in each
AND PREDICTIVE VALUES of the four cells, and, if expressed as percentages, are based on the
following formulas:
When the adequacy, also known as the predictive power or
predictive validity, of a screening test is being established, the Sensitivity = [a / (a + c)] ×100
outcomes yielded by that screening test are initially inspected to
Specificity = [d / (b + d)] ×100
see whether they correspond to what is regarded as a definitive
indicator, often referred to as a gold standard, of the same target Positive predictive valu e ( PPV ) = [a / (a + b)] × 100
condition. The analyses are typically characterized in the way Negative predictive valu e ( NPV ) = [d / (c + d)] × 100.
shown in Figure 1. There it can be seen from the two columns
under the heading Status of person according to “gold standard” These are the metrics that are cited—i.e., often as percentages,
that people are categorized as either having, or as not having, although sometimes as decimal fractions, and preferably with
the target condition. The words “gold standard” suggest that this accompanying 95% confidence intervals—when researchers and
initial categorization is made on the basis of a test that provides clinicians refer to sensitivity, specificity, and predictive values to
authoritative, and presumably indisputable, evidence that a describe the characteristics of a screening test. The simplicity, and
condition does or does not exist. Because there can be concerns even familiarity, of these four metrics can mask the existence of
about the validity of these so-called gold standards (3, 4), they a number of complexities that sometimes appear to be underap-
have increasingly been referred to less glowingly as reference preciated, however. Deficiencies in either the reference standard
standards (5), thus removing what seemed to be unreserved or the screening test, or in both, can exist. Furthermore, the four
endorsement. That wording (i.e., reference standard) will be used metrics should not be regarded as unquestionably valid and fixed
for the remainder of this article. attributes of a screening test: the values that are entered into the
Independent of the categorization established on the basis of cells of Figure 1 depend on how stringent the screening test is
the reference standard, people are also assessed on the screening and the prevalence of the target condition in the sample of people
test of interest. That test might comprise a natural dichotomy or it used in the analysis.
might be based on whether the test outcomes fall below or above Because of these complexities, it is sometimes necessary to
a specified cutoff point on a continuum. It might also comprise a examine the validity of measurement procedures within both the
battery of tests that, together, are regarded as a single test (6–8). reference standard and the screening test (3, 8). It might also be
Based on their reference standard and screening test results, necessary to question the stringency of the screening test and to
people are assigned to one of the four cells labeled a through d ensure that there is a match between the samples that were used
in Figure 1 depending on whether they are definitely regarded for assessing a screening test and the people subsequently being
as having or as not having the target condition based on the screened (2, 3, 9–11).
FIGURE 1 | Diagram demonstrating the basis for deriving sensitivity, specificity, and positive and negative predictive values.
It is also important to recognize that there are sometimes assessed on the basis of a screening test, which focuses on the
noticeable tradeoffs between sensitivity and specificity, as well practical usefulness of the test in clinical practice.
as between positive predictive values (PPVs) and negative By way of further explanation, sensitivity is based solely on the
predictive values (NPVs). This is demonstrated in the first four cells labeled a and c in Figure 1 and, therefore, requires that all
rows of entries in Table 1. Furthermore, as also illustrated in people in the analysis are diagnosed according to the reference
Table 1, there is little or no consistency regarding either size or standard as definitely having the target condition. The determina-
pattern of sensitivity, specificity, and predictive values in different tion of sensitivity does not take into account any people who,
contexts, so it is not possible to determine one of them merely according to the reference standard, do not have the condition
from information about any of the others. In that sense, they are of interest (who are in cells b and d). Confidence in a screening
pliable in relation to each other. This indicates that it is necessary test’s ability, when it returns a positive result, to differentiate suc-
to appreciate the foundations of, distinctions between, and uses cessfully between people who have a condition and those who
and misuses of each of these metrics, and that it is necessary to do not, is another matter. As indicated above, it is the test’s PPV,
provide information about all of them, as well as the reference and is based on the cells labeled a and b, which refer solely to the
standard and the sample on which they are based, to characterize accuracy of positive results produced by the screening test. Those
a screening test adequately. cells do not include any people who, according to results from
Because sensitivity seems often to be confused with PPV, and the screening test, do not have the condition (who are in cells
specificity seems often to be confused with NPV, unambiguous c and d).
definitions for each pair are necessary. These are provided below. Therefore, a clear definition of sensitivity—with italics for
supportive emphasis—would be a screening test’s probability of
DEFINITIONS correctly identifying, solely from among people who are known to
have a condition, all those who do indeed have that condition (i.e.,
Defining Sensitivity and PPV identifying true positives), and, at the same time, not categorizing
The sensitivity of a screening test can be described in variety of other people as not having the condition when in fact they do
ways, typically such as sensitivity being the ability of a screening have it (i.e., avoiding false negatives). Less elaborated, but perhaps
test to detect a true positive, being based on the true positive rate, also less helpfully explicit, definitions are possible, for example,
reflecting a test’s ability to correctly identify all people who have that sensitivity is the proportion of people with a condition who
a condition, or, if 100%, identifying all people with a condition are correctly identified by a screening test as indeed having that
of interest by those people testing positive on the test. condition.
Each of these definitions is incontestably accurate, but they It follows that a clear definition of PPV would be a screening
can all be easily misinterpreted because none of them sufficiently test’s probability, when returning a positive result, of correctly
emphasizes an important distinction between two essentially identifying, from among people who might or might not have a
different contexts. In the first context, only those people who condition, all people who do actually have that condition (i.e.,
obtain positive results on the reference standard are assessed in identifying true positives), and, at the same time, not categoriz-
terms of whether they obtained positive or negative results on the ing some people as having the condition when in fact they do
screening test. This determines the test’s sensitivity. In the second not (i.e., avoiding false positives). Expressed differently and
context, the focus changes from people who tested positive on the more economically, PPV is the probability that people with a
reference standard to people who tested positive on the screening positive screening test result indeed do have the condition of
test. Here, an attempt is made to establish whether people who interest.
tested positive on the screening test do or do not actually have Inspection of Figure 1 supports the above definitions and
the condition of interest. This refers to the screening test’s PPV. those that are provided within the next subsection.
Expressed differently, the first context is the screening test being
assessed on the basis of its performance relative to a reference Defining Specificity and NPV
standard, which focuses on whether the foundations of the The specificity of a test is defined in a variety of ways, typically
screening test are satisfactory; the second context is people being such as specificity being the ability of a screening test to detect
a true negative, being based on the true negative rate, correctly
identifying people who do not have a condition, or, if 100%,
identifying all patients who do not have the condition of interest
TABLE 1 | Five sets of sensitivity, specificity, and predictive values demonstrating by those people testing negative on the test.
differing patterns of results.
As with the definitions often offered for sensitivity, these defi-
Research domain/researchers Sensitivity Specificity PPV NPV nitions are accurate but can easily be misinterpreted because they
(%) (%) (%) (%) do not sufficiently indicate the distinction between two different
Shoulder pain (6 ) 96 7 15 90
contexts that parallel those identified for sensitivity. Specificity
Carpal tunnel syndrome (12) 5 98 10 96 is based on the cells labeled b and d in Figure 1 and, therefore,
Peripheral artery disease (13) 45 100 100 53 requires that all the people in the analysis are diagnosed, accord-
Aspiration risk following stroke (14) 47 86 50 85 ing to a reference standard, as not having the target condition.
Peripheral artery disease (15) 71 79 72 77
Specificity does not take into account any people who, according
PPV, positive predictive value; NPV, negative predictive value. to the reference standard, do have the condition (as pointed out
above, those people, in the cells labeled a and c, were taken into sensitivity and specificity, the values for those two metrics should
account when determining sensitivity). Confidence in a screen- not be relied on when making decisions about individual people
ing test’s ability, when it returns a negative result, to differentiate in screening situations. In that second context, use of PPVs and
between people who have a condition and those who do not, is NPVs is more appropriate. The lack of correspondence between
another matter. That is the test’s NPV and is based on the cells sensitivity, specificity, and predictive values is illustrated by the
labeled c and d, which refer solely to the accuracy of negative inconsistent pattern of entries in Table 1 and should become
results produced by the screening test. Those cells do not include more obvious in the next section.
any people who, according to the screening test, do have the
condition (who are located in cells a and b).
Therefore, a clear definition of specificity, again with italics for USES AND MISUSES OF SENSITIVITY
supportive emphasis, would be a screening test’s probability of AND SPECIFICITY
correctly identifying, solely from among people who are known
not to have a condition, all those who do indeed not have that Because the pairs of categories into which people are placed when
condition (i.e., identifying true negatives), and, at the same time, sensitivity and specificity values are calculated are not the same
not categorizing some people as having the condition when in fact as the pairs of categories that pertain in a screening context,
they do not have it (i.e., avoiding false positives). Less elaborated, there are not only important distinctions between sensitivity
but perhaps also less helpfully explicit, definitions are possible, and PPV, and between specificity and NPV, but there are also
for example, that specificity is the proportion of people without distinct limitations on sensitivity and specificity for screening
a condition who are correctly identified by a screening test as purposes. Akobeng [(9), p. 340] has gone so far as to write that
indeed not having the condition. “both sensitivity and specificity … are of no practical use when it
It follows that a clear definition of NPV would be a screening comes to helping the clinician estimate the probability of disease
test’s probability, when returning a negative result, of correctly in individual patients.”
identifying, from among people who might or might not have a Sensitivity does not provide the basis for informed decisions
condition, all people who indeed do not have that condition (i.e., following positive screening test results because those positive
identifying true negatives), and, at the same time, not categoriz- test results could contain many false positive outcomes that
ing some people as not having the condition when in fact they appear in the cell labeled b in Figure 1. Those outcomes are
do (i.e., avoiding false negatives). Expressed differently and ignored in determining sensitivity (cells a and c are used for
more economically, NPV is the probability that people with a determining sensitivity). Therefore, of itself a positive result on
negative screening test result indeed do not have the condition a screening test, even if that test has high sensitivity, is not at all
of interest. useful for definitely regarding a condition as being present in a
particular person. Conversely, specificity does not provide an
Summary Regarding Definitions accurate indication about a negative screening test result because
Sensitivity and specificity are concerned with the accuracy of a negative outcomes from a screening test could contain many false
screening test relative to a reference standard. The focus is the negative results that appear in the cell labeled c, which are ignored
adequacy of the screening test, or its fundamental “credentials.” The in determining specificity (cells b and d are used for determining
main question is: do the results on the screening test correspond specificity). Therefore, of itself, a negative result on a screening
to the results on the reference standard? Here, the screening test with high specificity is not at all useful for definitely ruling
test is being assessed. By contrast, for PPV and NPV, people are out disease in a particular person.
being assessed. There are two main questions of relevance in that Failing to appreciate the above major constraints on sensitiv-
second situation. First, if a person’s screening test yields a positive ity and specificity arises from what is known in formal logic as
result, what is the probability that that person has the relevant confusion of the inverse (16). An example of this with regard to
condition (PPV)? Second, if the screening test yields a negative sensitivity, consciously chosen in a form that makes the problem
result, what is the probability that the person does not have the clear, would be converting the logical proposition This animal
condition (NPV)? is a dog; therefore it is likely to have four legs into the illogical
In order to sharpen the distinction, it could be said that proposition This animal has four legs; therefore it is likely to be a
sensitivity and specificity indicate the effectiveness of a test with dog. A parallel confusion of the inverse can occur with specificity.
respect to a trusted “outside” referent, while PPV and NPV indi- An example of this would be converting the logical proposition
cate the effectiveness of a test for categorizing people as having This person is not a young adult; therefore this person is not likely
or not having a target condition. More precisely, sensitivity and to be a university undergraduate into the illogical proposition This
specificity indicate the concordance of a test with respect to a person is not a university undergraduate; therefore this person is
chosen referent, while PPV and NPV, respectively, indicate the not likely to be a young adult.
likelihood that a test can successfully identify whether people do These examples demonstrate the flaws in believing that a
or do not have a target condition, based on their test results. positive result on a highly sensitive test indicates the presence of
The two contexts (i.e., the context that relates to sensitivity and a condition and that a negative result on a highly specific test indi-
specificity, versus the context that relates to the two predictive cates the absence of a condition. Instead, it should be emphasized
values) should not be confused with each other. Of particular that a highly sensitive test, when yielding a positive result, by no
importance, although it is desirable to have tests with high means indicates that a condition is present (many animals with
four legs are not dogs), and a highly specific test, when yielding a These include the immediate and long-term burdens on the
negative result, by no means indicates that a condition is absent healthcare system, the treatability of a particular condition,
(many young people are not university undergraduates). and the psychological effect on clients as well as clients’ health
Despite the above reservations concerning sensitivity and status. Considerations might also include over- versus under-
specificity in a screening situation, sensitivity and specificity can application of diagnostic procedures as well as the possibility
be useful in two circumstances but only if they are extremely of premature versus inappropriately delayed application of
high. First, because a highly sensitive screening test is unlikely to diagnostic procedures. Input from clinicians and policymakers
produce false negative outcomes (there will be few entries in cell is likely to be particularly informative in any deliberations.
c of Figure 1), people who test negative on that kind of screening Decisions about desirable PPVs and NPVs can be approached
test (i.e., a test with high sensitivity) are very unlikely to have the from two related and complementary, but different, directions.
target condition. Expressed differently, high sensitivity permits One approach involves the extent to which true positive and
people to be confidently regarded as not having a condition if true negative results are desirable on a screening test. The other
their screening test yields a negative result. They can be “ruled approach involves the extent to which false positive and false
out.” This has led to the mnemonic snout (sensitive, negative, negative results are tolerable or even acceptable.
out—in which it is useful to regard the n in snout as referring A high PPV is desirable, implying that false positive outcomes
to the n in sensitive as well as the n in negative) concerning high are minimized, under a variety of circumstances. Some of these
sensitivity in screening. are when, relative to potential benefits, the costs (including costs
Second, because a highly specific screening test is unlikely to associated with finances, time, and personnel for health services,
produce false positive results (there will be few entries in cell b in as well as inconvenience, discomfort, and anxiety for clients) are
Figure 1), people are very unlikely to be categorized as having a high. A high PPV, with its concomitant few false positive screening
condition if they indeed do not have it. Expressed differently, high test results, is also desirable when the risk of harm from follow-up
specificity permits people to be confidently regarded as having a diagnosis or therapy (including hemorrhaging and infection) is
condition if their diagnostic test yields a positive result. They can high despite the benefits from treatment also being high, or when
be “ruled in”—and, thus, the mnemonic spin (specific, positive, the target condition is not life-threatening or progresses slowly.
in—in which it is useful to regard the p in spin as referring to the Under these circumstances, false positive outcomes can be associ-
p in specific as well as the p in positive) concerning high specificity ated with overtreatment and unnecessary costs and prospect of
in screening. iatrogenic complications. False positive outcomes may also be
The mnemonics snout and spin, it must be emphasized, pertain annoying and distressing for both the providers and the recipients
only when sensitivity and specificity are high. Their pliability, of health care.
therefore, has some strong limitations. Furthermore, these mne- A moderate PPV (with its greater proportion of false positive
monics are applied in a way that might seem counterintuitive. A screening test outcomes) might be acceptable under a number of
screening test with high sensitivity is not necessarily useful for circumstances, most of which are the opposite of the situations in
“picking things up.” It is useful only for deciding that a negative which a high PPV is desirable. For example, a certain percentage
screening test outcome is so unusual that it strongly indicates the of false positive outcomes might not be objectionable if follow-
absence of the target condition. Conversely, a screening test with up tests are inexpensive, easily and quickly performed, and not
high specificity is not so “choosy” that it is effective in ignoring stressful for clients. In addition, false positive screening outcomes
a condition if that condition is not present; rather, a highly spe- might be quite permissible if no harm is likely to be done to
cific test is useful only for deciding that a positive screening test clients in protecting them against a target condition even if that
outcome is so unusual that it strongly indicates the presence of condition is not present. For example, people who are mistakenly
the target condition. In addition, Pewsner et al. (2) have pointed told that they have peripheral artery disease, despite not actually
out that effective use of snout and spin is “eroded” when highly having it, are likely to benefit from adopting advice to exercise
sensitive tests are not sufficiently specific or highly specific tests appropriately, improve their diet, and discontinue smoking.
are not sufficiently sensitive—and for many screening tests, A high NPV is desirable, implying that false negatives are
unfortunately, either sensitivity or specificity is low despite the minimized, under a different set of circumstances. Some of
other being high, or neither sensitivity nor specificity is high. As these are a condition being serious, largely asymptomatic, or
a consequence, both sensitivity and specificity remain unhelpful contagious, or if treatment for a condition is advisable early in
for making decisions about individual people in most screening its course, particularly if the condition can be treated effectively
contexts, and PPV and NPV should be retained as the metrics of and is likely to progress quickly. Under these circumstances, it
choice in those contexts. would be highly undesirable if a screening test indicated that
people did not have a condition when in fact they did. A moder-
ate NPV—with its greater proportion of false negative screening
ASSESSING DESIRABLE PREDICTIVE test outcomes—might be acceptable under other circumstances,
VALUES AND CONSEQUENCES FOR however, and most of those circumstances are the opposite of
SENSITIVITY AND SPECIFICITY those that make a high NPV desirable. For example, the false
negative outcomes associated with moderate NPVs might not be
When assessing the desirability of specific PPVs and NPVs, problematic if the target condition is not serious or contagious,
a variety of costs and benefits need to be considered (1). or if a condition does not progress quickly or benefit from early
treatment. Moderate NPVs might also be acceptable if diagnosis incidentally, similar to the values obtained by other researchers
at low levels of a condition is known to be ambiguous and subse- (15, 20).
quent screening tests can easily be scheduled and performed, or Deficiencies in provision of information can be even more
if, given time, a condition is likely to resolve itself satisfactorily problematic. In a recently published article, Jönelid et al. (21)
without treatment. investigated usefulness of the ankle–brachial index for identify-
If, for a variety of reasons, the PPVs and NPVs on a screening ing polyvascular disease. Although they reported a specificity of
test were deemed to be either too high or too low, they could 92.4% and a PPV of 68.4%, they did not provide results concern-
be adjusted by altering the stringency of the screening test (for ing either sensitivity or NPV. From information in their article,
example, by raising or lowering cutpoints on a continuous vari- those unrevealed values can be calculated as both being 100%.
able or by changing the components that comprise a screening That these values are so high in a screening context raises sus-
test), by altering the sample of people on whom the analyses were picions. When following those suspicions through, it becomes
based (for example, by identifying people who are regarded as evident that the researchers used the ABI as a component of
having more pertinent demographic or health status variables), the reference standard as well as being the sole variable that
or by altering the nature of the reference standard. Those strate- comprised the screening test. Failure to sufficiently disclose this
gies would almost inevitably result in changes to the sensitivity circular situation (the inevitability of something being highly
and specificity values, and those revised values would simply related to something that is partly itself) permitted the authors
need to be reported as applying to the particular new level of to claim that the “ABI is a useful … measurement that appears
stringency on the screening test, the applicable population, and predictive of widespread atherosclerosis” in their patients. That
the reference standard when that test was being described. This this statement is invalid becomes apparent only through an
reveals, yet again, that pliability can be associated with sensitivity, awareness of how researchers’ data should conform to entries
specificity, and predictive values. in Figure 1 and how reference standards and screening tests are
conceptualized.
The above examples illustrate the importance of research
THE IMPORTANCE OF FULL DISCLOSURE consumers being provided with complete information when
OF INFORMATION IN RESEARCH screening tests are being described, and consumers being able
to interpret that information appropriately—sometimes with at
When describing screening tests, many researchers provide least a modicum of skepticism. Having a healthy level of skepti-
information about their reference standard; the prevalence of cism as well as clarity concerning the nature and appropriate
the target condition in their research sample(s); the criteria that interpretations and uses of sensitivity, specificity, and predictive
had been used to indicate presence or absence of a condition values, can be seen as important for educators, researchers, and
according to the screening test; and the sensitivity, specificity, clinicians in public health.
and predictive values they obtained (6, 7, 15, 17, 18). The research
results are not always impressive or what the researchers might SUMMARY
have hoped for, but at least it is possible to draw informed conclu-
sions from those results. Sensitivity and specificity should be emphasized as having
Sometimes only partial information is provided, and that different origins, and different purposes, from PPVs and
limits the usefulness of research. For example, in a systematic NPVs, and all four metrics should be regarded as important
review concerning the toe–brachial index in screening for when describing and assessing a screening test’s adequacy and
peripheral artery disease, Tehan et al. (19) were evidently unable usefulness.
to find predictive values in so many of the final seven studies they Researchers and clinicians should avoid confusion of the
reviewed that they did not provide any information about those inverse when considering the application of sensitivity and
values—despite those metrics being of fundamental importance specificity to screening tests.
for screening. Predictive values are more relevant than are sensitivity and
In one of the more informative articles reviewed by Tehan specificity when people are being screened.
et al. (19), Okamoto et al. (13) did include information about Predictive values on screening tests need to be determined on
sensitivity, specificity, and predictive values of several screening the basis of careful clinical deliberation and might be used in a
tests. However, they provided insufficient interpretation at times. reverse process that would result in adjustments to sensitivity
For example, they reported an unusually low sensitivity value of and specificity values.
45.2% for the toe–brachial index in detecting peripheral artery Researchers should provide information about sensitivity,
disease. That value occurred in the presence of 100% specificity, specificity, and predictive values when describing screening
indicating that the cutoff point might have been too stringent and test results, and that information should include how those
that sensitivity had been sacrificed in the interest of obtaining metrics were derived as well as appropriate interpretations.
high specificity, but the researchers did not draw attention to
that or provide any explanation for their strategy. Information AUTHOR CONTRIBUTIONS
in a receiver operating characteristic analysis within their
article suggests that more appropriate sensitivity and specificity RT conceived of, conducted the research for, and wrote the
values would have both been approximately 73% and therefore, complete manuscript.
ACKNOWLEDGMENTS to literature that I had not been aware of, and Dr. Tehan gener-
ously shared computer output of receiver operating characteristic
Rod Pope and Peta Tehan provided valuable feedback on earlier analyses that provided confirming insights about sensitivity and
drafts of this manuscript. Professor Pope also drew my attention specificity.