3/8/2018 https://onlinecourses.science.psu.
edu/stat555/print/book/export/html/56
Published on STAT 555 (https://onlinecourses.science.psu.edu/stat555)
Home > Lesson 4: Multiple Testing
Lesson 4: Multiple Testing
Key Learning Goals for this Lesson:
Understand the multiple testing problem
Interpret a histogram of p-values from independent tests
Be familiar with the Bonferroni method
Understand false discovery and non-discovery rates
Be familiar with some FDR estimation methods
An event that is rare if we have only one opportunity to observe can become quite common if we are
observing thousands of events. For example, when you roll 2 fair dice, getting double sixes happens
only about 1 out of 36 times. But if you roll 3600 times, you expect to get about 100 rolls with 2 sixes.
The p-value is the probability of obtaining a result at least as extreme as the observed result if the null
hypothesis is true. Suppose we accept p < 0.05 as "extreme". If we do 10,000 (independent) tests, and
all the null hypotheses are true, we expect about 5% of the tests (i.e. about 500) to have p < 0.05.
This is a huge problem in high throughput analysis, because we are usually doing thousands of tests.
We do not want to waste our time following up false positive hypotheses. But if we use conventional p-
value cut-offs, this will be inevitable.
This chapter discusses some approaches to correcting our inference methods when we are doing
multiple tests.
4.1 - Mistakes in Statistical Testing
The table below shows the possible correct and incorrect outcomes for a single test:
Not Significant Significant
H0 True type I Error
H0 False type II Error
A type I error is called a false discovery. A type II error is called a false non-discovery. Generally false
discoveries are considered to be more serious than false nondiscoveries, although this is not always
the case. Investigators usually follow-up with discoveries, so that false discoveries can lead to
misleading and expensive follow-up studies. But nondiscoveries are usually abandoned, so that false
nondiscoveries can lead to missing potentially important results.
https://onlinecourses.science.psu.edu/stat555/print/book/export/html/56 1/8
3/8/2018 https://onlinecourses.science.psu.edu/stat555/print/book/export/html/56
In high throughput studies we typically test each of our m features individually, leading to the following
table:
Not Significant Significant
H0 True U V m0
H0 False T S m − m0
W R m
The total errors are T + V.
The false discovery proportion FDP is V / R.
The false discovery rate is the expected value of V / R, given that R ≠ 0.
Similarly, the false nondiscovery proportion FNP is T / W.
The false nondiscovery rate is the expected value of T / W, given that RW≠ 0.
π0 = m0 /m is the proportion of null tests.
Before 1995, the objective of multiple testing correction was to control Pr(V > 0), the so-called Family-
Wise Error Rate (FWER).
The problem is: as m0 grows, so does Pr(V > 0) for any fixed cut-off one would use to ascertain
statistical significance.
4.2 - Controlling Family-wise Error Rate
Pr(V > 0) is called the family-wise error rate or FWER. It is easy to show that if you declare tests
significant for p < α then FWER ≤ min(m0 α, 1).
The most commonly used method which controls FWER at level α is called Bonferroni's method. It
rejects the null hypothesis when p < α/m . (It would be better to use m0 but we don't know what it is -
more on that later.)
The Bonferroni method is guaranteed to control FWER, but it has a big problem. It greatly reduces your
power to detect real differences. For example, suppose the effect size is 2 and you are doing a t-test,
rejecting for p < 0.05. With 10 observations per group, the power is 99%. Now suppose you have 1000
tests, and use the Bonferroni method. That means that to reject, we need p < 0.00005. The power is
now only 29%. If you have 10 thousand tests (which is small for genomics studies) the power is only
10%.
Sometimes the "Bonferroni-adjusted p-values are reported". They are just:
pb = min(mp, 1) .
Another simple more powerful but less popular method uses the sorted p-values:
https://onlinecourses.science.psu.edu/stat555/print/book/export/html/56 2/8
3/8/2018 https://onlinecourses.science.psu.edu/stat555/print/book/export/html/56
p(1) ≤ p(2) ≤ ⋯ ≤ p(m)
Holmes showed that the FWER is controlled with the following algorithm:
Compare p(i) with α/(m − i + 1) . Starting from i = 1, reject until p(i) is greater.
The most significant test must therefore pass the Bonferroni criterion. However, if it is significant, the
next most significant is tested at a less stringent level. Heuristically, after rejecting the most significant
test, we conclude the m0 ≤ m − 1 and use m − 1 for the next correction, and so on sequentially.
The Holmes method is more powerful than the Bonferroni method, but it is still not very powerful. We
can also compute "Holmes-adjusted p-values" ph(i) = min((m − i + 1)p(i) , 1) .
4.3 -1995 - Two Huge Steps for Biological
Inference
In 1995, the first microarray was "spotted" and hybridized, starting the "omics" revolution in biology.
Also in 1995, independently, Benjamini and Hochberg conceived of the idea of False Discovery Rate
or FDR. Their idea was that for large m, we do not expect all of the null hypotheses to be true, and so
we do not want to stringently control Pr(V > 0). Instead, we want to control the expected proportion of
our discoveries that are false assuming we make at least one discovery, that is FDR=E(V/R|R>0).
Let q be the target FDR. Benjamini and Hochberg proved that if q is the target FDR rejecting while
p(i) ≤ qi/m controls FDR at level q. [1]
The Benjamini and Hochberg method is used extensively in bioinformatics and other "big data"
disciplines. It requires the tests to be independent. This is seldom true in "omics" data for which our
features may be gene expression or proteins, which occur in pathways which induce correlated
behavior. However, in a follow-up paper, Benjamini and Hochberg also showed that their procedure
controls FDR for certain types of correlation.
The BH procedure may not work so well for highly correlated data such as SNP frequencies for SNPs
that are densely located. Considerable work has gone into developing FDR controlling procedures for
highly correlated data such as dense SNPs and neuroimaging data. The Benjamini and Yekutieli (BY)
method [2] controls FDR for any correlation structure, but is much less powerful than the BH method.
Although the BH procedure is meant to control FDR, not the FWER, "BH-adjusted p-values" computed
as pBH(i) = min(mp(i)/i, 1) are often used as adjusted p-values.
The BH procedure is more powerful than the Holmes procedure.
All of the procedures could be made more powerful because we really only need to adjust for the null
tests. If we only knew m0 we could adjust for it instead of m, giving us larger cut-off values. Fortunately,
it turns out that when we have done many tests, it is fairly easy to estimate m0 . There are many
estimation methods.
FDR controlling or estimation methods that estimate m0 and use it in place of m, are called adaptive
FDR methods.
https://onlinecourses.science.psu.edu/stat555/print/book/export/html/56 3/8
3/8/2018 https://onlinecourses.science.psu.edu/stat555/print/book/export/html/56
[1] Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and
powerful approach to multiple testing. JRSSB, 57, 289--300.http://www.jstor.org/stable/2346101 [1]
[2] Benjamini, Yoav, and Daniel Yekutieli. "The control of the false discovery rate in multiple testing
under dependency." Annals of statistics (2001): 1165-
1188. http://projecteuclid.org/download/pdf_1/euclid.aos/1013699998 [2]
4.4 - Estimating m0 (or π0 )
The proportion of hypotheses that are truly null is an important parameter in many situations. For
example, when comparing normal and diseased tissues we might hypothesize a greater number of
genes have changed response in the normal tissue than in the diseased when challenged by a toxin.
For multiple testing adjustments, the proportion of null hypotheses among the m tests is important,
because we need only adjust for the m0 = π0 m tests which are actually null. The other m − m0 can
contribute only to the true discoveries.
Estimation of π0 starts with the observation that when the test statistic is continuous, then the
distribution of p-values for the null tests is uniform on the interval (0,1). This is because for any
significance level α the proportion of tests with p-value less than α is α . By subtraction, we can see
that the proportion of tests in any interval inside (0,1) is just the length of the interval.
Below we see a histogram of p-values from t-tests on data simulated to be 10000 2-sample tests. For
each 2-sample test, both samples have the same mean and SD. Since the means are the same, the
null hypothesis is true. We can see the distribution is not quite uniform, but is pretty close because
these are "observed" values and so have deviations from the ideal histogram. Below this histogram is
a histogram in which the two samples have different means but the same power to detect the
difference. As expected, the p-values are skewed to small values. The percentage of p-values in this
histogram with p-value less than α is the power of the test when significance is declared at level α .
https://onlinecourses.science.psu.edu/stat555/print/book/export/html/56 4/8
3/8/2018 https://onlinecourses.science.psu.edu/stat555/print/book/export/html/56
Notice that in the histogram of non-null p-values, there are very few p-values bigger than 0.2 and almost
none bigger than 0.5.
Of course when we actually test many features, some of which have the same means in both samples
and some of which do not, we get a mix of histograms of this type. We should get a histogram
something like the one below. In each bin of the histogram, we do not know which tests come from the
truly null hypotheses, and which from the truly non-null. But we do know that the large p-values come
mostly from the null distribution, and that the histogram should be fairly flat where most of the p-values
come from the null. The red line shows an estimate of what the part of the histogram coming from the
null should look like (although it might be a bit lower than it should be, comparing with the top histogram
above).
There are several estimators of π0 based on this picture. Some try to estimate where the histogram
flattens out. The two simplest are Storey's method and the Pounds and Cheng method.
https://onlinecourses.science.psu.edu/stat555/print/book/export/html/56 5/8
3/8/2018 https://onlinecourses.science.psu.edu/stat555/print/book/export/html/56
The Pounds and Cheng [1] method is based on noting that the expected average of the null p-values is
0.5. They then assume that all of the non-null p-values are exactly 0. Then
π0̂ = 2 ∗ average(p-value) because on average we expect the average p-value to be
π0 ∗ 0.5 + (1 − π0 ) ∗ 0) .
Storey’s method [2] uses the histogram directly, assuming that all p-values bigger than some cut-off λ
come from the null distribution. Then we expect m0 (1 − λ) of the tests to have p-value greater than λ.
We then count and get some number mλ p-values in this region, giving us m̂ 0 = mλ /(1 − λ) or
̂ = mλ /(m ∗ (1 − λ)) . Sophisticated implementations of the method estimate a value for λ but
π0
λ = 0.5 works quite well in most cases.
[1] Storey, John D. "A direct approach to false discovery rates." Journal of the Royal Statistical
Society: Series B (Statistical Methodology) 64.3 (2002): 479-
498.http://www.genomine.org/papers/directfdr.pdf [3]
[2] Pounds, Stan, and Cheng Cheng. "Improving false discovery rate estimation."Bioinformatics 20.11
(2004): 1737-1745. http://bioinformatics.oxfordjournals.org/content/20/11/1737.full.pdf [4]
4.5 - q-Values
Storey's method also leads to a direct estimate of FDP. If we reject at level α we expect the number of
false discoveries to be αm0 . So the estimate of FDP is αm̂ 0 /R .
This leads directly to the Storey q-value [1] which is often interpreted as either an FDR-adjusted p-value
or FDP(p) where p is any observed p-value in the experiment.
We start by sorting the p-values as we do for the BH or Holmes procedures.
Note that if we reject for p ≤ p(i) then the total rejections will be at least i (with equality unless two or
more of the p-values are equal to p(i) ). Let R(α ) be the number of rejections when we reject for all
p ≤ α . Then define the q-values by:
q(1) = p(1) m̂ 0 /R( p(1) )
q(i + 1) = max(q(i), p(i+1) m̂ 0 /R( p(i+1) )
This associates a q-value with each feature, which estimates the FDP if you reject the null hypothesis
for this feature and all features which are this significant or more. Often we pick a cut-off for the q-value
and reject the null hypothesis for all features with q-value less than or equal to our cut-off.
[1] Storey, John D. "The positive false discovery rate: a Bayesian interpretation and the q-value."
Annals of statistics(2003): 2013-2035. http://projecteuclid.org/download/pdf_1/euclid.aos/1074290335
[5]
4.6 - Using the Histogram of p-values
We have seen that the histogram of all the p-values of our features plays an important role in estimating
FDR. This suggests that we should plot our p-values.
https://onlinecourses.science.psu.edu/stat555/print/book/export/html/56 6/8
3/8/2018 https://onlinecourses.science.psu.edu/stat555/print/book/export/html/56
Below is a set of plots from a microarray experiment with multiple treatments. A test was done for each
pair of treatments, leading to a large number of pairs, each giving a p-value for each feature. The p-
values for each pair of treatments is shown below. Notice that they all have the expected shape, and so
the use of q-values is appropriate.
Sometimes our tests do NOT have the expected distribution. Below is the histogram of p-values for a
set of p-values from an experiment using an antigene microarray. The histogram is very informative - it
tells me that the test statistic does not have the assumed null distribution, and so the computed p-values
are not valid. After looking at the data with the microarray provider, we concluded that some of the
statistical analysis steps were not appropriate. Another statistical analysis pipeline will need to be
developed.
https://onlinecourses.science.psu.edu/stat555/print/book/export/html/56 7/8
3/8/2018 https://onlinecourses.science.psu.edu/stat555/print/book/export/html/56
I always look at the histogram of p-values before interpreting the p-values, computing q-values, or
estimating π0 . There are many reasons that it might not have the ideal shape. If the data are counts
(like sequencing data) the histogram has a different characteristic shape, which we will discuss when
we discuss sequencing data. If the data are intensities, a hump at low p (rather than the peak near
p=0) might indicate correlation among the tests, due to strong association of the features. Another
possibility is that the test statistic does not have the assumed null distribution. For example, if a block
design was used but the blocks were not accounted for in the statistical analysis, the degrees of
freedom of the test statistic will be incorrect. Another possibility is that the data are highly skewed and
the sample size is too small for the t-statistic to have a t-distribution.
Source URL: https://onlinecourses.science.psu.edu/stat555/node/56
Links:
[1] http://www.jstor.org/stable/2346101
[2] http://projecteuclid.org/download/pdf_1/euclid.aos/1013699998
[3] http://www.genomine.org/papers/directfdr.pdf
[4] http://bioinformatics.oxfordjournals.org/content/20/11/1737.full.pdf
[5] Storey's method also leads to a direct estimate of FDP. If we reject at level α we expect the number of false discoveries
to be αm0 . So the estimate of FDP is αm̂ 0 /R . This leads directly to the Storey q-value [1] which is often interpreted as
either an FDR-adjusted p-value or FDP(p) where p is any observed p-value in the experiment. [1] Storey, John D. "The
positive false discovery rate: a Bayesian interpretation and the q-value." Annals of statistics (2003): 2013-2035.
http://projecteuclid.org/download/pdf_1/euclid.aos/1074290335
https://onlinecourses.science.psu.edu/stat555/print/book/export/html/56 8/8