Multiplicity
Marcio Diniz | Michael Luu
Cedars Sinai Medical Center
08 November, 2022
Jelly beans
No link between jelly beans and acne (p > 0.05).
Jelly beans
No link between purple, brown, pink, blue and teal jelly beans and acne (p > 0.05).
Jelly beans
No link between salmon, red, turquoise, magenta and yellow jelly beans and acne (p > 0.05).
Jelly beans
No link between gray, tan, cyan and mauve jelly beans and acne (p > 0.05). There is a link for green jelly beans and acne!
Jelly beans
No link between beige, lilac, black, peach and orange jelly beans and acne (p > 0.05).
Jelly beans
In 20 comparisons, only one was statistically significant!
Jelly beans
Now, we can reduce acne prevalence!
Multiple Comparisons
Basic Science
- Compare mean tumor volume of nude mice between several treatment groups to control;
- Compare mean tumor volume for a given group at different time points;
- Compare gene expression levels of a large number of genes between treatment and control.
Clinical Trials
- Multiple endpoints: compare treatment response, disease progression, quality of life between treatment and control;
- Many treatments: compare each treatment arm to control and some pairwise comparisons;
- Multiple looks: Interim analysis of data at pre-specified time points;
- Sub-population: Testing for treatment effects for pre-specified subgroups of interest.
Why do we need multiple comparisons?
Testing multiple hypothesis inflates the probability of making a type I error (\(\alpha\));
Family-wise error rate (\(FWER\)) = P(Reject at least one true \(H_{0, i}\)) for \(i = 1, \ldots, n\) tests.
For one test,
\[
P(\mbox{not making type I error}) = (1 - \alpha)
\]
- For two independent tests,
\[
P(\mbox{not making any type I error}) = (1 - \alpha) \times (1 - \alpha)
\]
Why do we need multiple comparisons?
Then,
\[
\begin{eqnarray}
FWER &=& P(\mbox{making at least one type I error}) \\
&=& 1 - P(\mbox{not making any type I error}) \\
&=& 1 - (1 - \alpha)^2;
\end{eqnarray}
\]
Why do we need multiple comparisons?
- If \(\alpha = 0.05\) with 20 independent tests,
\[
\begin{eqnarray*}
FWER = P(\mbox{making at least one type I error}) = 1 - (1 - \alpha)^{20} \approx 0.64;
\end{eqnarray*}
\]
- For perfectly correlated tests, \(FWER = \alpha\).
Family Wise Error Rate
Controlling FWER
Any correlation structure between tests
- Bonferroni correction (1936);
- Holm (1979);
- Sidak correction (1967);
- Hochberg (1988);
- Hommel (1988).
Bonferroni correction
Procedure
- It is the most conservative correction;
- The significance level is divided by the number of comparisons;
- It does not require that the tests are independent;
- \(FWER \leq \alpha\).
Bonferroni correction
Example
- Three treatment arms and want to perform all pairwise comparisons with a familywise error rate \(\alpha = 0.05\);
- There are 3 possible comparisons, so N = 3;
- Suppose the p-values of these three tests are \(p_1 = 0.0004\), \(p_2 = 0.03\), and \(p_3 = 0.0623\);
- Is \(p_1 < 0.05/3 = 0.0166\)?
- Is \(p_2 < 0.05/3 = 0.0166\)?
- Is \(p_3 < 0.05/3 = 0.0166\)?
Bonferroni correction
Calculating adjusted p-values
- Adjusted p-values are calculated multiplying p-values by the number of comparisons;
- Obtain p-values for all tests \(p_1, p_2, \ldots, p_N\);
- Then,
- \(p^*_{1} = min\{1, p_{1} \times N\}\)
- \(p^*_{2} = min\{1, p_{2} \times N\}\)
- and so on.
- In general,
- \(p^*_{k} = min\{1, p_{k} \times N\}\)
- Check if adjusted p-values \(p^*_{1}, \ldots, p^*_{N}\) are smaller than \(\alpha\).
Bonferroni correction
Example
- Three genes and investigators want to perform comparisons between two groups with a familywise error rate \(\alpha = 0.05\);
- There are 3 possible comparisons, so N = 3;
- Suppose the p-values of these three tests are \(p_1 = 0.0004\), \(p_2 = 0.03\), and \(p_3 = 0.0623\);
- The adjusted p-values are:
- \(p^*_1 = 3 \times 0.0004 = 0.0012\);
- \(p^*_2 = 3 \times 0.03 = 0.09\);
- \(p^*_3 = 3 \times 0.0623 = 0.1869\);
- Check whether adjusted p-values are less than 0.05.
Holm correction
- Holm correction is uniformly better than Bonferroni correction;
Procedure
- Obtain p-values for all tests \(p_1, p_2, \ldots, p_N\);
- Sort p-values \(p_{(1)} < p_{(2)} < \ldots < p_{(N)}\);
- Then,
- If \(p_{(1)} > \alpha/N\), stop. No test is statistically significant. Otherwise,
- \(p_{(2)} > \alpha/(N-1)\), stop. Only the first test is statistically significant. Otherwise,
- and so on.
- In general, \(p_{(k)} > \alpha/(k-1)\)
- \(FWER \leq \alpha\).
Holm correction
Example
- Three genes and investigators want to perform comparisons between two groups with a familywise error rate \(\alpha = 0.05\);
- There are 3 possible comparisons, so N = 3;
- Suppose the p-values of these three tests are \(p_1 = 0.0004\), \(p_2 = 0.03\), and \(p_3 = 0.0623\);
- Is \(p_1 < 0.05/3 = 0.0166\)?
- Is \(p_2 < 0.05/2 = 0.025\)?
- Is \(p_3 < 0.05/1 = 0.05\)?
Holm correction
Calculating adjusted p-values
- Obtain p-values for all tests \(p_1, p_2, \ldots, p_N\);
- Sort p-values \(p_{(1)} < p_{(2)} < \ldots < p_{(N)}\);
- Then,
- \(p^*_{(1)} = min\{1, p_{(1)} \times N\}\)
- \(p^*_{(2)} = min\{1, p_{(2)} \times (N-1)\}\)
- and so on.
- In general, \(p^*_{(k)} = min\{1, p_{(k)} \times N-(k-1)\}\)
- Check if adjusted p-values \(p^*_{(1)}, \ldots, p^*_{(N)}\) are smaller than \(\alpha\).
Holm correction
Example
- Three genes and investigators want to perform comparisons between two groups with a familywise error rate \(\alpha = 0.05\);
- There are 3 possible comparisons, so N = 3;
- Suppose the p-values of these three tests are \(p_1 = 0.0004\), \(p_2 = 0.03\), and \(p_3 = 0.0623\);
- The adjusted p-values are:
- \(p^*_1 = 3 \times 0.0004 = 0.0012\);
- \(p^*_2 = 2 \times 0.03 = 0.06\);
- \(p^*_3 = 0.0623\);
- Check whether adjusted p-values are less than 0.05.
False Discovery Error
- The association of 100 genes with cancer will be investigated
- \(H_{0, i}\): gene \(i\) does not have association with cancer for \(i = 1, \ldots, 100\);
- \(H_{1, i}\): gene \(i\) has association with cancer for \(i = 1, \ldots, 100\);
False Discovery Error
- Only 10 genes are actually associated with cancer;
False Discovery Error
- Only 10 genes are actually associated with cancer;
- If the test has power of 80%, then 8 out of 10 genes associated with cancer will be identified;
False Discovery Error
- Only 10 genes are actually associated with cancer;
- If the test has power of 80%, then 8 out of 10 genes associated with cancer will be identified;
- Assuming type I error \(\alpha\) equal to 5%, then 4.5 out of 90 genes not associated with cancer will be identified.
False Discovery Error
- The chance of any identified gene be truly associated with cancer is 8 out 13, i.e., 62%;
- The false discovery rate is the expected fraction of statistically significant results (discoveries) that are really false positive, i.e., 38%;
- The family wise error rate is 99%;
- Controlling FWER implies to control FDR, but the opposite is not true.
Controlling FDR
Any structure of correlation between tests
- Benjamini and Hochberg (1995);
- Benjamini and Yekutieli (2001);
Controlling FDR
Procedure
- Choose a FDR level \(Q\);
- Obtain p-values for all tests \(p_1, p_2, \ldots, p_N\);
- Sort p-values \(p_{(1)} < p_{(2)} < \ldots < p_{(N)}\);
- Then, compare non-adjusted p-values with their Benjamini-Hochberg critical value:
- Is \(p_{(1)} < 1/N \times Q\)?
- Is \(p_{(2)} < 2/N \times Q\)?
- and so on.
- In general, Is \(p_{(k)} < k/N \times Q\)?
- The largest P value that has P < (i/m)Q is significant, and all of the p-values smaller than it are also significant, even the ones that are not less than their Benjamini-Hochberg critical value.
Controlling FDR
Example
- Three genes and investigators want to perform comparisons between two groups with a false discovery rate of \(5\%\);
- Number of comparisons = 3;
- Suppose the p-values of these three tests are \(p_1 = 0.0004\), \(p_2 = 0.03\), and \(p_3 = 0.0623\);
- Is \(p_1 < 0.05 \times 1/3 = 0.0166\)?
- Is \(p_2 < 0.05 \times 2/3 = 0.0333\)?
- Is \(p_3 < 0.05 \times 3/3 = 0.05\)?
Controlling FDR
Calculating adjusted p-values
- Choose a FDR level \(Q\);
- Obtain p-values for all tests \(p_1, p_2, \ldots, p_N\);
- Sort p-values \(p_{(1)} < p_{(2)} < \ldots < p_{(N)}\);
- Then,
- \(p^*_{(N)} = p_{(N)}\)
- \(p^*_{(N-1)} = min\left\{p_{(N-1)} \times \frac{N}{N-1}, p_{(N)}\right\}\)
- \(p^*_{(2)} = min\left\{p_{2} \times \frac{N}{2}, p_{3}\right\}\)
- \(p^*_{(1)} = min\left\{p_{1} \times \frac{N}{1}, p_{2}\right\}\).
- Check if adjusted p-values \(p^*_{(1)}, \ldots, p^*_{(N)}\) are smaller than \(Q\).
Controlling FDR
Example
- Three genes and investigators want to perform comparisons between two groups with a false discovery rate of \(5\%\);
- There are 3 possible comparisons, so \(N = 3\);
- Suppose the p-values of these three tests are \(p_1 = 0.0004\), \(p_2 = 0.03\), and \(p_3 = 0.0623\);
- \(p^*_1 = 0.0004 \times 3/1 = 0.0012\);
- \(p^*_2 = 0.03 \times 3/2 = 0.045\);
- \(p^*_3 = 0.0623 \times 3/3 = 0.0623\);
- Check whether adjusted p-values are less than 0.05.
Multiple Comparisons
Criticism
- Multiple comparisons adjustment is not always accepted;
- In subgroup analysis, sequential tests or searching significant associations without pre-established hypothesis are well known scenarios that requires adjustments;
- In multiple endpoints is debatable (not in high throughput data), if it is applied then multiple comparison adjustments can encourage the salami science;
- Such correction decrease the power of the tests, i.e., increase the number of false negatives;
Multiple Comparisons
Down Syndrome dataset
- Down syndrome (DS) is caused by an extra copy of human chromosome 21 (Hsa21);
- There are no effective pharmacotherapies;
- Ts65Dn mouse model of DS display many features relevant to those seen in DS;
- N-methil-D-aspartate receptor antagonist, memantine, was shown to rescue performance of the Ts65Dn in several learning and memory tasks;
- However, these studies have not been accompanied by molecular analysis.
Multiple Comparisons
Down Syndrome dataset
- Ahmed, M. M., Dhanasekaran, A. R., Block, A., Tong, S., Costa, A. C., Stasko, M., & Gardiner, K. J. (2015). Protein dynamics associated with failed and rescued learning in the Ts65Dn mouse model of Down syndrome. PLoS One, 10(3), e0119491.