Multiplicity

Marcio Diniz | Michael Luu

Cedars Sinai Medical Center

08 November, 2022

Jelly beans

No link between jelly beans and acne (p > 0.05).

Jelly beans

No link between purple, brown, pink, blue and teal jelly beans and acne (p > 0.05).

Jelly beans

No link between salmon, red, turquoise, magenta and yellow jelly beans and acne (p > 0.05).

Jelly beans

No link between gray, tan, cyan and mauve jelly beans and acne (p > 0.05). There is a link for green jelly beans and acne!

Jelly beans

No link between beige, lilac, black, peach and orange jelly beans and acne (p > 0.05).

Jelly beans

In 20 comparisons, only one was statistically significant!

Jelly beans

Now, we can reduce acne prevalence!

Multiple Comparisons

Basic Science

  • Compare mean tumor volume of nude mice between several treatment groups to control;
  • Compare mean tumor volume for a given group at different time points;
  • Compare gene expression levels of a large number of genes between treatment and control.

Clinical Trials

  • Multiple endpoints: compare treatment response, disease progression, quality of life between treatment and control;
  • Many treatments: compare each treatment arm to control and some pairwise comparisons;
  • Multiple looks: Interim analysis of data at pre-specified time points;
  • Sub-population: Testing for treatment effects for pre-specified subgroups of interest.

Why do we need multiple comparisons?

  • Testing multiple hypothesis inflates the probability of making a type I error (\(\alpha\));

  • Family-wise error rate (\(FWER\)) = P(Reject at least one true \(H_{0, i}\)) for \(i = 1, \ldots, n\) tests.

  • For one test,

\[ P(\mbox{not making type I error}) = (1 - \alpha) \]

  • For two independent tests,

\[ P(\mbox{not making any type I error}) = (1 - \alpha) \times (1 - \alpha) \]

Why do we need multiple comparisons?

Then,

\[ \begin{eqnarray} FWER &=& P(\mbox{making at least one type I error}) \\ &=& 1 - P(\mbox{not making any type I error}) \\ &=& 1 - (1 - \alpha)^2; \end{eqnarray} \]

Why do we need multiple comparisons?

  • If \(\alpha = 0.05\) with 20 independent tests,

\[ \begin{eqnarray*} FWER = P(\mbox{making at least one type I error}) = 1 - (1 - \alpha)^{20} \approx 0.64; \end{eqnarray*} \]

  • For perfectly correlated tests, \(FWER = \alpha\).

Family Wise Error Rate

Controlling FWER

Any correlation structure between tests

  • Bonferroni correction (1936);
  • Holm (1979);

Independents or positive correlated tests

  • Sidak correction (1967);
  • Hochberg (1988);
  • Hommel (1988).

Bonferroni correction

Procedure

  • It is the most conservative correction;
  • The significance level is divided by the number of comparisons;
  • It does not require that the tests are independent;
  • \(FWER \leq \alpha\).

Bonferroni correction

Example

  • Three treatment arms and want to perform all pairwise comparisons with a familywise error rate \(\alpha = 0.05\);
  • There are 3 possible comparisons, so N = 3;
  • Suppose the p-values of these three tests are \(p_1 = 0.0004\), \(p_2 = 0.03\), and \(p_3 = 0.0623\);
  • Is \(p_1 < 0.05/3 = 0.0166\)?
  • Is \(p_2 < 0.05/3 = 0.0166\)?
  • Is \(p_3 < 0.05/3 = 0.0166\)?

Bonferroni correction

Calculating adjusted p-values

  • Adjusted p-values are calculated multiplying p-values by the number of comparisons;
  • Obtain p-values for all tests \(p_1, p_2, \ldots, p_N\);
  • Then,
    • \(p^*_{1} = min\{1, p_{1} \times N\}\)
    • \(p^*_{2} = min\{1, p_{2} \times N\}\)
    • and so on.
  • In general,
    • \(p^*_{k} = min\{1, p_{k} \times N\}\)
  • Check if adjusted p-values \(p^*_{1}, \ldots, p^*_{N}\) are smaller than \(\alpha\).

Bonferroni correction

Example

  • Three genes and investigators want to perform comparisons between two groups with a familywise error rate \(\alpha = 0.05\);
  • There are 3 possible comparisons, so N = 3;
  • Suppose the p-values of these three tests are \(p_1 = 0.0004\), \(p_2 = 0.03\), and \(p_3 = 0.0623\);
  • The adjusted p-values are:
    • \(p^*_1 = 3 \times 0.0004 = 0.0012\);
    • \(p^*_2 = 3 \times 0.03 = 0.09\);
    • \(p^*_3 = 3 \times 0.0623 = 0.1869\);
  • Check whether adjusted p-values are less than 0.05.

Holm correction

  • Holm correction is uniformly better than Bonferroni correction;

Procedure

  • Obtain p-values for all tests \(p_1, p_2, \ldots, p_N\);
  • Sort p-values \(p_{(1)} < p_{(2)} < \ldots < p_{(N)}\);
  • Then,
    • If \(p_{(1)} > \alpha/N\), stop. No test is statistically significant. Otherwise,
    • \(p_{(2)} > \alpha/(N-1)\), stop. Only the first test is statistically significant. Otherwise,
    • and so on.
  • In general, \(p_{(k)} > \alpha/(k-1)\)
  • \(FWER \leq \alpha\).

Holm correction

Example

  • Three genes and investigators want to perform comparisons between two groups with a familywise error rate \(\alpha = 0.05\);
  • There are 3 possible comparisons, so N = 3;
  • Suppose the p-values of these three tests are \(p_1 = 0.0004\), \(p_2 = 0.03\), and \(p_3 = 0.0623\);
    • Is \(p_1 < 0.05/3 = 0.0166\)?
    • Is \(p_2 < 0.05/2 = 0.025\)?
    • Is \(p_3 < 0.05/1 = 0.05\)?

Holm correction

Calculating adjusted p-values

  • Obtain p-values for all tests \(p_1, p_2, \ldots, p_N\);
  • Sort p-values \(p_{(1)} < p_{(2)} < \ldots < p_{(N)}\);
  • Then,
    • \(p^*_{(1)} = min\{1, p_{(1)} \times N\}\)
    • \(p^*_{(2)} = min\{1, p_{(2)} \times (N-1)\}\)
    • and so on.
  • In general, \(p^*_{(k)} = min\{1, p_{(k)} \times N-(k-1)\}\)
  • Check if adjusted p-values \(p^*_{(1)}, \ldots, p^*_{(N)}\) are smaller than \(\alpha\).

Holm correction

Example

  • Three genes and investigators want to perform comparisons between two groups with a familywise error rate \(\alpha = 0.05\);
  • There are 3 possible comparisons, so N = 3;
  • Suppose the p-values of these three tests are \(p_1 = 0.0004\), \(p_2 = 0.03\), and \(p_3 = 0.0623\);
  • The adjusted p-values are:
    • \(p^*_1 = 3 \times 0.0004 = 0.0012\);
    • \(p^*_2 = 2 \times 0.03 = 0.06\);
    • \(p^*_3 = 0.0623\);
  • Check whether adjusted p-values are less than 0.05.

False Discovery Error

  • The association of 100 genes with cancer will be investigated
  • \(H_{0, i}\): gene \(i\) does not have association with cancer for \(i = 1, \ldots, 100\);
  • \(H_{1, i}\): gene \(i\) has association with cancer for \(i = 1, \ldots, 100\);

False Discovery Error

  • Only 10 genes are actually associated with cancer;

False Discovery Error

  • Only 10 genes are actually associated with cancer;
  • If the test has power of 80%, then 8 out of 10 genes associated with cancer will be identified;

False Discovery Error

  • Only 10 genes are actually associated with cancer;
  • If the test has power of 80%, then 8 out of 10 genes associated with cancer will be identified;
  • Assuming type I error \(\alpha\) equal to 5%, then 4.5 out of 90 genes not associated with cancer will be identified.

False Discovery Error

  • The chance of any identified gene be truly associated with cancer is 8 out 13, i.e., 62%;
  • The false discovery rate is the expected fraction of statistically significant results (discoveries) that are really false positive, i.e., 38%;
  • The family wise error rate is 99%;
  • Controlling FWER implies to control FDR, but the opposite is not true.

Controlling FDR

Any structure of correlation between tests

  • Benjamini and Hochberg (1995);

Positive correlated tests

  • Benjamini and Yekutieli (2001);

Controlling FDR

Procedure

  • Choose a FDR level \(Q\);
  • Obtain p-values for all tests \(p_1, p_2, \ldots, p_N\);
  • Sort p-values \(p_{(1)} < p_{(2)} < \ldots < p_{(N)}\);
  • Then, compare non-adjusted p-values with their Benjamini-Hochberg critical value:
    • Is \(p_{(1)} < 1/N \times Q\)?
    • Is \(p_{(2)} < 2/N \times Q\)?
    • and so on.
  • In general, Is \(p_{(k)} < k/N \times Q\)?
  • The largest P value that has P < (i/m)Q is significant, and all of the p-values smaller than it are also significant, even the ones that are not less than their Benjamini-Hochberg critical value.

Controlling FDR

Example

  • Three genes and investigators want to perform comparisons between two groups with a false discovery rate of \(5\%\);
  • Number of comparisons = 3;
  • Suppose the p-values of these three tests are \(p_1 = 0.0004\), \(p_2 = 0.03\), and \(p_3 = 0.0623\);
    • Is \(p_1 < 0.05 \times 1/3 = 0.0166\)?
    • Is \(p_2 < 0.05 \times 2/3 = 0.0333\)?
    • Is \(p_3 < 0.05 \times 3/3 = 0.05\)?

Controlling FDR

Calculating adjusted p-values

  • Choose a FDR level \(Q\);
  • Obtain p-values for all tests \(p_1, p_2, \ldots, p_N\);
  • Sort p-values \(p_{(1)} < p_{(2)} < \ldots < p_{(N)}\);
  • Then,
    • \(p^*_{(N)} = p_{(N)}\)
    • \(p^*_{(N-1)} = min\left\{p_{(N-1)} \times \frac{N}{N-1}, p_{(N)}\right\}\)
    • \(p^*_{(2)} = min\left\{p_{2} \times \frac{N}{2}, p_{3}\right\}\)
    • \(p^*_{(1)} = min\left\{p_{1} \times \frac{N}{1}, p_{2}\right\}\).
  • Check if adjusted p-values \(p^*_{(1)}, \ldots, p^*_{(N)}\) are smaller than \(Q\).

Controlling FDR

Example

  • Three genes and investigators want to perform comparisons between two groups with a false discovery rate of \(5\%\);
  • There are 3 possible comparisons, so \(N = 3\);
  • Suppose the p-values of these three tests are \(p_1 = 0.0004\), \(p_2 = 0.03\), and \(p_3 = 0.0623\);
    • \(p^*_1 = 0.0004 \times 3/1 = 0.0012\);
    • \(p^*_2 = 0.03 \times 3/2 = 0.045\);
    • \(p^*_3 = 0.0623 \times 3/3 = 0.0623\);
  • Check whether adjusted p-values are less than 0.05.

Multiple Comparisons

Criticism

  • Multiple comparisons adjustment is not always accepted;
  • In subgroup analysis, sequential tests or searching significant associations without pre-established hypothesis are well known scenarios that requires adjustments;
  • In multiple endpoints is debatable (not in high throughput data), if it is applied then multiple comparison adjustments can encourage the salami science;
  • Such correction decrease the power of the tests, i.e., increase the number of false negatives;

Multiple Comparisons

Down Syndrome dataset

  • Down syndrome (DS) is caused by an extra copy of human chromosome 21 (Hsa21);
  • There are no effective pharmacotherapies;
  • Ts65Dn mouse model of DS display many features relevant to those seen in DS;
  • N-methil-D-aspartate receptor antagonist, memantine, was shown to rescue performance of the Ts65Dn in several learning and memory tasks;
  • However, these studies have not been accompanied by molecular analysis.

Multiple Comparisons

Down Syndrome dataset

  • Ahmed, M. M., Dhanasekaran, A. R., Block, A., Tong, S., Costa, A. C., Stasko, M., & Gardiner, K. J. (2015). Protein dynamics associated with failed and rescued learning in the Ts65Dn mouse model of Down syndrome. PLoS One, 10(3), e0119491.