Test of Hypotheses

Marcio Diniz | Michael Luu

Cedars Sinai Medical Center

02 November, 2022

Introduction

Inductive vs Deductive research

Steps

Research can be conduct following two approaches: Inductive and Deductive;
- Inductive reasoning is used to formulate theories while deductive reasoning is used to test hypotheses;
Deductive reasoning does not exist without inductive reasoning.

Introduction

Inductive reasoning

How do we test a hypothesis that not all swans are white?

After careful observation to test this hypothesis and no sightings of a non-white swan, the scientist concludes that all swans are indeed white;
Can we conclude that all swans are white?
Even if the observations were accurate, the conclusion might still be false as the scientist assumes that all other swans that had yet to be observed would also be white.

Introduction

Deductive reasoning

The scientist could formulate a reverse hypothesis that all swans are white.
Suppose that the scientist subsequently spots a non-white swan at place x and time t, what can we conclude?
It logically follows by deduction that not all swans are white and the hypothesis that all swans are white is rejected.

Introduction

Deductive reasoning

If we did not see any non-white swans, can we conclude that the hypothesis that all swans are white is true?
No. Based on deductive reasoning, we cannot consider absence of evidence is evidence of absence.
Another classical example is the decision in a trial: the defendant can only be found guilty or not guilty, but not innocent.

Introduction

Exploratory

How can I explain or describe variation in my data set?
Searching for answers by visualizing, transforming, and modeling your data;
The researchers should feel free to investigate every idea that occurs to them;
It is not repeatable;
There is no pre-established hypothesis;
It generates hypotheses to be checked.

Introduction

Hypothesis-Driven

There are pre-established hypotheses strictly related to the aims of the research;
Analyses must be planned a prior;
It is repeatable;
It cannot be based on the same data that generated the hypothesis.

Introduction

History

The modern theory of test of hypotheses started with William Gosset`s discovery of the t-test in 1908;
It was followed by Fisher with his book Statistical Methods for Research Workers (1925) presenting an approach called significance testing;
Neyman and Pearson (1933) introduced a more mathematical approach called Hypothesis testing;
A hybrid approach has been taught since 1940.

Statistical Inference

Assume a probability distribution for the endpoint of our experiment;
Interpret the parameters of the probability distribution in the context of the experiment;
Estimate the parameters based on the observed data in our sample;
Assume a probability distribution for the estimator of the parameters;
- For medium and large sample sizes, Central Limit Theorem.
Calculate confidence intervals and test hypotheses for the parameters of the probability distribution.

Quantifying evidence

Is this coin fair?

A coin is fair when each coin flip had a 50% chance of landing on heads and a 50% chance of landing on tails;
You flip a coin 20 times and observe 16 heads and 4 tails;
How unlikely is it to find 16 heads in 20 flips?
We can calculate the probability of observing 16 heads on 20 flips assuming that the coin is fair;
P(16 heads on 20 flips | coin is fair) = 0.00462;

Quantifying evidence

Is this coin fair?

We would have been even more suspicious that the coin flipping wasn’t fair had the coins landed on heads 17 times, 18 times, and so on;
- P(16 heads or 17 heads or 18 heads or 19 heads or 20 heads on 20 flips | coin is fair) = 0.0059;
Similarly, if the coins had landed 4 times, 3 times, and so on;
- P(4 heads or 3 heads or 2 heads or 1 heads or 0 heads on 20 flips | coin is fair) = 0.0059;
p-value = 0.0059 + 0.0059 = 0.0118.

Quantifying evidence

Is this coin fair?

\(X_i = \begin{cases} 0 \quad \mbox{if tails} \\ 1 \quad \mbox{if heads} \end{cases}\);
\(X_1, \ldots, X_{20} \quad \mbox{i.i.d. r.v.} \sim Bernouli(p)\);
The hypothesis that the coin is fair means \(H_0: p = 0.5\) and the hypothesis that the coin is not fair means \(H_1: p \neq 0.5\);
- \(H_0\) is called the null hypothesis. It the reversed hypothesis the researcher expects to be true.
- \(H_1\) is called the alternative hypothesis. It is the hypothesis of interest of the researcher.

Quantifying evidence

Is this coin fair?

Code

n <- 20
size <- 20
p <- 0.5
data_plot <- data.frame(y = 0:20, prob = dbinom(0:20, size, p),
                        col = c(rep("y", 5), rep("n", 11), rep("y", 5)))

ggplot(data_plot, aes(x = y, y = prob, fill = col)) +
  geom_bar(stat="identity") +
  #scale_y_continuous(limits = c(0, 1)) +
  scale_x_continuous(breaks = seq(0:n)) +
  labs(x = "Y", y = "P(Y = y)") +
  theme_bw() +   theme(text = element_text(size=20),
                       legend.text = element_text(size = 12),
                       legend.title = element_text(size = 12)) +
  scale_fill_manual("col", values = c("grey", gg_color_hue(2)[2])) +
  theme(legend.position = "none")

Then, we can define the test statistic \(Y = \sum_{i = 1}^{20} X_i\) such that its sampling distribution is \(Binomial(n = 20, p)\), p is unknown;
Based on the test statistic, a p-value is the probability to observe what we have observed or any other result that is a more extreme evidence favoring \(H_1: p \neq 0.5\) assuming that \(H_0: p = 0.5\) is true.

Quantifying evidence

Is this coin fair?

The calculation of the p-value depends on both hypothesis;
- Alternative hypothesis can be classified as two-sided if \(H_1: p \neq 0.5\), i.e., the the coin comes up heads too often or too not often;
If there is some information prior to the experiment, then a one-sided hypothesis can be defined:
- \(H_1: p > 0.5\) if the coin comes up heads too often;
- \(H_1: p < 0.5\) if the coin comes up heads not often;
One-sided hypothesis requires smaller sample sizes, but the results will be ignored if they are the opposite of the one-sided hypothesis;
After the data is collected, the hypotheses cannot be modified.

Quantifying evidence

Is this coin fair?

Code

n <- 20
size <- 20
p <- 0.5
data_plot <- data.frame(y = 0:20, prob = dbinom(0:20, size, p),
                        col = c(rep("y", 5), rep("n", 11), rep("y", 5)))

p1 <- ggplot(data_plot, aes(x = y, y = prob, fill = col)) +
  geom_bar(stat="identity") +
  #scale_y_continuous(limits = c(0, 1)) +
  scale_x_continuous(breaks = seq(0:n)) +
  labs(x = "Y", y = "P(Y = y)") +
  theme_bw() +   theme(text = element_text(size=20),
                       legend.text = element_text(size = 12),
                       legend.title = element_text(size = 12)) +
  scale_fill_manual("col", values = c("grey", gg_color_hue(2)[2])) +
  theme(legend.position = "none")

data_plot <- data.frame(y = 0:20, prob = dbinom(0:20, size, p),
                        col = c(rep("n", 5), rep("n", 11), rep("y", 5)))

p2 <- ggplot(data_plot, aes(x = y, y = prob, fill = col)) +
  geom_bar(stat="identity") +
  #scale_y_continuous(limits = c(0, 1)) +
  scale_x_continuous(breaks = seq(0:n)) +
  labs(x = "Y", y = "P(Y = y)") +
  theme_bw() +   theme(text = element_text(size=20),
                       legend.text = element_text(size = 12),
                       legend.title = element_text(size = 12)) +
  scale_fill_manual("col", values = c("grey", gg_color_hue(2)[2])) +
  theme(legend.position = "none")

data_plot <- data.frame(y = 0:20, prob = dbinom(0:20, size, p),
                        col = c(rep("y", 5), rep("y", 12), rep("n", 4)))

p3 <- ggplot(data_plot, aes(x = y, y = prob, fill = col)) +
  geom_bar(stat="identity") +
  #scale_y_continuous(limits = c(0, 1)) +
  scale_x_continuous(breaks = seq(0:n)) +
  labs(x = "Y", y = "P(Y = y)") +
  theme_bw() +   theme(text = element_text(size=20),
                       legend.text = element_text(size = 12),
                       legend.title = element_text(size = 12)) +
  scale_fill_manual("col", values = c("grey", gg_color_hue(2)[2])) +
  theme(legend.position = "none")

p1 + p2 + p3 + plot_layout(nrow = 3) +
  plot_annotation(tag_levels = "A",
                  tag_suffix = ".")

A. p ≠ 0.5, B. p > 0.5, C. p < 0.5

Making a decision

Once data is collected, we can reject or not reject \(H_0\) based on deductive reasoning;
What are the consequences of a wrong decision?
Type I error is the probability of rejecting \(H_0\) when \(H_0\) is true, i.e., \(\alpha\);
Type II error is the probability of not rejecting \(H_0\) when \(H_0\) is false, it is notated by \(\beta\);
Power is the probability of rejecting \(H_0\) when \(H_0\) is false, i.e., \(1 - \beta\);
Type I error can be interpreted as probability of a false positive, and type II error as probability of false negative; consequently, power is the probability of true positive.

Making a decision

What are type I and II errors for the below hypotheses?

\(H_0\): the drug is ineffective and \(H_1\): the drug is effective;
\(H_0\): the biomarker is not associated with disease and \(H_1\): the biomarker is associated with disease;

Making a decision

The \(H_0\) is rejected if there is enough evidence against it. Otherwise, the null hypothesis is not rejected.
If the probability to observe the data or any other result favoring \(H_1\) when \(H_0\) is assumed true is small, then how could we have observed the data we observed?
Either our data is wrong or our assumptions for the calculation of p-value is wrong.
We are often confident on our experiments, therefore our only option is to not believe that \(H_0\) is true, consequently, we can reject \(H_0\).
How small a p-value need to be to indicate the rejection of \(H_0\)?
P-values will be compared with the significance level which is a threshold established in advance for the type I error;
Historical values for the significance level are \(0.01\), \(0.05\) and \(0.10\).

Another example

In a trial, the prosecutor wants to prove that the defendant is guilt, \(H_1\): the defendant is guilty;
Based on deductive reasoning, the hypothesis of interest is reversed as \(H_0\): the defendant is innocent;
Prosecutor presents evidences against the hypothesis \(H_0\);
Defense attorney tries to invalidate the evidences;

Another example

Assuming that the defendant is innocent, the members of jury need to evaluate whether evidence is enough to consider the defendant guilty;
Is the evidence strong enough to reject \(H_0\) and declare the defendant guilty?
Otherwise, will \(H_0\) not be reject and the defendant will be declared non-guilty?
- Type I error is to declare the defendant guilty when they are innocent;
- Type II error is to declare the defendant non-guilty when they guilty.

Quantitative Variables

Troponin I

It is a cardiac and skeletal muscle protein useful in the laboratory diagnosis of heart attack. Test can be performed for confirmation of cardiac muscle damage;
Investigators believe that troponin levels are different between women and men;
Troponin probably does not follow a Normal distribution, but Central Limit theorem can help us to make statements regarding the average troponin levels.

How do we test a hypothesis about the mean?

\(X_i:\) Troponin I level for \(i = 1, \ldots, n\);
\(X_1, \ldots, X_n \mbox{ i.i.d. r.v.} \sim\) Normal\((\mu, \sigma^2)\) with \(\sigma^2\) known;
Is the average level of Troponin I in the health population equal to 2? \(H_0: \mu = 2\) vs \(H_1: \mu \neq 2\);
\(\bar{X} \sim N\left(\mu, \frac{\sigma^2}{n}\right)\);
If we observe \(\bar{X} = 5\), is there evidence against the null hypothesis?
How about \(\bar{X} = 2.5\)?
And \(\bar{X} = 1.9\)?
Intuitively, we checked how far the sampling mean is from the null hypothesis, \(\mu = 2\).
Therefore, we could define the test statistic \(Z = \bar{X} - \mu\):
- \(\bar{X} = 5 \rightarrow Z = 3\);
- \(\bar{X} = 2.5 \rightarrow Z = 0.5\);
- \(\bar{X} = 1.9 \rightarrow Z = -0.1\);
\(Z \sim N\left(0, \frac{\sigma^2}{n}\right)\)

How do we test a hypothesis about the mean?

Is \(Z = 3\) a large value?

If \(\sigma^2 = 10\) and \(n = 10\), then \(Z \sim N(0, 1)\);

Code

z <- 3
pz <- 2*(1 - pnorm(z, 0, 1))

data_plot <- data.frame(x = rep(c(-3, 3)))

ggplot(data_plot, aes(x = x)) +
  stat_function(fun = dnorm, args = list(0, 1)) +
  geom_area(fill = "blue",
            stat = "function", fun = dnorm,
            args = list(0, 1),
            xlim = c(z, 4),
            alpha = 0.5) +
  geom_area(fill = "blue",
            stat = "function", fun = dnorm,
            args = list(0, 1),
            xlim = c(-4, -z),
            alpha = 0.5) +
  labs(y = "Density", x = "Z") +
  theme_bw() +
  theme(legend.position = "None") +
  geom_vline(xintercept =
               c(qnorm(0.025, 0, 1), qnorm(0.975, 0, 1)))

p-value = 0.003

How do we test a hypothesis about the mean?

Is \(Z = 3\) a large value?

If \(\sigma^2 = 100\) and \(n = 10\), then \(Z \sim N(0, 10)\);

Code

z <- 3
pz <- 2*(1 - pnorm(z, 0, 10))

data_plot <- data.frame(x = rep(c(-50, 50)))

ggplot(data_plot, aes(x = x)) +
  stat_function(fun = dnorm, args = list(0, 10)) +
  geom_area(fill = "blue",
            stat = "function", fun = dnorm,
            args = list(0, 10),
            xlim = c(z, 50),
            alpha = 0.5) +
  geom_area(fill = "blue",
            stat = "function", fun = dnorm,
            args = list(0, 10),
            xlim = c(-50, -z),
            alpha = 0.5) +
  labs(y = "Density", x = "Z") +
  theme_bw() +
  theme(legend.position = "None") +
  geom_vline(xintercept =
               c(qnorm(0.025, 0, 10), qnorm(0.975, 0, 10)))

p-value = 0.76

How do we test a hypothesis about the mean?

Is \(Z = 3\) a large value?

If \(\sigma^2 = 1\) and \(n = 10\), then \(Z \sim N(0, 0.1)\);

Code

z <- 3
pz <- 2*(1 - pnorm(z, 0, 0.1))

data_plot <- data.frame(x = rep(c(-3.5, 3.5)))

ggplot(data_plot, aes(x = x)) +
  stat_function(fun = dnorm, args = list(0, 0.1)) +
  geom_area(fill = "blue",
            stat = "function", fun = dnorm,
            args = list(0, 0.1),
            xlim = c(z, 1),
            alpha = 0.5) +
  geom_area(fill = "blue",
            stat = "function", fun = dnorm,
            args = list(0, 0.1),
            xlim = c(-1, -z),
            alpha = 0.5) +
  labs(y = "Density", x = "Z") +
  scale_x_continuous(breaks = seq(-3, 3, 1)) +
  theme_bw() +
  theme(legend.position = "None") +
  geom_vline(xintercept =
               c(qnorm(0.025, 0, 0.1), qnorm(0.975, 0, 0.1)))

p-value < 0.001

One Sample Z-test

Code

data_plot <- data.frame(x = rep(c(-4, 4)))

ggplot(data_plot, aes(x = x)) +
  stat_function(fun = dnorm, args = list(0, 1)) +
  labs(y = "Density", x = "Z") +
  theme_bw() +
  theme(legend.position = "None")

Is \(Z = 3\) a large value?

\(Z = \bar{X} - \mu\) is a measure of distance that can be considered large or small depending on \(\sigma^2\);
Therefore, we can standardize the distance to be invariant to \(\sigma^2\), i.e., \(Z = \displaystyle\frac{\bar{X} - \mu}{\sigma/\sqrt{n}}\);
Then, \(Z \sim N(0, 1)\);
- If \(Z = 3\), p-value = 0.003 for any value of \(\sigma^2\);
- If \(Z = 0.5\), p-value = 0.617 for any value of \(\sigma^2\);
- If \(Z = -0.1\), p-value = 0.92 for any value of \(\sigma^2\).
The test statistic \(Z\) is a measure of standardized distance from the observed data and the null hypothesis. It is also known as Z-score;
The p-value is transformation of this distance into a probability assuming that the null hypothesis is true.
If the observed data is too far from the null hypothesis (high Z-score), then the p-value is small. As the null hypothesis is assumed to be true when calculating the p-value, then we have evidence against the null hypothesis.

One Sample Z-test

\(Z = \displaystyle\frac{\bar{X} - \mu}{\sqrt{\frac{\sigma^2}{n}}} ~ \sim N(0, 1)\) with \(\sigma^2 = 1\), \(n = 50\) and \(\mu = 2\) under \(H_0\);
\(\bar{X} = 2.34 \rightarrow Z^* = 2.45\),

Code

set.seed(1234)
x <- rnorm(50, 2.8, 1)
z <- (mean(x) - 2)/sqrt(1^2/length(x))
t <- (mean(x) - 2)/sqrt(var(x)/length(x))
pz <- 2*(1 - pnorm(z, 0, 1))
pt <- 2*(1 - pt(t, df = (length(x) - 1)))

data_plot <- data.frame(x = rep(c(-5, 5)))

ggplot(data_plot, aes(x = x)) +
  stat_function(fun = dnorm, args = list(0, 1)) +
  geom_area(fill = "blue",
            stat = "function", fun = dnorm,
            args = list(0, 1),
            xlim = c(z, 5),
            alpha = 0.5) +
  geom_area(fill = "blue",
            stat = "function", fun = dnorm,
            args = list(0, 1),
            xlim = c(-5, -z),
            alpha = 0.5) +
  labs(y = "Density", x = "Z") +
  theme_bw() +
  theme(legend.position = "None") +
  geom_vline(xintercept =
               c(qnorm(0.025, 0, 1), qnorm(0.975, 0, 1)))

\(H_0: µ = 2\) vs \(H_1: µ ≠ 2\), p-value = 0.014

One Sample Z-test

\(Z = \displaystyle\frac{\bar{X} - \mu}{\sqrt{\frac{\sigma^2}{n}}} ~ \sim N(0, 1)\) with \(\sigma^2 = 1\), \(n = 50\) and \(\mu = 2\) under \(H_0\);
\(\bar{X} = 2.34 \rightarrow Z^* = 2.45\),

Code

ggplot(data_plot, aes(x = x)) +
  stat_function(fun = dnorm, args = list(0, 1)) +
  geom_area(fill = "blue",
            stat = "function", fun = dnorm,
            args = list(0, 1),
            xlim = c(z, 5),
            alpha = 0.5) +
  labs(y = "Density", x = "Z") +
  theme_bw() +
  theme(legend.position = "None") +
  geom_vline(xintercept =
               qnorm(0.95, 0, 1))

\(H_0: µ = 2\) vs \(H_1: µ > 2\), p-value = 0.007

One Sample Z-test

\(Z = \displaystyle\frac{\bar{X} - \mu}{\sqrt{\frac{\sigma^2}{n}}} ~ \sim N(0, 1)\) with \(\sigma^2 = 1\), \(n = 50\) and \(\mu = 2\) under \(H_0\);
\(\bar{X} = 2.34 \rightarrow Z^* = 2.45\),

Code

ggplot(data_plot, aes(x = x)) +
  stat_function(fun = dnorm, args = list(0, 1)) +
  geom_area(fill = "blue",
            stat = "function", fun = dnorm,
            args = list(0, 1),
            xlim = c(-5, z),
            alpha = 0.5) +
  labs(y = "Density", x = "Z") +
  theme_bw() +
  theme(legend.position = "None") +
  geom_vline(xintercept =
               qnorm(0.05, 0, 1))

\(H_0: µ = 2\) vs \(H_1: µ < 2\), p-value = 0.993

One Sample Z-test

\(Z = \displaystyle\frac{\bar{X} - \mu}{\sqrt{\frac{\sigma^2}{n}}} ~ \sim N(0, 1)\) with \(\sigma^2 = 1\), \(n = 50\) and \(\mu = 2\) under \(H_0\);
\(\bar{X} = 2.24 \rightarrow Z^* = 1.72\),

Code

set.seed(892)
x <- rnorm(50, 2.2, 1)
z <- (mean(x) - 2)/sqrt(1^2/length(x))
t <- (mean(x) - 2)/sqrt(var(x)/length(x))
pz <- 2*(1 - pnorm(z, 0, 1))
pt <- 2*(1 - pt(t, df = (length(x) - 1)))

data_plot <- data.frame(x = rep(c(-5, 5)))

ggplot(data_plot, aes(x = x)) +
  stat_function(fun = dnorm, args = list(0, 1)) +
  geom_area(fill = "blue",
            stat = "function", fun = dnorm,
            args = list(0, 1),
            xlim = c(z, 5),
            alpha = 0.5) +
  geom_area(fill = "blue",
            stat = "function", fun = dnorm,
            args = list(0, 1),
            xlim = c(-5, -z),
            alpha = 0.5) +
  labs(y = "Density", x = "Z") +
  theme_bw() +
  theme(legend.position = "None") +
  geom_vline(xintercept =
               c(qnorm(0.025, 0, 1), qnorm(0.975, 0, 1)))

\(H_0: µ = 2\) vs \(H_1: µ ≠ 2\), p-value = 0.084

One Sample Z-test

\(Z = \displaystyle\frac{\bar{X} - \mu}{\sqrt{\frac{\sigma^2}{n}}} ~ \sim N(0, 1)\) with \(\sigma^2 = 1\), \(n = 50\) and \(\mu = 2\) under \(H_0\);
\(\bar{X} = 2.24 \rightarrow Z^* = 1.72\),

Code

ggplot(data_plot, aes(x = x)) +
  stat_function(fun = dnorm, args = list(0, 1)) +
  geom_area(fill = "blue",
            stat = "function", fun = dnorm,
            args = list(0, 1),
            xlim = c(z, 5),
            alpha = 0.5) +
  labs(y = "Density", x = "Z") +
  theme_bw() +
  theme(legend.position = "None") +
  geom_vline(xintercept =
               qnorm(0.95, 0, 1))

\(H_0: µ = 2\) vs \(H_1: µ > 2\), p-value = 0.042

One Sample Z-test

\(Z = \displaystyle\frac{\bar{X} - \mu}{\sqrt{\frac{\sigma^2}{n}}} ~ \sim N(0, 1)\) with \(\sigma^2 = 1\), \(n = 50\) and \(\mu = 2\) under \(H_0\);
\(\bar{X} = 2.24 \rightarrow Z^* = 1.72\),

Code

ggplot(data_plot, aes(x = x)) +
  stat_function(fun = dnorm, args = list(0, 1)) +
  geom_area(fill = "blue",
            stat = "function", fun = dnorm,
            args = list(0, 1),
            xlim = c(-5, z),
            alpha = 0.5) +
  labs(y = "Density", x = "Z") +
  theme_bw() +
  theme(legend.position = "None") +
  geom_vline(xintercept =
               qnorm(0.05, 0, 1))

\(H_0: µ = 2\) vs \(H_1: µ < 2\), p-value = 0.958

One Sample Z-test

\(Z = \displaystyle\frac{\bar{X} - \mu}{\sqrt{\frac{\sigma^2}{n}}} ~ \sim N(0, 1)\) with \(\sigma^2 = 1\), \(n = 50\) and \(\mu = 2\) under \(H_0\);
\(\bar{X} = 2.1 \rightarrow Z^* = 1.02\),

Code

set.seed(2527)
x <- rnorm(100, 2, 1)
z <- (mean(x) - 2)/sqrt(1^2/length(x))
t <- (mean(x) - 2)/sqrt(var(x)/length(x))
pz <- 2*(1 - pnorm(z, 0, 1))
pt <- 2*(1 - pt(t, df = (length(x) - 1)))


ggplot(data_plot, aes(x = x)) +
  stat_function(fun = dnorm, args = list(0, 1)) +
  geom_area(fill = "blue",
            stat = "function", fun = dnorm,
            args = list(0, 1),
            xlim = c(z, 5),
            alpha = 0.5) +
  geom_area(fill = "blue",
            stat = "function", fun = dnorm,
            args = list(0, 1),
            xlim = c(-5, -z),
            alpha = 0.5) +
  labs(y = "Density", x = "Z") +
  theme_bw() +
  theme(legend.position = "None") +
  geom_vline(xintercept =
               c(qnorm(0.025, 0, 1), qnorm(0.975, 0, 1)))

\(H_0: µ = 2\) vs \(H_1: µ ≠ 2\), p-value = 0.308

One Sample Z-test

\(Z = \displaystyle\frac{\bar{X} - \mu}{\sqrt{\frac{\sigma^2}{n}}} ~ \sim N(0, 1)\) with \(\sigma^2 = 1\), \(n = 50\) and \(\mu = 2\) under \(H_0\);
\(\bar{X} = 2.1 \rightarrow Z^* = 1.02\),

Code

ggplot(data_plot, aes(x = x)) +
  stat_function(fun = dnorm, args = list(0, 1)) +
  geom_area(fill = "blue",
            stat = "function", fun = dnorm,
            args = list(0, 1),
            xlim = c(z, 5),
            alpha = 0.5) +
  labs(y = "Density", x = "Z") +
  theme_bw() +
  theme(legend.position = "None") +
  geom_vline(xintercept =
               qnorm(0.95, 0, 1))

\(H_0: µ = 2\) vs \(H_1: µ > 2\), p-value = 0.154

One Sample Z-test

\(Z = \displaystyle\frac{\bar{X} - \mu}{\sqrt{\frac{\sigma^2}{n}}} ~ \sim N(0, 1)\) with \(\sigma^2 = 1\), \(n = 50\) and \(\mu = 2\) under \(H_0\);
\(\bar{X} = 2.1 \rightarrow Z^* = 1.02\),

Code

ggplot(data_plot, aes(x = x)) +
  stat_function(fun = dnorm, args = list(0, 1)) +
  geom_area(fill = "blue",
            stat = "function", fun = dnorm,
            args = list(0, 1),
            xlim = c(-5, z),
            alpha = 0.5) +
  labs(y = "Density", x = "Z") +
  theme_bw() +
  theme(legend.position = "None") +
  geom_vline(xintercept =
               qnorm(0.05, 0, 1))

\(H_0: µ = 2\) vs \(H_1: µ < 2\), p-value = 0.846

What happens when the variance is not known?

One Sample t-test

\(X_i:\) Troponin I level for \(i = 1, \ldots, n\);
\(X_1, \ldots, X_n \mbox{ i.i.d. r.v.} \sim\) Normal\((\mu, \sigma^2)\) with \(\sigma^2\) unknown;
Is the average level of Troponin I in the health population equal to 2? \(H_0: \mu = 2\) vs \(H_1: \mu \neq 2\);
\(\bar{X} \sim N\left(\mu, \frac{\sigma^2}{n}\right)\);
\(S^2 = \displaystyle\frac{\sum_{i = 1}^{n} (X_i - \bar{X})^2 }{n - 1}\) is a good estimator for \(\sigma^2\);
\(T = \displaystyle\frac{\bar{X} - \mu}{\sqrt{\frac{S^2}{n}}} ~ \sim t_{n - 1}\) if \(\mu\) is known.
The t-student distribution has the parameter degrees of freedom which depends on the sample size;
The test statistic \(T\) is a measure of distance between observed and null hypothesis similar to \(Z\) statistic.
If the observed data is too far from the null hypothesis (high T-score), then the p-value is small. As the null hypothesis is assumed to be true when calculating the p-value, then we have evidence against the null hypothesis.

One Sample t-test

\(T = \displaystyle\frac{\bar{X} - \mu}{\sqrt{\frac{S^2}{n}}} ~ \sim t(n-1)\) with \(\mu = 2\) under \(H_0\);
\(\bar{X} = 2.34, S^2 = 0.78 \rightarrow T^* = 2.77\),

Code

set.seed(1234)
x <- rnorm(50, 2.8, 1)
z <- (mean(x) - 2)/sqrt(1^2/length(x))
t <- (mean(x) - 2)/sqrt(var(x)/length(x))
pz <- 2*(1 - pnorm(z, 0, 1))
pt <- 2*(1 - pt(t, df = (length(x) - 1)))

data_plot <- data.frame(x = rep(c(-5, 5)))

ggplot(data_plot, aes(x = x)) +
  stat_function(fun = dt, args = list((length(x) - 1))) +
  geom_area(fill = "blue",
            stat = "function", fun = dt,
            args = list((length(x) - 1)),
            xlim = c(t, 5),
            alpha = 0.5) +
  geom_area(fill = "blue",
            stat = "function", fun = dt,
            args = list((length(x) - 1)),
            xlim = c(-5, -t),
            alpha = 0.5) +
  labs(y = "Density", x = "T") +
  theme_bw() +
  theme(legend.position = "None") +
  geom_vline(xintercept =
               c(qt(0.025, df = (length(x) - 1)),
                 qt(0.975, df = (length(x) - 1))))

One Sample t-test

\(T = \displaystyle\frac{\bar{X} - \mu}{\sqrt{\frac{S^2}{n}}} ~ \sim t(n - 1)\) with \(\mu = 2\) under \(H_0\);
\(\bar{X} = 2.24, S^2 = 0.87 \rightarrow T^* = 1.79\),

Code

set.seed(892)
x <- rnorm(50, 2.2, 1)
z <- (mean(x) - 2)/sqrt(1^2/length(x))
t <- (mean(x) - 2)/sqrt(var(x)/length(x))
pz <- 2*(1 - pnorm(z, 0, 1))
pt <- 2*(1 - pt(t, df = (length(x) - 1)))

ggplot(data_plot, aes(x = x)) +
  stat_function(fun = dt, args = list((length(x) - 1))) +
  geom_area(fill = "blue",
            stat = "function", fun = dt,
            args = list((length(x) - 1)),
            xlim = c(t, 5),
            alpha = 0.5) +
  geom_area(fill = "blue",
            stat = "function", fun = dt,
            args = list((length(x) - 1)),
            xlim = c(-5, -t),
            alpha = 0.5) +
  labs(y = "Density", x = "T") +
  theme_bw() +
  theme(legend.position = "None") +
  geom_vline(xintercept =
               c(qt(0.025, df = (length(x) - 1)),
                 qt(0.975, df = (length(x) - 1))))

\(H_0: µ = 2\) vs \(H_1: µ ≠ 2\), p-value = 0.078

One Sample t-test

\(T = \displaystyle\frac{\bar{X} - \mu}{\sqrt{\frac{S^2}{n}}} ~ \sim t(n - 1)\) with \(\mu = 2\) under \(H_0\);
\(\bar{X} = 2.10, S^2 = 0.77 \rightarrow T^* = 1.15\),

Code

set.seed(2527)
x <- rnorm(100, 2, 1)
z <- (mean(x) - 2)/sqrt(1^2/length(x))
t <- (mean(x) - 2)/sqrt(var(x)/length(x))
pz <- 2*(1 - pnorm(z, 0, 1))
pt <- 2*(1 - pt(t, df = (length(x) - 1)))

ggplot(data_plot, aes(x = x)) +
  stat_function(fun = dt, args = list((length(x) - 1))) +
  geom_area(fill = "blue",
            stat = "function", fun = dt,
            args = list((length(x) - 1)),
            xlim = c(t, 5),
            alpha = 0.5) +
  geom_area(fill = "blue",
            stat = "function", fun = dt,
            args = list((length(x) - 1)),
            xlim = c(-5, -t),
            alpha = 0.5) +
  labs(y = "Density", x = "T") +
  theme_bw() +
  theme(legend.position = "None") +
  geom_vline(xintercept =
               c(qt(0.025, df = (length(x) - 1)),
                 qt(0.975, df = (length(x) - 1))))

\(H_0: µ = 2\) vs \(H_1: µ ≠ 2\), p-value = 0.249

Comparing two groups

Is the average level of Troponin I in the health population equal between men and women?

\(X_i:\) Troponin I level for females, \(i = 1, \ldots, n_F\);
\(Y_i:\) Troponin I level for males, \(i = 1, \ldots, n_M\);
\(X_1, \ldots, X_{n_F} \mbox{i.i.d. r.v.} \sim\) Normal\((\mu_F, \sigma_F^2)\);
\(Y_1, \ldots, Y_{n_M} \mbox{i.i.d. r.v.} \sim\) Normal\((\mu_M, \sigma_M^2)\);
\(\sigma_F^2\) and \(\sigma_M^2\) are equal i.e., \(\sigma_F^2 = \sigma_M^2 = \sigma^2\) such that \(\sigma^2\) is known;
The two samples are independents;
The statistical hypotheses are \(H_0: \mu_F = \mu_M\) vs \(H_1: \mu_F \neq \mu_M\) that can be re-written as \(H_0: \mu_F - \mu_M = 0\) vs \(H_1: \mu_F - \mu_M \neq 0\);

Two-sample z-test

\(H_0: \mu_F - \mu_M = 0\) vs \(H_1: \mu_F - \mu_M \neq 0\);
\(\bar{X} \sim N\left(\mu_F, \frac{\sigma^2_F}{n}\right)\);
\(\bar{Y} \sim N\left(\mu_M, \frac{\sigma^2_M}{n}\right)\);
Is there evidence against the null hypothesis if \(\bar{X} = 4.83\) and \(\bar{Y} = 6.74\)?
\(Z = (\bar{X} - \bar{Y}) - (\mu_X - \mu_Y)\) = -1.91
Is Z = -1.91 a value far away from zero?
Similar to one-sample z-test, a more adequate measure of distance is obtained after standardized by the standard deviation of the difference \(\bar{X} - \bar{Y}\).
The variance of the difference is \(\frac{\sigma^2_F}{n} + \frac{\sigma^2_M}{n}\).
Then, \(Z = \displaystyle\frac{(\bar{X} - \bar{Y}) - (\mu_F - \mu_M)}{\sqrt{\frac{\sigma^2_F}{n} + \frac{\sigma^2_M}{n}}} ~ \sim N(0, 1)\) if \(\mu_F\) and \(\mu_M\) are known.
Based on the test statistic \(Z\), p-values can be calculated using the same procedures previously discussed.

Two-sample t-test

Student t-test (equal variances)

\(X_i:\) Troponin I level for females, \(i = 1, \ldots, n_F\); item \(Y_i:\) Troponin I level for males, \(i = 1, \ldots, n_M\);
\(X_1, \ldots, X_{n_F} \mbox{i.i.d. r.v.} \sim\) Normal\((\mu_F, \sigma_F^2)\);
\(Y_1, \ldots, Y_{n_M} \mbox{i.i.d. r.v.} \sim\) Normal\((\mu_M, \sigma_M^2)\);
\(\sigma_F^2\) and \(\sigma_M^2\) are equal, i.e., \(\sigma_F^2 = \sigma_M^2 = \sigma^2\) such that \(\sigma^2\) is unknown;
The two samples are independents;
\(H_0: \mu_F - \mu_M = 0\) vs \(H_1: \mu_F - \mu_M \neq 0\);
\(\bar{X} \sim N\left(\mu_F, \frac{\sigma^2}{n}\right)\);
\(\bar{Y} \sim N\left(\mu_M, \frac{\sigma^2}{n}\right)\);
\(T = \displaystyle\frac{(\bar{X} - \bar{Y}) - (\mu_F - \mu_M)}{\sqrt{\frac{S^2}{n_F} + \frac{S^2}{n_M}}} ~ \sim t_{(n_F + n_M - 2)}\) if \(\mu_F\) and \(\mu_M\) are known;
\(S^2 = \displaystyle\frac{\sum_{i = 1}^{n_F} (X_i - \bar{X})^2 + \sum_{i = 1}^{n_M} (Y_i - \bar{Y})^2}{n_F + n_M - 2}\).

Two-sample t-test

Welch t-test (unequal variances)

\(X_i:\) Troponin I level for females, \(i = 1, \ldots, n_F\);
\(Y_i:\) Troponin I level for males, \(i = 1, \ldots, n_M\);
\(X_1, \ldots, X_{n_F} \mbox{i.i.d. r.v.} \sim\) Normal\((\mu_F, \sigma_F^2)\);
\(Y_1, \ldots, Y_{n_M} \mbox{i.i.d. r.v.} \sim\) Normal\((\mu_M, \sigma_M^2)\);
\(\sigma_F^2\) and \(\sigma_M^2\) are unequal and unknown;
The two samples are independents;
\(H_0: \mu_F - \mu_M = 0\) vs \(H_1: \mu_F - \mu_M \neq 0\);
\(\bar{X} \sim N\left(\mu_F, \frac{\sigma^2}{n}\right)\);
\(\bar{Y} \sim N\left(\mu_M, \frac{\sigma^2}{n}\right)\);
\(T = \displaystyle\frac{(\bar{X} - \bar{Y}) - (\mu_F - \mu_M)}{\sqrt{\frac{S_F^2}{n_F} + \frac{S_M^2}{n_M}}} ~ \sim t_{v}\) if \(\mu_F\) and \(\mu_M\) are known;
\(S_F^2 = \displaystyle\frac{\sum_{i = 1}^{n_F} (X_i - \bar{X})^2 }{n_F - 1}\);
\(S_M^2 = \displaystyle\frac{\sum_{i = 1}^{n_M} (Y_i - \bar{Y})^2 }{n_M - 1}\);

Tests of Hypotheses - Worfklow

One Sample

Independent and identical distributed samples;
Normality;
Known variance;

Two Samples

Independent and identical distributed samples;
Two samples are independent;
Normality;
Known variances;
Equal Variances;

Tests of Hypotheses - Assumptions

The assumption of normality can be checked graphically and using test of hypotheses;
- The graphical approach is preferred;
- For large sample sizes, the Central Limit Theorem tell us that the sampling average is normal distributed, therefore we can assume that the assumption is not violated;
In case of normality, the assumption of homoscedasticity (equal variances) can be checked graphically and using test of hypotheses;
In case of non-normality, the assumption of same shape (equal variances + equal kurtosis + equal skewness) can only be verified graphically;
Data will never fulfill completely any of those assumptions, therefore we seek for extreme violations of those assumptions that can invalid our conclusions.

Test of Hypotheses

Non-parametric tests

Non-parametric tests do not have the assumption that the data follows the Normal distribution;
However, such tests has less power to identify differences between samples;
The hypotheses are not defined based on the mean values as previously, but based on the median or the full distribution of probability;
\(X_i:\) Troponin I level for females, \(i = 1, \ldots, n_F\);
\(Y_i:\) Troponin I level for males, \(i = 1, \ldots, n_M\);
\(X_1, \ldots, X_{n_F} \mbox{i.i.d. r.v.} \sim P_F\);
\(Y_1, \ldots, Y_{n_M} \mbox{i.i.d. r.v.} \sim P_M\);
\(H_0\): \(P(X > Y) + 0.5P(X = Y) = 0.5\), i.e., the probability of getting large values in one group than the other one is equal to 0.5;
\(H_1\): \(P(X > Y) + 0.5P(X = Y) \neq 0.5\), i.e., the probability of getting large values in one group than the other one is greater than 0.5.
If the hypothesis of equal shapes is valid, then the hypotheses above become:
- \(H_0\): \(median_X = median_Y\);
- \(H_1\): \(median_X \neq median_Y\).

Non-parametric tests

Are medians being compared?

Yes, there is no enough evidence that the distributions have different shapes.

Code

set.seed(1234)
x <- c(rgamma(100, 5, 1), rgamma(100, 10, 1))
sex <- c(rep("F", 100), rep("M", 100))

dp <- data.frame(x, sex)
ggplot(dp, aes(x = x, y = stat(density), fill = sex)) + geom_histogram() +
  facet_grid(sex ~ ., labeller = label_both) + 
  labs(x = "Troponin", y = "Density") +
  theme_bw() + theme(text = element_text(size = 14), legend.position = "none")

Non-parametric tests

Are medians being compared?

No, there is enough evidence that shapes are different.

Code

set.seed(1234)
x <- c(rgamma(100, 5, 1), rweibull(100, 0.5, 0.8))
sex <- c(rep("F", 100), rep("M", 100))

dp <- data.frame(x, sex)
ggplot(dp, aes(x = x, y = stat(density), fill = sex)) + geom_histogram() +
  facet_grid(sex ~ ., labeller = label_both) + 
  labs(x = "Troponin", y = "Density") +
  theme_bw() + theme(text = element_text(size = 14), legend.position = "none")

Non-parametric tests

Non-parametric tests compare the full distribution of probability between groups;
The test statistic is based on the probability whether group B presents higher values than group A, i.e., P(B > A);
If this probability is equal 0.5, then the distributions of both groups overlap each other:
This test statistic is valid even for normal distribution.
Example 01:
- A: 1, 3, 5
- B: 2, 4, 6
- All possible (A, B) pairs: (1, 2), (1, 4), (1, 6), (3, 2), (3, 4), (3, 6), (5, 2), (5, 4), (5, 6);
- (A, B) pairs where B > A: (1, 2), (1, 4), (1, 6), (3, 4), (3, 6), (5, 6)
- P(B > A) = 6/9;
Example 02:
- A: 1, 4, 6
- B: 2, 4, 5
- All possible (A, B) pairs: (1, 2), (1, 4), (1, 5), (4, 2), (4, 4), (4, 5), (6, 2), (6, 4), (6,5)
- (A, B) pairs where B > A: (1, 2), (1, 4), (1, 5), (4, 5)
- (A, B) pairs where B = A: (4, 4)
- P(B > A) + 0.5P(A = B) = 4/9 + 0.5/9 = 4.5/9

Non-parametric tests

Code

n <- 100000
sd <- 0.5

set.seed(1234)
x <- rnorm(n, mean=3, sd = sd)
y <- rnorm(n, mean=3, sd = sd)
aux01 <- data.frame(group = factor(rep(c("A", "B"), each=n)),
                    biomarker = c(x, y), 
                    PI = paste("P(B > A) = ", round(1 - pnorm(0, 0, sd = sd), 2)))

x <- rnorm(n, mean=3, sd = sd)
y <- rnorm(n, mean=3.5, sd = sd)
aux02 <- data.frame(group = factor(rep(c("A", "B"), each=n)),
                    biomarker = c(x, y), 
                    PI = paste("P(B > A) = ", round(1 - pnorm(0, 0.5, sd = sd), 2)))

x <- rnorm(n, mean=3, sd = sd)
y <- rnorm(n, mean=4, sd = sd)
aux03 <- data.frame(group = factor(rep(c("A", "B"), each=n)),
                    biomarker = c(x, y), 
                    PI = paste("P(B > A) = ", round(1 - pnorm(0, 1, sd = sd), 2)))

x <- rnorm(n, mean=3, sd = sd)
y <- rnorm(n, mean=2.5, sd = sd)
aux04 <- data.frame(group = factor(rep(c("A", "B"), each=n)),
                    biomarker = c(x, y), 
                    PI = paste("P(B > A) = ", round(1 - pnorm(0, -0.5, sd = sd), 2)))

x <- rnorm(n, mean=3, sd = sd)
y <- rnorm(n, mean=2, sd = sd)
aux05 <- data.frame(group = factor(rep(c("A", "B"), each=n)),
                    biomarker = c(x, y), 
                    PI = paste("P(B > A) = ", round(1 - pnorm(0, -1, sd = sd), 2)))

dp <- bind_rows(aux01, aux02, aux03, aux04, aux05)

ggplot(dp, aes(x = biomarker, color = group)) + 
  geom_density() + 
  theme_bw() +
  facet_grid(PI ~ ., switch = "y") + 
  labs(y = "Density", x = "Biomarker", color = "Group") +
  theme(legend.position = "right", strip.text.y = element_text(angle = 180)) +
  scale_x_continuous(limits = c(0, 6))

Checking assumptions

Normality assumption

Shapiro-Francia, Anderson-Darling, Kolmogorov-Smirnov, Cramer-von Mises, and Pearson are the most common tests;
\(H_0:\) There is normality vs \(H_1:\) There is not normality;
Ideally, normality should be evaluated from a independent and large enough sample;

Normality assumption

Code

set.seed(123)
df <- data.frame(x = rnorm(100, 2, 1))
ggplot(df, aes(sample = x)) +
  stat_qq() + 
  stat_qq_line() +
  theme_bw() +
  labs(y = "Sampling quantiles", x = "Theoretical quantiles")

Sampling quantiles are compared to theoretical quantiles from a Normal(0, 1);

The data follows perfectly the Normal distribution if the points are on the diagonal.

Normality assumption

Code

set.seed(123)
df <- data.frame(x = rnorm(100, 2, 1))
ggplot(df, aes(sample = x)) +
  stat_qq() + 
  stat_qq_line() +
  theme_bw() +
  labs(y = "Sampling quantiles", x = "Theoretical quantiles")

p value = 0.703

Normality assumption

Code

set.seed(123)
df <- data.frame(x = rnorm(100, 2, 1))
ggplot(df, aes(sample = x)) +
  stat_qq() + 
  stat_qq_line() +
  theme_bw() +
  labs(y = "Sampling quantiles", x = "Theoretical quantiles")

p value = 0.703

Normality assumption

Code

set.seed(109)
df <- data.frame(x = rnorm(10, 2, 1))
ggplot(df, aes(sample = x)) +
  stat_qq() + 
  stat_qq_line() +
  theme_bw() +
  labs(y = "Sampling quantiles", x = "Theoretical quantiles")

p value = 0.798

Normality assumption

Code

set.seed(484)
df <- data.frame(x = rnorm(100, 2, 1))
ggplot(df, aes(sample = x)) +
  stat_qq() + 
  stat_qq_line() +
  theme_bw() +
  labs(y = "Sampling quantiles", x = "Theoretical quantiles")

p value = 0.13

Normality assumption

Code

set.seed(60)
df <- data.frame(x = rgumbel(100, 2, 1))
ggplot(df, aes(sample = x)) +
  stat_qq() + 
  stat_qq_line() +
  theme_bw() +
  labs(y = "Sampling quantiles", x = "Theoretical quantiles")

p value < 0.001

Normality assumption

Code

set.seed(2430)
df <- data.frame(x = rgamma(10, 2, 1))
ggplot(df, aes(sample = x)) +
  stat_qq() + 
  stat_qq_line() +
  theme_bw() +
  labs(y = "Sampling quantiles", x = "Theoretical quantiles")

p value = 0.091

Normality assumption

Code

set.seed(1751)
df <- data.frame(x = rexp(10, 2))
ggplot(df, aes(sample = x)) +
  stat_qq() + 
  stat_qq_line() +
  theme_bw() +
  labs(y = "Sampling quantiles", x = "Theoretical quantiles")

p value = 0.010

Normality assumption

Code

set.seed(5286)
df <- data.frame(x = rnorm(10000, 2, 1))
ggplot(df, aes(sample = x)) +
  stat_qq() + 
  stat_qq_line() +
  theme_bw() +
  labs(y = "Sampling quantiles", x = "Theoretical quantiles")

p value = 0.001

Normality assumption

For small sample size,
- If the test reject the normality, there is an evidence that the data is not normal;
- If the test does not reject the normality, there are two possible reasons: lack of power or the data follows the normal distribution;
For large sample size,
- If the test reject the normality, there are two reasons: small departures from the Normal distribution will be enough evidence to reject the null hypothesis or the data does not follow the Normal distribution;
- If the test does not reject the normality, we can only say that we do not have evidence against the normality assumption. We cannot state that our data follows the normal distribution because we cannot accept the null hypothesis.
The most conservative assumption is to assume non-normality.
Graphic evaluation using quantile plots (Q-Q plots) or histograms is a more suitable alternative for small sample size;
Biological reason is essential to assess normality.

Equal Variances

The most common tests are Levene, Bartlett and F;
\(H_0:\) Equal Variances vs \(H_1:\) Unequal Variances;
If the sample size is small, all tests lack power to reject the null hypothesis when the null hypothesis is false;
If the sample size is large, small differences between the variances will be enough evidence to reject the null hypothesis;
Graphical approaches or biological reasoning are recommended.

Qualitative variables

Comparing proportions

\(X_i:\) Occurrence of myocardial infarction for females, \(i = 1, \ldots, n_F\);
\(Y_i:\) Occurrence of myocardial infarction for males, \(i = 1, \ldots, n_M\);
\(X_1, \ldots, X_{n_F} \mbox{i.i.d. r.v.} \sim Bin(n_F, p_F)\);
\(Y_1, \ldots, Y_{n_M} \mbox{i.i.d. r.v.} \sim Bin(n_M, p_M)\);
\(H_0\): \(p_F = p_M\)
\(H_1\): \(p_F \neq p_M\);
Fisher’s Exact Test;
If it is not possible to calculate Fisher’s Exact Test, then Chi-Squared test.

Summary

Variable of interest that is a random variable with an associated probabilistic model;
Probabilistic model has parameters that are unknown;
Biological hypothesis can be written as statistical hypothesis, i.e., hypothesis about the parameters of the probabilistic model;
Choose a test statistic that is a function of the data and helps to test the hypotheses;
Under the null hypothesis, the test statistic is possible to be calculated and also follows a sampling distribution;
When the data is collected, the test statistic is calculated;
The p-value is calculated based on the test statistic value and its sampling distribution assuming null hypothesis true;
Compare p-value and the significance level to decide whether there is enough evidence to reject the null hypothesis.