Inference

Marcio Diniz | Michael Luu

Cedars Sinai Medical Center

18 October, 2022

Introduction

The goal in statistics is to make inferences from a sample to a large population;
Inference can only be performed based on probabilistic assumptions;

Reference intervals

Troponin I

It is a cardiac and skeletal muscle protein useful in the laboratory diagnosis of heart attack. Test can be performed for confirmation of cardiac muscle damage.

Goal

To find reference intervals of Troponin I assay.

Probabilistic Model

\(X\) = Troponin I levels;
\(X \sim N(\mu, \sigma^2)\);

Reference Intervals

Code

data_plot <- data.frame(x = rep(c(0, 20))) 

ggplot(data_plot, aes(x = x)) +
  stat_function(fun = dnorm, args = list(7, 2)) + 
  geom_area(fill = "blue",
            stat = "function", fun = dnorm,
            args = list(7, 2),
            xlim = c(0, 15),
            alpha = 0.5) +
  labs(y = "Density", x = "Troponin I") +
  theme_bw(base_size = 20) + 
  theme(legend.position = "None") +
  geom_vline(xintercept = 
               c(qnorm(0.025, 7, 2), qnorm(0.975, 7, 2)))

Normal distribution

Reference Intervals

Code

ggplot(data_plot, aes(x = x)) +
  stat_function(fun = dnorm, args = list(10, 1.5)) + 
  geom_area(fill = "blue",
            stat = "function", fun = dnorm,
            args = list(10, 1.5),
            xlim = c(0, 15),
            alpha = 0.5) +
  labs(y = "Density", x = "Troponin I") +
  theme_bw(base_size = 20) + 
  theme(legend.position = "None") +
  geom_vline(xintercept = 
               c(qnorm(0.025, 10, 1.5), qnorm(0.975, 10, 1.5)))

Normal distribution

Reference Intervals

Code

ggplot(data_plot, aes(x = x)) +
  stat_function(fun = dnorm, args = list(5, 1)) + 
  geom_area(fill = "blue",
            stat = "function", fun = dnorm,
            args = list(5, 1),
            xlim = c(0, 15),
            alpha = 0.5) +
  labs(y = "Density", x = "Troponin I") +
  theme_bw(base_size = 20) + 
  theme(legend.position = "None") +
  geom_vline(xintercept = 
               c(qnorm(0.025, 5, 1), qnorm(0.975, 5, 1)))

Normal distribution

Reference Intervals

Code

ggplot(data_plot, aes(x = x)) +
  stat_function(fun = dgamma, args = list(5, 1)) + 
  geom_area(fill = "red", stat = "function", fun = dgamma,
            args = list(5, 1),
            xlim = c(0, 20),
            alpha = 0.5)  +
  labs(y = "Density", x = "Troponin I") +
  theme_bw(base_size = 20) + 
  geom_vline(xintercept = 
               c(qgamma(0.025, 5, 1), 
                 qgamma(0.975, 5, 1)))

Gamma distribution

Reference Intervals

Code

ggplot(data_plot, aes(x = x)) +
  stat_function(fun = dgumbel, args = list(5, 1)) + 
  geom_area(fill = "yellow", stat = "function", fun = dgumbel,
            args = list(5, 1),
            xlim = c(0, 20),
            alpha = 0.5)  +
  labs(y = "Density", x = "Troponin I") +
  theme_bw(base_size = 20) + 
  geom_vline(xintercept = 
               c(qgumbel(0.025, 5, 1), 
                 qgumbel(0.975, 5, 1)))

Gumbel distribution

Reference Intervals

What are the challenges?

Define the biomarker distribution;
Define the parameters for a given biormarker distribution.

Solutions

Collect the entire healthy population to measure the Troponin I levels;
Collect a sample of 100 patients from the healthy population to measure the Troponin I levels;

Solution 01

Collect the entire healthy population to measure the Troponin I levels;

Code

set.seed(1234)
data_plot <- data.frame(x = 
                          rnorm(100000, 5, 1))

ggplot(data_plot, aes(x = x, y = after_stat(density))) +
  geom_histogram(bins = 100) + 
  scale_x_continuous(limits = c(0, 20)) + 
  theme_bw(base_size = 20) +
  labs(y = "Density", x = "Troponin I")

Solution 01

Collect the entire healthy population to measure the Troponin I levels;

Code

ggplot(data_plot, aes(x = x, y = after_stat(density))) +
  geom_histogram(bins = 100) + 
  geom_density(colour = "blue", size = 1.2) +
  scale_x_continuous(limits = c(0, 20)) + 
  theme_bw(base_size = 20) +
  labs(y = "Density", x = "Troponin I")

Solution 02

Collect a sample of 100 patients from the healthy population to measure the Troponin I levels;

Code

set.seed(1234)
data_plot <- data.frame(x = 
                          rnorm(100, 5, 1))

ggplot(data_plot, aes(x = x, y = after_stat(density))) +
  geom_histogram(bins = 100) + 
  scale_x_continuous(limits = c(0, 20)) + 
  theme_bw(base_size = 20) +
  labs(y = "Density", x = "Troponin I")

Solution 02

Collect a sample of 100 patients from the healthy population to measure the Troponin I levels;

Code

ggplot(data_plot, aes(x = x, y = after_stat(density))) +
  geom_histogram(bins = 100) + 
  geom_density(colour = "blue", size = 1.2) +
  scale_x_continuous(limits = c(0, 20)) + 
  theme_bw(base_size = 20) +
  labs(y = "Density", x = "Troponin I")

Reference Intervals

Can we properly answer our challenges using a sample?

Limited number of patients makes harder to define the biomarker distribution;
Reference intervals require us to define the biomarker distribution, therefore, studies to define reference intervals have very large sample size;
If we were interested on the average level of Troponin I, then we could assume that the average Troponin level follows a Normal distribution, i.e., \(\sim N(\mu, \sigma^2/n)\) given that we have a medium/large sample size \(n\) based on the Central Limit Theorem;
On the other hand, inferences for the average based on small sample sizes and other questions requires us to choose a specific distribution for our variable of interest;
In both contexts, the parameters of the probability distribution will be unknown and we will need to define them.

How to determine the parameters from a sample?

Point Estimate

Statisticians have developed several methods to get point estimates for the parameters based on the sample;
The most common estimation method is Maximum Likelihood;
Other methods are Moments and Least Squares.

Maximum Likelihood - Example

\(X_1, \ldots, X_4\) are i.i.d random variables following \(Bernoulli(p)\) such that \(X_i = \begin{cases} 1 \quad \mbox{if the patient has cancer}\\ 0 \quad \mbox{otherwise} \end{cases}\);

Maximum Likelihood - Example 01

What is the value of \(p\) which would maximize the likelihood to have observed the sample \(X_1 = 1, X_2 = 1, X_3 = 0, X_4 = 0\)?
- If \(p = 0.1\), then the likelihood to observe 2 out of 4 is \(0.1 \times 0.1 \times 0.9 \times 0.9 = 0.0081\);
- If \(p = 0.5\), then the likelihood to observe 2 out of 4 is \(0.5 \times 0.5 \times 0.5 \times 0.5 = 0.0625\)
- If \(p = 0.9\), then the likelihood to observe 2 out of 4 is \(0.9 \times 0.9 \times 0.1 \times 0.1 = 0.0081\);

Maximum Likelihood - Example 01

Code

likelihood_binom <- function(p, x, size){
  out <- dbinom(x = x, size = size, prob = p)/choose(size, x)
  return(out)
}

data_plot <- data.frame(p = c(0, 1))

#| fig-cap: "Observed sample: 1, 0, 1, 0"
ggplot(data_plot, aes(x = p)) +
  stat_function(fun = likelihood_binom, 
                args = list(x = 2, size = 4)) + 
  labs(y = "Likelihood", x = "p") +
  theme_bw(base_size = 20) + 
  theme(legend.position = "None")

Maximum Likelihood - Example 02

What is the value of \(p\) which would maximize the likelihood to have observed the sample \(X_1 = 1, X_2 = 1, X_3 = 1, X_4 = 0\)?
- If \(p = 0.1\), then the likelihood to observe 2 out of 4 is \(0.1 \times 0.1 \times 0.1 \times 0.9 = 0.0009\);
- If \(p = 0.5\), then the likelihood to observe 2 out of 4 is \(0.5 \times 0.5 \times 0.5 \times 0.5 = 0.0625\);
- If \(p = 0.75\), then the likelihood to observe 2 out of 4 is \(0.75 \times 0.75 \times 0.75 \times 0.25 = 0.1054\)

Maximum Likelihood - Example 02

Code

ggplot(data_plot, aes(x = p)) +
  stat_function(fun = likelihood_binom, 
                args = list(x = 3, size = 4)) + 
  labs(y = "Likelihood", x = "p") +
  theme_bw(base_size = 20) + 
  theme(legend.position = "None")

Observed sample 1, 1, 1, 0

Point Estimates

Estimators are quantities calculated from the observed sample to estimate parameters of a probability distribution;
Different distributions have parameters with different estimators;
Estimators can be characterized based on their bias (towards the true value in the study population) and variance (around the true value in the study population);
Ideally, estimators have low bias and low variance.

Point Estimates

Examples

If the data is described with \(N(\mu, \sigma^2)\), then
- the sampling average, \(\bar{X}\), is the maximum likelihood estimator for the populational mean \(\mu\), i.e., \(\hat{\mu} = \bar{X}\);
- the sampling variance can be calculated in two ways:
  - \[\hat{\sigma}^2 = \frac{1}{n}\sum_{i = 1}^n (x_i - \bar{x})^2, \quad \quad S^2 = \frac{1}{n - 1}\sum_{i = 1}^n (x_i - \bar{x})^2\]
- The estimator \(\hat{\sigma}^2\) is the maximum likelihood estimator, but it is biased. Therefore, \(S^2\) is preferred.

Confidence intervals

Why do we need confidence intervals?

Code

shape <- 1
rate <- 0.2

set.seed(1257)
data_plot <- data.frame(a = rgamma(100, shape, rate),
                        b = rgamma(100, shape, rate),
                        c = rgamma(100, shape, rate),
                        d = rgamma(100, shape, rate))
data_plot <- data_plot %>% 
  pivot_longer(col = a:d)

tmp <- data_plot %>% group_by(name) %>%
  summarise(m = mean(value))

data_plot <- data_plot %>% 
  mutate(name = 
              factor(name, 
                     levels = 
                       c("a", "b", 
                         "c", "d"),  
                     labels =
                       format(round(tmp[[2]], 2), nsmall = 2)))

ggplot(data_plot, aes(x = value, y = after_stat(density))) +
  geom_histogram(bins = 20) + 
  theme_bw() +
  labs(y = "Density", x = "Troponin I") +
  facet_wrap(~ name, ncol = 2, 
             labeller = 
               label_bquote(bar(x) ==  .(as.character(name)))) +
  theme(strip.text = element_text(size = 12))

Every point estimate is also a random variable

Confidence Intervals

Confidence levels

We can construct a confidence interval for the average troponin level;
The uncertainty around the point estimate is measured by the confidence level \(100 \times(1 - \alpha)\%\);
Usual choices of \(\alpha\) are \(0.10\), \(0.05\), and \(0.01\):
- 90% CI: 4.82 to 5.18;
- 95% CI: 4.79 to 5.21;
- 99% CI: 4.72 to 5.27.

Confidence Intervals

How to interpret them?

If you repeat your experiment 100 times, then calculate a \(100 \times(1 - \alpha)\%\) confidence interval for each experiment follows that \(100 \times(1 - \alpha)\%\) of the calculated intervals will contain the true value of the parameter.

Confidence Intervals

Width

The confidence interval width is a function of three parameters:
- Decreases as the sample size increases;
- Increases as the variability increases;
- Increases as the confidence level increases.

Confidence Intervals

Example

\(X_1, \ldots, X_n\) such that \(X_i \sim Normal(\mu, \sigma^2)\) with known \(\sigma^2\);
The \(100 \times(1 - \alpha)\%\) for \(\mu\) is

\[ \left[\bar{X} - z_{1 - \alpha/2}\frac{\sigma}{\sqrt{n}} ; \bar{X} + z_{1 - \alpha/2}\frac{\sigma}{\sqrt{n}} \right] \]

The quantile \(1 - \alpha/2\) of the \(N(0, 1)\), \(z_{1 - \alpha/2}\), depends on the confidence level:
- \(\alpha = 0.01\), then \(z_{1 - \alpha/2} = 2.32\);
- \(\alpha = 0.05\), then \(z_{1 - \alpha/2} = 1.96\);
- \(\alpha = 0.10\), then \(z_{1 - \alpha/2} = 1.64\);

Confidence Intervals

Can we test hypothesis using confidence intervals?

We have the hypotheses \(H: \mu = 5\), then we obtain the 95% confidence interval [4.79 ; 5.21] for \(\mu\).
What can we tell about our hypothesis?
What would have changed if we had observed the 95% confidence interval [7.84 ; 8.24]?
Do we have any chance to commit a wrong decision when we use confidence interval to test a hypothesis?
How much is this error?
How to interpret such error?

Confidence Intervals

Can we compare groups using confidence intervals?

We are not sure if the biomarker has different or equal means by sex, i.e., we are interested in the hypothesis \(H: \mu_M = \mu_F\), i.e., \(H: \mu_M - \mu_F = 0\);
We calculate 95% confidence intervals for men (\(\mu_M\)), [7.84 ; 8.24], and for women, (\(\mu_F\)), [3.85 ; 4.13].
What can we tell about our hypothesis?
What would have changed if we had observed the 95% confidence interval [4.08 ; 4.51] for \(\mu_M\)?
What can we tell about out hypothesis if we have a 95% confidence interval for \(\mu_M - \mu_F\), [-0.67 ; -0.08]?

Confidence Intervals

Example 01

Code

set.seed(1234)
female <- data.frame(sex = "Female", measure = rnorm(100, 4, 1))
male <- data.frame(sex = "Male", measure = rnorm(100, 8, 1))
dataset <- rbind(female, male)


data_plot <- dataset %>% 
  select(sex, measure) %>% 
  group_by(sex) %>%
  summarize(mean = mean(measure, na.rm = TRUE), 
            sd = sd(measure, na.rm = TRUE))

ggplot(data_plot, aes(x = sex, y = mean, fill = sex)) +
  geom_bar(position = position_dodge(), stat = "identity") +
  geom_errorbar(aes(ymin = mean - sd, ymax = mean + sd), 
                width = .1) +
  scale_fill_brewer("Sex", palette = "Set1") +
  theme_bw(base_size = 20) + 
  labs(x = "Sex", y = "Troponin I") +
  theme(legend.position = "none")

Mean and SD

Confidence Intervals

Example 01

Code

se <- function(x, na.rm = TRUE){
  out <- sd(x, na.rm = na.rm)/sqrt(length(na.omit(x)))
}

data_plot <- dataset %>% 
  select(sex, measure) %>% 
  group_by(sex) %>%
  summarize(mean = mean(measure, na.rm = TRUE),
            se = se(measure, na.rm = TRUE))


ggplot(data_plot, aes(x = sex, y = mean, fill = sex)) +
  geom_bar(position = position_dodge(), stat = "identity") +
  geom_errorbar(aes(ymin = mean - se, ymax = mean + se), width = .1) +
  scale_fill_brewer("Sex", palette = "Set1") +
  theme_bw(base_size = 20) + labs(x = "Sex", y = "Troponin I") +
  theme(legend.position = "none")

Mean and SE

Confidence Intervals

Example 01

Code

se <- function(x, na.rm = TRUE){
  out <- sd(x, na.rm = na.rm)/sqrt(length(na.omit(x)))
}

data_plot <- dataset %>% 
  select(sex, measure) %>% 
  group_by(sex) %>%
  summarize(mean = mean(measure, na.rm = TRUE),
            se = se(measure, na.rm = TRUE))


ggplot(data_plot, aes(x = sex, y = mean, fill = sex)) +
  geom_bar(position = position_dodge(), stat = "identity") +
  geom_errorbar(aes(ymin = mean - 2*se, ymax = mean + 2*se), width = .1) +
  scale_fill_brewer("Sex", palette = "Set1") +
  theme_bw(base_size = 20) + labs(x = "Sex", y = "Troponin I") +
  theme(legend.position = "none")

Code

test01 <- t.test(dataset$measure ~ dataset$sex)

Comparisons between groups are often done checking whether individual confidence intervals (or approximately, mean \(\pm\) 2 x standard error) overlap;
In case confidence intervals do not overlap, then we conclude that groups are different;
p-value < 0.001 indicating that there is enough evidence that groups have different average at 5% significance level.

Confidence Intervals

Example 02

Code

set.seed(2847)
female <- data.frame(sex = "Female", measure = rnorm(100, 4, 1))
male <- data.frame(sex = "Male", measure = rnorm(100, 4, 1))
dataset <- rbind(female, male)

se <- function(x, na.rm = TRUE){
  out <- sd(x, na.rm = na.rm)/sqrt(length(na.omit(x)))
}

data_plot <- dataset %>% 
  select(sex, measure) %>% 
  group_by(sex) %>%
  summarize(mean = mean(measure, na.rm = TRUE),
            se = se(measure, na.rm = TRUE))


ggplot(data_plot, aes(x = sex, y = mean, fill = sex)) +
  geom_bar(position = position_dodge(), stat = "identity") +
  geom_errorbar(aes(ymin = mean - 2*se, ymax = mean + 2*se), width = .1) +
  scale_fill_brewer("Sex", palette = "Set1") +
  theme_bw(base_size = 20) + labs(x = "Sex", y = "Troponin I") +
  theme(legend.position = "none")

Code

test01 <- t.test(dataset$measure ~ dataset$sex)

Following this rationale, we can conclude that confidence intervals that overlap indicate that groups are not different;
The p-value is 0.612 indicating that there is not enough evidence to reject the hypothesis that groups have the same average at 5% significance level.

Confidence Intervals

Counter-Example

Code

set.seed(2847)
female <- data.frame(sex = "Female", measure = rnorm(100, 4, 1))
male <- data.frame(sex = "Male", measure = rnorm(100, 4.3, 1))
dataset <- rbind(female, male)

se <- function(x, na.rm = TRUE){
  out <- sd(x, na.rm = na.rm)/sqrt(length(na.omit(x)))
}

data_plot <- dataset %>% 
  select(sex, measure) %>% 
  group_by(sex) %>%
  summarize(mean = mean(measure, na.rm = TRUE),
            se = se(measure, na.rm = TRUE))


ggplot(data_plot, aes(x = sex, y = mean, fill = sex)) +
  geom_bar(position = position_dodge(), stat = "identity") +
  geom_errorbar(aes(ymin = mean - 2*se, ymax = mean + 2*se), width = .1) +
  scale_fill_brewer("Sex", palette = "Set1") +
  theme_bw(base_size = 20) + labs(x = "Sex", y = "Troponin I") +
  theme(legend.position = "none")

Code

test01 <- t.test(dataset$measure ~ dataset$sex)

However, that is not always the case:
- In this counter-example, the upper confidence limit for the average of females is 4.12, while the lower confidence limit for the average of males is 4.09;
- Even though the confidence intervals overlap, the p-value is 0.012.

Confidence Intervals

Do not compare confidence intervals

If we want to test hypothesis of difference between groups, then a confidence interval for the difference between averages should be calculated, [-0.67 ; -0.08];
If the confidence interval for the difference of averages does not contain zero, then groups are different. Otherwise, there is not enough evidence that groups are different.

Confidence Intervals

Mathematical Formulas for known \(\sigma^2\)

\(100 \times(1 - \alpha)\%\) Confidence Interval for \(\mu_M\) with known \(\sigma^2_M\):

\[ \left[\bar{X}_M - z_{1 - \alpha/2}\sqrt{\frac{\sigma_M^2}{n_M}} ; \bar{X}_M + z_{1 - \alpha/2}\sqrt{\frac{\sigma_M^2}{n_M}} \right] \]

\(100 \times(1 - \alpha)\%\) Confidence Interval for \(\mu_F\) with known \(\sigma^2_F\):

\[ \left[\bar{X}_F - z_{1 - \alpha/2}\sqrt{\frac{\sigma_F^2}{n_F}} ; \bar{X}_F + z_{1 - \alpha/2}\sqrt\frac{\sigma_F^2}{{n_F}} \right] \]

where \(Z \sim N(0, 1)\).

Confidence Intervals

Mathematical Formulas for known \(\sigma^2\)

\(100 \times(1 - \alpha)\%\) Confidence Interval for \(\mu_M - M_F\) with known \(\sigma^2_M\) and \(\sigma^2_F\):

\[ \left[\bar{X}_M - \bar{X}_F - z_{1 - \alpha/2}\sqrt{\frac{\sigma_M^2}{n_M} + \frac{\sigma_F^2}{n_F}} ; \bar{X}_M - \bar{X}_F + z_{1 - \alpha/2}\sqrt{\frac{\sigma_M^2}{n_M} + \frac{\sigma_F^2}{n_F}} \right] \]

where \(Z \sim N(0, 1)\).

\(\sqrt{\frac{\sigma_M^2}{n_M}} - \sqrt\frac{\sigma_F^2}{{n_F}} \neq \sqrt{\frac{\sigma_M^2}{n_M} + \frac{\sigma_F^2}{n_F}}\)

Confidence Intervals

Mathematical Formulas for unknown \(\sigma^2\)

\(100 \times(1 - \alpha)\%\) Confidence Interval for \(\mu_M\) with unknown \(\sigma^2_M\):

\[ \left[\bar{X}_M - t_{1 - \alpha/2, n - 1}\sqrt{\frac{S_M^2}{n_M}} ; \bar{X}_M + t_{1 - \alpha/2, n - 1}\sqrt{\frac{S_M^2}{n_M}} \right] \]

\(100 \times(1 - \alpha)\%\) Confidence Interval for \(\mu_F\) with unknown \(\sigma^2_F\):

\[ \left[\bar{X}_F - t_{1 - \alpha/2, n - 1}\sqrt{\frac{S_F^2}{n_F}} ; \bar{X}_F + t_{1 - \alpha/2, n - 1}\sqrt\frac{S_F^2}{{n_F}} \right] \]

Confidence Intervals

Mathematical Formulas for unknown \(\sigma^2\)

\(100 \times(1 - \alpha)\%\) Confidence Interval for \(\mu_M - \mu_F\) with unknown \(\sigma^2_M\) and \(\sigma^2_F\):

\[ \left[\bar{X}_M - \bar{X}_F - t_{1 - \alpha/2, n - 1}\sqrt{\frac{S_M^2}{n_M} + \frac{S_F^2}{n_F}} ; \bar{X}_M - \bar{X}_F + z_{1 - \alpha/2, n - 1}\sqrt{\frac{S_M^2}{n_M} + \frac{S_F^2}{n_F}} \right] \]

where \(T \sim t-student(n - 1)\).

\(\sqrt{\frac{\S_M^2}{n_M}} - \sqrt\frac{S_F^2}{{n_F}} \neq \sqrt{\frac{S_M^2}{n_M} + \frac{S_F^2}{n_F}}\)

Summary

Outcomes are modeled by distribution of probability such that we make inferences on the parameters of these probability distributions;
We often assume a probability distribution as sample size is limited;
Parameters of probability distribution are estimated based on the observed sample;
Confidence intervals shows the uncertainty around those estimates;
The uncertainty is defined based on confidence level that is often defined as function of \(\alpha\), \(100\times(1-\alpha)\%\) CI;
We can test hypothesis using confidence intervals evaluating if the confidence interval contains or does not contain the hypothesis to be tested;
\(\alpha\) is the probability to reject a hypothesis given that hypothesis was true;
We can compare groups using confidence intervals, but we cannot use individual intervals to compare them. We need a confidence interval of the difference between the groups.