Probability

Marcio Diniz | Michael Luu

Cedars Sinai Medical Center

11 October, 2022

Introduction

Why do we need to learn probability?

The world is full of uncertainty;

How do we make decisions when we have uncertainty?
How can we generalize results from a sample to a population?

Brief Probability History

Rolling a dice

Dice have been used since before recorded history, and it is uncertain where they originated;
The Egyptian game of Senet was played with dice. Senet was played before 3000 BC and up to the 2nd century AD. It was likely a racing game, but there is no scholarly consensus on the rules of Senet;
In 1654, The mathematical methods of probability arose in the correspondence of Gerolamo Cardano, Pierre de Fermat and Blaise Pascal on such questions as the fair division of the stake in an interrupted game of chance;
In 1812, Pierre de Laplace introduced a host of new ideas and mathematical techniques in his book, Théorie Analytique des Probabilités. Before Laplace, probability theory was solely concerned with developing a mathematical analysis of games of chance. Laplace applied probabilistic ideas to many scientific and practical problems;
In 1933, the russian mathematician Kolmogorov outlined an axiomatic approach that forms the basis for the modern theory.

First steps

Random Experiment

Roll a dice;
Possible results: $\Omega = \{1, 2, 3, 4, 5, 6\}$;
What is the probability to observe the result 4?
If the dice is fair, we expect to observe the face 4 in 1 out 6 times when rolling a dice .
Let’s try to observe it in practice:

Code

set.seed(1234) #set.seed allow us to reproduce any random procedure in R.
dice <- sample(x = 1:6, size = 6, replace = TRUE)
table(dice)

dice
1 2 4 5 6 
1 1 2 1 1

Did we observe the face 4 with frequency 1 out 6 when rolling a dice 6 times?

First steps

Random Experiment

How about if we roll a dice 1,000 times?

Code

set.seed(1234) #set.seed allow us to reproduce any random procedure in R.
dice <- sample(x = 1:6, size = 1000, replace = TRUE)
table(dice)

dice
  1   2   3   4   5   6 
161 153 188 149 157 192

It is hard to say, let’s calculate the proportions:

Code

tmp <- table(dice)
prop.table(tmp)

dice
    1     2     3     4     5     6 
0.161 0.153 0.188 0.149 0.157 0.192

Are the proportions equal to 1/6, i.e., 0.166…?

Code

prop.table(tmp) == 1/6

dice
    1     2     3     4     5     6 
FALSE FALSE FALSE FALSE FALSE FALSE

First steps

Random Experiment

How about if we roll a dice 100,000 times?

Code

set.seed(1234) #set.seed allow us to reproduce any random procedure in R.
dice <- sample(x = 1:6, size = 100000, replace = TRUE)
table(dice)

dice
    1     2     3     4     5     6 
16745 16806 16453 16757 16714 16525

It is hard to say, let’s calculate the proportions:

Code

tmp <- table(dice)
prop.table(tmp)

dice
      1       2       3       4       5       6 
0.16745 0.16806 0.16453 0.16757 0.16714 0.16525

Are the proportions equal to 1/6, i.e., 0.166…?

Code

prop.table(tmp) == 1/6

dice
    1     2     3     4     5     6 
FALSE FALSE FALSE FALSE FALSE FALSE

First steps

Random Experiment

How about if we roll a dice 10,000,000 times?

Code

set.seed(1234) #set.seed allow us to reproduce any random procedure in R.
dice <- sample(x = 1:6, size = 10000000, replace = TRUE)
table(dice)

dice
      1       2       3       4       5       6 
1666872 1668473 1667449 1665194 1667349 1664663

It is hard to say, let’s calculate the proportions:

Code

tmp <- table(dice)
prop.table(tmp)

dice
        1         2         3         4         5         6 
0.1666872 0.1668473 0.1667449 0.1665194 0.1667349 0.1664663

Are the proportions equal to 1/6, i.e., 0.166…?

Code

prop.table(tmp) == 1/6

dice
    1     2     3     4     5     6 
FALSE FALSE FALSE FALSE FALSE FALSE

Probability distributions

Each experiment has a set of possible outcomes, which can be:
- finite or non-finite;
- enumerable or non-enumerable;
For example,
- Number of mice with metastasis in a group of 10 mice is an outcome with a finite and enumerable number of possible values;
- Number of laser photons hitting a detector in a particular time interval is an outcome with a non-finite and enumerable number of possible outcomes;
- Gene expression is an outcome with a non-finite and non-enumerable number of values.
If the outcome is random, then we say that the outcome is a random variable;
A random variable is characterized by a distribution of probability;
A distribution of probability associates each possible value of an outcome with a probability that value is expected to be observed;
Therefore, the observed frequencies from a sample converge to the probability distribution as the sample size increases.

Probability distributions

Types

Probability distributions can be discrete for enumerable outcomes or continuous for enumerable and non-enumerable outcomes;

Discrete

Discrete probability distributions describe experiments with enumerable outcomes such as
- presence of metastasis
- number of cells with mutations per replication cycle; number of read counts from RNA seq, number of
- number of reads (sequence obtained from a fragment of a gene) in a RNA-Seq;
- number of enriched genes for a given set.

Continuous

Continuous probability distributions describe experiments with non-enumerable outcomes such as
- gene expression;
- biomarker levels;
- time-to-event;
They also can be used for enumerable outcomes.

Probabilistic distributions are mathematical models

Solar system DNA

Models are a simplified description of the real world;
Every scientific is defined based on a few parameters that are essential to the representation of a phenomenon;
Probability distributions are mathematical models to describe uncertainty;
Each probability distribution is also characterized by parameters;
What are parameters of the probability distribution characterizing the outcome of rolling a fair dice?
- 6 possible outcomes;
- each outcome is equally likely;
Furthermore, probability distributions are often characterized by their moments:
- Mean;
- Variance = Squared Standard Deviation;
- Kurtosis;
- Skewness.

Discrete distributions

Discrete Uniform distribution

It is the simplest distribution of probability for discrete variables;
It has two main parameters $a$ and $b$ indicating the lowest and highest integer that can be observed;
It is notated $X \sim U\{a, \ldots, b\}$;

Probability distribution

The random variable has $n = b - a + 1$ possible values ($a, \ldots, b$) that are equally likely to occur, i.e., $1/n$;
$P(X = x) = \displaystyle\frac{1}{n}$;

Moments

$E(X) = \displaystyle\frac{a + b}{2}$;
$Var(X) = \displaystyle\frac{(b - a + 1)^2 - 1}{12}$;

Discrete Uniform distribution

Example

Experiment: Roll a dice;
$X \sim U\{1, \ldots, 6\} \rightarrow a = 1, b = 6$
$E(X) = 3.5$
$Var(X) = 2.916667 \rightarrow SD(X) = 1.70825$
In a sample size of 10,

Code

set.seed(1234) #set.seed allow us to reproduce any random procedure in R.
dice <- sample(x = 1:6, size = 10, replace = TRUE)
table(dice)

dice
1 2 4 5 6 
1 2 3 2 2

Code

mean(dice)

[1] 3.9

Code

var(dice)

[1] 2.988889

Code

sd(dice)

[1] 1.72884

Discrete Uniform distribution

Example

$X \sim U\{1, \ldots, 6\} \rightarrow a = 1, b = 6$
$E(X) = 3.5$
$Var(X) = 2.916667 \rightarrow SD(X) = 1.70825$

In a sample of 100,000,

Code

set.seed(1234) #set.seed allow us to reproduce any random procedure in R.
dice <- sample(x = 1:6, size = 100000, replace = TRUE)
table(dice)

dice
    1     2     3     4     5     6 
16745 16806 16453 16757 16714 16525

Code

mean(dice)

[1] 3.49464

Code

var(dice)

[1] 2.9166

Code

sd(dice)

[1] 1.707806

Discrete Uniform distribution

Code

n <- 6
data_plot <- data.frame(x = seq(1:n), y = 1/n)

ggplot(data_plot, aes(x = x, y = y)) + 
  geom_bar(stat="identity") +
  scale_y_continuous(limits = c(0, 1)) +
  scale_x_continuous(breaks = seq(1:n)) +
  labs(x = "X", y = "P(X = x)") +
  theme_bw() +   theme(text = element_text(size=20),
                       legend.text = element_text(size = 12),
                       legend.title = element_text(size = 12))
n <- 10
data_plot <- data.frame(x = seq(1:n), y = 1/n)

ggplot(data_plot, aes(x = x, y = y)) + 
  geom_bar(stat="identity") +
  scale_y_continuous(limits = c(0, 1)) +
  scale_x_continuous(breaks = seq(1:n)) +
  labs(x = "X", y = "P(X = x)") +
  theme_bw() +   theme(text = element_text(size=20),
                       legend.text = element_text(size = 12),
                       legend.title = element_text(size = 12))

Bernoulli distribution

The random variable has two possible values (0 = failure, 1 = success);
$X \sim B(p)$, where is the probability of success is $p \in [0, 1]$;

Probability distribution

$P(X = x) = p^x (1- p)^{1 - x}$

Moments

$E(X) = p$
$Var(X) = p(1 - p)$

Bernoulli distribution

Example

Experiment: In a given pancreatic adenocarcinoma (PDAC) mouse model, we expect that that a mouse will develop metastasis with probability of 80% without treatment;
$X = \begin{cases} 1 \quad \mbox{if the mouse is metastatic} \\ 0 \quad \mbox{if the mouse is not metastatic;} \end{cases}$
$X \sim B(0.8) \rightarrow p = 0.8$
$E(X) = 0.8$
$Var(X) = 0.16 \rightarrow SD(X) = 0.4$

Bernoulli distribution

Example

$X \sim B(0.8) \rightarrow p = 0.8$
$E(X) = 0.8$
$Var(X) = 0.16 \rightarrow SD(X) = 0.4$

For a sample size of 10 and $p = 0.8$,

set.seed(1234) #set.seed allow us to reproduce any random procedure in R.
mice <- rbinom(n = 10, size = 1, prob = 0.8)
table(mice)

mice
0 1 
1 9

mean(mice)

[1] 0.9

var(mice)

[1] 0.1

sd(mice)

[1] 0.3162278

Bernoulli distribution

Code

size <- 1
p <- 0.2
data_plot <- data.frame(x = 0:1, y = dbinom(0:1, size, p))

ggplot(data_plot, aes(x = x, y = y)) + 
  geom_bar(stat="identity") +
  scale_y_continuous(limits = c(0, 1)) +
  scale_x_continuous(breaks = 0:1) +
  labs(x = "X", y = "P(X = x)") +
  theme_bw() +   theme(text = element_text(size=20),
                       legend.text = element_text(size = 12),
                       legend.title = element_text(size = 12))
size <- 1
p <- 0.8
data_plot <- data.frame(x = 0:1, y = dbinom(0:1, size, p))

ggplot(data_plot, aes(x = x, y = y)) + 
  geom_bar(stat="identity") +
  scale_y_continuous(limits = c(0, 1)) +
  scale_x_continuous(breaks = 0:1) +
  labs(x = "X", y = "P(X = x)") +
  theme_bw() +   theme(text = element_text(size=20),
                       legend.text = element_text(size = 12),
                       legend.title = element_text(size = 12))

Binomial distribution

The random variable is the sum of a sequence of $m$ independent of Bernoulli variables with probability of success $p$;
It is notated by $Y \sim Bin(m, p)$.

Probability distribution

$P(Y = y) = \displaystyle{m \choose y} p^y (1- p)^{m - y}$

Moments

$E(Y) = mp$
$Var(Y) = mp(1 - p)$

Binomial distribution

Example

Experiment: $m$ mice are followed such each mouse has a chance about of 80% to develop metastasis;
$X_i = \begin{cases} 1 \quad \mbox{if the mouse is metastatic} \\ 0 \quad \mbox{if the mouse is not metastatic;} \end{cases}$
$Y = \sum\limits_{i = 1}^m X_i$ is the total number of metastatic mice in a group with $m$ mice;
$Y \in \{0, 1, \ldots, m\}$;
$E(Y) = mp = 10\times0.8 = 8$;
$Var(Y) = mp(1-p) = 10\times0.8\times0.2=1.6$;
$Y \sim Binomial(m, p)$.

Binomial distribution

Example

$Y \sim Binomial(m, p)$;
$E(Y) = mp = 10 \times 0.8 = 8$;
$Var(Y) = mp(1-p) = 10 \times 0.8 \times 0.2 = 1.6$;

If we replicate this experiment 10 times,

set.seed(1234) #set.seed allow us to reproduce any random procedure in R.
mice <- rbinom(n = 10, size = 10, prob = 0.8)
table(mice)

mice
 7  8  9 10 
 1  6  2  1

mean(mice)

[1] 8.3

var(mice)

[1] 0.6777778

sd(mice)

[1] 0.8232726

Binomial distribution

Code

n <- 10
size <- 10
p <- 0.2
data_plot <- data.frame(x = seq(1:n), y = dbinom(seq(1:n), size, p))

ggplot(data_plot, aes(x = x, y = y)) + 
  geom_bar(stat="identity") +
  scale_y_continuous(limits = c(0, 1)) +
  scale_x_continuous(breaks = seq(0:n)) +
  labs(x = "X", y = "P(X = x)") +
  theme_bw() +   theme(text = element_text(size=20),
                       legend.text = element_text(size = 12),
                       legend.title = element_text(size = 12))
n <- 10
size <- 10
p <- 0.8
data_plot <- data.frame(x = seq(1:n), y = dbinom(seq(1:n), size, p))

ggplot(data_plot, aes(x = x, y = y)) + 
  geom_bar(stat="identity") +
  scale_y_continuous(limits = c(0, 1)) +
  scale_x_continuous(breaks = seq(0:n)) +
  labs(x = "X", y = "P(X = x)") +
  theme_bw() +   theme(text = element_text(size=20),
                       legend.text = element_text(size = 12),
                       legend.title = element_text(size = 12))

Poisson distribution

The random variable is the number of occurrences of an event during a specific time interval or area;
$X \sim Po(\lambda)$ where $\lambda$ is the rate of events;

Probability distribution

$P(X = x) = \frac{\lambda^x e^{-\lambda}}{x!}$;

Moments

$E(X) = Var(X) = \lambda$.

Poisson distribution

Example

Experiment: Observe the number of mutations in a genome with about $10^4$ nucleotides such that the mutation rate of $9\times10^{-4}$ per nucleotide in a replication cycle.
$X \in \{0, 1, 2, 3, \ldots, \}$;
$X \sim Po(9) \rightarrow \lambda = 9$ ;
$E(X) = \lambda = 9$;
$Var(Y) = \lambda = 9 \rightarrow = SD(X) = 3$;
In a sample of 10 cell-culture dishes,

set.seed(1234) #set.seed allow us to reproduce any random procedure in R.
mutations <- rpois(n = 10, lambda = 9)
table(mutations)

mutations
 3  5  7  9 10 12 
 1  1  1  1  5  1

mean(mutations)

[1] 8.6

var(mutations)

[1] 7.6

sd(mutations)

[1] 2.75681

Poisson distribution

Code

n <- 20
lambda <- 1
data_plot <- data.frame(x = seq(1:n), y = dpois(seq(1:n), lambda))

ggplot(data_plot, aes(x = x, y = y)) + 
  geom_bar(stat="identity") +
  scale_y_continuous(limits = c(0, 1)) +
  scale_x_continuous(breaks = seq(0:n)) +
  labs(x = "X", y = "P(X = x)") +
  theme_bw() +   theme(text = element_text(size=20),
                       legend.text = element_text(size = 12),
                       legend.title = element_text(size = 12)) 
n <- 20
lambda <- 9
data_plot <- data.frame(x = seq(1:n), y = dpois(seq(1:n), lambda))

ggplot(data_plot, aes(x = x, y = y)) + 
  geom_bar(stat="identity") +
  scale_y_continuous(limits = c(0, 1)) +
  scale_x_continuous(breaks = seq(0:n)) +
  labs(x = "X", y = "P(X = x)") +
  theme_bw() +   theme(text = element_text(size=20),
                       legend.text = element_text(size = 12),
                       legend.title = element_text(size = 12))

Negative Binomial

The random variable is the number of occurrences of an event during a specific time interval or area;
$X \sim NegBin(\mu, \sigma^2)$ where $\mu$ is the mean and $\sigma^2$ is the variance;

Probability distribution

$P(X = x) = \displaystyle{x + \frac{\mu^2}{\sigma^2 - \mu} - 1 \choose x}\left(\frac{\sigma^2 - \mu}{\sigma^2}\right)^x\left(\frac{\mu}{\sigma^2}\right)^{\mu^2/(\sigma^2 - \mu)}$;

Moments

$E(X) = \mu$;
$Var(X) = \sigma^2$ such that $\sigma^2 > \mu$;

Negative Binomial

Example

Experiment: We perform RNA-Seq obtaining the number of reads (sequence obtained from a fragment of a gene) of a specific gene;
$X \in \{0, 1, 2, 3, \ldots, \}$;
$X \sim NegBin(5, 9) \rightarrow \mu = 5, \sigma^2 = 9$;
$E(X) = 5$;
$Var(X) = 9 \rightarrow SD(X) = 3$

Negative Binomial

Example

$X \sim NegBin(5, 9) \rightarrow \mu = 5, \sigma^2 = 9$;
$E(X) = 5$;
$Var(X) = 9 \rightarrow SD(X) = 3$

In a sample of 10 patients,

set.seed(1234) #set.seed allow us to reproduce any random procedure in R.

mu <- 5
sigma2 <- 9
size <- mu^2/(sigma2 - mu)

reads <- rnbinom(n = 10, mu = 5, size = size)
table(reads)

reads
1 2 3 4 6 7 8 
1 3 1 2 1 1 1

mean(reads)

[1] 3.9

var(reads)

[1] 5.655556

sd(reads)

[1] 2.378141

Negative Binomial

Code

n <- 20
mu <- 2
sigma2 <- 9
size <- mu^2/(sigma2 - mu)
data_plot <- data.frame(x = seq(1:n), y = dnbinom(seq(1:n), 
                                                  size = size, 
                                                  mu = mu))

ggplot(data_plot, aes(x = x, y = y)) + 
  geom_bar(stat="identity") +
  scale_y_continuous(limits = c(0, 1)) +
  scale_x_continuous(breaks = seq(0:n)) +
  labs(x = "X", y = "P(X = x)") +
  theme_bw() +   theme(text = element_text(size=20),
                       legend.text = element_text(size = 12),
                       legend.title = element_text(size = 12))
n <- 20
mu <- 5
sigma2 <- 9 
size <- mu^2/(sigma2 - mu)
data_plot <- data.frame(x = seq(1:n), y = dnbinom(seq(1:n),
                                                size = size, 
                                                mu = mu))

ggplot(data_plot, aes(x = x, y = y)) + 
  geom_bar(stat="identity") +
  scale_y_continuous(limits = c(0, 1)) +
  scale_x_continuous(breaks = seq(0:n)) +
  labs(x = "X", y = "P(X = x)") +
  theme_bw() +   theme(text = element_text(size=20),
                       legend.text = element_text(size = 12),
                       legend.title = element_text(size = 12))

Hypergeometric distribution

The random variable is the number of white balls among $k$ sampled balls given that the urn has $m$ white balls and $n$ black balls;
$X \sim HyperGeometric(k, m, n)$;

Probability distribution

$P(X = x) = \frac{{m \choose x} {n \choose k - x}}{{m + n \choose k}}$;

Moments

$E(X) = k\frac{m}{m + n}$;
$Var(X) = k\frac{mn(m+n-k)}{(m + n)^2(m + n-1)}$;

Hypergeometric distribution

Example

Experiment: We observe the number of genes that are members of a specific pathway, defined by 300 genes, among the 50 genes that were found to be associated with diabetes after comparing 1000 genes between cases and control patients.

$X \sim HyperGeometric(50, 300, 700) \rightarrow k = 50, m = 300, n = 700$;
$E(X) = 15$;
$Var(X) = 9.984985 \rightarrow SD(X) = 3.159903$;

Hypergeometric distribution

Example

$X ~ HyperGeometric(50, 300, 1000) \rightarrow k = 50, m = 300, n = 700$;
$E(X) = 15$;
$Var(X) = 9.984985 \rightarrow SD(X) = 3.159903$;

If we replicate this experiment 10 times,

genes_pathway <- rhyper(nn = 10, k = 50, m = 300, n = 700)
table(genes_pathway)

genes_pathway
11 12 13 14 16 17 18 
 1  1  2  2  1  1  2

mean(genes_pathway)

[1] 14.6

var(genes_pathway)

[1] 6.266667

sd(genes_pathway)

[1] 2.503331

Hypergeometric distribution

Code

n <- 20
s <- 1:20
data_plot <- data.frame(x = s, y = dhyper(s, k = 10, m = 300, n = 700))

ggplot(data_plot, aes(x = x, y = y)) + 
  geom_bar(stat="identity") +
  scale_y_continuous(limits = c(0, 1)) +
  scale_x_continuous(breaks = seq(1, 20, by = 2)) +
  labs(x = "X", y = "P(X = x)") +
  theme_bw() +   theme(text = element_text(size=20),
                       legend.text = element_text(size = 12),
                       legend.title = element_text(size = 12))
n <- 20
s <- 1:26
data_plot <- data.frame(x = s, y = dhyper(s, k = 50, m = 300, n = 700))

ggplot(data_plot, aes(x = x, y = y)) + 
  geom_bar(stat="identity") +
  scale_y_continuous(limits = c(0, 1)) +
  scale_x_continuous(breaks = seq(1, 26, by = 2)) +
  labs(x = "X", y = "P(X = x)") +
  theme_bw() +   theme(text = element_text(size=20),
                       legend.text = element_text(size = 12),
                       legend.title = element_text(size = 12))

Other discrete distributions

Continuous distributions

Continuous Uniform distribution

It is the simplest distribution of probability for continuous variables;
$X \sim U(a, b)$ where $a$ is the lower limit and $b$ is the upper limit;

Density function

$f(x) = \begin{cases}\displaystyle\frac{1}{b - a} \quad \mbox{if} a \leq x \leq b \\ 0 \quad \mbox{otherwise}; \end{cases}$;

Moments

$E(X) = \displaystyle\frac{a + b}{2}$;
$Var(X) = \displaystyle\frac{(b - a)^2}{12}$.

Continuous Uniform distribution

Example

$U(0, 1) \rightarrow a = 0, b = 1$;
$E(X) = 0.5$;
$Var(X) = 0.08333 \rightarrow SD(X) = 0.288$
In a sample of 10 Uniform distributed variables,

Code

x <- runif(n = 10, min = 0, max = 1)
mean(x)

[1] 0.4022828

Code

var(x)

[1] 0.09913909

Code

sd(x)

[1] 0.3148636

Continuous Uniform distribution

Code

ggplot(data = data.frame(x = c(-10, 10)), aes(x)) +
  stat_function(fun = dunif, n = 1001, args = list(min = 0, max = 5), aes(colour = "b")) +
  stat_function(fun = dunif, args = list(min = -6, max = 1), aes(colour = "a")) +
  labs(x = "X", y = "f(x)") +
  theme_bw() +   theme(text = element_text(size=20),
                       legend.text = element_text(size = 12),
                       legend.title = element_text(size = 12),
                       legend.position = "bottom") +
  scale_colour_discrete("Distribution", labels = c("U(-6, 1)", "U(0, 5)"))

Probability distribution vs density function

How do we calculate the probability $P(X = x)$ for a continuous distribution?
In the emergency dataset, what is the probability of creatinine = 3.5?

Code

dataset <- read_csv(file = "data/emergency.csv")

data_plot <- dataset

ggplot(data_plot, aes(x = creatinine)) +
  geom_histogram(breaks = seq(0, 12, by = 0.5))  +
  theme_bw(base_size = 25) +
  labs(x = "Creatinine", y = "Frequency") +
  theme(legend.position = "none")
ggplot(data_plot, aes(x = creatinine, y = after_stat(density))) +
  geom_histogram(breaks = seq(0, 12, by = 0.5))  +
  theme_bw(base_size = 25) +
  labs(x = "Creatinine", y = "Density") +
  theme(legend.position = "none")

The probability for continuous distributions is defined as the area under the curve between two points in the x-axis;
What is the area under creatinine= 3.5?
- $P(X = 3.5) = 0$;
We can only calculate probabilities between two points, i.e., $P(a \leq X \leq b)$;

Probability distributon vs density function

Code

tab <- data_plot %>% 
  mutate(creatinine_bin = cut(creatinine, seq(0, 12, 0.5))) %>% 
  group_by(creatinine_bin) %>% 
  summarize(frequency = n()) %>%
  mutate(relative_frequency = frequency/sum(frequency),
         interval_width = 0.5,
         density = relative_frequency/interval_width) %>%
  mutate(relative_frequency = 
           format(relative_frequency, digits = 3, nsmall = 3),
         density = 
           format(density, digits = 3, nsmall = 3),
         interval_width = as.character(interval_width)) %>%
  add_row(creatinine_bin = "Total", frequency = 145,
          relative_frequency = "1", interval_width = "",
          density = "") 

gt(tab)  %>%
  cols_label(
    creatinine_bin = md("**Creatinine Bin**"),
    frequency = md("**Frequency**"),
    relative_frequency = md("**Relative Frequency**"),
    interval_width = md("**Interval Width**"),
    density = md("**Density**"),
  ) %>%
  tab_footnote(
    footnote = "Relative Frequency = Density $\times$ Interval Width",
    locations = cells_column_labels(
      columns = density
    )
  )

Creatinine Bin	Frequency	Relative Frequency	Interval Width	Density¹
(0,0.5]	2	0.01342	0.5	0.0268
(0.5,1]	50	0.33557	0.5	0.6711
(1,1.5]	40	0.26846	0.5	0.5369
(1.5,2]	11	0.07383	0.5	0.1477
(2,2.5]	17	0.11409	0.5	0.2282
(2.5,3]	11	0.07383	0.5	0.1477
(3,3.5]	4	0.02685	0.5	0.0537
(3.5,4]	2	0.01342	0.5	0.0268
(4,4.5]	2	0.01342	0.5	0.0268
(4.5,5]	4	0.02685	0.5	0.0537
(5,5.5]	2	0.01342	0.5	0.0268
(5.5,6]	1	0.00671	0.5	0.0134
(6,6.5]	2	0.01342	0.5	0.0268
(11,11.5]	1	0.00671	0.5	0.0134
Total	145	1
¹ Relative Frequency = Density $ imes$ Interval Width

We can only calculate the probability that creatinine will be between 3 and 3.5, or 3.4 and 3.6, or 3 and 4, and so on.
The probability P(3 < X $\leq$ 3.5)$ is the relative frequency of for the interval (3, 3.5], i.e., 0.02685;
The probability P(X $\leq$ 3.5)$ is the relative frequency of for the interval (0, 3.5], i.e., 0.01342 + 0.33557 + 0.2684 + 0.07383 + 0.11409 + 0.07383 + 0.02685 + 0.01342 = 0.91941;

Probability distribution vs density function

The probability P(3 < X $\leq$ 3.5)$ is the relative frequency of for the interval (3, 3.5], i.e., 0.02685;

Code

data_plot <- dataset %>% 
  mutate(indicator = (creatinine > 3 & creatinine <= 3.5))

gp <- ggplot(data_plot, aes(x = creatinine, y = after_stat(density))) + 
  geom_histogram(breaks = seq(0, 12, 0.5))  +
  theme_bw() + 
  labs(x = "Creatinine", y = "Density") + 
  theme(text = element_text(size=20),
        legend.position = "none") +
  scale_x_continuous(breaks = seq(0, 12, 0.5)) +
  theme(legend.position = "none") + 
  scale_fill_manual(values = c("grey", gg_color_hue(2)[2]))

Probability distribution vs density function

The probability P(X $\leq$ 3.5)$ is the relative frequency of for the interval (0, 3.5], i.e., 0.01342 + 0.33557 + 0.2684 + 0.07383 + 0.11409 + 0.07383 + 0.02685 + 0.01342 = 0.91941;

Code

data_plot <- dataset %>% 
  mutate(indicator = (creatinine <= 3.5))

gp <- ggplot(data_plot, aes(x = creatinine, y = after_stat(density))) + 
  geom_histogram(breaks = seq(0, 12, 0.5))  +
  theme_bw() + 
  labs(x = "Creatinine", y = "Density") + 
  theme(text = element_text(size=20),
        legend.position = "none") +
  scale_x_continuous(breaks = seq(0, 12, 0.5)) +
  theme(legend.position = "none") + 
  scale_fill_manual(values = c("grey", gg_color_hue(2)[2]))

Normal distribution

It is the most common distribution of probability for continuous variables;
$X \sim N(\mu, \sigma^2)$, where $\mu$ is the mean and $\sigma^2$ is the variance;

Density function

$f(x) = \displaystyle\frac{1}{\sqrt{2\pi \sigma^2}} \exp\left\{-\displaystyle\frac{1}{2\sigma^2}(x - \mu)^2\right\} \quad \mbox{if} \quad -\infty < x < \infty$;

Moments

$E(X) = \mu$;
$Var(X) = \sigma^2$;
Skewness = 0;
Kurtosis = 0.

Normal distribution

Code

ggplot(data = data.frame(x = c(-6, 6)), aes(x)) +
  stat_function(fun = dnorm, args = list(mean = 0, sd = 1), aes(colour = "a")) +
  stat_function(fun = dnorm, args = list(mean = 3, sd = 1), aes(colour = "b")) +
  labs(x = "X", y = "f(x)") +
  theme_bw() +   theme(text = element_text(size=20),
                       legend.text = element_text(size = 12),
                       legend.title = element_text(size = 12),
                       legend.position = "bottom") +
  scale_colour_discrete("Distribution", labels = c("N(0, 1)", "N(3, 1)"))
ggplot(data = data.frame(x = c(-6, 6)), aes(x)) +
  stat_function(fun = dnorm, args = list(mean = 0, sd = 1), aes(colour = "a")) +
  stat_function(fun = dnorm, args = list(mean = 0, sd = 2), aes(colour = "b")) +
  labs(x = "X", y = "f(x)") +
  theme_bw() +   theme(text = element_text(size=20),
                       legend.text = element_text(size = 12),
                       legend.title = element_text(size = 12),
                       legend.position = "bottom") +
  scale_colour_discrete("Distribution", labels = c("N(0, 1)", "N(0, 2)"))

Normal distribution

Properties

If $X \sim N(0, 1)$, then $a + X \sim N(a, 1)$ where $a$ is a constant.

Code

ggplot(data = data.frame(x = c(-6, 6)), aes(x)) +
  stat_function(fun = dnorm, args = list(mean = 0, sd = 1), aes(colour = "a")) +
  stat_function(fun = dnorm, args = list(mean = 3, sd = 1), aes(colour = "b")) +
  labs(x = "X", y = "f(x)") +
  theme_bw() +   theme(text = element_text(size=20),
                       legend.text = element_text(size = 12),
                       legend.title = element_text(size = 12),
                       legend.position = "bottom") +
  scale_colour_discrete("Distribution", labels = c("N(0, 1)", "N(3, 1)"))

Normal distribution

Properties

If $X \sim N(0, 1)$, then $bX \sim N(0, b^2)$ where $b$ is a constant.

Code

ggplot(data = data.frame(x = c(-6, 6)), aes(x)) +
  stat_function(fun = dnorm, args = list(mean = 0, sd = 1), aes(colour = "a")) +
  stat_function(fun = dnorm, args = list(mean = 0, sd = 4), aes(colour = "b")) +
  labs(x = "X", y = "f(x)") +
  theme_bw() +   theme(text = element_text(size=20),
                       legend.text = element_text(size = 12),
                       legend.title = element_text(size = 12),
                       legend.position = "bottom") +
  scale_colour_discrete("Distribution", labels = c("N(0, 1)", "N(0, 4)"))

Normal distribution

Properties

If $X \sim N(a_1, b_1)$ and $Y \sim N(a_2, b_2)$, then $X + Y \sim N(a_1 + a_1, b_1^2 + b_2^2)$:

Code

ggplot(data = data.frame(x = c(-6, 6)), aes(x)) +
  stat_function(fun = dnorm, args = list(mean = -1, sd = 1), aes(colour = "a")) +
  stat_function(fun = dnorm, args = list(mean = 1, sd = 1), aes(colour = "b")) +
    stat_function(fun = dnorm, args = list(mean = 0, sd = sqrt(2)), aes(colour = "c")) +
  labs(x = "X", y = "f(x)") +
  theme_bw() +   theme(text = element_text(size=20),
                       legend.text = element_text(size = 12),
                       legend.title = element_text(size = 12),
                       legend.position = "bottom") +
  scale_colour_discrete("Distribution", labels = c("N(-1, 1)", "N(1, 1)", "N(0, 2)"))

t-Student distribution

It oftens replaces the Normal distribution when there are outliers;
The most common use is in test of hypotheses;
$X \sim t(\mu, \sigma^2, \nu)$, where $\mu$ is the location, $\sigma^2$ is the dispersion parameter and $\nu$ is the degrees of freedom (kurtosis) parameter;
It is often assumed $\mu = 0$ and $\sigma^2 = 1$, then $X \sim t(\nu)$. In case $\mu \neq 0$ and $\sigma^2 \neq 1$, then $Z = \frac{X - \mu}{\sigma}$.

Density function

$f(x) = \displaystyle\frac{1}{\sqrt{\nu}\beta(1/2, \nu/2)}\left(1 + \frac{[(x - \mu)/\sigma^2]^2}{\nu}\right)^{-(\nu+1)/2} \quad \mbox{if} \quad -\infty < x < \infty$;

Moments

$E(X) = \mu$;
$Var(X) = \sigma^2\frac{\nu}{\nu - 2}$;
Skewness = 0;
Kurtosis = $\frac{6}{\nu - 4}$ for $\nu > 4$.

t-Student distribution

Code

ggplot(data = data.frame(x = c(-6, 6)), aes(x)) +
  stat_function(fun = dnorm, args = list(mean = 0, sd = 1), aes(colour = "a"), size = 1) +
  stat_function(fun = dt, args = list(df = 5), aes(colour = "b"), size = 1, linetype = "dashed") +
  stat_function(fun = dt, args = list(df = 20), aes(colour = "c"), size = 1, linetype = "dashed") +
  stat_function(fun = dt, args = list(df = 100), aes(colour = "d"), size = 1, linetype = "dashed") +
  stat_function(fun = dt, args = list(df = 50), aes(colour = "e"), size = 1, linetype = "dashed") +
  labs(x = "X", y = "f(x)") +
  theme_bw() +   theme(text = element_text(size=20),
                       legend.text = element_text(size = 12),
                       legend.title = element_text(size = 12),
                       legend.position = "bottom") +
  scale_colour_discrete("Distribution", labels = c("N(0, 1)", "t(5)", "t(20)",  "t(50)",  "t(100)"))

Varying the degrees of freedom

Skew Normal distribution

It is a generalization of the Normal distribution with an additional parameter for skewness;
$X \sim SN(\mu, \sigma, \lambda)$, where $\mu$ is the location, $\sigma$ is the dispersion parameter and $\lambda$ is the degree of skewness;
If $\lambda = 0$, then we have a Normal distribution;

Density function

$f(x) = \displaystyle\frac{2}{\sigma}\phi\left(\frac{x - \mu}{\sigma}\right)\Phi\left(\lambda\left(\frac{x - \mu}{\sigma}\right)\right) \quad \mbox{if} \quad -\infty < x < \infty$;

Moments

$E(X) = \mu + \sigma\sqrt{\frac{2}{\pi}}\delta$;
$Var(X) = \sigma^2\left(1 - \frac{2\delta}{\pi}\right)$;
Kurtosis = $2(\pi - 3)\frac{(\delta\sqrt(2/pi))^4}{(1 - 2\delta^2/\pi)}$;
Skewness = $\frac{4 - \pi}{2}\frac{\delta(\sqrt{2/\pi})^3}{(1 - 2\delta^2/\pi)^2}$;

where $\delta = \frac{\lambda}{\sqrt{1 + \delta^2}}$

Skew Normal distribution

Code

ggplot(data = data.frame(x = c(-6, 6)), aes(x)) +
  stat_function(fun = dSN1, args = list(mu = 0, sigma = 1, nu = 0), aes(colour = "a"), size = 1) +
  stat_function(fun = dSN1, args = list(mu = 0, sigma = 1, nu = 1), aes(colour = "b"), size = 1, linetype = "dashed") +
  stat_function(fun = dSN1, args = list(mu = 0, sigma = 1, nu = -1), aes(colour = "c"), size = 1, linetype = "dashed") +
    stat_function(fun = dSN1, args = list(mu = 0, sigma = 1, nu = 3), aes(colour = "d"), size = 1, linetype = "dashed") +
  stat_function(fun = dSN1, args = list(mu = 0, sigma = 1, nu = -3), aes(colour = "e"), size = 1, linetype = "dashed") +
  labs(x = "X", y = "f(x)") +
  theme_bw() +   theme(text = element_text(size=20),
                       legend.text = element_text(size = 12),
                       legend.title = element_text(size = 12),
                       legend.position = "bottom") +
  scale_colour_discrete("Distribution", labels = c("N(0, 1)", "SN(0, 1, 1)", "SN(0, 1, -1)", "SN(0, 1, 3)", "SN(0, 1, -3)"))

Varying the skewness

Log-Normal distribution

It also derived from a Normal distribution;
If $Z \sim Normal(\mu, \sigma^2)$, then $X = e^{Z} \sim Log-Normal(\mu, \sigma^2)$;
Inversely, if $X = e^{Z} \sim Log-Normal(\mu, \sigma^2)$;

Density function

$f(x) = \displaystyle\frac{1}{x \sqrt{2\pi \sigma^2}} \exp\left\{-\displaystyle\frac{1}{2\sigma^2}(ln(x) - \mu)^2\right\} \quad \mbox{if} \quad 0 < x < \infty$;

Moments

$E(X) = exp\{\mu\}$;
$Var(X) = (exp\{\sigma^2 - 1\})exp\{2\mu + \sigma^2\}$;
Skewness = $(exp\{\sigma^2\} + 2)\sqrt{\exp\{\sigma^2\} - 1}$;
Kurtosis = $exp\{4\sigma^2\} + 2exp\{3\sigma^2\} + 3exp\{2\sigma^2\} - 6$.

Log-Normal distribution

Code

ggplot(data = data.frame(x = c(0, 50)), aes(x)) +
  stat_function(fun = dlnorm, args = list(meanlog = 0, sdlog = 1), aes(colour = "a")) +
  stat_function(fun = dlnorm, args = list(meanlog = 3, sdlog = 1), aes(colour = "b")) +
  labs(x = "X", y = "f(x)") +
  theme_bw() +   theme(text = element_text(size=20),
                       legend.text = element_text(size = 12),
                       legend.title = element_text(size = 12),
                       legend.position = "bottom") +
  scale_colour_discrete("Distribution", labels = c("LN(0, 1)", "LN(3, 1)"))
ggplot(data = data.frame(x = c(0, 30)), aes(x)) +
  stat_function(fun = dlnorm, args = list(meanlog = 0, sdlog = 1), aes(colour = "a")) +
  stat_function(fun = dlnorm, args = list(meanlog = 0, sdlog = 2), aes(colour = "b")) +
  labs(x = "X", y = "f(x)") +
  theme_bw() +   theme(text = element_text(size=20),
                       legend.text = element_text(size = 12),
                       legend.title = element_text(size = 12),
                       legend.position = "bottom") +
  scale_colour_discrete("Distribution", labels = c("LN(0, 1)", "LN(0, 2)"))

Chi-Square distribution

It rarely used to describe endpoints, but it is often used for test of hypotheses;
$X \sim \chi^2_{k}$ where $k$ is a parameter known as degrees of freedom;

Density function

$f(x) = \displaystyle\frac{\sqrt{\frac{(d_1x)^{d_1}d_2^{d_2}}{(d_1x + d_2)^{d_1 + d_2}}}}{x\beta(k/2, k/2)} x^{k/2-1}e^{-x/2} \quad \mbox{if} \quad 0 < x < \infty$;

Moments

$E(X) = k$;
$Var(X) = 2k$;
Skewness = $\sqrt{8/k}$;
Kurtosis = $12/k$.

Chi-Square distribution

Code

ggplot(data = data.frame(x = c(0, 50)), aes(x)) +
  stat_function(fun = dchisq, args = list(df = 1), aes(colour = "a")) +
  stat_function(fun = dchisq, args = list(df = 5), aes(colour = "b")) +
  stat_function(fun = dchisq, args = list(df = 20), aes(colour = "c")) +
  labs(x = "X", y = "f(x)") +
  theme_bw() +   theme(text = element_text(size=20),
                       legend.text = element_text(size = 12),
                       legend.title = element_text(size = 12),
                       legend.position = "bottom") +
  scale_colour_discrete("Distribution", labels = c("Chi-Squared(1)", 
                                                   "Chi-Squared(5)", 
                                                   "Chi-Squared(20)"))

F-Snedecor distribution

It rarely used to describe endpoints, but it is often used for test of hypotheses;
$X \sim F_{d_1, d_2}$ where $d_1$ and $d_2$ are both parameters known as degrees of freedom;

Density function

$f(x) = \displaystyle\frac{\sqrt{\frac{(d_1x)^{d_1}d_2^{d_2}}{(d_1x + d_2)^{d_1 + d_2}}}}{x \beta(d_1/2, d_2/2)} \quad \mbox{if} \quad 0 < x < \infty$;

Moments

$E(X) = \displaystyle\frac{d_2}{d_2 - 2}$;
$Var(X) = \frac{2d_2^2(d_1 + d_2 - 2)}{d_1(d_2 - 2)^2(d_2 - 4)}$;
Skewness = $\frac{2d_1 + d_2 - 2\sqrt{8(d_2 - 4)}}{(d_2 - 6)\sqrt{d_1(d_1 + d_2 - 2)}}$;
Kurtosis = $12\frac{d_1(5d_2 - 22)(d_1 + d_2 - 2) + (d_2 - 4)(d_2 - 2)^2}{d1(d_2 - 6)(d_2 - 8)(d_1 + d_2 - 2)}$.

F-Snedecor distribution

Code

ggplot(data = data.frame(x = c(0, 20)), aes(x)) +
  stat_function(fun = df, args = list(df1 = 1, df2 = 1), aes(colour = "a")) +
  stat_function(fun = df, args = list(df1 = 1, df2 = 5), aes(colour = "b")) +
  stat_function(fun = df, args = list(df1 = 5, df2 = 1), aes(colour = "c")) +
  labs(x = "X", y = "f(x)") +
  theme_bw() +   theme(text = element_text(size=20),
                       legend.text = element_text(size = 12),
                       legend.title = element_text(size = 12),
                       legend.position = "bottom") +
  scale_colour_discrete("Distribution", labels = c("F(1, 1)", 
                                                   "F(1, 40)", 
                                                   "F(40, 1)"))