Probability

Marcio Diniz | Michael Luu

Cedars Sinai Medical Center

11 October, 2022

Introduction

Why do we need to learn probability?

The world is full of uncertainty;

  • How do we make decisions when we have uncertainty?
  • How can we generalize results from a sample to a population?

Inference

Brief Probability History

Louvre Museum

Rolling a dice

  • Dice have been used since before recorded history, and it is uncertain where they originated;
  • The Egyptian game of Senet was played with dice. Senet was played before 3000 BC and up to the 2nd century AD. It was likely a racing game, but there is no scholarly consensus on the rules of Senet;
  • In 1654, The mathematical methods of probability arose in the correspondence of Gerolamo Cardano, Pierre de Fermat and Blaise Pascal on such questions as the fair division of the stake in an interrupted game of chance;
  • In 1812, Pierre de Laplace introduced a host of new ideas and mathematical techniques in his book, Théorie Analytique des Probabilités. Before Laplace, probability theory was solely concerned with developing a mathematical analysis of games of chance. Laplace applied probabilistic ideas to many scientific and practical problems;
  • In 1933, the russian mathematician Kolmogorov outlined an axiomatic approach that forms the basis for the modern theory.

First steps

Random Experiment

  • Roll a dice;
  • Possible results: \(\Omega = \{1, 2, 3, 4, 5, 6\}\);
  • What is the probability to observe the result 4?
  • If the dice is fair, we expect to observe the face 4 in 1 out 6 times when rolling a dice .
  • Let’s try to observe it in practice:
Code
set.seed(1234) #set.seed allow us to reproduce any random procedure in R.
dice <- sample(x = 1:6, size = 6, replace = TRUE)
table(dice)
dice
1 2 4 5 6 
1 1 2 1 1 
  • Did we observe the face 4 with frequency 1 out 6 when rolling a dice 6 times?

First steps

Random Experiment

  • How about if we roll a dice 1,000 times?
Code
set.seed(1234) #set.seed allow us to reproduce any random procedure in R.
dice <- sample(x = 1:6, size = 1000, replace = TRUE)
table(dice)
dice
  1   2   3   4   5   6 
161 153 188 149 157 192 

It is hard to say, let’s calculate the proportions:

Code
tmp <- table(dice)
prop.table(tmp)
dice
    1     2     3     4     5     6 
0.161 0.153 0.188 0.149 0.157 0.192 

Are the proportions equal to 1/6, i.e., 0.166…?

Code
prop.table(tmp) == 1/6
dice
    1     2     3     4     5     6 
FALSE FALSE FALSE FALSE FALSE FALSE 

First steps

Random Experiment

  • How about if we roll a dice 100,000 times?
Code
set.seed(1234) #set.seed allow us to reproduce any random procedure in R.
dice <- sample(x = 1:6, size = 100000, replace = TRUE)
table(dice)
dice
    1     2     3     4     5     6 
16745 16806 16453 16757 16714 16525 

It is hard to say, let’s calculate the proportions:

Code
tmp <- table(dice)
prop.table(tmp)
dice
      1       2       3       4       5       6 
0.16745 0.16806 0.16453 0.16757 0.16714 0.16525 

Are the proportions equal to 1/6, i.e., 0.166…?

Code
prop.table(tmp) == 1/6
dice
    1     2     3     4     5     6 
FALSE FALSE FALSE FALSE FALSE FALSE 

First steps

Random Experiment

  • How about if we roll a dice 10,000,000 times?
Code
set.seed(1234) #set.seed allow us to reproduce any random procedure in R.
dice <- sample(x = 1:6, size = 10000000, replace = TRUE)
table(dice)
dice
      1       2       3       4       5       6 
1666872 1668473 1667449 1665194 1667349 1664663 

It is hard to say, let’s calculate the proportions:

Code
tmp <- table(dice)
prop.table(tmp)
dice
        1         2         3         4         5         6 
0.1666872 0.1668473 0.1667449 0.1665194 0.1667349 0.1664663 

Are the proportions equal to 1/6, i.e., 0.166…?

Code
prop.table(tmp) == 1/6
dice
    1     2     3     4     5     6 
FALSE FALSE FALSE FALSE FALSE FALSE 

Probability distributions

  • Each experiment has a set of possible outcomes, which can be:
    • finite or non-finite;
    • enumerable or non-enumerable;
  • For example,
    • Number of mice with metastasis in a group of 10 mice is an outcome with a finite and enumerable number of possible values;
    • Number of laser photons hitting a detector in a particular time interval is an outcome with a non-finite and enumerable number of possible outcomes;
    • Gene expression is an outcome with a non-finite and non-enumerable number of values.
  • If the outcome is random, then we say that the outcome is a random variable;
  • A random variable is characterized by a distribution of probability;
  • A distribution of probability associates each possible value of an outcome with a probability that value is expected to be observed;
  • Therefore, the observed frequencies from a sample converge to the probability distribution as the sample size increases.

Probability distributions

Types

  • Probability distributions can be discrete for enumerable outcomes or continuous for enumerable and non-enumerable outcomes;

Discrete

  • Discrete probability distributions describe experiments with enumerable outcomes such as
    • presence of metastasis
    • number of cells with mutations per replication cycle; number of read counts from RNA seq, number of
    • number of reads (sequence obtained from a fragment of a gene) in a RNA-Seq;
    • number of enriched genes for a given set.

Continuous

  • Continuous probability distributions describe experiments with non-enumerable outcomes such as
    • gene expression;
    • biomarker levels;
    • time-to-event;
  • They also can be used for enumerable outcomes.

Probabilistic distributions are mathematical models

Solar system DNA

  • Models are a simplified description of the real world;
  • Every scientific is defined based on a few parameters that are essential to the representation of a phenomenon;
  • Probability distributions are mathematical models to describe uncertainty;
  • Each probability distribution is also characterized by parameters;
  • What are parameters of the probability distribution characterizing the outcome of rolling a fair dice?
    • 6 possible outcomes;
    • each outcome is equally likely;
  • Furthermore, probability distributions are often characterized by their moments:
    • Mean;
    • Variance = Squared Standard Deviation;
    • Kurtosis;
    • Skewness.

Discrete distributions

Discrete Uniform distribution

  • It is the simplest distribution of probability for discrete variables;
  • It has two main parameters \(a\) and \(b\) indicating the lowest and highest integer that can be observed;
  • It is notated \(X \sim U\{a, \ldots, b\}\);

Probability distribution

  • The random variable has \(n = b - a + 1\) possible values (\(a, \ldots, b\)) that are equally likely to occur, i.e., \(1/n\);
  • \(P(X = x) = \displaystyle\frac{1}{n}\);

Moments

  • \(E(X) = \displaystyle\frac{a + b}{2}\);
  • \(Var(X) = \displaystyle\frac{(b - a + 1)^2 - 1}{12}\);

Discrete Uniform distribution

Example

  • Experiment: Roll a dice;

  • \(X \sim U\{1, \ldots, 6\} \rightarrow a = 1, b = 6\)

  • \(E(X) = 3.5\)

  • \(Var(X) = 2.916667 \rightarrow SD(X) = 1.70825\)

  • In a sample size of 10,

Code
set.seed(1234) #set.seed allow us to reproduce any random procedure in R.
dice <- sample(x = 1:6, size = 10, replace = TRUE)
table(dice)
dice
1 2 4 5 6 
1 2 3 2 2 
Code
mean(dice)
[1] 3.9
Code
var(dice)
[1] 2.988889
Code
sd(dice)
[1] 1.72884

Discrete Uniform distribution

Example

  • \(X \sim U\{1, \ldots, 6\} \rightarrow a = 1, b = 6\)
  • \(E(X) = 3.5\)
  • \(Var(X) = 2.916667 \rightarrow SD(X) = 1.70825\)
  • In a sample of 100,000,
Code
set.seed(1234) #set.seed allow us to reproduce any random procedure in R.
dice <- sample(x = 1:6, size = 100000, replace = TRUE)
table(dice)
dice
    1     2     3     4     5     6 
16745 16806 16453 16757 16714 16525 
Code
mean(dice)
[1] 3.49464
Code
var(dice)
[1] 2.9166
Code
sd(dice)
[1] 1.707806

Discrete Uniform distribution

Code
n <- 6
data_plot <- data.frame(x = seq(1:n), y = 1/n)

ggplot(data_plot, aes(x = x, y = y)) + 
  geom_bar(stat="identity") +
  scale_y_continuous(limits = c(0, 1)) +
  scale_x_continuous(breaks = seq(1:n)) +
  labs(x = "X", y = "P(X = x)") +
  theme_bw() +   theme(text = element_text(size=20),
                       legend.text = element_text(size = 12),
                       legend.title = element_text(size = 12))
n <- 10
data_plot <- data.frame(x = seq(1:n), y = 1/n)

ggplot(data_plot, aes(x = x, y = y)) + 
  geom_bar(stat="identity") +
  scale_y_continuous(limits = c(0, 1)) +
  scale_x_continuous(breaks = seq(1:n)) +
  labs(x = "X", y = "P(X = x)") +
  theme_bw() +   theme(text = element_text(size=20),
                       legend.text = element_text(size = 12),
                       legend.title = element_text(size = 12))

a = 1, b = 6

a = 1, b = 10

Bernoulli distribution

  • The random variable has two possible values (0 = failure, 1 = success);
  • \(X \sim B(p)\), where is the probability of success is \(p \in [0, 1]\);

Probability distribution

  • \(P(X = x) = p^x (1- p)^{1 - x}\)

Moments

  • \(E(X) = p\)
  • \(Var(X) = p(1 - p)\)

Bernoulli distribution

Example

  • Experiment: In a given pancreatic adenocarcinoma (PDAC) mouse model, we expect that that a mouse will develop metastasis with probability of 80% without treatment;
  • \(X = \begin{cases} 1 \quad \mbox{if the mouse is metastatic} \\ 0 \quad \mbox{if the mouse is not metastatic;} \end{cases}\)
  • \(X \sim B(0.8) \rightarrow p = 0.8\)
  • \(E(X) = 0.8\)
  • \(Var(X) = 0.16 \rightarrow SD(X) = 0.4\)

Bernoulli distribution

Example

  • \(X \sim B(0.8) \rightarrow p = 0.8\)
  • \(E(X) = 0.8\)
  • \(Var(X) = 0.16 \rightarrow SD(X) = 0.4\)
  • For a sample size of 10 and \(p = 0.8\),
set.seed(1234) #set.seed allow us to reproduce any random procedure in R.
mice <- rbinom(n = 10, size = 1, prob = 0.8)
table(mice)
mice
0 1 
1 9 
mean(mice)
[1] 0.9
var(mice)
[1] 0.1
sd(mice)
[1] 0.3162278

Bernoulli distribution

Code
size <- 1
p <- 0.2
data_plot <- data.frame(x = 0:1, y = dbinom(0:1, size, p))

ggplot(data_plot, aes(x = x, y = y)) + 
  geom_bar(stat="identity") +
  scale_y_continuous(limits = c(0, 1)) +
  scale_x_continuous(breaks = 0:1) +
  labs(x = "X", y = "P(X = x)") +
  theme_bw() +   theme(text = element_text(size=20),
                       legend.text = element_text(size = 12),
                       legend.title = element_text(size = 12))
size <- 1
p <- 0.8
data_plot <- data.frame(x = 0:1, y = dbinom(0:1, size, p))

ggplot(data_plot, aes(x = x, y = y)) + 
  geom_bar(stat="identity") +
  scale_y_continuous(limits = c(0, 1)) +
  scale_x_continuous(breaks = 0:1) +
  labs(x = "X", y = "P(X = x)") +
  theme_bw() +   theme(text = element_text(size=20),
                       legend.text = element_text(size = 12),
                       legend.title = element_text(size = 12))

p = 0.2

p = 0.8

Binomial distribution

  • The random variable is the sum of a sequence of \(m\) independent of Bernoulli variables with probability of success \(p\);
  • It is notated by \(Y \sim Bin(m, p)\).

Probability distribution

  • \(P(Y = y) = \displaystyle{m \choose y} p^y (1- p)^{m - y}\)

Moments

  • \(E(Y) = mp\)
  • \(Var(Y) = mp(1 - p)\)

Binomial distribution

Example

  • Experiment: \(m\) mice are followed such each mouse has a chance about of 80% to develop metastasis;
  • \(X_i = \begin{cases} 1 \quad \mbox{if the mouse is metastatic} \\ 0 \quad \mbox{if the mouse is not metastatic;} \end{cases}\)
  • \(Y = \sum\limits_{i = 1}^m X_i\) is the total number of metastatic mice in a group with \(m\) mice;
  • \(Y \in \{0, 1, \ldots, m\}\);
  • \(E(Y) = mp = 10\times0.8 = 8\);
  • \(Var(Y) = mp(1-p) = 10\times0.8\times0.2=1.6\);
  • \(Y \sim Binomial(m, p)\).

Binomial distribution

Example

  • \(Y \sim Binomial(m, p)\);
  • \(E(Y) = mp = 10 \times 0.8 = 8\);
  • \(Var(Y) = mp(1-p) = 10 \times 0.8 \times 0.2 = 1.6\);
  • If we replicate this experiment 10 times,
set.seed(1234) #set.seed allow us to reproduce any random procedure in R.
mice <- rbinom(n = 10, size = 10, prob = 0.8)
table(mice)
mice
 7  8  9 10 
 1  6  2  1 
mean(mice)
[1] 8.3
var(mice)
[1] 0.6777778
sd(mice)
[1] 0.8232726

Binomial distribution

Code
n <- 10
size <- 10
p <- 0.2
data_plot <- data.frame(x = seq(1:n), y = dbinom(seq(1:n), size, p))

ggplot(data_plot, aes(x = x, y = y)) + 
  geom_bar(stat="identity") +
  scale_y_continuous(limits = c(0, 1)) +
  scale_x_continuous(breaks = seq(0:n)) +
  labs(x = "X", y = "P(X = x)") +
  theme_bw() +   theme(text = element_text(size=20),
                       legend.text = element_text(size = 12),
                       legend.title = element_text(size = 12))
n <- 10
size <- 10
p <- 0.8
data_plot <- data.frame(x = seq(1:n), y = dbinom(seq(1:n), size, p))

ggplot(data_plot, aes(x = x, y = y)) + 
  geom_bar(stat="identity") +
  scale_y_continuous(limits = c(0, 1)) +
  scale_x_continuous(breaks = seq(0:n)) +
  labs(x = "X", y = "P(X = x)") +
  theme_bw() +   theme(text = element_text(size=20),
                       legend.text = element_text(size = 12),
                       legend.title = element_text(size = 12))

p = 0.2

p = 0.8

Poisson distribution

  • The random variable is the number of occurrences of an event during a specific time interval or area;
  • \(X \sim Po(\lambda)\) where \(\lambda\) is the rate of events;

Probability distribution

  • \(P(X = x) = \frac{\lambda^x e^{-\lambda}}{x!}\);

Moments

  • \(E(X) = Var(X) = \lambda\).

Poisson distribution

Example

  • Experiment: Observe the number of mutations in a genome with about \(10^4\) nucleotides such that the mutation rate of \(9\times10^{-4}\) per nucleotide in a replication cycle.

  • \(X \in \{0, 1, 2, 3, \ldots, \}\);

  • \(X \sim Po(9) \rightarrow \lambda = 9\) ;

  • \(E(X) = \lambda = 9\);

  • \(Var(Y) = \lambda = 9 \rightarrow = SD(X) = 3\);

  • In a sample of 10 cell-culture dishes,

set.seed(1234) #set.seed allow us to reproduce any random procedure in R.
mutations <- rpois(n = 10, lambda = 9)
table(mutations)
mutations
 3  5  7  9 10 12 
 1  1  1  1  5  1 
mean(mutations)
[1] 8.6
var(mutations)
[1] 7.6
sd(mutations)
[1] 2.75681

Poisson distribution

Code
n <- 20
lambda <- 1
data_plot <- data.frame(x = seq(1:n), y = dpois(seq(1:n), lambda))

ggplot(data_plot, aes(x = x, y = y)) + 
  geom_bar(stat="identity") +
  scale_y_continuous(limits = c(0, 1)) +
  scale_x_continuous(breaks = seq(0:n)) +
  labs(x = "X", y = "P(X = x)") +
  theme_bw() +   theme(text = element_text(size=20),
                       legend.text = element_text(size = 12),
                       legend.title = element_text(size = 12)) 
n <- 20
lambda <- 9
data_plot <- data.frame(x = seq(1:n), y = dpois(seq(1:n), lambda))

ggplot(data_plot, aes(x = x, y = y)) + 
  geom_bar(stat="identity") +
  scale_y_continuous(limits = c(0, 1)) +
  scale_x_continuous(breaks = seq(0:n)) +
  labs(x = "X", y = "P(X = x)") +
  theme_bw() +   theme(text = element_text(size=20),
                       legend.text = element_text(size = 12),
                       legend.title = element_text(size = 12))

Poisson rate = 1

Poisson rate = 9

Negative Binomial

  • The random variable is the number of occurrences of an event during a specific time interval or area;
  • \(X \sim NegBin(\mu, \sigma^2)\) where \(\mu\) is the mean and \(\sigma^2\) is the variance;

Probability distribution

  • \(P(X = x) = \displaystyle{x + \frac{\mu^2}{\sigma^2 - \mu} - 1 \choose x}\left(\frac{\sigma^2 - \mu}{\sigma^2}\right)^x\left(\frac{\mu}{\sigma^2}\right)^{\mu^2/(\sigma^2 - \mu)}\);

Moments

  • \(E(X) = \mu\);
  • \(Var(X) = \sigma^2\) such that \(\sigma^2 > \mu\);

Negative Binomial

Example

  • Experiment: We perform RNA-Seq obtaining the number of reads (sequence obtained from a fragment of a gene) of a specific gene;
  • \(X \in \{0, 1, 2, 3, \ldots, \}\);
  • \(X \sim NegBin(5, 9) \rightarrow \mu = 5, \sigma^2 = 9\);
  • \(E(X) = 5\);
  • \(Var(X) = 9 \rightarrow SD(X) = 3\)

Negative Binomial

Example

  • \(X \sim NegBin(5, 9) \rightarrow \mu = 5, \sigma^2 = 9\);
  • \(E(X) = 5\);
  • \(Var(X) = 9 \rightarrow SD(X) = 3\)
  • In a sample of 10 patients,
set.seed(1234) #set.seed allow us to reproduce any random procedure in R.

mu <- 5
sigma2 <- 9
size <- mu^2/(sigma2 - mu)

reads <- rnbinom(n = 10, mu = 5, size = size)
table(reads)
reads
1 2 3 4 6 7 8 
1 3 1 2 1 1 1 
mean(reads)
[1] 3.9
var(reads)
[1] 5.655556
sd(reads)
[1] 2.378141

Negative Binomial

Code
n <- 20
mu <- 2
sigma2 <- 9
size <- mu^2/(sigma2 - mu)
data_plot <- data.frame(x = seq(1:n), y = dnbinom(seq(1:n), 
                                                  size = size, 
                                                  mu = mu))

ggplot(data_plot, aes(x = x, y = y)) + 
  geom_bar(stat="identity") +
  scale_y_continuous(limits = c(0, 1)) +
  scale_x_continuous(breaks = seq(0:n)) +
  labs(x = "X", y = "P(X = x)") +
  theme_bw() +   theme(text = element_text(size=20),
                       legend.text = element_text(size = 12),
                       legend.title = element_text(size = 12))
n <- 20
mu <- 5
sigma2 <- 9 
size <- mu^2/(sigma2 - mu)
data_plot <- data.frame(x = seq(1:n), y = dnbinom(seq(1:n),
                                                size = size, 
                                                mu = mu))

ggplot(data_plot, aes(x = x, y = y)) + 
  geom_bar(stat="identity") +
  scale_y_continuous(limits = c(0, 1)) +
  scale_x_continuous(breaks = seq(0:n)) +
  labs(x = "X", y = "P(X = x)") +
  theme_bw() +   theme(text = element_text(size=20),
                       legend.text = element_text(size = 12),
                       legend.title = element_text(size = 12))

Mean = 5 and Variance = 9

Mean = 12 and Variance = 9

Hypergeometric distribution

  • The random variable is the number of white balls among \(k\) sampled balls given that the urn has \(m\) white balls and \(n\) black balls;
  • \(X \sim HyperGeometric(k, m, n)\);

Probability distribution

  • \(P(X = x) = \frac{{m \choose x} {n \choose k - x}}{{m + n \choose k}}\);

Moments

  • \(E(X) = k\frac{m}{m + n}\);
  • \(Var(X) = k\frac{mn(m+n-k)}{(m + n)^2(m + n-1)}\);

Hypergeometric distribution

Example

  • Experiment: We observe the number of genes that are members of a specific pathway, defined by 300 genes, among the 50 genes that were found to be associated with diabetes after comparing 1000 genes between cases and control patients.

Enrichment Table

  • \(X \sim HyperGeometric(50, 300, 700) \rightarrow k = 50, m = 300, n = 700\);
  • \(E(X) = 15\);
  • \(Var(X) = 9.984985 \rightarrow SD(X) = 3.159903\);

Hypergeometric distribution

Example

  • \(X ~ HyperGeometric(50, 300, 1000) \rightarrow k = 50, m = 300, n = 700\);
  • \(E(X) = 15\);
  • \(Var(X) = 9.984985 \rightarrow SD(X) = 3.159903\);
  • If we replicate this experiment 10 times,
genes_pathway <- rhyper(nn = 10, k = 50, m = 300, n = 700)
table(genes_pathway)
genes_pathway
11 12 13 14 16 17 18 
 1  1  2  2  1  1  2 
mean(genes_pathway)
[1] 14.6
var(genes_pathway)
[1] 6.266667
sd(genes_pathway)
[1] 2.503331

Hypergeometric distribution

Code
n <- 20
s <- 1:20
data_plot <- data.frame(x = s, y = dhyper(s, k = 10, m = 300, n = 700))

ggplot(data_plot, aes(x = x, y = y)) + 
  geom_bar(stat="identity") +
  scale_y_continuous(limits = c(0, 1)) +
  scale_x_continuous(breaks = seq(1, 20, by = 2)) +
  labs(x = "X", y = "P(X = x)") +
  theme_bw() +   theme(text = element_text(size=20),
                       legend.text = element_text(size = 12),
                       legend.title = element_text(size = 12))
n <- 20
s <- 1:26
data_plot <- data.frame(x = s, y = dhyper(s, k = 50, m = 300, n = 700))

ggplot(data_plot, aes(x = x, y = y)) + 
  geom_bar(stat="identity") +
  scale_y_continuous(limits = c(0, 1)) +
  scale_x_continuous(breaks = seq(1, 26, by = 2)) +
  labs(x = "X", y = "P(X = x)") +
  theme_bw() +   theme(text = element_text(size=20),
                       legend.text = element_text(size = 12),
                       legend.title = element_text(size = 12))

k = 10, m = 300, n = 700

k = 50, m = 300, n = 700

Other discrete distributions

Continuous distributions

Continuous Uniform distribution

  • It is the simplest distribution of probability for continuous variables;
  • \(X \sim U(a, b)\) where \(a\) is the lower limit and \(b\) is the upper limit;

Density function

  • \(f(x) = \begin{cases}\displaystyle\frac{1}{b - a} \quad \mbox{if} a \leq x \leq b \\ 0 \quad \mbox{otherwise}; \end{cases}\);

Moments

  • \(E(X) = \displaystyle\frac{a + b}{2}\);
  • \(Var(X) = \displaystyle\frac{(b - a)^2}{12}\).

Continuous Uniform distribution

Example

  • \(U(0, 1) \rightarrow a = 0, b = 1\);

  • \(E(X) = 0.5\);

  • \(Var(X) = 0.08333 \rightarrow SD(X) = 0.288\)

  • In a sample of 10 Uniform distributed variables,

Code
x <- runif(n = 10, min = 0, max = 1)
mean(x)
[1] 0.4022828
Code
var(x)
[1] 0.09913909
Code
sd(x)
[1] 0.3148636

Continuous Uniform distribution

Code
ggplot(data = data.frame(x = c(-10, 10)), aes(x)) +
  stat_function(fun = dunif, n = 1001, args = list(min = 0, max = 5), aes(colour = "b")) +
  stat_function(fun = dunif, args = list(min = -6, max = 1), aes(colour = "a")) +
  labs(x = "X", y = "f(x)") +
  theme_bw() +   theme(text = element_text(size=20),
                       legend.text = element_text(size = 12),
                       legend.title = element_text(size = 12),
                       legend.position = "bottom") +
  scale_colour_discrete("Distribution", labels = c("U(-6, 1)", "U(0, 5)"))

Probability distribution vs density function

  • How do we calculate the probability \(P(X = x)\) for a continuous distribution?
  • In the emergency dataset, what is the probability of creatinine = 3.5?
Code
dataset <- read_csv(file = "data/emergency.csv")

data_plot <- dataset

ggplot(data_plot, aes(x = creatinine)) +
  geom_histogram(breaks = seq(0, 12, by = 0.5))  +
  theme_bw(base_size = 25) +
  labs(x = "Creatinine", y = "Frequency") +
  theme(legend.position = "none")
ggplot(data_plot, aes(x = creatinine, y = after_stat(density))) +
  geom_histogram(breaks = seq(0, 12, by = 0.5))  +
  theme_bw(base_size = 25) +
  labs(x = "Creatinine", y = "Density") +
  theme(legend.position = "none")

  • The probability for continuous distributions is defined as the area under the curve between two points in the x-axis;
  • What is the area under creatinine= 3.5?
    • \(P(X = 3.5) = 0\);
  • We can only calculate probabilities between two points, i.e., \(P(a \leq X \leq b)\);

Probability distributon vs density function

Code
tab <- data_plot %>% 
  mutate(creatinine_bin = cut(creatinine, seq(0, 12, 0.5))) %>% 
  group_by(creatinine_bin) %>% 
  summarize(frequency = n()) %>%
  mutate(relative_frequency = frequency/sum(frequency),
         interval_width = 0.5,
         density = relative_frequency/interval_width) %>%
  mutate(relative_frequency = 
           format(relative_frequency, digits = 3, nsmall = 3),
         density = 
           format(density, digits = 3, nsmall = 3),
         interval_width = as.character(interval_width)) %>%
  add_row(creatinine_bin = "Total", frequency = 145,
          relative_frequency = "1", interval_width = "",
          density = "") 

gt(tab)  %>%
  cols_label(
    creatinine_bin = md("**Creatinine Bin**"),
    frequency = md("**Frequency**"),
    relative_frequency = md("**Relative Frequency**"),
    interval_width = md("**Interval Width**"),
    density = md("**Density**"),
  ) %>%
  tab_footnote(
    footnote = "Relative Frequency = Density $\times$ Interval Width",
    locations = cells_column_labels(
      columns = density
    )
  )
Creatinine Bin Frequency Relative Frequency Interval Width Density1
(0,0.5] 2 0.01342 0.5 0.0268
(0.5,1] 50 0.33557 0.5 0.6711
(1,1.5] 40 0.26846 0.5 0.5369
(1.5,2] 11 0.07383 0.5 0.1477
(2,2.5] 17 0.11409 0.5 0.2282
(2.5,3] 11 0.07383 0.5 0.1477
(3,3.5] 4 0.02685 0.5 0.0537
(3.5,4] 2 0.01342 0.5 0.0268
(4,4.5] 2 0.01342 0.5 0.0268
(4.5,5] 4 0.02685 0.5 0.0537
(5,5.5] 2 0.01342 0.5 0.0268
(5.5,6] 1 0.00671 0.5 0.0134
(6,6.5] 2 0.01342 0.5 0.0268
(11,11.5] 1 0.00671 0.5 0.0134
Total 145 1
1 Relative Frequency = Density $ imes$ Interval Width
  • We can only calculate the probability that creatinine will be between 3 and 3.5, or 3.4 and 3.6, or 3 and 4, and so on.
  • The probability P(3 < X \(\leq\) 3.5)$ is the relative frequency of for the interval (3, 3.5], i.e., 0.02685;
  • The probability P(X \(\leq\) 3.5)$ is the relative frequency of for the interval (0, 3.5], i.e., 0.01342 + 0.33557 + 0.2684 + 0.07383 + 0.11409 + 0.07383 + 0.02685 + 0.01342 = 0.91941;

Probability distribution vs density function

  • The probability P(3 < X \(\leq\) 3.5)$ is the relative frequency of for the interval (3, 3.5], i.e., 0.02685;
Code
data_plot <- dataset %>% 
  mutate(indicator = (creatinine > 3 & creatinine <= 3.5))

gp <- ggplot(data_plot, aes(x = creatinine, y = after_stat(density))) + 
  geom_histogram(breaks = seq(0, 12, 0.5))  +
  theme_bw() + 
  labs(x = "Creatinine", y = "Density") + 
  theme(text = element_text(size=20),
        legend.position = "none") +
  scale_x_continuous(breaks = seq(0, 12, 0.5)) +
  theme(legend.position = "none") + 
  scale_fill_manual(values = c("grey", gg_color_hue(2)[2]))

Probability distribution vs density function

  • The probability P(X \(\leq\) 3.5)$ is the relative frequency of for the interval (0, 3.5], i.e., 0.01342 + 0.33557 + 0.2684 + 0.07383 + 0.11409 + 0.07383 + 0.02685 + 0.01342 = 0.91941;
Code
data_plot <- dataset %>% 
  mutate(indicator = (creatinine <= 3.5))

gp <- ggplot(data_plot, aes(x = creatinine, y = after_stat(density))) + 
  geom_histogram(breaks = seq(0, 12, 0.5))  +
  theme_bw() + 
  labs(x = "Creatinine", y = "Density") + 
  theme(text = element_text(size=20),
        legend.position = "none") +
  scale_x_continuous(breaks = seq(0, 12, 0.5)) +
  theme(legend.position = "none") + 
  scale_fill_manual(values = c("grey", gg_color_hue(2)[2]))

Normal distribution

  • It is the most common distribution of probability for continuous variables;
  • \(X \sim N(\mu, \sigma^2)\), where \(\mu\) is the mean and \(\sigma^2\) is the variance;

Density function

  • \(f(x) = \displaystyle\frac{1}{\sqrt{2\pi \sigma^2}} \exp\left\{-\displaystyle\frac{1}{2\sigma^2}(x - \mu)^2\right\} \quad \mbox{if} \quad -\infty < x < \infty\);

Moments

  • \(E(X) = \mu\);
  • \(Var(X) = \sigma^2\);
  • Skewness = 0;
  • Kurtosis = 0.

Normal distribution

Code
ggplot(data = data.frame(x = c(-6, 6)), aes(x)) +
  stat_function(fun = dnorm, args = list(mean = 0, sd = 1), aes(colour = "a")) +
  stat_function(fun = dnorm, args = list(mean = 3, sd = 1), aes(colour = "b")) +
  labs(x = "X", y = "f(x)") +
  theme_bw() +   theme(text = element_text(size=20),
                       legend.text = element_text(size = 12),
                       legend.title = element_text(size = 12),
                       legend.position = "bottom") +
  scale_colour_discrete("Distribution", labels = c("N(0, 1)", "N(3, 1)"))
ggplot(data = data.frame(x = c(-6, 6)), aes(x)) +
  stat_function(fun = dnorm, args = list(mean = 0, sd = 1), aes(colour = "a")) +
  stat_function(fun = dnorm, args = list(mean = 0, sd = 2), aes(colour = "b")) +
  labs(x = "X", y = "f(x)") +
  theme_bw() +   theme(text = element_text(size=20),
                       legend.text = element_text(size = 12),
                       legend.title = element_text(size = 12),
                       legend.position = "bottom") +
  scale_colour_discrete("Distribution", labels = c("N(0, 1)", "N(0, 2)"))

Normal distribution

Properties

  • If \(X \sim N(0, 1)\), then \(a + X \sim N(a, 1)\) where \(a\) is a constant.
Code
ggplot(data = data.frame(x = c(-6, 6)), aes(x)) +
  stat_function(fun = dnorm, args = list(mean = 0, sd = 1), aes(colour = "a")) +
  stat_function(fun = dnorm, args = list(mean = 3, sd = 1), aes(colour = "b")) +
  labs(x = "X", y = "f(x)") +
  theme_bw() +   theme(text = element_text(size=20),
                       legend.text = element_text(size = 12),
                       legend.title = element_text(size = 12),
                       legend.position = "bottom") +
  scale_colour_discrete("Distribution", labels = c("N(0, 1)", "N(3, 1)"))

Normal distribution

Properties

  • If \(X \sim N(0, 1)\), then \(bX \sim N(0, b^2)\) where \(b\) is a constant.
Code
ggplot(data = data.frame(x = c(-6, 6)), aes(x)) +
  stat_function(fun = dnorm, args = list(mean = 0, sd = 1), aes(colour = "a")) +
  stat_function(fun = dnorm, args = list(mean = 0, sd = 4), aes(colour = "b")) +
  labs(x = "X", y = "f(x)") +
  theme_bw() +   theme(text = element_text(size=20),
                       legend.text = element_text(size = 12),
                       legend.title = element_text(size = 12),
                       legend.position = "bottom") +
  scale_colour_discrete("Distribution", labels = c("N(0, 1)", "N(0, 4)"))

Normal distribution

Properties

  • If \(X \sim N(a_1, b_1)\) and \(Y \sim N(a_2, b_2)\), then \(X + Y \sim N(a_1 + a_1, b_1^2 + b_2^2)\):
Code
ggplot(data = data.frame(x = c(-6, 6)), aes(x)) +
  stat_function(fun = dnorm, args = list(mean = -1, sd = 1), aes(colour = "a")) +
  stat_function(fun = dnorm, args = list(mean = 1, sd = 1), aes(colour = "b")) +
    stat_function(fun = dnorm, args = list(mean = 0, sd = sqrt(2)), aes(colour = "c")) +
  labs(x = "X", y = "f(x)") +
  theme_bw() +   theme(text = element_text(size=20),
                       legend.text = element_text(size = 12),
                       legend.title = element_text(size = 12),
                       legend.position = "bottom") +
  scale_colour_discrete("Distribution", labels = c("N(-1, 1)", "N(1, 1)", "N(0, 2)"))

t-Student distribution

  • It oftens replaces the Normal distribution when there are outliers;
  • The most common use is in test of hypotheses;
  • \(X \sim t(\mu, \sigma^2, \nu)\), where \(\mu\) is the location, \(\sigma^2\) is the dispersion parameter and \(\nu\) is the degrees of freedom (kurtosis) parameter;
  • It is often assumed \(\mu = 0\) and \(\sigma^2 = 1\), then \(X \sim t(\nu)\). In case \(\mu \neq 0\) and \(\sigma^2 \neq 1\), then \(Z = \frac{X - \mu}{\sigma}\).

Density function

  • \(f(x) = \displaystyle\frac{1}{\sqrt{\nu}\beta(1/2, \nu/2)}\left(1 + \frac{[(x - \mu)/\sigma^2]^2}{\nu}\right)^{-(\nu+1)/2} \quad \mbox{if} \quad -\infty < x < \infty\);

Moments

  • \(E(X) = \mu\);
  • \(Var(X) = \sigma^2\frac{\nu}{\nu - 2}\);
  • Skewness = 0;
  • Kurtosis = \(\frac{6}{\nu - 4}\) for \(\nu > 4\).

t-Student distribution

Code
ggplot(data = data.frame(x = c(-6, 6)), aes(x)) +
  stat_function(fun = dnorm, args = list(mean = 0, sd = 1), aes(colour = "a"), size = 1) +
  stat_function(fun = dt, args = list(df = 5), aes(colour = "b"), size = 1, linetype = "dashed") +
  stat_function(fun = dt, args = list(df = 20), aes(colour = "c"), size = 1, linetype = "dashed") +
  stat_function(fun = dt, args = list(df = 100), aes(colour = "d"), size = 1, linetype = "dashed") +
  stat_function(fun = dt, args = list(df = 50), aes(colour = "e"), size = 1, linetype = "dashed") +
  labs(x = "X", y = "f(x)") +
  theme_bw() +   theme(text = element_text(size=20),
                       legend.text = element_text(size = 12),
                       legend.title = element_text(size = 12),
                       legend.position = "bottom") +
  scale_colour_discrete("Distribution", labels = c("N(0, 1)", "t(5)", "t(20)",  "t(50)",  "t(100)"))

Varying the degrees of freedom

Skew Normal distribution

  • It is a generalization of the Normal distribution with an additional parameter for skewness;
  • \(X \sim SN(\mu, \sigma, \lambda)\), where \(\mu\) is the location, \(\sigma\) is the dispersion parameter and \(\lambda\) is the degree of skewness;
  • If \(\lambda = 0\), then we have a Normal distribution;

Density function

  • \(f(x) = \displaystyle\frac{2}{\sigma}\phi\left(\frac{x - \mu}{\sigma}\right)\Phi\left(\lambda\left(\frac{x - \mu}{\sigma}\right)\right) \quad \mbox{if} \quad -\infty < x < \infty\);

Moments

  • \(E(X) = \mu + \sigma\sqrt{\frac{2}{\pi}}\delta\);
  • \(Var(X) = \sigma^2\left(1 - \frac{2\delta}{\pi}\right)\);
  • Kurtosis = \(2(\pi - 3)\frac{(\delta\sqrt(2/pi))^4}{(1 - 2\delta^2/\pi)}\);
  • Skewness = \(\frac{4 - \pi}{2}\frac{\delta(\sqrt{2/\pi})^3}{(1 - 2\delta^2/\pi)^2}\);

where \(\delta = \frac{\lambda}{\sqrt{1 + \delta^2}}\)

Skew Normal distribution

Code
ggplot(data = data.frame(x = c(-6, 6)), aes(x)) +
  stat_function(fun = dSN1, args = list(mu = 0, sigma = 1, nu = 0), aes(colour = "a"), size = 1) +
  stat_function(fun = dSN1, args = list(mu = 0, sigma = 1, nu = 1), aes(colour = "b"), size = 1, linetype = "dashed") +
  stat_function(fun = dSN1, args = list(mu = 0, sigma = 1, nu = -1), aes(colour = "c"), size = 1, linetype = "dashed") +
    stat_function(fun = dSN1, args = list(mu = 0, sigma = 1, nu = 3), aes(colour = "d"), size = 1, linetype = "dashed") +
  stat_function(fun = dSN1, args = list(mu = 0, sigma = 1, nu = -3), aes(colour = "e"), size = 1, linetype = "dashed") +
  labs(x = "X", y = "f(x)") +
  theme_bw() +   theme(text = element_text(size=20),
                       legend.text = element_text(size = 12),
                       legend.title = element_text(size = 12),
                       legend.position = "bottom") +
  scale_colour_discrete("Distribution", labels = c("N(0, 1)", "SN(0, 1, 1)", "SN(0, 1, -1)", "SN(0, 1, 3)", "SN(0, 1, -3)"))

Varying the skewness

Log-Normal distribution

  • It also derived from a Normal distribution;
  • If \(Z \sim Normal(\mu, \sigma^2)\), then \(X = e^{Z} \sim Log-Normal(\mu, \sigma^2)\);
  • Inversely, if \(X = e^{Z} \sim Log-Normal(\mu, \sigma^2)\);

Density function

  • \(f(x) = \displaystyle\frac{1}{x \sqrt{2\pi \sigma^2}} \exp\left\{-\displaystyle\frac{1}{2\sigma^2}(ln(x) - \mu)^2\right\} \quad \mbox{if} \quad 0 < x < \infty\);

Moments

  • \(E(X) = exp\{\mu\}\);
  • \(Var(X) = (exp\{\sigma^2 - 1\})exp\{2\mu + \sigma^2\}\);
  • Skewness = \((exp\{\sigma^2\} + 2)\sqrt{\exp\{\sigma^2\} - 1}\);
  • Kurtosis = \(exp\{4\sigma^2\} + 2exp\{3\sigma^2\} + 3exp\{2\sigma^2\} - 6\).

Log-Normal distribution

Code
ggplot(data = data.frame(x = c(0, 50)), aes(x)) +
  stat_function(fun = dlnorm, args = list(meanlog = 0, sdlog = 1), aes(colour = "a")) +
  stat_function(fun = dlnorm, args = list(meanlog = 3, sdlog = 1), aes(colour = "b")) +
  labs(x = "X", y = "f(x)") +
  theme_bw() +   theme(text = element_text(size=20),
                       legend.text = element_text(size = 12),
                       legend.title = element_text(size = 12),
                       legend.position = "bottom") +
  scale_colour_discrete("Distribution", labels = c("LN(0, 1)", "LN(3, 1)"))
ggplot(data = data.frame(x = c(0, 30)), aes(x)) +
  stat_function(fun = dlnorm, args = list(meanlog = 0, sdlog = 1), aes(colour = "a")) +
  stat_function(fun = dlnorm, args = list(meanlog = 0, sdlog = 2), aes(colour = "b")) +
  labs(x = "X", y = "f(x)") +
  theme_bw() +   theme(text = element_text(size=20),
                       legend.text = element_text(size = 12),
                       legend.title = element_text(size = 12),
                       legend.position = "bottom") +
  scale_colour_discrete("Distribution", labels = c("LN(0, 1)", "LN(0, 2)"))

Chi-Square distribution

  • It rarely used to describe endpoints, but it is often used for test of hypotheses;
  • \(X \sim \chi^2_{k}\) where \(k\) is a parameter known as degrees of freedom;

Density function

  • \(f(x) = \displaystyle\frac{\sqrt{\frac{(d_1x)^{d_1}d_2^{d_2}}{(d_1x + d_2)^{d_1 + d_2}}}}{x\beta(k/2, k/2)} x^{k/2-1}e^{-x/2} \quad \mbox{if} \quad 0 < x < \infty\);

Moments

  • \(E(X) = k\);
  • \(Var(X) = 2k\);
  • Skewness = \(\sqrt{8/k}\);
  • Kurtosis = \(12/k\).

Chi-Square distribution

Code
ggplot(data = data.frame(x = c(0, 50)), aes(x)) +
  stat_function(fun = dchisq, args = list(df = 1), aes(colour = "a")) +
  stat_function(fun = dchisq, args = list(df = 5), aes(colour = "b")) +
  stat_function(fun = dchisq, args = list(df = 20), aes(colour = "c")) +
  labs(x = "X", y = "f(x)") +
  theme_bw() +   theme(text = element_text(size=20),
                       legend.text = element_text(size = 12),
                       legend.title = element_text(size = 12),
                       legend.position = "bottom") +
  scale_colour_discrete("Distribution", labels = c("Chi-Squared(1)", 
                                                   "Chi-Squared(5)", 
                                                   "Chi-Squared(20)"))

F-Snedecor distribution

  • It rarely used to describe endpoints, but it is often used for test of hypotheses;
  • \(X \sim F_{d_1, d_2}\) where \(d_1\) and \(d_2\) are both parameters known as degrees of freedom;

Density function

  • \(f(x) = \displaystyle\frac{\sqrt{\frac{(d_1x)^{d_1}d_2^{d_2}}{(d_1x + d_2)^{d_1 + d_2}}}}{x \beta(d_1/2, d_2/2)} \quad \mbox{if} \quad 0 < x < \infty\);

Moments

  • \(E(X) = \displaystyle\frac{d_2}{d_2 - 2}\);
  • \(Var(X) = \frac{2d_2^2(d_1 + d_2 - 2)}{d_1(d_2 - 2)^2(d_2 - 4)}\);
  • Skewness = \(\frac{2d_1 + d_2 - 2\sqrt{8(d_2 - 4)}}{(d_2 - 6)\sqrt{d_1(d_1 + d_2 - 2)}}\);
  • Kurtosis = \(12\frac{d_1(5d_2 - 22)(d_1 + d_2 - 2) + (d_2 - 4)(d_2 - 2)^2}{d1(d_2 - 6)(d_2 - 8)(d_1 + d_2 - 2)}\).

F-Snedecor distribution

Code
ggplot(data = data.frame(x = c(0, 20)), aes(x)) +
  stat_function(fun = df, args = list(df1 = 1, df2 = 1), aes(colour = "a")) +
  stat_function(fun = df, args = list(df1 = 1, df2 = 5), aes(colour = "b")) +
  stat_function(fun = df, args = list(df1 = 5, df2 = 1), aes(colour = "c")) +
  labs(x = "X", y = "f(x)") +
  theme_bw() +   theme(text = element_text(size=20),
                       legend.text = element_text(size = 12),
                       legend.title = element_text(size = 12),
                       legend.position = "bottom") +
  scale_colour_discrete("Distribution", labels = c("F(1, 1)", 
                                                   "F(1, 40)", 
                                                   "F(40, 1)"))

Other continuous distributions