Descriptive Analysis

Marcio Diniz | Michael Luu

Cedars Sinai Medical Center

15 September, 2022

Introduction

Emergency dataset

Background

  • Patients with cirrhosis have high risk of bacterial infections and cirrhosis decompensation, resulting in admission to emergency department (ED). However, there are no criteria developed in the ED to identify patients with cirrhosis with bacterial infection and with high mortality risk.

Sample

  • This is a retrospective single-center study using a tertiary hospital’s database to identify consecutive ED patients with decompensated cirrhosis;
  • Data from 149 patients were colleted.

Types of variables

Can we classify these patients’ characteristics?

  • Infection (No, Yes)
  • Gender (Female, Male)
  • Age (Years)
  • Heart Rate (beat/min)
  • Amount of C Protein Reactive (mg/L)
  • CHILD-PUgh Score (A, B, C)
  • Amount of Albumin (g/dL)
  • Number of Leukocytes (per mm\(^3\))

Types of Variable

Quantitative variables

  • Age, Heart Rate, C Protein Reactive, Albumin, Number of Leukocytes.
    • Discrete: Age, Heart Rate, Number of Leukocytes;
    • Continuous: Age, C Protein Reactive, Albumin.

Qualitative variables

  • Infection, Gender, CHILD-Pugh Score.
    • Nominal: Infection, Gender;
    • Ordinal: CHILD.

How do we summarize variables?

Quantitative Variables

Descriptive Measures

  • Measures of Location: Mean, Median, and Mode;
  • Measures of Dispersion: Standard Deviation, Quantiles, Minimum and Maximum.

Plots

  • Dot-plots;
  • Histograms;
  • Box-plots;
  • Violin-plots.

Measures of Location

Measures of Location

How can we summarize people’s height?

Height chart

Measures of Location

Height chart

Mode

  • It is the value that occurs most frequently in the data set;
  • It does not always assess the center of a distribution;

Measures of Location

Mode = 188

Mode

  • It is the value that occurs most frequently in the data set;
  • It does not always assess the center of a distribution;
  • What happens when measures are continuous variables?
  • Each value will be unique, therefore there is no mode unless measures are assigned to bins.

Measures of Location

Height chart

Mean

  • It is the average of the measures;
  • What happens with the mean if Tiago was replaced by the tallest basketball player (Sun Mingming, 7’9’’ or 236.22)?

Measures of Location

Mean = 180.2

Mean

  • It is the average of the measures;
  • What happens with the mean if Tiago was replaced by the tallest basketball player (Sun Mingming, 7’9’’ or 236.22)?
  • The mean would increase to 184.52;
  • It is susceptible to outliers.

Measures of Location

Median = 182.5

Median

  • It is the middle value.
    • Rank the values from lowest to highest and identify the middle one.
    • If there is an even number of values, average the two middle ones;

Measures of Location

Median = 182.5

Median

  • It is the middle value.
    • Rank the values from lowest to highest and identify the middle one.
    • If there is an even number of values, average the two middle ones;
  • What happens with the mean if Tiago was replaced by the tallest basketball player (Sun Mingming, 7’9’’ or 236.22)?
  • The median would be still 182.5;
  • It is robust to outliers.

Dot plots

Dot plots

  • Each point is a sample unit;
  • Sample units with the same value are stacked;
  • It is easy to identify the mode;

Dot plots

  • Each point is a sample unit;
  • Sample units with the same value are stacked;
  • It is easy to identify the mode;
  • If the distribution is symmetric, mean and median can be considered the center of the dotplot;

Dot plots

  • Each point is a sample unit;
  • Sample units with the same value are stacked;
  • It is easy to identify the mode;
  • If the distribution is symmetric, mean and median can be considered the center of the dotplot;
  • How about identifying those location measures when the distribution is not symmetric?

Dot plots

  • Each point is a sample unit;
  • Sample units with the same value are stacked;
  • It is easy to identify the mode;
  • If the distribution is symmetric, mean and median can be considered the center of the dotplot;
  • How about identifying those location measures when the distribution is not symmetric?
  • It is not straightforward to identify mean and median if the distribution is not symmetric.

Dot plots

Dotplot for Age with Mean

Dotplot for Length of Stay with Median

Measures of Dispersion

Measures of Dispersion

What is the difference between the groups?

Dotplot with mean indicated in red

Measures of Dispersion

Minimum = 155 and Maximum = 193

Minimum and Maximum

  • It is the smallest and largest values in the sample;
  • It usually follows the median as a dispersion measure: Median (Minimum ; Maximum).

Measures of Dispersion

Quantiles 25th = 173.5, 75th = 188

Percentiles/Quantiles

  • Median is the percentile 50%;
  • Other common percentiles are 25th (1st quartile), 75th (3rd quartile);
  • 25th quantile is the value such whose a quarter of the smallest sample values are less than it;
  • 75th quantile is the value such whose a quarter of the highest sample values are greater than it;
  • There are several definitions of quantiles;
  • It usually follows the median as a dispersion measure: Median (25th quantile ; 75th quantile).

Measures of Dispersion

Standard Deviation = 11.25

Standard deviation

  • It measures the spread around the mean;
  • It is expressed in the same unit of the mean;
  • It is also susceptible to outliers as the mean;
  • It usually presented with the mean: Mean \(\pm\) SD.

Quantitative Variables

  • If the variable has a symmetric distribution of values,
    • mean \(\pm\) sd;
    • median (25% - 75% quantiles);
    • median (minimum - maximum).
  • If the variable has an asymmetric distribution of values,
    • median (25% - 75% quantiles);
    • median (minimum - maximum).

Histograms

What happens when we increase the sample size?

What happens when we increase the sample size?

What happens when we increase the sample size?

What happens when we increase the sample size?

What happens when we increase the sample size?

Histograms

Frequency

  • Histograms concatenate several sample units in bins with specific widths;
  • If the y-axis is frequency,
    • Adding up the bar heights is equal to the sample size;
    • The y-axis indicates the number of samples values for each bin;
    • Each bin has to have the same width.

Histograms

Density

  • Histograms concatenate several sample units in bins with specific widths;
  • If the y-axis is density,
    • The area under the histogram is equal to 1;
    • The y-axis indicates the number of sample values for each unit of the measure;
    • Bins can have different widths.

Histograms

Frequency vs Density for Histograms

Age Bin Frequency Relative Frequency Interval Width Density1
(35,40] 3 0.0207 5 0.00414
(40,45] 8 0.0552 5 0.01103
(45,50] 18 0.1241 5 0.02483
(50,55] 29 0.2000 5 0.04000
(55,60] 21 0.1448 5 0.02897
(60,65] 27 0.1862 5 0.03724
(65,70] 20 0.1379 5 0.02759
(70,75] 14 0.0966 5 0.01931
(75,80] 4 0.0276 5 0.00552
(80,85] 1 0.0069 5 0.00138
Total 145 1
1 Density = Relative Frequency/Interval Width

Histograms

Frequency vs Density for Histograms

Histogram for Length of Stay considering bins with equally width

Histogram for Length of Stay using Frequency considering bins with different width

Histograms

Frequency vs Density for Histograms

Histogram for Length of Stay considering bins with equally width

Histogram for Length of Stay using Frequency considering bins with different width

Histograms

Frequency vs Density for Histograms

Why do we need a histogram with density on the y-axis?

  • It allow one to see the underlying probability distribution that might describe the phenomenon of interest;
  • Probability distributions are the statistical tools to perform inferences from a sample to a population.

Box plots

Box plots

Dot plot for Age with Median (Red line)

Dot plot for Length of Stay with Median (Red line)

Box plots

Dot-Plot for Age with Median (Red line) and 25%, 75% quantiles (Orange lines)

Dot-Plot for Length of Stay with Median (Red line) and 25%, 75% quantiles (Orange lines)

Box plots

Dot-Plot for Age with Median (Red line) and 25%, 75% quantiles (Orange lines) and fences (Green lines)

Dot-Plot for Length of Stay with Median (Red line) and 25%, 75% quantiles (Orange lines) and fences (Green lines)

Box plots

Dot-Plot for Age with Median (Red line) and 25%, 75% quantiles (Orange lines) and fences (Green lines) and Minimum and Maximum (Blue lines)

Dot-Plot for Length of Stay with Median (Red line) and 25%, 75% quantiles (Orange lines) and fences (Green lines) and Minimum and Maximum (Blue lines)

Box plots

Box-Plot for Age

Box-Plot for Length of Stay

Box plots

How to interpret a box-plot?

  • Central line is the median;
  • Bottom line of the box is the 25% quantile;
  • Top line of the box is the 75% quantile;
  • Bottom whisker is the minimum value which is smaller than the bottom fence;
  • Bottom fence is 25% quantile - 1.5 \(\times\) (75% - 25%) quantiles;
  • Top whisker is the maximum value which is smaller than the top fence;
  • Top fence is 75% quantile + 1.5 \(\times\) (75% - 25%) quantiles;
    • Fences are not shown in an usual box-plot;
    • Observations outside of the fences are considered outliers.

How do we summarize variables?

Qualitative Variables

Descriptive Measures

  • Frequency (Percentage);

Plots

  • Barplots;
  • Pie-charts.

Avoid a pie-chart

It is not a pie-chart. It is a histogram!

Florence Nightingale

  • Nurse in the Crimean war;
  • How could she show to generals that soldiers were dying more because of diseases than in battle?

Barplots

Barplot for Child-Pugh Score based on frequency

Barplot for Child-Pugh Score based on percentage

How do we compare groups?

Quantitative Variables

Bar plot for Age with Mean and SD

Dot plot for Age with Mean

Quantitative Variables

Bar plot for Age with Mean and SD

Histogram for Age

Quantitative Variables

Barplot for Age with Mean and SD

Box plot for Age

Stop using barplots for quantitative variables

  • Kick the bar chart habit. Nat Methods 11, 113 (2014).
  • Krzywinski, M., Altman, N. Visualizing samples with box plots. Nat Methods 11, 119-120 (2014).
  • Bar Barplots project

Quantitative Variables

Bar plot for Length of Stay with Mean and SD

Dot plot for Length of Stay with Mean

Quantitative Variables

Bar plot for Length of Stay with Mean and SD

Histogram for Length of Stay

Quantitative Variables

Barplot for Length of Stay with Mean and SD

Box plot for Length of Stay

Qualitative Variables

Summary

Flow-chart to guide descriptive analyses

What is next?

  • Once we know our data, the next step is to make inferences to a population from which my data was sampled;
  • Our main tool to reach such aim is using probability distributions.