Descriptive Analysis
Marcio Diniz | Michael Luu
Cedars Sinai Medical Center
15 September, 2022
Emergency dataset
Background
- Patients with cirrhosis have high risk of bacterial infections and cirrhosis decompensation, resulting in admission to emergency department (ED). However, there are no criteria developed in the ED to identify patients with cirrhosis with bacterial infection and with high mortality risk.
Sample
- This is a retrospective single-center study using a tertiary hospital’s database to identify consecutive ED patients with decompensated cirrhosis;
- Data from 149 patients were colleted.
Types of variables
Can we classify these patients’ characteristics?
- Infection (No, Yes)
- Gender (Female, Male)
- Age (Years)
- Heart Rate (beat/min)
- Amount of C Protein Reactive (mg/L)
- CHILD-PUgh Score (A, B, C)
- Amount of Albumin (g/dL)
- Number of Leukocytes (per mm\(^3\))
Types of Variable
Quantitative variables
- Age, Heart Rate, C Protein Reactive, Albumin, Number of Leukocytes.
- Discrete: Age, Heart Rate, Number of Leukocytes;
- Continuous: Age, C Protein Reactive, Albumin.
Qualitative variables
- Infection, Gender, CHILD-Pugh Score.
- Nominal: Infection, Gender;
- Ordinal: CHILD.
How do we summarize variables?
Quantitative Variables
Descriptive Measures
- Measures of Location: Mean, Median, and Mode;
- Measures of Dispersion: Standard Deviation, Quantiles, Minimum and Maximum.
Plots
- Dot-plots;
- Histograms;
- Box-plots;
- Violin-plots.
Measures of Location
How can we summarize people’s height?
Height chart
Measures of Location
Mode
- It is the value that occurs most frequently in the data set;
- It does not always assess the center of a distribution;
Measures of Location
Mode
- It is the value that occurs most frequently in the data set;
- It does not always assess the center of a distribution;
- What happens when measures are continuous variables?
- Each value will be unique, therefore there is no mode unless measures are assigned to bins.
Measures of Location
Mean
- It is the average of the measures;
- What happens with the mean if Tiago was replaced by the tallest basketball player (Sun Mingming, 7’9’’ or 236.22)?
Measures of Location
Mean
- It is the average of the measures;
- What happens with the mean if Tiago was replaced by the tallest basketball player (Sun Mingming, 7’9’’ or 236.22)?
- The mean would increase to 184.52;
- It is susceptible to outliers.
Measures of Location
- It is the middle value.
- Rank the values from lowest to highest and identify the middle one.
- If there is an even number of values, average the two middle ones;
Measures of Location
- It is the middle value.
- Rank the values from lowest to highest and identify the middle one.
- If there is an even number of values, average the two middle ones;
- What happens with the mean if Tiago was replaced by the tallest basketball player (Sun Mingming, 7’9’’ or 236.22)?
- The median would be still 182.5;
- It is robust to outliers.
Dot plots
- Each point is a sample unit;
- Sample units with the same value are stacked;
- It is easy to identify the mode;
Dot plots
- Each point is a sample unit;
- Sample units with the same value are stacked;
- It is easy to identify the mode;
- If the distribution is symmetric, mean and median can be considered the center of the dotplot;
Dot plots
- Each point is a sample unit;
- Sample units with the same value are stacked;
- It is easy to identify the mode;
- If the distribution is symmetric, mean and median can be considered the center of the dotplot;
- How about identifying those location measures when the distribution is not symmetric?
Dot plots
- Each point is a sample unit;
- Sample units with the same value are stacked;
- It is easy to identify the mode;
- If the distribution is symmetric, mean and median can be considered the center of the dotplot;
- How about identifying those location measures when the distribution is not symmetric?
- It is not straightforward to identify mean and median if the distribution is not symmetric.
Measures of Dispersion
What is the difference between the groups?
Dotplot with mean indicated in red
Measures of Dispersion
Minimum and Maximum
- It is the smallest and largest values in the sample;
- It usually follows the median as a dispersion measure: Median (Minimum ; Maximum).
Measures of Dispersion
Percentiles/Quantiles
- Median is the percentile 50%;
- Other common percentiles are 25th (1st quartile), 75th (3rd quartile);
- 25th quantile is the value such whose a quarter of the smallest sample values are less than it;
- 75th quantile is the value such whose a quarter of the highest sample values are greater than it;
- There are several definitions of quantiles;
- It usually follows the median as a dispersion measure: Median (25th quantile ; 75th quantile).
Measures of Dispersion
Standard deviation
- It measures the spread around the mean;
- It is expressed in the same unit of the mean;
- It is also susceptible to outliers as the mean;
- It usually presented with the mean: Mean \(\pm\) SD.
Quantitative Variables
- If the variable has a symmetric distribution of values,
- mean \(\pm\) sd;
- median (25% - 75% quantiles);
- median (minimum - maximum).
- If the variable has an asymmetric distribution of values,
- median (25% - 75% quantiles);
- median (minimum - maximum).
What happens when we increase the sample size?
What happens when we increase the sample size?
What happens when we increase the sample size?
What happens when we increase the sample size?
What happens when we increase the sample size?
Histograms
Frequency
- Histograms concatenate several sample units in bins with specific widths;
- If the y-axis is frequency,
- Adding up the bar heights is equal to the sample size;
- The y-axis indicates the number of samples values for each bin;
- Each bin has to have the same width.
Histograms
Density
- Histograms concatenate several sample units in bins with specific widths;
- If the y-axis is density,
- The area under the histogram is equal to 1;
- The y-axis indicates the number of sample values for each unit of the measure;
- Bins can have different widths.
Histograms
Frequency vs Density for Histograms
Age Bin |
Frequency |
Relative Frequency |
Interval Width |
Density |
(35,40] |
3 |
0.0207 |
5 |
0.00414 |
(40,45] |
8 |
0.0552 |
5 |
0.01103 |
(45,50] |
18 |
0.1241 |
5 |
0.02483 |
(50,55] |
29 |
0.2000 |
5 |
0.04000 |
(55,60] |
21 |
0.1448 |
5 |
0.02897 |
(60,65] |
27 |
0.1862 |
5 |
0.03724 |
(65,70] |
20 |
0.1379 |
5 |
0.02759 |
(70,75] |
14 |
0.0966 |
5 |
0.01931 |
(75,80] |
4 |
0.0276 |
5 |
0.00552 |
(80,85] |
1 |
0.0069 |
5 |
0.00138 |
Total |
145 |
1 |
|
|
Histograms
Frequency vs Density for Histograms
Histograms
Frequency vs Density for Histograms
Histograms
Frequency vs Density for Histograms
Why do we need a histogram with density on the y-axis?
- It allow one to see the underlying probability distribution that might describe the phenomenon of interest;
- Probability distributions are the statistical tools to perform inferences from a sample to a population.
Box plots
How to interpret a box-plot?
- Central line is the median;
- Bottom line of the box is the 25% quantile;
- Top line of the box is the 75% quantile;
- Bottom whisker is the minimum value which is smaller than the bottom fence;
- Bottom fence is 25% quantile - 1.5 \(\times\) (75% - 25%) quantiles;
- Top whisker is the maximum value which is smaller than the top fence;
- Top fence is 75% quantile + 1.5 \(\times\) (75% - 25%) quantiles;
- Fences are not shown in an usual box-plot;
- Observations outside of the fences are considered outliers.
How do we summarize variables?
Qualitative Variables
Descriptive Measures
Plots
Avoid a pie-chart
Florence Nightingale
- Nurse in the Crimean war;
- How could she show to generals that soldiers were dying more because of diseases than in battle?
How do we compare groups?
Stop using barplots for quantitative variables
- Kick the bar chart habit. Nat Methods 11, 113 (2014).
- Krzywinski, M., Altman, N. Visualizing samples with box plots. Nat Methods 11, 119-120 (2014).
- Bar Barplots project
Summary
Flow-chart to guide descriptive analyses
What is next?
- Once we know our data, the next step is to make inferences to a population from which my data was sampled;
- Our main tool to reach such aim is using probability distributions.