Data manipulation in R

Introduction to the tidyverse: dplyr & tidyr

Michael Luu, MPH | Marcio Diniz, PhD

September 08, 2022

Workflow in Statistical Analysis

What is ‘tidy’ data?

Tidy data is a standardized structure of how data should be stored

Every column should consist of a variable with a single type of data e.g. logical, dbl, integer, character, etc.
Every row should consist of a single observation.
Every cell should consist of a single value type.

What is ‘tidy’ data?

# A tibble: 150 × 5
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
          <dbl>       <dbl>        <dbl>       <dbl> <fct>  
 1          5.1         3.5          1.4         0.2 setosa 
 2          4.9         3            1.4         0.2 setosa 
 3          4.7         3.2          1.3         0.2 setosa 
 4          4.6         3.1          1.5         0.2 setosa 
 5          5           3.6          1.4         0.2 setosa 
 6          5.4         3.9          1.7         0.4 setosa 
 7          4.6         3.4          1.4         0.3 setosa 
 8          5           3.4          1.5         0.2 setosa 
 9          4.4         2.9          1.4         0.2 setosa 
10          4.9         3.1          1.5         0.1 setosa 
# … with 140 more rows

tidyverse

Introduction to the tidyverse

What is the tidyverse?

A collection of R packages that are used to facilitate data exploration, visualization, and ‘tidying’ of your data.

dplyr - manipulation of data
ggplot2 - visualization of data
readr - importing data into R
tibble - creation of dataframes in R
tidyr - ‘tidying’ or ‘cleaning’ your data
purrr - functional or iterative programming in R

What is a function ?

A function has a ‘name’
A function has ‘inputs’
A function has an ‘output’
A function performs an ‘action’
A function is enclosed by parentheses after the name, e.g. f()

In the above example, the function ‘name’ is f(), the function has 1 input x, and the function has 1 output x^2.

dplyr and tidyr

dplyr

These are the six main dplyr functions that we will use to perform the vast majority of the necessary data manipulations in R and will be the focus

mutate() adds new variables that are functions of existing variables
select() subset variables based on their names.
filter() subset cases based on their values.
summarise() reduces multiple values down to a single summary.
arrange() changes the ordering of the rows.
glimpse() obtain a ‘glimpse’ of the dataframe.

tidyr

We will also focus on two main tidyr functions

pivot_longer() transposition of data into the long format
pivot_wider() transposition of data into the wider format

Functions in R programming

Similar to functions in mathematics, all functions have a name, some number of inputs, and an output
In the tidyverse class of functions, the first ‘input’ is always a class data.frame
In the tidyverse class of functions, the output is commonly a class data.frame
- This largely applies to the tidyverse class of functions only
- The output can technically be anything

%>% (Pipe)

Another very important function is the %>% (pipe) operator.

%>% allows us to chain multiple functions together.
x %>% f() is interpreted as f(x)
x %>% f(y) is interpretted as f(x, y)
x %>% f() %>% g() %>% h() is equivalent to h(g(f(x)))

%>% (Pipe)

For example, the pipe operator will allow us to chain together multiple functions and operations in a systematic and ‘readable’ manner

iris %>% filter() %>% mutate() %>% summarise()

As opposed to the alternative without using the pipe seen below

summarise(mutate(filter(iris, ...), ...), ...)

The input of each tidyverse function is a data.frame, and the output of each of these tidyverse function is a data.frame

The pipe operator makes our code significantly more ‘readable’

Help

Type a question mark before the name of the function to invoke the help file for that function.

Code

?mutate

Cheat Sheet

Highly recommend printing these ‘cheat cheets’, and have them available for reference.

Example

Revisiting the built in R dataset iris

# A tibble: 150 × 5
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
          <dbl>       <dbl>        <dbl>       <dbl> <fct>  
 1          5.1         3.5          1.4         0.2 setosa 
 2          4.9         3            1.4         0.2 setosa 
 3          4.7         3.2          1.3         0.2 setosa 
 4          4.6         3.1          1.5         0.2 setosa 
 5          5           3.6          1.4         0.2 setosa 
 6          5.4         3.9          1.7         0.4 setosa 
 7          4.6         3.4          1.4         0.3 setosa 
 8          5           3.4          1.5         0.2 setosa 
 9          4.4         2.9          1.4         0.2 setosa 
10          4.9         3.1          1.5         0.1 setosa 
# … with 140 more rows

glimpse()

glimpse() provides a ‘glimpse’ of ‘what’ is in the dataframe

Code

iris %>% glimpse(., width = 75)

Rows: 150
Columns: 5
$ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.…
$ Sepal.Width  <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.…
$ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.…
$ Petal.Width  <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.…
$ Species      <fct> setosa, setosa, setosa, setosa, setosa, setosa, seto…

mutate()

Create new variables as functions of other variables mutate()

Code

iris %>% mutate(., ratio = Petal.Length / Petal.Width)

# A tibble: 150 × 6
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species ratio
          <dbl>       <dbl>        <dbl>       <dbl> <fct>   <dbl>
 1          5.1         3.5          1.4         0.2 setosa   7   
 2          4.9         3            1.4         0.2 setosa   7   
 3          4.7         3.2          1.3         0.2 setosa   6.5 
 4          4.6         3.1          1.5         0.2 setosa   7.5 
 5          5           3.6          1.4         0.2 setosa   7   
 6          5.4         3.9          1.7         0.4 setosa   4.25
 7          4.6         3.4          1.4         0.3 setosa   4.67
 8          5           3.4          1.5         0.2 setosa   7.5 
 9          4.4         2.9          1.4         0.2 setosa   7   
10          4.9         3.1          1.5         0.1 setosa  15   
# … with 140 more rows

select()

Select variables by their names select()

Code

iris %>% select(., Petal.Length, Petal.Width, Species)

# A tibble: 150 × 3
   Petal.Length Petal.Width Species
          <dbl>       <dbl> <fct>  
 1          1.4         0.2 setosa 
 2          1.4         0.2 setosa 
 3          1.3         0.2 setosa 
 4          1.5         0.2 setosa 
 5          1.4         0.2 setosa 
 6          1.7         0.4 setosa 
 7          1.4         0.3 setosa 
 8          1.5         0.2 setosa 
 9          1.4         0.2 setosa 
10          1.5         0.1 setosa 
# … with 140 more rows

filter()

Select observations by their values filter()

Code

iris %>% filter(., Sepal.Length >= 5.0)

# A tibble: 128 × 5
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
          <dbl>       <dbl>        <dbl>       <dbl> <fct>  
 1          5.1         3.5          1.4         0.2 setosa 
 2          5           3.6          1.4         0.2 setosa 
 3          5.4         3.9          1.7         0.4 setosa 
 4          5           3.4          1.5         0.2 setosa 
 5          5.4         3.7          1.5         0.2 setosa 
 6          5.8         4            1.2         0.2 setosa 
 7          5.7         4.4          1.5         0.4 setosa 
 8          5.4         3.9          1.3         0.4 setosa 
 9          5.1         3.5          1.4         0.3 setosa 
10          5.7         3.8          1.7         0.3 setosa 
# … with 118 more rows

summarise()

Collapse values down to a summary summarise()

Code

iris %>% summarise(
  .,
  mean_sepal_length = mean(Sepal.Length)
)

# A tibble: 1 × 1
  mean_sepal_length
              <dbl>
1              5.84

summarise()

group_by() is another dplyr function that is commonly used with summarise(). It allows us to apply the function summarise() within a categorical variable or a ‘factor’

Code

iris %>%
  group_by(Species) %>%
  summarise(., mean_sepal_length = mean(Sepal.Length))

# A tibble: 3 × 2
  Species    mean_sepal_length
  <fct>                  <dbl>
1 setosa                  5.01
2 versicolor              5.94
3 virginica               6.59

arrange()

Reordering rows arrange()

Code

iris %>% arrange(., Sepal.Length)

# A tibble: 150 × 5
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
          <dbl>       <dbl>        <dbl>       <dbl> <fct>  
 1          4.3         3            1.1         0.1 setosa 
 2          4.4         2.9          1.4         0.2 setosa 
 3          4.4         3            1.3         0.2 setosa 
 4          4.4         3.2          1.3         0.2 setosa 
 5          4.5         2.3          1.3         0.3 setosa 
 6          4.6         3.1          1.5         0.2 setosa 
 7          4.6         3.4          1.4         0.3 setosa 
 8          4.6         3.6          1           0.2 setosa 
 9          4.6         3.2          1.4         0.2 setosa 
10          4.7         3.2          1.3         0.2 setosa 
# … with 140 more rows

arrange()

Code

iris %>% arrange(., desc(Sepal.Length))

# A tibble: 150 × 5
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species  
          <dbl>       <dbl>        <dbl>       <dbl> <fct>    
 1          7.9         3.8          6.4         2   virginica
 2          7.7         3.8          6.7         2.2 virginica
 3          7.7         2.6          6.9         2.3 virginica
 4          7.7         2.8          6.7         2   virginica
 5          7.7         3            6.1         2.3 virginica
 6          7.6         3            6.6         2.1 virginica
 7          7.4         2.8          6.1         1.9 virginica
 8          7.3         2.9          6.3         1.8 virginica
 9          7.2         3.6          6.1         2.5 virginica
10          7.2         3.2          6           1.8 virginica
# … with 140 more rows

Transposition

The same data can be stored in both a ‘long’ and ‘wide’ representation

Transposition

Since the original dataset does not include a unique identifier for each observation, the below code is to simply add a identifier using the rowname called ‘index’.

Code

iris <- iris %>% rownames_to_column(var = "index")

iris

# A tibble: 150 × 6
   index Sepal.Length Sepal.Width Petal.Length Petal.Width Species
   <chr>        <dbl>       <dbl>        <dbl>       <dbl> <fct>  
 1 1              5.1         3.5          1.4         0.2 setosa 
 2 2              4.9         3            1.4         0.2 setosa 
 3 3              4.7         3.2          1.3         0.2 setosa 
 4 4              4.6         3.1          1.5         0.2 setosa 
 5 5              5           3.6          1.4         0.2 setosa 
 6 6              5.4         3.9          1.7         0.4 setosa 
 7 7              4.6         3.4          1.4         0.3 setosa 
 8 8              5           3.4          1.5         0.2 setosa 
 9 9              4.4         2.9          1.4         0.2 setosa 
10 10             4.9         3.1          1.5         0.1 setosa 
# … with 140 more rows

pivot_longer()

Code

long <- pivot_longer(
  data = iris,
  cols = c(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width),
  names_to = "name",
  values_to = "value"
)

long

# A tibble: 600 × 4
   index Species name         value
   <chr> <fct>   <chr>        <dbl>
 1 1     setosa  Sepal.Length   5.1
 2 1     setosa  Sepal.Width    3.5
 3 1     setosa  Petal.Length   1.4
 4 1     setosa  Petal.Width    0.2
 5 2     setosa  Sepal.Length   4.9
 6 2     setosa  Sepal.Width    3  
 7 2     setosa  Petal.Length   1.4
 8 2     setosa  Petal.Width    0.2
 9 3     setosa  Sepal.Length   4.7
10 3     setosa  Sepal.Width    3.2
# … with 590 more rows

pivot_wider()

Code

wide <- pivot_wider(
  data = long,
  names_from = "name",
  values_from = "value"
)

wide

# A tibble: 150 × 6
   index Species Sepal.Length Sepal.Width Petal.Length Petal.Width
   <chr> <fct>          <dbl>       <dbl>        <dbl>       <dbl>
 1 1     setosa           5.1         3.5          1.4         0.2
 2 2     setosa           4.9         3            1.4         0.2
 3 3     setosa           4.7         3.2          1.3         0.2
 4 4     setosa           4.6         3.1          1.5         0.2
 5 5     setosa           5           3.6          1.4         0.2
 6 6     setosa           5.4         3.9          1.7         0.4
 7 7     setosa           4.6         3.4          1.4         0.3
 8 8     setosa           5           3.4          1.5         0.2
 9 9     setosa           4.4         2.9          1.4         0.2
10 10    setosa           4.9         3.1          1.5         0.1
# … with 140 more rows

Debugging

Errors and bugs in your code will happen. Below are a few simple tips that will hopefully aid your debugging process.

Systematically run your code line by line to identify the specific piece of your code that is causing the error.
Identify the error message in your R console and Google it.
Stack Overflow is an online Q/A website where you can ask your programming questions.
Trial and Error - modify the piece of code, and run it again.