Spread, Gather, Separate, and Unite variables and datasets in R

With the use of tidyverse package is become easy to manage and create new datasets. Among many other useful functions that tidyverse has, such as mutate or summarise, other functions including spread, gather, separate, and unite are less used in data management. Therefore, in this post, I will focus on those functions. I will show how to transform the dataset from long to wide, how to separate one variable in two new variables or to unite two variables into one.

Load the library and data:

library(tidyverse)
library(datasets)

The dataset I will use in this post is Smoking, Alcohol and (O)esophageal Cancer which is included by default in the R. Lets take a look of variables.

dt = esoph
head(dt)
##   agegp     alcgp    tobgp ncases ncontrols
## 1 25-34 0-39g/day 0-9g/day      0        40
## 2 25-34 0-39g/day    10-19      0        10
## 3 25-34 0-39g/day    20-29      0         6
## 4 25-34 0-39g/day      30+      0         5
## 5 25-34     40-79 0-9g/day      0        27
## 6 25-34     40-79    10-19      0         7

This dataset dt have 88 rows and 5 variable composed of age group, alcohol intake, smoking and number of cases (esophageal cancer) and controls for each row.

summary(dt)
##    agegp          alcgp         tobgp        ncases         ncontrols    
##  25-34:15   0-39g/day:23   0-9g/day:24   Min.   : 0.000   Min.   : 1.00  
##  35-44:15   40-79    :23   10-19   :24   1st Qu.: 0.000   1st Qu.: 3.00  
##  45-54:16   80-119   :21   20-29   :20   Median : 1.000   Median : 6.00  
##  55-64:16   120+     :21   30+     :20   Mean   : 2.273   Mean   :11.08  
##  65-74:15                                3rd Qu.: 4.000   3rd Qu.:14.00  
##  75+  :11                                Max.   :17.000   Max.   :60.00

Spread

As it is shown above, the variable agegp has 6 groups (i.e., 25-34, 35-44) which has different alcohol intake and smoking use combinations. I think it would be interesting to transform this dataset from long to wide and to create a column for each age group and show the respective cases. Let see how the dataset will look like.

dt %>% 
  spread(agegp, ncases) %>% 
  slice(1:5)
##       alcgp    tobgp ncontrols 25-34 35-44 45-54 55-64 65-74 75+
## 1 0-39g/day 0-9g/day        18    NA    NA    NA    NA    NA   1
## 2 0-39g/day 0-9g/day        40     0    NA    NA    NA    NA  NA
## 3 0-39g/day 0-9g/day        46    NA    NA     1    NA    NA  NA
## 4 0-39g/day 0-9g/day        48    NA    NA    NA    NA     5  NA
## 5 0-39g/day 0-9g/day        49    NA    NA    NA     2    NA  NA

Gather

This dataset would be informative if I would prefer to arrange the variables by each age group. However, to get the dataset back to the original structure I will use the function gather.

wide = dt %>% 
  spread(agegp, ncases)

wide %>% 
  gather(agegp, ncases, -alcgp, -tobgp, -ncontrols) %>% 
  filter(!is.na(ncases)) %>% 
  slice(1:5)
##       alcgp    tobgp ncontrols agegp ncases
## 1 0-39g/day 0-9g/day        40 25-34      0
## 2 0-39g/day    10-19        10 25-34      0
## 3 0-39g/day    20-29         6 25-34      0
## 4 0-39g/day      30+         5 25-34      0
## 5     40-79 0-9g/day        27 25-34      0

Separate

separate is used to create multiple columns starting from a single one. Below I am separating the agegp variable into two other variables min and max year.

dt %>% 
  separate(agegp, into = c("min", "max"), sep = "-") %>% 
  slice(1:5)
##   min max     alcgp    tobgp ncases ncontrols
## 1  25  34 0-39g/day 0-9g/day      0        40
## 2  25  34 0-39g/day    10-19      0        10
## 3  25  34 0-39g/day    20-29      0         6
## 4  25  34 0-39g/day      30+      0         5
## 5  25  34     40-79 0-9g/day      0        27

Unite

The opposite of separate is unite. Using the example above, I will unite the min and max variable into agegp as it was in original dataset.

dt2 = dt %>% 
  separate(agegp, into = c("min", "max"), sep = "-")
dt2 %>% 
  unite(agegp, min, max, sep = "-") %>% 
  slice(1:5)
##   agegp     alcgp    tobgp ncases ncontrols
## 1 25-34 0-39g/day 0-9g/day      0        40
## 2 25-34 0-39g/day    10-19      0        10
## 3 25-34 0-39g/day    20-29      0         6
## 4 25-34 0-39g/day      30+      0         5
## 5 25-34     40-79 0-9g/day      0        27

Now I hope you will use more often the package tidyverse and its functions spread, gather, separate, and unite.

Data ManipulationtidyverseTips & Tricks