With the use of tidyverse
package is become easy to manage and create new datasets. Among many other useful functions that tidyverse
has, such as mutate
or summarise
, other functions including spread
, gather
, separate
, and unite
are less used in data management. Therefore, in this post, I will focus on those functions. I will show how to transform the dataset from long to wide, how to separate one variable in two new variables or to unite two variables into one.
Load the library and data:
library(tidyverse)
library(datasets)
The dataset I will use in this post is Smoking, Alcohol and (O)esophageal Cancer which is included by default in the R. Lets take a look of variables.
dt = esoph
head(dt)
## agegp alcgp tobgp ncases ncontrols
## 1 25-34 0-39g/day 0-9g/day 0 40
## 2 25-34 0-39g/day 10-19 0 10
## 3 25-34 0-39g/day 20-29 0 6
## 4 25-34 0-39g/day 30+ 0 5
## 5 25-34 40-79 0-9g/day 0 27
## 6 25-34 40-79 10-19 0 7
This dataset dt
have 88 rows and 5 variable composed of age group, alcohol intake, smoking and number of cases (esophageal cancer) and controls for each row.
summary(dt)
## agegp alcgp tobgp ncases ncontrols
## 25-34:15 0-39g/day:23 0-9g/day:24 Min. : 0.000 Min. : 1.00
## 35-44:15 40-79 :23 10-19 :24 1st Qu.: 0.000 1st Qu.: 3.00
## 45-54:16 80-119 :21 20-29 :20 Median : 1.000 Median : 6.00
## 55-64:16 120+ :21 30+ :20 Mean : 2.273 Mean :11.08
## 65-74:15 3rd Qu.: 4.000 3rd Qu.:14.00
## 75+ :11 Max. :17.000 Max. :60.00
Spread
As it is shown above, the variable agegp
has 6 groups (i.e., 25-34, 35-44) which has different alcohol intake and smoking use combinations. I think it would be interesting to transform this dataset from long to wide and to create a column for each age group and show the respective cases. Let see how the dataset will look like.
dt %>%
spread(agegp, ncases) %>%
slice(1:5)
## alcgp tobgp ncontrols 25-34 35-44 45-54 55-64 65-74 75+
## 1 0-39g/day 0-9g/day 18 NA NA NA NA NA 1
## 2 0-39g/day 0-9g/day 40 0 NA NA NA NA NA
## 3 0-39g/day 0-9g/day 46 NA NA 1 NA NA NA
## 4 0-39g/day 0-9g/day 48 NA NA NA NA 5 NA
## 5 0-39g/day 0-9g/day 49 NA NA NA 2 NA NA
Gather
This dataset would be informative if I would prefer to arrange the variables by each age group. However, to get the dataset back to the original structure I will use the function gather
.
wide = dt %>%
spread(agegp, ncases)
wide %>%
gather(agegp, ncases, -alcgp, -tobgp, -ncontrols) %>%
filter(!is.na(ncases)) %>%
slice(1:5)
## alcgp tobgp ncontrols agegp ncases
## 1 0-39g/day 0-9g/day 40 25-34 0
## 2 0-39g/day 10-19 10 25-34 0
## 3 0-39g/day 20-29 6 25-34 0
## 4 0-39g/day 30+ 5 25-34 0
## 5 40-79 0-9g/day 27 25-34 0
Separate
separate
is used to create multiple columns starting from a single one. Below I am separating the agegp
variable into two other variables min
and max
year.
dt %>%
separate(agegp, into = c("min", "max"), sep = "-") %>%
slice(1:5)
## min max alcgp tobgp ncases ncontrols
## 1 25 34 0-39g/day 0-9g/day 0 40
## 2 25 34 0-39g/day 10-19 0 10
## 3 25 34 0-39g/day 20-29 0 6
## 4 25 34 0-39g/day 30+ 0 5
## 5 25 34 40-79 0-9g/day 0 27
Unite
The opposite of separate
is unite
. Using the example above, I will unite the min
and max
variable into agegp
as it was in original dataset.
dt2 = dt %>%
separate(agegp, into = c("min", "max"), sep = "-")
dt2 %>%
unite(agegp, min, max, sep = "-") %>%
slice(1:5)
## agegp alcgp tobgp ncases ncontrols
## 1 25-34 0-39g/day 0-9g/day 0 40
## 2 25-34 0-39g/day 10-19 0 10
## 3 25-34 0-39g/day 20-29 0 6
## 4 25-34 0-39g/day 30+ 0 5
## 5 25-34 40-79 0-9g/day 0 27
Now I hope you will use more often the package tidyverse
and its functions spread
, gather
, separate
, and unite
.