When I have a dataset with many variables and want to create a new variable for each of them, then the first thing comes into my mind is to write a new line of code for each transformation (e.g., new variable). It is time-consuming when I have more than 10 variables. Therefore, as “an advanced R user,” I will use mutate_all
to create a new variable for each variable included in the dataset.
However, in most “real life” circumstances, I don't want to create a new variable for all variables in the dataset, but only for a few of them. For example, I don't want to create a new variable for ID or categorical variables. In this situation, the mutate_at
became a useful function in the tidyverse
package.
Let say that I want to create a tertiles for a selected list of variables in esoph
dataset.
Library and data
library(tidyverse)
dat = as.data.frame(esoph)
dat
## agegp alcgp tobgp ncases ncontrols
## 1 25-34 0-39g/day 0-9g/day 0 40
## 2 25-34 0-39g/day 10-19 0 10
## 3 25-34 0-39g/day 20-29 0 6
## 4 25-34 0-39g/day 30+ 0 5
## 5 25-34 40-79 0-9g/day 0 27
## 6 25-34 40-79 10-19 0 7
## 7 25-34 40-79 20-29 0 4
## 8 25-34 40-79 30+ 0 7
## 9 25-34 80-119 0-9g/day 0 2
## 10 25-34 80-119 10-19 0 1
## 11 25-34 80-119 30+ 0 2
## 12 25-34 120+ 0-9g/day 0 1
## 13 25-34 120+ 10-19 1 1
## 14 25-34 120+ 20-29 0 1
## 15 25-34 120+ 30+ 0 2
...
...
The problem
I want to create tertiles for each variable in the dataset and want to exclude the ncases
and ncontrols
from the computation.
The solution
Here is the code (see below the explanation):
dat %>%
mutate_at(list(tertile = ~ntile(., 3)), .vars = vars(ends_with("gp"), -starts_with("nc")))
## agegp alcgp tobgp ncases ncontrols agegp_tertile alcgp_tertile
## 1 25-34 0-39g/day 0-9g/day 0 40 1 1
## 2 25-34 0-39g/day 10-19 0 10 1 1
## 3 25-34 0-39g/day 20-29 0 6 1 1
## 4 25-34 0-39g/day 30+ 0 5 1 1
## 5 25-34 40-79 0-9g/day 0 27 1 1
## 6 25-34 40-79 10-19 0 7 1 1
## 7 25-34 40-79 20-29 0 4 1 1
## 8 25-34 40-79 30+ 0 7 1 1
## 9 25-34 80-119 0-9g/day 0 2 1 2
## 10 25-34 80-119 10-19 0 1 1 2
## 11 25-34 80-119 30+ 0 2 1 2
## 12 25-34 120+ 0-9g/day 0 1 1 3
...
...
The ntile
function is used to create tertiles of the variables. The ends_with
will select variables of interests, and given that all ends with “gp” I used that function. The function -starts_with
will exlclude the variables ncases
and ncontrols
.
Another example
If you want to do other computations such as standardize the variables youu can use the code below. In this example I am using the ncases
and ncontrols
because are continuous.
dat %>%
mutate_at(list(sd = ~./sd(.)), .vars = vars(-ends_with("gp"), starts_with("nc")))
## agegp alcgp tobgp ncases ncontrols ncases_sd ncontrols_sd
## 1 25-34 0-39g/day 0-9g/day 0 40 0.0000000 3.14398607
## 2 25-34 0-39g/day 10-19 0 10 0.0000000 0.78599652
## 3 25-34 0-39g/day 20-29 0 6 0.0000000 0.47159791
## 4 25-34 0-39g/day 30+ 0 5 0.0000000 0.39299826
## 5 25-34 40-79 0-9g/day 0 27 0.0000000 2.12219060
## 6 25-34 40-79 10-19 0 7 0.0000000 0.55019756
## 7 25-34 40-79 20-29 0 4 0.0000000 0.31439861
## 8 25-34 40-79 30+ 0 7 0.0000000 0.55019756
## 9 25-34 80-119 0-9g/day 0 2 0.0000000 0.15719930
## 10 25-34 80-119 10-19 0 1 0.0000000 0.07859965
## 11 25-34 80-119 30+ 0 2 0.0000000 0.15719930
## 12 25-34 120+ 0-9g/day 0 1 0.0000000 0.07859965
## 13 25-34 120+ 10-19 1 1 0.3632179 0.07859965
## 14 25-34 120+ 20-29 0 1 0.0000000 0.07859965
## 15 25-34 120+ 30+ 0 2 0.0000000 0.15719930
...
...
Thats all.
I hope you find these tips and trics useful for your data analysis.