This post will focus on merging datasets with tidyverse
using R. I will use data from NHANES, which are freely available for everyone. The first dataset data1
consists of the blood pressure levels for each participant, and the second data2
contain their LDL and Triglycerides levels.
First, I will load the neccessary libraries and datasets.
library(tidyverse)
library(RNHANES)
data1 <- nhanes_load_data("BPX_G", "2011-2012") %>%
select(SEQN, BPXSY1, BPXDI1)
data2 <- nhanes_load_data("TRIGLY_G", "2011-2012") %>%
select(SEQN, LBXTR, LBDLDL)
Get some information on the data1
and data2
dim(data1)
## [1] 9338 3
dim(data2)
## [1] 3239 3
data1
has 9338 rows and 3 columns. data2
has 3239 rows and 3 colums.
data1 %>% slice(1:10)
## SEQN BPXSY1 BPXDI1
## 1 62161 110 82
## 2 62162 NA NA
## 3 62163 112 38
## 4 62164 116 56
## 5 62165 110 64
## 6 62166 96 32
## 7 62167 NA NA
## 8 62168 NA NA
## 9 62169 124 80
## 10 62170 124 82
data2 %>% slice(1:10)
## SEQN LBXTR LBDLDL
## 1 62161 84 110
## 2 62164 56 151
## 3 62165 71 84
## 4 62169 78 73
## 5 62170 44 77
## 6 62171 38 68
## 7 62172 141 68
## 8 62177 126 130
## 9 62178 100 134
## 10 62184 40 117
inner_join
Now I will merge data1
and data2
using the function inner_join
.
total <- inner_join(data1, data2, by="SEQN")
dim(total)
## [1] 3239 5
total
has 3239 rows and five variables. This indicates that the total
kept all the IDs that were available in both datasets and entered the new columns from data2
. So in case, you want to keep only IDs that are available in both databases use the function inner_join
left_join
Next function is left_join
. The left_join
is used when we want to keep all the ID in one dataset and to add columns from the other datasets. If the second dataset has less number of ID, the NA
will be added.
total1 <- left_join(data1, data2, by="SEQN")
dim(total1)
## [1] 9338 5
total1 %>% slice(1:10)
## SEQN BPXSY1 BPXDI1 LBXTR LBDLDL
## 1 62161 110 82 84 110
## 2 62162 NA NA NA NA
## 3 62163 112 38 NA NA
## 4 62164 116 56 56 151
## 5 62165 110 64 71 84
## 6 62166 96 32 NA NA
## 7 62167 NA NA NA NA
## 8 62168 NA NA NA NA
## 9 62169 124 80 78 73
## 10 62170 124 82 44 77
In comparison with the inner_join
, the left_join
does not delete rows that are not available in the second dataset.
right_join
Another merge is using right_join
function, which does the opposite of the left_join
. SO the matching will be based on the data2
and not data1
total2 <- right_join(data1, data2, by="SEQN")
dim(total2)
## [1] 3239 5
total2 %>% slice(1:10)
## SEQN BPXSY1 BPXDI1 LBXTR LBDLDL
## 1 62161 110 82 84 110
## 2 62164 116 56 56 151
## 3 62165 110 64 71 84
## 4 62169 124 80 78 73
## 5 62170 124 82 44 77
## 6 62171 112 54 38 68
## 7 62172 100 70 141 68
## 8 62177 152 68 126 130
## 9 62178 124 72 100 134
## 10 62184 120 70 40 117
full_join
The full_join
command, returns in a final dataset, all rows, and all columns from both datasets. When there is not a matching value, it is turned into N/A.
semi_join
The semi_join
function is different than the previous examples of joins. A semi join creates a new dataset in which there are all rows from the data1
where there is a corresponding matching value in data2
. Still, instead of the final dataset merging both the first (data1) and second (data2) datasets, it only contains the variables from the first one (data1).
This illustration is shown below:
total3 <- semi_join(data1, data2, by="SEQN")
dim(total3)
## [1] 3239 3
total3 %>% slice(1:10)
## SEQN BPXSY1 BPXDI1
## 1 62161 110 82
## 2 62164 116 56
## 3 62165 110 64
## 4 62169 124 80
## 5 62170 124 82
## 6 62171 112 54
## 7 62172 100 70
## 8 62177 152 68
## 9 62178 124 72
## 10 62184 120 70
anti_join
An example that is very helpful in exploring datasets, is anti_join
command, and does the opposite of semi_join
; it shows the rows from the first dataset data1
where there are not matching values from the second dataset data2
:
total4 <- anti_join(data1, data2, by="SEQN")
dim(total4)
## [1] 6099 3
total4 %>% slice(1:10)
## SEQN BPXSY1 BPXDI1
## 1 62162 NA NA
## 2 62163 112 38
## 3 62166 96 32
## 4 62167 NA NA
## 5 62168 NA NA
## 6 62173 NA NA
## 7 62174 NA NA
## 8 62175 NA NA
## 9 62176 NA NA
## 10 62179 126 78
nest_join
The nest_join
identified which IDs are in data2
that match that row from data1
. The '0' means there is no match.
total5 <- nest_join(data1, data2, by="SEQN")
dim(total5)
## [1] 9338 4
total5 %>% slice(1:10)
## SEQN BPXSY1 BPXDI1 y
## 1 62161 110 82 84, 110
## 2 62162 NA NA
## 3 62163 112 38
## 4 62164 116 56 56, 151
## 5 62165 110 64 71, 84
## 6 62166 96 32
## 7 62167 NA NA
## 8 62168 NA NA
## 9 62169 124 80 78, 73
## 10 62170 124 82 44, 77
Hope you find helpful!