This post will focus on merging datasets with tidyverse using R. I will use data from NHANES, which are freely available for everyone. The first dataset data1 consists of the blood pressure levels for each participant, and the second data2 contain their LDL and Triglycerides levels.
First, I will load the neccessary libraries and datasets.
library(tidyverse)
library(RNHANES)
data1 <- nhanes_load_data("BPX_G", "2011-2012") %>%
select(SEQN, BPXSY1, BPXDI1)
data2 <- nhanes_load_data("TRIGLY_G", "2011-2012") %>%
select(SEQN, LBXTR, LBDLDL)
Get some information on the data1 and data2
dim(data1)
## [1] 9338 3
dim(data2)
## [1] 3239 3
data1 has 9338 rows and 3 columns. data2 has 3239 rows and 3 colums.
data1 %>% slice(1:10)
## SEQN BPXSY1 BPXDI1
## 1 62161 110 82
## 2 62162 NA NA
## 3 62163 112 38
## 4 62164 116 56
## 5 62165 110 64
## 6 62166 96 32
## 7 62167 NA NA
## 8 62168 NA NA
## 9 62169 124 80
## 10 62170 124 82
data2 %>% slice(1:10)
## SEQN LBXTR LBDLDL
## 1 62161 84 110
## 2 62164 56 151
## 3 62165 71 84
## 4 62169 78 73
## 5 62170 44 77
## 6 62171 38 68
## 7 62172 141 68
## 8 62177 126 130
## 9 62178 100 134
## 10 62184 40 117
inner_join
Now I will merge data1 and data2 using the function inner_join.
total <- inner_join(data1, data2, by="SEQN")
dim(total)
## [1] 3239 5
total has 3239 rows and five variables. This indicates that the total kept all the IDs that were available in both datasets and entered the new columns from data2. So in case, you want to keep only IDs that are available in both databases use the function inner_join
left_join
Next function is left_join. The left_join is used when we want to keep all the ID in one dataset and to add columns from the other datasets. If the second dataset has less number of ID, the NA will be added.
total1 <- left_join(data1, data2, by="SEQN")
dim(total1)
## [1] 9338 5
total1 %>% slice(1:10)
## SEQN BPXSY1 BPXDI1 LBXTR LBDLDL
## 1 62161 110 82 84 110
## 2 62162 NA NA NA NA
## 3 62163 112 38 NA NA
## 4 62164 116 56 56 151
## 5 62165 110 64 71 84
## 6 62166 96 32 NA NA
## 7 62167 NA NA NA NA
## 8 62168 NA NA NA NA
## 9 62169 124 80 78 73
## 10 62170 124 82 44 77
In comparison with the inner_join, the left_join does not delete rows that are not available in the second dataset.
right_join
Another merge is using right_join function, which does the opposite of the left_join. SO the matching will be based on the data2 and not data1
total2 <- right_join(data1, data2, by="SEQN")
dim(total2)
## [1] 3239 5
total2 %>% slice(1:10)
## SEQN BPXSY1 BPXDI1 LBXTR LBDLDL
## 1 62161 110 82 84 110
## 2 62164 116 56 56 151
## 3 62165 110 64 71 84
## 4 62169 124 80 78 73
## 5 62170 124 82 44 77
## 6 62171 112 54 38 68
## 7 62172 100 70 141 68
## 8 62177 152 68 126 130
## 9 62178 124 72 100 134
## 10 62184 120 70 40 117
full_join
The full_join command, returns in a final dataset, all rows, and all columns from both datasets. When there is not a matching value, it is turned into N/A.
semi_join
The semi_join function is different than the previous examples of joins. A semi join creates a new dataset in which there are all rows from the data1 where there is a corresponding matching value in data2. Still, instead of the final dataset merging both the first (data1) and second (data2) datasets, it only contains the variables from the first one (data1).
This illustration is shown below:
total3 <- semi_join(data1, data2, by="SEQN")
dim(total3)
## [1] 3239 3
total3 %>% slice(1:10)
## SEQN BPXSY1 BPXDI1
## 1 62161 110 82
## 2 62164 116 56
## 3 62165 110 64
## 4 62169 124 80
## 5 62170 124 82
## 6 62171 112 54
## 7 62172 100 70
## 8 62177 152 68
## 9 62178 124 72
## 10 62184 120 70
anti_join
An example that is very helpful in exploring datasets, is anti_join command, and does the opposite of semi_join; it shows the rows from the first dataset data1 where there are not matching values from the second dataset data2:
total4 <- anti_join(data1, data2, by="SEQN")
dim(total4)
## [1] 6099 3
total4 %>% slice(1:10)
## SEQN BPXSY1 BPXDI1
## 1 62162 NA NA
## 2 62163 112 38
## 3 62166 96 32
## 4 62167 NA NA
## 5 62168 NA NA
## 6 62173 NA NA
## 7 62174 NA NA
## 8 62175 NA NA
## 9 62176 NA NA
## 10 62179 126 78
nest_join
The nest_join identified which IDs are in data2 that match that row from data1. The '0' means there is no match.
total5 <- nest_join(data1, data2, by="SEQN")
dim(total5)
## [1] 9338 4
total5 %>% slice(1:10)
## SEQN BPXSY1 BPXDI1 y
## 1 62161 110 82 84, 110
## 2 62162 NA NA
## 3 62163 112 38
## 4 62164 116 56 56, 151
## 5 62165 110 64 71, 84
## 6 62166 96 32
## 7 62167 NA NA
## 8 62168 NA NA
## 9 62169 124 80 78, 73
## 10 62170 124 82 44, 77
Hope you find helpful!