Merging Datasets with Tidyverse

This post will focus on merging datasets with tidyverse using R. I will use data from NHANES, which are freely available for everyone. The first dataset data1 consists of the blood pressure levels for each participant, and the second data2 contain their LDL and Triglycerides levels.

First, I will load the neccessary libraries and datasets.

library(tidyverse)
library(RNHANES)
data1 <- nhanes_load_data("BPX_G", "2011-2012") %>% 
  select(SEQN, BPXSY1, BPXDI1)
data2 <- nhanes_load_data("TRIGLY_G", "2011-2012") %>% 
  select(SEQN, LBXTR, LBDLDL)

Get some information on the data1 and data2

dim(data1)
## [1] 9338    3

dim(data2)
## [1] 3239    3

data1 has 9338 rows and 3 columns. data2 has 3239 rows and 3 colums.

data1 %>% slice(1:10)
##     SEQN BPXSY1 BPXDI1
## 1  62161    110     82
## 2  62162     NA     NA
## 3  62163    112     38
## 4  62164    116     56
## 5  62165    110     64
## 6  62166     96     32
## 7  62167     NA     NA
## 8  62168     NA     NA
## 9  62169    124     80
## 10 62170    124     82

data2 %>% slice(1:10)
##     SEQN LBXTR LBDLDL
## 1  62161    84    110
## 2  62164    56    151
## 3  62165    71     84
## 4  62169    78     73
## 5  62170    44     77
## 6  62171    38     68
## 7  62172   141     68
## 8  62177   126    130
## 9  62178   100    134
## 10 62184    40    117

inner_join

Now I will merge data1 and data2 using the function inner_join.

total <- inner_join(data1, data2, by="SEQN")
dim(total)
## [1] 3239    5

total has 3239 rows and five variables. This indicates that the total kept all the IDs that were available in both datasets and entered the new columns from data2. So in case, you want to keep only IDs that are available in both databases use the function inner_join

left_join

Next function is left_join. The left_join is used when we want to keep all the ID in one dataset and to add columns from the other datasets. If the second dataset has less number of ID, the NA will be added.

total1 <- left_join(data1, data2, by="SEQN")
dim(total1)
## [1] 9338    5

total1 %>% slice(1:10)
##     SEQN BPXSY1 BPXDI1 LBXTR LBDLDL
## 1  62161    110     82    84    110
## 2  62162     NA     NA    NA     NA
## 3  62163    112     38    NA     NA
## 4  62164    116     56    56    151
## 5  62165    110     64    71     84
## 6  62166     96     32    NA     NA
## 7  62167     NA     NA    NA     NA
## 8  62168     NA     NA    NA     NA
## 9  62169    124     80    78     73
## 10 62170    124     82    44     77

In comparison with the inner_join, the left_join does not delete rows that are not available in the second dataset.

right_join

Another merge is using right_join function, which does the opposite of the left_join. SO the matching will be based on the data2 and not data1

total2 <- right_join(data1, data2, by="SEQN")
dim(total2)
## [1] 3239    5

total2 %>% slice(1:10)
##     SEQN BPXSY1 BPXDI1 LBXTR LBDLDL
## 1  62161    110     82    84    110
## 2  62164    116     56    56    151
## 3  62165    110     64    71     84
## 4  62169    124     80    78     73
## 5  62170    124     82    44     77
## 6  62171    112     54    38     68
## 7  62172    100     70   141     68
## 8  62177    152     68   126    130
## 9  62178    124     72   100    134
## 10 62184    120     70    40    117

full_join

The full_join command, returns in a final dataset, all rows, and all columns from both datasets. When there is not a matching value, it is turned into N/A.

semi_join

The semi_join function is different than the previous examples of joins. A semi join creates a new dataset in which there are all rows from the data1 where there is a corresponding matching value in data2. Still, instead of the final dataset merging both the first (data1) and second (data2) datasets, it only contains the variables from the first one (data1).

This illustration is shown below:

total3 <- semi_join(data1, data2, by="SEQN")
dim(total3)
## [1] 3239    3

total3 %>% slice(1:10)
##     SEQN BPXSY1 BPXDI1
## 1  62161    110     82
## 2  62164    116     56
## 3  62165    110     64
## 4  62169    124     80
## 5  62170    124     82
## 6  62171    112     54
## 7  62172    100     70
## 8  62177    152     68
## 9  62178    124     72
## 10 62184    120     70

anti_join

An example that is very helpful in exploring datasets, is anti_join command, and does the opposite of semi_join; it shows the rows from the first dataset data1 where there are not matching values from the second dataset data2:

total4 <- anti_join(data1, data2, by="SEQN")
dim(total4)
## [1] 6099    3

total4 %>% slice(1:10)
##     SEQN BPXSY1 BPXDI1
## 1  62162     NA     NA
## 2  62163    112     38
## 3  62166     96     32
## 4  62167     NA     NA
## 5  62168     NA     NA
## 6  62173     NA     NA
## 7  62174     NA     NA
## 8  62175     NA     NA
## 9  62176     NA     NA
## 10 62179    126     78

nest_join

The nest_join identified which IDs are in data2 that match that row from data1. The '0' means there is no match.

total5 <- nest_join(data1, data2, by="SEQN")
dim(total5)
## [1] 9338    4

total5 %>% slice(1:10)
##     SEQN BPXSY1 BPXDI1       y
## 1  62161    110     82 84, 110
## 2  62162     NA     NA        
## 3  62163    112     38        
## 4  62164    116     56 56, 151
## 5  62165    110     64  71, 84
## 6  62166     96     32        
## 7  62167     NA     NA        
## 8  62168     NA     NA        
## 9  62169    124     80  78, 73
## 10 62170    124     82  44, 77

Hope you find helpful!

Data Managementtidyverse