Lecture 7B: Tibble Joins

October 14, 2025

Modified

August 12, 2025

From today’s class, students are anticipated to be able to:

We will require the tidyverse functions for this chapter:

# install.packages("tidyverse")
library(tidyverse)

Overview of join functions

Note: In order to merge two tibbles, you need to have an identifier variable that has unique values for every row of observations in both tibbles.

Create two sample tibbles:

# First tibble
df1 <- tibble(ID = 1:3,                     
              Name = c("Sophie", "Josh","Alex"))

# Second tibble
df2 <- tibble(ID = 2:4,                      
              Age = c(20,50,31))

Mutating joins

Join matching rows from df2 to df1

left_join(df1, df2, by = "ID")
# A tibble: 3 × 3
     ID Name     Age
  <int> <chr>  <dbl>
1     1 Sophie    NA
2     2 Josh      20
3     3 Alex      50

Join matching rows from df1 to df2

right_join(df1, df2, by = "ID")
# A tibble: 3 × 3
     ID Name    Age
  <int> <chr> <dbl>
1     2 Josh     20
2     3 Alex     50
3     4 <NA>     31

Retain only rows present in both sets

inner_join(df1, df2, by = "ID")
# A tibble: 2 × 3
     ID Name    Age
  <int> <chr> <dbl>
1     2 Josh     20
2     3 Alex     50

Retain all values, all rows

full_join(df1, df2, by = "ID")
# A tibble: 4 × 3
     ID Name     Age
  <int> <chr>  <dbl>
1     1 Sophie    NA
2     2 Josh      20
3     3 Alex      50
4     4 <NA>      31

Filtering joins

Retain all rows in df1 that have a match in df2

semi_join(df1, df2, by = "ID")
# A tibble: 2 × 2
     ID Name 
  <int> <chr>
1     2 Josh 
2     3 Alex 

Retain all rows in df1 that do not have a match in df2

anti_join(df1, df2, by = "ID")
# A tibble: 1 × 2
     ID Name  
  <int> <chr> 
1     1 Sophie

Binding

Append df2 to df1 as new rows

bind_rows(df1, df2)
# A tibble: 6 × 3
     ID Name     Age
  <int> <chr>  <dbl>
1     1 Sophie    NA
2     2 Josh      NA
3     3 Alex      NA
4     2 <NA>      20
5     3 <NA>      50
6     4 <NA>      31

Append df2 to df1 as new columns

bind_cols(df1, df2)
New names:
• `ID` -> `ID...1`
• `ID` -> `ID...3`
# A tibble: 3 × 4
  ID...1 Name   ID...3   Age
   <int> <chr>   <int> <dbl>
1      1 Sophie      2    20
2      2 Josh        3    50
3      3 Alex        4    31

Joining multiple (>2) tibbles

Create a third tibble

df3 <- tibble(ID = 1:5,                      
              Height = c(175,167,190,155,160))

Use piping operator (%>%) to layer multiple join functions

full_join(df1, df2, by = "ID") %>%
  full_join(df3, by = "ID") 
# A tibble: 5 × 4
     ID Name     Age Height
  <int> <chr>  <dbl>  <dbl>
1     1 Sophie    NA    175
2     2 Josh      20    167
3     3 Alex      50    190
4     4 <NA>      31    155
5     5 <NA>      NA    160

Joining tibbles on multiple conditions

Create two new tibbles df4 and df5

df4 <- tibble(FirstName = c("Sophie", "Josh","Alex"),
              LastName=c("Wang","Smith","Smith"),
              Age = c(42,20,50))

df5 <- tibble(First_name = c("Josh","Alex","Sophie"),        
              Last_name=c("Smith","Smith","Jones"),
              Height = c(167,190,155))

full_join(df4, df5, by = c("FirstName" = "First_name", "LastName" = "Last_name"))
# A tibble: 4 × 4
  FirstName LastName   Age Height
  <chr>     <chr>    <dbl>  <dbl>
1 Sophie    Wang        42     NA
2 Josh      Smith       20    167
3 Alex      Smith       50    190
4 Sophie    Jones       NA    155

Set operations

Create sample tibbles

# First tibble
df6 <- tibble(Number = 1:3,                     
              Letter = c("A", "B","C"))

# Second tibble
df7 <- tibble(Number = 2:4,                      
              Letter = c("B","C","D"))

Include rows that appear in both tibbles

# First tibble
df6 <- tibble(Number = 1:3,                     
              Letter = c("A", "B","C"))

# Second tibble
df7 <- tibble(Number = 2:4,                      
              Letter = c("B","C","D"))

Include rows that appear in either or both tibbles

union(df6, df7)
# A tibble: 4 × 2
  Number Letter
   <int> <chr> 
1      1 A     
2      2 B     
3      3 C     
4      4 D     

Include rows that appear in one tibble/dataset but not another

Include rows that appear in df6 but not in df7

setdiff(df6, df7)
# A tibble: 1 × 2
  Number Letter
   <int> <chr> 
1      1 A     

Include rows that appear in df7 but not in df6

setdiff(df7, df6)
# A tibble: 1 × 2
  Number Letter
   <int> <chr> 
1      4 D     

Joining tibbles with different types of variables

You can also join tibbles with sets of predictions:

set.seed(1) #make reproducible

x <- rnorm(5) #randomly sample 5 times from a N(0,1) distribution

model1 <- tibble(x = x, yhat = 2.1 + 3.2 * x) #do prediction based on the linear function
model2 <- tibble(x = x, yhat = 1.5 + 2.9 * x)

left_join(model1, model2, by = "x")
# A tibble: 5 × 3
       x  yhat.x yhat.y
   <dbl>   <dbl>  <dbl>
1 -0.626  0.0953 -0.317
2  0.184  2.69    2.03 
3 -0.836 -0.574  -0.923
4  1.60   7.20    6.13 
5  0.330  3.15    2.46 

Worksheet A5

Try your hand at basics of tibble joins by working through the corresponding part of Worksheet A5.

There will be some class time to go over solutions if you got stuck on any questions.

Resources

Back to top