# install.packages("tidyverse")
library(tidyverse)
Lecture 7B: Tibble Joins
October 14, 2025
From today’s class, students are anticipated to be able to:
Recognize how to manipulate data through a variety of tibble joins such as:
Mutating joins:
left_join()
,right_join()
,full_join()
,anti_join()
Filtering joins:
semi_join()
,anti_join()
Perform binding:
bind_rows()
,bind_cols()
Join more than 2 tibbles
Join based on multiple conditions
Perform set operations on data:
intersect()
,union()
,setdiff()
Join tibbles with different types of variables
We will require the tidyverse
functions for this chapter:
Overview of join functions
Note: In order to merge two tibbles, you need to have an identifier variable that has unique values for every row of observations in both tibbles.
Create two sample tibbles:
# First tibble
<- tibble(ID = 1:3,
df1 Name = c("Sophie", "Josh","Alex"))
# Second tibble
<- tibble(ID = 2:4,
df2 Age = c(20,50,31))
Mutating joins
Join matching rows from df2
to df1
left_join(df1, df2, by = "ID")
# A tibble: 3 × 3
ID Name Age
<int> <chr> <dbl>
1 1 Sophie NA
2 2 Josh 20
3 3 Alex 50
Join matching rows from df1
to df2
right_join(df1, df2, by = "ID")
# A tibble: 3 × 3
ID Name Age
<int> <chr> <dbl>
1 2 Josh 20
2 3 Alex 50
3 4 <NA> 31
Retain only rows present in both sets
inner_join(df1, df2, by = "ID")
# A tibble: 2 × 3
ID Name Age
<int> <chr> <dbl>
1 2 Josh 20
2 3 Alex 50
Retain all values, all rows
full_join(df1, df2, by = "ID")
# A tibble: 4 × 3
ID Name Age
<int> <chr> <dbl>
1 1 Sophie NA
2 2 Josh 20
3 3 Alex 50
4 4 <NA> 31
Filtering joins
Retain all rows in df1
that have a match in df2
semi_join(df1, df2, by = "ID")
# A tibble: 2 × 2
ID Name
<int> <chr>
1 2 Josh
2 3 Alex
Retain all rows in df1
that do not have a match in df2
anti_join(df1, df2, by = "ID")
# A tibble: 1 × 2
ID Name
<int> <chr>
1 1 Sophie
Binding
Append df2
to df1
as new rows
bind_rows(df1, df2)
# A tibble: 6 × 3
ID Name Age
<int> <chr> <dbl>
1 1 Sophie NA
2 2 Josh NA
3 3 Alex NA
4 2 <NA> 20
5 3 <NA> 50
6 4 <NA> 31
Append df2
to df1
as new columns
bind_cols(df1, df2)
New names:
• `ID` -> `ID...1`
• `ID` -> `ID...3`
# A tibble: 3 × 4
ID...1 Name ID...3 Age
<int> <chr> <int> <dbl>
1 1 Sophie 2 20
2 2 Josh 3 50
3 3 Alex 4 31
Joining multiple (>2) tibbles
Create a third tibble
<- tibble(ID = 1:5,
df3 Height = c(175,167,190,155,160))
Use piping operator (%>%) to layer multiple join functions
full_join(df1, df2, by = "ID") %>%
full_join(df3, by = "ID")
# A tibble: 5 × 4
ID Name Age Height
<int> <chr> <dbl> <dbl>
1 1 Sophie NA 175
2 2 Josh 20 167
3 3 Alex 50 190
4 4 <NA> 31 155
5 5 <NA> NA 160
Joining tibbles on multiple conditions
Create two new tibbles df4
and df5
<- tibble(FirstName = c("Sophie", "Josh","Alex"),
df4 LastName=c("Wang","Smith","Smith"),
Age = c(42,20,50))
<- tibble(First_name = c("Josh","Alex","Sophie"),
df5 Last_name=c("Smith","Smith","Jones"),
Height = c(167,190,155))
full_join(df4, df5, by = c("FirstName" = "First_name", "LastName" = "Last_name"))
# A tibble: 4 × 4
FirstName LastName Age Height
<chr> <chr> <dbl> <dbl>
1 Sophie Wang 42 NA
2 Josh Smith 20 167
3 Alex Smith 50 190
4 Sophie Jones NA 155
Set operations
Create sample tibbles
# First tibble
<- tibble(Number = 1:3,
df6 Letter = c("A", "B","C"))
# Second tibble
<- tibble(Number = 2:4,
df7 Letter = c("B","C","D"))
Include rows that appear in both tibbles
# First tibble
<- tibble(Number = 1:3,
df6 Letter = c("A", "B","C"))
# Second tibble
<- tibble(Number = 2:4,
df7 Letter = c("B","C","D"))
Include rows that appear in either or both tibbles
union(df6, df7)
# A tibble: 4 × 2
Number Letter
<int> <chr>
1 1 A
2 2 B
3 3 C
4 4 D
Include rows that appear in one tibble/dataset but not another
Include rows that appear in df6
but not in df7
setdiff(df6, df7)
# A tibble: 1 × 2
Number Letter
<int> <chr>
1 1 A
Include rows that appear in df7
but not in df6
setdiff(df7, df6)
# A tibble: 1 × 2
Number Letter
<int> <chr>
1 4 D
Joining tibbles with different types of variables
You can also join tibbles with sets of predictions:
set.seed(1) #make reproducible
<- rnorm(5) #randomly sample 5 times from a N(0,1) distribution
x
<- tibble(x = x, yhat = 2.1 + 3.2 * x) #do prediction based on the linear function
model1 <- tibble(x = x, yhat = 1.5 + 2.9 * x)
model2
left_join(model1, model2, by = "x")
# A tibble: 5 × 3
x yhat.x yhat.y
<dbl> <dbl> <dbl>
1 -0.626 0.0953 -0.317
2 0.184 2.69 2.03
3 -0.836 -0.574 -0.923
4 1.60 7.20 6.13
5 0.330 3.15 2.46
Worksheet A5
Try your hand at basics of tibble joins by working through the corresponding part of Worksheet A5.
There will be some class time to go over solutions if you got stuck on any questions.
Resources
Video lecture: Tibble Joins with dplyr
“Relational Data” chapter in “R for Data Science”.
“Two-table verbs” vignette gives a concise overview of tibble joins with dplyr.
Jenny’s Join Cheatsheet for a quick reference to joins.
dplyr cheatsheet for all these concepts packed onto a sheet of paper.