STAT 545 - Fall 2025
From today’s class, students are anticipated to be able to:
Read and write a delimited file, like a csv, from R using the readr package.
Make relative paths using the here::here() function.
Recognize how to manipulate data through a variety of tibble joins such as:
Mutating joins: left_join(), right_join(), full_join(), anti_join()
Filtering joins: semi_join(), anti_join()
Perform binding: bind_rows(), bind_cols()
Join more than 2 tibbles
Join based on multiple conditions
Perform set operations on data: intersect(), union(), setdiff()
Required packages:
Data has to be stored somewhere. When saving data locally, common file formats include
Spreadsheets: Excel (.xlsx), Google Sheets (.gsheet)
Nice for human interaction (like through Excel)
Clunky in R
Use more memory to store due to their extra features
Delimited files: Plaintext files containing data, e.g., text files (.txt), comma separated values (.csv), tab separated values (.tsv)
most “one-size-fits-all”
lightweight
R binary: A serialization of an R object to a binary file (.rds).
First few entries of penguins.csv when opened as a text file (see for yourself here):
species,island,bill_len,bill_dep,flipper_len,body_mass,sex,year
Adelie,Torgersen,39.1,18.7,181,3750,male,2007 Adelie,Torgersen,39.5,17.4,186,3800,female,2007 Adelie,Torgersen,40.3,18,195,3250,female,2007 Adelie,Torgersen,NA,NA,NA,NA,NA,2007 Adelie,Torgersen,36.7,19.3,193,3450,female,2007 Adelie,Torgersen,39.3,20.6,190,3650,male,2007 Adelie,Torgersen,38.9,17.8,181,3625,female,2007 Adelie,Torgersen,39.2,19.6,195,4675,male,2007
read_csv and write_csvFrom the readr package, we ca use:
read_csv(): tidyverse equivalent of read.csv() used to import data from a CSV to a tibble
write_csv(): tidyverse equivalent of write.csv() used to export a tibble into CSV format
Let’s assume that a file called penguins.csv is saved in the same folder as our code. We can read in, and save the tibble as a variable called penguins using:
Note that the file path needs to be a string, relative to where you are now in the directory (i.e., where the R script you’re working on is saved.
read_csv and write_csvWe can also manipulate the data, and save the output as a new CSV. For example,
penguins_2007 <- penguins %>%
filter(year == 2007) #filter only on year 2007
write_csv(penguins_2007, "penguins_2007.csv") #save new data as penguins_2007.csvNote
Want to read and write to an Excel file? The readxl package in the tidyverse is for you!
For the very niche option of R binary: read_rds() and write_rds().
In the previous example, we saved and read in data that was stored in the same folder. However, we will often want to read from or write to other locations, including sub-folders in our project.
To do so, we need to specify where we are reading/writing our data from/to.
Absolute paths start with “/” (or “\” for Windows users) and begin at the root of your computer. This is a looooong set of “directions” that tell you where the file is located.
I could always read in my penguins dataset using an absolute file path where the file path begins at the root of your computer. Consider the following file structure:

What is the absolute file path to penguins.csv?
The best practice is to use a relative path. This helps with reproducibility and automation!
Instead of starting at the root of your computer, you can give directions to the file you want to load in relative to the working directory (i.e., where you are now).

(If you’re having trouble visualizing the working directory, you could consider the folders nested this way as well:)

If our working directory is Lec7A, what is the absolute file path to penguins.csv?
Some useful tips for relative paths:
they do not start with a slash
. represents the current directory
.. means go to one folder before the current directory (open the parent folder)
thesis folder if my current working directory is Lec7a, the path is ..\..\thesis (leave the Lec7a folder to go to the STAT545 folder, then leave the STAT545 folder to go to documents, then go to the thesis).you can call getwd() in R to confirm where your working directory is (it will show the absolute file path as the output)
in R projects, by default your working directory is you R project folder.
here PackageAs we stated before, things can get frustrating when sharing files between operating systems. Even with relative paths, we’ll need to manually replace forward and back slashed when switching to/from Mac and Windows operating systems.
Thankfully, there is a package that allows us to use relative paths without specifying a filepath string that is operating system dependent. Let’s (install, if necessary, and) load the here package
here PackageNow, let’s call here():
Side note: we will explicitly call here() from the here package using here::here() as dplyr also has a here() function.
I get a long chain of folders where this R Project (which I used to build this website) is stored. The cool thing about here is that I can specify a file path relative to my project root (the above location) without using any operating system-specific strings.
here PackageExample: the penguins.csv data set is located in webpages > lectures_i > datasets within my R project folder. I can access it by:
penguins <- read_csv(here("webpages", "lectures_i", "datasets", "penguins.csv"))
head(penguins) #view first few entries of the tibbleThis is reproducible!
here PackageSome final notes on here::here():
By default in an R project, here::here() will be the project folder.
I don’t think you can go outside of your root folder for the R project, unless you re-initialize the root somehow using here::iam().
This does not change the working directory. However, we recommend against using setwd() and similar functions to play around with directories in R projects. This again affects reproducibility.
Sometimes you’ll need to read in multiple data sets and then combine them. When we do this, we refer to it as “joining”.
Note: In order to join two tibbles, you need to have an identifier variable that has unique values for every row of observations in both tibbles.
Create two sample tibbles:

df2 to df1
df1 to df2


df1 that have a match in df2
df1 that do not have a match in df2
df2 to df1 as new rows
df2 to df1 as new columns
Create a third tibble

Use piping operator (%>%) to layer multiple join functions

Create two new tibbles df4 and df5

Create sample tibbles




Include rows that appear in df6 but not in df7
Include rows that appear in df7 but not in df6
This is the last lecture of STAT545A! Goodbye to those not continuing onto STAT545B!
Feel free to stay in touch:
https://www.linkedin.com/in/grace-e-tompkins/
grace@stat.ubc.ca