library(tidyverse) #stringr is automatically downloaded with tidyverse
Lecture 12: StringR and Regular Expressions
November 18, 2025
From this topic, students are anticipated to be able to:
Manipulate a character vector in R using the stringr package.
Write simple regular expressions (regex).
Apply stringr and regular expressions to manipulate data in tibbles.
We will require the following packages:
Strings
You’ve used a bunch of strings at this point without knowing explicitly what they are: any time you surround text by "
, you’ve been making a string: a storage format for text. In R, they are of type “character”.
<- "This is a string"
sample_string typeof(sample_string)
[1] "character"
Two places where you’ll often want to manipulate these in data analysis:
Cleaning up column/variable names
Cleaning up character column values
Good to know: Constructing strings out of characters and numbers is intuitive, but there’s a gotcha involving particular symbols with special meaning in R. For example, try running quote <- """
in R; it won’t work, because the "
symbol is interpreted as you trying to make a string! To literally include a quote in a string, you can use the \
character to “escape” it:
<- "\""
single_quote cat(single_quote)
"
You can see more examples of special characters and how to escape them in R4DS Chapter 15.2.
Working with Strings
Our main tools for working with strings will be the powerful stringr
package in the tidyverse paired with regular expressions (also called regex).
We’ll be going through a few examples with the enron
dataset, which you can download on Github. Enron is a database of 270,087 Enron emails, taken from the May 7, 2015 version of the dataset. Data was extracted from CMU. There are three columns in enron.csv
:
person
: The person associated with the email.mail_num
: Identifier for each person’s email.email
: Each entry is a line in an email, including the email’s metadata (like subject, who it was sent to, etc.)
Let’s load in the data:
<- read_csv("datasets/enron.csv")
enron head(enron)
# A tibble: 6 × 3
mail_num person email
<dbl> <chr> <chr>
1 1 allen-p Message-ID: <18782981.1075855378110.JavaMail.evans@thyme>
2 1 allen-p Date: Mon, 14 May 2001 16:39:00 -0700 (PDT)
3 1 allen-p From: phillip.allen@enron.com
4 1 allen-p To: tim.belden@enron.com
5 1 allen-p Subject:
6 1 allen-p Mime-Version: 1.0
For each email sent in this data set (mail_num
), we have the sender (person
) along with the lines of the email in separate entries in the data frame (email
). For now, we’ll focus on the contents of the email, in the email
column.
<- enron$email enron_email
Non-regex String Manipulation
We’ll first go through some stringr
functions that don’t require regular expressions.
str_detech() and str_subset()
Let’s filter only on entries that contain emails. We can do this by calling str_detect()
and finding entries that contain “From:”:
<- str_detect(enron_email, "@") # find all entries with "@" in the email column
enron_email_at
head(enron_email_at)
[1] TRUE FALSE TRUE TRUE FALSE FALSE
str_detect
actually returns a logical vector showing whether or not “@” was found in the email contents! To get the subset, we can use str_subset()
:
<- str_subset(enron_email, "@") #overwrite:
enron_email_at head(enron_email_at)
[1] "Message-ID: <18782981.1075855378110.JavaMail.evans@thyme>"
[2] "From: phillip.allen@enron.com"
[3] "To: tim.belden@enron.com"
[4] "X-To: Tim Belden <Tim Belden/Enron@EnronXGate>"
[5] "Message-ID: <15464986.1075855378456.JavaMail.evans@thyme>"
[6] "From: phillip.allen@enron.com"
str_split()
Let’s say we wanted to count the number of times the word “important” was written in an email. To do this, we can split every email line into its individual words using the str_split()
function. Let’s split by spaces to get each individual word.
<- str_split(enron_email, " ")
enron_email_words
head(enron_email_words)
[[1]]
[1] "Message-ID:"
[2] "<18782981.1075855378110.JavaMail.evans@thyme>"
[[2]]
[1] "Date:" "Mon," "14" "May" "2001" "16:39:00" "-0700"
[8] "(PDT)"
[[3]]
[1] "From:" "phillip.allen@enron.com"
[[4]]
[1] "To:" "tim.belden@enron.com"
[[5]]
[1] "Subject:"
[[6]]
[1] "Mime-Version:" "1.0"
We get a list! This is because we don’t know how many words each entry will contain, so a list is the most flexible option. If we really wanted a vector of all of the individual words (which, we do!), we can unlist()
the list:
<- unlist(enron_email_words)
enron_email_words
head(enron_email_words)
[1] "Message-ID:"
[2] "<18782981.1075855378110.JavaMail.evans@thyme>"
[3] "Date:"
[4] "Mon,"
[5] "14"
[6] "May"
Now, we can subset by which words contain “important”, and count the number of entries!
<- str_subset(enron_email_words, "important")
enron_emails_important length(enron_emails_important)
[1] 91
91 times! But unfortunately, str_subset()
is case sensitive… See:
<- str_subset(enron_email_words, "Important")
enron_emails_important length(enron_emails_important)
[1] 9
and..
<- str_subset(enron_email_words, "IMPORTANT")
enron_emails_important length(enron_emails_important)
[1] 5
This is where regular expressions come in handy! We’ll discuss this in the following section.
Regular Expressions (regex)
Regular expressions provide a powerful tool to extract character data through patterns! In regexes, specific characters and constructs take on special meaning in order to match multiple strings.
We will go through a few common regex special characters, but refer readers to this Cheat Sheet by Ian Kopacka for a more in-depth review.
Let’s consider a simple vector of words:
<- c("bat", "cat", "5at", "rats", "?at", "kittycat", "doormats", "attic", "matchstick", "hat-trick", "hattrick", "hat trick") words
.
(match any letter)
The .
in regex allows us to match any character (i.e., a number, letter, symbol (aside from a new line)) in an expression.
For example, the pattern .at
would match any word that has at least three characters, where we have some character followed by “at”.
str_subset(words, pattern = ".at")
[1] "bat" "cat" "5at" "rats" "?at"
[6] "kittycat" "doormats" "matchstick" "hat-trick" "hattrick"
[11] "hat trick"
$, ^
(anchors)
Anchors can be included in an expression to indicate where the expression must occur in a string. ^
indicates that the string must start with the phrase, and $
indicates the string must end with it.
str_subset(words, pattern = "^.at")
[1] "bat" "cat" "5at" "rats" "?at"
[6] "matchstick" "hat-trick" "hattrick" "hat trick"
Notice how we did not select “attic”,“kittycat” or “doormats”
str_subset(words, pattern = ".at$")
[1] "bat" "cat" "5at" "?at" "kittycat"
Notice how we only selected words ending in <some chatacter>at.
\b
and \B
(boundaries and not a word boundary)
\b
indicates a word boundary, and \B
indicates NOT a word boundary. To see what we mean, see the following:
str_subset(words, pattern = "\\btrick")
[1] "hat-trick" "hat trick"
This selected words that were separated by either a “-” or space.
str_subset(words, pattern = "\\Btrick")
[1] "hattrick"
This selected words that did not have any separation between them.
*
, +
, ?
(quantifiers)
We can specify regex to match any number of characters
Quantifier | Number of Characters | Range to Match | Meaning |
---|---|---|---|
* | 0 or more | {n} | exactly n |
+ | 1 or more | {n,} | at least n |
? | 0 or 1 | {,m} | at most m |
{n,m} | between n and m, inclusive |
For example, let’s match words that have 0 or more characters before “at”.
str_subset(words, pattern = ".*at")
[1] "bat" "cat" "5at" "rats" "?at"
[6] "kittycat" "doormats" "attic" "matchstick" "hat-trick"
[11] "hattrick" "hat trick"
All of our words contain “at” somewhere so they are all matched. What about strictly one or more characters before “at”?
str_subset(words, pattern = ".+at")
[1] "bat" "cat" "5at" "rats" "?at"
[6] "kittycat" "doormats" "matchstick" "hat-trick" "hattrick"
[11] "hat trick"
All but attic. Makes sense!
We can also match on specific letters. Let’s select words with one or more “o”’s in it:
str_subset(words, pattern = "o+")
[1] "doormats"
Of course, regular expressions can be combined together to make powerful search tools, as well. For more on regular expressions, check out Chapter 11 of the STAT545 Textbookand the STAT545 video lecture. We will also learn more as we work through Worksheet B2.
Worksheet B2
Class 1
Before class, start working on parts I and II of Worksheet B2.
Class will be dedicated to getting your questions answered.
Done early? Then do the optional R4DS Strings and R4DS Regular expressions readings (linked above), and do exercises for extra practice.
Class 2
Before class, start working on parts II and III of Worksheet B2.
Class will be dedicated to getting your questions answered.
Done early? Then do the optional R4DS Strings and R4DS Regular expressions readings (linked above), and do exercises for extra practice. Or, start Assignment B4.
Resources
Video lecture:
- Regular Expressions and stringr for Text Data (only labelled as “age restricted” because it looks at real emails within the Enron company.)
Written material:
Overview tutorials similar to our worksheet:
The stat545.com Chapter 11 on character vectors has an elaborate discussion on useful resources for learning more about strings.