Lecture 12: StringR and Regular Expressions

November 18, 2025

Modified

November 21, 2025

From this topic, students are anticipated to be able to:

Manipulate a character vector in R using the stringr package.
Write simple regular expressions (regex).
Apply stringr and regular expressions to manipulate data in tibbles.

We will require the following packages:

library(tidyverse) #stringr is automatically downloaded with tidyverse

Lecture Slides

Lecture 12 - R Character Data

Strings

You’ve used a bunch of strings at this point without knowing explicitly what they are: any time you surround text by ", you’ve been making a string: a storage format for text. In R, they are of type “character”.

sample_string <- "This is a string" 
typeof(sample_string)

[1] "character"

Two places where you’ll often want to manipulate these in data analysis:

Cleaning up column/variable names
Cleaning up character column values

Good to know: Constructing strings out of characters and numbers is intuitive, but there’s a gotcha involving particular symbols with special meaning in R. For example, try running quote <- """ in R; it won’t work, because the " symbol is interpreted as you trying to make a string! To literally include a quote in a string, you can use the \ character to “escape” it:

single_quote <- "\""
cat(single_quote)

You can see more examples of special characters and how to escape them in R4DS Chapter 15.2.

Working with Strings

Our main tools for working with strings will be the powerful stringr package in the tidyverse paired with regular expressions (also called regex).

We’ll be going through a few examples with the enron dataset, which you can download on Github. Enron is a database of 270,087 Enron emails, taken from the May 7, 2015 version of the dataset. Data was extracted from CMU. There are three columns in enron.csv:

person: The person associated with the email.
mail_num: Identifier for each person’s email.
email: Each entry is a line in an email, including the email’s metadata (like subject, who it was sent to, etc.)

Let’s load in the data:

enron <- read_csv("datasets/enron.csv")
head(enron)

# A tibble: 6 × 3
  mail_num person  email                                                    
     <dbl> <chr>   <chr>                                                    
1        1 allen-p Message-ID: <18782981.1075855378110.JavaMail.evans@thyme>
2        1 allen-p Date: Mon, 14 May 2001 16:39:00 -0700 (PDT)              
3        1 allen-p From: phillip.allen@enron.com                            
4        1 allen-p To: tim.belden@enron.com                                 
5        1 allen-p Subject:                                                 
6        1 allen-p Mime-Version: 1.0

For each email sent in this data set (mail_num), we have the sender (person) along with the lines of the email in separate entries in the data frame (email). For now, we’ll focus on the contents of the email, in the email column.

enron_email <- enron$email

Non-regex String Manipulation

We’ll first go through some stringr functions that don’t require regular expressions.

str_detect() and str_subset()

Let’s filter only on entries that contain emails. We can do this by calling str_detect() and finding entries that contain “From:”:

enron_email_at<- str_detect(enron_email, "@") # find all entries with "@" in the email column

head(enron_email_at)

[1]  TRUE FALSE  TRUE  TRUE FALSE FALSE

str_detect actually returns a logical vector showing whether or not “@” was found in the email contents! To get the subset, we can use str_subset():

enron_email_at <- str_subset(enron_email, "@") #overwrite:
head(enron_email_at)

[1] "Message-ID: <18782981.1075855378110.JavaMail.evans@thyme>"
[2] "From: phillip.allen@enron.com"                            
[3] "To: tim.belden@enron.com"                                 
[4] "X-To: Tim Belden <Tim Belden/Enron@EnronXGate>"           
[5] "Message-ID: <15464986.1075855378456.JavaMail.evans@thyme>"
[6] "From: phillip.allen@enron.com"

str_split()

Let’s say we wanted to count the number of times the word “important” was written in an email. To do this, we can split every email line into its individual words using the str_split() function. Let’s split by spaces to get each individual word.

enron_email_words <- str_split(enron_email, " ")

head(enron_email_words)

[[1]]
[1] "Message-ID:"                                  
[2] "<18782981.1075855378110.JavaMail.evans@thyme>"

[[2]]
[1] "Date:"    "Mon,"     "14"       "May"      "2001"     "16:39:00" "-0700"   
[8] "(PDT)"   

[[3]]
[1] "From:"                   "phillip.allen@enron.com"

[[4]]
[1] "To:"                  "tim.belden@enron.com"

[[5]]
[1] "Subject:"

[[6]]
[1] "Mime-Version:" "1.0"

We get a list! This is because we don’t know how many words each entry will contain, so a list is the most flexible option. If we really wanted a vector of all of the individual words (which, we do!), we can unlist() the list:

enron_email_words <- unlist(enron_email_words)

head(enron_email_words)

[1] "Message-ID:"                                  
[2] "<18782981.1075855378110.JavaMail.evans@thyme>"
[3] "Date:"                                        
[4] "Mon,"                                         
[5] "14"                                           
[6] "May"

Now, we can subset by which words contain “important”, and count the number of entries!

enron_emails_important <- str_subset(enron_email_words, "important")
length(enron_emails_important)

[1] 91

91 times! But unfortunately, str_subset() is case sensitive… See:

enron_emails_important <- str_subset(enron_email_words, "Important")
length(enron_emails_important)

[1] 9

and…

enron_emails_important <- str_subset(enron_email_words, "IMPORTANT")
length(enron_emails_important)

[1] 5

This is where regular expressions will come in handy! We’ll discuss this in the following section.

Regular Expressions (regex)

Regular expressions provide a powerful tool to extract character data through patterns! In regexes, specific characters and constructs take on special meaning in order to match multiple strings.

We will go through a few common regex special characters, but refer readers to this Cheat Sheet by Ian Kopacka for a more in-depth review.

Let’s consider a simple vector of words:

words <- c("bat", "cat", "5at", "rats", "?at", "kittycat", "doormats", "attic", "matchstick", "hat-trick", "hattrick", "hat trick")

`.` (match any letter)

The . in regex allows us to match any character (i.e., a number, letter, symbol (aside from a new line)) in an expression.

For example, the pattern .at would match any word that has at least three characters, where we have some character followed by “at”.

str_subset(words, pattern = ".at")

 [1] "bat"        "cat"        "5at"        "rats"       "?at"       
 [6] "kittycat"   "doormats"   "matchstick" "hat-trick"  "hattrick"  
[11] "hat trick"

`$, ^` (anchors)

Anchors can be included in an expression to indicate where the expression must occur in a string. ^ indicates that the string must start with the phrase, and $ indicates the string must end with it.

str_subset(words, pattern = "^.at")

[1] "bat"        "cat"        "5at"        "rats"       "?at"       
[6] "matchstick" "hat-trick"  "hattrick"   "hat trick"

Notice how we did not select “attic”,“kittycat” or “doormats”

str_subset(words, pattern = ".at$")

[1] "bat"      "cat"      "5at"      "?at"      "kittycat"

Notice how we only selected words ending in <some character>at.

`\b` and `\B` (boundaries and not a word boundary)

\b indicates a word boundary, and \B indicates NOT a word boundary. To see what we mean, see the following:

str_subset(words, pattern = "\\btrick")

[1] "hat-trick" "hat trick"

This selected words that were separated by either a “-” or space.

str_subset(words, pattern = "\\Btrick")

[1] "hattrick"

This selected words that did not have any separation between them.

`*`, `+`, `?` (quantifiers)

We can specify regex to match any number of characters

Quantifier	Number of Characters	Range to Match	Meaning
*	0 or more	{n}	exactly n
+	1 or more	{n,}	at least n
?	0 or 1	{,m}	at most m
		{n,m}	between n and m, inclusive

For example, let’s match words that have 0 or more characters before “at”.

str_subset(words, pattern = ".*at")

 [1] "bat"        "cat"        "5at"        "rats"       "?at"       
 [6] "kittycat"   "doormats"   "attic"      "matchstick" "hat-trick" 
[11] "hattrick"   "hat trick"

All of our words contain “at” somewhere so they are all matched. What about strictly one or more characters before “at”?

str_subset(words, pattern = ".+at")

 [1] "bat"        "cat"        "5at"        "rats"       "?at"       
 [6] "kittycat"   "doormats"   "matchstick" "hat-trick"  "hattrick"  
[11] "hat trick"

All but attic. Makes sense!

We can also match on specific letters. Let’s select words with one or more “o”’s in it:

str_subset(words, pattern = "o+")

[1] "doormats"

Of course, regular expressions can be combined together to make powerful search tools, as well. For more on regular expressions, check out Chapter 11 of the STAT545 Textbookand the STAT545 video lecture. We will also learn more as we work through Worksheet B2.

Example: Back to the Emails

So, how do we select emails that contain the word “important” without it being case sensitive? There are a few ways. The easiest is to leverage stringr and wrap the regular expression in the regex() function and use ignore_case = T:

enron_emails_important <- str_subset(enron_email_words, regex("important", ignore_case = T))
length(enron_emails_important)

[1] 105

Resources

Video lecture:

Regular Expressions and stringr for Text Data (only labelled as “age restricted” because it looks at real emails within the Enron company.)

Written material:

Overview tutorials similar to our worksheet:
The stat545.com Chapter 11 on character vectors has an elaborate discussion on useful resources for learning more about strings.
Regex cheat sheet