STAT 545 - Fall 2025
Manipulate a character vector in R using the stringr package.
Write simple regular expressions (regex).
Apply stringr and regular expressions to manipulate data in tibbles.
(No video lecture this week!)
We will require the following packages:
You’ve used a bunch of strings at this point without knowing explicitly what they are: any time you surround text by ", you’ve been making a string: a storage format for text. In R, they are of type “character”.
Two places where you’ll often want to manipulate these in data analysis:
Cleaning up column/variable names
Cleaning up character column values
quote <- """ below." symbol is interpreted as you trying to make a string! To literally include a quote symbol in a string, you can use the \ character to “escape” it:You can see more examples of special characters and how to escape them in R4DS Chapter 15.2.
stringr package (loaded with tidyverse) paired with regular expressions (also called regex).We’ll be going through a few examples with the enron dataset, which you can download on Github. Enron is a database of 270,087 Enron emails, taken from the May 7, 2015 version of the dataset. Data was extracted from CMU. There are three columns in enron.csv:
person: The person associated with the email.
mail_num: Identifier for each person’s email.
email: Each entry is a line in an email, including the email’s metadata (like subject, who it was sent to, etc.)
Let’s load in the data:
For each email sent in this data set (mail_num), we have the sender (person) along with the lines of the email in separate entries in the data frame (email). For now, we’ll focus on the contents of the email, in the email column.
We’ll first go through some stringr functions that don’t require regular expressions.
Let’s filter only on entries that contain emails. We can do this by calling str_detect() and finding entries that contain “From:”:
str_detect actually returns a logical vector showing whether or not “@” was found in the email contents! To get the subset, we can use str_subset():
Let’s say we wanted to count the number of times the word “important” was written in an email. To do this, we can split every email line into its individual words using the str_split() function. Let’s split by spaces to get each individual word.
We get a list! This is because we don’t know how many words each entry will contain, so a list is the most flexible option.
If we really wanted a vector of all of the individual words (which, we do!), we can unlist() the list:
Now, we can subset by which words contain “important”, and count the number of entries!
91 times! But unfortunately, str_subset() is case sensitive… See:
and…
This is where regular expressions will come in handy! We’ll discuss this in the following section.
Regular expressions provide a powerful tool to extract character data through patterns! In regexes, specific characters and constructs take on special meaning in order to match multiple strings.
We will go through a few common regex special characters, but refer readers to this Cheat Sheet by Ian Kopacka for a more in-depth review.
Let’s consider a simple vector of words:
. (match any letter)The . in regex allows us to match any character (i.e., a number, letter, symbol (aside from a new line)) in an expression.
For example, the pattern .at would match any word that has at least three characters, where we have some character followed by “at”.
$, ^ (anchors)Anchors can be included in an expression to indicate where the expression must occur in a string. ^ indicates that the string must start with the phrase, and $ indicates the string must end with it.
Notice how we did not select “attic”,“kittycat” or “doormats”
Notice how we only selected words ending in <some character>at.
\b and \B (boundaries and not a word boundary)\b indicates a word boundary, and \B indicates NOT a word boundary. To see what we mean, see the following:
This selected words that were separated by either a “-” or space.
This selected words that did not have any separation between them.
*, +, ? (quantifiers)We can specify regex to match any number of characters
| Quantifier | Number of Characters | Range to Match | Meaning |
|---|---|---|---|
| * | 0 or more | {n} | exactly n |
| + | 1 or more | {n,} | at least n |
| ? | 0 or 1 | {,m} | at most m |
| {n,m} | between n and m, inclusive |
For example, let’s match words that have 0 or more characters before “at”.
All of our words contain “at” somewhere so they are all matched.
*, +, ? (quantifiers)What about strictly one or more characters before “at”?
All but attic. Makes sense!
*, +, ? (quantifiers)We can also match on specific letters. Let’s select words with one or more “o”’s in it:
Of course, regular expressions can be combined together to make powerful search tools, as well. For more on regular expressions, check out Chapter 11 of the STAT545 Textbook and the STAT545 video lecture. We will also learn more as we work through Worksheet B2.