Lecture 12: Character Data and ReGeX

STAT 545 - Fall 2025

Learning Outcomes

  • Manipulate a character vector in R using the stringr package.

  • Write simple regular expressions (regex).

  • Apply stringr and regular expressions to manipulate data in tibbles.

Written Notes

(No video lecture this week!)

Required Packages

We will require the following packages:

library(tidyverse) #stringr is automatically downloaded with tidyverse

Strings

You’ve used a bunch of strings at this point without knowing explicitly what they are: any time you surround text by ", you’ve been making a string: a storage format for text. In R, they are of type “character”.

Two places where you’ll often want to manipulate these in data analysis:

  • Cleaning up column/variable names

  • Cleaning up character column values

Escape Characters

  • Try running quote <- """ below.
  • It won’t work, because the " symbol is interpreted as you trying to make a string! To literally include a quote symbol in a string, you can use the \ character to “escape” it:

You can see more examples of special characters and how to escape them in R4DS Chapter 15.2.

Working with Strings

  • We will focus on the stringr package (loaded with tidyverse) paired with regular expressions (also called regex).

We’ll be going through a few examples with the enron dataset, which you can download on Github. Enron is a database of 270,087 Enron emails, taken from the May 7, 2015 version of the dataset. Data was extracted from CMU. There are three columns in enron.csv:

  • person: The person associated with the email.

  • mail_num: Identifier for each person’s email.

  • email: Each entry is a line in an email, including the email’s metadata (like subject, who it was sent to, etc.)

Working with Strings

Let’s load in the data:

For each email sent in this data set (mail_num), we have the sender (person) along with the lines of the email in separate entries in the data frame (email). For now, we’ll focus on the contents of the email, in the email column.

Non-regex String Manipulation

We’ll first go through some stringr functions that don’t require regular expressions.

str_detect() and str_subset()

Let’s filter only on entries that contain emails. We can do this by calling str_detect() and finding entries that contain “From:”:

str_detect actually returns a logical vector showing whether or not “@” was found in the email contents! To get the subset, we can use str_subset():

Non-regex String Manipulation

str_split()

Let’s say we wanted to count the number of times the word “important” was written in an email. To do this, we can split every email line into its individual words using the str_split() function. Let’s split by spaces to get each individual word.

We get a list! This is because we don’t know how many words each entry will contain, so a list is the most flexible option.

Non-regex String Manipulation

str_split() with unlist()

If we really wanted a vector of all of the individual words (which, we do!), we can unlist() the list:

Non-regex String Manipulation

str_subset() for “important” emails

Now, we can subset by which words contain “important”, and count the number of entries!

91 times! But unfortunately, str_subset() is case sensitive… See:

and…

This is where regular expressions will come in handy! We’ll discuss this in the following section.

Regular Expressions (regex)

Regular expressions provide a powerful tool to extract character data through patterns! In regexes, specific characters and constructs take on special meaning in order to match multiple strings.

We will go through a few common regex special characters, but refer readers to this Cheat Sheet by Ian Kopacka for a more in-depth review.

Let’s consider a simple vector of words:

Regular Expressions (regex)

. (match any letter)

The . in regex allows us to match any character (i.e., a number, letter, symbol (aside from a new line)) in an expression.

For example, the pattern .at would match any word that has at least three characters, where we have some character followed by “at”.

Regular Expressions (regex)

$, ^ (anchors)

Anchors can be included in an expression to indicate where the expression must occur in a string. ^ indicates that the string must start with the phrase, and $ indicates the string must end with it.

Notice how we did not select “attic”,“kittycat” or “doormats”

Notice how we only selected words ending in <some character>at.

Regular Expressions (regex)

\b and \B (boundaries and not a word boundary)

\b indicates a word boundary, and \B indicates NOT a word boundary. To see what we mean, see the following:

This selected words that were separated by either a “-” or space.

This selected words that did not have any separation between them.

Regular Expressions (regex)

*, +, ? (quantifiers)

We can specify regex to match any number of characters

Quantifier Number of Characters Range to Match Meaning
* 0 or more {n} exactly n
+ 1 or more {n,} at least n
? 0 or 1 {,m} at most m
{n,m} between n and m, inclusive

For example, let’s match words that have 0 or more characters before “at”.

All of our words contain “at” somewhere so they are all matched.

Regular Expressions (regex)

*, +, ? (quantifiers)

What about strictly one or more characters before “at”?

All but attic. Makes sense!

Regular Expressions (regex)

*, +, ? (quantifiers)

We can also match on specific letters. Let’s select words with one or more “o”’s in it:

Worksheet B2

Of course, regular expressions can be combined together to make powerful search tools, as well. For more on regular expressions, check out Chapter 11 of the STAT545 Textbook and the STAT545 video lecture. We will also learn more as we work through Worksheet B2.