STAT 545 - Fall 2025
From today’s class, students are anticipated to be able to:
Throughout this lecture, we will be using functions from the following packages:
Factors represent categorical variables: variables that take on a fixed number of known values
Example, in the penguins data set, species is a factor with three levels: “Adelie”, “Chinstrap”, and “Gentoo”. We can see this by looking at the str() (structure) of the tibble:
Suppose in the penguins data set than instead of their names, species had levels:
“1” for Gentoo,
“2” for Adelie, and
3 for Chinstrap penguins
We see here that R assumes that species is a numeric variable, not a categorical variable. This means, R assumes “species” can take on any numeric value, such as 1.5, 2.567, or 5 (which doesn’t make sense for this variable!)
To ensure R knows that species is categorical, we can use:
Now we see that species is a factor with three levels!
To make our lives easier, we will work with factors through the forcats package loaded as part of the tidyverse.
By default, factors are ordered alphabetically or numerically.
However, in many cases factors have a logical ordering they should follow (i.e., elementary, secondary, post-secondary, graduate).
Reordering data can be useful for both data visualization and model fitting.
To see the current ordering of a factor variable, we can call levels().
We see here, the levels of the factor are ordered alphabetically.
There are many ways to reorder the levels of a factor in R. To reorder the levels of the factor, we can use built in R functions such as ordered():
Now, when we call str() on the penguins data, we see that the factor is explicitly ordered, with out new ordering:
forcatsLet’s look again at the original penguins data set and look at the frequency of each species using ggplot:
Notice that the ordering is alphabetical What would be a more effective ordering?
Notes:
forcatsLet’s order bars by largest to smallest (or smallest to largest) so readers can easily spot which species is the most common.
Two approaches:
edit the original tibble to have new factor ordering, or
order the factors directly in the ggplot2 call so that the original tibble isn’t overwritten.
forcatsOption A: Reorder Tibble Directly
To reorder the factor levels according to the frequency in the tibble:
We can then plot our data using penguins4:
forcatsOption B: Reorder the Factor in the ggplot Call
We can do these steps directly in ggplot without overwriting or making a new tibble!
We get the exact same plot with a single chunk of code! And, the original tibble penguins remains unchanged with the alphabetical ordering of the species factors.
Perhaps we may want to visualize that in our data set, there are no Emperor or King penguins. To do so, we can add “Emperor” and “King” as possible factor levels for species in our penguins data set. We do so by fct_expand()
Now we add to the visualization that there were no King or Emperor penguins collected in the data.
Let’s suppose we were only interested in comparing Adelie and Chinstrap penguins. We could drop the Gentoo level using forcats by:
Try your hand at using factors by working through the factors portion of Worksheet A5.
Finished attempting all of the questions? Then do the optional R4DS Factors reading, and maybe even do some of the exercises for extra practice.