Lecture 6b: Factors

STAT 545 - Fall 2025

Learning Outcomes

From today’s class, students are anticipated to be able to:

  • Reorder levels within factors according to various principles

Video Lecture

Course Notes

Set-up

Throughout this lecture, we will be using functions from the following packages:

Factors

  • Factors represent categorical variables: variables that take on a fixed number of known values

  • Example, in the penguins data set, species is a factor with three levels: “Adelie”, “Chinstrap”, and “Gentoo”. We can see this by looking at the str() (structure) of the tibble:

Factors coded as Numbers

Suppose in the penguins data set than instead of their names, species had levels:

  • “1” for Gentoo,

  • “2” for Adelie, and

  • 3 for Chinstrap penguins

We see here that R assumes that species is a numeric variable, not a categorical variable. This means, R assumes “species” can take on any numeric value, such as 1.5, 2.567, or 5 (which doesn’t make sense for this variable!)

Factors coded as Numbers

To ensure R knows that species is categorical, we can use:

Now we see that species is a factor with three levels!

To make our lives easier, we will work with factors through the forcats package loaded as part of the tidyverse.

Reordering Factor Levels

  • By default, factors are ordered alphabetically or numerically.

  • However, in many cases factors have a logical ordering they should follow (i.e., elementary, secondary, post-secondary, graduate).

  • Reordering data can be useful for both data visualization and model fitting.

To see the current ordering of a factor variable, we can call levels().

We see here, the levels of the factor are ordered alphabetically.

Reordering Levels of a Factors Manually

There are many ways to reorder the levels of a factor in R. To reorder the levels of the factor, we can use built in R functions such as ordered():

Now, when we call str() on the penguins data, we see that the factor is explicitly ordered, with out new ordering:

Reordering Levels of a Factors Based on a Condition Using forcats

Let’s look again at the original penguins data set and look at the frequency of each species using ggplot:

Notice that the ordering is alphabetical What would be a more effective ordering?

Notes:

Reordering Levels of a Factors Based on a Condition Using forcats

Let’s order bars by largest to smallest (or smallest to largest) so readers can easily spot which species is the most common.

Two approaches:

  • edit the original tibble to have new factor ordering, or

  • order the factors directly in the ggplot2 call so that the original tibble isn’t overwritten.

Reordering Levels of a Factors Based on a Condition Using forcats

Option A: Reorder Tibble Directly

To reorder the factor levels according to the frequency in the tibble:

We can then plot our data using penguins4:

Reordering Levels of a Factors Based on a Condition Using forcats

Option B: Reorder the Factor in the ggplot Call

We can do these steps directly in ggplot without overwriting or making a new tibble!

We get the exact same plot with a single chunk of code! And, the original tibble penguins remains unchanged with the alphabetical ordering of the species factors.

Expanding Factor Levels

Perhaps we may want to visualize that in our data set, there are no Emperor or King penguins. To do so, we can add “Emperor” and “King” as possible factor levels for species in our penguins data set. We do so by fct_expand()

Now we add to the visualization that there were no King or Emperor penguins collected in the data.

Removing Factor Levels

Let’s suppose we were only interested in comparing Adelie and Chinstrap penguins. We could drop the Gentoo level using forcats by:

Worksheet A5

Try your hand at using factors by working through the factors portion of Worksheet A5.

Finished attempting all of the questions? Then do the optional R4DS Factors reading, and maybe even do some of the exercises for extra practice.