Lecture 8b: Advanced Functions

October 23, 2025

Modified

October 19, 2025

We will learn about a couple of advanced topics:

These topics are covered in the R4DS Functions book chapter as well. So if you miss this class, then the R4DS Functions reading is a good alternative.

We will be using the following packages throughout this lecture:

library(palmerpenguins)
library(tidyverse)
library(gapminder)

Video Lecture

Lecture Slides

Data-masking and the Curly-Curly {{}}

Sometimes your function needs to take in variable names without quotation marks and work with them that way.

For example, select(penguins, species) does not put quotation marks around species (the reasoning being that lifeExp is like a variable in our workspace, if we were to include column names in our R Environment) and select("penguins", "species") does not work.

We often need to create functions that reference column names without it being a string. If your function needs to do this, then you need to work with the arguments with extra care inside the function definition. Whenever we use those arguments, we need to embrace them within two curly brackets – an operator called “curly curly”.

Take this function that produces a quick scatterplot between two columns in a dataset as an example.

quick_scatter <- function(data, x, y) {
  ggplot(data, aes(x, y)) + #note curly brackets here!
     geom_point()
}

quick_scatter(penguins, bill_length_mm, body_mass_g)
Error in `geom_point()`:
! Problem while computing aesthetics.
ℹ Error occurred in the 1st layer.
Caused by error:
! object 'bill_length_mm' not found

Why doesn’t this work? The reason is that R is looking for variables named bill_len and body_mass in the workspace, and cannot find them. To fix the problem, we can change the function definition so that `x` and `y` are embraced within two curly brackets {{}} (“curly curly”):

quick_scatter <- function(data, x, y) {
  ggplot(data, aes({{ x }}, {{ y }})) + #note curly brackets here!
     geom_point()
}

quick_scatter(penguins, bill_length_mm, body_mass_g)

But, you can only use curly-curly when passing your function’s argument to another function that’s anticipating a variable name without quotation marks.

Tip

In the `dplyr` documentation, if you spy the words “data masking” or “tidy selection”, then you will need to curly-curly your arguments when using those functions within your custom function.

Exercise: Curly Curly

Here’s some code that:

  • groups penguins by species, then summarizes the number of missing values in each variable.

  • groups gapminder by continent, then summarizes the number of missing values in each variable.

penguins %>% 
  group_by(species) %>% 
  summarize(across(everything(), ~ sum(is.na(.x))))
# A tibble: 3 × 8
  species   island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>      <int>          <int>         <int>             <int>       <int>
1 Adelie         0              1             1                 1           1
2 Chinstrap      0              0             0                 0           0
3 Gentoo         0              1             1                 1           1
# ℹ 2 more variables: sex <int>, year <int>
gapminder %>% 
  group_by(continent) %>% 
  summarize(across(everything(), ~ sum(is.na(.x))))
# A tibble: 5 × 6
  continent country  year lifeExp   pop gdpPercap
  <fct>       <int> <int>   <int> <int>     <int>
1 Africa          0     0       0     0         0
2 Americas        0     0       0     0         0
3 Asia            0     0       0     0         0
4 Europe          0     0       0     0         0
5 Oceania         0     0       0     0         0

These steps to summarize the data are quite similar! Instead of coding each step multiple times, let’s turn it into a function. By yourself or with a partner, write a function that could be used to group by a column name and summarize the number of NAs in a data set. Remember: we need to use the {{}} when referring to column names.

summarizeby_fun <- function(data, groups){
  data %>%
    group_by({{groups}}) %>%
    summarize(across(everything(), ~ sum(is.na(.x))))
}

summarizeby_fun(penguins, species)
# A tibble: 3 × 8
  species   island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>      <int>          <int>         <int>             <int>       <int>
1 Adelie         0              1             1                 1           1
2 Chinstrap      0              0             0                 0           0
3 Gentoo         0              1             1                 1           1
# ℹ 2 more variables: sex <int>, year <int>
summarizeby_fun(gapminder, continent)
# A tibble: 5 × 6
  continent country  year lifeExp   pop gdpPercap
  <fct>       <int> <int>   <int> <int>     <int>
1 Africa          0     0       0     0         0
2 Americas        0     0       0     0         0
3 Asia            0     0       0     0         0
4 Europe          0     0       0     0         0
5 Oceania         0     0       0     0         0

Default Parameters

Recall the dice-rolling example from the previous lecture. Sometimes you want to use a function frequently without re-writing the same parameters over and over again. Let’s make a more flexible function that allows us to change the number of faces on the dice being rolled.

#' @details
#' Simulates rolling `num_dice` number of dice with `n_sides` sides and outputs the sum. Note: no seed is used so the function will return a dice combination each time it is run
#'
#' @param num_dice integer representing number of dice to be rolled
#' @param n_sides integer representing the number of sides of each dice
#' @return the sum of the dice rolled
  
roll_dice <- function(n_sides, num_dice) { 
  
    # throw an error if num_dice (the input) is not an integer
  
    if(num_dice %% 1 != 0){ #if num_dice mod 1 is NOT 0
      stop("num_dice must be an integer") #throw this error message and stop the function
    }
  
    #if the num_dice is an integer, continue with the function:
    sum(sample(1:n_sides, num_dice, replace=TRUE)) #sample two numbers from one to n_sides with replacement, return sum
}

Notice now that in this function, there are two parameters (the new one is n_sides). We now sample from 1:nsides (instead of 1:10) to make the function more flexible. I also renamed the function to roll_dice as we are not necessarily rolling 10 sided dice.

So to roll two 10-sided dice, I can call:

roll_dice(n_sides = 10, num_dice = 2)
[1] 13

Now, perhaps I often want to roll ten-sided dice. To avoid having to type n_sides = 10, I can simply make the default number of sides to 10!

#' @details
#' Simulates rolling `num_dice` number of dice with `n_sides` sides and outputs the sum. Note: no seed is used so the function will return a dice combination each time it is run
#'
#' @param num_dice integer representing number of dice to be rolled
#' @param n_sides integer representing the number of sides of each dice. Default is 10.
#' @return the sum of the dice rolled
  
roll_dice <- function(n_sides = 10, num_dice) { #NEW: n_sides default is 10
  
    # throw an error if num_dice (the input) is not an integer
  
    if(num_dice %% 1 != 0){ #if num_dice mod 1 is NOT 0
      stop("num_dice must be an integer") #throw this error message and stop the function
    }
  
    #if the num_dice is an integer, continue with the function:
    sum(sample(1:n_sides, num_dice, replace=TRUE)) #sample two numbers from one to n_sides with replacement, return sum
}

We have set n_sides = 10 as the default. This means the function will assume we have a 10 sided dice unless otherwise specified. Let’s roll 3 dice using 10 sided dice (the default):

roll_dice(num_dice = 3)
[1] 15

I didn’t need to include n_sides = 10 in my function call! But I can if I want to change it to a number other than 10. Let’s roll 3 standard 6-sided dice

roll_dice(n_sides = 6, num_dice = 3)
[1] 10
Exercise: Default Values

Make a new argument for the summarizeby_fun() we made previously called columns that allows you to input a vector containing which columns you wish to look at missing values for (these can be written as strings). Set the default to everything().

summarizeby_fun <- function(data, groups, columns = everything()){
  data %>%
    group_by({{groups}}) %>%
    summarize(across(columns, ~ sum(is.na(.x))))
}

summarizeby_fun(penguins, species)
Warning: There was 1 warning in `summarize()`.
ℹ In argument: `across(columns, ~sum(is.na(.x)))`.
Caused by warning:
! Using an external vector in selections was deprecated in tidyselect 1.1.0.
ℹ Please use `all_of()` or `any_of()` instead.
  # Was:
  data %>% select(columns)

  # Now:
  data %>% select(all_of(columns))

See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
# A tibble: 3 × 8
  species   island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>      <int>          <int>         <int>             <int>       <int>
1 Adelie         0              1             1                 1           1
2 Chinstrap      0              0             0                 0           0
3 Gentoo         0              1             1                 1           1
# ℹ 2 more variables: sex <int>, year <int>
summarizeby_fun(penguins, species, c("bill_length_mm", "sex"))
# A tibble: 3 × 3
  species   bill_length_mm   sex
  <fct>              <int> <int>
1 Adelie                 1     6
2 Chinstrap              0     0
3 Gentoo                 1     5

Ellipses (...)

The ellipses allow a function to accept a variable number of additional unnamed arguments beyond what is explicitly written in the function. Many built-in functions have ... listed (check out c()!)

Let’s modify our function to allow grouping by any number of variables.

summarizeby_fun <- function(data, ..., columns = everything()){
  data %>%
    group_by(...) %>%
    summarize(across(columns, ~ sum(is.na(.x))))
}

summarizeby_fun(penguins, species)
# A tibble: 3 × 8
  species   island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>      <int>          <int>         <int>             <int>       <int>
1 Adelie         0              1             1                 1           1
2 Chinstrap      0              0             0                 0           0
3 Gentoo         0              1             1                 1           1
# ℹ 2 more variables: sex <int>, year <int>
summarizeby_fun(penguins, species, island)
`summarise()` has grouped output by 'species'. You can override using the
`.groups` argument.
# A tibble: 5 × 8
# Groups:   species [3]
  species   island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>     <fct>              <int>         <int>             <int>       <int>
1 Adelie    Biscoe                 0             0                 0           0
2 Adelie    Dream                  0             0                 0           0
3 Adelie    Torgersen              1             1                 1           1
4 Chinstrap Dream                  0             0                 0           0
5 Gentoo    Biscoe                 1             1                 1           1
# ℹ 2 more variables: sex <int>, year <int>

Now our function is very flexible! And as a bonus, we didn’t need to use the {{}}.

Handling NAs

Missing data is essentially inevitable. Few studies are able to collect 100% of the data they intend to.

Missing data can heavily complicate analyses and even lead to biased results when not handled properly. Missing data is a big research area in statistics! But for this class, we are going to focus on dealing with missing data using a simple example.

Let’s look at the flipper length of penguins in the penguins data set and count how many missing values there are using the is.na() function in R. is.na(flipper_len) will return a vector full of TRUE or FALSE values indicating whether or not the observation was missing. As TRUE is coded as a 1 and FALSE as a 0, we can sum over these to count how many missing values there are.

flipper_len <- penguins$flipper_length_mm #save this data as its own vector
sum(is.na(flipper_len)) # count how many NAs there are in the flipper data
[1] 2

We see we have two missing values. Let’s see if we can summarize the quantiles of the lengths:

quantile(flipper_len)
Error in quantile.default(flipper_length) :
missing values and NaN's not allowed if 'na.rm' is FALSE

We see here that missing values are not allowed unless we specify na.rm = TRUE. When na.rm = TRUE, we remove missing values from the data and then calculate the quantiles. This is also referred to as a complete case analysis.

quantile(flipper_len, na.rm = TRUE)
  0%  25%  50%  75% 100% 
 172  190  197  213  231 

Now suppose we wanted to make our own function that utilized the quantile() built-in function:

#' @details
#' calculates the range of data by finding the 100th and 0th quantile and finding their difference
#'
#' @param vec a vector that we want to find the range of
#' @return the difference in the maximum and minimum
  
get_range <- function(vec){
  quantiles <- quantile(vec, na.rm = TRUE) #calc quantiles, remove NA's
  return(max(quantiles) - min(quantiles)) #calculate and return the range
}

get_range(flipper_len)
[1] 59

We could also include na.rm in our function parameters to allow the user to specify whether or not it should be set to TRUE or FALSE. We can use a default, as well.

#' @details
#' calculates the range of data by finding the 100th and 0th quantile and finding their difference
#'
#' @param vec a vector that we want to find the range of
#' @param na.rm logical, whether or not to remove NAs. Default set to TRUE.
#' @return the difference in the maximum and minimum
  
get_range <- function(vec, na.rm = TRUE){
  quantiles <- quantile(vec, na.rm = na.rm) #calc quantiles, remove NA's
  return(max(quantiles) - min(quantiles)) #calculate and return the range
}

get_range(flipper_len) #default to true
[1] 59

Worksheet B1 and Assignment B1

You can now begin working on Worksheet B1 and Assignment B1 (pdf on Canvas).

Back to top