Lecture 8B: Advanced Functions

STAT 545 - Fall 2025

We will learn about a couple of advanced topics:

  • Data-masking and the curly-curly {{}}

  • Default values

  • Ellipses ...

  • Handling NA’s

Lecture Notes

YouTube Video

Set-up

We will be using the following packages throughout this lecture:

Data-masking and the Curly-Curly {{}}

Sometimes your function needs to take in variable names without quotation marks and work with them that way.

For example,

  • select(penguins, species) works

  • select("penguins", "species") does not work.

We often need to create functions that reference column names without it being a string.

If your function needs to do this, we need to embrace them within two curly brackets – an operator called “curly curly”.

Data-masking and the Curly-Curly {{}}

Take this function that produces a quick scatterplot between two columns in a dataset as an example.

Why doesn’t this work?

Notes:

Data-masking and the Curly-Curly {{}}

The reason is that R is looking for variables named bill_len and body_mass in the workspace, and cannot find them. To fix the problem, we can change the function definition so that `x` and `y` are embraced within two curly brackets {{}} (“curly curly”):

But, you can only use curly-curly when passing your function’s argument to another function that’s anticipating a variable name without quotation marks.

Notes:

Curly-Curly {{}} Exercise

Here’s some code that:

  • groups penguins by species, then summarizes the number of missing values in each variable.

  • groups gapminder by continent, then summarizes the number of missing values in each variable.


Write a function that could be used to group by a column name and summarize the number of NAs in a data set. *Remember: we need to use the {{}} when referring to column names*.

Multiple Arguments

Recall the dice-rolling example from the previous lecture.

Sometimes you want to use a function frequently without re-writing the same parameters over and over again. Let’s make a more flexible function that allows us to change the number of faces on the dice being rolled.

So to roll two 10-sided dice, I can call:

Multiple Arguments

NOTE: with multi-argument functions, you can either list the arguments in order that they appear in the function, or be explicit with the parameter names. The following are equivalent:

  • roll_dice(10, 2)

  • roll_dice(n_sides = 10, num_dice = 2)

  • roll_dice(num_dice = 2, n_sides = 10)

Default Parameters

Now, perhaps I often want to roll ten-sided dice. To avoid having to type n_sides = 10, I can simply make the default number of sides to 10!

Default Parameters

We have set n_sides = 10 as the default. This means the function will assume we have a 10 sided dice unless otherwise specified. Let’s roll 3 dice using 10 sided dice (the default):


I didn’t need to include n_sides = 10 in my function call! But I can if I want to change it to a number other than 10. Let’s roll 3 standard 6-sided dice

Exercise: Default Values

Make a new argument for the summarizeby_fun() we made previously called columns that allows you to input a vector containing which columns you wish to look at missing values for (these can be written as strings). Set the default to everything().

Ellipses (...)

The ellipses allow a function to accept a variable number of additional unnamed arguments beyond what is explicitly written in the function. Many built-in functions have ... listed:

Ellipses (...)

Let’s modify our function to allow grouping by any number of variables.

Now our function is very flexible! And as a bonus, we didn’t need to use the {{}}.

Handling NAs

Missing data is inevitable. Few studies are able to collect 100% of the data they intend to.

Missing data is a big research area in statistics! But for this class, we are going to focus on dealing with missing data using a simple example.

Let’s look at the flipper length of penguins in the penguins data set and count how many missing values there are using the is.na() function in R.

  • is.na(flipper_len) will return a vector full of TRUE or FALSE values indicating whether or not the observation was missing.

  • As TRUE is coded as a 1 and FALSE as a 0, we can sum over these to count how many missing values there are.

Handling NAs

We see we have two missing values. Let’s see if we can summarize the quantiles of the lengths:

We see here that missing values are not allowed unless we specify na.rm = TRUE. When na.rm = TRUE, we remove missing values from the data and then calculate the quantiles. This is also referred to as a complete case analysis.

Handling NAs

Now suppose we wanted to make our own function that utilized the quantile() built-in function:

Handling NAs

We could also include na.rm in our function parameters to allow the user to specify whether or not it should be set to TRUE or FALSE. We can use a default, as well.

Worksheet B1

Get some practice with advanced functions on Worksheet B1, or begin on Assignment B1.