00 R, Rmarkdown, code, and {tidyverse}:
A whirlwind tour

Stat 406

Geoff Pleiss, Trevor Campbell

Last modified – 11 September 2024

\[ \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator*{\argmax}{argmax} \DeclareMathOperator*{\minimize}{minimize} \DeclareMathOperator*{\maximize}{maximize} \DeclareMathOperator*{\find}{find} \DeclareMathOperator{\st}{subject\,\,to} \newcommand{\E}{E} \newcommand{\Expect}[1]{\E\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[2]{\mathrm{Cov}\left[#1,\ #2\right]} \newcommand{\given}{\ \vert\ } \newcommand{\X}{\mathbf{X}} \newcommand{\x}{\mathbf{x}} \newcommand{\y}{\mathbf{y}} \newcommand{\P}{\mathcal{P}} \newcommand{\R}{\mathbb{R}} \newcommand{\norm}[1]{\left\lVert #1 \right\rVert} \newcommand{\snorm}[1]{\lVert #1 \rVert} \newcommand{\tr}[1]{\mbox{tr}(#1)} \newcommand{\brt}{\widehat{\beta}^R_{s}} \newcommand{\brl}{\widehat{\beta}^R_{\lambda}} \newcommand{\bls}{\widehat{\beta}_{ols}} \newcommand{\blt}{\widehat{\beta}^L_{s}} \newcommand{\bll}{\widehat{\beta}^L_{\lambda}} \newcommand{\U}{\mathbf{U}} \newcommand{\D}{\mathbf{D}} \newcommand{\V}{\mathbf{V}} \]

Housekeeping

  1. Introduction
  2. My office hours will be in (room redacted) after class today only
    • I’m trying to book this room for future OHs
  3. This week’s lab is due Friday 23:00 (to help with setup issues)
    • only for this week. After that, 23:00 on night of the lab
  4. Reminders
    • 4 students still haven’t done Quiz 0
    • 25 students need to accept the GitHub invite
    • 8 students need to enroll in a lab section

The basics

Tour of Rstudio

Things to note

  1. Console
  2. Terminal
  3. Scripts, .Rmd, Knit
  4. Files, Projects
  5. Getting help
  6. Environment, Git

R and the {tidyverse}

Today is going to be a whirlwind tour of R.

If you are new to R: read the first 4 chapters of Data Science: A First Introduction.

It’s available for free at https://datasciencebook.ca. It covers:

  • Data loading from .csv, Excel, database, and web sources
  • Data saving to .csv files
  • Data wrangling with tidyverse functions
  • Plotting with ggplot

Basic data structures

Vectors:

x <- c(1, 3, 4)
x[1]
[1] 1
x[-1]
[1] 3 4
rev(x)
[1] 4 3 1
c(x, x)
[1] 1 3 4 1 3 4

Matrices:

x <- matrix(1:25, nrow = 5, ncol = 5)
x[1,]
[1]  1  6 11 16 21
x[,-1]
     [,1] [,2] [,3] [,4]
[1,]    6   11   16   21
[2,]    7   12   17   22
[3,]    8   13   18   23
[4,]    9   14   19   24
[5,]   10   15   20   25
x[c(1,3),  2:3]
     [,1] [,2]
[1,]    6   11
[2,]    8   13

All elements of a vector/matrix must be of the same type

Basic data structures

Lists

(l <- list(
  a = letters[1:2], 
  b = 1:4, 
  c = list(a = 1)))
$a
[1] "a" "b"

$b
[1] 1 2 3 4

$c
$c$a
[1] 1
l$a
[1] "a" "b"
l$c$a
[1] 1
l["b"] # compare to l[["b"]] == l$b
$b
[1] 1 2 3 4

Data frames

(dat <- data.frame(
  z = 1:5, 
  b = 6:10, 
  c = letters[1:5]))
  z  b c
1 1  6 a
2 2  7 b
3 3  8 c
4 4  9 d
5 5 10 e
class(dat)
[1] "data.frame"
dat$b
[1]  6  7  8  9 10
dat[1,]
  z b c
1 1 6 a

Lists can have multiple element types; data frames are lists of vectors

Tibbles

These are {tidyverse} data frames

(dat2 <- tibble(z = 1:5, b = z + 5, c = letters[z]))
# A tibble: 5 × 3
      z     b c    
  <int> <dbl> <chr>
1     1     6 a    
2     2     7 b    
3     3     8 c    
4     4     9 d    
5     5    10 e    
class(dat2)
[1] "tbl_df"     "tbl"        "data.frame"

We’ll return to classes in a moment. A tbl_df is a “subclass” of data.frame.

Anything that data.frame can do, tbl_df can do (better).

For instance, the printing is more informative.

Also, you can construct one by referencing previously constructed columns.

Functions

Functions in R

A function is a mapping from inputs to outputs, and is defined with the function keyword.

The function’s body is wrapped in curly braces, and its output is given by the return keyword (or the last evaluated statement)

f <- function(x, y){
  x+y 
}

f(3,5)
[1] 8
f <- function(x, y){
  return(x+y)
}

f(3,5)
[1] 8

Function Signatures

Code
sig <- sig::sig
sig(lm)
fn <- function(formula, data, subset, weights, na.action, method = "qr", model
  = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE, contrasts =
  NULL, offset, ...)
sig(`+`)
fn <- function(e1, e2)
sig(dplyr::filter)
fn <- function(.data, ..., .by = NULL, .preserve = FALSE)
sig(stats::filter)
fn <- function(x, filter, method = c("convolution", "recursive"), sides = 2,
  circular = FALSE, init = NULL)
sig(rnorm)
fn <- function(n, mean = 0, sd = 1)

These are all the same

set.seed(12345)
rnorm(3)
[1]  0.5855288  0.7094660 -0.1093033
set.seed(12345)
rnorm(n = 3, mean = 0)
[1]  0.5855288  0.7094660 -0.1093033
set.seed(12345)
rnorm(3, 0, 1)
[1]  0.5855288  0.7094660 -0.1093033
set.seed(12345)
rnorm(sd = 1, n = 3, mean = 0)
[1]  0.5855288  0.7094660 -0.1093033
  • Functions can have default values.
  • You may, but don’t have to, name the arguments
  • If you name them, you can pass them out of order (but you shouldn’t).

Write lots of functions.

Outputs vs. Side Effects

f <- function(arg1, arg2, arg3 = 12, ...) {
  stuff <- arg1 * arg3
  stuff2 <- stuff + arg2
  plot(arg1, stuff2, ...)
  return(stuff2)
}
x <- rnorm(100)
y1 <- f(x, 3, 15, col = 4, pch = 19)

str(y1)
 num [1:100] -3.8 12.09 -24.27 12.45 -1.14 ...

Outputs vs. Side Effects

  • Side effects are things a function changes in global scope
  • Outputs can be assigned to variables
  • A good example is the hist function
  • You have probably only seen the side effect which is to plot the histogram
my_histogram <- hist(rnorm(1000))

str(my_histogram)
List of 6
 $ breaks  : num [1:14] -3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 ...
 $ counts  : int [1:13] 4 21 41 89 142 200 193 170 74 38 ...
 $ density : num [1:13] 0.008 0.042 0.082 0.178 0.284 0.4 0.386 0.34 0.148 0.076 ...
 $ mids    : num [1:13] -2.75 -2.25 -1.75 -1.25 -0.75 -0.25 0.25 0.75 1.25 1.75 ...
 $ xname   : chr "rnorm(1000)"
 $ equidist: logi TRUE
 - attr(*, "class")= chr "histogram"
class(my_histogram)
[1] "histogram"

When writing functions, program defensively, ensure behaviour

incrementer <- function(x, inc_by = 1) {
  x + 1
}
  
incrementer(2)
[1] 3
incrementer(1:4)
[1] 2 3 4 5
incrementer("a")
Error in x + 1: non-numeric argument to binary operator
incrementer <- function(x, inc_by = 1) {
  stopifnot(is.numeric(x))
  return(x + 1)
}
incrementer("a")
Error in incrementer("a"): is.numeric(x) is not TRUE
incrementer <- function(x, inc_by = 1) {
  if (!is.numeric(x)) {
    stop("`x` must be numeric")
  }
  x + 1
}
incrementer("a")
Error in incrementer("a"): `x` must be numeric
incrementer(2, -3) ## oops!
[1] 3
incrementer <- function(x, inc_by = 1) {
  if (!is.numeric(x)) {
    stop("`x` must be numeric")
  }
  x + inc_by
}
incrementer(2, -3)
[1] -1

Unit Testing

When you write functions, test them!

Use testthat: check a few usual values and corner cases

library(testthat)
incrementer <- function(x, inc_by = 1) {
  if (!is.numeric(x)) {
    stop("`x` must be numeric")
  }
  if (!is.numeric(inc_by)) {
    stop("`inc_by` must be numeric")
  }
  x + inc_by
}
expect_error(incrementer("a"))
expect_equal(incrementer(1:3), 2:4)
expect_equal(incrementer(2, -3), -1)
expect_error(incrementer(1, "b"))
expect_identical(incrementer(1:3), 2:4)
Error: incrementer(1:3) not identical to 2:4.
Objects equal but not identical
is.integer(2:4)
[1] TRUE
is.integer(incrementer(1:3))
[1] FALSE
expect_identical(incrementer(1:3, 1L), 2:4)

Important

Don’t copy code; write a function. Validate your arguments. Write tests to check if inputs result in predicted outputs.

Classes and methods

Classes

We saw some of these earlier:

tib <- tibble(
  x1 = rnorm(100), 
  x2 = rnorm(100), 
  y = x1 + 2 * x2 + rnorm(100)
)
mdl <- lm(y ~ ., data = tib )
class(tib)
[1] "tbl_df"     "tbl"        "data.frame"
class(mdl)
[1] "lm"

The class allows for the use of “methods”

print(mdl)

Call:
lm(formula = y ~ ., data = tib)

Coefficients:
(Intercept)           x1           x2  
    -0.1742       1.0454       2.0470  
  • R “knows what to do” when you print() an object of class "lm".

  • print() is called a “generic” function.

  • You can create “methods” that get dispatched.

  • For any generic, R looks for a method for the class.

  • If available, it calls that function.

Viewing the dispatch chain

sloop::s3_dispatch(print(incrementer))
=> print.function
 * print.default
sloop::s3_dispatch(print(tib))
   print.tbl_df
=> print.tbl
 * print.data.frame
 * print.default
sloop::s3_dispatch(print(mdl))
=> print.lm
 * print.default

Generic Methods

There are lots of generic functions in R

Common ones are print(), summary(), and plot().

Also, lots of important statistical modelling concepts: residuals() coef()

(In python, these work the opposite way: obj.residuals. The dot after the object accesses methods defined for that type of object. But the dispatch behaviour is less robust.)

  • The convention is that the specialized function is named method.class(), e.g., summary.lm().

  • If no specialized function is defined, R will try to use method.default().

For this reason, R programmers try to avoid . in names of functions or objects.

Wherefore methods?

  • The advantage is that you don’t have to learn a totally new syntax to grab residuals or plot things

  • You just use residuals(mdl) whether mdl has class lm or any other class you expect to have residuals

  • The one draw-back is the help pages for the generic methods tend to be pretty vague

  • Compare ?summary with ?summary.lm.

Environments

Different environments

(known as scope in other languages)

  • These are often tricky, but are very common.

  • Most programming languages have this concept in one way or another.

  • In R code run in the Console produces objects in the “Global environment”

  • You can see what you create in the “Environment” tab.

  • But there’s lots of other stuff.

  • Many packages are automatically loaded at startup, so you have access to the functions and data inside

For example mean(), lm(), plot(), iris (technically iris is lazy-loaded, meaning it’s not in memory until you call it, but it is available)

  • Other packages require you to load them with library(pkg) before their functions are available.

  • But, you can call those functions by prefixing the package name ggplot2::ggplot().

  • You can also access functions that the package developer didn’t “export” for use with ::: like dplyr:::as_across_fn_call()

Other issues with environments

As one might expect, functions create an environment inside the function.

z <- 1
fun <- function(x) {
  z <- x
  print(z)
  invisible(z)
}
fun(14)
[1] 14

Non-trivial cases are data-masking environments.

tib <- tibble(x1 = rnorm(100),  x2 = rnorm(100),  y = x1 + 2 * x2)
mdl <- lm(y ~ x2, data = tib)
x2
Error in eval(expr, envir, enclos): object 'x2' not found
  • lm() looks “inside” the tib to find y and x2
  • The data variables are added to the lm() environment

Other issues with environments

When Knit, .Rmd files run in their OWN environment.

They are run from top to bottom, with code chunks depending on previous

This makes them reproducible.

Objects in your local environment are not available in the .Rmd

Objects in the .Rmd are not available locally.

Tip

The most frequent error I see is:

  • running chunks individually, 1-by-1, and it works
  • Knitting, and it fails

The reason is almost always that the chunks refer to objects in the Environment that don’t exist in the .Rmd

This error also happens because:

  • library() calls were made globally but not in the .Rmd
    • so the packages aren’t loaded
  • paths to data or other objects are not relative to the .Rmd in your file system
    • they must be
  • Carefully keeping Labs / Assignments in their current location will help to avoid some of these.

Tip

Knit frequently throughout your homework / lab so that you encounter environment errors early and often!

Debugging

How to fix code

  • If you’re using a function in a package, start with ?function to see the help
    • Make sure you’re calling the function correctly.
    • Try running the examples.
    • paste the error into Google (if you share the error on Slack, I often do this first)
    • Go to the package website if it exists, and browse around
  • If your .Rmd won’t Knit
    • Did you make the mistake on the last slide?
    • Did it Knit before? Then the bug is in whatever you added.
    • Did you never Knit it? Why not?
    • Call rstudioapi::restartSession(), then run the Chunks 1-by-1

Adding browser()

(known as a breakpoint in any other language)

  • Only useful with your own functions.
  • Open the script with the function, and add browser() to the code somewhere
  • Then call your function.
  • The execution will Stop where you added browser() and you’ll have access to the local environment to play around

Reproducible examples

Question I frequently get:

“I ran this code, but it didn’t work.”

  • If you want to ask me why the code doesn’t work, you need to show me what’s wrong.

Don’t just paste a screenshot!

Unless you get lucky, I won’t be able to figure it out from that. And we’ll both get frustrated.

What you need is a Reproducible Example or reprex.

  • This is a small chunk of code that
    1. runs in it’s own environment
    2. and produces the error.

Reprex example 1

Reprex example 2

The {reprex} package

  1. Open a new .R script.

  2. Paste your buggy code in the file (no need to save)

  3. Edit your code to make sure it’s “enough to produce the error” and nothing more. (By rerunning the code a few times.)

  4. Copy your code (so that it’s on the clipboard)

  5. Call reprex::reprex(venue = "r") from the console. This will run your code in a new environment and show the result in the Viewer tab. Does it create the error you expect?

  6. If it creates other errors, that may be the problem. You may fix the bug on your own!

  7. If it doesn’t have errors, then your global environment is Farblunget.

  8. The Output is now on your clipboard. Go to Slack and paste it in a message. Then press Cmd+Shift+Enter (on Mac) or Ctrl+Shift+Enter (Windows/Linux). Under Type, select R.

  9. Send the message, perhaps with more description and an SOS emoji.

Note

Because Reprex runs in it’s own environment, it doesn’t have access to any of the libraries you loaded or the stuff in your global environment. You’ll have to load these things in the script.

R Pitfalls

  • R is very permissive, and this leads to frequent silent errors
    • nonstandard evaluation of arguments, data masking
    • allows dots in names (even though they mean something syntactically!)
    • allows accessing attributes that don’t exist
    • promotion of ints to floats, floats to strings 😱
  • Lots of unusual design decisions
    • many assignment operators (->, <-, ->>, <<-, =)
    • many accessors (a$b is a[["b"]] but not a["b"])
    • lacking basic data types (e.g., hash maps)
    • informal classes (class(x) <- "a weird new class!")
    • tonnes of functions/data/objects in the global namespace
    • 3 == "3" (evaluates to TRUE?!!?!)
  • Rscript executable treats code differently than the R REPL

Understanding {tidyverse}

{tidyverse} is huge

Core tidyverse is ~30 different packages, but we’re going to just talk about a few.

Load all of them by calling library(tidyverse)

Packages fall roughly into a few categories:

  1. Convenience functions: {magrittr} and many many others.
  2. Data processing: {dplyr} and many others.
  3. Graphing: {ggplot2} and some others like {scales}.
  4. Utilities

We’re going to talk quickly about some of it, but ignore much of 2.

There’s a lot that’s great about these packages, especially ease of data processing.

But it doesn’t always jive with base R (it’s almost a separate proglang at this point).

When in doubt…

Read the first 4 chapters (especially 3 and 4!)

https://datasciencebook.ca

Piping with {magrittr}

This was introduced by {magrittr} as %>%,

but is now in base R (>=4.1.0) as |>.

Note: there are other pipes in {magrittr} (e.g. %$% and %T%) but I’ve never used them.

The point of the pipe is to logically sequence nested operations

The pipe passes the left hand side as the first argument of the right hand side

Example

select(filter(mtcars, cyl == 6), mpg)
                mpg
Mazda RX4      21.0
Mazda RX4 Wag  21.0
Hornet 4 Drive 21.4
Valiant        18.1
Merc 280       19.2
Merc 280C      17.8
Ferrari Dino   19.7
mse1 <- print(
  sum(
    residuals(
      lm(y~., data = mutate(
        tib, 
        x3 = x1^2,
        x4 = log(x2 + abs(min(x2)) + 1)
      )
      )
    )^2
  )
)
[1] 9.888005e-30
mtcars |> filter(cyl == 6) |> select(mpg)
                mpg
Mazda RX4      21.0
Mazda RX4 Wag  21.0
Hornet 4 Drive 21.4
Valiant        18.1
Merc 280       19.2
Merc 280C      17.8
Ferrari Dino   19.7
mse2 <- tib |>
  mutate(
    x3 = x1^2, 
    x4 = log(x2 + abs(min(x2)) + 1)
  ) %>% # base pipe only goes to first arg
  lm(y ~ ., data = .) |> # note the use of `.`
  residuals() |>
  magrittr::raise_to_power(2) |> # same as `^`(2)
  sum() |>
  print()
[1] 9.888005e-30

It may seem like we should push this all the way

tib |>
  mutate(
    x3 = x1^2, 
    x4 = log(x2 + abs(min(x2)) + 1)
  ) %>% # base pipe only goes to first arg
  lm(y ~ ., data = .) |> # note the use of `.`
  residuals() |>
  magrittr::raise_to_power(2) |> # same as `^`(2)
  sum() ->
  mse3

This technically works…but at a minimum it makes it hard to extend pipe sequences.

Note

Opinion zone: It’s also just weird. Don’t encourage the R devs.

Data processing in {dplyr}

This package has all sorts of things. And it interacts with {tibble} generally.

The basic idea is “tibble in, tibble out”.

Satisfies data masking which means you can refer to columns by name or use helpers like ends_with("_rate")

Majorly useful operations:

  1. select() (chooses columns to keep)
  2. mutate() (showed this already)
  3. group_by()
  4. pivot_longer() and pivot_wider()
  5. left_join() and full_join()
  6. summarise()

Note

filter() and select() are functions in Base R.

Sometimes you get 🐞 because it called the wrong version.

To be sure, prefix it like dplyr::select().

A useful data frame

7-day rolling avg COVID case/death counts for CA and WA from Aug 1-21, 2022 from Johns Hopkins

library(tidyverse)
covid <- read_csv("data/covid.csv") |>
  select(geo_value, time_value, signal, value)

covid
# A tibble: 84 × 4
   geo_value time_value signal                        value
   <chr>     <date>     <chr>                         <dbl>
 1 ca        2022-08-01 confirmed_7dav_incidence_prop  45.4
 2 wa        2022-08-01 confirmed_7dav_incidence_prop  27.7
 3 ca        2022-08-02 confirmed_7dav_incidence_prop  44.9
 4 wa        2022-08-02 confirmed_7dav_incidence_prop  27.7
 5 ca        2022-08-03 confirmed_7dav_incidence_prop  44.5
 6 wa        2022-08-03 confirmed_7dav_incidence_prop  26.6
 7 ca        2022-08-04 confirmed_7dav_incidence_prop  42.3
 8 wa        2022-08-04 confirmed_7dav_incidence_prop  26.6
 9 ca        2022-08-05 confirmed_7dav_incidence_prop  40.7
10 wa        2022-08-05 confirmed_7dav_incidence_prop  34.6
# ℹ 74 more rows

Examples

Rename the signal to something short.

covid <- covid |> 
  mutate(signal = case_when(
    str_starts(signal, "confirmed") ~ "case_rate", 
    TRUE ~ "death_rate"
  ))

Sort by time_value then geo_value

covid <- covid |> arrange(time_value, geo_value)

Calculate grouped medians

covid |> 
  group_by(geo_value, signal) |>
  summarise(med = median(value), .groups = "drop")
# A tibble: 4 × 3
  geo_value signal        med
  <chr>     <chr>       <dbl>
1 ca        case_rate  33.2  
2 ca        death_rate  0.112
3 wa        case_rate  23.2  
4 wa        death_rate  0.178

Examples

Split the data into two tibbles by signal

cases <- covid |> 
  filter(signal == "case_rate") |>
  rename(case_rate = value) |> select(-signal)
deaths <- covid |> 
  filter(signal == "death_rate") |>
  rename(death_rate = value) |> select(-signal)

Join them together

joined <- full_join(cases, deaths, by = c("geo_value", "time_value"))

Do the same thing by pivoting

covid |> pivot_wider(names_from = signal, values_from = value)
# A tibble: 42 × 4
   geo_value time_value case_rate death_rate
   <chr>     <date>         <dbl>      <dbl>
 1 ca        2022-08-01      45.4      0.105
 2 wa        2022-08-01      27.7      0.169
 3 ca        2022-08-02      44.9      0.106
 4 wa        2022-08-02      27.7      0.169
 5 ca        2022-08-03      44.5      0.107
 6 wa        2022-08-03      26.6      0.173
 7 ca        2022-08-04      42.3      0.112
 8 wa        2022-08-04      26.6      0.173
 9 ca        2022-08-05      40.7      0.116
10 wa        2022-08-05      34.6      0.225
# ℹ 32 more rows

Plotting with {ggplot2}

  • Everything you can do with ggplot(), you can do with plot(). But the defaults are much prettier.

  • It’s also much easier to adjust by aesthetics / panels by factors.

  • It also uses “data masking”: data goes into ggplot(data = mydata), then the columns are available to the rest.

  • It (sort of) pipes, but by adding layers with +

  • It strongly prefers “long” data frames over “wide” data frames.


I’ll give a very fast overview of some confusing bits.

I suggest exploring

🔗 This slide deck

for more help

ggplot(
  data = covid |> 
    filter(signal == "case_rate")
) +
  geom_point(
    mapping = aes(
      x = time_value,
      y = value
    )
  ) + 
  geom_smooth( 
    mapping = aes( 
      x = time_value, 
      y = value 
    ) 
  ) 

ggplot(
  data = covid |> filter(signal == "case_rate")
) +
  geom_point(
    mapping = aes(
      x = time_value,
      y = value,
      colour = geo_value
    )
  ) + 
  geom_smooth( 
    mapping = aes( 
      x = time_value, 
      y = value,
      colour = geo_value
    ),
    se = FALSE,
    method = "lm"
  ) 

ggplot(
  data = covid |> filter(signal == "case_rate"),
  mapping = aes(
    x = time_value,
    y = value,
    colour = geo_value
  )
) +
  geom_point() + 
  geom_smooth(se = FALSE, method = "lm") 

ggplot(
  covid |> filter(signal == "case_rate"),
  aes(time_value, value, colour = geo_value)
) +
  geom_point() + 
  geom_smooth(se = FALSE, method = "lm") 

ggplot(
  covid, 
  aes(time_value, value, colour = geo_value)
) +
  geom_point() + 
  geom_smooth(se = FALSE, method = "lm") +
  facet_grid(signal ~ geo_value) +
  scale_colour_manual(
    name = NULL, 
    values = c(blue, orange)) +
  theme(legend.position = "bottom")

ggplot(
  covid, 
  aes(time_value, value, colour = geo_value)
) +
  geom_point() + 
  geom_smooth(se = FALSE, method = "lm") +
  facet_grid(signal ~ geo_value, scales = "free_y") +
  scale_colour_manual(
    name = NULL, 
    values = c(blue, orange)) +
  theme(legend.position = "bottom")