Unit tests and avoiding 🪲🪲

Stat 550

Daniel J. McDonald

Last modified – 03 April 2024

\[ \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator*{\argmax}{argmax} \DeclareMathOperator*{\minimize}{minimize} \DeclareMathOperator*{\maximize}{maximize} \DeclareMathOperator*{\find}{find} \DeclareMathOperator{\st}{subject\,\,to} \newcommand{\E}{E} \newcommand{\Expect}[1]{\E\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[2]{\mathrm{Cov}\left[#1,\ #2\right]} \newcommand{\given}{\mid} \newcommand{\X}{\mathbf{X}} \newcommand{\x}{\mathbf{x}} \newcommand{\y}{\mathbf{y}} \newcommand{\P}{\mathcal{P}} \newcommand{\R}{\mathbb{R}} \newcommand{\norm}[1]{\left\lVert #1 \right\rVert} \newcommand{\snorm}[1]{\lVert #1 \rVert} \newcommand{\tr}[1]{\mbox{tr}(#1)} \newcommand{\U}{\mathbf{U}} \newcommand{\D}{\mathbf{D}} \newcommand{\V}{\mathbf{V}} \renewcommand{\hat}{\widehat} \]

I urge you to consult:

Carnegie Mellon’s 36-750 Notes

Thank you Alex and Chris for the heavy lifting.

Bugs happen. All. The. Time.

It is easy to write lots of code.

But are we sure it’s doing the right things?

Important

Effective testing tries to help.

A Common (Interactive) Workflow

  1. Write a function.
  2. Try some reasonable values at the REPL to check that it works.
  3. If there are problems, maybe insert some print statements, and modify the function.
  4. Repeat until things seem fine.

(REPL == Read-Eval-Print-Loop, the console, or Jupyter NB)

  • This tends to result in lots of bugs.

  • Later on, you forget which values you tried, whether they failed, how you fixed them.

  • So you make a change and maybe or maybe not try some again.

Step 1 — write functions

Write functions.

Lots of them.

👍 Functions are testable

👎 Scripts are not

It’s easy to alter the arguments and see “what happens”

There’s less ability to screw up environments.

I’m going to mainly describe R, but the logic is very similar (if not the syntax) for python, C++, and Julia

Understanding signatures

sig(lm)
fn <- function(formula, data, subset, weights, na.action, method = "qr", model
  = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE, contrasts =
  NULL, offset, ...)
sig(`+`)
fn <- function(e1, e2)
sig(dplyr::filter)
fn <- function(.data, ..., .by = NULL, .preserve = FALSE)
sig(stats::filter)
fn <- function(x, filter, method = c("convolution", "recursive"), sides = 2,
  circular = FALSE, init = NULL)
sig(rnorm)
fn <- function(n, mean = 0, sd = 1)

These are all the same

set.seed(12345)
rnorm(3)
[1]  0.5855288  0.7094660 -0.1093033
set.seed(12345)
rnorm(n = 3, mean = 0)
[1]  0.5855288  0.7094660 -0.1093033
set.seed(12345)
rnorm(3, 0, 1)
[1]  0.5855288  0.7094660 -0.1093033
set.seed(12345)
rnorm(sd = 1, n = 3, mean = 0)
[1]  0.5855288  0.7094660 -0.1093033
  • Functions can have default values.
  • You may, but don’t have to, name the arguments
  • If you name them, you can pass them out of order (but you shouldn’t).

Outputs vs. Side effects

  • Side effects are things a function does, outputs can be assigned to variables
  • A good example is the hist function
  • You have probably only seen the side effect which is to plot the histogram
my_histogram <- hist(rnorm(1000))

str(my_histogram)
List of 6
 $ breaks  : num [1:14] -3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 ...
 $ counts  : int [1:13] 4 21 41 83 138 191 191 182 74 43 ...
 $ density : num [1:13] 0.008 0.042 0.082 0.166 0.276 0.382 0.382 0.364 0.148 0.086 ...
 $ mids    : num [1:13] -2.75 -2.25 -1.75 -1.25 -0.75 -0.25 0.25 0.75 1.25 1.75 ...
 $ xname   : chr "rnorm(1000)"
 $ equidist: logi TRUE
 - attr(*, "class")= chr "histogram"
class(my_histogram)
[1] "histogram"

Step 2 — program defensively, ensure behaviour

incrementer <- function(x, inc_by = 1) {
  x + 1
}
  
incrementer(2)
[1] 3
incrementer(1:4)
[1] 2 3 4 5
incrementer("a")
Error in x + 1: non-numeric argument to binary operator
incrementer <- function(x, inc_by = 1) {
  stopifnot(is.numeric(x))
  return(x + 1)
}
incrementer("a")
Error in incrementer("a"): is.numeric(x) is not TRUE
incrementer <- function(x, inc_by = 1) {
  if (!is.numeric(x)) {
    stop("`x` must be numeric")
  }
  x + 1
}
incrementer("a")
Error in incrementer("a"): `x` must be numeric
incrementer(2, -3) ## oops!
[1] 3
incrementer <- function(x, inc_by = 1) {
  if (!is.numeric(x)) {
    stop("`x` must be numeric")
  }
  x + inc_by
}
incrementer(2, -3)
[1] -1

Even better

incrementer <- function(x, inc_by = 1) {
  if (!is.numeric(x)) cli::cli_abort("`x` must be numeric")
  if (!is.numeric(inc_by)) cli::cli_abort("`inc_by` must be numeric")
  x + inc_by
}
incrementer("a")
Error in `incrementer()`:
! `x` must be numeric
incrementer(1:6, "b")
Error in `incrementer()`:
! `inc_by` must be numeric

Step 3 — Keep track of behaviour with tests

library(testthat)
incrementer <- function(x, inc_by = 1) {
  if (!is.numeric(x)) stop("`x` must be numeric")
  if (!is.numeric(inc_by)) stop("`inc_by` must be numeric")
  x + inc_by
}
test_that("incrementer validates arguments", {
  expect_error(incrementer("a"))
  expect_equal(incrementer(1:3), 2:4)
  expect_equal(incrementer(2, -3), -1)
  expect_error(incrementer(1, "b"))
  expect_identical(incrementer(1:3), 2:4)
})
── Failure: incrementer validates arguments ────────────────────────────────────
incrementer(1:3) not identical to 2:4.
Objects equal but not identical
Error:
! Test failed

Integers are trouble

is.integer(2:4)
[1] TRUE
is.integer(incrementer(1:3))
[1] FALSE
expect_identical(incrementer(1:3, 1L), 2:4)
expect_equal(incrementer(1:3, 1), 2:4)

Testing lingo

Unit testing

  • A unit is a small bit of code (function, class, module, group of classes)

  • A test calls the unit with a set of inputs, and checks if we get the expected output.

gcd <- function(x, na.rm = FALSE) {
  if (na.rm) x <- x[!is.na(x)]
  if (anyNA(x)) return(NA)
  stopifnot(is.numeric(x))
  if (!rlang::is_integerish(x)) cli_abort("`x` must contain only integers.")
  if (length(x) == 1L) return(as.integer(x))
  x <- x[x != 0]
  compute_gcd(x) # dispatch to a C++ function
}

test_that("gcd works", {
  # corner cases
  expect_identical(gcd(c(1, NA)), NA)
  expect_identical(gcd(c(1, NA), TRUE), 1L)
  expect_identical(gcd(c(1, 2, 4)), 1L)
  # error
  expect_error(gcd(1.3))
  # function
  expect_identical(gcd(c(2, 4, 6)), 2L)
  expect_identical(gcd(c(2, 3, 7)), 1L)
})

Unit testing

Unit testing consists of writing tests that are

  • focused on a small, low-level piece of code (a unit)
  • typically written by the programmer with standard tools
  • fast to run (so can be run often, i.e. before every commit).

Unit testing benefits

Among others:

  • Exposing problems early
  • Making it easy to change (refactor) code without forgetting pieces or breaking things
  • Simplifying integration of components
  • Providing natural documentation of what the code should do
  • Driving the design of new code.

Components of a Unit Testing Framework

  • Collection of Assertions executed in sequence.
  • Executed in a self-contained environment.
  • Any assertion fails Test fails.

Each test focuses on a single component.

Named so that you know what it’s doing.

## See https://en.wikipedia.org/wiki/Conway%27s_Game_of_Life
test_that("Conway's rules are correct", {
    # conway_rules(num_neighbors, alive?)
    expect_true(conway_rules(3, FALSE))
    expect_false(conway_rules(4, FALSE))
    expect_true(conway_rules(2, TRUE))
    ...
})

A test suite

  • Collection of related tests in a common context.

  • Prepares the environment, cleans up after

  • (loads some data, connects to database, necessary library,…)

  • Test suites are run and the results reported, particularly failures, in a easy to parse and economical style.

  • For example, Python’s {unittest} can report like this

$ python test/trees_test.py -v

test_crime_counts (__main__.DataTreeTest)
Ensure Ks are consistent with num_points. ... ok
test_indices_sorted (__main__.DataTreeTest)
Ensure all node indices are sorted in increasing order. ... ok
test_no_bbox_overlap (__main__.DataTreeTest)
Check that child bounding boxes do not overlap. ... ok
test_node_counts (__main__.DataTreeTest)
Ensure that each node's point count is accurate. ... ok
test_oversized_leaf (__main__.DataTreeTest)
Don't recurse infinitely on duplicate points. ... ok
test_split_parity (__main__.DataTreeTest)
Check that each tree level has the right split axis. ... ok
test_trange_contained (__main__.DataTreeTest)
Check that child tranges are contained in parent tranges. ... ok
test_no_bbox_overlap (__main__.QueryTreeTest)
Check that child bounding boxes do not overlap. ... ok
test_node_counts (__main__.QueryTreeTest)
Ensure that each node's point count is accurate. ... ok
test_oversized_leaf (__main__.QueryTreeTest)
Don't recurse infinitely on duplicate points. ... ok
test_split_parity (__main__.QueryTreeTest)
Check that each tree level has the right split axis. ... ok
test_trange_contained (__main__.QueryTreeTest)
Check that child tranges are contained in parent tranges. ... ok

---------------------------------------------------------
Ran 12 tests in 23.932s

R example

testthat::test_local(here::here("../../../../../../Delphi/smoothqr/"))
✔ | F W  S  OK | Context

⠏ |          0 | smooth-rq                                                      
⠴ |          6 | smooth-rq                                                      
✔ |         12 | smooth-rq

══ Results ═════════════════════════════════════════════════════════════════════
[ FAIL 0 | WARN 0 | SKIP 0 | PASS 12 ]

What do I test?

Core Principle:

Tests should be passed by a correct function, but not by an incorrect function.

The tests must apply pressure to know if things break.

  • several specific inputs for which you know the correct answer
  • “edge” cases, like a list of size zero or a matrix instead of a vector
  • special cases that the function must handle, but which you might forget about months from now
  • error cases that should throw an error instead of returning an invalid answer
  • previous bugs you’ve fixed, so those bugs never return.

What do I test?

Make sure that incorrect functions won’t pass (or at least, won’t pass them all).

add <- function(a, b) return(4)
add <- function(a, b) return(a * b)

test_that("Addition is commutative", {
  expect_equal(add(1, 3), add(3, 1)) # both pass this !!
  expect_equal(add(2, 5), add(5, 2)) # neither passes this
})

Tip

  • Cover all branches.

  • Make sure there aren’t branches you don’t expect.

Assertions

Assertions are things that must be true. Failure means “Quit”.

  • There’s no way to recover.
  • Think: passed in bad arguments.
def fit(data, ...):

    for it in range(max_iterations):
        # iterative fitting code here
        ...

        # Plausibility check
        assert np.all(alpha >= 0), "negative alpha"
        assert np.all(theta >= 0), "negative theta"
        assert omega > 0, "Nonpositive omega"
        assert eta2 > 0, "Nonpositive eta2"
        assert sigma2 > 0, "Nonpositive sigma2"

    ...

The parameters have to be positive. Negative is impossible. No way to recover.

Errors

Errors are for unexpected conditions that could be handled by the calling code.

  • You could perform some action to work around the error, fix it, or report it to the user.

Example:

  • I give you directions to my house. You get lost. You could recover.
  • Maybe retrace your steps, see if you missed a sign post.
  • Maybe search on Google Maps to locate your self in relation to a landmark.
  • If those fail, message me.
  • If I don’t respond, get an Uber.
  • Finally, give up and go home.

Errors

Code can also do this. It can try the function and catch errors to recover automatically.

For example:

  • Load some data from the internet. If the file doesn’t exist, create some.

  • Run some iterative algorithm. If we haven’t converged, restart from another place.

Code can fix errors without user input. It can’t fix assertions.

  • An input must be an integer. So round it, Warn, and proceed. Rather than fail.

Test-driven development

Test Driven Development (TDD) uses a short development cycle for each new feature or component:

  1. Write tests that specify the component’s desired behavior.
    The tests will initially fail because the component does not yet exist.

  2. Create the minimal implementation that passes the test.

  3. Refactor the code to meet design standards, running the tests with each change to ensure correctness.

Why work this way?

  • Writing the tests may help you realize

    1. what arguments the function must take,
    2. what other data it needs,
    3. and what kinds of errors it needs to handle.
  • The tests define a specific plan for what the function must do.

  • You will catch bugs at the beginning instead of at the end (or never).

  • Testing is part of design, instead of a lame afterthought you dread doing.

Rules of thumb

Keep tests in separate files
from the code they test. This makes it easy to run them separately.
Give tests names
Testing frameworks usually let you give the test functions names or descriptions. test_1 doesn’t help you at all, but test_tree_insert makes it easy for you to remember what the test is for.
Make tests replicable
If a test involves random data, what do you do when the test fails? You need some way to know what random values it used so you can figure out why the test fails.

Rules of thumb

Use tests instead of the REPL
If you’re building a complicated function, write the tests in advance and use them to help you while you write the function. You’ll waste time calling over and over at the REPL.
Avoid testing against another’s code/package
You don’t know the ins and outs of what they do. If they change the code, your tests will fail.
Test Units, not main functions
You should write small functions that do one thing. Test those. Don’t write one huge 1000-line function and try to test that.
Avoid random numbers
Seeds are not always portable.

Note

  • R, use {testthat}. See the Testing chapter from Hadley Wickham’s R Packages book.

  • python use {pytest}. A bit more user-friendly than {unittest}: pytest

Other suggestions

Do this

foo <- function(x) {
  if (x < 0) stop(x, " is not positive")
}

foo <- function(x) {
  if (x < 0) message(x, " is not positive")
  # not useful unless we fix it too...
}

foo <- function(x) {
  if (x < 0) warning(x, " is not positive")
  # not useful unless we fix it too...
}

foo <- function(x) {
  if (length(x) == 0)
    rlang::abort("no data", class="no_input_data")
}

These allow error handling.

Don’t do this

foo <- function(x) {
  if (x < 0) {
    print(paste0(x, " is not positive"))
    return(NULL)
  }
  ...
}

foo <- function(x) {
  if (x < 0) cat("uh oh.")
  ...
}

Can’t recover.

Don’t know what went wrong.

See here for more details.

Seems like overkill,

but when you run a big simulation that takes 2 weeks,

you don’t want it to die after 10 days.

You want it to recover.

More coding details, if time.

Classes

tib <- tibble(
  x1 = rnorm(100), 
  x2 = rnorm(100), 
  y = x1 + 2 * x2 + rnorm(100)
)
mdl <- lm(y ~ ., data = tib)
class(tib)
[1] "tbl_df"     "tbl"        "data.frame"
class(mdl)
[1] "lm"

The class allows for the use of “methods”

print(mdl)

Call:
lm(formula = y ~ ., data = tib)

Coefficients:
(Intercept)           x1           x2  
     0.1216       1.0803       2.1038  
  • R “knows what to do” when you print() an object of class "lm".

  • print() is called a “generic” function.

  • You can create “methods” that get dispatched.

  • For any generic, R looks for a method for the class.

  • If available, it calls that function.

Viewing the dispatch chain

sloop::s3_dispatch(print(incrementer))
=> print.function
 * print.default
sloop::s3_dispatch(print(tib))
   print.tbl_df
=> print.tbl
 * print.data.frame
 * print.default
sloop::s3_dispatch(print(mdl))
=> print.lm
 * print.default

R-Geeky But Important

There are lots of generic functions in R

Common ones are print(), summary(), and plot().

Also, lots of important statistical modelling concepts: residuals() coef()

(In python, these work the opposite way: obj.residuals. The dot after the object accesses methods defined for that type of object. But the dispatch behaviour is less robust.)

  • The convention is that the specialized function is named method.class(), e.g., summary.lm().

  • If no specialized function is defined, R will try to use method.default().

For this reason, R programmers try to avoid . in names of functions or objects.

Annoying example

print(mdl)

Call:
lm(formula = y ~ ., data = tib)

Coefficients:
(Intercept)           x1           x2  
     0.1216       1.0803       2.1038  
print.lm <- function(x, ...) print("This is an linear model.")
print(mdl)
[1] "This is an linear model."
  • Overwrote the method in the global environment.

Wherefore methods?

  • The advantage is that you don’t have to learn a totally new syntax to grab residuals or plot things

  • You just use residuals(mdl) whether mdl has class lm could have been done two centuries ago, or a Batrachian Emphasis Machine which won’t be invented for another five years.

  • The one draw-back is the help pages for the generic methods tend to be pretty vague

  • Compare ?summary with ?summary.lm.

Different environments

  • These are often tricky, but are very common.

  • Most programming languages have this concept in one way or another.

  • In R code run in the Console produces objects in the “Global environment”

  • You can see what you create in the “Environment” tab.

  • But there’s lots of other stuff.

  • Many packages are automatically loaded at startup, so you have access to the functions and data inside those package Environments.

For example mean(), lm(), plot(), iris (technically iris is lazy-loaded, meaning it’s not in memory until you call it, but it is available)

  • Other packages require you to load them with library(pkg) before their functions are available.

  • But, you can call those functions by prefixing the package name ggplot2::ggplot().

  • You can also access functions that the package developer didn’t “export” for use with ::: like dplyr:::as_across_fn_call()

Other issues with environments

As one might expect, functions create an environment inside the function.

z <- 1
fun <- function(x) {
  z <- x
  print(z)
  invisible(z)
}
fun(14)
[1] 14

Non-trivial cases are data-masking environments.

tib <- tibble(x1 = rnorm(100),  x2 = rnorm(100),  y = x1 + 2 * x2)
mdl <- lm(y ~ x2, data = tib)
x2
Error in eval(expr, envir, enclos): object 'x2' not found
  • lm() looks “inside” the tib to find y and x2
  • The data variables are added to the lm() environment

Other issues with environments

When Knit, .Rmd files run in their OWN environment.

They are run from top to bottom, with code chunks depending on previous

This makes them reproducible.

Jupyter notebooks don’t do this. 😱

Objects in your local environment are not available in the .Rmd

Objects in the .Rmd are not available locally.

Tip

The most frequent error I see is:

  • running chunks individually, 1-by-1, and it works
  • Knitting, and it fails

The reason is almost always that the chunks refer to objects in the Global Environment that don’t exist in the .Rmd

This error also happens because:

  • library() calls were made globally but not in the .Rmd
    • so the packages aren’t loaded
  • paths to data or other objects are not relative to the .Rmd in your file system
    • they must be
  • Careful organization and relative paths will help to avoid some of these.

Debugging

How to fix code

  • If you’re using a function in a package, start with ?function to see the help
    • Make sure you’re calling the function correctly.
    • Try running the examples.
    • paste the error into Google (this is my first step when you ask me)
    • Go to the package website if it exists, and browse around
  • If your .Rmd won’t Knit
    • Did you make the mistake on the last slide?
    • Did it Knit before? Then the bug is in whatever you added.
    • Did you never Knit it? Why not?
    • Call rstudioapi::restartSession(), then run the Chunks 1-by-1

Adding browser()

  • Only useful with your own functions.
  • Open the script with the function, and add browser() to the code somewhere
  • Then call your function.
  • The execution will Stop where you added browser() and you’ll have access to the local environment to play around

Reproducible examples

Question I get uncountably often that I hate:

“I ran the code like you had on Slide 39, but it didn’t work.”

  • If you want to ask me why the code doesn’t work, you need to show me what’s wrong.

Don’t just share a screenshot!

Unless you get lucky, I won’t be able to figure it out from that.

And I can’t copy-paste into Google.

What you need is a Reproducible Example or reprex. This is a small chunk of code that

  1. runs in it’s own environment
  2. and produces the error.

The best way to do this is with the {reprex} package.

Reproducible examples, How it works

  1. Open a new .R script.

  2. Paste your buggy code in the file (no need to save)

  3. Edit your code to make sure it’s “enough to produce the error” and nothing more. (By rerunning the code a few times.)

  4. Copy your code.

  5. Call reprex::reprex() from the console. This will run your code in a new environment and show the result in the Viewer tab. Does it create the error you expect?

  6. If it creates other errors, that may be the problem. You may fix the bug on your own!

  7. If it doesn’t have errors, then your global environment is Farblunget.

  8. The Output is now on your clipboard. Share that.

Note

Because Reprex runs in it’s own environment, it doesn’t have access to any of the libraries you loaded or the stuff in your global environment. You’ll have to load these things in the script. That’s the point

Practice

Gradient ascent.

  • Suppose we want to find \(\max_x f(x)\).

  • We repeat the update \(x \leftarrow x + \gamma f'(x)\) until convergence, for some \(\gamma > 0\).

Poisson likelihood.

  • Recall the likelihood: \(L(\lambda; y_1,\ldots,y_n) = \prod_{i=1}^n \frac{\lambda^{y_i} \exp(-\lambda)}{y_i!}I(y_i \in 0,1,\ldots)\)

Goal: find the MLE for \(\lambda\) using gradient ascent

Deliverables, 2 R scripts

  1. A function that evaluates the log likelihood. (think about sufficiency, ignorable constants)
  2. A function that evaluates the gradient of the log likelihood.
  3. A function that does the optimization.
    1. Should take in data, the log likelihood and the gradient.
    2. Use the loglikelihood to determine convergence.
    3. Pass in any other necessary parameters with reasonable defaults.
  4. A collection of tests that make sure your functions work.

\[ \begin{aligned} L(\lambda; y_1,\ldots,y_n) &= \prod_{i=1}^n \frac{\lambda^{y_i} \exp(-\lambda)}{y_i!}I(y_i \in 0,1,\ldots)\\ x &\leftarrow x + \gamma f'(x) \end{aligned} \]