18 The bootstrap

Stat 406

Daniel J. McDonald

Last modified – 02 November 2023

\[ \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator*{\argmax}{argmax} \DeclareMathOperator*{\minimize}{minimize} \DeclareMathOperator*{\maximize}{maximize} \DeclareMathOperator*{\find}{find} \DeclareMathOperator{\st}{subject\,\,to} \newcommand{\E}{E} \newcommand{\Expect}[1]{\E\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[2]{\mathrm{Cov}\left[#1,\ #2\right]} \newcommand{\given}{\ \vert\ } \newcommand{\X}{\mathbf{X}} \newcommand{\x}{\mathbf{x}} \newcommand{\y}{\mathbf{y}} \newcommand{\P}{\mathcal{P}} \newcommand{\R}{\mathbb{R}} \newcommand{\norm}[1]{\left\lVert #1 \right\rVert} \newcommand{\snorm}[1]{\lVert #1 \rVert} \newcommand{\tr}[1]{\mbox{tr}(#1)} \newcommand{\brt}{\widehat{\beta}^R_{s}} \newcommand{\brl}{\widehat{\beta}^R_{\lambda}} \newcommand{\bls}{\widehat{\beta}_{ols}} \newcommand{\blt}{\widehat{\beta}^L_{s}} \newcommand{\bll}{\widehat{\beta}^L_{\lambda}} \newcommand{\U}{\mathbf{U}} \newcommand{\D}{\mathbf{D}} \newcommand{\V}{\mathbf{V}} \]

A small detour…

In statistics…

The “bootstrap” works. And well.

It’s good for “second-level” analysis.

  • “First-level” analyses are things like \(\hat\beta\), \(\hat \y\), an estimator of the center (a median), etc.

  • “Second-level” are things like \(\Var{\hat\beta}\), a confidence interval for \(\hat \y\), or a median, etc.


You usually get these “second-level” properties from “the sampling distribution of an estimator”

In statistics…

The “bootstrap” works. And well.

It’s good for “second-level” analysis.

  • “First-level” analyses are things like \(\hat\beta\), \(\hat \y\), an estimator of the center (a median), etc.

  • “Second-level” are things like \(\Var{\hat\beta}\), a confidence interval for \(\hat \y\), or a median, etc.


But what if you don’t know the sampling distribution? Or you’re skeptical of the CLT argument?

In statistics…

The “bootstrap” works. And well.

It’s good for “second-level” analysis.

  • “First-level” analyses are things like \(\hat\beta\), \(\hat \y\), an estimator of the center (a median), etc.

  • “Second-level” are things like \(\Var{\hat\beta}\), a confidence interval for \(\hat \y\), or a median, etc.


Sampling distributions

  1. If \(X_i\) are iid Normal \((0,\sigma^2)\), then \(\Var{\overline{X}} = \sigma^2 / n\).
  2. If \(X_i\) are iid and \(n\) is big, then \(\Var{\overline{X}} \approx \Var{X_1} / n\).
  3. If \(X_i\) are iid Binomial \((m, p)\), then \(\Var{\overline{X}} = mp(1-p) / n\)

Example of unknown sampling distribution

I estimate a LDA on some data.

I get a new \(x_0\) and produce \(\widehat{Pr}(y_0 =1 \given x_0)\).

Can I get a 95% confidence interval for \(Pr(y_0=1 \given x_0)\)?

The bootstrap gives this to you.

Bootstrap procedure

  1. Resample your training data w/ replacement.
  2. Calculate LDA on this sample.
  3. Produce a new prediction, call it \(\widehat{Pr}_b(y_0 =1 \given x_0)\).
  4. Repeat 1-3 \(b = 1,\ldots,B\) times.
  5. CI: \(\left[2\widehat{Pr}(y_0 =1 \given x_0) - \widehat{F}_{boot}(1-\alpha/2),\ 2\widehat{Pr}(y_0 =1 \given x_0) - \widehat{F}_{boot}(\alpha/2)\right]\)

\(\hat{F}\) is the “empirical” distribution of the bootstraps.

Empirical distribution

Code
r <- rexp(50, 1 / 5)
ggplot(tibble(r = r), aes(r)) + 
  stat_ecdf(colour = orange) +
  geom_vline(xintercept = quantile(r, probs = c(.05, .95))) +
  geom_hline(yintercept = c(.05, .95), linetype = "dashed") +
  annotate(
    "label", x = c(5, 12), y = c(.25, .75), 
    label = c("hat(F)[boot](.05)", "hat(F)[boot](.95)"), 
    parse = TRUE
  )

Very basic example

  • Let \(X_i\sim \textrm{Exponential}(1/5)\). The pdf is \(f(x) = \frac{1}{5}e^{-x/5}\)

  • I know if I estimate the mean with \(\bar{X}\), then by the CLT (if \(n\) is big),

\[\frac{\sqrt{n}(\bar{X}-E[X])}{s} \approx N(0, 1).\]

  • This gives me a 95% confidence interval like \[\bar{X} \pm 2s/\sqrt{n}\]

  • But I don’t want to estimate the mean, I want to estimate the median.

Code
ggplot(data.frame(x = c(0, 12)), aes(x)) +
  stat_function(fun = function(x) dexp(x, 1 / 5), color = orange) +
  geom_vline(xintercept = 5, color = blue) + # mean
  geom_vline(xintercept = qexp(.5, 1 / 5), color = red) + # median
  annotate("label",
    x = c(2.5, 5.5, 10), y = c(.15, .15, .05),
    label = c("median", "bar(x)", "pdf"), parse = TRUE,
    color = c(red, blue, orange), size = 6
  )

Now what…

  • I give you a sample of size 500, you give me the sample median.

  • How do you get a CI?

  • You can use the bootstrap!

set.seed(406406406)
x <- rexp(n, 1 / 5)
(med <- median(x)) # sample median
[1] 3.611615
B <- 100
alpha <- 0.05
Fhat <- map_dbl(1:B, ~ median(sample(x, replace = TRUE))) # repeat B times, "empirical distribution"
CI <- 2 * med - quantile(Fhat, probs = c(1 - alpha / 2, alpha / 2))

Code
ggplot(data.frame(Fhat), aes(Fhat)) +
  geom_density(color = orange) +
  geom_vline(xintercept = CI, color = orange, linetype = 2) +
  geom_vline(xintercept = med, col = blue) +
  geom_vline(xintercept = qexp(.5, 1 / 5), col = red) +
  annotate("label",
    x = c(3.15, 3.5, 3.75), y = c(.5, .5, 1),
    color = c(orange, red, blue),
    label = c("widehat(F)", "true~median", "widehat(median)"),
    parse = TRUE
  ) +
  xlab("x") +
  geom_rug(aes(2 * med - Fhat))

How does this work?

Approximations

Slightly harder example

ggplot(fatcats, aes(Bwt, Hwt)) +
  geom_point(color = blue) +
  xlab("Cat body weight (Kg)") +
  ylab("Cat heart weight (g)")

cats.lm <- lm(Hwt ~ 0 + Bwt, data = fatcats)
summary(cats.lm)

Call:
lm(formula = Hwt ~ 0 + Bwt, data = fatcats)

Residuals:
     Min       1Q   Median       3Q      Max 
-11.2353  -0.7932  -0.1407   0.5968  11.1026 

Coefficients:
    Estimate Std. Error t value Pr(>|t|)    
Bwt  3.95424    0.06294   62.83   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.089 on 143 degrees of freedom
Multiple R-squared:  0.965, Adjusted R-squared:  0.9648 
F-statistic:  3947 on 1 and 143 DF,  p-value: < 2.2e-16
confint(cats.lm)
       2.5 %   97.5 %
Bwt 3.829836 4.078652

When we fit models, we examine diagnostics

qqnorm(residuals(cats.lm), pch = 16, col = blue)
qqline(residuals(cats.lm), col = orange, lwd = 2)

The tails are too fat. So I don’t believe that CI…

We bootstrap

B <- 500
alpha <- .05
bhats <- map_dbl(1:B, ~ {
  newcats <- fatcats |>
    slice_sample(prop = 1, replace = TRUE)
  coef(lm(Hwt ~ 0 + Bwt, data = newcats))
})

2 * coef(cats.lm) - # Bootstrap CI
  quantile(bhats, probs = c(1 - alpha / 2, alpha / 2))
   97.5%     2.5% 
3.826735 4.084322 
confint(cats.lm) # Original CI
       2.5 %   97.5 %
Bwt 3.829836 4.078652

An alternative

  • So far, I didn’t use any information about the data-generating process.

  • We’ve done the non-parametric bootstrap

  • This is easiest, and most common for the methods in this module

But there’s another version

  • You could try a “parametric bootstrap”

  • This assumes knowledge about the DGP

Same data

Non-parametric bootstrap

Same as before

B <- 500
alpha <- .05
bhats <- map_dbl(1:B, ~ {
  newcats <- fatcats |>
    slice_sample(prop = 1, replace = TRUE)
  coef(lm(Hwt ~ 0 + Bwt, data = newcats))
})

2 * coef(cats.lm) - # NP Bootstrap CI
  quantile(bhats, probs = c(1 - alpha / 2, alpha / 2))
   97.5%     2.5% 
3.832907 4.070232 
confint(cats.lm) # Original CI
       2.5 %   97.5 %
Bwt 3.829836 4.078652

Parametric bootstrap

  1. Assume that the linear model is TRUE.
  2. Then, \(\texttt{Hwt}_i = \widehat{\beta}\times \texttt{Bwt}_i + \widehat{e}_i\), \(\widehat{e}_i \approx \epsilon_i\)
  3. The \(\epsilon_i\) is random \(\longrightarrow\) just resample \(\widehat{e}_i\).
B <- 500
bhats <- double(B)
cats.lm <- lm(Hwt ~ 0 + Bwt, data = fatcats)
r <- residuals(cats.lm)
bhats <- map_dbl(1:B, ~ {
  newcats <- fatcats |> mutate(
    Hwt = predict(cats.lm) +
      sample(r, n(), replace = TRUE)
  )
  coef(lm(Hwt ~ 0 + Bwt, data = newcats))
})

2 * coef(cats.lm) - # Parametric Bootstrap CI
  quantile(bhats, probs = c(1 - alpha / 2, alpha / 2))
   97.5%     2.5% 
3.815162 4.065045 

Bootstrap error sources

Simulation error
using only \(B\) samples to estimate \(F\) with \(\hat{F}\).
Statistical error
our data depended on a sample from the population. We don’t have the whole population so we make an error by using a sample (Note: this part is what always happens with data, and what the science of statistics analyzes.)
Specification error
If we use the parametric bootstrap, and our model is wrong, then we are overconfident.

Types of intervals

Let \(\hat{\theta}\) be our sample statistic, \(\hat{\Theta}\) be the resamples

Our interval is

\[ [2\hat{\theta} - \theta^*_{1-\alpha/2},\ 2\hat{\theta} - \theta^*_{\alpha/2}] \]

where \(\theta^*_q\) is the \(q\) quantile of \(\hat{\Theta}\).


  • Called the “Pivotal Interval”
  • Has the correct \(1-\alpha\)% coverage under very mild conditions on \(\hat{\theta}\)

Types of intervals

Let \(\hat{\theta}\) be our sample statistic, \(\hat{\Theta}\) be the resamples

\[ [\hat{\theta} - z_{\alpha/2}\hat{s},\ \hat{\theta} + z_{\alpha/2}\hat{s}] \]

where \(\hat{s} = \sqrt{\Var{\hat{\Theta}}}\)


  • Called the “Normal Interval”
  • Only works if the distribution of \(\hat{\Theta}\) is approximately Normal.
  • Unlikely to work well
  • Don’t do this

Types of intervals

Let \(\hat{\theta}\) be our sample statistic, \(\hat{\Theta}\) be the resamples

\[ [\theta^*_{\alpha/2},\ \theta^*_{1-\alpha/2}] \]

where \(\theta^*_q\) is the \(q\) quantile of \(\hat{\Theta}\).


  • Called the “Percentile Interval”
  • Works if \(\exists\) monotone \(m\) so that \(m(\hat\Theta) \sim N(m(\theta), c^2)\)
  • Better than the Normal Interval
  • More assumptions than the Pivotal Interval

More details

See “All of Statistics” by Larry Wasserman, Chapter 8.3

There’s a handout with the proofs on Canvas (under Modules)

Next time…

Bootstrap for bagging and random forests