18 The bootstrap

Stat 406

Geoff Pleiss, Trevor Campbell

Last modified – 29 October 2024

\[ \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator*{\argmax}{argmax} \DeclareMathOperator*{\minimize}{minimize} \DeclareMathOperator*{\maximize}{maximize} \DeclareMathOperator*{\find}{find} \DeclareMathOperator{\st}{subject\,\,to} \newcommand{\E}{E} \newcommand{\Expect}[1]{\E\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[2]{\mathrm{Cov}\left[#1,\ #2\right]} \newcommand{\given}{\ \vert\ } \newcommand{\X}{\mathbf{X}} \newcommand{\x}{\mathbf{x}} \newcommand{\y}{\mathbf{y}} \newcommand{\P}{\mathcal{P}} \newcommand{\R}{\mathbb{R}} \newcommand{\norm}[1]{\left\lVert #1 \right\rVert} \newcommand{\snorm}[1]{\lVert #1 \rVert} \newcommand{\tr}[1]{\mbox{tr}(#1)} \newcommand{\brt}{\widehat{\beta}^R_{s}} \newcommand{\brl}{\widehat{\beta}^R_{\lambda}} \newcommand{\bls}{\widehat{\beta}_{ols}} \newcommand{\blt}{\widehat{\beta}^L_{s}} \newcommand{\bll}{\widehat{\beta}^L_{\lambda}} \newcommand{\U}{\mathbf{U}} \newcommand{\D}{\mathbf{D}} \newcommand{\V}{\mathbf{V}} \]

Uncertainty with one sample?

We get data \(X_{1:N},Y_{1:N}\). We compute something from the data \(\hat\theta = f(X_{1:N},Y_{1:N})\).

That \(\ \hat\theta\) is a point estimate (regression coeffs, median of \(Y\), risk estimate…)

How do we understand the uncertainty in \(\hat\theta\approx \theta\) due to randomness in \(X_{1:N}, Y_{1:N}\)?

If we know the sampling distribution of \(\hat\theta\), we can use variance / confidence intervals.

We usually don’t know that in practice (don’t know real data dist, \(f\) can be nasty)

For some \(\hat\theta\), things are nice anyway. E.g. for empirical mean \(\hat\theta = \frac{1}{N}\sum_{n=1}^N X_n\) with iid \(X_n\):

\(\Var{\hat\theta} = \Var{X_1} / n\).
Confidence Intervals: if \(X_n\) are iid and \(n\) “big”, then CLT: \(\hat\theta \overset{\text{approx}}{\sim}\mathcal{N}(\mu, \sigma^2/n)\)

Unknown sampling distribution?

What if you are skeptical of CLT / don’t know the sampling distribution?

I fit LDA on some data.

I get a new \(X_0\) and produce \(\hat\theta = \Pr(Y_0 =1 \given X_0)\).

Can I get a 95% confidence interval for \(\theta = \Pr(Y_0=1 \given X_0)\)?

The bootstrap gives this to you.

Etymology of the “bootstrap”

“to build itself up incrementally, starting from very little”
“to create something valuable from very little/nothing”
attributed to a late 1800s physics textbook question
- “why can a man not lift himself up by pulling on his bootstraps?”
became a sarcastic comment about self-driven socioeconomic advancement
- as these things go: co-opted by others and taken seriously
fun fact: same etymology for the term in computing
- “booting up your computer”

Generic Bootstrap Procedure

The bootstrap is super general-purpose. Given data \(X_{1:N}, Y_{1:N}\):

Compute \(\hat\theta = f(X_{1:N}, Y_{1:N})\)
For \(b=1,\dots, B\), resample data w/ replacement: \(X^{(b)}_{1:N}, Y^{(b)}_{1:N}\) (bootstrap samples)
For \(b=1,\dots, B\), recompute \(\hat\theta^{(b)} = f(X^{(b)}_{1:N}, Y^{(b)}_{1:N}\)) (bootstrap estimates)
Variance of \(\hat\theta \approx\) the empirical variance of \(\hat\theta^{(b)}\)
\((1-\alpha)\)% Confidence Interval for \(\theta\): \(\left[2\hat\theta - \widehat{F}_{\text{boot}}(1-\alpha/2),\ 2\hat\theta - \widehat{F}_{\text{boot}}(\alpha/2)\right]\)
- \(\widehat{F}_{\text{boot}}\) is the empirical CDF of the samples \(\hat\theta^{(b)}\)

The bootstrap is very flexible

e.g., for LDA:

Produce the original estimate \(\hat\theta=\widehat{\Pr}(Y_0=1 \given X_0)\)
Resample your training data \(B\) times with replacement
Refit LDA for each one to produce \(\widehat{\Pr}\nolimits_b(Y_0 =1 \given X_0)\).
Variance: empirical variance of \(\widehat{\Pr}\nolimits_b(Y_0 =1 \given X_0)\).
CI: \(\left[2\widehat{\Pr}(Y_0 =1 \given X_0) - \widehat{F}_{\text{boot}}(1-\alpha/2),\ 2\widehat{\Pr}(Y_0 =1 \given X_0) - \widehat{F}_{\text{boot}}(\alpha/2)\right]\)
- \(\widehat{F}_{\text{boot}}\) is the empirical CDF of the samples \(\widehat{\Pr}\nolimits_b(Y_0 =1 \given X_0)\)

A basic example

Let \(X_i\sim \textrm{Exponential}(1/5)\). The pdf is \(f(x) = \frac{1}{5}e^{-x/5}\)
I know if I estimate the mean with \(\bar{X}\), then by the CLT (if \(n\) is big),

\[\frac{\sqrt{n}(\bar{X}-E[X])}{s} \approx N(0, 1).\]

This gives me a 95% confidence interval like \[\bar{X} \pm 2s/\sqrt{n}\]
But I don’t want to estimate the mean, I want to estimate the median.

Code

ggplot(data.frame(x = c(0, 12)), aes(x)) +
  stat_function(fun = function(x) dexp(x, 1 / 5), color = orange) +
  geom_vline(xintercept = 5, color = blue) + # mean
  geom_vline(xintercept = qexp(.5, 1 / 5), color = red) + # median
  annotate("label",
    x = c(2.5, 5.5, 10), y = c(.15, .15, .05),
    label = c("median", "bar(x)", "pdf"), parse = TRUE,
    color = c(red, blue, orange), size = 6
  )

How to assess uncertainty in the median?

I give you a sample of size 500, you give me the sample median.
How do you get a CI?
You can use the bootstrap!

set.seed(406406406)
x <- rexp(n, 1 / 5)
(med <- median(x)) # sample median

[1] 3.611615

B <- 100
alpha <- 0.05
Fhat <- map_dbl(1:B, ~ median(sample(x, replace = TRUE))) # repeat B times, "empirical distribution"
CI <- 2 * med - quantile(Fhat, probs = c(1 - alpha / 2, alpha / 2))

Empirical distribution

Code

bootdf = tibble(boots = Fhat)
ggplot(bootdf, aes(boots)) + 
  stat_ecdf(colour = orange) +
  geom_vline(xintercept = quantile(Fhat, probs = c(.05, .95))) +
  geom_hline(yintercept = c(.05, .95), linetype = "dashed") +
  annotate(
    "label", x = c(3.2, 3.9), y = c(.2, .8), 
    label = c("hat(F)[boot](.05)", "hat(F)[boot](.95)"), 
    parse = TRUE
  )

Result

Code

ggplot(data.frame(Fhat), aes(Fhat)) +
  geom_density(color = orange) +
  geom_vline(xintercept = CI, color = orange, linetype = 2) +
  geom_vline(xintercept = med, col = blue) +
  geom_vline(xintercept = qexp(.5, 1 / 5), col = red) +
  annotate("label",
    x = c(3.15, 3.5, 3.75), y = c(.5, .5, 1),
    color = c(orange, red, blue),
    label = c("widehat(F)", "true~median", "widehat(median)"),
    parse = TRUE
  ) +
  xlab("x") +
  geom_rug(aes(2 * med - Fhat))

How does this work?

The Fundamental Premise (TM): a sufficiently large sample looks like the population.

So, sampling from the sample looks like sampling from the population.

Population:

Medians for samples of size \(N=100\):

[1] 3.055229

[1] 4.155205

[1] 3.346588

One Sample ( \(N = 100\) ):

Medians for samples of size \(N=100\):

[1] 3.126391

[1] 3.063069

[1] 3.534019

Bootstrap error sources

Simulation error: using only \(B\) samples to estimate \(F\) with \(\hat{F}\).
Statistical error: our data depended on a sample from the population. We don’t have the whole population so we make an error by using a sample (Note: this part is what always happens with data, and what the science of statistics analyzes.)
Specification error: If we use the parametric bootstrap, and our model is wrong, then we are overconfident.

Types of intervals

Let \(\hat{\theta}\) be our sample statistic, \(\hat\theta^{(b)}\) be the resamples

Our interval is

\[ [2\hat{\theta} - \hat\theta^{(b)}_{1-\alpha/2},\ 2\hat{\theta} - \hat\theta^{(b)}_{\alpha/2}] \]

where \(\theta^{(b)}_q\) is the \(q\) quantile of \(\hat\theta^{(b)}\).

Called the “Pivotal Interval”
Has the correct \(1-\alpha\)% coverage under very mild conditions on \(\hat{\theta}\)

Types of intervals

Let \(\hat{\theta}\) be our sample statistic, \(\hat\theta^{(b)}\) be the resamples

\[ [\hat{\theta} - z_{1-\alpha/2}\hat{s},\ \hat{\theta} + z_{1-\alpha/2}\hat{s}] \]

where \(\hat{s} = \sqrt{\Var{\hat\theta^{(b)}}}\)

Called the “Normal Interval”
Only works if the distribution of \(\hat{\theta^{(b)}}\) is approximately Normal.
Common and really tempting, but not likely to work well
Don’t do this

Types of intervals

Let \(\hat{\theta}\) be our sample statistic, \(\hat\theta^{(b)}\) be the resamples

\[ [\hat\theta^{(b)}_{\alpha/2},\ \hat\theta^{(b)}_{1-\alpha/2}] \]

where \(\hat\theta^{(b)}_q\) is the \(q\) quantile of \(\hat\theta^{(b)}\).

Called the “Percentile Interval”
Better than the Normal Interval, more assumptions than the Pivotal Interval
We teach this one in DSCI100 because it’s easy and “looks right” (it’s not, in general)
- Unlike Pivotal, doesn’t follow from the fundamental premise of bootstrap
It does have (asymp) right coverage if \(\exists\) monotone \(m\) so that \(m(\hat\theta) \sim N(m(\theta), c^2)\)

Slightly harder example

ggplot(fatcats, aes(Bwt, Hwt)) +
  geom_point(color = blue) +
  xlab("Cat body weight (Kg)") +
  ylab("Cat heart weight (g)")

cats.lm <- lm(Hwt ~ 0 + Bwt, data = fatcats)
summary(cats.lm)


Call:
lm(formula = Hwt ~ 0 + Bwt, data = fatcats)

Residuals:
    Min      1Q  Median      3Q     Max 
-9.8138 -0.9014 -0.2155  0.7548 22.5957 

Coefficients:
    Estimate Std. Error t value Pr(>|t|)    
Bwt  3.88297    0.08401   46.22   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.789 on 143 degrees of freedom
Multiple R-squared:  0.9373,    Adjusted R-squared:  0.9368 
F-statistic:  2136 on 1 and 143 DF,  p-value: < 2.2e-16

confint(cats.lm)

       2.5 %   97.5 %
Bwt 3.716912 4.049036

When we fit models, we examine diagnostics

qqnorm(residuals(cats.lm), pch = 16, col = blue)
qqline(residuals(cats.lm), col = orange, lwd = 2)

The tails are too fat. I don’t believe that CI…

We bootstrap

B <- 500
alpha <- .05
bhats <- map_dbl(1:B, ~ {
  newcats <- fatcats |>
    slice_sample(prop = 1, replace = TRUE)
  coef(lm(Hwt ~ 0 + Bwt, data = newcats))
})

2 * coef(cats.lm) - # Bootstrap CI
  quantile(bhats, probs = c(1 - alpha / 2, alpha / 2))

   97.5%     2.5% 
3.731614 4.030176

confint(cats.lm) # Original CI

       2.5 %   97.5 %
Bwt 3.716912 4.049036

An alternative

So far, I didn’t use any information about the data-generating process.
We’ve done the non-parametric bootstrap
This is easiest, and most common for the methods in this module

But there’s another version

You could try a “parametric bootstrap”
This assumes knowledge about the DGP

Same data

Non-parametric bootstrap

Same as before

B <- 500
alpha <- .05
bhats <- map_dbl(1:B, ~ {
  newcats <- fatcats |>
    slice_sample(prop = 1, replace = TRUE)
  coef(lm(Hwt ~ 0 + Bwt, data = newcats))
})

2 * coef(cats.lm) - # NP Bootstrap CI
  quantile(bhats, probs = c(1 - alpha / 2, alpha / 2))

   97.5%     2.5% 
3.726857 4.038470

confint(cats.lm) # Original CI

       2.5 %   97.5 %
Bwt 3.716912 4.049036

Parametric bootstrap

Assume that the linear model is TRUE.
Then, \(\texttt{Hwt}_i = \widehat{\beta}\times \texttt{Bwt}_i + \widehat{e}_i\), \(\widehat{e}_i \approx \epsilon_i\)
The \(\epsilon_i\) is random \(\longrightarrow\) just resample \(\widehat{e}_i\).

B <- 500
bhats <- double(B)
cats.lm <- lm(Hwt ~ 0 + Bwt, data = fatcats)
r <- residuals(cats.lm)
bhats <- map_dbl(1:B, ~ {
  newcats <- fatcats |> mutate(
    Hwt = predict(cats.lm) +
      sample(r, n(), replace = TRUE)
  )
  coef(lm(Hwt ~ 0 + Bwt, data = newcats))
})

2 * coef(cats.lm) - # Parametric Bootstrap CI
  quantile(bhats, probs = c(1 - alpha / 2, alpha / 2))

   97.5%     2.5% 
3.707015 4.035300

Next time…

Bootstrap for bagging and random forests