18 The bootstrap
Stat 406
Geoff Pleiss, Trevor Campbell
Last modified – 02 November 2023
\[
\DeclareMathOperator*{\argmin}{argmin}
\DeclareMathOperator*{\argmax}{argmax}
\DeclareMathOperator*{\minimize}{minimize}
\DeclareMathOperator*{\maximize}{maximize}
\DeclareMathOperator*{\find}{find}
\DeclareMathOperator{\st}{subject\,\,to}
\newcommand{\E}{E}
\newcommand{\Expect}[1]{\E\left[ #1 \right]}
\newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]}
\newcommand{\Cov}[2]{\mathrm{Cov}\left[#1,\ #2\right]}
\newcommand{\given}{\ \vert\ }
\newcommand{\X}{\mathbf{X}}
\newcommand{\x}{\mathbf{x}}
\newcommand{\y}{\mathbf{y}}
\newcommand{\P}{\mathcal{P}}
\newcommand{\R}{\mathbb{R}}
\newcommand{\norm}[1]{\left\lVert #1 \right\rVert}
\newcommand{\snorm}[1]{\lVert #1 \rVert}
\newcommand{\tr}[1]{\mbox{tr}(#1)}
\newcommand{\brt}{\widehat{\beta}^R_{s}}
\newcommand{\brl}{\widehat{\beta}^R_{\lambda}}
\newcommand{\bls}{\widehat{\beta}_{ols}}
\newcommand{\blt}{\widehat{\beta}^L_{s}}
\newcommand{\bll}{\widehat{\beta}^L_{\lambda}}
\newcommand{\U}{\mathbf{U}}
\newcommand{\D}{\mathbf{D}}
\newcommand{\V}{\mathbf{V}}
\]
A small detour…
In statistics…
The “bootstrap” works. And well.
It’s good for “second-level” analysis.
“First-level” analyses are things like \(\hat\beta\) , \(\hat \y\) , an estimator of the center (a median), etc.
“Second-level” are things like \(\Var{\hat\beta}\) , a confidence interval for \(\hat \y\) , or a median, etc.
You usually get these “second-level” properties from “the sampling distribution of an estimator”
In statistics…
The “bootstrap” works. And well.
It’s good for “second-level” analysis.
“First-level” analyses are things like \(\hat\beta\) , \(\hat \y\) , an estimator of the center (a median), etc.
“Second-level” are things like \(\Var{\hat\beta}\) , a confidence interval for \(\hat \y\) , or a median, etc.
But what if you don’t know the sampling distribution? Or you’re skeptical of the CLT argument?
In statistics…
The “bootstrap” works. And well.
It’s good for “second-level” analysis.
“First-level” analyses are things like \(\hat\beta\) , \(\hat \y\) , an estimator of the center (a median), etc.
“Second-level” are things like \(\Var{\hat\beta}\) , a confidence interval for \(\hat \y\) , or a median, etc.
Sampling distributions
If \(X_i\) are iid Normal \((0,\sigma^2)\) , then \(\Var{\overline{X}} = \sigma^2 / n\) .
If \(X_i\) are iid and \(n\) is big, then \(\Var{\overline{X}} \approx \Var{X_1} / n\) .
If \(X_i\) are iid Binomial \((m, p)\) , then \(\Var{\overline{X}} = mp(1-p) / n\)
Example of unknown sampling distribution
I estimate a LDA on some data.
I get a new \(x_0\) and produce \(\widehat{Pr}(y_0 =1 \given x_0)\) .
Can I get a 95% confidence interval for \(Pr(y_0=1 \given x_0)\) ?
The bootstrap gives this to you.
Bootstrap procedure
Resample your training data w/ replacement.
Calculate LDA on this sample.
Produce a new prediction, call it \(\widehat{Pr}_b(y_0 =1 \given x_0)\) .
Repeat 1-3 \(b = 1,\ldots,B\) times.
CI: \(\left[2\widehat{Pr}(y_0 =1 \given x_0) - \widehat{F}_{boot}(1-\alpha/2),\ 2\widehat{Pr}(y_0 =1 \given x_0) - \widehat{F}_{boot}(\alpha/2)\right]\)
\(\hat{F}\) is the “empirical” distribution of the bootstraps.
Empirical distribution
Code
r <- rexp (50 , 1 / 5 )
ggplot (tibble (r = r), aes (r)) +
stat_ecdf (colour = orange) +
geom_vline (xintercept = quantile (r, probs = c (.05 , .95 ))) +
geom_hline (yintercept = c (.05 , .95 ), linetype = "dashed" ) +
annotate (
"label" , x = c (5 , 12 ), y = c (.25 , .75 ),
label = c ("hat(F)[boot](.05)" , "hat(F)[boot](.95)" ),
parse = TRUE
)
Very basic example
Let \(X_i\sim \textrm{Exponential}(1/5)\) . The pdf is \(f(x) = \frac{1}{5}e^{-x/5}\)
I know if I estimate the mean with \(\bar{X}\) , then by the CLT (if \(n\) is big),
\[\frac{\sqrt{n}(\bar{X}-E[X])}{s} \approx N(0, 1).\]
This gives me a 95% confidence interval like \[\bar{X} \pm 2s/\sqrt{n}\]
But I don’t want to estimate the mean, I want to estimate the median.
Code
ggplot (data.frame (x = c (0 , 12 )), aes (x)) +
stat_function (fun = function (x) dexp (x, 1 / 5 ), color = orange) +
geom_vline (xintercept = 5 , color = blue) + # mean
geom_vline (xintercept = qexp (.5 , 1 / 5 ), color = red) + # median
annotate ("label" ,
x = c (2.5 , 5.5 , 10 ), y = c (.15 , .15 , .05 ),
label = c ("median" , "bar(x)" , "pdf" ), parse = TRUE ,
color = c (red, blue, orange), size = 6
)
Now what…
set.seed (406406406 )
x <- rexp (n, 1 / 5 )
(med <- median (x)) # sample median
B <- 100
alpha <- 0.05
Fhat <- map_dbl (1 : B, ~ median (sample (x, replace = TRUE ))) # repeat B times, "empirical distribution"
CI <- 2 * med - quantile (Fhat, probs = c (1 - alpha / 2 , alpha / 2 ))
Code
ggplot (data.frame (Fhat), aes (Fhat)) +
geom_density (color = orange) +
geom_vline (xintercept = CI, color = orange, linetype = 2 ) +
geom_vline (xintercept = med, col = blue) +
geom_vline (xintercept = qexp (.5 , 1 / 5 ), col = red) +
annotate ("label" ,
x = c (3.15 , 3.5 , 3.75 ), y = c (.5 , .5 , 1 ),
color = c (orange, red, blue),
label = c ("widehat(F)" , "true~median" , "widehat(median)" ),
parse = TRUE
) +
xlab ("x" ) +
geom_rug (aes (2 * med - Fhat))
How does this work?
Approximations
Slightly harder example
ggplot (fatcats, aes (Bwt, Hwt)) +
geom_point (color = blue) +
xlab ("Cat body weight (Kg)" ) +
ylab ("Cat heart weight (g)" )
cats.lm <- lm (Hwt ~ 0 + Bwt, data = fatcats)
summary (cats.lm)
Call:
lm(formula = Hwt ~ 0 + Bwt, data = fatcats)
Residuals:
Min 1Q Median 3Q Max
-11.2353 -0.7932 -0.1407 0.5968 11.1026
Coefficients:
Estimate Std. Error t value Pr(>|t|)
Bwt 3.95424 0.06294 62.83 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.089 on 143 degrees of freedom
Multiple R-squared: 0.965, Adjusted R-squared: 0.9648
F-statistic: 3947 on 1 and 143 DF, p-value: < 2.2e-16
2.5 % 97.5 %
Bwt 3.829836 4.078652
When we fit models, we examine diagnostics
qqnorm (residuals (cats.lm), pch = 16 , col = blue)
qqline (residuals (cats.lm), col = orange, lwd = 2 )
The tails are too fat. So I don’t believe that CI…
We bootstrap
B <- 500
alpha <- .05
bhats <- map_dbl (1 : B, ~ {
newcats <- fatcats |>
slice_sample (prop = 1 , replace = TRUE )
coef (lm (Hwt ~ 0 + Bwt, data = newcats))
})
2 * coef (cats.lm) - # Bootstrap CI
quantile (bhats, probs = c (1 - alpha / 2 , alpha / 2 ))
97.5% 2.5%
3.826735 4.084322
confint (cats.lm) # Original CI
2.5 % 97.5 %
Bwt 3.829836 4.078652
An alternative
So far, I didn’t use any information about the data-generating process.
We’ve done the non-parametric bootstrap
This is easiest, and most common for the methods in this module
But there’s another version
Same data
Non-parametric bootstrap
Same as before
B <- 500
alpha <- .05
bhats <- map_dbl (1 : B, ~ {
newcats <- fatcats |>
slice_sample (prop = 1 , replace = TRUE )
coef (lm (Hwt ~ 0 + Bwt, data = newcats))
})
2 * coef (cats.lm) - # NP Bootstrap CI
quantile (bhats, probs = c (1 - alpha / 2 , alpha / 2 ))
97.5% 2.5%
3.832907 4.070232
confint (cats.lm) # Original CI
2.5 % 97.5 %
Bwt 3.829836 4.078652
Parametric bootstrap
Assume that the linear model is TRUE.
Then, \(\texttt{Hwt}_i = \widehat{\beta}\times \texttt{Bwt}_i + \widehat{e}_i\) , \(\widehat{e}_i \approx \epsilon_i\)
The \(\epsilon_i\) is random \(\longrightarrow\) just resample \(\widehat{e}_i\) .
B <- 500
bhats <- double (B)
cats.lm <- lm (Hwt ~ 0 + Bwt, data = fatcats)
r <- residuals (cats.lm)
bhats <- map_dbl (1 : B, ~ {
newcats <- fatcats |> mutate (
Hwt = predict (cats.lm) +
sample (r, n (), replace = TRUE )
)
coef (lm (Hwt ~ 0 + Bwt, data = newcats))
})
2 * coef (cats.lm) - # Parametric Bootstrap CI
quantile (bhats, probs = c (1 - alpha / 2 , alpha / 2 ))
97.5% 2.5%
3.815162 4.065045
Bootstrap error sources
Simulation error
using only \(B\) samples to estimate \(F\) with \(\hat{F}\) .
Statistical error
our data depended on a sample from the population. We don’t have the whole population so we make an error by using a sample (Note: this part is what always happens with data, and what the science of statistics analyzes.)
Specification error
If we use the parametric bootstrap, and our model is wrong, then we are overconfident.
Types of intervals
Let \(\hat{\theta}\) be our sample statistic, \(\hat{\Theta}\) be the resamples
Our interval is
\[
[2\hat{\theta} - \theta^*_{1-\alpha/2},\ 2\hat{\theta} - \theta^*_{\alpha/2}]
\]
where \(\theta^*_q\) is the \(q\) quantile of \(\hat{\Theta}\) .
Called the “Pivotal Interval”
Has the correct \(1-\alpha\) % coverage under very mild conditions on \(\hat{\theta}\)
Types of intervals
Let \(\hat{\theta}\) be our sample statistic, \(\hat{\Theta}\) be the resamples
\[
[\hat{\theta} - z_{\alpha/2}\hat{s},\ \hat{\theta} + z_{\alpha/2}\hat{s}]
\]
where \(\hat{s} = \sqrt{\Var{\hat{\Theta}}}\)
Called the “Normal Interval”
Only works if the distribution of \(\hat{\Theta}\) is approximately Normal.
Unlikely to work well
Don’t do this
Types of intervals
Let \(\hat{\theta}\) be our sample statistic, \(\hat{\Theta}\) be the resamples
\[
[\theta^*_{\alpha/2},\ \theta^*_{1-\alpha/2}]
\]
where \(\theta^*_q\) is the \(q\) quantile of \(\hat{\Theta}\) .
Called the “Percentile Interval”
Works if \(\exists\) monotone \(m\) so that \(m(\hat\Theta) \sim N(m(\theta), c^2)\)
Better than the Normal Interval
More assumptions than the Pivotal Interval
More details
See “All of Statistics” by Larry Wasserman, Chapter 8.3
There’s a handout with the proofs on Canvas (under Modules)
Next time…
Bootstrap for bagging and random forests