Two Great Questions from Last Lecture

1) The Bias of OLS

In last class’ clicker question, we were trying to predict income from a \(p=2\), \(n=50000\) dataset. I claimed that the OLS predictor is high bias in this example.


But Trevor proved that the OLS predictor is unbiased (via the following proof):

A. Assume that \(y = x^\top \beta + \epsilon, \quad \epsilon \sim \mathcal N(0, \sigma^2).\)

B. Then \(E[\hat \beta_\mathrm{ols}] = E[ E[\hat \beta_\mathrm{ols} \mid \mathbf X] ]\)

C. \(E[\hat \beta_\mathrm{ols} \mid \mathbf X] = (\mathbf X^\top \mathbf X)^{-1} \mathbf X^\top E[ \mathbf y \mid \mathbf X]\)

D. \(E[ \mathbf y \mid \mathbf X] = \mathbf X \beta\)

E. So \(E[\hat \beta_\mathrm{ols}] - \beta = E[(\mathbf X^\top \mathbf X)^{-1} \mathbf X^\top \mathbf X \beta] - \beta = \beta - \beta = 0\).


Why did this proof not apply to the clicker question?
(Which step of this proof breaks down?)

1) The Bias of OLS

Which step did the proof break down?

A. Assume that \(y = x^\top \beta + \epsilon, \quad \epsilon \sim \mathcal N(0, \sigma^2).\)


This assumption does not hold.

It is (almost certainly) not the case that Income ~ Age + Education.


In reality, \(y \sim f(x) + \epsilon\) where \(f(x)\) is some potentially nonlinear function. So

\[ \begin{align} E[\hat \beta_\mathrm{ols}] = E[E[\hat \beta_\mathrm{ols} \mid \mathbf X ]] = E[ (\mathbf X^\top \mathbf X)^{-1} \mathbf X f(\mathbf X) ] \ne \beta \end{align} \]

In statistics speak, our model is misspecified.
Ridge/lasso will always increase bias and decrease variance, even under misspecification.

2) Why does ridge regression shrink varinace?

Mathematically

  • Variance of OLS: \(\mathrm{Cov}[\hat \beta_\mathrm{ols}] = \sigma^2 E[ (\mathbf X^\top \mathbf X)^{-1} ]\)
  • Variance of ridge regression: \(\mathrm{Cov}[\hat \beta_\mathrm{ols}] = \sigma^2 E[ (\mathbf X^\top \mathbf X + \lambda \mathbf I)^{-1} ]\)

Intuitively…

Think about the constrained optimization problem pictorally.
What happens if we had a different dataset?

11 Local methods

Stat 406

Geoff Pleiss, Trevor Campbell

Last modified – 08 October 2024

\[ \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator*{\argmax}{argmax} \DeclareMathOperator*{\minimize}{minimize} \DeclareMathOperator*{\maximize}{maximize} \DeclareMathOperator*{\find}{find} \DeclareMathOperator{\st}{subject\,\,to} \newcommand{\E}{E} \newcommand{\Expect}[1]{\E\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[2]{\mathrm{Cov}\left[#1,\ #2\right]} \newcommand{\given}{\ \vert\ } \newcommand{\X}{\mathbf{X}} \newcommand{\x}{\mathbf{x}} \newcommand{\y}{\mathbf{y}} \newcommand{\P}{\mathcal{P}} \newcommand{\R}{\mathbb{R}} \newcommand{\norm}[1]{\left\lVert #1 \right\rVert} \newcommand{\snorm}[1]{\lVert #1 \rVert} \newcommand{\tr}[1]{\mbox{tr}(#1)} \newcommand{\brt}{\widehat{\beta}^R_{s}} \newcommand{\brl}{\widehat{\beta}^R_{\lambda}} \newcommand{\bls}{\widehat{\beta}_{ols}} \newcommand{\blt}{\widehat{\beta}^L_{s}} \newcommand{\bll}{\widehat{\beta}^L_{\lambda}} \newcommand{\U}{\mathbf{U}} \newcommand{\D}{\mathbf{D}} \newcommand{\V}{\mathbf{V}} \]

Last time…

We looked at feature maps as a way to do nonlinear regression.

We used new “features” \(\Phi(x) = \bigg(\phi_1(x),\ \phi_2(x),\ldots,\phi_k(x)\bigg)\)

Now we examine a nonparametric alternative

Suppose I just look at the “neighbours” of some point (based on the \(x\)-values)

I just average the \(y\)’s at those locations together

Let’s use 3 neighbours

Code
library(cowplot)
data(arcuate, package = "Stat406")
set.seed(406406)
arcuate_unif <- arcuate |> slice_sample(n = 40) |> arrange(position)
pt <- 15
nn <-  3
seq_range <- function(x, n = 101) seq(min(x, na.rm = TRUE), max(x, na.rm = TRUE), length.out = n)
neibs <- sort.int(abs(arcuate_unif$position - arcuate_unif$position[pt]), index.return = TRUE)$ix[1:nn]
arcuate_unif$neighbours = seq_len(40) %in% neibs
g1 <- ggplot(arcuate_unif, aes(position, fa, colour = neighbours)) + 
  geom_point() +
  scale_colour_manual(values = c(blue, red)) + 
  geom_vline(xintercept = arcuate_unif$position[pt], colour = red) + 
  annotate("rect", fill = red, alpha = .25, ymin = -Inf, ymax = Inf,
           xmin = min(arcuate_unif$position[neibs]), 
           xmax = max(arcuate_unif$position[neibs])
  ) +
  theme(legend.position = "none")
g2 <- ggplot(arcuate_unif, aes(position, fa)) +
  geom_point(colour = blue) +
  geom_line(
    data = tibble(
      position = seq_range(arcuate_unif$position),
      fa = FNN::knn.reg(
        arcuate_unif$position, matrix(position, ncol = 1),
        y = arcuate_unif$fa
      )$pred
    ),
    colour = orange, linewidth = 2
  )
plot_grid(g1, g2, ncol = 2)

KNN

data(arcuate, package = "Stat406")
library(FNN)
arcuate_unif <- arcuate |> 
  slice_sample(n = 40) |> 
  arrange(position) 

new_position <- seq(
  min(arcuate_unif$position), 
  max(arcuate_unif$position),
  length.out = 101
)

knn3 <- knn.reg(
  train = arcuate_unif$position, 
  test = matrix(arcuate_unif$position, ncol = 1), 
  y = arcuate_unif$fa, 
  k = 3
)

This method is \(K\)-nearest neighbours.

It’s a linear smoother just like in previous lectures: \(\widehat{\mathbf{y}} = \mathbf{S} \mathbf{y}\) for some matrix \(S\).

You should imagine what \(\mathbf{S}\) looks like.

  1. What is the degrees of freedom of KNN, and how does it depend on \(k\)?
  2. How does \(k\) affect the bias/variance?
  • \(\mathrm{df} = \tr{\mathbf S} = n/k\).
  • \(k = n\) produces a constant predictor (highest bias, lowest variance).
  • \(k = 1\) produces a low bias but extremely high variance predictor.
Code
set.seed(406406)

plot_knn <- function(k) {
  ggplot(arcuate_unif, aes(position, fa)) +
    geom_point(colour = blue) +
    geom_line(
      data = tibble(
        position = seq_range(arcuate_unif$position),
        fa = FNN::knn.reg(
          arcuate_unif$position, matrix(position, ncol = 1),
          y = arcuate_unif$fa,
          k = k
        )$pred
      ),
      colour = orange, linewidth = 2
    ) + ggtitle(paste("k =", k))
}

g1 <- plot_knn(1)
g2 <- plot_knn(5)
g3 <- plot_knn(length(arcuate_unif$position))
plot_grid(g1, g2, g3, ncol = 3)

Local averages (soft KNN)

KNN averages the neighbours with equal weight.

But some neighbours are “closer” than other neighbours.

Instead of choosing the number of neighbours to average, we can average any observations within a certain distance.

The boxes have width 30.

What is a “kernel” smoother?

  • The mathematics:

A kernel is any function \(K\) such that for any \(u\),

  • \(K(u) \geq 0\),
  • \(\int K(u) du = 1\),
  • \(\int uK(u) du = 0\).
  • The idea: a kernel takes weighted averages. The kernel function gives the weights.

  • The previous example is called the boxcar kernel.

Smoothing with the boxcar

Code
testpts <- seq(0, 200, length.out = 101)
dmat <- abs(outer(testpts, arcuate_unif$position, "-"))
S <- (dmat < 15)
S <- S / rowSums(S)
boxcar <- tibble(position = testpts, fa = S %*% arcuate_unif$fa)
ggplot(arcuate_unif, aes(position, fa)) +
  geom_point(colour = blue) +
  geom_line(data = boxcar, colour = orange)

This one gives the same non-zero weight to all points within \(\pm 15\) range.

Other kernels

Most of the time, we don’t use the boxcar because the weights are weird.
Ideally we would like closer points to have more weight.

A more common kernel: the Gaussian kernel

\[ K(u) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp\left(-\frac{u^2}{2\sigma^2}\right) \]

For the plot, I made \(\sigma=7.5\).

Now the weights “die away” for points farther from where we’re predicting.
(but all nonzero!!)

Code
gaussian_kernel <- function(x) dnorm(x, mean = arcuate_unif$position[15], sd = 7.5) * 3
ggplot(arcuate_unif, aes(position, fa)) +
  geom_point(colour = blue) +
  geom_segment(aes(x = position[15], y = 0, xend = position[15], yend = fa[15]), colour = orange) +
  stat_function(fun = gaussian_kernel, geom = "area", fill = orange)

Other kernels

What if I made \(\sigma=15\)?

Code
gaussian_kernel <- function(x) dnorm(x, mean = arcuate_unif$position[15], sd = 15) * 3
ggplot(arcuate_unif, aes(position, fa)) +
  geom_point(colour = blue) +
  geom_segment(aes(x = position[15], y = 0, xend = position[15], yend = fa[15]), colour = orange) +
  stat_function(fun = gaussian_kernel, geom = "area", fill = orange)

Before, points far from \(x_{15}\) got very small weights, now they have more influence.

For the Gaussian kernel, \(\sigma\) determines something like the “range” of the smoother.

Many Gaussians

The following code creates \(\mathbf{S}\) for Gaussian kernel smoothers with different \(\sigma\)

dmat <- as.matrix(dist(x))
Sgauss <- function(sigma) {
  gg <- dnorm(dmat, sd = sigma) # not an argument, uses the global dmat
  sweep(gg, 1, rowSums(gg), "/") # make the rows sum to 1.
}
Code
Sgauss <- function(sigma) {
  gg <-  dnorm(dmat, sd = sigma) # not an argument, uses the global dmat
  sweep(gg, 1, rowSums(gg),'/') # make the rows sum to 1.
}
boxcar$S15 = with(arcuate_unif, Sgauss(15) %*% fa)
boxcar$S08 = with(arcuate_unif, Sgauss(8) %*% fa)
boxcar$S30 = with(arcuate_unif, Sgauss(30) %*% fa)
bc = boxcar %>% select(position, S15, S08, S30) %>% 
  pivot_longer(-position, names_to = "Sigma")
ggplot(arcuate_unif, aes(position, fa)) + 
  geom_point(colour = blue) + 
  geom_line(data = bc, aes(position, value, colour = Sigma), linewidth = 1.5) +
  scale_colour_brewer(palette = "Set1")

The bandwidth

  • Choosing \(\sigma\) is very important.

  • This “range” parameter is called the bandwidth.

  • It is way more important than which kernel you use.

Choosing the bandwidth

As we have discussed, kernel smoothing (and KNN) are linear smoothers

\[\widehat{\mathbf{y}} = \mathbf{S}\mathbf{y}\]

The degrees of freedom is \(\textrm{tr}(\mathbf{S})\)

Therefore we can use our model selection criteria from before

Unfortunately, these don’t satisfy the “technical condition”, so cv_nice() doesn’t give LOO-CV

Smoothing the full Lidar data

ar <- arcuate |> slice_sample(n = 200)

gcv <- function(y, S) {
  yhat <- S %*% y
  mean( (y - yhat)^2 / (1 - mean(diag(S)))^2 )
}

fake_loocv <- function(y, S) {
  yhat <- S %*% y
  mean( (y - yhat)^2 / (1 - diag(S))^2 )
}

dmat <- as.matrix(dist(ar$position))
sigmas <- 10^(seq(log10(300), log10(.3), length = 100))

gcvs <- map_dbl(sigmas, ~ gcv(ar$fa, Sgauss(.x)))
flcvs <- map_dbl(sigmas, ~ fake_loocv(ar$fa, Sgauss(.x)))
best_s <- sigmas[which.min(gcvs)]
other_s <- sigmas[which.min(flcvs)]

ar$smoothed <- Sgauss(best_s) %*% ar$fa
ar$other <- Sgauss(other_s) %*% ar$fa

Smoothing the full Lidar data

Code
g3 <- ggplot(data.frame(sigma = sigmas, gcv = gcvs), aes(sigma, gcv)) +
  geom_point(colour = blue) +
  geom_vline(xintercept = best_s, colour = red) +
  scale_x_log10() +
  xlab(sprintf("Sigma, best is sig = %.2f", best_s))
g4 <- ggplot(ar, aes(position, fa)) +
  geom_point(colour = blue) +
  geom_line(aes(y = smoothed), colour = orange, linewidth = 2)
plot_grid(g3, g4, nrow = 1)

I considered \(\sigma \in [0.3,\ 300]\) and used \(3.97\).

It’s too wiggly, to my eye. Typical for GCV.

Smoothing manually

I did Kernel Smoothing “manually”

  1. For a fixed bandwidth

  2. Compute the smoothing matrix

  3. Make the predictions

  4. Repeat and compute GCV

The point is to “show how it works”. It’s also really easy.

R functions / packages

There are a number of other ways to do this in R

loess()
ksmooth()
KernSmooth::locpoly()
mgcv::gam()
np::npreg()

These have tricks and ways of doing CV and other things automatically.

Note
All I needed was the distance matrix dist(x).
Given ANY distance function
say, \(d(\mathbf{x}_i, \mathbf{x}_j) = \Vert\mathbf{x}_i - \mathbf{x}_j\Vert_2 + I(x_{i,3} = x_{j,3})\)
I can use these methods.

Next time…

Why don’t we just smooth everything all the time?