Two Great Questions from Last Lecture

1) The Bias of OLS

In last class’ clicker question, we were trying to predict income from a \(p=2\), \(n=50000\) dataset. I claimed that the OLS predictor is high bias in this example.

But Trevor proved that the OLS predictor is unbiased (via the following proof):

A. Assume that \(y = x^\top \beta + \epsilon, \quad \epsilon \sim \mathcal N(0, \sigma^2).\)

B. Then \(E[\hat \beta_\mathrm{ols}] = E[ E[\hat \beta_\mathrm{ols} \mid \mathbf X] ]\)

C. \(E[\hat \beta_\mathrm{ols} \mid \mathbf X] = (\mathbf X^\top \mathbf X)^{-1} \mathbf X^\top E[ \mathbf y \mid \mathbf X]\)

D. \(E[ \mathbf y \mid \mathbf X] = \mathbf X \beta\)

E. So \(E[\hat \beta_\mathrm{ols}] - \beta = E[(\mathbf X^\top \mathbf X)^{-1} \mathbf X^\top \mathbf X \beta] - \beta = \beta - \beta = 0\).

Why did this proof not apply to the clicker question?
(Which step of this proof breaks down?)

1) The Bias of OLS

Which step did the proof break down?

A. Assume that \(y = x^\top \beta + \epsilon, \quad \epsilon \sim \mathcal N(0, \sigma^2).\)

This assumption does not hold.

It is (almost certainly) not the case that Income ~ Age + Education.

In reality, \(y \sim f(x) + \epsilon\) where \(f(x)\) is some potentially nonlinear function. So

\[ \begin{align} E[\hat \beta_\mathrm{ols}] = E[E[\hat \beta_\mathrm{ols} \mid \mathbf X ]] = E[ (\mathbf X^\top \mathbf X)^{-1} \mathbf X f(\mathbf X) ] \ne \beta \end{align} \]

In statistics speak, our model is misspecified.
Ridge/lasso will always increase bias and decrease variance, even under misspecification.

2) Why does ridge regression shrink varinace?

Mathematically

Variance of OLS: \(\mathrm{Cov}[\hat \beta_\mathrm{ols}] = \sigma^2 E[ (\mathbf X^\top \mathbf X)^{-1} ]\)
Variance of ridge regression: \(\mathrm{Cov}[\hat \beta_\mathrm{ols}] = \sigma^2 E[ (\mathbf X^\top \mathbf X + \lambda \mathbf I)^{-1} ]\)

Intuitively…

Think about the constrained optimization problem pictorally.
What happens if we had a different dataset?

11 Local methods

Stat 406

Geoff Pleiss, Trevor Campbell

Last modified – 08 October 2024

\[ \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator*{\argmax}{argmax} \DeclareMathOperator*{\minimize}{minimize} \DeclareMathOperator*{\maximize}{maximize} \DeclareMathOperator*{\find}{find} \DeclareMathOperator{\st}{subject\,\,to} \newcommand{\E}{E} \newcommand{\Expect}[1]{\E\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[2]{\mathrm{Cov}\left[#1,\ #2\right]} \newcommand{\given}{\ \vert\ } \newcommand{\X}{\mathbf{X}} \newcommand{\x}{\mathbf{x}} \newcommand{\y}{\mathbf{y}} \newcommand{\P}{\mathcal{P}} \newcommand{\R}{\mathbb{R}} \newcommand{\norm}[1]{\left\lVert #1 \right\rVert} \newcommand{\snorm}[1]{\lVert #1 \rVert} \newcommand{\tr}[1]{\mbox{tr}(#1)} \newcommand{\brt}{\widehat{\beta}^R_{s}} \newcommand{\brl}{\widehat{\beta}^R_{\lambda}} \newcommand{\bls}{\widehat{\beta}_{ols}} \newcommand{\blt}{\widehat{\beta}^L_{s}} \newcommand{\bll}{\widehat{\beta}^L_{\lambda}} \newcommand{\U}{\mathbf{U}} \newcommand{\D}{\mathbf{D}} \newcommand{\V}{\mathbf{V}} \]

Last time…

We looked at feature maps as a way to do nonlinear regression.

We used new “features” \(\Phi(x) = \bigg(\phi_1(x),\ \phi_2(x),\ldots,\phi_k(x)\bigg)\)

Now we examine a nonparametric alternative

Suppose I just look at the “neighbours” of some point (based on the \(x\)-values)

I just average the \(y\)’s at those locations together

Let’s use 3 neighbours

Code

library(cowplot)
data(arcuate, package = "Stat406")
set.seed(406406)
arcuate_unif <- arcuate |> slice_sample(n = 40) |> arrange(position)
pt <- 15
nn <-  3
seq_range <- function(x, n = 101) seq(min(x, na.rm = TRUE), max(x, na.rm = TRUE), length.out = n)
neibs <- sort.int(abs(arcuate_unif$position - arcuate_unif$position[pt]), index.return = TRUE)$ix[1:nn]
arcuate_unif$neighbours = seq_len(40) %in% neibs
g1 <- ggplot(arcuate_unif, aes(position, fa, colour = neighbours)) + 
  geom_point() +
  scale_colour_manual(values = c(blue, red)) + 
  geom_vline(xintercept = arcuate_unif$position[pt], colour = red) + 
  annotate("rect", fill = red, alpha = .25, ymin = -Inf, ymax = Inf,
           xmin = min(arcuate_unif$position[neibs]), 
           xmax = max(arcuate_unif$position[neibs])
  ) +
  theme(legend.position = "none")
g2 <- ggplot(arcuate_unif, aes(position, fa)) +
  geom_point(colour = blue) +
  geom_line(
    data = tibble(
      position = seq_range(arcuate_unif$position),
      fa = FNN::knn.reg(
        arcuate_unif$position, matrix(position, ncol = 1),
        y = arcuate_unif$fa
      )$pred
    ),
    colour = orange, linewidth = 2
  )
plot_grid(g1, g2, ncol = 2)

KNN

data(arcuate, package = "Stat406")
library(FNN)
arcuate_unif <- arcuate |> 
  slice_sample(n = 40) |> 
  arrange(position) 

new_position <- seq(
  min(arcuate_unif$position), 
  max(arcuate_unif$position),
  length.out = 101
)

knn3 <- knn.reg(
  train = arcuate_unif$position, 
  test = matrix(arcuate_unif$position, ncol = 1), 
  y = arcuate_unif$fa, 
  k = 3
)

This method is \(K\)-nearest neighbours.

It’s a linear smoother just like in previous lectures: \(\widehat{\mathbf{y}} = \mathbf{S} \mathbf{y}\) for some matrix \(S\).

You should imagine what \(\mathbf{S}\) looks like.

What is the degrees of freedom of KNN, and how does it depend on \(k\)?
How does \(k\) affect the bias/variance?

\(\mathrm{df} = \tr{\mathbf S} = n/k\).
\(k = n\) produces a constant predictor (highest bias, lowest variance).
\(k = 1\) produces a low bias but extremely high variance predictor.

Code

set.seed(406406)

plot_knn <- function(k) {
  ggplot(arcuate_unif, aes(position, fa)) +
    geom_point(colour = blue) +
    geom_line(
      data = tibble(
        position = seq_range(arcuate_unif$position),
        fa = FNN::knn.reg(
          arcuate_unif$position, matrix(position, ncol = 1),
          y = arcuate_unif$fa,
          k = k
        )$pred
      ),
      colour = orange, linewidth = 2
    ) + ggtitle(paste("k =", k))
}

g1 <- plot_knn(1)
g2 <- plot_knn(5)
g3 <- plot_knn(length(arcuate_unif$position))
plot_grid(g1, g2, g3, ncol = 3)

Local averages (soft KNN)

KNN averages the neighbours with equal weight.

But some neighbours are “closer” than other neighbours.

Instead of choosing the number of neighbours to average, we can average any observations within a certain distance.

The boxes have width 30.

What is a “kernel” smoother?

The mathematics:

A kernel is any function \(K\) such that for any \(u\),

\(K(u) \geq 0\),

\(\int K(u) du = 1\),

\(\int uK(u) du = 0\).

The idea: a kernel takes weighted averages. The kernel function gives the weights.
The previous example is called the boxcar kernel.

Smoothing with the boxcar

Code

testpts <- seq(0, 200, length.out = 101)
dmat <- abs(outer(testpts, arcuate_unif$position, "-"))
S <- (dmat < 15)
S <- S / rowSums(S)
boxcar <- tibble(position = testpts, fa = S %*% arcuate_unif$fa)
ggplot(arcuate_unif, aes(position, fa)) +
  geom_point(colour = blue) +
  geom_line(data = boxcar, colour = orange)

This one gives the same non-zero weight to all points within \(\pm 15\) range.

Other kernels

Most of the time, we don’t use the boxcar because the weights are weird.
Ideally we would like closer points to have more weight.

A more common kernel: the Gaussian kernel

\[ K(u) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp\left(-\frac{u^2}{2\sigma^2}\right) \]

For the plot, I made \(\sigma=7.5\).

Now the weights “die away” for points farther from where we’re predicting.
(but all nonzero!!)

Code

gaussian_kernel <- function(x) dnorm(x, mean = arcuate_unif$position[15], sd = 7.5) * 3
ggplot(arcuate_unif, aes(position, fa)) +
  geom_point(colour = blue) +
  geom_segment(aes(x = position[15], y = 0, xend = position[15], yend = fa[15]), colour = orange) +
  stat_function(fun = gaussian_kernel, geom = "area", fill = orange)

Other kernels

What if I made \(\sigma=15\)?

Code

gaussian_kernel <- function(x) dnorm(x, mean = arcuate_unif$position[15], sd = 15) * 3
ggplot(arcuate_unif, aes(position, fa)) +
  geom_point(colour = blue) +
  geom_segment(aes(x = position[15], y = 0, xend = position[15], yend = fa[15]), colour = orange) +
  stat_function(fun = gaussian_kernel, geom = "area", fill = orange)

Before, points far from \(x_{15}\) got very small weights, now they have more influence.

For the Gaussian kernel, \(\sigma\) determines something like the “range” of the smoother.

Many Gaussians

The following code creates \(\mathbf{S}\) for Gaussian kernel smoothers with different \(\sigma\)

dmat <- as.matrix(dist(x))
Sgauss <- function(sigma) {
  gg <- dnorm(dmat, sd = sigma) # not an argument, uses the global dmat
  sweep(gg, 1, rowSums(gg), "/") # make the rows sum to 1.
}

Code

Sgauss <- function(sigma) {
  gg <-  dnorm(dmat, sd = sigma) # not an argument, uses the global dmat
  sweep(gg, 1, rowSums(gg),'/') # make the rows sum to 1.
}
boxcar$S15 = with(arcuate_unif, Sgauss(15) %*% fa)
boxcar$S08 = with(arcuate_unif, Sgauss(8) %*% fa)
boxcar$S30 = with(arcuate_unif, Sgauss(30) %*% fa)
bc = boxcar %>% select(position, S15, S08, S30) %>% 
  pivot_longer(-position, names_to = "Sigma")
ggplot(arcuate_unif, aes(position, fa)) + 
  geom_point(colour = blue) + 
  geom_line(data = bc, aes(position, value, colour = Sigma), linewidth = 1.5) +
  scale_colour_brewer(palette = "Set1")

The bandwidth

Choosing \(\sigma\) is very important.
This “range” parameter is called the bandwidth.
It is way more important than which kernel you use.

Choosing the bandwidth

As we have discussed, kernel smoothing (and KNN) are linear smoothers

\[\widehat{\mathbf{y}} = \mathbf{S}\mathbf{y}\]

The degrees of freedom is \(\textrm{tr}(\mathbf{S})\)

Therefore we can use our model selection criteria from before

Unfortunately, these don’t satisfy the “technical condition”, so cv_nice() doesn’t give LOO-CV

Smoothing the full Lidar data

ar <- arcuate |> slice_sample(n = 200)

gcv <- function(y, S) {
  yhat <- S %*% y
  mean( (y - yhat)^2 / (1 - mean(diag(S)))^2 )
}

fake_loocv <- function(y, S) {
  yhat <- S %*% y
  mean( (y - yhat)^2 / (1 - diag(S))^2 )
}

dmat <- as.matrix(dist(ar$position))
sigmas <- 10^(seq(log10(300), log10(.3), length = 100))

gcvs <- map_dbl(sigmas, ~ gcv(ar$fa, Sgauss(.x)))
flcvs <- map_dbl(sigmas, ~ fake_loocv(ar$fa, Sgauss(.x)))
best_s <- sigmas[which.min(gcvs)]
other_s <- sigmas[which.min(flcvs)]

ar$smoothed <- Sgauss(best_s) %*% ar$fa
ar$other <- Sgauss(other_s) %*% ar$fa

Smoothing the full Lidar data

Code

g3 <- ggplot(data.frame(sigma = sigmas, gcv = gcvs), aes(sigma, gcv)) +
  geom_point(colour = blue) +
  geom_vline(xintercept = best_s, colour = red) +
  scale_x_log10() +
  xlab(sprintf("Sigma, best is sig = %.2f", best_s))
g4 <- ggplot(ar, aes(position, fa)) +
  geom_point(colour = blue) +
  geom_line(aes(y = smoothed), colour = orange, linewidth = 2)
plot_grid(g3, g4, nrow = 1)

I considered \(\sigma \in [0.3,\ 300]\) and used \(3.97\).

It’s too wiggly, to my eye. Typical for GCV.

Smoothing manually

I did Kernel Smoothing “manually”

For a fixed bandwidth
Compute the smoothing matrix
Make the predictions
Repeat and compute GCV

The point is to “show how it works”. It’s also really easy.

`R` functions / packages

There are a number of other ways to do this in R

loess()
ksmooth()
KernSmooth::locpoly()
mgcv::gam()
np::npreg()

These have tricks and ways of doing CV and other things automatically.

Note: All I needed was the distance matrix dist(x).; Given ANY distance function; say, \(d(\mathbf{x}_i, \mathbf{x}_j) = \Vert\mathbf{x}_i - \mathbf{x}_j\Vert_2 + I(x_{i,3} = x_{j,3})\); I can use these methods.

Next time…

Why don’t we just smooth everything all the time?

Two Great Questions from Last Lecture

1) The Bias of OLS

1) The Bias of OLS

2) Why does ridge regression shrink varinace?

Mathematically

Intuitively…

11 Local methods

Last time…

Let’s use 3 neighbours

KNN

This method is \(K\)-nearest neighbours.

Local averages (soft KNN)

What is a “kernel” smoother?

Smoothing with the boxcar

Other kernels

Other kernels

Many Gaussians

The bandwidth

Choosing the bandwidth

Smoothing the full Lidar data

Smoothing the full Lidar data

Smoothing manually

R functions / packages

Next time…

`R` functions / packages