11 Local methods

Stat 406

Daniel J. McDonald

Last modified – 09 October 2023

\[ \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator*{\argmax}{argmax} \DeclareMathOperator*{\minimize}{minimize} \DeclareMathOperator*{\maximize}{maximize} \DeclareMathOperator*{\find}{find} \DeclareMathOperator{\st}{subject\,\,to} \newcommand{\E}{E} \newcommand{\Expect}[1]{\E\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[2]{\mathrm{Cov}\left[#1,\ #2\right]} \newcommand{\given}{\ \vert\ } \newcommand{\X}{\mathbf{X}} \newcommand{\x}{\mathbf{x}} \newcommand{\y}{\mathbf{y}} \newcommand{\P}{\mathcal{P}} \newcommand{\R}{\mathbb{R}} \newcommand{\norm}[1]{\left\lVert #1 \right\rVert} \newcommand{\snorm}[1]{\lVert #1 \rVert} \newcommand{\tr}[1]{\mbox{tr}(#1)} \newcommand{\brt}{\widehat{\beta}^R_{s}} \newcommand{\brl}{\widehat{\beta}^R_{\lambda}} \newcommand{\bls}{\widehat{\beta}_{ols}} \newcommand{\blt}{\widehat{\beta}^L_{s}} \newcommand{\bll}{\widehat{\beta}^L_{\lambda}} \]

Last time…

We looked at feature maps as a way to do nonlinear regression.

We used new “features” \(\Phi(x) = \bigg(\phi_1(x),\ \phi_2(x),\ldots,\phi_k(x)\bigg)\)

Now we examine an alternative

Suppose I just look at the “neighbours” of some point (based on the \(x\)-values)

I just average the \(y\)’s at those locations together

Let’s use 3 neighbours

Code
library(cowplot)
data(arcuate, package = "Stat406")
set.seed(406406)
arcuate_unif <- arcuate |> slice_sample(n = 40) |> arrange(position)
pt <- 15
nn <-  3
seq_range <- function(x, n = 101) seq(min(x, na.rm = TRUE), max(x, na.rm = TRUE), length.out = n)
neibs <- sort.int(abs(arcuate_unif$position - arcuate_unif$position[pt]), index.return = TRUE)$ix[1:nn]
arcuate_unif$neighbours = seq_len(40) %in% neibs
g1 <- ggplot(arcuate_unif, aes(position, fa, colour = neighbours)) + 
  geom_point() +
  scale_colour_manual(values = c(blue, red)) + 
  geom_vline(xintercept = arcuate_unif$position[pt], colour = red) + 
  annotate("rect", fill = red, alpha = .25, ymin = -Inf, ymax = Inf,
           xmin = min(arcuate_unif$position[neibs]), 
           xmax = max(arcuate_unif$position[neibs])
  ) +
  theme(legend.position = "none")
g2 <- ggplot(arcuate_unif, aes(position, fa)) +
  geom_point(colour = blue) +
  geom_line(
    data = tibble(
      position = seq_range(arcuate_unif$position),
      fa = FNN::knn.reg(
        arcuate_unif$position, matrix(position, ncol = 1),
        y = arcuate_unif$fa
      )$pred
    ),
    colour = orange, linewidth = 2
  )
plot_grid(g1, g2, ncol = 2)

KNN

data(arcuate, package = "Stat406")
library(FNN)
arcuate_unif <- arcuate |> 
  slice_sample(n = 40) |> 
  arrange(position) 

new_position <- seq(
  min(arcuate_unif$position), 
  max(arcuate_unif$position),
  length.out = 101
)

knn3 <- knn.reg(
  train = arcuate_unif$position, 
  test = matrix(arcuate_unif$position, ncol = 1), 
  y = arcuate_unif$fa, 
  k = 3
)

This method is \(K\)-nearest neighbours.

It’s a linear smoother just like in previous lectures: \(\widehat{\mathbf{y}} = \mathbf{S} \mathbf{y}\) for some matrix \(S\).

You should imagine what \(\mathbf{S}\) looks like.

What is the degrees of freedom of KNN?

KNN averages the neighbours with equal weight.

But some neighbours are “closer” than other neighbours.

Local averages

Instead of choosing the number of neighbours to average, we can average any observations within a certain distance.

The boxes have width 30.

What is a “kernel” smoother?

  • The mathematics:

A kernel is any function \(K\) such that for any \(u\), \(K(u) \geq 0\), \(\int du K(u)=1\) and \(\int uK(u)du=0\).

  • The idea: a kernel is a nice way to take weighted averages. The kernel function gives the weights.

  • The previous example is called the boxcar kernel.

Smoothing with the boxcar

Code
testpts <- seq(0, 200, length.out = 101)
dmat <- abs(outer(testpts, arcuate_unif$position, "-"))
S <- (dmat < 15)
S <- S / rowSums(S)
boxcar <- tibble(position = testpts, fa = S %*% arcuate_unif$fa)
ggplot(arcuate_unif, aes(position, fa)) +
  geom_point(colour = blue) +
  geom_line(data = boxcar, colour = orange)

This one gives the same non-zero weight to all points within \(\pm 15\) range.

Other kernels

Most of the time, we don’t use the boxcar because the weights are weird. (constant)

A more common one is the Gaussian kernel:

Code
gaussian_kernel <- function(x) dnorm(x, mean = arcuate_unif$position[15], sd = 7.5) * 3
ggplot(arcuate_unif, aes(position, fa)) +
  geom_point(colour = blue) +
  geom_segment(aes(x = position[15], y = 0, xend = position[15], yend = fa[15]), colour = orange) +
  stat_function(fun = gaussian_kernel, geom = "area", fill = orange)

For the plot, I made \(\sigma=7.5\).

Now the weights “die away” for points farther from where we’re predicting. (but all nonzero!!)

Other kernels

What if I made \(\sigma=15\)?

Code
gaussian_kernel <- function(x) dnorm(x, mean = arcuate_unif$position[15], sd = 15) * 3
ggplot(arcuate_unif, aes(position, fa)) +
  geom_point(colour = blue) +
  geom_segment(aes(x = position[15], y = 0, xend = position[15], yend = fa[15]), colour = orange) +
  stat_function(fun = gaussian_kernel, geom = "area", fill = orange)

Before, points far from \(x_{15}\) got very small weights, now they have more influence.

For the Gaussian kernel, \(\sigma\) determines something like the “range” of the smoother.

Many Gaussians

The following code creates \(\mathbf{S}\) for Gaussian kernel smoothers with different \(\sigma\)

dmat <- as.matrix(dist(x))
Sgauss <- function(sigma) {
  gg <- dnorm(dmat, sd = sigma) # not an argument, uses the global dmat
  sweep(gg, 1, rowSums(gg), "/") # make the rows sum to 1.
}
Code
Sgauss <- function(sigma) {
  gg <-  dnorm(dmat, sd = sigma) # not an argument, uses the global dmat
  sweep(gg, 1, rowSums(gg),'/') # make the rows sum to 1.
}
boxcar$S15 = with(arcuate_unif, Sgauss(15) %*% fa)
boxcar$S08 = with(arcuate_unif, Sgauss(8) %*% fa)
boxcar$S30 = with(arcuate_unif, Sgauss(30) %*% fa)
bc = boxcar %>% select(position, S15, S08, S30) %>% 
  pivot_longer(-position, names_to = "Sigma")
ggplot(arcuate_unif, aes(position, fa)) + 
  geom_point(colour = blue) + 
  geom_line(data = bc, aes(position, value, colour = Sigma), linewidth = 1.5) +
  scale_colour_brewer(palette = "Set1")

The bandwidth

  • Choosing \(\sigma\) is very important.

  • This “range” parameter is called the bandwidth.

  • It is way more important than which kernel you use.

  • The default kernel in ksmooth() is something called ‘Epanechnikov’:

epan <- function(x) 3/4 * (1 - x^2) * (abs(x) < 1)
ggplot(data.frame(x = c(-2, 2)), aes(x)) + stat_function(fun = epan, colour = green, linewidth = 2)

Choosing the bandwidth

As we have discussed, kernel smoothing (and KNN) are linear smoothers

\[\widehat{\mathbf{y}} = \mathbf{S}\mathbf{y}\]

The degrees of freedom is \(\textrm{tr}(\mathbf{S})\)

Therefore we can use our model selection criteria from before

Unfortunately, these don’t satisfy the “technical condition”, so cv_nice() doesn’t give LOO-CV

Smoothing the full Lidar data

ar <- arcuate |> slice_sample(n = 200)

gcv <- function(y, S) {
  yhat <- S %*% y
  mean( (y - yhat)^2 / (1 - mean(diag(S)))^2 )
}

fake_loocv <- function(y, S) {
  yhat <- S %*% y
  mean( (y - yhat)^2 / (1 - diag(S))^2 )
}

dmat <- as.matrix(dist(ar$position))
sigmas <- 10^(seq(log10(300), log10(.3), length = 100))

gcvs <- map_dbl(sigmas, ~ gcv(ar$fa, Sgauss(.x)))
flcvs <- map_dbl(sigmas, ~ fake_loocv(ar$fa, Sgauss(.x)))
best_s <- sigmas[which.min(gcvs)]
other_s <- sigmas[which.min(flcvs)]

ar$smoothed <- Sgauss(best_s) %*% ar$fa
ar$other <- Sgauss(other_s) %*% ar$fa

Smoothing the full Lidar data

Code
g3 <- ggplot(data.frame(sigma = sigmas, gcv = gcvs), aes(sigma, gcv)) +
  geom_point(colour = blue) +
  geom_vline(xintercept = best_s, colour = red) +
  scale_x_log10() +
  xlab(sprintf("Sigma, best is sig = %.2f", best_s))
g4 <- ggplot(ar, aes(position, fa)) +
  geom_point(colour = blue) +
  geom_line(aes(y = smoothed), colour = orange, linewidth = 2)
plot_grid(g3, g4, nrow = 1)

I considered \(\sigma \in [0.3,\ 300]\) and used \(3.97\).

It’s too wiggly, to my eye. Typical for GCV.

Smoothing manually

I did Kernel Smoothing “manually”

  1. For a fixed bandwidth

  2. Compute the smoothing matrix

  3. Make the predictions

  4. Repeat and compute GCV

The point is to “show how it works”. It’s also really easy.

R functions / packages

There are a number of other ways to do this in R

loess()
ksmooth()
KernSmooth::locpoly()
mgcv::gam()
np::npreg()

These have tricks and ways of doing CV and other things automatically.

Note
All I needed was the distance matrix dist(x).
Given ANY distance function
say, \(d(\mathbf{x}_i, \mathbf{x}_j) = \Vert\mathbf{x}_i - \mathbf{x}_j\Vert_2 + I(x_{i,3} = x_{j,3})\)
I can use these methods.

Next time…

Why don’t we just smooth everything all the time?