10 Basis expansions

Stat 406

Geoff Pleiss, Trevor Campbell

Last modified – 07 October 2024

\[ \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator*{\argmax}{argmax} \DeclareMathOperator*{\minimize}{minimize} \DeclareMathOperator*{\maximize}{maximize} \DeclareMathOperator*{\find}{find} \DeclareMathOperator{\st}{subject\,\,to} \newcommand{\E}{E} \newcommand{\Expect}[1]{\E\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[2]{\mathrm{Cov}\left[#1,\ #2\right]} \newcommand{\given}{\ \vert\ } \newcommand{\X}{\mathbf{X}} \newcommand{\x}{\mathbf{x}} \newcommand{\y}{\mathbf{y}} \newcommand{\P}{\mathcal{P}} \newcommand{\R}{\mathbb{R}} \newcommand{\norm}[1]{\left\lVert #1 \right\rVert} \newcommand{\snorm}[1]{\lVert #1 \rVert} \newcommand{\tr}[1]{\mbox{tr}(#1)} \newcommand{\brt}{\widehat{\beta}^R_{s}} \newcommand{\brl}{\widehat{\beta}^R_{\lambda}} \newcommand{\bls}{\widehat{\beta}_{ols}} \newcommand{\blt}{\widehat{\beta}^L_{s}} \newcommand{\bll}{\widehat{\beta}^L_{\lambda}} \newcommand{\U}{\mathbf{U}} \newcommand{\D}{\mathbf{D}} \newcommand{\V}{\mathbf{V}} \]

Quick Review of Ridge and Lasso

OLS: low bias, (potentially) high variance

\[ \begin{gathered} \text{Model:} \quad y = x^\top \beta + \epsilon, \qquad \epsilon \sim N(0, \sigma^2) \\ \text{OLS:} \quad \bls = \argmin_\beta \| \y - \X\beta\|_2^2\quad \end{gathered} \]

Bias: \(\E[\bls] - \beta = \E[\E[\bls \mid \X]] - \beta = \ldots = 0\)
variance: \(\Var{\bls} = \sigma^2(\X^\top \X)^{-1}\)

When is \((\X^\top \X)^{-1}\) large?

When we have nearly colinear features
Nearly colinear features \(\Rightarrow\) small singular values \(\Rightarrow\) large matrix inverse
(Colinearity is more likely when \(n\) is small)

Reducing variance (at the cost of additional bias)

Manual variable selection
Ridge regression: \(\min_\beta \| \y - \X\beta\|_2^2 + \lambda \Vert \beta \Vert_2^2\)
Lasso: \(\min_\beta \| \y - \X\beta\|_2^2 + \lambda \Vert \beta \Vert_1\)

Ridge shrinks all parameters towards 0.
Lasso performs automatic variable selection.

Increasing \(\lambda\) increases bias and decreases variance.
(For ridge, larger lambda \(\rightarrow\) smaller \(\beta\).)
(For lasso, larger lambda \(\rightarrow\) sparser \(\beta\).)

Computing ridge and lasso predictors

OLS: \(\bls = (\X^\top \X)^{-1}\X^\top \y\)
Ridge: \(\brl = (\X^\top \X + \lambda \mathbf{I})^{-1}\X^\top \y\)
Lasso: No closed form solution 😔
- (Convex optimization problem solvable with iterative algorithms.)

(Optional) Proof that ridge shrinks parameters

(This proof is not too hard if you use facts about SVDs and eigenvalues. I recommend working through it.)

Let \(\mathbf{UDV^\top} = X\) be the SVD of \(\mathbf X\).

\[ \begin{align} \brl &= (\X^\top \X + \lambda \mathbf{I})^{-1} \X^\top \y \\ &= (\X^\top \X + \lambda \mathbf{I})^{-1} {\color{blue} \X^\top \X (\X^\top \X)^{-1}} \X^\top \y \\ &= (\X^\top \X + \lambda \mathbf{I})^{-1} \X^\top \X \underbrace{\left( (\X^\top \X)^{-1} \X^\top \y \right)}_{\bls} \\ &= (\mathbf V \mathbf D^2 \mathbf V^\top + \lambda \mathbf{I})^{-1} \mathbf V \mathbf D^2 \mathbf V^\top \bls \\ &= (\mathbf V (\mathbf D^2 + \lambda I) \mathbf V^\top)^{-1} \mathbf V \mathbf D^2 \mathbf V^\top \bls \\ &= \mathbf V (\mathbf D^2)(\mathbf D^2 + \lambda I)^{-1} \mathbf V^\top \bls \end{align} \]

\((\mathbf D^2)(\mathbf D^2 + \lambda I)^{-1}\) is a diagonal matrix with entries \(d_i^2/(d_i^2 + \lambda) < 1\).

So \(\mathbf V (\mathbf D^2)(\mathbf D^2 + \lambda I)^{-1} \mathbf V^\top\) is a matrix that shrinks all coefficients of any vector multiplied against it.

\[ \Vert \mathbf V (\mathbf D^2)(\mathbf D^2 + \lambda I)^{-1} \mathbf V^\top \bls \Vert_2 < \Vert \bls \Vert_2. \] So \(\Vert \brl \Vert_2 < \Vert \bls \Vert_2\)

Now onto new stuff

(But first, more clickers!)

What about nonlinear things

\[\text{Our usual model:} \quad \Expect{Y \given X=x} = \sum_{j=1}^p x_j\beta_j\]

Now we relax this assumption of linearity:

\[\Expect{Y \given X=x} = f(x)\]

How do we estimate \(f\)?

For this lecture, we use \(x \in \R\) (1 dimensional)

Higher dimensions are possible, but complexity grows exponentially.

We’ll see some special techniques for \(x\in\R^p\) later this Module.

Start simple

For any \(f : \R \rightarrow [0,1]\)

\[f(x) = f(x_0) + f'(x_0)(x-x_0) + \frac{1}{2}f''(x_0)(x-x_0)^2 + \frac{1}{3!}f'''(x_0)(x-x_0)^3 + R_3(x-x_0)\]

So we can linearly regress \(y_i = f(x_i)\) on the polynomials.

The more terms we use, the smaller \(R\).

Code

set.seed(406406)
data(arcuate, package = "Stat406") 
arcuate <- arcuate |> slice_sample(n = 220)
arcuate %>% 
  ggplot(aes(position, fa)) + 
  geom_point(color = blue) +
  geom_smooth(color = orange, formula = y ~ poly(x, 3), method = "lm", se = FALSE)

Same thing, different orders

Code

arcuate %>% 
  ggplot(aes(position, fa)) + 
  geom_point(color = blue) + 
  geom_smooth(aes(color = "a"), formula = y ~ poly(x, 4), method = "lm", se = FALSE) +
  geom_smooth(aes(color = "b"), formula = y ~ poly(x, 7), method = "lm", se = FALSE) +
  geom_smooth(aes(color = "c"), formula = y ~ poly(x, 25), method = "lm", se = FALSE) +
  scale_color_manual(name = "Taylor order",
    values = c(green, red, orange), labels = c("4 terms", "7 terms", "25 terms"))

Still a “linear smoother”

Really, this is still linear regression, just in a transformed space.

It’s not linear in \(x\), but it is linear in \((x,x^2,x^3)\) (for the 3rd-order case)

So, we’re still doing OLS with

\[\X=\begin{bmatrix}1& x_1 & x_1^2 & x_1^3 \\ \vdots&&&\vdots\\1& x_n & x_n^2 & x_n^3\end{bmatrix}\]

So we can still use our nice formulas for LOO-CV, GCV, Cp, AIC, etc.

max_deg <- 20
cv_nice <- function(mdl) mean( residuals(mdl)^2 / (1 - hatvalues(mdl))^2 ) 
cvscores <- map_dbl(seq_len(max_deg), ~ cv_nice(lm(fa ~ poly(position, .), data = arcuate)))

Code

library(cowplot)
g1 <- ggplot(tibble(cvscores, degrees = seq(max_deg)), aes(degrees, cvscores)) +
  geom_point(colour = blue) +
  geom_line(colour = blue) + 
  labs(ylab = 'LOO-CV', xlab = 'polynomial degree') +
  geom_vline(xintercept = which.min(cvscores), linetype = "dotted") 
g2 <- ggplot(arcuate, aes(position, fa)) + 
  geom_point(colour = blue) + 
  geom_smooth(
    colour = orange, 
    formula = y ~ poly(x, which.min(cvscores)), 
    method = "lm", 
    se = FALSE
  )
plot_grid(g1, g2, ncol = 2)

Other bases

Polynomials: \(x \mapsto \left(1,\ x,\ x^2, \ldots, x^p\right)\) (technically, not quite this, they are orthogonalized)
Linear splines: \(x \mapsto \bigg(1,\ x,\ (x-k_1)_+,\ (x-k_2)_+,\ldots, (x-k_p)_+\bigg)\) for some \(\{k_1,\ldots,k_p\}\)

Fourier series: \(x \mapsto \bigg(1,\ \cos(2\pi x),\ \sin(2\pi x),\ \cos(2\pi 2 x),\ \sin(2\pi 2 x), \ldots, \cos(2\pi p x),\ \sin(2\pi p x)\bigg)\)

Code

library(cowplot)
library(ggplot2)

relu_shifted <- function(x, shift) {pmax(0, x - shift)}

# Create a sequence of x values
x_vals <- seq(-3, 3, length.out = 1000)

# Create a data frame with all the shifted functions
data <- data.frame(
  x = rep(x_vals, 5),
  polynomial = c(x_vals, x_vals^2, x_vals^3, x_vals^4, x_vals^5),
  linear.splines = c(relu_shifted(x_vals, 2), relu_shifted(x_vals, 1), relu_shifted(x_vals, 0), relu_shifted(x_vals, -1), relu_shifted(x_vals, -2)),
  fourier = c(cos(pi / 2 * x_vals), sin(pi / 2 * x_vals), cos(pi / 4 * x_vals), sin(pi / 4 * x_vals), cos(pi * x_vals)),
  function_label = rep(c("f1", "f2", "f3", "f4", "f5"), each = length(x_vals))
)

# Plot using ggplot2
g1 <- ggplot(data, aes(x = x, y = polynomial, color = function_label)) +
      geom_line(size = 1, show.legend=FALSE) +
      theme(axis.text.y=element_blank())
g2 <- ggplot(data, aes(x = x, y = linear.splines, color = function_label)) +
      geom_line(size = 1, show.legend=FALSE) +
      theme(axis.text.y=element_blank())
g3 <- ggplot(data, aes(x = x, y = fourier, color = function_label)) +
      geom_line(size = 1, show.legend=FALSE) +
      theme(axis.text.y=element_blank())

plot_grid(g1, g2, g3, ncol = 3)

How do you choose?

Procedure 1:

Pick your favorite basis. (Think if the data might “prefer” one basis over another.)
- How “smooth” is the response you’re trying to model?

Perform OLS on different orders.
Use model selection criterion to choose the order.

What bases do you think will work best for \(f1\), \(f2\) and \(f3\)?

Answer: \(f1\) was made from polynomial bases, \(f2\) from fourier, \(f3\) from linear splines

How do you choose?

Procedure 2:

Use a bunch of high-order bases, say Linear splines and Fourier series and whatever else you like.
Use Lasso or Ridge regression or elastic net. (combining bases can lead to multicollinearity, but we may not care)
Use model selection criteria to choose the tuning parameter.

Try both procedures

Split arcuate into 75% training data and 25% testing data.
Estimate polynomials up to 20 as before and choose best order.
Do ridge, lasso and elastic net \(\alpha=.5\) on 20th order polynomials, splines with 20 knots, and Fourier series with \(p=20\). Choose tuning parameter (using lambda.1se).
Repeat 1-3 10 times (different splits)

library(glmnet)
mapto01 <- function(x, pad = .005) (x - min(x) + pad) / (max(x) - min(x) + 2 * pad)
x <- mapto01(arcuate$position)
Xmat <- cbind(
  poly(x, 20), 
  splines::bs(x, df = 20, degree = 1), 
  cos(2 * pi * outer(x, 1:20)), sin(2 * pi * outer(x, 1:20))
)
y <- arcuate$fa
rmse <- function(z, s) sqrt(mean( (z - s)^2 ))
nzero <- function(x) with(x, nzero[match(lambda.1se, lambda)])
sim <- function(maxdeg = 20, train_frac = 0.75) {
  n <- nrow(arcuate)
  train <- as.logical(rbinom(n, 1, train_frac))
  test <- !train # not precisely 25%, but on average
  polycv <- map_dbl(seq(maxdeg), ~ cv_nice(lm(y ~ Xmat[,seq(.)], subset = train))) # figure out which order to use
  bpoly <- lm(y[train] ~ Xmat[train, seq(which.min(polycv))]) # now use it
  lasso <- cv.glmnet(Xmat[train, ], y[train])
  ridge <- cv.glmnet(Xmat[train, ], y[train], alpha = 0)
  elnet <- cv.glmnet(Xmat[train, ], y[train], alpha = .5)
  tibble(
    methods = c("poly", "lasso", "ridge", "elnet"),
    rmses = c(
      rmse(y[test], cbind(1, Xmat[test, 1:which.min(polycv)]) %*% coef(bpoly)),
      rmse(y[test], predict(lasso, Xmat[test,])),
      rmse(y[test], predict(ridge, Xmat[test,])),
      rmse(y[test], predict(elnet, Xmat[test,]))
    ),
    nvars = c(which.min(polycv), nzero(lasso), nzero(ridge), nzero(elnet))
  )
}
set.seed(12345)
sim_results <- map(seq(20), sim) |> list_rbind() # repeat it 20 times

Code

sim_results |>  
  pivot_longer(-methods) |> 
  ggplot(aes(methods, value, fill = methods)) + 
  geom_boxplot() +
  facet_wrap(~ name, scales = "free_y") + 
  ylab("") +
  theme(legend.position = "none") + 
  xlab("") +
  scale_fill_viridis_d(begin = .2, end = 1)

Common elements

In all these cases, we transformed \(x\) to a higher-dimensional space

Used \(p+1\) dimensions with polynomials

Used \(p+4\) dimensions with cubic splines

Used \(2p+1\) dimensions with Fourier basis

Featurization

Each case applied a feature map to \(x\), call it \(\Phi\)

We used new “features” \(\Phi(x) = \bigg(\phi_1(x),\ \phi_2(x),\ldots,\phi_k(x)\bigg)\) w/ a linear model

\[f(x) = \Phi(x)^\top \beta\]

Neural networks (coming in module 4) build upon this idea

Some methods (Support Vector Machines and other Kernel Machines) allow \(k=\infty\)
See [ISLR] 9.3.2 for baby overview or [ESL] 5.8 (note 😱)

Next time…

Kernel regression and nearest neighbors