08 Ridge regression

Stat 406

Daniel J. McDonald

Last modified – 27 September 2023

\[ \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator*{\argmax}{argmax} \DeclareMathOperator*{\minimize}{minimize} \DeclareMathOperator*{\maximize}{maximize} \DeclareMathOperator*{\find}{find} \DeclareMathOperator{\st}{subject\,\,to} \newcommand{\E}{E} \newcommand{\Expect}[1]{\E\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[2]{\mathrm{Cov}\left[#1,\ #2\right]} \newcommand{\given}{\ \vert\ } \newcommand{\X}{\mathbf{X}} \newcommand{\x}{\mathbf{x}} \newcommand{\y}{\mathbf{y}} \newcommand{\P}{\mathcal{P}} \newcommand{\R}{\mathbb{R}} \newcommand{\norm}[1]{\left\lVert #1 \right\rVert} \newcommand{\snorm}[1]{\lVert #1 \rVert} \newcommand{\tr}[1]{\mbox{tr}(#1)} \newcommand{\brt}{\widehat{\beta}^R_{s}} \newcommand{\brl}{\widehat{\beta}^R_{\lambda}} \newcommand{\bls}{\widehat{\beta}_{ols}} \newcommand{\blt}{\widehat{\beta}^L_{s}} \newcommand{\bll}{\widehat{\beta}^L_{\lambda}} \]

Recap

So far, we have emphasized model selection as

Decide which predictors we would like to use in our linear model

Or similarly:

Decide which of a few linear models to use

To do this, we used a risk estimate, and chose the “model” with the lowest estimate

Moving forward, we need to generalize this to

Decide which of possibly infinite prediction functions \(f\in\mathcal{F}\) to use

Thankfully, this isn’t really any different. We still use those same risk estimates.

Remember: We were choosing models that balance bias and variance (and hence have low prediction risk).

\[ \newcommand{\brt}{\widehat{\beta}^R_{s}} \newcommand{\brl}{\widehat{\beta}^R_{\lambda}} \newcommand{\bls}{\widehat{\beta}_{ols}} \newcommand{\blt}{\widehat{\beta}^L_{s}} \newcommand{\bll}{\widehat{\beta}^L_{\lambda}} \]

Regularization

  • Another way to control bias and variance is through regularization or shrinkage.

  • Rather than selecting a few predictors that seem reasonable, maybe trying a few combinations, use them all.

  • I mean ALL.

  • But, make your estimates of \(\beta\) “smaller”

Brief aside on optimization

  • An optimization problem has 2 components:

    1. The “Objective function”: e.g. \(\frac{1}{n}\sum_i (y_i-x^\top_i \beta)^2\).
    2. The “constraint”: e.g. “fewer than 5 non-zero entries in \(\beta\)”.
  • A constrained minimization problem is written

\[\min_\beta f(\beta)\;\; \mbox{ subject to }\;\; C(\beta)\]

  • \(f(\beta)\) is the objective function
  • \(C(\beta)\) is the constraint

Ridge regression (constrained version)

One way to do this for regression is to solve (say): \[ \minimize_\beta \frac{1}{n}\sum_i (y_i-x^\top_i \beta)^2 \quad \st \sum_j \beta^2_j < s \] for some \(s>0\).

  • This is called “ridge regression”.
  • Call the minimizer of this problem \(\brt\)

Compare this to ordinary least squares:

\[ \minimize_\beta \frac{1}{n}\sum_i (y_i-x^\top_i \beta)^2 \quad \st \beta \in \R^p \]

Geometry of ridge regression (contours)

Code
library(mvtnorm)
norm_ball <- function(q = 1, len = 1000) {
  tg <- seq(0, 2 * pi, length = len)
  out <- tibble(x = cos(tg), b = (1 - abs(x)^q)^(1 / q), bm = -b) |>
    pivot_longer(-x, values_to = "y")
  out$lab <- paste0('"||" * beta * "||"', "[", signif(q, 2), "]")
  return(out)
}

ellipse_data <- function(
  n = 75, xlim = c(-2, 3), ylim = c(-2, 3),
  mean = c(1, 1), Sigma = matrix(c(1, 0, 0, .5), 2)) {
  expand_grid(
    x = seq(xlim[1], xlim[2], length.out = n),
    y = seq(ylim[1], ylim[2], length.out = n)) |>
    rowwise() |>
    mutate(z = dmvnorm(c(x, y), mean, Sigma))
}

lballmax <- function(ed, q = 1, tol = 1e-6, niter = 20) {
  ed <- filter(ed, x > 0, y > 0)
  feasible <- (ed$x^q + ed$y^q)^(1 / q) <= 1
  best <- ed[feasible, ]
  best[which.max(best$z), ]
}


nb <- norm_ball(2)
ed <- ellipse_data()
bols <- data.frame(x = 1, y = 1)
bhat <- lballmax(ed, 2)
ggplot(nb, aes(x, y)) +
  xlim(-2, 2) +
  ylim(-2, 2) +
  geom_path(colour = red) +
  geom_contour(mapping = aes(z = z), colour = blue, data = ed, bins = 7) +
  geom_vline(xintercept = 0) +
  geom_hline(yintercept = 0) +
  geom_point(data = bols) +
  coord_equal() +
  geom_label(
    data = bols,
    mapping = aes(label = bquote("hat(beta)[ols]")),
    parse = TRUE, 
    nudge_x = .3, nudge_y = .3
  ) +
  geom_point(data = bhat) +
  xlab(bquote(beta[1])) +
  ylab(bquote(beta[2])) +
  theme_bw(base_size = 24) +
  geom_label(
    data = bhat,
    mapping = aes(label = bquote("hat(beta)[s]^R")),
    parse = TRUE,
    nudge_x = -.4, nudge_y = -.4
  )

Brief aside on norms

Recall, for a vector \(z \in \R^p\)

\[\snorm{z}_2 = \sqrt{z_1^2 + z_2^2 + \cdots + z^2_p} = \sqrt{\sum_{j=1}^p z_j^2}\]

So,

\[\snorm{z}^2_2 = z_1^2 + z_2^2 + \cdots + z^2_p = \sum_{j=1}^p z_j^2.\]

Other norms we should remember:

\(\ell_q\)-norm
\(\left(\sum_{j=1}^p |z_j|^q\right)^{1/q}\)
\(\ell_1\)-norm (special case)
\(\sum_{j=1}^p |z_j|\)
\(\ell_0\)-norm
\(\sum_{j=1}^p I(z_j \neq 0 ) = \lvert \{j : z_j \neq 0 \}\rvert\)
\(\ell_\infty\)-norm
\(\max_{1\leq j \leq p} |z_j|\)

Ridge regression

An equivalent way to write

\[\brt = \argmin_{ || \beta ||_2^2 \leq s} \frac{1}{n}\sum_i (y_i-x^\top_i \beta)^2\]

is in the Lagrangian form

\[\brl = \argmin_{ \beta} \frac{1}{n}\sum_i (y_i-x^\top_i \beta)^2 + \lambda || \beta ||_2^2.\]

For every \(\lambda\) there is a unique \(s\) (and vice versa) that makes

\[\brt = \brl\]

Ridge regression

\(\brt = \argmin_{ || \beta ||_2^2 \leq s} \frac{1}{n}\sum_i (y_i-x^\top_i \beta)^2\)

\(\brl = \argmin_{ \beta} \frac{1}{n}\sum_i (y_i-x^\top_i \beta)^2 + \lambda || \beta ||_2^2\)

Observe:

  • \(\lambda = 0\) (or \(s = \infty\)) makes \(\brl = \bls\)
  • Any \(\lambda > 0\) (or \(s <\infty\)) penalizes larger values of \(\beta\), effectively shrinking them.

\(\lambda\) and \(s\) are known as tuning parameters

Visualizing ridge regression (2 coefficients)

Code
b <- c(1, 1)
n <- 1000
lams <- c(1, 5, 10)
ols_loss <- function(b1, b2) colMeans((y - X %*% rbind(b1, b2))^2) / 2
pen <- function(b1, b2, lambda = 1) lambda * (b1^2 + b2^2) / 2
gr <- expand_grid(
  b1 = seq(b[1] - 0.5, b[1] + 0.5, length.out = 100),
  b2 = seq(b[2] - 0.5, b[2] + 0.5, length.out = 100)
)

X <- mvtnorm::rmvnorm(n, c(0, 0), sigma = matrix(c(1, .3, .3, .5), nrow = 2))
y <- drop(X %*% b + rnorm(n))

bols <- coef(lm(y ~ X - 1))
bridge <- coef(MASS::lm.ridge(y ~ X - 1, lambda = lams * sqrt(n)))

penalties <- lams |>
  set_names(~ paste("lam =", .)) |>
  map(~ pen(gr$b1, gr$b2, .x)) |>
  as_tibble()
gr <- gr |>
  mutate(loss = ols_loss(b1, b2)) |>
  bind_cols(penalties)

g1 <- ggplot(gr, aes(b1, b2)) +
  geom_raster(aes(fill = loss)) +
  scale_fill_viridis_c(direction = -1) +
  geom_point(data = data.frame(b1 = b[1], b2 = b[2])) +
  theme(legend.position = "bottom") +
  guides(fill = guide_colourbar(barwidth = 20, barheight = 0.5))

g2 <- gr |>
    pivot_longer(starts_with("lam")) |>
    mutate(name = factor(name, levels = paste("lam =", lams))) |>
  ggplot(aes(b1, b2)) +
  geom_raster(aes(fill = value)) +
  scale_fill_viridis_c(direction = -1, name = "penalty") +
  facet_wrap(~name, ncol = 1) +
  geom_point(data = data.frame(b1 = b[1], b2 = b[2])) +
  theme(legend.position = "bottom") +
  guides(fill = guide_colourbar(barwidth = 10, barheight = 0.5))

g3 <- gr |> 
  mutate(across(starts_with("lam"), ~ loss + .x)) |>
  pivot_longer(starts_with("lam")) |>
  mutate(name = factor(name, levels = paste("lam =", lams))) |>
  ggplot(aes(b1, b2)) +
  geom_raster(aes(fill = value)) +
  scale_fill_viridis_c(direction = -1, name = "loss + pen") +
  facet_wrap(~name, ncol = 1) +
  geom_point(data = data.frame(b1 = b[1], b2 = b[2])) +
  theme(legend.position = "bottom") +
  guides(fill = guide_colourbar(barwidth = 10, barheight = 0.5))

cowplot::plot_grid(g1, g2, g3, rel_widths = c(2, 1, 1), nrow = 1)

The effect on the estimates

Code
gr |> 
  mutate(z = ols_loss(b1, b2) + max(lams) * pen(b1, b2)) |>
  ggplot(aes(b1, b2)) +
  geom_raster(aes(fill = z)) +
  scale_fill_viridis_c(direction = -1) +
  geom_point(data = tibble(
    b1 = c(bols[1], bridge[,1]),
    b2 = c(bols[2], bridge[,2]),
    estimate = factor(c("ols", paste0("ridge = ", lams)), 
                      levels = c("ols", paste0("ridge = ", lams)))
  ),
  aes(shape = estimate), size = 3) +
  geom_point(data = data.frame(b1 = b[1], b2 = b[2]), colour = orange, size = 4)

Example data

prostate data from [ESL]

data(prostate, package = "ElemStatLearn")
prostate |> as_tibble()
# A tibble: 97 × 10
   lcavol lweight   age   lbph   svi   lcp gleason pgg45   lpsa train
    <dbl>   <dbl> <int>  <dbl> <int> <dbl>   <int> <int>  <dbl> <lgl>
 1 -0.580    2.77    50 -1.39      0 -1.39       6     0 -0.431 TRUE 
 2 -0.994    3.32    58 -1.39      0 -1.39       6     0 -0.163 TRUE 
 3 -0.511    2.69    74 -1.39      0 -1.39       7    20 -0.163 TRUE 
 4 -1.20     3.28    58 -1.39      0 -1.39       6     0 -0.163 TRUE 
 5  0.751    3.43    62 -1.39      0 -1.39       6     0  0.372 TRUE 
 6 -1.05     3.23    50 -1.39      0 -1.39       6     0  0.765 TRUE 
 7  0.737    3.47    64  0.615     0 -1.39       6     0  0.765 FALSE
 8  0.693    3.54    58  1.54      0 -1.39       6     0  0.854 TRUE 
 9 -0.777    3.54    47 -1.39      0 -1.39       6     0  1.05  FALSE
10  0.223    3.24    63 -1.39      0 -1.39       6     0  1.05  FALSE
# ℹ 87 more rows

Ridge regression path

Y <- prostate$lpsa
X <- model.matrix(~ ., data = prostate |> dplyr::select(-train, -lpsa))
library(glmnet)
ridge <- glmnet(x = X, y = Y, alpha = 0, lambda.min.ratio = .00001)


plot(ridge, xvar = "lambda", lwd = 3)

Model selection here:

  • means choose some \(\lambda\)

  • A value of \(\lambda\) is a vertical line.

  • This graphic is a “path” or “coefficient trace”

  • Coefficients for varying \(\lambda\)

Solving the minimization

  • One nice thing about ridge regression is that it has a closed-form solution (like OLS)

\[\brl = (\X^\top\X + \lambda \mathbf{I})^{-1}\X^\top \y\]

  • This is easy to calculate in R for any \(\lambda\).

  • However, computations and interpretation are simplified if we examine the Singular Value Decomposition of \(\X = \mathbf{UDV}^\top\).

  • Recall: any matrix has an SVD.

  • Here \(\mathbf{D}\) is diagonal and \(\mathbf{U}\) and \(\mathbf{V}\) are orthonormal: \(\mathbf{U}^\top\mathbf{U} = \mathbf{I}\).

Solving the minization

\[\brl = (\X^\top\X + \lambda \mathbf{I})^{-1}\X^\top \y\]

  • Note that \(\mathbf{X}^\top\mathbf{X} = \mathbf{VDU}^\top\mathbf{UDV}^\top = \mathbf{V}\mathbf{D}^2\mathbf{V}^\top\).

  • Then,

\[\brl = (\X^\top \X + \lambda \mathbf{I})^{-1}\X^\top \y = (\mathbf{VD}^2\mathbf{V}^\top + \lambda \mathbf{I})^{-1}\mathbf{VDU}^\top \y = \mathbf{V}(\mathbf{D}^2+\lambda \mathbf{I})^{-1} \mathbf{DU}^\top \y.\]

  • For computations, now we only need to invert \(\mathbf{D}\).

Comparing with OLS

  • \(\mathbf{D}\) is a diagonal matrix

\[\bls = (\X^\top\X)^{-1}\X^\top \y = (\mathbf{VD}^2\mathbf{V}^\top)^{-1}\mathbf{VDU}^\top \y = \mathbf{V}\color{red}{\mathbf{D}^{-2}\mathbf{D}}\mathbf{U}^\top \y = \mathbf{V}\color{red}{\mathbf{D}^{-1}}\mathbf{U}^\top \y\]

\[\brl = (\X^\top \X + \lambda \mathbf{I})^{-1}\X^\top \y = \mathbf{V}\color{red}{(\mathbf{D}^2+\lambda \mathbf{I})^{-1}} \mathbf{DU}^\top \y.\]

  • Notice that \(\bls\) depends on \(d_j/d_j^2\) while \(\brl\) depends on \(d_j/(d_j^2 + \lambda)\).

  • Ridge regression makes the coefficients smaller relative to OLS.

  • But if \(\X\) has small singular values, ridge regression compensates with \(\lambda\) in the denominator.

Ridge regression and multicollinearity

Multicollinearity: a linear combination of predictor variables is nearly equal to another predictor variable.

Some comments:

  • A better phrase: \(\X\) is ill-conditioned

  • AKA “(numerically) rank-deficient”.

  • \(\X = \mathbf{U D V}^\top\) ill-conditioned \(\Longleftrightarrow\) some elements of \(\mathbf{D} \approx 0\)

  • \(\bls= \mathbf{V D}^{-1} \mathbf{U}^\top \y\), so small entries of \(\mathbf{D}\) \(\Longleftrightarrow\) huge elements of \(\mathbf{D}^{-1}\)

  • Means huge variance: \(\Var{\bls} = \sigma^2(\X^\top \X)^{-1} = \sigma^2 \mathbf{V D}^{-2} \mathbf{V}^\top\)

Ridge regression and ill-posed \(\X\)

Ridge Regression fixes this problem by preventing the division by a near-zero number

Conclusion
\((\X^{\top}\X)^{-1}\) can be really unstable, while \((\X^{\top}\X + \lambda \mathbf{I})^{-1}\) is not.
Aside
Engineering approach to solving linear systems is to always do this with small \(\lambda\). The thinking is about the numerics rather than the statistics.

Which \(\lambda\) to use?

Computational
Use CV and pick the \(\lambda\) that makes this smallest.
Intuition (bias)
As \(\lambda\rightarrow\infty\), bias ⬆
Intuition (variance)
As \(\lambda\rightarrow\infty\), variance ⬇

You should think about why.

Can we get the best of both worlds?

To recap:

  • Deciding which predictors to include, adding quadratic terms, or interactions is model selection (more precisely variable selection within a linear model).

  • Ridge regression provides regularization, which trades off bias and variance and also stabilizes multicollinearity.

  • If the LM is true,

    1. OLS is unbiased, but Variance depends on \(\mathbf{D}^{-2}\). Can be big.
    2. Ridge is biased (can you find the bias?). But Variance is smaller than OLS.
  • Ridge regression does not perform variable selection.

  • But picking \(\lambda=3.7\) and thereby deciding to predict with \(\widehat{\beta}^R_{3.7}\) is model selection.

Can we get the best of both worlds?

Ridge regression
\(\minimize \frac{1}{n}||\y-\X\beta||_2^2 \ \st\ ||\beta||_2^2 \leq s\)
Best (in-sample) linear regression model of size \(s\)
\(\minimize \frac{1}{n}||\y-\X\beta||_2^2 \ \st\ ||\beta||_0 \leq s\)

\(||\beta||_0\) is the number of nonzero elements in \(\beta\)

Finding the best in-sample linear model (of size \(s\), among these predictors) is a nonconvex optimization problem (In fact, it is NP-hard)

Ridge regression is convex (easy to solve), but doesn’t do variable selection

Can we somehow “interpolate” to get both?

Note: selecting \(\lambda\) is still model selection, but we’ve included all the variables.

Next time…

The lasso, interpolating variable selection and model selection