08 Ridge regression

Stat 406

Geoff Pleiss, Trevor Campbell

Last modified – 29 September 2024

\[ \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator*{\argmax}{argmax} \DeclareMathOperator*{\minimize}{minimize} \DeclareMathOperator*{\maximize}{maximize} \DeclareMathOperator*{\find}{find} \DeclareMathOperator{\st}{subject\,\,to} \newcommand{\E}{E} \newcommand{\Expect}[1]{\E\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[2]{\mathrm{Cov}\left[#1,\ #2\right]} \newcommand{\given}{\ \vert\ } \newcommand{\X}{\mathbf{X}} \newcommand{\x}{\mathbf{x}} \newcommand{\y}{\mathbf{y}} \newcommand{\P}{\mathcal{P}} \newcommand{\R}{\mathbb{R}} \newcommand{\norm}[1]{\left\lVert #1 \right\rVert} \newcommand{\snorm}[1]{\lVert #1 \rVert} \newcommand{\tr}[1]{\mbox{tr}(#1)} \newcommand{\brt}{\widehat{\beta}^R_{s}} \newcommand{\brl}{\widehat{\beta}^R_{\lambda}} \newcommand{\bls}{\widehat{\beta}_{ols}} \newcommand{\blt}{\widehat{\beta}^L_{s}} \newcommand{\bll}{\widehat{\beta}^L_{\lambda}} \newcommand{\U}{\mathbf{U}} \newcommand{\D}{\mathbf{D}} \newcommand{\V}{\mathbf{V}} \]

Recap

General idea:

  1. Pick a model (a family of distributions).
  2. Pick a loss function (measures how wrong predictions are).
  3. Design a predictor/estimator that minimizes the prediction/estimation risk.

Risk includes bias, variance, and irreducible error.

So far, trade bias/variance via variable selection using a risk estimate.

Today: trade bias/variance via regularization

  • use all predictors
  • shrink \(\hat\beta\) towards 0 (increase bias, reduce variance)

\[ \newcommand{\brt}{\widehat{\beta}^R_{s}} \newcommand{\brl}{\widehat{\beta}^R_{\lambda}} \newcommand{\bls}{\widehat{\beta}_{ols}} \newcommand{\blt}{\widehat{\beta}^L_{s}} \newcommand{\bll}{\widehat{\beta}^L_{\lambda}} \]

OLS: bias and variance

Minimize

\[\min_\beta \frac{1}{n}\sum_i (y_i - x_i^T\beta)^2 \quad \mbox{ subject to } \quad \beta \in \R^p\]

Can rewrite using matrix-vector notation:

\[\min_\beta \frac{1}{n}\| \y - \X\beta\|_2^2\quad \mbox{ subject to } \quad \beta \in \R^p\]

Solve by setting derivative to 0: \(\bls = (\X^\top\X)^{-1}\X^\top\y\)

  • unbiased: \(\E[\bls] = \E[\E[\bls \mid \X]] = \E[(\X^\top\X)^{-1}\X^\top\X\beta] = \beta\)
  • variance: \(\Var{\bls} = \sigma^2(\X^\top \X)^{-1}\) ??

Singular Value Decomposition (SVD)

A rectangular \(n \times p\) matrix \(X\) has singular value decomposition \(X = U D V^\top\)

  • \(U\) is \(n\times n\), orthonormal: \(U^\top U = UU^\top = I\)
  • \(V\) is \(p\times p\), orthonormal: \(V^\top V = VV^\top = I\)
  • \(D\) is rectangular \(n\times p\), diagonal and nonnegative (singular values)

e.g. \(n = 4, p = 2\), singular values \(d_1, d_2\):

\[D = \left[\begin{array} dd_1 & 0 \\ 0 & d_2 \\ 0 & 0 \\ 0 & 0 \end{array}\right] \qquad D^\top D = \left[\begin{array} dd_1^2 & 0 \\ 0 & d_2^2\end{array}\right].\]

If \(X\) has (almost) linearly dependent columns, some singular values in \(D\) are (near) zero

\(X\) is (nearly) rank-deficient or ill-conditioned

The variance of OLS

Let’s use the SVD \(X = UDV^T\) to examine the OLS variance \(\Var{\bls} = \sigma^2(\X^\top \X)^{-1}\):

\[(\X^\top \X)^{-1} = (VD^\top U^\top UDV^\top)^{-1} = V (D^\top D)^{-1}V^\top = V\left[\begin{array}dd_1^{-2} & 0 & 0\\ 0 & \ddots & 0 \\ 0 & 0 & d_p^{-2}\end{array}\right]V^\top\]

Multicollinearity: a linear combination of variables is nearly equal to another variable.

  • so \(\X\) is ill-conditioned
  • so some singular values \(d_j\approx 0\)
  • so \(d^{-2}_j\) is large
  • so \(\bls\) has large, unstable values; i.e., large variance

How to stop large, unstable values of \(\widehat{\beta}\)?

Idea: constrain the values of \(\beta\) to be small. For \(s > 0\): \[ \minimize_\beta \underbrace{\frac{1}{n}\|\y - \X\beta\|_2^2}_{\text{objective function}} \quad \st \underbrace{\|\beta\|_2^2 < s}_{\text{constraint}}. \]

Recall, for a vector \(\beta \in \R^p\)

\(\ell_2\)-norm
\(\|\beta\|_2 = \sqrt{\beta_1^2 + \beta_2^2 + \dots + \beta_p^2} = \left(\sum_{j=1}^p |\beta_j|^2\right)^{1/2}\) (so \(\|\beta\|_2^2 = \sum_j \beta_j^2\))

Compare this to ordinary least squares:

\[ \minimize_\beta \underbrace{\frac{1}{n}\|\y-\X\beta\|_2^2}_{\text{same objective}} \quad \st \underbrace{\beta \in \R^p}_{\textbf{no constraint}} \]

Geometry of constrained regression (contours)

Code
library(mvtnorm)
norm_ball <- function(q = 1, len = 1000) {
  tg <- seq(0, 2 * pi, length = len)
  out <- tibble(x = cos(tg), b = (1 - abs(x)^q)^(1 / q), bm = -b) |>
    pivot_longer(-x, values_to = "y")
  out$lab <- paste0('"||" * beta * "||"', "[", signif(q, 2), "]")
  return(out)
}

ellipse_data <- function(
  n = 75, xlim = c(-2, 3), ylim = c(-2, 3),
  mean = c(1, 1), Sigma = matrix(c(1, 0, 0, .5), 2)) {
  expand_grid(
    x = seq(xlim[1], xlim[2], length.out = n),
    y = seq(ylim[1], ylim[2], length.out = n)) |>
    rowwise() |>
    mutate(z = dmvnorm(c(x, y), mean, Sigma))
}

lballmax <- function(ed, q = 1, tol = 1e-6, niter = 20) {
  ed <- filter(ed, x > 0, y > 0)
  feasible <- (ed$x^q + ed$y^q)^(1 / q) <= 1
  best <- ed[feasible, ]
  best[which.max(best$z), ]
}


nb <- norm_ball(2)
ed <- ellipse_data()
bols <- data.frame(x = 1, y = 1)
bhat <- lballmax(ed, 2)
ggplot(nb, aes(x, y)) +
  xlim(-2, 2) +
  ylim(-2, 2) +
  geom_path(colour = red) +
  geom_contour(mapping = aes(z = z), colour = blue, data = ed, bins = 7) +
  geom_vline(xintercept = 0) +
  geom_hline(yintercept = 0) +
  geom_point(data = bols) +
  coord_equal() +
  geom_label(
    data = bols,
    mapping = aes(label = bquote("hat(beta)[ols]")),
    parse = TRUE, 
    nudge_x = .3, nudge_y = .3
  ) +
  geom_point(data = bhat) +
  xlab(bquote(beta[1])) +
  ylab(bquote(beta[2])) +
  theme_bw(base_size = 24) +
  geom_label(
    data = bhat,
    mapping = aes(label = bquote("hat(beta)[s]^R")),
    parse = TRUE,
    nudge_x = -.4, nudge_y = -.4
  )

Ridge regression

An equivalent way to write

\[\brt = \argmin_{\beta} \frac{1}{n}\|\y - \X\beta\|_2^2 \st || \beta ||_2^2 \leq s\]

is as a regularized (or penalized) optimization with regularization weight \(\lambda\):

\[\brl = \argmin_{ \beta} \frac{1}{n}\|\y-\X\beta\|_2^2 + \lambda || \beta ||_2^2.\]

For every \(\lambda\) there is a unique \(s\) (and vice versa) that makes \(\brt = \brl\). We will work with \(\lambda\).

Ridge regression

\[\brl = \argmin_{ \beta} \frac{1}{n}\|\y-\X\beta\|_2^2 + \lambda || \beta ||_2^2\]

Observe:

  • \(\lambda = 0\) makes \(\brl = \bls\)
  • \(\lambda \to\infty\) makes \(\brl \to 0\)
  • Any \(0 < \lambda < \infty\) penalizes larger values of \(\beta\), effectively shrinking them.
  • \(\lambda\) is a tuning parameter: we need to pick it

Example data

prostate data from [ESL]

data(prostate, package = "ElemStatLearn")
prostate |> as_tibble()
# A tibble: 97 × 10
   lcavol lweight   age   lbph   svi   lcp gleason pgg45   lpsa train
    <dbl>   <dbl> <int>  <dbl> <int> <dbl>   <int> <int>  <dbl> <lgl>
 1 -0.580    2.77    50 -1.39      0 -1.39       6     0 -0.431 TRUE 
 2 -0.994    3.32    58 -1.39      0 -1.39       6     0 -0.163 TRUE 
 3 -0.511    2.69    74 -1.39      0 -1.39       7    20 -0.163 TRUE 
 4 -1.20     3.28    58 -1.39      0 -1.39       6     0 -0.163 TRUE 
 5  0.751    3.43    62 -1.39      0 -1.39       6     0  0.372 TRUE 
 6 -1.05     3.23    50 -1.39      0 -1.39       6     0  0.765 TRUE 
 7  0.737    3.47    64  0.615     0 -1.39       6     0  0.765 FALSE
 8  0.693    3.54    58  1.54      0 -1.39       6     0  0.854 TRUE 
 9 -0.777    3.54    47 -1.39      0 -1.39       6     0  1.05  FALSE
10  0.223    3.24    63 -1.39      0 -1.39       6     0  1.05  FALSE
# ℹ 87 more rows

Ridge regression path

We can look at coefficients as we vary \(\lambda\) (a “path” or “coefficient trace”)

Y <- prostate$lpsa
X <- model.matrix(~ ., data = prostate |> dplyr::select(-train, -lpsa))
library(glmnet)
ridge <- glmnet(x = X, y = Y, alpha = 0, lambda.min.ratio = .00001)
plot(ridge, xvar = "lambda", lwd = 3)

Model selection: choose some \(\lambda\) (a vertical line on this plot)

Ridge regression: closed-form

Ridge regression has a closed-form solution like OLS (set the derivative in \(\beta\) to 0):

\[\brl = (\X^\top\X + \lambda \mathbf{I})^{-1}\X^\top \y\]

Compare to OLS:

\[\bls = (\X^\top \X)^{-1}\X^\top \y\]

  • What does the \(+\lambda I\) do?

  • What about bias and variance?

  • Is \(\brl\) faster/slower/neither to compute?

  • Is \(\brl\) more/less numerically stable to compute?

Ridge regression: bias and variance

Ridge regression has a closed-form solution like OLS (set the derivative in \(\beta\) to 0):

\[\brl = (\X^\top\X + \lambda \mathbf{I})^{-1}\X^\top \y\]

  • bias: \(\E[\brl | \X] = (\X^\top\X + \lambda I)^{-1}\X^\top \X\beta \neq \beta\)

  • variance: \(\Var{\brl \mid \X} = \sigma^2 (\X^\top\X + \lambda I)^{-1}\X^\top \X(\X^\top\X + \lambda I)^{-1}\)

Using the SVD \(\X = UDV^T\), \(\Var{\brl \mid \X} = \sigma^2 VG V^\top\) where \(G\) is diagonal with entries

\[ g_j = \frac{d_j^2}{(d_j^2 + \lambda)^2} \qquad \left(\text{compare to OLS:} \quad g_j = \frac{1}{d_j^{2}} \right)\]

Mitigates the issue of multicollinearity (ill-conditioned \(\X\))!

Tuning the regularization weight \(\lambda\)

Use cross-validation (it’s built into glmnet if you use cv.glmnet!)

plot shows mean CV estimate \(\pm\) standard error

Left line is minimum risk estimate; right is the largest \(\lambda\) within 1\(\sigma\)

ridge <- cv.glmnet(x = X, y = Y, alpha = 0, lambda.min.ratio = .00001)
plot(ridge, main = "Ridge")

Ridge regression: summary

Ridge regression addresses multicollinearity by preventing division by near-zero numbers

Conclusion
\(\bls = (\X^{\top}\X)^{-1}\X^\top \y\) can be unstable, while \(\brl=(\X^{\top}\X + \lambda \mathbf{I})^{-1}\X^\top\y\) is not.
Aside
Engineering approach to solving linear systems is to always do this with small \(\lambda\). The thinking is about the numerics rather than the statistics.

Which \(\lambda\) to use?

Computational
Use CV and pick the \(\lambda\) that makes this smallest (maybe within \(1\sigma\)).
Intuition (bias)
As \(\lambda\rightarrow\infty\), bias ⬆
Intuition (variance)
As \(\lambda\rightarrow\infty\), variance ⬇

Regularization for variable selection?

Ridge regression is a regularized approach to prediction.

  • nice bias/variance tradeoff
  • closed-form / easy-to-compute solution
  • no predictor variable selection.
    • (NB: picking a \(\lambda\) is still model selection)

Is there a regularization approach to variable selection?

e.g., best (in-sample) linear regression model of size \(s\):
\(\minimize \frac{1}{n}||\y-\X\beta||_2^2 \ \st\ (\text{\# of nonzero $\beta_j$}) \leq s\)

That is a super nasty optimization. What to do?…

Next time…

The lasso, interpolating variable selection and model selection