08 Ridge regression

Stat 406

Geoff Pleiss, Trevor Campbell

Last modified – 29 September 2024

\[ \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator*{\argmax}{argmax} \DeclareMathOperator*{\minimize}{minimize} \DeclareMathOperator*{\maximize}{maximize} \DeclareMathOperator*{\find}{find} \DeclareMathOperator{\st}{subject\,\,to} \newcommand{\E}{E} \newcommand{\Expect}[1]{\E\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[2]{\mathrm{Cov}\left[#1,\ #2\right]} \newcommand{\given}{\ \vert\ } \newcommand{\X}{\mathbf{X}} \newcommand{\x}{\mathbf{x}} \newcommand{\y}{\mathbf{y}} \newcommand{\P}{\mathcal{P}} \newcommand{\R}{\mathbb{R}} \newcommand{\norm}[1]{\left\lVert #1 \right\rVert} \newcommand{\snorm}[1]{\lVert #1 \rVert} \newcommand{\tr}[1]{\mbox{tr}(#1)} \newcommand{\brt}{\widehat{\beta}^R_{s}} \newcommand{\brl}{\widehat{\beta}^R_{\lambda}} \newcommand{\bls}{\widehat{\beta}_{ols}} \newcommand{\blt}{\widehat{\beta}^L_{s}} \newcommand{\bll}{\widehat{\beta}^L_{\lambda}} \newcommand{\U}{\mathbf{U}} \newcommand{\D}{\mathbf{D}} \newcommand{\V}{\mathbf{V}} \]

Recap

General idea:

Pick a model (a family of distributions).
Pick a loss function (measures how wrong predictions are).
Design a predictor/estimator that minimizes the prediction/estimation risk.

Risk includes bias, variance, and irreducible error.

So far, trade bias/variance via variable selection using a risk estimate.

Today: trade bias/variance via regularization

use all predictors
shrink $\hat\beta$ towards 0 (increase bias, reduce variance)

\[ \newcommand{\brt}{\widehat{\beta}^R_{s}} \newcommand{\brl}{\widehat{\beta}^R_{\lambda}} \newcommand{\bls}{\widehat{\beta}_{ols}} \newcommand{\blt}{\widehat{\beta}^L_{s}} \newcommand{\bll}{\widehat{\beta}^L_{\lambda}} \]

OLS: bias and variance

Minimize

\[\min_\beta \frac{1}{n}\sum_i (y_i - x_i^T\beta)^2 \quad \mbox{ subject to } \quad \beta \in \R^p\]

Can rewrite using matrix-vector notation:

\[\min_\beta \frac{1}{n}\| \y - \X\beta\|_2^2\quad \mbox{ subject to } \quad \beta \in \R^p\]

Solve by setting derivative to 0: $\bls = (\X^\top\X)^{-1}\X^\top\y$

unbiased: $\E[\bls] = \E[\E[\bls \mid \X]] = \E[(\X^\top\X)^{-1}\X^\top\X\beta] = \beta$
variance: $\Var{\bls} = \sigma^2(\X^\top \X)^{-1}$ ??

Singular Value Decomposition (SVD)

A rectangular $n \times p$ matrix $X$ has singular value decomposition $X = U D V^\top$

$U$ is $n\times n$, orthonormal: $U^\top U = UU^\top = I$
$V$ is $p\times p$, orthonormal: $V^\top V = VV^\top = I$
$D$ is rectangular $n\times p$, diagonal and nonnegative (singular values)

e.g. $n = 4, p = 2$, singular values $d_1, d_2$:

\[D = \left[\begin{array} dd_1 & 0 \\ 0 & d_2 \\ 0 & 0 \\ 0 & 0 \end{array}\right] \qquad D^\top D = \left[\begin{array} dd_1^2 & 0 \\ 0 & d_2^2\end{array}\right].\]

If $X$ has (almost) linearly dependent columns, some singular values in $D$ are (near) zero

$X$ is (nearly) rank-deficient or ill-conditioned

The variance of OLS

Let’s use the SVD $X = UDV^T$ to examine the OLS variance $\Var{\bls} = \sigma^2(\X^\top \X)^{-1}$:

\[(\X^\top \X)^{-1} = (VD^\top U^\top UDV^\top)^{-1} = V (D^\top D)^{-1}V^\top = V\left[\begin{array}dd_1^{-2} & 0 & 0\\ 0 & \ddots & 0 \\ 0 & 0 & d_p^{-2}\end{array}\right]V^\top\]

Multicollinearity: a linear combination of variables is nearly equal to another variable.

so $\X$ is ill-conditioned
so some singular values $d_j\approx 0$
so $d^{-2}_j$ is large
so $\bls$ has large, unstable values; i.e., large variance

How to stop large, unstable values of $\widehat{\beta}$?

Idea: constrain the values of $\beta$ to be small. For $s > 0$: \[ \minimize_\beta \underbrace{\frac{1}{n}\|\y - \X\beta\|_2^2}_{\text{objective function}} \quad \st \underbrace{\|\beta\|_2^2 < s}_{\text{constraint}}. \]

Recall, for a vector $\beta \in \R^p$

$\ell_2$-norm: $\|\beta\|_2 = \sqrt{\beta_1^2 + \beta_2^2 + \dots + \beta_p^2} = \left(\sum_{j=1}^p |\beta_j|^2\right)^{1/2}$ (so $\|\beta\|_2^2 = \sum_j \beta_j^2$)

Compare this to ordinary least squares:

\[ \minimize_\beta \underbrace{\frac{1}{n}\|\y-\X\beta\|_2^2}_{\text{same objective}} \quad \st \underbrace{\beta \in \R^p}_{\textbf{no constraint}} \]

Geometry of constrained regression (contours)

Code

library(mvtnorm)
norm_ball <- function(q = 1, len = 1000) {
  tg <- seq(0, 2 * pi, length = len)
  out <- tibble(x = cos(tg), b = (1 - abs(x)^q)^(1 / q), bm = -b) |>
    pivot_longer(-x, values_to = "y")
  out$lab <- paste0('"||" * beta * "||"', "[", signif(q, 2), "]")
  return(out)
}

ellipse_data <- function(
  n = 75, xlim = c(-2, 3), ylim = c(-2, 3),
  mean = c(1, 1), Sigma = matrix(c(1, 0, 0, .5), 2)) {
  expand_grid(
    x = seq(xlim[1], xlim[2], length.out = n),
    y = seq(ylim[1], ylim[2], length.out = n)) |>
    rowwise() |>
    mutate(z = dmvnorm(c(x, y), mean, Sigma))
}

lballmax <- function(ed, q = 1, tol = 1e-6, niter = 20) {
  ed <- filter(ed, x > 0, y > 0)
  feasible <- (ed$x^q + ed$y^q)^(1 / q) <= 1
  best <- ed[feasible, ]
  best[which.max(best$z), ]
}


nb <- norm_ball(2)
ed <- ellipse_data()
bols <- data.frame(x = 1, y = 1)
bhat <- lballmax(ed, 2)
ggplot(nb, aes(x, y)) +
  xlim(-2, 2) +
  ylim(-2, 2) +
  geom_path(colour = red) +
  geom_contour(mapping = aes(z = z), colour = blue, data = ed, bins = 7) +
  geom_vline(xintercept = 0) +
  geom_hline(yintercept = 0) +
  geom_point(data = bols) +
  coord_equal() +
  geom_label(
    data = bols,
    mapping = aes(label = bquote("hat(beta)[ols]")),
    parse = TRUE, 
    nudge_x = .3, nudge_y = .3
  ) +
  geom_point(data = bhat) +
  xlab(bquote(beta[1])) +
  ylab(bquote(beta[2])) +
  theme_bw(base_size = 24) +
  geom_label(
    data = bhat,
    mapping = aes(label = bquote("hat(beta)[s]^R")),
    parse = TRUE,
    nudge_x = -.4, nudge_y = -.4
  )

Ridge regression

An equivalent way to write

\[\brt = \argmin_{\beta} \frac{1}{n}\|\y - \X\beta\|_2^2 \st || \beta ||_2^2 \leq s\]

is as a regularized (or penalized) optimization with regularization weight $\lambda$:

\[\brl = \argmin_{ \beta} \frac{1}{n}\|\y-\X\beta\|_2^2 + \lambda || \beta ||_2^2.\]

For every $\lambda$ there is a unique $s$ (and vice versa) that makes $\brt = \brl$. We will work with $\lambda$.

Ridge regression

\[\brl = \argmin_{ \beta} \frac{1}{n}\|\y-\X\beta\|_2^2 + \lambda || \beta ||_2^2\]

Observe:

$\lambda = 0$ makes $\brl = \bls$
$\lambda \to\infty$ makes $\brl \to 0$
Any $0 < \lambda < \infty$ penalizes larger values of $\beta$, effectively shrinking them.
$\lambda$ is a tuning parameter: we need to pick it

Example data

prostate data from [ESL]

data(prostate, package = "ElemStatLearn")
prostate |> as_tibble()

# A tibble: 97 × 10
   lcavol lweight   age   lbph   svi   lcp gleason pgg45   lpsa train
    <dbl>   <dbl> <int>  <dbl> <int> <dbl>   <int> <int>  <dbl> <lgl>
 1 -0.580    2.77    50 -1.39      0 -1.39       6     0 -0.431 TRUE 
 2 -0.994    3.32    58 -1.39      0 -1.39       6     0 -0.163 TRUE 
 3 -0.511    2.69    74 -1.39      0 -1.39       7    20 -0.163 TRUE 
 4 -1.20     3.28    58 -1.39      0 -1.39       6     0 -0.163 TRUE 
 5  0.751    3.43    62 -1.39      0 -1.39       6     0  0.372 TRUE 
 6 -1.05     3.23    50 -1.39      0 -1.39       6     0  0.765 TRUE 
 7  0.737    3.47    64  0.615     0 -1.39       6     0  0.765 FALSE
 8  0.693    3.54    58  1.54      0 -1.39       6     0  0.854 TRUE 
 9 -0.777    3.54    47 -1.39      0 -1.39       6     0  1.05  FALSE
10  0.223    3.24    63 -1.39      0 -1.39       6     0  1.05  FALSE
# ℹ 87 more rows

Ridge regression path

We can look at coefficients as we vary $\lambda$ (a “path” or “coefficient trace”)

Y <- prostate$lpsa
X <- model.matrix(~ ., data = prostate |> dplyr::select(-train, -lpsa))
library(glmnet)
ridge <- glmnet(x = X, y = Y, alpha = 0, lambda.min.ratio = .00001)
plot(ridge, xvar = "lambda", lwd = 3)

Model selection: choose some $\lambda$ (a vertical line on this plot)

Ridge regression: closed-form

Ridge regression has a closed-form solution like OLS (set the derivative in $\beta$ to 0):

\[\brl = (\X^\top\X + \lambda \mathbf{I})^{-1}\X^\top \y\]

Compare to OLS:

\[\bls = (\X^\top \X)^{-1}\X^\top \y\]

What does the $+\lambda I$ do?
What about bias and variance?
Is $\brl$ faster/slower/neither to compute?
Is $\brl$ more/less numerically stable to compute?

Ridge regression: bias and variance

Ridge regression has a closed-form solution like OLS (set the derivative in $\beta$ to 0):

\[\brl = (\X^\top\X + \lambda \mathbf{I})^{-1}\X^\top \y\]

bias: $\E[\brl | \X] = (\X^\top\X + \lambda I)^{-1}\X^\top \X\beta \neq \beta$
variance: $\Var{\brl \mid \X} = \sigma^2 (\X^\top\X + \lambda I)^{-1}\X^\top \X(\X^\top\X + \lambda I)^{-1}$

Using the SVD $\X = UDV^T$, $\Var{\brl \mid \X} = \sigma^2 VG V^\top$ where $G$ is diagonal with entries

\[ g_j = \frac{d_j^2}{(d_j^2 + \lambda)^2} \qquad \left(\text{compare to OLS:} \quad g_j = \frac{1}{d_j^{2}} \right)\]

Mitigates the issue of multicollinearity (ill-conditioned $\X$)!

Tuning the regularization weight $\lambda$

Use cross-validation (it’s built into glmnet if you use cv.glmnet!)

plot shows mean CV estimate $\pm$ standard error

Left line is minimum risk estimate; right is the largest $\lambda$ within 1$\sigma$

ridge <- cv.glmnet(x = X, y = Y, alpha = 0, lambda.min.ratio = .00001)
plot(ridge, main = "Ridge")

Ridge regression: summary

Ridge regression addresses multicollinearity by preventing division by near-zero numbers

Conclusion: $\bls = (\X^{\top}\X)^{-1}\X^\top \y$ can be unstable, while $\brl=(\X^{\top}\X + \lambda \mathbf{I})^{-1}\X^\top\y$ is not.
Aside: Engineering approach to solving linear systems is to always do this with small $\lambda$. The thinking is about the numerics rather than the statistics.

Which $\lambda$ to use?

Computational: Use CV and pick the $\lambda$ that makes this smallest (maybe within $1\sigma$).
Intuition (bias): As $\lambda\rightarrow\infty$, bias ⬆
Intuition (variance): As $\lambda\rightarrow\infty$, variance ⬇

Regularization for variable selection?

Ridge regression is a regularized approach to prediction.

nice bias/variance tradeoff
closed-form / easy-to-compute solution
no predictor variable selection.
- (NB: picking a $\lambda$ is still model selection)

Is there a regularization approach to variable selection?

e.g., best (in-sample) linear regression model of size $s$:: $\minimize \frac{1}{n}||\y-\X\beta||_2^2 \ \st\ (\text{\# of nonzero $\beta_j$}) \leq s$

That is a super nasty optimization. What to do?…

Next time…

The lasso, interpolating variable selection and model selection

08 Ridge regression

Recap

OLS: bias and variance

Singular Value Decomposition (SVD)

The variance of OLS

How to stop large, unstable values of \(\widehat{\beta}\)?

Geometry of constrained regression (contours)

Ridge regression

Ridge regression

Example data

Ridge regression path

Ridge regression: closed-form

Ridge regression: bias and variance

Tuning the regularization weight \(\lambda\)

Ridge regression: summary

Which \(\lambda\) to use?

Regularization for variable selection?

Next time…