09 L1 penalties

Stat 406

Daniel J. McDonald

Last modified – 02 October 2023

\[ \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator*{\argmax}{argmax} \DeclareMathOperator*{\minimize}{minimize} \DeclareMathOperator*{\maximize}{maximize} \DeclareMathOperator*{\find}{find} \DeclareMathOperator{\st}{subject\,\,to} \newcommand{\E}{E} \newcommand{\Expect}[1]{\E\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[2]{\mathrm{Cov}\left[#1,\ #2\right]} \newcommand{\given}{\ \vert\ } \newcommand{\X}{\mathbf{X}} \newcommand{\x}{\mathbf{x}} \newcommand{\y}{\mathbf{y}} \newcommand{\P}{\mathcal{P}} \newcommand{\R}{\mathbb{R}} \newcommand{\norm}[1]{\left\lVert #1 \right\rVert} \newcommand{\snorm}[1]{\lVert #1 \rVert} \newcommand{\tr}[1]{\mbox{tr}(#1)} \newcommand{\brt}{\widehat{\beta}^R_{s}} \newcommand{\brl}{\widehat{\beta}^R_{\lambda}} \newcommand{\bls}{\widehat{\beta}_{ols}} \newcommand{\blt}{\widehat{\beta}^L_{s}} \newcommand{\bll}{\widehat{\beta}^L_{\lambda}} \]

Last time

Ridge regression
\(\min \frac{1}{n}\snorm{\y-\X\beta}_2^2 \st \snorm{\beta}_2^2 \leq s\)
Best (in sample) linear regression model of size \(s\)
\(\min \frac 1n \snorm{\y-\X\beta}_2^2 \st \snorm{\beta}_0 \leq s\)

\(\snorm{\beta}_0\) is the number of nonzero elements in \(\beta\)

Finding the “best” linear model (of size \(s\), among these predictors, in sample) is a nonconvex optimization problem (In fact, it is NP-hard)

Ridge regression is convex (easy to solve), but doesn’t do variable selection

Can we somehow “interpolate” to get both?

Geometry of convexity

Code
library(mvtnorm)
normBall <- function(q = 1, len = 1000) {
  tg <- seq(0, 2 * pi, length = len)
  out <- data.frame(x = cos(tg)) %>%
    mutate(b = (1 - abs(x)^q)^(1 / q), bm = -b) %>%
    gather(key = "lab", value = "y", -x)
  out$lab <- paste0('"||" * beta * "||"', "[", signif(q, 2), "]")
  return(out)
}

ellipseData <- function(n = 100, xlim = c(-2, 3), ylim = c(-2, 3),
                        mean = c(1, 1), Sigma = matrix(c(1, 0, 0, .5), 2)) {
  df <- expand.grid(
    x = seq(xlim[1], xlim[2], length.out = n),
    y = seq(ylim[1], ylim[2], length.out = n)
  )
  df$z <- dmvnorm(df, mean, Sigma)
  df
}

lballmax <- function(ed, q = 1, tol = 1e-6) {
  ed <- filter(ed, x > 0, y > 0)
  for (i in 1:20) {
    ff <- abs((ed$x^q + ed$y^q)^(1 / q) - 1) < tol
    if (sum(ff) > 0) break
    tol <- 2 * tol
  }
  best <- ed[ff, ]
  best[which.max(best$z), ]
}

nbs <- list()
nbs[[1]] <- normBall(0, 1)
qs <- c(.5, .75, 1, 1.5, 2)
for (ii in 2:6) nbs[[ii]] <- normBall(qs[ii - 1])
nbs <- bind_rows(nbs)
nbs$lab <- factor(nbs$lab, levels = unique(nbs$lab))
seg <- data.frame(
  lab = levels(nbs$lab)[1],
  x0 = c(-1, 0), x1 = c(1, 0), y0 = c(0, -1), y1 = c(0, 1)
)
levels(seg$lab) <- levels(nbs$lab)
ggplot(nbs, aes(x, y)) +
  geom_path(size = 1.2) +
  facet_wrap(~lab, labeller = label_parsed) +
  geom_segment(data = seg, aes(x = x0, xend = x1, y = y0, yend = y1), size = 1.2) +
  theme_bw(base_family = "", base_size = 24) +
  coord_equal() +
  scale_x_continuous(breaks = c(-1, 0, 1)) +
  scale_y_continuous(breaks = c(-1, 0, 1)) +
  geom_vline(xintercept = 0, size = .5) +
  geom_hline(yintercept = 0, size = .5) +
  xlab(bquote(beta[1])) +
  ylab(bquote(beta[2]))

The best of both worlds

Code
nb <- normBall(1)
ed <- ellipseData()
bols <- data.frame(x = 1, y = 1)
bhat <- lballmax(ed, 1)
ggplot(nb, aes(x, y)) +
  geom_path(colour = red) +
  geom_contour(mapping = aes(z = z), colour = blue, data = ed, bins = 7) +
  geom_vline(xintercept = 0) +
  geom_hline(yintercept = 0) +
  geom_point(data = bols) +
  coord_equal(xlim = c(-2, 2), ylim = c(-2, 2)) +
  theme_bw(base_family = "", base_size = 24) +
  geom_label(
    data = bols, mapping = aes(label = bquote("hat(beta)[ols]")), parse = TRUE,
    nudge_x = .3, nudge_y = .3
  ) +
  geom_point(data = bhat) +
  xlab(bquote(beta[1])) +
  ylab(bquote(beta[2])) +
  geom_label(
    data = bhat, mapping = aes(label = bquote("hat(beta)[s]^L")), parse = TRUE,
    nudge_x = -.4, nudge_y = -.4
  )

This regularization set…

  • … is convex (computationally efficient)
  • … has corners (performs variable selection)

\(\ell_1\)-regularized regression

Known as

  • “lasso”
  • “basis pursuit”

The estimator satisfies

\[\blt = \argmin_{ \snorm{\beta}_1 \leq s} \frac{1}{n}\snorm{\y-\X\beta}_2^2\]

In its corresponding Lagrangian dual form:

\[\bll = \argmin_{\beta} \frac{1}{n}\snorm{\y-\X\beta}_2^2 + \lambda \snorm{\beta}_1\]

Lasso

While the ridge solution can be easily computed

\[\brl = \argmin_{\beta} \frac 1n \snorm{\y-\X\beta}_2^2 + \lambda \snorm{\beta}_2^2 = (\X^{\top}\X + \lambda \mathbf{I})^{-1} \X^{\top}\y\]

the lasso solution

\[\bll = \argmin_{\beta} \frac 1n\snorm{\y-\X\beta}_2^2 + \lambda \snorm{\beta}_1 = \; ??\]

doesn’t have a closed-form solution.

However, because the optimization problem is convex, there exist efficient algorithms for computing it

Coefficient path: ridge vs lasso

Code
library(glmnet)
data(prostate, package = "ElemStatLearn")
X <- prostate |> dplyr::select(-train, -lpsa) |>  as.matrix()
Y <- prostate$lpsa
lasso <- glmnet(x = X, y = Y) # alpha = 1 by default
ridge <- glmnet(x = X, y = Y, alpha = 0)
op <- par()
par(mfrow = c(1, 2), mar = c(5, 3, 5, .1))
plot(lasso, main = "Lasso")
plot(ridge, main = "Ridge")

Same but against Lambda

par(mfrow = c(1, 2), mar = c(5, 3, 5, .1))
plot(lasso, main = "Lasso", xvar = "lambda")
plot(ridge, main = "Ridge", xvar = "lambda")

Additional intuition for why Lasso selects variables

Suppose, for a particular \(\lambda\), I have solutions for \(\widehat{\beta}_k\), \(k = 1,\ldots,j-1, j+1,\ldots,p\).

Let \(\widehat{\y}_{-j} = \X_{-j}\widehat{\beta}_{-j}\), and assume WLOG \(\overline{\X}_k = 0\), \(\X_k^\top\X_k = 1\ \forall k\)

One can show that:

\[ \widehat{\beta}_j = S\left(\mathbf{X}^\top_j(\y - \widehat{\y}_{-j}),\ \lambda\right). \]

\[ S(z, \gamma) = \textrm{sign}(z)(|z| - \gamma)_+ = \begin{cases} z - \gamma & z > \gamma\\ z + \gamma & z < -\gamma \\ 0 & |z| \leq \gamma \end{cases} \]

  • Iterating over this is called coordinate descent and gives the solution

Packages

There are two main R implementations for finding lasso

{glmnet}: lasso = glmnet(X, Y, alpha=1).

  • Setting alpha = 0 gives ridge regression (as does lm.ridge in the MASS package)
  • Setting alpha \(\in (0,1)\) gives a method called the “elastic net” which combines ridge regression and lasso, more on that next lecture.
  • If you don’t specify alpha, it does lasso

{lars}: lars = lars(X, Y)

  • lars() also does other things called “Least angle” and “forward stagewise” in addition to “forward stepwise” regression

  • The path returned by lars() is more useful than that returned by glmnet().

But you should use {glmnet}.

Choosing the \(\lambda\)

You have to choose \(\lambda\) in lasso or in ridge regression

lasso selects variables (by setting coefficients to zero), but the value of \(\lambda\) determines how many/which.

All of these packages come with CV built in.

However, the way to do it differs from package to package

{glmnet} version (same procedure for lasso or ridge)

lasso <- cv.glmnet(X, Y) # estimate full model and CV no good reason to call glmnet() itself
# 2. Look at the CV curve. If the dashed lines are at the boundaries, redo and adjust lambda
lambda_min <- lasso$lambda.min # the value, not the location (or use lasso$lambda.1se)
coeffs <- coefficients(lasso, s = "lambda.min") # s can be string or a number
preds <- predict(lasso, newx = X, s = "lambda.1se") # must supply `newx`
  • \(\widehat{R}_{CV}\) is an estimator of \(R_n\), it has bias and variance
  • Because we did CV, we actually have 10 \(\widehat{R}\) values, 1 per split.
  • Calculate the mean (that’s what we’ve been using), but what about SE?

par(mfrow = c(1, 2), mar = c(5, 3, 3, 0))
plot(lasso) # a plot method for the cv fit
plot(lasso$glmnet.fit) # the glmnet.fit == glmnet(X,Y)
abline(v = colSums(abs(coef(lasso$glmnet.fit)[-1, drop(lasso$index)])), lty = 2)

Paths with chosen lambda

ridge <- cv.glmnet(X, Y, alpha = 0, lambda.min.ratio = 1e-10) # added to get a minimum
par(mfrow = c(1, 4))
plot(ridge, main = "Ridge")
plot(lasso, main = "Lasso")
plot(ridge$glmnet.fit, main = "Ridge")
abline(v = sum(abs(coef(ridge)))) # defaults to `lambda.1se`
plot(lasso$glmnet.fit, main = "Lasso")
abline(v = sum(abs(coef(lasso)))) # again, `lambda.1se` unless told otherwise

Degrees of freedom

Lasso is not a linear smoother. There is no matrix \(S\) such that \(\widehat{\y} = \mathbf{S}\y\) for the predicted values from lasso.

  • We can’t use cv_nice().

  • We don’t have \(\tr{\mathbf{S}} = \textrm{df}\) because there is no \(\mathbf{S}\).

However,

  • One can show that \(\textrm{df}_\lambda = \E[\#(\widehat{\beta}_\lambda \neq 0)] = \E[||\widehat{\beta}_\lambda||_0]\)

  • The proof is PhD-level material

Note that the \(\widehat{\textrm{df}}_\lambda\) is shown on the CV plot and that lasso.glmnet$glmnet.fit$df contains this value for all \(\lambda\).

Other flavours

The elastic net
generally used for correlated variables that combines a ridge/lasso penalty. Use glmnet(..., alpha = a) (0 < a < 1).
Grouped lasso
where variables are included or excluded in groups. Required for factors (1-hot encoding)
Relaxed lasso
Takes the estimated model from lasso and fits the full least squares solution on the selected covariates (less bias, more variance). Use glmnet(..., relax = TRUE).
Dantzig selector
a slightly modified version of the lasso

Lasso cinematic universe

SCAD
a non-convex version of lasso that adds a more severe variable selection penalty
\(\sqrt{\textrm{lasso}}\)
claims to be tuning parameter free (but isn’t). Uses \(\Vert\cdot\Vert_2\) instead of \(\Vert\cdot\Vert_1\) for the loss.
Generalized lasso
Adds various additional matrices to the penalty term (e.g. \(\Vert D\beta\Vert_1\)).
Arbitrary combinations
combine the above penalties in your favourite combinations

Warnings on regularized regression

  1. This isn’t a method unless you say how to choose \(\lambda\).
  2. The intercept is never penalized. Adds an extra degree-of-freedom.
  3. Predictor scaling is very important.
  4. Discrete predictors need groupings.
  5. Centering the predictors is important
  6. (These all work with other likelihoods.)

Software handles most of these automatically, but not always. (No Lasso with factor predictors.)

Next time…

What happens when we’re tired of all this linearity.