\[
S(z, \gamma) = \textrm{sign}(z)(|z| - \gamma)_+ = \begin{cases} z - \gamma & z > \gamma\\
z + \gamma & z < -\gamma \\ 0 & |z| \leq \gamma \end{cases}
\]

Iterating over this is called coordinate descent and gives the solution

Packages

There are two main R implementations for finding lasso

{glmnet}: lasso = glmnet(X, Y, alpha=1).

Setting alpha = 0 gives ridge regression (as does lm.ridge in the MASS package)

Setting alpha\(\in (0,1)\) gives a method called the “elastic net” which combines ridge regression and lasso, more on that next lecture.

If you don’t specify alpha, it does lasso

{lars}: lars = lars(X, Y)

lars() also does other things called “Least angle” and “forward stagewise” in addition to “forward stepwise” regression

The path returned by lars() is more useful than that returned by glmnet().

But you should use {glmnet}.

Choosing the \(\lambda\)

You have to choose \(\lambda\) in lasso or in ridge regression

lasso selects variables (by setting coefficients to zero), but the value of \(\lambda\) determines how many/which.

All of these packages come with CV built in.

However, the way to do it differs from package to package

{glmnet} version (same procedure for lasso or ridge)

lasso <-cv.glmnet(X, Y) # estimate full model and CV no good reason to call glmnet() itself# 2. Look at the CV curve. If the dashed lines are at the boundaries, redo and adjust lambdalambda_min <- lasso$lambda.min # the value, not the location (or use lasso$lambda.1se)coeffs <-coefficients(lasso, s ="lambda.min") # s can be string or a numberpreds <-predict(lasso, newx = X, s ="lambda.1se") # must supply `newx`

\(\widehat{R}_{CV}\) is an estimator of \(R_n\), it has bias and variance

Because we did CV, we actually have 10 \(\widehat{R}\) values, 1 per split.

Calculate the mean (that’s what we’ve been using), but what about SE?

par(mfrow =c(1, 2), mar =c(5, 3, 3, 0))plot(lasso) # a plot method for the cv fitplot(lasso$glmnet.fit) # the glmnet.fit == glmnet(X,Y)abline(v =colSums(abs(coef(lasso$glmnet.fit)[-1, drop(lasso$index)])), lty =2)

Paths with chosen lambda

ridge <-cv.glmnet(X, Y, alpha =0, lambda.min.ratio =1e-10) # added to get a minimumpar(mfrow =c(1, 4))plot(ridge, main ="Ridge")plot(lasso, main ="Lasso")plot(ridge$glmnet.fit, main ="Ridge")abline(v =sum(abs(coef(ridge)))) # defaults to `lambda.1se`plot(lasso$glmnet.fit, main ="Lasso")abline(v =sum(abs(coef(lasso)))) # again, `lambda.1se` unless told otherwise

Degrees of freedom

Lasso is not a linear smoother. There is no matrix \(S\) such that \(\widehat{\y} = \mathbf{S}\y\) for the predicted values from lasso.

We can’t use cv_nice().

We don’t have \(\tr{\mathbf{S}} = \textrm{df}\) because there is no \(\mathbf{S}\).

However,

One can show that \(\textrm{df}_\lambda = \E[\#(\widehat{\beta}_\lambda \neq 0)] = \E[||\widehat{\beta}_\lambda||_0]\)

The proof is PhD-level material

Note that the \(\widehat{\textrm{df}}_\lambda\) is shown on the CV plot and that lasso.glmnet$glmnet.fit$df contains this value for all \(\lambda\).

Other flavours

The elastic net

generally used for correlated variables that combines a ridge/lasso penalty. Use glmnet(..., alpha = a) (0 < a < 1).

Grouped lasso

where variables are included or excluded in groups. Required for factors (1-hot encoding)

Relaxed lasso

Takes the estimated model from lasso and fits the full least squares solution on the selected covariates (less bias, more variance). Use glmnet(..., relax = TRUE).

Dantzig selector

a slightly modified version of the lasso

Lasso cinematic universe

SCAD

a non-convex version of lasso that adds a more severe variable selection penalty

\(\sqrt{\textrm{lasso}}\)

claims to be tuning parameter free (but isn’t). Uses \(\Vert\cdot\Vert_2\) instead of \(\Vert\cdot\Vert_1\) for the loss.

Generalized lasso

Adds various additional matrices to the penalty term (e.g. \(\Vert D\beta\Vert_1\)).

Arbitrary combinations

combine the above penalties in your favourite combinations

Warnings on regularized regression

This isn’t a method unless you say how to choose \(\lambda\).

The intercept is never penalized. Adds an extra degree-of-freedom.

Predictor scaling is very important.

Discrete predictors need groupings.

Centering the predictors is important

(These all work with other likelihoods.)

Software handles most of these automatically, but not always. (No Lasso with factor predictors.)

Next time…

What happens when we’re tired of all this linearity.