08 Ridge regression

class: center, middle, inverse, title-slide

.title[
# 08 Ridge regression
]
.author[
### STAT 406
]
.author[
### Daniel J. McDonald
]
.date[
### Last modified - 2022-10-03
]

---

class: middle, center
background-image: url("https://disher.com/wp-content/uploads/2017/02/Product-Constraints.jpg")
background-size: cover

.primary[.larger[Module]] .huge-blue-number[2]

.primary[regularization, constraints, and nonparametrics]

---

## Recap

So far, we thought of __model selection__ as

.hand[Decide which predictors we would like to use in our linear model]

Or similarly:

.hand[Decide which of a few linear models to use]

To do this, we used a risk estimate, and chose the "model" with the lowest estimate

Let's call what we've done __variable selection__, a specific case of __model selection__.

Moving forward, we need to generalize this to

.hand[Decide which of possibly infinite prediction functions] `$f\in\mathcal{F}$` .hand[to use]

Thankfully, this isn't really any different. We still use those same risk estimates.

__Remember:__ We were choosing models that balance bias and variance (and hence have low prediction risk).

`$$\newcommand{\Expect}[1]{E\left[ #1 \right]}
\newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]}
\newcommand{\Cov}[2]{\mathrm{Cov}\left[#1,\ #2\right]}
\newcommand{\given}{\ \vert\ }
\newcommand{\norm}[1]{\left\lVert #1 \right\rVert}
\newcommand{\E}{E}
\renewcommand{\P}{\mathbb{P}}
\newcommand{\R}{\mathbb{R}}
\newcommand{\tr}[1]{\mbox{tr}(#1)}
\newcommand{\brt}{\widehat{\beta}^R_{s}}
\newcommand{\brl}{\widehat{\beta}^R_{\lambda}}
\newcommand{\bls}{\widehat{\beta}_{ols}}
\newcommand{\blt}{\widehat{\beta}^L_{s}}
\newcommand{\bll}{\widehat{\beta}^L_{\lambda}}
\newcommand{\X}{\mathbf{X}}
\newcommand{\y}{\mathbf{y}}$$`

---

## Regularization

* Another way to control bias and variance is through __regularization__ or
__shrinkage__.

* Rather than selecting a few predictors that seem reasonable, maybe trying a few combinations, use them all.

* I mean __ALL__.

* But, make your estimates of `$\beta$` "smaller"

---

## Brief aside on optimization

* An optimization problem has 2 components:

1. The "Objective function": e.g. `$\frac{1}{n}\sum_i (y_i-x^\top_i \beta)^2$`.
    2. The "constraint": e.g. "fewer than 5 non-zero entries in `$\beta$`".
    
* A constrained minimization problem is written

`$$\min_\beta f(\beta)\;\; \mbox{ subject to }\;\; C(\beta)$$`

* `$f(\beta)$` is the objective function
* `$C(\beta)$` is the constraint

---

## Ridge regression (constrained version)

One way to do this for regression is to solve (say):

`\begin{aligned}
\min_\beta &\ \ \frac{1}{n}\sum_i (y_i-x^\top_i \beta)^2 \\
\mbox{s.t.} &\ \ \sum_j \beta^2_j < s
\end{aligned}`

for some `$s>0$`.

* This is called "ridge regression".

* The __minimizer__ of this problem is called `$\brt$`

Compare this to ordinary least squares:

`\begin{aligned}
\min_\beta &\ \ \frac{1}{n}\sum_i (y_i-x^\top_i \beta)^2 \\
\mbox{s.t.} &\ \ \beta \in \R^p
\end{aligned}`

---

## Geometry of ridge regression (2 dimensions)

---

## Brief aside on norms

Recall, for a vector `$z \in \R^p$`

`$$\norm{z}_2 = \sqrt{z_1^2 + z_2^2 + \cdots + z^2_p} = \sqrt{\sum_{j=1}^p z_j^2}$$`

So,

`$$\norm{z}^2_2 = z_1^2 + z_2^2 + \cdots + z^2_p = \sum_{j=1}^p z_j^2.$$`

Other norms we should remember:

`$\ell_q$`-norm
: `$\left(\sum_{j=1}^p |z_j|^q\right)^{1/q}$`

`$\ell_1$`-norm (special case)
: `$\sum_{j=1}^p |z_j|$`

`$\ell_0$`-norm
: `$\sum_{j=1}^p I(z_j \neq 0 ) = \lvert \{j : z_j \neq 0 \}\rvert$`

---

## Ridge regression

An equivalent way to write

`$$\brt = \arg \min_{ || \beta ||_2^2 \leq s} \frac{1}{n}\sum_i (y_i-x^\top_i \beta)^2$$`

is in the __Lagrangian__ form

`$$\brl = \arg \min_{ \beta} \frac{1}{n}\sum_i (y_i-x^\top_i \beta)^2 + \lambda || \beta ||_2^2.$$`

For every `$\lambda$` there is a unique `$s$` (and vice versa) that makes

`$$\brt = \brl$$`

Observe:

* `$\lambda = 0$` (or `$s = \infty$`) makes `$\brl = \bls$`
* Any `$\lambda > 0$` (or `$s <\infty$`)  penalizes larger values of `$\beta$`, effectively shrinking them.

Note: `$\lambda$` and `$s$` are known as __tuning parameters__

---

## Example data

__prostate__ data from [ESL]

```r
data(prostate, package = "ElemStatLearn")
head(prostate)
```

```
##       lcavol  lweight age      lbph svi       lcp gleason pgg45       lpsa train
## 1 -0.5798185 2.769459  50 -1.386294   0 -1.386294       6     0 -0.4307829  TRUE
## 2 -0.9942523 3.319626  58 -1.386294   0 -1.386294       6     0 -0.1625189  TRUE
## 3 -0.5108256 2.691243  74 -1.386294   0 -1.386294       7    20 -0.1625189  TRUE
## 4 -1.2039728 3.282789  58 -1.386294   0 -1.386294       6     0 -0.1625189  TRUE
## 5  0.7514161 3.432373  62 -1.386294   0 -1.386294       6     0  0.3715636  TRUE
## 6 -1.0498221 3.228826  50 -1.386294   0 -1.386294       6     0  0.7654678  TRUE
```

???

Use lpsa as response.

---

## Ridge regression path

```r
Y <- prostate$lpsa
X <- prostate %>% select(-train,-lpsa) %>% as.matrix()
library(glmnet)
ridge <- glmnet(x = X, y = Y, alpha = 0, lambda.min.ratio = .00001)
```

.pull-left-wide[

```r
plot(ridge, xvar = "lambda")
```

<img src="rmd_gfx/08-ridge-regression/plot-prostate-ridge-1.svg" style="display: block; margin: auto;" />
]

.pull-right-narrow[

Model selection here:

* means __choose__ some `$\lambda$`

* A value of `$\lambda$` is a vertical line.

* This graphic is a "path" or "coefficient trace"

* Coefficients for varying `$\lambda$`
]

---

## Solving the minimization

* One nice thing about ridge regression is that it has a closed-form solution (like OLS)

`$$\brl = (\X^\top\X + \lambda \mathbf{I})^{-1}\X^\top \y$$`

* This is easy to calculate in `R` for any `$\lambda$`.

* However, computations and interpretation are simplified if we examine the Singular Value Decomposition of `$\X = \mathbf{UDV}^\top$`.

* Recall: any matrix has an SVD.

* Here `$\mathbf{D}$` is diagonal and `$\mathbf{U}$` and `$\mathbf{V}$` are orthonormal: `$\mathbf{U}^\top\mathbf{U} = \mathbf{I}$`.

* Note that `$\mathbf{X}^\top\mathbf{X} = \mathbf{VDU}^\top\mathbf{UDV}^\top = \mathbf{V}\mathbf{D}^2\mathbf{V}^\top$`.

* Then,

`$$\brl = (\X^\top \X + \lambda \mathbf{I})^{-1}\X^\top \y = (\mathbf{VD}^2\mathbf{V}^\top + \lambda \mathbf{I})^{-1}\mathbf{VDU}^\top \y
= \mathbf{V}(\mathbf{D}^2+\lambda \mathbf{I})^{-1} \mathbf{DU}^\top \y.$$`

* For computations, now we only need to invert `$\mathbf{D}$`.

---

## Comparing with OLS

* `$\mathbf{D}$` is a diagonal matrix

`$$\bls = (\X^\top\X)^{-1}\X^\top \y = (\mathbf{VD}^2\mathbf{V}^\top)^{-1}\mathbf{VDU}^\top \y = \mathbf{V}\color{red}{\mathbf{D}^{-2}\mathbf{D}}\mathbf{U}^\top \y = \mathbf{V}\color{red}{\mathbf{D}^{-1}}\mathbf{U}^\top \y$$`

`$$\brl = (\X^\top \X + \lambda \mathbf{I})^{-1}\X^\top \y = \mathbf{V}\color{red}{(\mathbf{D}^2+\lambda \mathbf{I})^{-1}} \mathbf{DU}^\top \y.$$`

* Notice that `$\bls$` depends on `$d_j/d_j^2$` while `$\brl$` depends on `$d_j/(d_j^2 + \lambda)$`.

* Ridge regression makes the coefficients smaller relative to OLS.

* But if `$\X$` has small singular values, ridge regression compensates with `$\lambda$` in the denominator.

---

## Ridge regression and multicollinearity

__Multicollinearity:__  a linear combination of predictor variables is nearly equal to another predictor variable.

Some comments:

* A better phrase: `$\X$` is ill-conditioned

* AKA "(numerically) rank-deficient".

* `$\X = \mathbf{U D V}^\top$` ill-conditioned `$\Longleftrightarrow$` some elements of `$\mathbf{D} \approx 0$`

* `$\bls= \mathbf{V D}^{-1} \mathbf{U}^\top \y$`, so small entries of `$\mathbf{D}$` `$\Longleftrightarrow$` huge elements of `$\mathbf{D}^{-1}$`

* Means huge variance: `$\Var{\bls} =  \sigma^2(\X^\top \X)^{-1} = \sigma^2 \mathbf{V D}^{-2} \mathbf{V}^\top$`

Ridge Regression fixes this problem by preventing the division by a near-zero number

__Conclusion:__ `$(\X^{\top}\X)^{-1}$` can be really unstable, while `$(\X^{\top}\X + \lambda \mathbf{I})^{-1}$` is not.

__Aside:__ Engineering approach to solving linear systems is to always do this with small `$\lambda$`. The thinking is about the numerics rather than the statistics.

---
layout: true

## Can we get the best of both worlds?

---
To recap:

* Deciding which predictors to include, adding quadratic terms, or interactions is __model selection__ (more precisely __variable selection__ within a linear model).

* Ridge regression provides regularization, which trades off bias and variance and also stabilizes multicollinearity.

* If the LM is **true**, 
    1. OLS is unbiased, but Variance depends on `$\mathbf{D}^{-2}$`. Can be big.
    
    2. Ridge is biased (can you find the bias?). But Variance is smaller than OLS.

* Ridge regression does not perform variable selection.

* But .hand[picking] `$\lambda=3.7$` and thereby .hand[deciding] to predict with `$\widehat{\beta}^R_{3.7}$` is __model selection__.

---

__Ridge regression__: `$\min \frac{1}{n}||\y-\X\beta||_2^2 \textrm{ subject to } ||\beta||_2^2 \leq s$`

__Best (in-sample) linear regression model of size `$s$`__: `$\min \frac{1}{n}||\y-\X\beta||_2^2 \textrm{ subject to } ||\beta||_0 \leq s$`

`$||\beta||_0$` is the number of nonzero elements in `$\beta$`

Finding the best in-sample linear model (of size `$s$`, among these predictors) is a nonconvex optimization problem (In fact, it is NP-hard)

Ridge regression is convex (easy to solve), but doesn't do __variable__ selection

Can we somehow "interpolate" to get both?

Note: selecting `$\lambda$` is still __model__ selection, but we've included __all__ the variables.

---
class: center, middle, inverse
layout: false

# Next time...

The lasso, interpolating variable selection and model selection