class: center, middle, inverse, title-slide .title[ # 08 Ridge regression ] .author[ ### STAT 406 ] .author[ ### Daniel J. McDonald ] .date[ ### Last modified - 2022-10-03 ] --- class: middle, center background-image: url("https://disher.com/wp-content/uploads/2017/02/Product-Constraints.jpg") background-size: cover .primary[.larger[Module]] .huge-blue-number[2] .primary[regularization, constraints, and nonparametrics] --- ## Recap So far, we thought of __model selection__ as .hand[Decide which predictors we would like to use in our linear model] Or similarly: .hand[Decide which of a few linear models to use] To do this, we used a risk estimate, and chose the "model" with the lowest estimate -- Let's call what we've done __variable selection__, a specific case of __model selection__. -- Moving forward, we need to generalize this to .hand[Decide which of possibly infinite prediction functions] `\(f\in\mathcal{F}\)` .hand[to use] Thankfully, this isn't really any different. We still use those same risk estimates. __Remember:__ We were choosing models that balance bias and variance (and hence have low prediction risk). `$$\newcommand{\Expect}[1]{E\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[2]{\mathrm{Cov}\left[#1,\ #2\right]} \newcommand{\given}{\ \vert\ } \newcommand{\norm}[1]{\left\lVert #1 \right\rVert} \newcommand{\E}{E} \renewcommand{\P}{\mathbb{P}} \newcommand{\R}{\mathbb{R}} \newcommand{\tr}[1]{\mbox{tr}(#1)} \newcommand{\brt}{\widehat{\beta}^R_{s}} \newcommand{\brl}{\widehat{\beta}^R_{\lambda}} \newcommand{\bls}{\widehat{\beta}_{ols}} \newcommand{\blt}{\widehat{\beta}^L_{s}} \newcommand{\bll}{\widehat{\beta}^L_{\lambda}} \newcommand{\X}{\mathbf{X}} \newcommand{\y}{\mathbf{y}}$$` --- ## Regularization * Another way to control bias and variance is through __regularization__ or __shrinkage__. * Rather than selecting a few predictors that seem reasonable, maybe trying a few combinations, use them all. * I mean __ALL__. * But, make your estimates of `\(\beta\)` "smaller" --- ## Brief aside on optimization * An optimization problem has 2 components: 1. The "Objective function": e.g. `\(\frac{1}{n}\sum_i (y_i-x^\top_i \beta)^2\)`. 2. The "constraint": e.g. "fewer than 5 non-zero entries in `\(\beta\)`". * A constrained minimization problem is written `$$\min_\beta f(\beta)\;\; \mbox{ subject to }\;\; C(\beta)$$` * `\(f(\beta)\)` is the objective function * `\(C(\beta)\)` is the constraint --- ## Ridge regression (constrained version) One way to do this for regression is to solve (say): `\begin{aligned} \min_\beta &\ \ \frac{1}{n}\sum_i (y_i-x^\top_i \beta)^2 \\ \mbox{s.t.} &\ \ \sum_j \beta^2_j < s \end{aligned}` for some `\(s>0\)`. * This is called "ridge regression". * The __minimizer__ of this problem is called `\(\brt\)` -- Compare this to ordinary least squares: `\begin{aligned} \min_\beta &\ \ \frac{1}{n}\sum_i (y_i-x^\top_i \beta)^2 \\ \mbox{s.t.} &\ \ \beta \in \R^p \end{aligned}` --- ## Geometry of ridge regression (2 dimensions) <img src="rmd_gfx/08-ridge-regression/unnamed-chunk-1-1.svg" style="display: block; margin: auto;" /> --- ## Brief aside on norms Recall, for a vector `\(z \in \R^p\)` `$$\norm{z}_2 = \sqrt{z_1^2 + z_2^2 + \cdots + z^2_p} = \sqrt{\sum_{j=1}^p z_j^2}$$` So, `$$\norm{z}^2_2 = z_1^2 + z_2^2 + \cdots + z^2_p = \sum_{j=1}^p z_j^2.$$` -- Other norms we should remember: `\(\ell_q\)`-norm : `\(\left(\sum_{j=1}^p |z_j|^q\right)^{1/q}\)` `\(\ell_1\)`-norm (special case) : `\(\sum_{j=1}^p |z_j|\)` `\(\ell_0\)`-norm : `\(\sum_{j=1}^p I(z_j \neq 0 ) = \lvert \{j : z_j \neq 0 \}\rvert\)` --- ## Ridge regression An equivalent way to write `$$\brt = \arg \min_{ || \beta ||_2^2 \leq s} \frac{1}{n}\sum_i (y_i-x^\top_i \beta)^2$$` is in the __Lagrangian__ form `$$\brl = \arg \min_{ \beta} \frac{1}{n}\sum_i (y_i-x^\top_i \beta)^2 + \lambda || \beta ||_2^2.$$` For every `\(\lambda\)` there is a unique `\(s\)` (and vice versa) that makes `$$\brt = \brl$$` -- Observe: * `\(\lambda = 0\)` (or `\(s = \infty\)`) makes `\(\brl = \bls\)` * Any `\(\lambda > 0\)` (or `\(s <\infty\)`) penalizes larger values of `\(\beta\)`, effectively shrinking them. Note: `\(\lambda\)` and `\(s\)` are known as __tuning parameters__ --- ## Example data __prostate__ data from [ESL] ```r data(prostate, package = "ElemStatLearn") head(prostate) ``` ``` ## lcavol lweight age lbph svi lcp gleason pgg45 lpsa train ## 1 -0.5798185 2.769459 50 -1.386294 0 -1.386294 6 0 -0.4307829 TRUE ## 2 -0.9942523 3.319626 58 -1.386294 0 -1.386294 6 0 -0.1625189 TRUE ## 3 -0.5108256 2.691243 74 -1.386294 0 -1.386294 7 20 -0.1625189 TRUE ## 4 -1.2039728 3.282789 58 -1.386294 0 -1.386294 6 0 -0.1625189 TRUE ## 5 0.7514161 3.432373 62 -1.386294 0 -1.386294 6 0 0.3715636 TRUE ## 6 -1.0498221 3.228826 50 -1.386294 0 -1.386294 6 0 0.7654678 TRUE ``` ??? Use lpsa as response. --- ## Ridge regression path ```r Y <- prostate$lpsa X <- prostate %>% select(-train,-lpsa) %>% as.matrix() library(glmnet) ridge <- glmnet(x = X, y = Y, alpha = 0, lambda.min.ratio = .00001) ``` .pull-left-wide[ ```r plot(ridge, xvar = "lambda") ``` <img src="rmd_gfx/08-ridge-regression/plot-prostate-ridge-1.svg" style="display: block; margin: auto;" /> ] .pull-right-narrow[ Model selection here: * means __choose__ some `\(\lambda\)` * A value of `\(\lambda\)` is a vertical line. * This graphic is a "path" or "coefficient trace" * Coefficients for varying `\(\lambda\)` ] --- ## Solving the minimization * One nice thing about ridge regression is that it has a closed-form solution (like OLS) `$$\brl = (\X^\top\X + \lambda \mathbf{I})^{-1}\X^\top \y$$` * This is easy to calculate in `R` for any `\(\lambda\)`. * However, computations and interpretation are simplified if we examine the Singular Value Decomposition of `\(\X = \mathbf{UDV}^\top\)`. * Recall: any matrix has an SVD. * Here `\(\mathbf{D}\)` is diagonal and `\(\mathbf{U}\)` and `\(\mathbf{V}\)` are orthonormal: `\(\mathbf{U}^\top\mathbf{U} = \mathbf{I}\)`. * Note that `\(\mathbf{X}^\top\mathbf{X} = \mathbf{VDU}^\top\mathbf{UDV}^\top = \mathbf{V}\mathbf{D}^2\mathbf{V}^\top\)`. * Then, `$$\brl = (\X^\top \X + \lambda \mathbf{I})^{-1}\X^\top \y = (\mathbf{VD}^2\mathbf{V}^\top + \lambda \mathbf{I})^{-1}\mathbf{VDU}^\top \y = \mathbf{V}(\mathbf{D}^2+\lambda \mathbf{I})^{-1} \mathbf{DU}^\top \y.$$` * For computations, now we only need to invert `\(\mathbf{D}\)`. --- ## Comparing with OLS * `\(\mathbf{D}\)` is a diagonal matrix `$$\bls = (\X^\top\X)^{-1}\X^\top \y = (\mathbf{VD}^2\mathbf{V}^\top)^{-1}\mathbf{VDU}^\top \y = \mathbf{V}\color{red}{\mathbf{D}^{-2}\mathbf{D}}\mathbf{U}^\top \y = \mathbf{V}\color{red}{\mathbf{D}^{-1}}\mathbf{U}^\top \y$$` `$$\brl = (\X^\top \X + \lambda \mathbf{I})^{-1}\X^\top \y = \mathbf{V}\color{red}{(\mathbf{D}^2+\lambda \mathbf{I})^{-1}} \mathbf{DU}^\top \y.$$` * Notice that `\(\bls\)` depends on `\(d_j/d_j^2\)` while `\(\brl\)` depends on `\(d_j/(d_j^2 + \lambda)\)`. * Ridge regression makes the coefficients smaller relative to OLS. * But if `\(\X\)` has small singular values, ridge regression compensates with `\(\lambda\)` in the denominator. --- ## Ridge regression and multicollinearity __Multicollinearity:__ a linear combination of predictor variables is nearly equal to another predictor variable. Some comments: * A better phrase: `\(\X\)` is ill-conditioned * AKA "(numerically) rank-deficient". * `\(\X = \mathbf{U D V}^\top\)` ill-conditioned `\(\Longleftrightarrow\)` some elements of `\(\mathbf{D} \approx 0\)` * `\(\bls= \mathbf{V D}^{-1} \mathbf{U}^\top \y\)`, so small entries of `\(\mathbf{D}\)` `\(\Longleftrightarrow\)` huge elements of `\(\mathbf{D}^{-1}\)` * Means huge variance: `\(\Var{\bls} = \sigma^2(\X^\top \X)^{-1} = \sigma^2 \mathbf{V D}^{-2} \mathbf{V}^\top\)` Ridge Regression fixes this problem by preventing the division by a near-zero number __Conclusion:__ `\((\X^{\top}\X)^{-1}\)` can be really unstable, while `\((\X^{\top}\X + \lambda \mathbf{I})^{-1}\)` is not. __Aside:__ Engineering approach to solving linear systems is to always do this with small `\(\lambda\)`. The thinking is about the numerics rather than the statistics. --- layout: true ## Can we get the best of both worlds? --- To recap: * Deciding which predictors to include, adding quadratic terms, or interactions is __model selection__ (more precisely __variable selection__ within a linear model). * Ridge regression provides regularization, which trades off bias and variance and also stabilizes multicollinearity. * If the LM is **true**, 1. OLS is unbiased, but Variance depends on `\(\mathbf{D}^{-2}\)`. Can be big. 2. Ridge is biased (can you find the bias?). But Variance is smaller than OLS. * Ridge regression does not perform variable selection. * But .hand[picking] `\(\lambda=3.7\)` and thereby .hand[deciding] to predict with `\(\widehat{\beta}^R_{3.7}\)` is __model selection__. --- __Ridge regression__: `\(\min \frac{1}{n}||\y-\X\beta||_2^2 \textrm{ subject to } ||\beta||_2^2 \leq s\)` __Best (in-sample) linear regression model of size `\(s\)`__: `\(\min \frac{1}{n}||\y-\X\beta||_2^2 \textrm{ subject to } ||\beta||_0 \leq s\)` `\(||\beta||_0\)` is the number of nonzero elements in `\(\beta\)` Finding the best in-sample linear model (of size `\(s\)`, among these predictors) is a nonconvex optimization problem (In fact, it is NP-hard) Ridge regression is convex (easy to solve), but doesn't do __variable__ selection Can we somehow "interpolate" to get both? Note: selecting `\(\lambda\)` is still __model__ selection, but we've included __all__ the variables. --- class: center, middle, inverse layout: false # Next time... The lasso, interpolating variable selection and model selection