Another way to control bias and variance is through regularization or shrinkage.
Rather than selecting a few predictors that seem reasonable, maybe trying a few combinations, use them all.
I mean ALL.
But, make your estimates of \(\beta\) “smaller”
Brief aside on optimization
An optimization problem has 2 components:
The “Objective function”: e.g. \(\frac{1}{n}\sum_i (y_i-x^\top_i \beta)^2\).
The “constraint”: e.g. “fewer than 5 non-zero entries in \(\beta\)”.
A constrained minimization problem is written
\[\min_\beta f(\beta)\;\; \mbox{ subject to }\;\; C(\beta)\]
\(f(\beta)\) is the objective function
\(C(\beta)\) is the constraint
Ridge regression (constrained version)
One way to do this for regression is to solve (say): \[
\minimize_\beta \frac{1}{n}\sum_i (y_i-x^\top_i \beta)^2
\quad \st \sum_j \beta^2_j < s
\] for some \(s>0\).
Ridge Regression fixes this problem by preventing the division by a near-zero number
Conclusion
\((\X^{\top}\X)^{-1}\) can be really unstable, while \((\X^{\top}\X + \lambda \mathbf{I})^{-1}\) is not.
Aside
Engineering approach to solving linear systems is to always do this with small \(\lambda\). The thinking is about the numerics rather than the statistics.
Which \(\lambda\) to use?
Computational
Use CV and pick the \(\lambda\) that makes this smallest.
Intuition (bias)
As \(\lambda\rightarrow\infty\), bias ⬆
Intuition (variance)
As \(\lambda\rightarrow\infty\), variance ⬇
You should think about why.
Can we get the best of both worlds?
To recap:
Deciding which predictors to include, adding quadratic terms, or interactions is model selection (more precisely variable selection within a linear model).
Ridge regression provides regularization, which trades off bias and variance and also stabilizes multicollinearity.
If the LM is true,
OLS is unbiased, but Variance depends on \(\mathbf{D}^{-2}\). Can be big.
Ridge is biased (can you find the bias?). But Variance is smaller than OLS.
Ridge regression does not perform variable selection.
But picking\(\lambda=3.7\) and thereby deciding to predict with \(\widehat{\beta}^R_{3.7}\) is model selection.