Stat 406
Daniel J. McDonald
Last modified – 09 October 2023
\[ \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator*{\argmax}{argmax} \DeclareMathOperator*{\minimize}{minimize} \DeclareMathOperator*{\maximize}{maximize} \DeclareMathOperator*{\find}{find} \DeclareMathOperator{\st}{subject\,\,to} \newcommand{\E}{E} \newcommand{\Expect}[1]{\E\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[2]{\mathrm{Cov}\left[#1,\ #2\right]} \newcommand{\given}{\ \vert\ } \newcommand{\X}{\mathbf{X}} \newcommand{\x}{\mathbf{x}} \newcommand{\y}{\mathbf{y}} \newcommand{\P}{\mathcal{P}} \newcommand{\R}{\mathbb{R}} \newcommand{\norm}[1]{\left\lVert #1 \right\rVert} \newcommand{\snorm}[1]{\lVert #1 \rVert} \newcommand{\tr}[1]{\mbox{tr}(#1)} \newcommand{\brt}{\widehat{\beta}^R_{s}} \newcommand{\brl}{\widehat{\beta}^R_{\lambda}} \newcommand{\bls}{\widehat{\beta}_{ols}} \newcommand{\blt}{\widehat{\beta}^L_{s}} \newcommand{\bll}{\widehat{\beta}^L_{\lambda}} \]
We’ve been discussing smoothing methods in 1-dimension:
\[\Expect{Y\given X=x} = f(x),\quad x\in\R\]
We looked at basis expansions, e.g.:
\[f(x) \approx \beta_0 + \beta_1 x + \beta_2 x^2 + \cdots + \beta_k x^k\]
We looked at local methods, e.g.:
\[f(x_i) \approx s_i^\top \y\]
What if \(x \in \R^p\) and \(p>1\)?
In multivariate nonparametric regression, you estimate a surface over the input variables.
This is trying to find \(\widehat{f}(x_1,\ldots,x_p)\).
Therefore, this function by construction includes interactions, handles categorical data, etc. etc.
This is in contrast with explicit linear models which need you to specify these things.
This extra complexity (automatically including interactions, as well as other things) comes with tradeoffs.
More complicated functions (smooth Kernel regressions vs. linear models) tend to have lower bias but higher variance.
For \(p=1\), one can show that for kernels (with the correct bandwidth)
\[\textrm{MSE}(\hat{f}) = \frac{C_1}{n^{4/5}} + \frac{C_2}{n^{4/5}} + \sigma^2\]
Important
you don’t need to memorize these formulas but you should know the intuition
the constants don’t matter for the intuition, but they matter for a particular data set. We don’t know them. So you estimate this.
For \(p=1\), one can show that for kernels (with the correct bandwidth)
\[\textrm{MSE}(\hat{f}) = \frac{C_1}{n^{4/5}} + \frac{C_2}{n^{4/5}} + \sigma^2\]
Recall, this decomposition is squared bias + variance + irreducible error
\[\textrm{MSE}(\hat{f}) = C_1 h^4 + \frac{C_2}{nh} + \sigma^2\]
For \(p=1\), one can show that for kernels (with the correct bandwidth)
\[\textrm{MSE}(\hat{f}) = \frac{C_1}{n^{4/5}} + \frac{C_2}{n^{4/5}} + \sigma^2\]
as you collect data, use a smaller bandwidth and the MSE (on future data) decreases
For \(p=1\), one can show that for kernels (with the correct bandwidth)
\[\textrm{MSE}(\hat{f}) = \frac{C_1}{n^{4/5}} + \frac{C_2}{n^{4/5}} + \sigma^2\]
How does this compare to just using a linear model?
Bias
Variance
For \(p=1\), one can show that for kernels (with the correct bandwidth)
\[\textrm{MSE}(\hat{f}) = \frac{C_1}{n^{4/5}} + \frac{C_2}{n^{4/5}} + \sigma^2\]
bias of kernels goes to zero, bias of lines doesn’t (unless the truth is linear).
but variance of lines goes to zero faster than for kernels.
If the linear model is right, you win.
But if it’s wrong, you (eventually) lose as \(n\) grows.
How do you know if you have enough data?
Compare of the kernel version with CV-selected tuning parameter with the estimate of the risk for the linear model.
You can’t just compare the CVM for the kernel version to the CVM for the LM. This is because you used CVM to select the tuning parameter, so we’re back to the usual problem of using the data twice. You have to do another CV to estimate the risk of the kernel version at CV selected tuning parameter. ️
For \(p>1\), there is more trouble.
First, lets look again at \[\textrm{MSE}(\hat{f}) = \frac{C_1}{n^{4/5}} + \frac{C_2}{n^{4/5}} + \sigma^2\]
That is for \(p=1\). It’s not that much slower than \(C/n\), the variance for linear models.
If \(p>1\) similar calculations show,
\[\textrm{MSE}(\hat f) = \frac{C_1+C_2}{n^{4/(4+p)}} + \sigma^2 \hspace{2em} \textrm{MSE}(\hat \beta) = b + \frac{Cp}{n} + \sigma^2 .\]
\[\textrm{MSE}(\hat f) = \frac{C_1+C_2}{n^{4/(4+p)}} + \sigma^2 \hspace{2em} \textrm{MSE}(\hat \beta) = b + \frac{Cp}{n} + \sigma^2 .\]
What if \(p\) is big (and \(n\) is really big)?
How do you tell? Do model selection to decide.
A very, very questionable rule of thumb: if \(p>\log(n)\), don’t do smoothing.
Compromises if p is big
Additive models and trees
UBC Stat 406 - 2023