12 To(o) smooth or not to(o) smooth?

Stat 406

Daniel J. McDonald

Last modified – 09 October 2023

\[ \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator*{\argmax}{argmax} \DeclareMathOperator*{\minimize}{minimize} \DeclareMathOperator*{\maximize}{maximize} \DeclareMathOperator*{\find}{find} \DeclareMathOperator{\st}{subject\,\,to} \newcommand{\E}{E} \newcommand{\Expect}[1]{\E\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[2]{\mathrm{Cov}\left[#1,\ #2\right]} \newcommand{\given}{\ \vert\ } \newcommand{\X}{\mathbf{X}} \newcommand{\x}{\mathbf{x}} \newcommand{\y}{\mathbf{y}} \newcommand{\P}{\mathcal{P}} \newcommand{\R}{\mathbb{R}} \newcommand{\norm}[1]{\left\lVert #1 \right\rVert} \newcommand{\snorm}[1]{\lVert #1 \rVert} \newcommand{\tr}[1]{\mbox{tr}(#1)} \newcommand{\brt}{\widehat{\beta}^R_{s}} \newcommand{\brl}{\widehat{\beta}^R_{\lambda}} \newcommand{\bls}{\widehat{\beta}_{ols}} \newcommand{\blt}{\widehat{\beta}^L_{s}} \newcommand{\bll}{\widehat{\beta}^L_{\lambda}} \]

Last time…

We’ve been discussing smoothing methods in 1-dimension:

\[\Expect{Y\given X=x} = f(x),\quad x\in\R\]

We looked at basis expansions, e.g.:

\[f(x) \approx \beta_0 + \beta_1 x + \beta_2 x^2 + \cdots + \beta_k x^k\]

We looked at local methods, e.g.:

\[f(x_i) \approx s_i^\top \y\]

What if \(x \in \R^p\) and \(p>1\)?

Kernels and interactions

In multivariate nonparametric regression, you estimate a surface over the input variables.

This is trying to find \(\widehat{f}(x_1,\ldots,x_p)\).

Therefore, this function by construction includes interactions, handles categorical data, etc. etc.

This is in contrast with explicit linear models which need you to specify these things.

This extra complexity (automatically including interactions, as well as other things) comes with tradeoffs.

More complicated functions (smooth Kernel regressions vs. linear models) tend to have lower bias but higher variance.

Issue 1

For \(p=1\), one can show that for kernels (with the correct bandwidth)

\[\textrm{MSE}(\hat{f}) = \frac{C_1}{n^{4/5}} + \frac{C_2}{n^{4/5}} + \sigma^2\]

Important

you don’t need to memorize these formulas but you should know the intuition

the constants don’t matter for the intuition, but they matter for a particular data set. We don’t know them. So you estimate this.

Issue 1

For \(p=1\), one can show that for kernels (with the correct bandwidth)

\[\textrm{MSE}(\hat{f}) = \frac{C_1}{n^{4/5}} + \frac{C_2}{n^{4/5}} + \sigma^2\]

Recall, this decomposition is squared bias + variance + irreducible error

  • It depends on the choice of \(h\)

\[\textrm{MSE}(\hat{f}) = C_1 h^4 + \frac{C_2}{nh} + \sigma^2\]

  • Using \(h = cn^{-1/5}\) balances squared bias and variance, leads to the above rate. (That balance minimizes the MSE)

Issue 1

For \(p=1\), one can show that for kernels (with the correct bandwidth)

\[\textrm{MSE}(\hat{f}) = \frac{C_1}{n^{4/5}} + \frac{C_2}{n^{4/5}} + \sigma^2\]

Intuition:

as you collect data, use a smaller bandwidth and the MSE (on future data) decreases

Issue 1

For \(p=1\), one can show that for kernels (with the correct bandwidth)

\[\textrm{MSE}(\hat{f}) = \frac{C_1}{n^{4/5}} + \frac{C_2}{n^{4/5}} + \sigma^2\]

How does this compare to just using a linear model?

Bias

  1. The bias of using a linear model when the truth nonlinear is a number \(b > 0\) which doesn’t depend on \(n\).
  2. The bias of using kernel regression is \(C_1/n^{4/5}\). This goes to 0 as \(n\rightarrow\infty\).

Variance

  1. The variance of using a linear model is \(C/n\) no matter what
  2. The variance of using kernel regression is \(C_2/n^{4/5}\).

Issue 1

For \(p=1\), one can show that for kernels (with the correct bandwidth)

\[\textrm{MSE}(\hat{f}) = \frac{C_1}{n^{4/5}} + \frac{C_2}{n^{4/5}} + \sigma^2\]

To conclude:

  • bias of kernels goes to zero, bias of lines doesn’t (unless the truth is linear).

  • but variance of lines goes to zero faster than for kernels.

If the linear model is right, you win.

But if it’s wrong, you (eventually) lose as \(n\) grows.

How do you know if you have enough data?

Compare of the kernel version with CV-selected tuning parameter with the estimate of the risk for the linear model.

☠️☠️ Danger ☠️☠️

You can’t just compare the CVM for the kernel version to the CVM for the LM. This is because you used CVM to select the tuning parameter, so we’re back to the usual problem of using the data twice. You have to do another CV to estimate the risk of the kernel version at CV selected tuning parameter. ️

Issue 2

For \(p>1\), there is more trouble.

First, lets look again at \[\textrm{MSE}(\hat{f}) = \frac{C_1}{n^{4/5}} + \frac{C_2}{n^{4/5}} + \sigma^2\]

That is for \(p=1\). It’s not that much slower than \(C/n\), the variance for linear models.

If \(p>1\) similar calculations show,

\[\textrm{MSE}(\hat f) = \frac{C_1+C_2}{n^{4/(4+p)}} + \sigma^2 \hspace{2em} \textrm{MSE}(\hat \beta) = b + \frac{Cp}{n} + \sigma^2 .\]

Issue 2

\[\textrm{MSE}(\hat f) = \frac{C_1+C_2}{n^{4/(4+p)}} + \sigma^2 \hspace{2em} \textrm{MSE}(\hat \beta) = b + \frac{Cp}{n} + \sigma^2 .\]

What if \(p\) is big (and \(n\) is really big)?

  1. Then \((C_1 + C_2) / n^{4/(4+p)}\) is still big.
  2. But \(Cp / n\) is small.
  3. So unless \(b\) is big, we should use the linear model.

How do you tell? Do model selection to decide.

A very, very questionable rule of thumb: if \(p>\log(n)\), don’t do smoothing.

Next time…

Compromises if p is big

Additive models and trees