12 To(o) smooth or not to(o) smooth?

Stat 406

Geoff Pleiss, Trevor Campbell

Last modified – 08 October 2024

\[ \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator*{\argmax}{argmax} \DeclareMathOperator*{\minimize}{minimize} \DeclareMathOperator*{\maximize}{maximize} \DeclareMathOperator*{\find}{find} \DeclareMathOperator{\st}{subject\,\,to} \newcommand{\E}{E} \newcommand{\Expect}[1]{\E\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[2]{\mathrm{Cov}\left[#1,\ #2\right]} \newcommand{\given}{\ \vert\ } \newcommand{\X}{\mathbf{X}} \newcommand{\x}{\mathbf{x}} \newcommand{\y}{\mathbf{y}} \newcommand{\P}{\mathcal{P}} \newcommand{\R}{\mathbb{R}} \newcommand{\norm}[1]{\left\lVert #1 \right\rVert} \newcommand{\snorm}[1]{\lVert #1 \rVert} \newcommand{\tr}[1]{\mbox{tr}(#1)} \newcommand{\brt}{\widehat{\beta}^R_{s}} \newcommand{\brl}{\widehat{\beta}^R_{\lambda}} \newcommand{\bls}{\widehat{\beta}_{ols}} \newcommand{\blt}{\widehat{\beta}^L_{s}} \newcommand{\bll}{\widehat{\beta}^L_{\lambda}} \newcommand{\U}{\mathbf{U}} \newcommand{\D}{\mathbf{D}} \newcommand{\V}{\mathbf{V}} \]

Smooting vs Linear Models

We’ve been discussing nonlinear methods in 1-dimension:

\[\Expect{Y\given X=x} = f(x),\quad x\in\R\]

Basis expansions, e.g.:

\[\hat f_\mathrm{basis}(x) = \beta_0 + \beta_1 x + \beta_2 x^2 + \cdots + \beta_k x^k\]

Local methods, e.g.:

\[\hat f_\mathrm{local}(x_i) = s_i^\top \y\]

Which should we choose?
Of course, we can do model selection. But can we analyze the risk mathematically?

Risk Decomposition

\[ R_n = \mathrm{Bias}^2 + \mathrm{Var} + \sigma^2 \]

How does \(R_n^{(\mathrm{basis})}\) compare to \(R_n^{(\mathrm{local})}\) as we change \(n\)?

Variance

Basis: variance decreases as \(n\) increases
Local: variance decreases as \(n\) increases
But at what rate?

Bias

Basis: bias is fixed
Assuming num. basis features is fixed
Local: bias depends on choice of bandwidth \(\sigma\).

Risk Decomposition

Basis

\[ R_n^{(\mathrm{basis})} = \underbrace{C_1^{(\mathrm{basis})}}_{\mathrm{bias}^2} + \underbrace{\frac{C_2^{(\mathrm{basis})}}{n}}_{\mathrm{var}} + \sigma^2 \]

Local

With the optimal bandwidth (\(\propto n^{-1/5}\)), we have

\[ R_n^{(\mathrm{local})} = \underbrace{\frac{C_1^{(\mathrm{local})}}{n^{4/5}}}_{\mathrm{bias}^2} + \underbrace{\frac{C_2^{(\mathrm{local})}}{n^{4/5}}}_{\mathrm{var}} + \sigma^2 \]

Important

you don’t need to memorize these formulas but you should know the intuition

The constants don’t matter for the intuition, but they matter for a particular data set. You have to estimate them.

What do you notice?

As \(n\) increases, the optimal bandwidth \(\sigma\) decreases
\(R_n^{(\mathrm{basis})} \overset{n \to \infty}{\longrightarrow} C_1^{(\mathrm{basis})} + \sigma^2\)
\(R_n^{(\mathrm{local})} \overset{n \to \infty}{\longrightarrow} \sigma^2\)

Takeaway

Local methods are consistent universal approximators (bias and variance go to 0 as \(n \to \infty\))
Fixed basis expansions are biased but have lower variance when \(n\) is relatively small.
\(\underbrace{O(1/n)}_{\text{basis var.}} < \underbrace{O(1/n^{4/5})}_{\text{local var.}}\)

The Curse of Dimensionality

How do local methods perform when \(p > 1\)?

Intuitively

Parametric multivariate regressors (e.g. basis expansions) require you to specify nonlinear interaction terms
e.g. \(x^{(1)} x^{(2)}\), \(\cos( x^{(1)} + x^{(2)})\), etc.

Nonparametric multivariate regressors (e.g. KNN, local methods) automatically handle interactions.
The distance function (e.g. \(d(x,x') = \Vert x - x' \Vert_2\)) used by kernels implicitly defines infinitely many interactions!

This extra complexity (automatically including interactions, as well as other things) comes with a tradeoff.

Mathematically

Consider \(x_1, x_2, \ldots, x_n\) distributed uniformly within a \(p\)-dimensional ball of radius 1. For a test point \(x\) at the center of the ball, how far away are its \(k = n/10\) nearest neighbours?

(The picture on the right makes sense in 2D. However, it gives the wrong intuition for higher dimensions!)

Let \(r\) the the average distance between \(x\) and its \(k^\mathrm{th}\) nearest neighbour.

When \(p=2\), \(r = (0.1)^{1/2} \approx 0.316\)
When \(p=10\), \(r = (0.1)^{1/10} \approx 0.794\)(!)
When \(p=100\), \(r = (0.1)^{1/100} \approx 0.977\)(!!)
When \(p=1000\), \(r = (0.1)^{1/1000} \approx 0.999\)(!!!)

Why is this problematic?

All points are maximally far apart
Can’t distinguish between “similar” and “different” inputs

Curse of Dimensionality

Distance becomes (exponentially) meaningless in high dimensions.*
*(Unless our data has “low dimensional structure.”)

Risk decomposition (\(p > 1\))

Assuming optimal bandwidth of \(n^{-1/(4+p)}\)…

\[ R_n^{(\mathrm{OLS})} = \underbrace{C_1^{(\mathrm{OLS})}}_{\mathrm{bias}^2} + \underbrace{\tfrac{C_2^{(\mathrm{OLS})}}{n/p}}_{\mathrm{var}} + \sigma^2, \qquad R_n^{(\mathrm{local})} = \underbrace{\tfrac{C_1^{(\mathrm{local})}}{n^{4/(4+p)}}}_{\mathrm{bias}^2} + \underbrace{\tfrac{C_2^{(\mathrm{local})}}{n^{4/(4+p)}}}_{\mathrm{var}} + \sigma^2. \]

Observations

\((C_1^{(\mathrm{local})} + C_2^{(\mathrm{local})}) / n^{4/(4+p)}\) is relatively big, but \(C_2^{(\mathrm{OLS})} / (n/p)\) is relatively small.
So unless \(C_1^{(\mathrm{OLS})}\) is big, we should use the linear model.*

In practice

The previous math assumes that our data are “densely” distributed throughout \(\R^p\).

However, if our data lie on a low-dimensional manifold within \(\R^p\), then local methods can work well!

We generally won’t know the “intrinsic dimensinality” of our data though…

How to decide between basis expansions versus local kernel smoothers:

Model selection
Using a very, very questionable rule of thumb: if \(p>\log(n)\), don’t do smoothing.

☠️☠️ Danger ☠️☠️

You can’t just compare the GCV/CV/etc. scores for basis models versus local kernel smoothers.

You used GCV/CV/etc. to select the tuning parameter, so we’re back to the usual problem of using the data twice. You have to do another CV to estimate the risk of the kernel version once you have used GCV/CV/etc. to select the bandwidth.

Next time…

Compromises if p is big

Additive models and trees