03 The regression function
Stat 406
Geoff Pleiss, Trevor Campbell
Last modified – 16 September 2024
\[
\DeclareMathOperator*{\argmin}{argmin}
\DeclareMathOperator*{\argmax}{argmax}
\DeclareMathOperator*{\minimize}{minimize}
\DeclareMathOperator*{\maximize}{maximize}
\DeclareMathOperator*{\find}{find}
\DeclareMathOperator{\st}{subject\,\,to}
\newcommand{\E}{E}
\newcommand{\Expect}[1]{\E\left[ #1 \right]}
\newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]}
\newcommand{\Cov}[2]{\mathrm{Cov}\left[#1,\ #2\right]}
\newcommand{\given}{\ \vert\ }
\newcommand{\X}{\mathbf{X}}
\newcommand{\x}{\mathbf{x}}
\newcommand{\y}{\mathbf{y}}
\newcommand{\P}{\mathcal{P}}
\newcommand{\R}{\mathbb{R}}
\newcommand{\norm}[1]{\left\lVert #1 \right\rVert}
\newcommand{\snorm}[1]{\lVert #1 \rVert}
\newcommand{\tr}[1]{\mbox{tr}(#1)}
\newcommand{\brt}{\widehat{\beta}^R_{s}}
\newcommand{\brl}{\widehat{\beta}^R_{\lambda}}
\newcommand{\bls}{\widehat{\beta}_{ols}}
\newcommand{\blt}{\widehat{\beta}^L_{s}}
\newcommand{\bll}{\widehat{\beta}^L_{\lambda}}
\newcommand{\U}{\mathbf{U}}
\newcommand{\D}{\mathbf{D}}
\newcommand{\V}{\mathbf{V}}
\]
Mean squared error (MSE)
Last time… Ordinary Least Squares
\[\widehat\beta = \argmin_\beta \sum_{i=1}^n ( y_i - x_i^\top \beta )^2.\]
“Find the \(\beta\) which minimizes the sum of squared errors.”
\[\widehat\beta = \arg\min_\beta \frac{1}{n}\sum_{i=1}^n ( y_i - x_i^\top \beta )^2.\]
“Find the beta which minimizes the mean squared error.”
Forget all that…
That’s “stuff that seems like a good idea”
And it is for many reasons
This class is about those reasons, and the “statistics” behind it
Methods for “Statistical” Learning
Starts with “what is a model?”
What is a model?
In statistics, “model” has a mathematical meaning.
Distinct from “algorithm” or “procedure”.
Defining a model often leads to a procedure/algorithm with good properties.
Sometimes procedure/algorithm \(\Rightarrow\) a specific model.
Statistics (the field) tells me how to understand when different procedures are desirable and the mathematical guarantees that they satisfy.
When are certain models appropriate?
One definition of “Statistical Learning” is the “statistics behind the procedure”.
Statistical models 101
We observe data \(Z_1,\ Z_2,\ \ldots,\ Z_n\) generated by some probability distribution \(P\). We want to use the data to learn about \(P\).
A statistical model is a set of distributions \(\mathcal{P}\).
Some examples:
- \(\P = \{P : P(Z=1)=p,\ P(Z=0)=1-p, 0 \leq p \leq 1\}\).
- \(\P = \{P : Y | X \sim N(X^\top\beta,\sigma^2),\ \beta \in \R^p,\ \sigma>0\}\) (here \(Z = (Y,X)\))
- \(\P = \{P \mbox{ given by any CDF }F\}\).
- \(\P = \{P : E[Y | X] = f(X) \mbox{ for some smooth } f: \R^p \rightarrow \R\}\) (here \(Z = (Y,X)\))
Statistical models
We want to use the data to select a distribution \(P\) that probably generated the data.
My model:
\[
\P = \{P: P(z=1)=p,\ P(z=0)=1-p,\ 0 < p < 1 \}
\]
To completely characterize \(P\), I just need to estimate \(p\).
Need to assume that \(P \in \P\).
This assumption is mostly empty: need independent, can’t see \(z=12\).
Statistical models
We observe data \((Y, X)\) generated by some probability distribution \(P\) on \(\R \times \R^p\). We want to use the data to learn about \(P\).
My model
\[
\P = \{P : Y | X \sim N(X^\top\beta,\ \sigma^2), \beta \in \R^p, \sigma>0\}.
\]
To completely characterize the \(Y|X\)-conditional of \(P\), I just need to estimate \(\beta\) and \(\sigma\).
- I’m not interested in learning the \(X\)-marginal of \(P\)
Need to assume that \(P\in\P\).
This time, I have to assume a lot more: (conditional) linearity, independence, conditional Gaussian noise, no ignored variables, no collinearity, etc.
Statistical models, unfamiliar example
We observe data \(Z \in \R\) generated by some probability distribution \(P\). We want to use the data to learn about \(P\).
My model
\[
\P = \{P : Z \textrm{ has a density function } f \}.
\]
To completely characterize \(P\), I need to estimate \(f\).
In fact, we can’t hope to do this.
Revised Model 1 - \(\P=\{ Z \textrm{ has a density function } f : \int (f'')^2 dx < M \}\)
Revised Model 2 - \(\P=\{ Z \textrm{ has a density function } f : \int (f'')^2 dx < K < M \}\)
Revised Model 3 - \(\P=\{ Z \textrm{ has a density function } f : \int |f'| dx < M \}\)
- Each of these suggests different ways of estimating \(f\)
Assumption Lean Regression
Imagine \(Z = (Y, \mathbf{X}) \sim P\) with \(Y \in \R\) and \(\mathbf{X} = (1, X_1, \ldots, X_p)^\top\).
We are interested in the conditional distribution \(P_{Y|\mathbf{X}}\)
Suppose we think that there is some function of interest which relates \(Y\) and \(X\).
Let’s call this function \(\mu(\mathbf{X})\) for the moment. How do we estimate \(\mu\)? What is \(\mu\)?
To make this precise, we
- Have a model \(\P\).
- Need to define a “good” functional \(\mu\).
- Let’s loosely define “good” as
Given a new (random) \(Z\), \(\mu(\mathbf{X})\) is “close” to \(Y\).
Evaluating “close”
We need more functions.
Choose some loss function \(\ell\) that measures how close \(\mu\) and \(Y\) are.
Squared-error:
\(\ell(y,\ \mu) = (y-\mu)^2\)
Absolute-error:
\(\ell(y,\ \mu) = |y-\mu|\)
Zero-One:
\(\ell(y,\ \mu) = I(y\neq\mu)=\begin{cases} 0 & y=\mu\\1 & \mbox{else}\end{cases}\)
Cauchy:
\(\ell(y,\ \mu) = \log(1 + (y - \mu)^2)\)
Code
ggplot() +
xlim(-2, 2) +
geom_function(fun = ~log(1+.x^2), colour = 'purple', linewidth = 2) +
geom_function(fun = ~.x^2, colour = tertiary, linewidth = 2) +
geom_function(fun = ~abs(.x), colour = primary, linewidth = 2) +
geom_line(
data = tibble(x = seq(-2, 2, length.out = 100), y = as.numeric(x != 0)),
aes(x, y), colour = orange, linewidth = 2) +
geom_point(data = tibble(x = 0, y = 0), aes(x, y),
colour = orange, pch = 16, size = 3) +
ylab(bquote("\u2113" * (y - mu))) + xlab(bquote(y - mu))
Start with (Expected) Squared Error
Let’s try to minimize the expected squared error (MSE).
Claim: \(\mu(X) = \Expect{Y\ \vert\ X}\) minimizes MSE.
That is, for any \(r(X)\), \(\Expect{(Y - \mu(X))^2} \leq \Expect{(Y-r(X))^2}\).
Proof of Claim:
\[\begin{aligned}
\Expect{(Y-r(X))^2}
&= \Expect{(Y- \mu(X) + \mu(X) - r(X))^2}\\
&= \Expect{(Y- \mu(X))^2} + \Expect{(\mu(X) - r(X))^2} \\
&\quad +2\Expect{(Y- \mu(X))(\mu(X) - r(X))}\\
&=\Expect{(Y- \mu(X))^2} + \Expect{(\mu(X) - r(X))^2} \\
&\quad +2(\mu(X) - r(X))\Expect{(Y- \mu(X))}\\
&=\Expect{(Y- \mu(X))^2} + \Expect{(\mu(X) - r(X))^2} + 0\\
&\geq \Expect{(Y- \mu(X))^2}
\end{aligned}\]
The regression function
Sometimes people call this solution:
\[\mu(X) = \Expect{Y \ \vert\ X}\]
the regression function. (But don’t forget that it depended on \(\ell\).)
If we assume that \(\mu(x) = \Expect{Y \ \vert\ X=x} = x^\top \beta\), then we get back exactly OLS.
But why should we assume \(\mu(x) = x^\top \beta\)?
Brief aside
Some notation / terminology
“Hats” on things mean “estimates”, so \(\widehat{\mu}\) is an estimate of \(\mu\)
Parameters are “properties of the model”, so \(f_X(x)\) or \(\mu\) or \(\Var{Y}\)
Random variables like \(X\), \(Y\), \(Z\) may eventually become data, \(x\), \(y\), \(z\), once observed.
“Estimating” means “using observations to estimate parameters”
“Predicting” means “using observations to predict future data”
Often, there is a parameter whose estimate will provide a prediction.
This last point can lead to confusion.
The regression function
In mathematics: \(\mu(x) = \Expect{Y \ \vert\ X=x}\).
In words:
Regression with squared-error loss is really about estimating the (conditional) mean.
If \(Y\sim \textrm{N}(\mu,\ 1)\), our best guess for a new \(Y\) is \(\mu\).
For regression, we let the mean \((\mu)\) depend on \(X\).
Think of \(Y\sim \textrm{N}(\mu(X),\ 1)\), then conditional on \(X=x\), our best guess for a new \(Y\) is \(\mu(x)\)
[whatever this function \(\mu\) is]
Anything strange?
For any two variables \(Y\) and \(X\), we can always write
\[Y = E[Y\given X] + (Y - E[Y\given X]) = \mu(X) + \eta(X)\]
such that \(\Expect{\eta(X)}=0\).
- Suppose, \(\mu(X)=\mu_0\) (constant in \(X\)), are \(Y\) and \(X\) independent?
- Suppose \(Y\) and \(X\) are independent, is \(\mu(X)=\mu_0\)?
For more practice on this see the Fun Worksheet on Theory and solutions
In this course, I do not expect you to be able to create this math, but understanding and explaining it is important.
What do we mean by good predictions?
We make observations and then attempt to “predict” new, unobserved data.
Sometimes this is the same as estimating the (conditional) mean.
Mostly, we observe \((y_1,x_1),\ \ldots,\ (y_n,x_n)\), and we want some way to predict \(Y\) from \(X\).
Expected test MSE
For regression applications, we will use squared-error loss:
\(R_n(\widehat{\mu}) = \Expect{(Y-\widehat{\mu}(X))^2}\)
I’m giving this a name, \(R_n\) for ease.
Different than text.
This is expected test MSE.
Example: Estimating/Predicting the (conditional) mean
Suppose we know that we want to predict a quantity \(Y\),
where \(\Expect{Y}= \mu \in \mathbb{R}\) and \(\Var{Y} = 1\).
Our data is \(\{y_1,\ldots,y_n\}\)
We will use the sample mean \(\overline{Y}_n\) to estimate both \(\mu\) and \(Y\).
Estimating the mean
We evaluate the estimation risk (since we’re estimating \(\mu\)) via:
\[\begin{aligned}
E[(\overline{Y}_n-\mu)^2]
&= E[\overline{Y}_n^2]
-2\mu E[\overline{Y}_n] + \mu^2 \\
&= \mu^2 + \frac{1}{n} - 2\mu^2 +
\mu^2\\ &= \frac{1}{n}
\end{aligned}\]
Useful trick
For any \(Z\),
\(\Var{Z} = \Expect{Z^2} - \Expect{Z}^2\).
Therefore:
\(\Expect{Z^2} = \Var{Z} + \Expect{Z}^2\).
Predicting new Y’s
We evaluate the prediction risk of \(\overline{Y}_n\) (since we’re predicting \(Y\)) via:
\[\begin{aligned}
R_n(\overline{Y}_n)
&= \E[(\overline{Y}_n-Y)^2]\\
&= \E[(\overline{Y}_n - \mu)^2] + \E[(\mu-Y)^2]\\
&= \frac{1}{n} + 1
\end{aligned}\]
- \(1/n\) for estimation risk
- \(1\) for remaining noise in \(Y\)
Tricks:
Add and subtract \(\mu\) inside the square.
\(\overline{Y}_n\) and \(Y\) are independent and mean \(\mu\).
Predicting new Y’s
What is the prediction risk of guessing \(Y=0\)?
You can probably guess that this is a stupid idea.
Let’s show why it’s stupid.
\[\begin{aligned}
R_n(0) &= \E[(0-Y)^2] = 1 + \mu^2
\end{aligned}\]
Predicting new Y’s
What is the prediction risk of guessing \(Y=\mu\)?
This is a great idea, but we don’t know \(\mu\).
Let’s see what happens anyway.
\[\begin{aligned}
R_n(\mu) &= \E[(Y-\mu)^2]= 1
\end{aligned}\]
Risk relations
Prediction risk: \(R_n(\overline{Y}_n) = 1 + \frac{1}{n}\)
Estimation risk: \(E[(\overline{Y}_n - \mu)^2] = \frac{1}{n}\)
There is actually a nice interpretation here:
The common \(1/n\) term is \(\Var{\overline{Y}_n}\)
The extra factor of \(1\) in the prediction risk is irreducible error
- \(Y\) is a random variable, and hence noisy.
- We can never eliminate it’s intrinsic variance.
- In other words, even if we knew \(\mu\), we could never get closer than \(1\), on average.
Intuitively, \(\overline{Y}_n\) is the obvious thing to do.
But what about unintuitive things…
Next time…
Trading bias and variance