## Mean squared error (MSE)

Last time… Ordinary Least Squares

\[\widehat\beta = \argmin_\beta \sum_{i=1}^n ( y_i - x_i^\top \beta )^2.\]

“Find the \(\beta\) which minimizes the sum of squared errors.”

\[\widehat\beta = \arg\min_\beta \frac{1}{n}\sum_{i=1}^n ( y_i - x_i^\top \beta )^2.\]

“Find the beta which minimizes the mean squared error.”

## Forget all that…

That’s “stuff that seems like a good idea”

And it is for many reasons

This class is about those reasons, and the “statistics” behind it

#### Methods for “Statistical” Learning

Starts with “what is a model?”

## What is a model?

In statistics, “model” has a mathematical meaning.

Distinct from “algorithm” or “procedure”.

Defining a model often leads to a procedure/algorithm with good properties.

Sometimes procedure/algorithm \(\Rightarrow\) a specific model.

Statistics (the field) tells me how to understand when different procedures are desirable and the mathematical guarantees that they satisfy.

When are certain models appropriate?

One definition of “Statistical Learning” is the “statistics behind the procedure”.

## Statistical models 101

We observe data \(Z_1,\ Z_2,\ \ldots,\ Z_n\) generated by some probability distribution \(P\). We want to use the data to learn about \(P\).

A statistical model is a set of distributions \(\mathcal{P}\).

Some examples:

- \(\P = \{ 0 < p < 1 : P(z=1)=p,\ P(z=0)=1-p\}\).
- \(\P = \{ \beta \in \R^p,\ \sigma>0 : Y \sim N(X^\top\beta,\sigma^2),\ X\mbox{ fixed}\}\).
- \(\P = \{\mbox{all CDF's }F\}\).
- \(\P = \{\mbox{all smooth functions } f: \R^p \rightarrow \R : Z_i = (X_i, Y_i),\ E[Y_i] = f(X_i) \}\)

## Statistical models

We want to use the data to select a distribution \(P\) that probably generated the data.

#### My model:

\[
\P = \{ P(z=1)=p,\ P(z=0)=1-p,\ 0 < p < 1 \}
\]

To completely characterize \(P\), I just need to estimate \(p\).

Need to assume that \(P \in \P\).

This assumption is mostly empty: *need independent, can’t see \(z=12\).*

## Statistical models

We observe data \(Z_i=(Y_i,X_i)\) generated by some probability distribution \(P\). We want to use the data to learn about \(P\).

#### My model

\[
\P = \{ \beta \in \R^p, \sigma>0 : Y_i \given X_i=x_i \sim N(x_i^\top\beta,\ \sigma^2) \}.
\]

To completely characterize \(P\), I just need to estimate \(\beta\) and \(\sigma\).

Need to assume that \(P\in\P\).

This time, I have to assume a lot more: *(conditional) Linearity, independence, conditional Gaussian noise,* *no ignored variables, no collinearity, etc.*

## Statistical models, unfamiliar example

We observe data \(Z_i \in \R\) generated by some probability distribution \(P\). We want to use the data to learn about \(P\).

#### My model

\[
\P = \{ Z_i \textrm{ has a density function } f \}.
\]

To completely characterize \(P\), I need to estimate \(f\).

In fact, we can’t hope to do this.

Revised Model 1 - \(\P=\{ Z_i \textrm{ has a density function } f : \int (f'')^2 dx < M \}\)

Revised Model 2 - \(\P=\{ Z_i \textrm{ has a density function } f : \int (f'')^2 dx < K < M \}\)

Revised Model 3 - \(\P=\{ Z_i \textrm{ has a density function } f : \int |f'| dx < M \}\)

- Each of these suggests different ways of estimating \(f\)

## Assumption Lean Regression

Imagine \(Z = (Y, \mathbf{X}) \sim P\) with \(Y \in \R\) and \(\mathbf{X} = (1, X_1, \ldots, X_p)^\top\).

We are interested in the *conditional* distribution \(P_{Y|\mathbf{X}}\)

Suppose we think that there is *some* function of interest which relates \(Y\) and \(X\).

Let’s call this function \(\mu(\mathbf{X})\) for the moment. How do we estimate \(\mu\)? What is \(\mu\)?

To make this precise, we

- Have a model \(\P\).
- Need to define a “good” functional \(\mu\).
- Let’s loosely define “good” as

Given a new (random) \(Z\), \(\mu(\mathbf{X})\) is “close” to \(Y\).

## Evaluating “close”

We need more functions.

Choose some *loss function* \(\ell\) that measures how close \(\mu\) and \(Y\) are.

*Squared-error:*

\(\ell(y,\ \mu) = (y-\mu)^2\)

*Absolute-error:*

\(\ell(y,\ \mu) = |y-\mu|\)

*Zero-One:*

\(\ell(y,\ \mu) = I(y\neq\mu)=\begin{cases} 0 & y=\mu\\1 & \mbox{else}\end{cases}\)

*Cauchy:*

\(\ell(y,\ \mu) = \log(1 + (y - \mu)^2)\)

## Code

```
ggplot() +
xlim(-2, 2) +
geom_function(fun = ~log(1+.x^2), colour = 'purple', linewidth = 2) +
geom_function(fun = ~.x^2, colour = tertiary, linewidth = 2) +
geom_function(fun = ~abs(.x), colour = primary, linewidth = 2) +
geom_line(
data = tibble(x = seq(-2, 2, length.out = 100), y = as.numeric(x != 0)),
aes(x, y), colour = orange, linewidth = 2) +
geom_point(data = tibble(x = 0, y = 0), aes(x, y),
colour = orange, pch = 16, size = 3) +
ylab(bquote("\u2113" * (y - mu))) + xlab(bquote(y - mu))
```

## Start with (Expected) Squared Error

Let’s try to minimize the *expected* squared error (MSE).

Claim: \(\mu(X) = \Expect{Y\ \vert\ X}\) minimizes MSE.

That is, for any \(r(X)\), \(\Expect{(Y - \mu(X))^2} \leq \Expect{(Y-r(X))^2}\).

Proof of Claim:

\[\begin{aligned}
\Expect{(Y-r(X))^2}
&= \Expect{(Y- \mu(X) + \mu(X) - r(X))^2}\\
&= \Expect{(Y- \mu(X))^2} + \Expect{(\mu(X) - r(X))^2} \\
&\quad +2\Expect{(Y- \mu(X))(\mu(X) - r(X))}\\
&=\Expect{(Y- \mu(X))^2} + \Expect{(\mu(X) - r(X))^2} \\
&\quad +2(\mu(X) - r(X))\Expect{(Y- \mu(X))}\\
&=\Expect{(Y- \mu(X))^2} + \Expect{(\mu(X) - r(X))^2} + 0\\
&\geq \Expect{(Y- \mu(X))^2}
\end{aligned}\]
## The regression function

Sometimes people call this solution:

\[\mu(X) = \Expect{Y \ \vert\ X}\]

the regression function. (But don’t forget that it depended on \(\ell\).)

If we assume that \(\mu(x) = \Expect{Y \ \vert\ X=x} = x^\top \beta\), then we get back exactly OLS.

But why should we assume \(\mu(x) = x^\top \beta\)?

## Brief aside

Some notation / terminology

“Hats” on things mean “estimates”, so \(\widehat{\mu}\) is an estimate of \(\mu\)

Parameters are “properties of the model”, so \(f_X(x)\) or \(\mu\) or \(\Var{Y}\)

Random variables like \(X\), \(Y\), \(Z\) may eventually become data, \(x\), \(y\), \(z\), once observed.

“Estimating” means “using observations to estimate *parameters*”

“Predicting” means “using observations to predict *future data*”

Often, there is a parameter whose estimate will provide a prediction.

This last point can lead to confusion.

## The regression function

In mathematics: \(\mu(x) = \Expect{Y \ \vert\ X=x}\).

In words:

Regression with squared-error loss is really about estimating the (conditional) mean.

If \(Y\sim \textrm{N}(\mu,\ 1)\), our best guess for a new \(Y\) is \(\mu\).

For regression, we let the mean \((\mu)\) depend on \(X\).

Think of \(Y\sim \textrm{N}(\mu(X),\ 1)\), then conditional on \(X=x\), our best guess for a new \(Y\) is \(\mu(x)\)

[whatever this function \(\mu\) is]

## Anything strange?

For any two variables \(Y\) and \(X\), we can always write

\[Y = E[Y\given X] + (Y - E[Y\given X]) = \mu(X) + \eta(X)\]

such that \(\Expect{\eta(X)}=0\).

- Suppose, \(\mu(X)=\mu_0\) (constant in \(X\)), are \(Y\) and \(X\) independent?

- Suppose \(Y\) and \(X\) are independent, is \(\mu(X)=\mu_0\)?

For more practice on this see the Fun Worksheet on Theory and solutions

In this course, I do not expect you to be able to create this math, but understanding and explaining it is important.

## What do we mean by good predictions?

We make observations and then attempt to “predict” new, unobserved data.

Sometimes this is the same as estimating the (conditional) mean.

Mostly, we observe \((y_1,x_1),\ \ldots,\ (y_n,x_n)\), and we want some way to predict \(Y\) from \(X\).

## Expected test MSE

For *regression* applications, we will use squared-error loss:

\(R_n(\widehat{\mu}) = \Expect{(Y-\widehat{\mu}(X))^2}\)

I’m giving this a name, \(R_n\) for ease.

Different than text.

This is *expected test MSE*.

## Example: Estimating/Predicting the (conditional) mean

Suppose we know that we want to predict a quantity \(Y\),

where \(\Expect{Y}= \mu \in \mathbb{R}\) and \(\Var{Y} = 1\).

Our data is \(\{y_1,\ldots,y_n\}\)

Claim: We want to estimate \(\mu\).

## Estimating the mean

- Let \(\widehat{Y}=\overline{Y}_n\) be the sample mean.

- We can ask about the
*estimation risk* (since we’re estimating \(\mu\)):

\[\begin{aligned}
E[(\overline{Y}_n-\mu)^2]
&= E[\overline{Y}_n^2]
-2\mu E[\overline{Y}_n] + \mu^2 \\
&= \mu^2 + \frac{1}{n} - 2\mu^2 +
\mu^2\\ &= \frac{1}{n}
\end{aligned}\]

Useful trick

For any \(Z\),

\(\Var{Z} = \Expect{Z^2} - \Expect{Z}^2\).

Therefore:

\(\Expect{Z^2} = \Var{Z} + \Expect{Z}^2\).

## Predicting new Y’s

- Let \(\widehat{Y}=\overline{Y}_n\) be the sample mean.

- What is the
*prediction risk* of \(\overline{Y}\)?

\[\begin{aligned}
R_n(\overline{Y}_n)
&= \E[(\overline{Y}_n-Y)^2]\\
&= \E[\overline{Y}_{n}^{2}] -2\E[\overline{Y}_n Y] + \E[Y^2] \\
&= \mu^2 + \frac{1}{n} - 2\mu^2 + \mu^2 + 1 \\
&= 1 + \frac{1}{n}
\end{aligned}\]

Tricks:

Used the variance thing again.

If \(X\) and \(Z\) are independent, then \(\Expect{XZ} = \Expect{X}\Expect{Z}\)

## Predicting new Y’s

What is the prediction risk of guessing \(Y=0\)?

You can probably guess that this is a stupid idea.

Let’s show why it’s stupid.

\[\begin{aligned}
R_n(0) &= \E[(0-Y)^2] = 1 + \mu^2
\end{aligned}\]
## Predicting new Y’s

What is the prediction risk of guessing \(Y=\mu\)?

This is a great idea, but we don’t know \(\mu\).

Let’s see what happens anyway.

\[\begin{aligned}
R_n(\mu) &= \E[(Y-\mu)^2]= 1
\end{aligned}\]
## Risk relations

Prediction risk: \(R_n(\overline{Y}_n) = 1 + \frac{1}{n}\)

Estimation risk: \(E[(\overline{Y}_n - \mu)^2] = \frac{1}{n}\)

There is actually a nice interpretation here:

The common \(1/n\) term is \(\Var{\overline{Y}_n}\)

The extra factor of \(1\) in the prediction risk is *irreducible error*

- \(Y\) is a random variable, and hence noisy.
- We can never eliminate it’s intrinsic variance.

- In other words, even if we knew \(\mu\), we could never get closer than \(1\), on average.

Intuitively, \(\overline{Y}_n\) is the obvious thing to do.

But what about unintuitive things…

# Next time…

Trading bias and variance