03 The regression function

class: center, middle, inverse, title-slide

.title[
# 03 The regression function
]
.author[
### STAT 406
]
.author[
### Daniel J. McDonald
]
.date[
### Last modified - 2022-09-19
]

---

class: middle, center
background-image: url("https://upload.wikimedia.org/wikipedia/commons/6/6d/Activemarker2.PNG")
background-size: cover

.secondary[.larger[Module]] .huge-orange-number[1]

.secondary[.large[selecting models]]

---

## Mean squared error (MSE)

Last time... .secondary[Ordinary Least Squares]

`$$\widehat\beta = \arg\min_\beta \sum_{i=1}^n ( y_i - x_i^\top \beta )^2.$$`

"Find the `$\beta$` which minimizes the sum of squared errors."

`$$\widehat\beta = \arg\min_\beta \frac{1}{n}\sum_{i=1}^n ( y_i - x_i^\top \beta )^2.$$`

"Find the beta which minimizes the mean squared error."

--
.emphasis[
Let's look at the population version, and let's forget about the linear model.
]

Suppose we think that there is __some__ function which relates `$y$` and `$x$`.

Let's call this function `$f$` for the moment.

How do we estimate `$f$`? What is `$f$`?

$$
\newcommand{\E}{E}
\newcommand{\Expect}[1]{\E\left[ #1 \right]}
\newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]}
\newcommand{\Cov}[2]{\mathrm{Cov}\left[#1,\ #2\right]}
\newcommand{\given}{\ \vert\ }
\newcommand{\X}{\mathbf{X}}
\newcommand{\x}{\mathbf{x}}
\newcommand{\y}{\mathbf{y}}
$$

---

## Minimizing MSE

Let's try to minimize the __expected__ squared error (MSE).

.emphasis[
Claim: `$\mu(X) = \Expect{Y\ \vert\ X}$` minimizes MSE. 
That is, for any `$r(X)$`, `$\Expect{(Y - \mu(X))^2} \leq \Expect{(Y-r(X))^2}$`.
]

Proof of Claim:

`\begin{aligned}
\Expect{(Y-r(X))^2} 
&= \Expect{(Y- \mu(X) + \mu(X) - r(X))^2}\\
&= \Expect{(Y- \mu(X))^2} + \Expect{(\mu(X) - r(X))^2} + 
2\Expect{(Y- \mu(X))(\mu(X) - r(X))}\\
&=\Expect{(Y- \mu(X))^2} + \Expect{(\mu(X) - r(X))^2} + 
2(\mu(X) - r(X))\Expect{(Y- \mu(X))}\\
&=\Expect{(Y- \mu(X))^2} + \Expect{(\mu(X) - r(X))^2} + 0\\
&\geq \Expect{(Y- \mu(X))^2}
\end{aligned}`

---

## The regression function

We call this solution:

`$$\mu(X) = \Expect{Y \ \vert\  X}$$`

the regression function.

If we __assume__ that `$\mu(x) = \Expect{Y \ \vert\  X=x} = x^\top \beta$`, then we get back exactly OLS.

But why should we assume `$\mu(x) = x^\top \beta$`?

---

## The regression function

In mathematics: `$\mu(x) = \Expect{Y \ \vert\  X=x}$`.

In words:

__Regression is really about estimating the mean.__

1. If `$Y\sim \textrm{N}(\mu,\ 1)$`, our best guess for a __new__ `$Y$` is `$\mu$`.

2. For regression, we let the mean `$(\mu)$` __depend__ on `$X$`.  
3. Think of `$Y\sim \textrm{N}(\mu(X),\ 1)$`, then conditional on `$X=x$`, our best guess for a __new__ `$Y$` is `$\mu(x)$`

[whatever this function `$\mu$` is]

---

## Anything strange?

For any two variables `$Y$` and `$X$`, we can __always__ write

`$$Y \ \vert\  X = \mu(X) + \eta(X)$$`

such that `$\Expect{\eta(X)}=0$`.

* Suppose, `$\mu(X)=\mu_0$` (constant in `$X$`), are `$Y$` and `$X$` independent?

* Suppose `$Y$` and `$X$` are independent, is `$\mu(X)=\mu_0$`?

* For more practice on this see the "Fun Worksheet on Theory" in Module 1 on Canvas.

* In this course, I do not expect you to be able to create this math, but understanding and explaining it .secondary.hand[is] important.

---
class: inverse, center, middle

# Making predictions

---

## What do we mean by good predictions?

We make observations and then attempt to "predict" new, unobserved data.

Sometimes this is the same as estimating the mean. 
  
Mostly, we observe `$(y_1,x_1),\ldots,(y_n,x_n)$`, and we want some way to predict `$Y$` from `$X$`.

---

## Evaluating predictions

Of course, both `$Y$` and `$\widehat{Y}$` are __random__

I want to know how well I can predict __on average__

Let `$\widehat{f}$` be some way of making predictions `$\widehat{Y}$` of `$Y$` using covariates `$X$`

In fact, suppose I observe a dataset `$\{(y_1,x_1,),\ldots,(y_n,x_n)\}$`.

Then I want to __choose__ some `$\widehat{f}$` using the data

Is `$\widehat{f}$` good on average?
  
  
---

## Evaluating predictions

Choose some __loss function__ that measures prediction quality.

We predict `$y$` with `$\widehat{y}$`

Examples:

* __Squared-error:__   
`$\ell(y,\widehat{y}) = (y-\widehat{y})^2$`

* __Absolute-error:__  
`$\ell(y,\widehat{y}) = |y-\widehat{y}|$`

* __Zero-One:__         
`$\ell(y,\widehat{y}) = I(y\neq\widehat{y})=\begin{cases} 0 & y=\widehat{y}\\1 & \mbox{else}\end{cases}$`

--
  
Can be generalized to `$y$` in arbitrary spaces.

---

## Expected test MSE

For __regression__ applications, we will use squared-error loss:

`$R_n(\widehat{f}) = \Expect{(Y-\widehat{f}(X))^2}$`

I'm giving this a name, `$R_n$` for ease.

Different than text.

This is __expected test MSE__.

---

## Example: Estimating the mean

Suppose we know that we want to predict a quantity `$Y$`,

where `$\Expect{Y}= \mu \in \mathbb{R}$` and `$\Var{Y} = 1$`.

Our data is `$\{y_1,\ldots,y_n\}$`

We want to estimate `$\mu$`

---

## Estimating the mean

* Let `$\widehat{Y}=\overline{Y}_n$` be the sample mean.  
* We can ask about the __estimation risk__ (since we're estimating `$\mu$`):

.pull-left[
  
`\begin{aligned}
    R_n(\overline{Y}_n; \mu) 
    &= \E[(\overline{Y}_n-\mu)^2] \\ 
    &= \E[\overline{Y}_n^2]
    -2\mu\E[\overline{Y}_n] + \mu^2 \\ 
    &= \mu^2 + \frac{1}{n} - 2\mu^2 +
    \mu^2\\ &= \frac{1}{n}
\end{aligned}`
]

.pull-right[
**Useful trick**

For any `$Z$`,

`$\Var{Z} = \Expect{Z^2} - \Expect{Z}^2$`.

Therefore:

`$\Expect{Z^2} = \Var{Z} + \Expect{Z}^2$`.

]

---

## Predicting new Y's

* Let `$\widehat{Y}=\overline{Y}_n$` be the sample mean.

* What is the __prediction risk__ of `$\overline{Y}$`?

.pull-left[

`\begin{aligned}
    R_n(\overline{Y}_n) &= \E[(\overline{Y}_n-Y)^2]\\ &= \E[\overline{Y}_n^2]
    -2\E[\overline{Y}_n Y] + \E[Y^2] \\ &= \mu^2 + \frac{1}{n} - 2\mu^2 + \mu^2 +
    1 \\ &= 1 + \frac{1}{n} 
\end{aligned}`
]

.pull-right[
**Tricks:**

* Used the variance thing again.

* If `$X$` and `$Z$` are independent, then `$\Expect{XZ} = \Expect{X}\Expect{Z}$`

]

---

## Predicting new Y's

* What is the prediction risk of guessing `$Y=0$`?

* You can probably guess that this is a stupid idea.

* Let's show why it's stupid.

`\begin{aligned}
        R_n(0) &= \E[(0-Y)^2] = 1 + \mu^2
\end{aligned}`

---

## Predicting new Y's

What is the prediction risk of guessing `$Y=\mu$`?

This is a great idea, but we don't know `$\mu$`.

Let's see what happens anyway.

`\begin{aligned}
        R_n(\mu) &= \E[(Y-\mu)^2]= 1
\end{aligned}`

---

## Estimating the mean

Prediction risk: `$R_n(\overline{Y}_n) = 1 + \frac{1}{n}$`

Estimation risk: `$R_n(\overline{Y}_n;\mu) =  \frac{1}{n}$`

There is actually a nice interpretation here:
1. The common `$1/n$` term is `$\Var{\overline{Y}_n}$`  
2. The extra factor of `$1$` in the prediction risk is __irreducible error__ 
  * `$Y$` is a random variable, and hence noisy. 
  * We can never eliminate it's intrinsic variance.  
  * In other words, even if we knew `$\mu$`, we could never get closer than `$1$`, on average.

Intuitively, `$\overline{Y}_n$` is the obvious thing to do.
 
But what about unintuitive things...

---
class: inverse, center, middle

# Next time...

Trading bias and variance