class: center, middle, inverse, title-slide .title[ # 03 The regression function ] .author[ ### STAT 406 ] .author[ ### Daniel J. McDonald ] .date[ ### Last modified - 2022-09-19 ] --- class: middle, center background-image: url("https://upload.wikimedia.org/wikipedia/commons/6/6d/Activemarker2.PNG") background-size: cover .secondary[.larger[Module]] .huge-orange-number[1] .secondary[.large[selecting models]] --- ## Mean squared error (MSE) Last time... .secondary[Ordinary Least Squares] `$$\widehat\beta = \arg\min_\beta \sum_{i=1}^n ( y_i - x_i^\top \beta )^2.$$` "Find the `\(\beta\)` which minimizes the sum of squared errors." `$$\widehat\beta = \arg\min_\beta \frac{1}{n}\sum_{i=1}^n ( y_i - x_i^\top \beta )^2.$$` "Find the beta which minimizes the mean squared error." -- .emphasis[ Let's look at the population version, and let's forget about the linear model. ] Suppose we think that there is __some__ function which relates `\(y\)` and `\(x\)`. Let's call this function `\(f\)` for the moment. How do we estimate `\(f\)`? What is `\(f\)`? $$ \newcommand{\E}{E} \newcommand{\Expect}[1]{\E\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[2]{\mathrm{Cov}\left[#1,\ #2\right]} \newcommand{\given}{\ \vert\ } \newcommand{\X}{\mathbf{X}} \newcommand{\x}{\mathbf{x}} \newcommand{\y}{\mathbf{y}} $$ --- ## Minimizing MSE Let's try to minimize the __expected__ squared error (MSE). .emphasis[ Claim: `\(\mu(X) = \Expect{Y\ \vert\ X}\)` minimizes MSE. That is, for any `\(r(X)\)`, `\(\Expect{(Y - \mu(X))^2} \leq \Expect{(Y-r(X))^2}\)`. ] -- Proof of Claim: `\begin{aligned} \Expect{(Y-r(X))^2} &= \Expect{(Y- \mu(X) + \mu(X) - r(X))^2}\\ &= \Expect{(Y- \mu(X))^2} + \Expect{(\mu(X) - r(X))^2} + 2\Expect{(Y- \mu(X))(\mu(X) - r(X))}\\ &=\Expect{(Y- \mu(X))^2} + \Expect{(\mu(X) - r(X))^2} + 2(\mu(X) - r(X))\Expect{(Y- \mu(X))}\\ &=\Expect{(Y- \mu(X))^2} + \Expect{(\mu(X) - r(X))^2} + 0\\ &\geq \Expect{(Y- \mu(X))^2} \end{aligned}` --- ## The regression function We call this solution: `$$\mu(X) = \Expect{Y \ \vert\ X}$$` the regression function. If we __assume__ that `\(\mu(x) = \Expect{Y \ \vert\ X=x} = x^\top \beta\)`, then we get back exactly OLS. -- But why should we assume `\(\mu(x) = x^\top \beta\)`? --- ## The regression function In mathematics: `\(\mu(x) = \Expect{Y \ \vert\ X=x}\)`. In words: __Regression is really about estimating the mean.__ 1. If `\(Y\sim \textrm{N}(\mu,\ 1)\)`, our best guess for a __new__ `\(Y\)` is `\(\mu\)`. 2. For regression, we let the mean `\((\mu)\)` __depend__ on `\(X\)`. 3. Think of `\(Y\sim \textrm{N}(\mu(X),\ 1)\)`, then conditional on `\(X=x\)`, our best guess for a __new__ `\(Y\)` is `\(\mu(x)\)` [whatever this function `\(\mu\)` is] --- ## Anything strange? For any two variables `\(Y\)` and `\(X\)`, we can __always__ write `$$Y \ \vert\ X = \mu(X) + \eta(X)$$` such that `\(\Expect{\eta(X)}=0\)`. -- * Suppose, `\(\mu(X)=\mu_0\)` (constant in `\(X\)`), are `\(Y\)` and `\(X\)` independent? -- * Suppose `\(Y\)` and `\(X\)` are independent, is `\(\mu(X)=\mu_0\)`? -- * For more practice on this see the "Fun Worksheet on Theory" in Module 1 on Canvas. * In this course, I do not expect you to be able to create this math, but understanding and explaining it .secondary.hand[is] important. --- class: inverse, center, middle # Making predictions --- ## What do we mean by good predictions? We make observations and then attempt to "predict" new, unobserved data. Sometimes this is the same as estimating the mean. Mostly, we observe `\((y_1,x_1),\ldots,(y_n,x_n)\)`, and we want some way to predict `\(Y\)` from `\(X\)`. --- ## Evaluating predictions Of course, both `\(Y\)` and `\(\widehat{Y}\)` are __random__ I want to know how well I can predict __on average__ Let `\(\widehat{f}\)` be some way of making predictions `\(\widehat{Y}\)` of `\(Y\)` using covariates `\(X\)` In fact, suppose I observe a dataset `\(\{(y_1,x_1,),\ldots,(y_n,x_n)\}\)`. Then I want to __choose__ some `\(\widehat{f}\)` using the data Is `\(\widehat{f}\)` good on average? --- ## Evaluating predictions Choose some __loss function__ that measures prediction quality. We predict `\(y\)` with `\(\widehat{y}\)` -- Examples: * __Squared-error:__ `\(\ell(y,\widehat{y}) = (y-\widehat{y})^2\)` * __Absolute-error:__ `\(\ell(y,\widehat{y}) = |y-\widehat{y}|\)` * __Zero-One:__ `\(\ell(y,\widehat{y}) = I(y\neq\widehat{y})=\begin{cases} 0 & y=\widehat{y}\\1 & \mbox{else}\end{cases}\)` -- Can be generalized to `\(y\)` in arbitrary spaces. --- ## Expected test MSE For __regression__ applications, we will use squared-error loss: `\(R_n(\widehat{f}) = \Expect{(Y-\widehat{f}(X))^2}\)` -- I'm giving this a name, `\(R_n\)` for ease. Different than text. This is __expected test MSE__. --- ## Example: Estimating the mean Suppose we know that we want to predict a quantity `\(Y\)`, where `\(\Expect{Y}= \mu \in \mathbb{R}\)` and `\(\Var{Y} = 1\)`. Our data is `\(\{y_1,\ldots,y_n\}\)` We want to estimate `\(\mu\)` --- ## Estimating the mean * Let `\(\widehat{Y}=\overline{Y}_n\)` be the sample mean. * We can ask about the __estimation risk__ (since we're estimating `\(\mu\)`): .pull-left[ `\begin{aligned} R_n(\overline{Y}_n; \mu) &= \E[(\overline{Y}_n-\mu)^2] \\ &= \E[\overline{Y}_n^2] -2\mu\E[\overline{Y}_n] + \mu^2 \\ &= \mu^2 + \frac{1}{n} - 2\mu^2 + \mu^2\\ &= \frac{1}{n} \end{aligned}` ] -- .pull-right[ **Useful trick** For any `\(Z\)`, `\(\Var{Z} = \Expect{Z^2} - \Expect{Z}^2\)`. Therefore: `\(\Expect{Z^2} = \Var{Z} + \Expect{Z}^2\)`. ] --- ## Predicting new Y's * Let `\(\widehat{Y}=\overline{Y}_n\)` be the sample mean. * What is the __prediction risk__ of `\(\overline{Y}\)`? .pull-left[ `\begin{aligned} R_n(\overline{Y}_n) &= \E[(\overline{Y}_n-Y)^2]\\ &= \E[\overline{Y}_n^2] -2\E[\overline{Y}_n Y] + \E[Y^2] \\ &= \mu^2 + \frac{1}{n} - 2\mu^2 + \mu^2 + 1 \\ &= 1 + \frac{1}{n} \end{aligned}` ] -- .pull-right[ **Tricks:** * Used the variance thing again. * If `\(X\)` and `\(Z\)` are independent, then `\(\Expect{XZ} = \Expect{X}\Expect{Z}\)` ] --- ## Predicting new Y's * What is the prediction risk of guessing `\(Y=0\)`? * You can probably guess that this is a stupid idea. * Let's show why it's stupid. `\begin{aligned} R_n(0) &= \E[(0-Y)^2] = 1 + \mu^2 \end{aligned}` --- ## Predicting new Y's What is the prediction risk of guessing `\(Y=\mu\)`? This is a great idea, but we don't know `\(\mu\)`. Let's see what happens anyway. `\begin{aligned} R_n(\mu) &= \E[(Y-\mu)^2]= 1 \end{aligned}` --- ## Estimating the mean Prediction risk: `\(R_n(\overline{Y}_n) = 1 + \frac{1}{n}\)` Estimation risk: `\(R_n(\overline{Y}_n;\mu) = \frac{1}{n}\)` There is actually a nice interpretation here: 1. The common `\(1/n\)` term is `\(\Var{\overline{Y}_n}\)` 2. The extra factor of `\(1\)` in the prediction risk is __irreducible error__ * `\(Y\)` is a random variable, and hence noisy. * We can never eliminate it's intrinsic variance. * In other words, even if we knew `\(\mu\)`, we could never get closer than `\(1\)`, on average. Intuitively, `\(\overline{Y}_n\)` is the obvious thing to do. But what about unintuitive things... --- class: inverse, center, middle # Next time... Trading bias and variance