class: center, middle, inverse, title-slide .title[ # 13 GAMs and Trees ] .author[ ### STAT 406 ] .author[ ### Daniel J. McDonald ] .date[ ### Last modified - 2022-10-03 ] --- ## GAMs Last time we discussed smoothing in multiple dimensions. `$$\newcommand{\Expect}[1]{E\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[2]{\mathrm{Cov}\left[#1,\ #2\right]} \newcommand{\given}{\ \vert\ } \newcommand{\R}{\mathbb{R}} \newcommand{\tr}[1]{\mbox{tr}(#1)} \newcommand{\X}{\mathbf{X}} \newcommand{\y}{\mathbf{y}}$$` Here we introduce the concept of GAMs (__G__eneralized __A__dditive __M__odels) The basic idea is to imagine that the response is the sum of some functions of the predictors: `$$\Expect{Y \given X=x} = \beta_0 + f_1(x_{1})+\cdots+f_p(x_{p}).$$` Note that OLS __is__ a GAM (take `\(f_j(x_{j})=\beta_j x_{j}\)`): `$$\Expect{Y \given X=x} = \beta_0 + \beta_1 x_{1}+\cdots+\beta_p x_{p}.$$` -- These work by estimating each `\(f_i\)` using basis expansions in predictor `\(i\)` The algorithm for fitting these things is called "backfitting": .emphasis[ 1. Center `\(\y\)` and `\(\X\)`. 2. Hold `\(f_k\)` for all `\(k\neq j\)` fixed, and regress `\(f_j\)` on the partial residuals using your favorite smoother. 3. Repeat for `\(1\leq j\leq p\)`. 4. Repeat steps 2 and 3 until the estimated functions "stop moving" (iterate) 5. Return the results. ] --- ## Very small example .pull-left[ ```r library(mgcv) set.seed(12345) n <- 500 simple <- tibble( x1 = runif(n, 0, 2*pi), x2 = runif(n), y = 5 + 2*sin(x1) + 8*sqrt(x2) + rnorm(n, sd = .25)) ``` <img src="rmd_gfx/13-gams-trees/simple-data-1.svg" style="display: block; margin: auto;" /> ] -- .pull-right[ Smooth each coordinate independently ```r ex_smooth <- gam(y ~ s(x1) + s(x2), data = simple) # s(z) means "smooth" z (uses splines) plot(ex_smooth, pages = 1, scale = 0, shade = TRUE, resid = TRUE, se = 2, las = 1) ``` <!-- --> ] --- ## Wherefore GAMs? .pull-left[ If `\(\Expect{Y \given X=x} = \beta_0 + f_1(x_{1})+\cdots+f_p(x_{p}),\)` then `\(\textrm{MSE}(\hat f) = \frac{Cp}{n^{4/5}} + \sigma^2.\)` * Exponent no longer depends on `\(p\)`. Converges faster. (if truth separates like this) * You could also use the same methods to include "some" interactions like `$$\begin{aligned}&\Expect{Y \given X=x}\\ &= \beta_0 + f_{12}(x_{1},\ x_{2})+f_3(x_3)+\cdots+f_p(x_{p}),\end{aligned}$$` ] .pull-right[ Smooth two coordinates together ```r ex_smooth2 <- gam(y ~ s(x1, x2), data = simple) plot(ex_smooth2, scheme = 2, scale = 0, shade = TRUE, resid = TRUE, se = 2, las=1) ``` <img src="rmd_gfx/13-gams-trees/unnamed-chunk-2-1.svg" style="display: block; margin: auto;" /> ] --- ## Regression trees Trees involve stratifying or segmenting the predictor space into a number of simple regions. Trees are simple and useful for interpretation. Basic trees are not great at prediction. Modern methods that use trees are much better (Module 4) -- Regression trees estimate piece-wise constant functions The slabs are axis-parallel rectangles `\(R_1,\ldots,R_K\)` based on `\(\X\)` In each region, we average the `\(y_i\)`'s: `\(\hat\mu_1,\ldots,\hat\mu_k\)` Minimize `\(\sum_{k=1}^K \sum_{i=1}^n (y_i-\mu_k)^2\)` over `\(R_k,\mu_k\)` for `\(k\in \{1,\ldots,K\}\)` -- This sounds more complicated than it is. The minimization is performed __greedily__ (like forward stepwise regression). ---  --- ## Mobility data ```r bigtree <- tree(Mobility ~ ., data = mob) smalltree = prune.tree(bigtree, k = .09) draw.tree(smalltree, digits = 2) ``` <img src="rmd_gfx/13-gams-trees/small-tree-1.svg" style="display: block; margin: auto;" /> This is called the __dendrogram__ --- ## Partition view ```r mob$preds <- predict(smalltree) par(mfrow = c(1, 2), mar = c(5, 3, 0, 0)) draw.tree(smalltree, digits = 2) cols <- viridisLite::viridis(20, direction = -1)[cut(log(mob$Mobility), 20)] plot(mob$Black, mob$Commute, pch = 19, cex = .4, bty = 'n', las = 1, col = cols, ylab = "Commute time", xlab = "% Black") partition.tree(smalltree, add = TRUE, ordvars = c("Black","Commute")) ``` <img src="rmd_gfx/13-gams-trees/partition-view-1.svg" style="display: block; margin: auto;" /> We predict all observations in a region with the same value. `\(\bullet\)` The three regions correspond to the leaves of the tree. --- <img src="rmd_gfx/13-gams-trees/big-tree-1.svg" style="display: block; margin: auto;" /> __Terminology__ We call each split or end point a node. Each terminal node is referred to as a leaf. The interior nodes lead to branches. --- ## Advantages and disadvantages of trees
Trees are very easy to explain (much easier than even linear regression).
Some people believe that decision trees mirror human decision.
Trees can easily be displayed graphically no matter the dimension of the data.
Trees can easily handle qualitative predictors without the need to create dummy variables.
Trees aren't very good at prediction.
Full trees badly overfit, so we "prune" them using CV -- .hand[ We'll talk more about trees next module for Classification. ] --- class: middle background-image: url("gfx/proforhobo.png") background-size: cover .pull-left[ .hand[.secondary[.larger[Next time...]]] .hand[.secondary[.larger[Module]]] .huge-orange-number[3] ] .pull-right[.center[ .fourth-color[.hand[.large[classification: Professor or Hobo? You decide.]]] .small[.fourth-color[ Source: proforhobo.com ]] ]]