22 Neural nets - estimation

Stat 406

Daniel J. McDonald

Last modified – 16 November 2023

\[ \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator*{\argmax}{argmax} \DeclareMathOperator*{\minimize}{minimize} \DeclareMathOperator*{\maximize}{maximize} \DeclareMathOperator*{\find}{find} \DeclareMathOperator{\st}{subject\,\,to} \newcommand{\E}{E} \newcommand{\Expect}[1]{\E\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[2]{\mathrm{Cov}\left[#1,\ #2\right]} \newcommand{\given}{\ \vert\ } \newcommand{\X}{\mathbf{X}} \newcommand{\x}{\mathbf{x}} \newcommand{\y}{\mathbf{y}} \newcommand{\P}{\mathcal{P}} \newcommand{\R}{\mathbb{R}} \newcommand{\norm}[1]{\left\lVert #1 \right\rVert} \newcommand{\snorm}[1]{\lVert #1 \rVert} \newcommand{\tr}[1]{\mbox{tr}(#1)} \newcommand{\brt}{\widehat{\beta}^R_{s}} \newcommand{\brl}{\widehat{\beta}^R_{\lambda}} \newcommand{\bls}{\widehat{\beta}_{ols}} \newcommand{\blt}{\widehat{\beta}^L_{s}} \newcommand{\bll}{\widehat{\beta}^L_{\lambda}} \newcommand{\U}{\mathbf{U}} \newcommand{\D}{\mathbf{D}} \newcommand{\V}{\mathbf{V}} \]

Neural Network terms again (T hidden layers, regression)

\[ \begin{aligned} A_{k}^{(1)} &= g\left(\sum_{j=1}^p w^{(1)}_{k,j} x_j\right)\\ A_{\ell}^{(t)} &= g\left(\sum_{k=1}^{K_{t-1}} w^{(t)}_{\ell,k} A_{k}^{(t-1)} \right)\\ \hat{Y} &= z_m = \sum_{\ell=1}^{K_T} \beta_{m,\ell} A_{\ell}^{(T)}\ \ (M = 1) \end{aligned} \]

  • \(B \in \R^{M\times K_T}\).
  • \(M=1\) for regression
  • \(\mathbf{W}_t \in \R^{K_2\times K_1}\) \(t=1,\ldots,T\)

Training neural networks. First, choices

  • Choose the architecture: how many layers, units per layer, what connections?

  • Choose the loss: common choices (for each data point \(i\))

Regression
\(\hat{R}_i = \frac{1}{2}(y_i - \hat{y}_i)^2\) (the 1/2 just makes the derivative nice)
Classification
\(\hat{R}_i = I(y_i = m)\log( 1 + \exp(-z_{im}))\)
  • Choose the activation function \(g\)

Training neural networks (intuition)

  • We need to estimate \(B\), \(\mathbf{W}_t\), \(t=1,\ldots,T\)

  • We want to minimize \(\hat{R} = \sum_{i=1}^n \hat{R}_i\) as a function of all this.

  • We use gradient descent, but in this dialect, we call it back propagation

Derivatives via the chain rule: computed by a forward and backward sweep

All the \(g(u)\)’s that get used have \(g'(u)\) “nice”.

If \(g\) is ReLu:

  • \(g(u) = xI(x>0)\)
  • \(g'(u) = I(x>0)\)

Once we have derivatives from backprop,

\[ \begin{align} \widetilde{B} &\leftarrow B - \gamma \frac{\partial \widehat{R}}{\partial B}\\ \widetilde{\mathbf{W}_t} &\leftarrow \mathbf{W}_t - \gamma \frac{\partial \widehat{R}}{\partial \mathbf{W}_t} \end{align} \]

Chain rule

We want \(\frac{\partial}{\partial B} \hat{R}_i\) and \(\frac{\partial}{\partial W_{t}}\hat{R}_i\) for all \(t\).

Regression: \(\hat{R}_i = \frac{1}{2}(y_i - \hat{y}_i)^2\)

\[\begin{aligned} \frac{\partial\hat{R}_i}{\partial B} &= -(y_i - \hat{y}_i)\frac{\partial \hat{y_i}}{\partial B} =\underbrace{-(y_i - \hat{y}_i)}_{-r_i} \mathbf{A}^{(T)}\\ \frac{\partial}{\partial \mathbf{W}_T} \hat{R}_i &= -(y_i - \hat{y}_i)\frac{\partial\hat{y_i}}{\partial \mathbf{W}_T} = -r_i \frac{\partial \hat{y}_i}{\partial \mathbf{A}^{(T)}} \frac{\partial \mathbf{A}^{(T)}}{\partial \mathbf{W}_T}\\ &= -\left(r_i B \odot g'(\mathbf{W}_T \mathbf{A}^{(T)}) \right) \left(\mathbf{A}^{(T-1)}\right)^\top\\ \frac{\partial}{\partial \mathbf{W}_{T-1}} \hat{R}_i &= -(y_i - \hat{y}_i)\frac{\partial\hat{y_i}}{\partial \mathbf{W}_{T-1}} = -r_i \frac{\partial \hat{y}_i}{\partial \mathbf{A}^{(T)}} \frac{\partial \mathbf{A}^{(T)}}{\partial \mathbf{W}_{T-1}}\\ &= -r_i \frac{\partial \hat{y}_i}{\partial \mathbf{A}^{(T)}} \frac{\partial \mathbf{A}^{(T)}}{\partial \mathbf{W}_{T}}\frac{\partial \mathbf{W}_{T}}{\partial \mathbf{A}^{(T-1)}}\frac{\partial \mathbf{A}^{(T-1)}}{\partial \mathbf{W}_{T-1}}\\ \cdots &= \cdots \end{aligned}\]

Mapping it out

Given current \(\mathbf{W}_t, B\), we want to get new, \(\widetilde{\mathbf{W}}_t,\ \widetilde B\) for \(t=1,\ldots,T\)

  • Squared error for regression, cross-entropy for classification

Feed forward

\[\mathbf{A}^{(0)} = \mathbf{X} \in \R^{n\times p}\]

Repeat, \(t= 1,\ldots, T\)

  1. \(\mathbf{Z}_{t} = \mathbf{A}^{(t-1)}\mathbf{W}_t \in \R^{n\times K_t}\)
  2. \(\mathbf{A}^{(t)} = g(\mathbf{Z}_{t})\) (component wise)
  3. \(\dot{\mathbf{A}}^{(t)} = g'(\mathbf{Z}_t)\)

\[\begin{cases} \hat{\mathbf{y}} =\mathbf{A}^{(T)} B \in \R^n \\ \hat{\Pi} = \left(1 + \exp\left(-\mathbf{A}^{(T)}\mathbf{B}\right)\right)^{-1} \in \R^{n \times M}\end{cases}\]

Back propogate

\[-r = \begin{cases} -\left(\mathbf{y} - \widehat{\mathbf{y}}\right) \\ -\left(1 - \widehat{\Pi}\right)[y]\end{cases}\]

\[ \begin{aligned} \frac{\partial}{\partial \mathbf{B}} \widehat{R} &= \left(\mathbf{A}^{(T)}\right)^\top \mathbf{r}\\ -\boldsymbol{\Gamma} &\leftarrow -\mathbf{r}\\ \mathbf{W}_{T+1} &\leftarrow \mathbf{B} \end{aligned} \]

Repeat, \(t = T,...,1\),

  1. \(-\boldsymbol{\Gamma} \leftarrow -\left(\boldsymbol{\Gamma} \mathbf{W}_{t+1}\right) \odot\dot{\mathbf{A}}^{(t)}\)
  2. \(\frac{\partial R}{\partial \mathbf{W}_t} = -\left(\mathbf{A}^{(t)}\right)^\top \Gamma\)

Deep nets

Some comments on adding layers:

  • It has been shown that one hidden layer is sufficient to approximate any bounded piecewise continuous function

  • However, this may take a huge number of hidden units (i.e. \(K_1 \gg 1\)).

  • This is what people mean when they say that NNets are “universal approximators”

  • By including multiple layers, we can have fewer hidden units per layer.

  • Also, we can encode (in)dependencies that can speed computations

  • We don’t have to connect everything the way we have been

Simple example

n <- 200
df <- tibble(
  x = seq(.05, 1, length = n),
  y = sin(1 / x) + rnorm(n, 0, .1) # Doppler function
)
testdata <- matrix(seq(.05, 1, length.out = 1e3), ncol = 1)
library(neuralnet)
nn_out <- neuralnet(y ~ x, data = df, hidden = c(10, 5, 15), threshold = 0.01, rep = 3)
nn_preds <- map(1:3, ~ compute(nn_out, testdata, .x)$net.result)
yhat <- nn_preds |> bind_cols() |> rowMeans() # average over the runs
Code
# This code will reproduce the analysis, takes some time
set.seed(406406406)
n <- 200
df <- tibble(
  x = seq(.05, 1, length = n),
  y = sin(1 / x) + rnorm(n, 0, .1) # Doppler function
)
testx <- matrix(seq(.05, 1, length.out = 1e3), ncol = 1)
library(neuralnet)
library(splines)
fstar <- sin(1 / testx)
spline_test_err <- function(k) {
  fit <- lm(y ~ bs(x, df = k), data = df)
  yhat <- predict(fit, newdata = tibble(x = testx))
  mean((yhat - fstar)^2)
}
Ks <- 1:15 * 10
SplineErr <- map_dbl(Ks, ~ spline_test_err(.x))

Jgrid <- c(5, 10, 15)
NNerr <- double(length(Jgrid)^3)
NNplot <- character(length(Jgrid)^3)
sweep <- 0
for (J1 in Jgrid) {
  for (J2 in Jgrid) {
    for (J3 in Jgrid) {
      sweep <- sweep + 1
      NNplot[sweep] <- paste(J1, J2, J3, sep = " ")
      nn_out <- neuralnet(y ~ x, df,
        hidden = c(J1, J2, J3),
        threshold = 0.01, rep = 3
      )
      nn_results <- sapply(1:3, function(x) {
        compute(nn_out, testx, x)$net.result
      })
      # Run them through the neural network
      Yhat <- rowMeans(nn_results)
      NNerr[sweep] <- mean((Yhat - fstar)^2)
    }
  }
}

bestK <- Ks[which.min(SplineErr)]
bestspline <- predict(lm(y ~ bs(x, bestK), data = df), newdata = tibble(x = testx))
besthidden <- as.numeric(unlist(strsplit(NNplot[which.min(NNerr)], " ")))
nn_out <- neuralnet(y ~ x, df, hidden = besthidden, threshold = 0.01, rep = 3)
nn_results <- sapply(1:3, function(x) compute(nn_out, testdata, x)$net.result)
# Run them through the neural network
bestnn <- rowMeans(nn_results)
plotd <- data.frame(
  x = testdata, spline = bestspline, nnet = bestnn, truth = fstar
)
save.image(file = "data/nnet-example.Rdata")

Different architectures

Next time…

Other considerations