Stat 406

Geoff Pleiss, Trevor Campbell

Last modified – 02 November 2023

\[ \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator*{\argmax}{argmax} \DeclareMathOperator*{\minimize}{minimize} \DeclareMathOperator*{\maximize}{maximize} \DeclareMathOperator*{\find}{find} \DeclareMathOperator{\st}{subject\,\,to} \newcommand{\E}{E} \newcommand{\Expect}[1]{\E\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[2]{\mathrm{Cov}\left[#1,\ #2\right]} \newcommand{\given}{\ \vert\ } \newcommand{\X}{\mathbf{X}} \newcommand{\x}{\mathbf{x}} \newcommand{\y}{\mathbf{y}} \newcommand{\P}{\mathcal{P}} \newcommand{\R}{\mathbb{R}} \newcommand{\norm}[1]{\left\lVert #1 \right\rVert} \newcommand{\snorm}[1]{\lVert #1 \rVert} \newcommand{\tr}[1]{\mbox{tr}(#1)} \newcommand{\brt}{\widehat{\beta}^R_{s}} \newcommand{\brl}{\widehat{\beta}^R_{\lambda}} \newcommand{\bls}{\widehat{\beta}_{ols}} \newcommand{\blt}{\widehat{\beta}^L_{s}} \newcommand{\bll}{\widehat{\beta}^L_{\lambda}} \newcommand{\U}{\mathbf{U}} \newcommand{\D}{\mathbf{D}} \newcommand{\V}{\mathbf{V}} \]

Neural networks are models for supervised learning

Linear combinations of features are passed through a non-linear transformation in successive layers

At the top layer, the resulting latent factors are fed into an algorithm for predictions

(Most commonly via least squares or logistic loss)

Neural networks have come about in 3 “waves”

The first was an attempt in the 1950s to model the mechanics of the human brain

It appeared the brain worked by

- taking atomic units known as neurons, which can be “on” or “off”
- putting them in networks

A neuron itself interprets the status of other neurons

There weren’t really computers, so we couldn’t estimate these things

After the development of parallel, distributed computation in the 1980s, this “artificial intelligence” view was diminished

And neural networks gained popularity

But, the growing popularity of SVMs and boosting/bagging in the late 1990s, neural networks again fell out of favor

This was due to many of the problems we’ll discuss (non-convexity being the main one)

State-of-the-art performance on various classification tasks has been accomplished via neural networks

Today, Neural Networks/Deep Learning are the hottest…

Suppose \(Y \in \mathbb{R}\) and we are trying estimate the regression function \[\Expect{Y\given X} = f_*(X)\]

In Module 2, we discussed basis expansion,

We know \(f_*(x) =\sum_{k=1}^\infty \beta_k \phi_k(x)\) some basis \(\phi_1,\phi_2,\ldots\)

Truncate this expansion at \(K\): \(f_*^K(x) \approx \sum_{k=1}^K \beta_k \phi_k(x)\)

Estimate \(\beta_k\) with least squares

The weaknesses of this approach are:

- The basis is fixed and independent of the data
- If \(p\) is large, then nonparametrics doesn’t work well at all (recall the Curse of Dimensionality)
- If the basis doesn’t “agree” with \(f_*\), then \(K\) will have to be large to capture the structure
- What if parts of \(f_*\) have substantially different structure? Say \(f_*(x)\) really wiggly for \(x \in [-1,3]\) but smooth elsewhere

An alternative would be to have the data tell us what kind of basis to use (Module 5)

A single layer neural network model is \[ \begin{aligned} &f(x) = \sum_{k=1}^K \beta_k h_k(x) \\ &= \sum_{k=1}^K \beta_k \ g(w_k^{\top}x)\\ &= \sum_{k=1}^K \beta_k \ A_k\\ \end{aligned} \]

Compare: A nonparametric regression \[f(x) = \sum_{k=1}^K \beta_k {\phi_k(x)}\]

\[f(x) = \sum_{k=1}^{{K}} {\beta_k} {g( w_k^{\top}x)}\] The main components are

- The derived features \({A_k = g(w_k^{\top}x)}\) and are called the hidden units or activations
- The function \(g\) is called the activation function (more on this later)
- The parameters \({\beta_k},{w_k}\) are estimated from the data for all \(k = 1,\ldots, K\).
- The number of hidden units \({K}\) is a tuning parameter

\[f(x) = \sum_{k=1}^{{K}} \beta_0 + {\beta_k} {g(w_{k0} + w_k^{\top}x)}\]

- Could add \(\beta_0\) and \(w_{k0}\). Called biases (I’m going to ignore them. It’s just an intercept)

\[f(x) = \sum_{k=1}^{{K}} {\beta_k} {g(w_k^{\top}x)}\]

Notes (no biases):

\(\beta \in \R^k\)

\(w_k \in \R^p,\ k = 1,\ldots,K\)

\(\mathbf{W} \in \R^{K\times p}\)

\[ \begin{aligned} A_k^{(1)} &= g\left(\sum_{j=1}^p w^{(1)}_{k,j} x_j\right)\\ A_\ell^{(2)} &= g\left(\sum_{k=1}^{K_1} w^{(2)}_{\ell,k} A_k^{(1)} \right)\\ z_m &= \sum_{\ell=1}^{K_2} \beta_{m,\ell} A_\ell^{(2)}\\ f_m(x) &= \frac{1}{1 + \exp(-z_m)}\\ \end{aligned} \]

Predict class with largest probability \(\longrightarrow\ \widehat{Y} = \argmax_{m} f_m(x)\)

Notes:

\(B \in \R^{M\times K_2}\) (here \(M=10\)).

\(\mathbf{W}_2 \in \R^{K_2\times K_1}\)

\(\mathbf{W}_1 \in \R^{K_1\times p}\)

- The \(g\) function generates a feature map

We start with \(p\) covariates and we generate \(K\) features (1-layer)

Logistic / Least-squares with a polynomial transformation

\[ \begin{aligned} &\Phi(x) \\ & = (1, x_1, \ldots, x_p, x_1^2,\ldots,x_p^2,\ldots\\ & \quad \ldots x_1x_2, \ldots, x_{p-1}x_p) \\ & = (\phi_1(x),\ldots,\phi_{K_2}(x))\\ f(x) &= \sum_{k=1}^{K_2} \beta_k \phi_k(x) = \beta^\top \Phi(x) \end{aligned} \]

Neural network

\[\begin{aligned} A_k &= g\left( \sum_{j=1}^p w_{kj}x_j\right) = g\left( w_{k}^{\top}x\right)\\ \Phi(x) &= (A_1,\ldots, A_K)^\top \in \mathbb{R}^{K}\\ f(x) &=\beta^{\top} \Phi(x)=\beta^\top A\\ &= \sum_{k=1}^K \beta_k g\left( \sum_{j=1}^p w_{kj}x_j\right)\end{aligned}\]

- If \(g(u) = u\), (or \(=3u\)) then neural networks reduce to (massively underdetermined) ordinary least squares (try to show this)

- ReLU is the current fashion (used to be tanh or logistic)

How do we estimate these monsters?

UBC Stat 406 - 2024