Stat 406
Geoff Pleiss, Trevor Campbell
Last modified – 13 November 2024
\[ \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator*{\argmax}{argmax} \DeclareMathOperator*{\minimize}{minimize} \DeclareMathOperator*{\maximize}{maximize} \DeclareMathOperator*{\find}{find} \DeclareMathOperator{\st}{subject\,\,to} \newcommand{\E}{E} \newcommand{\Expect}[1]{\E\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[2]{\mathrm{Cov}\left[#1,\ #2\right]} \newcommand{\given}{\ \vert\ } \newcommand{\X}{\mathbf{X}} \newcommand{\x}{\mathbf{x}} \newcommand{\y}{\mathbf{y}} \newcommand{\P}{\mathcal{P}} \newcommand{\R}{\mathbb{R}} \newcommand{\norm}[1]{\left\lVert #1 \right\rVert} \newcommand{\snorm}[1]{\lVert #1 \rVert} \newcommand{\tr}[1]{\mbox{tr}(#1)} \newcommand{\brt}{\widehat{\beta}^R_{s}} \newcommand{\brl}{\widehat{\beta}^R_{\lambda}} \newcommand{\bls}{\widehat{\beta}_{ols}} \newcommand{\blt}{\widehat{\beta}^L_{s}} \newcommand{\bll}{\widehat{\beta}^L_{\lambda}} \newcommand{\U}{\mathbf{U}} \newcommand{\D}{\mathbf{D}} \newcommand{\V}{\mathbf{V}} \]
Neural networks are models for supervised learning
Linear combinations of features are passed through a non-linear transformation in successive layers
At the top layer, the resulting latent factors are fed into an algorithm for predictions
(Most commonly via least squares or logistic loss)
Neural networks have come about in 3 “waves”
The first was an attempt in the 1950s to model the mechanics of the human brain
It appeared the brain worked by
A neuron itself interprets the status of other neurons
There weren’t really computers, so we couldn’t estimate these things
After the development of parallel, distributed computation in the 1980s, this “artificial intelligence” view was diminished
And neural networks gained popularity
But, the growing popularity of SVMs and boosting/bagging in the late 1990s, neural networks again fell out of favor
This was due to many of the problems we’ll discuss (non-convexity being the main one)
State-of-the-art performance on various classification tasks has been accomplished via neural networks
Today, Neural Networks/Deep Learning are the hottest…
Suppose \(Y \in \mathbb{R}\) and we are trying estimate the regression function \[\Expect{Y\given X} = f_*(X)\]
In Module 2, we discussed basis expansion,
We know \(f_*(x) =\sum_{k=1}^\infty \beta_k \phi_k(x)\) some basis \(\phi_1,\phi_2,\ldots\)
Truncate this expansion at \(K\): \(f_*^K(x) \approx \sum_{k=1}^K \beta_k \phi_k(x)\)
Estimate \(\beta_k\) with least squares
The weaknesses of this approach are:
An alternative would be to have the data tell us what kind of basis to use (Module 5)
A single layer neural network model is \[ \begin{aligned} &f(x) = \sum_{k=1}^K \beta_k h_k(x) \\ &= \sum_{k=1}^K \beta_k \ g(w_k^{\top}x)\\ &= \sum_{k=1}^K \beta_k \ A_k\\ \end{aligned} \]
Compare: A nonparametric regression \[f(x) = \sum_{k=1}^K \beta_k {\phi_k(x)}\]
\[f(x) = \sum_{k=1}^{{K}} {\beta_k} {g( w_k^{\top}x)}\] The main components are
\[f(x) = \sum_{k=1}^{{K}} \beta_0 + {\beta_k} {g(w_{k0} + w_k^{\top}x)}\]
\[f(x) = \sum_{k=1}^{{K}} {\beta_k} {g(w_k^{\top}x)}\]
Notes (no biases):
\(\beta \in \R^k\)
\(w_k \in \R^p,\ k = 1,\ldots,K\)
\(\mathbf{W} \in \R^{K\times p}\)
Some comments on adding layers:
It has been shown that one hidden layer is sufficient to approximate any bounded piecewise continuous function
However, this may take a huge number of hidden units (i.e. \(K_1 \gg 1\)).
This is what people mean when they say that NNets are “universal approximators”
By including multiple layers, we can have fewer hidden units per layer.
Also, we can encode (in)dependencies that can speed computations
We don’t have to connect everything the way we have been
\[ \begin{aligned} A_k^{(1)} &= g\left(\sum_{j=1}^p w^{(1)}_{k,j} x_j\right)\\ A_\ell^{(2)} &= g\left(\sum_{k=1}^{K_1} w^{(2)}_{\ell,k} A_k^{(1)} \right)\\ z_m &= \sum_{\ell=1}^{K_2} \beta_{m,\ell} A_\ell^{(2)}\\ f_m(x) &= \frac{1}{1 + \exp(-z_m)}\\ \end{aligned} \]
Predict class with largest probability \(\longrightarrow\ \widehat{Y} = \argmax_{m} f_m(x)\)
Notes:
\(B \in \R^{M\times K_2}\) (here \(M=10\)).
\(\mathbf{W}_2 \in \R^{K_2\times K_1}\)
\(\mathbf{W}_1 \in \R^{K_1\times p}\)
We start with \(p\) covariates and we generate \(K\) features (1-layer)
Logistic / Least-squares with a polynomial transformation
\[ \begin{aligned} &\Phi(x) \\ & = (1, x_1, \ldots, x_p, x_1^2,\ldots,x_p^2,\ldots\\ & \quad \ldots x_1x_2, \ldots, x_{p-1}x_p) \\ & = (\phi_1(x),\ldots,\phi_{K_2}(x))\\ f(x) &= \sum_{k=1}^{K_2} \beta_k \phi_k(x) = \beta^\top \Phi(x) \end{aligned} \]
Neural network
\[\begin{aligned} A_k &= g\left( \sum_{j=1}^p w_{kj}x_j\right) = g\left( w_{k}^{\top}x\right)\\ \Phi(x) &= (A_1,\ldots, A_K)^\top \in \mathbb{R}^{K}\\ f(x) &=\beta^{\top} \Phi(x)=\beta^\top A\\ &= \sum_{k=1}^K \beta_k g\left( \sum_{j=1}^p w_{kj}x_j\right)\end{aligned}\]
How do we estimate these monsters?
UBC Stat 406 - 2024