21 Neural nets

Stat 406

Geoff Pleiss, Trevor Campbell

Last modified – 13 November 2024

\[ \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator*{\argmax}{argmax} \DeclareMathOperator*{\minimize}{minimize} \DeclareMathOperator*{\maximize}{maximize} \DeclareMathOperator*{\find}{find} \DeclareMathOperator{\st}{subject\,\,to} \newcommand{\E}{E} \newcommand{\Expect}[1]{\E\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[2]{\mathrm{Cov}\left[#1,\ #2\right]} \newcommand{\given}{\ \vert\ } \newcommand{\X}{\mathbf{X}} \newcommand{\x}{\mathbf{x}} \newcommand{\y}{\mathbf{y}} \newcommand{\P}{\mathcal{P}} \newcommand{\R}{\mathbb{R}} \newcommand{\norm}[1]{\left\lVert #1 \right\rVert} \newcommand{\snorm}[1]{\lVert #1 \rVert} \newcommand{\tr}[1]{\mbox{tr}(#1)} \newcommand{\brt}{\widehat{\beta}^R_{s}} \newcommand{\brl}{\widehat{\beta}^R_{\lambda}} \newcommand{\bls}{\widehat{\beta}_{ols}} \newcommand{\blt}{\widehat{\beta}^L_{s}} \newcommand{\bll}{\widehat{\beta}^L_{\lambda}} \newcommand{\U}{\mathbf{U}} \newcommand{\D}{\mathbf{D}} \newcommand{\V}{\mathbf{V}} \]

Overview

Neural networks are models for supervised learning

Linear combinations of features are passed through a non-linear transformation in successive layers

At the top layer, the resulting latent factors are fed into an algorithm for predictions

(Most commonly via least squares or logistic loss)

Background

Neural networks have come about in 3 “waves”

The first was an attempt in the 1950s to model the mechanics of the human brain

It appeared the brain worked by

taking atomic units known as neurons, which can be “on” or “off”
putting them in networks

A neuron itself interprets the status of other neurons

There weren’t really computers, so we couldn’t estimate these things

Background

After the development of parallel, distributed computation in the 1980s, this “artificial intelligence” view was diminished

And neural networks gained popularity

But, the growing popularity of SVMs and boosting/bagging in the late 1990s, neural networks again fell out of favor

This was due to many of the problems we’ll discuss (non-convexity being the main one)

State-of-the-art performance on various classification tasks has been accomplished via neural networks

Today, Neural Networks/Deep Learning are the hottest…

High level overview

Recall basis regression

Suppose \(Y \in \mathbb{R}\) and we are trying estimate the regression function \[\Expect{Y\given X} = f_*(X)\]

In Module 2, we discussed basis expansion,

We know \(f_*(x) =\sum_{k=1}^\infty \beta_k \phi_k(x)\) some basis \(\phi_1,\phi_2,\ldots\)
Truncate this expansion at \(K\): \(f_*^K(x) \approx \sum_{k=1}^K \beta_k \phi_k(x)\)
Estimate \(\beta_k\) with least squares

Recall basis regression

The weaknesses of this approach are:

The basis is fixed and independent of the data
If the basis doesn’t “agree” with \(f_*\), then \(K\) will have to be large to capture the structure
What if parts of \(f_*\) have substantially different structure? Say \(f_*(x)\) really wiggly for \(x \in [-1,3]\) but smooth elsewhere

An alternative would be to have the data tell us what kind of basis to use (Module 5)

1-layer for Regression

A single layer neural network model is \[ \begin{aligned} &f(x) = \sum_{k=1}^K \beta_k h_k(x) \\ &= \sum_{k=1}^K \beta_k \ g(w_k^{\top}x)\\ &= \sum_{k=1}^K \beta_k \ A_k\\ \end{aligned} \]

Compare: A nonparametric regression \[f(x) = \sum_{k=1}^K \beta_k {\phi_k(x)}\]

Terminology

\[f(x) = \sum_{k=1}^{{K}} {\beta_k} {g( w_k^{\top}x)}\] The main components are

The derived features \({A_k = g(w_k^{\top}x)}\) and are called the hidden units or activations
The function \(g\) is called the activation function (more on this later)
The parameters \({\beta_k},{w_k}\) are estimated from the data for all \(k = 1,\ldots, K\).
The number of hidden units \({K}\) is a tuning parameter

\[f(x) = \sum_{k=1}^{{K}} \beta_0 + {\beta_k} {g(w_{k0} + w_k^{\top}x)}\]

Could add \(\beta_0\) and \(w_{k0}\). Called biases (I’m going to ignore them. It’s just an intercept)

Terminology

\[f(x) = \sum_{k=1}^{{K}} {\beta_k} {g(w_k^{\top}x)}\]

Notes (no biases):

\(\beta \in \R^k\)

\(w_k \in \R^p,\ k = 1,\ldots,K\)

\(\mathbf{W} \in \R^{K\times p}\)

Deep nets

Some comments on adding layers:

It has been shown that one hidden layer is sufficient to approximate any bounded piecewise continuous function
However, this may take a huge number of hidden units (i.e. \(K_1 \gg 1\)).
This is what people mean when they say that NNets are “universal approximators”
By including multiple layers, we can have fewer hidden units per layer.
Also, we can encode (in)dependencies that can speed computations
We don’t have to connect everything the way we have been

Deep nets example (10 class classification, 2 layers)

\[ \begin{aligned} A_k^{(1)} &= g\left(\sum_{j=1}^p w^{(1)}_{k,j} x_j\right)\\ A_\ell^{(2)} &= g\left(\sum_{k=1}^{K_1} w^{(2)}_{\ell,k} A_k^{(1)} \right)\\ z_m &= \sum_{\ell=1}^{K_2} \beta_{m,\ell} A_\ell^{(2)}\\ f_m(x) &= \frac{1}{1 + \exp(-z_m)}\\ \end{aligned} \]

Predict class with largest probability \(\longrightarrow\ \widehat{Y} = \argmax_{m} f_m(x)\)

Deep nets example (10 class classification, 2 layers)

Notes:

\(B \in \R^{M\times K_2}\) (here \(M=10\)).

\(\mathbf{W}_2 \in \R^{K_2\times K_1}\)

\(\mathbf{W}_1 \in \R^{K_1\times p}\)

(Nonlinear) activation functions

Effect of depth

Two observations

The \(g\) function generates a feature map

We start with \(p\) covariates and we generate \(K\) features (1-layer)

Logistic / Least-squares with a polynomial transformation

\[ \begin{aligned} &\Phi(x) \\ & = (1, x_1, \ldots, x_p, x_1^2,\ldots,x_p^2,\ldots\\ & \quad \ldots x_1x_2, \ldots, x_{p-1}x_p) \\ & = (\phi_1(x),\ldots,\phi_{K_2}(x))\\ f(x) &= \sum_{k=1}^{K_2} \beta_k \phi_k(x) = \beta^\top \Phi(x) \end{aligned} \]

Neural network

\[\begin{aligned} A_k &= g\left( \sum_{j=1}^p w_{kj}x_j\right) = g\left( w_{k}^{\top}x\right)\\ \Phi(x) &= (A_1,\ldots, A_K)^\top \in \mathbb{R}^{K}\\ f(x) &=\beta^{\top} \Phi(x)=\beta^\top A\\ &= \sum_{k=1}^K \beta_k g\left( \sum_{j=1}^p w_{kj}x_j\right)\end{aligned}\]

Two observations

If \(g(u) = u\), (or \(=3u\)) then neural networks reduce to (massively underdetermined) ordinary least squares (try to show this)

ReLU is the current fashion (used to be tanh or logistic)

Next time…

How do we estimate these monsters?