23 Neural nets - other considerations

Stat 406

Daniel J. McDonald

Last modified – 12 October 2023

\[ \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator*{\argmax}{argmax} \DeclareMathOperator*{\minimize}{minimize} \DeclareMathOperator*{\maximize}{maximize} \DeclareMathOperator*{\find}{find} \DeclareMathOperator{\st}{subject\,\,to} \newcommand{\E}{E} \newcommand{\Expect}[1]{\E\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[2]{\mathrm{Cov}\left[#1,\ #2\right]} \newcommand{\given}{\ \vert\ } \newcommand{\X}{\mathbf{X}} \newcommand{\x}{\mathbf{x}} \newcommand{\y}{\mathbf{y}} \newcommand{\P}{\mathcal{P}} \newcommand{\R}{\mathbb{R}} \newcommand{\norm}[1]{\left\lVert #1 \right\rVert} \newcommand{\snorm}[1]{\lVert #1 \rVert} \newcommand{\tr}[1]{\mbox{tr}(#1)} \newcommand{\brt}{\widehat{\beta}^R_{s}} \newcommand{\brl}{\widehat{\beta}^R_{\lambda}} \newcommand{\bls}{\widehat{\beta}_{ols}} \newcommand{\blt}{\widehat{\beta}^L_{s}} \newcommand{\bll}{\widehat{\beta}^L_{\lambda}} \]

Estimation procedures (training)

Back-propagation

Advantages:

  • It’s updates only depend on local information in the sense that if objects in the hierarchical model are unrelated to each other, the updates aren’t affected

    (This helps in many ways, most notably in parallel architectures)

  • It doesn’t require second-derivative information

  • As the updates are only in terms of \(\hat{R}_i\), the algorithm can be run in either batch or online mode

Down sides:

  • It can be very slow

  • Need to choose the learning rate \(\gamma_t\)

Other algorithms

There are many variations on the fitting algorithm

Stochastic gradient descent: (SGD) discussed in the optimization lecture

The rest are variations that use lots of tricks

  • RMSprop
  • Adam
  • Adadelta
  • Adagrad
  • Adamax
  • Nadam
  • Ftrl

Regularizing neural networks

NNets can almost always achieve 0 training error. Even with regularization. Because they have so many parameters.

Flavours:

  • a complexity penalization term \(\longrightarrow\) solve \(\min \hat{R} + \rho(\alpha,\beta)\)
  • early stopping on the back propagation algorithm used for fitting
Weight decay
This is like ridge regression in that we penalize the squared Euclidean norm of the weights \(\rho(\mathbf{W},\mathbf{B}) = \sum w_i^2 + \sum b_i^2\)
Weight elimination
This encourages more shrinking of small weights \(\rho(\mathbf{W},\mathbf{B}) = \sum \frac{w_i^2}{1+w_i^2} + \sum \frac{b_i^2}{1 + b_i^2}\) or Lasso-type
Dropout
In each epoch, randomly choose \(z\%\) of the nodes and set those weights to zero.

Other common pitfalls

There are a few areas to watch out for

Nonconvexity:

The neural network optimization problem is non-convex.

This makes any numerical solution highly dependent on the initial values. These should be

  • chosen carefully, typically random near 0. DON’T use all 0.
  • regenerated several times to check sensitivity

Scaling:
Be sure to standardize the covariates before training

Other common pitfalls

Number of hidden units:
It is generally better to have too many hidden units than too few (regularization can eliminate some).

Sifting the output:

  • Choose the solution that minimizes training error
  • Choose the solution that minimizes the penalized training error
  • Average the solutions across runs

Tuning parameters

There are many.

  • Regularization
  • Stopping criterion
  • learning rate
  • Architecture
  • Dropout %
  • others…

These are hard to tune.

In practice, people might choose “some” with a validation set, and fix the rest largely arbitrarily

More often, people set them all arbitrarily

Thoughts on NNets

Off the top of my head, without lots of justification

🤬😡 Why don’t statisticians like them? 🤬😡

  • There is little theory (though this is increasing)
  • Stat theory applies to global minima, here, only local determined by the optimizer
  • Little understanding of when they work
  • In large part, NNets look like logistic regression + feature creation. We understand that well, and in many applications, it performs as well
  • Explosion of tuning parameters without a way to decide
  • Require massive datasets to work
  • Lots of examples where they perform exceedingly poorly

🔥🔥Why are they hot?🔥🔥

  • Perform exceptionally well on typical CS tasks (images, translation)
  • Take advantage of SOTA computing (parallel, GPUs)
  • Very good for multinomial logistic regression
  • An excellent example of “transfer learning”
  • They generate pretty pictures (the nets, pseudo-responses at hidden units)

Keras

Most people who do deep learning use Python \(+\) Keras \(+\) Tensorflow

It takes some work to get all this software up and running.

It is possible to do in with R using an interface to Keras.

I used to try to do a walk-through, but the interface is quite brittle

If you want to explore, see the handout:

Double descent and model complexity

Where does this U shape come from?

MSE = Squared Bias + Variance + Irreducible Noise

As we increase flexibility:

  • Squared bias goes down
  • Variance goes up
  • Eventually, | \(\partial\) Variance | \(>\) | \(\partial\) Squared Bias |.

Goal: Choose amount of flexibility to balance these and minimize MSE.

Use CV or something to estimate MSE and decide how much flexibility.

Zero training error and model saturation

  • In Deep Learning, the recommendation is to “fit until you get zero training error”

  • This somehow magically, leads to a continued decrease in test error.

  • So, who cares about the Bias-Variance Trade off!!

Lesson:

BV Trade off is not wrong. 😢

This is a misunderstanding of black box algorithms and flexibility.

We don’t even need deep learning to illustrate.

library(splines)
set.seed(20221102)
n <- 20
df <- tibble(
  x = seq(-1.5 * pi, 1.5 * pi, length.out = n),
  y = sin(x) + runif(n, -0.5, 0.5)
)
g <- ggplot(df, aes(x, y)) + geom_point() + stat_function(fun = sin) + ylim(c(-2, 2))
g + stat_smooth(method = lm, formula = y ~ bs(x, df = 4), se = FALSE, color = green) + # too smooth
  stat_smooth(method = lm, formula = y ~ bs(x, df = 8), se = FALSE, color = orange) # looks good

xn <- seq(-1.5 * pi, 1.5 * pi, length.out = 1000)
# Spline by hand
X <- bs(df$x, df = 20, intercept = TRUE)
Xn <- bs(xn, df = 20, intercept = TRUE)
S <- svd(X)
yhat <- Xn %*% S$v %*% diag(1/S$d) %*% crossprod(S$u, df$y)
g + geom_line(data = tibble(x=xn, y=yhat), colour = orange) +
  ggtitle("20 degrees of freedom")

xn <- seq(-1.5 * pi, 1.5 * pi, length.out = 1000)
# Spline by hand
X <- bs(df$x, df = 40, intercept = TRUE)
Xn <- bs(xn, df = 40, intercept = TRUE)
S <- svd(X)
yhat <- Xn %*% S$v %*% diag(1/S$d) %*% crossprod(S$u, df$y)
g + geom_line(data = tibble(x = xn, y = yhat), colour = orange) +
  ggtitle("40 degrees of freedom")

What happened?!

doffs <- 4:50
mse <- function(x, y) mean((x - y)^2)
get_errs <- function(doff) {
  X <- bs(df$x, df = doff, intercept = TRUE)
  Xn <- bs(xn, df = doff, intercept = TRUE)
  S <- svd(X)
  yh <- S$u %*% crossprod(S$u, df$y)
  bhat <- S$v %*% diag(1 / S$d) %*% crossprod(S$u, df$y)
  yhat <- Xn %*% S$v %*% diag(1 / S$d) %*% crossprod(S$u, df$y)
  nb <- sqrt(sum(bhat^2))
  tibble(train = mse(df$y, yh), test = mse(yhat, sin(xn)), norm = nb)
}
errs <- map(doffs, get_errs) |>
  list_rbind() |> 
  mutate(`degrees of freedom` = doffs) |> 
  pivot_longer(train:test, values_to = "error")

What happened?!

Code
ggplot(errs, aes(`degrees of freedom`, error, color = name)) +
  geom_line(linewidth = 2) + 
  coord_cartesian(ylim = c(0, .12)) +
  scale_x_log10() + 
  scale_colour_manual(values = c(blue, orange), name = "") +
  geom_vline(xintercept = 20)

What happened?!

Code
best_test <- errs |> filter(name == "test")
min_norm <- best_test$norm[which.min(best_test$error)]
ggplot(best_test, aes(norm, error)) +
  geom_line(colour = blue, size = 2) + ylab("test error") +
  geom_vline(xintercept = min_norm, colour = orange) +
  scale_y_log10() + scale_x_log10() + geom_vline(xintercept = 20)

Degrees of freedom and complexity

  • In low dimensions (where \(n \gg p\)), with linear smoothers, df and model complexity are roughly the same.

  • But this relationship breaks down in more complicated settings

  • We’ve already seen this:

library(glmnet)
out <- cv.glmnet(X, df$y, nfolds = n) # leave one out
Code
with(
  out, 
  tibble(lambda = lambda, df = nzero, cv = cvm, cvup = cvup, cvlo = cvlo )
) |> 
  filter(df > 0) |>
  pivot_longer(lambda:df) |> 
  ggplot(aes(x = value)) +
  geom_errorbar(aes(ymax = cvup, ymin = cvlo)) +
  geom_point(aes(y = cv), colour = orange) +
  facet_wrap(~ name, strip.position = "bottom", scales = "free_x") +
  scale_y_log10() +
  scale_x_log10() + theme(axis.title.x = element_blank())

Infinite solutions

  • In Lasso, df is not really the right measure of complexity

  • Better is \(\lambda\) or the norm of the coefficients (these are basically the same)

  • So what happened with the Splines?

  • When df \(= 20\), there’s a unique solution that interpolates the data

  • When df \(> 20\), there are infinitely many solutions that interpolate the data.

Because we used the SVD to solve the system, we happened to pick one: the one that has the smallest \(\Vert\hat\beta\Vert_2\)

Recent work in Deep Learning shows that SGD has the same property: it returns the local optima with the smallest norm.

If we measure complexity in terms of the norm of the weights, rather than by counting parameters, we don’t see double descent anymore.

The lesson

  • Deep learning isn’t magic.

  • Zero training error with lots of parameters doesn’t mean good test error.

  • We still need the bias variance tradeoff

  • It’s intuition still applies: more flexibility eventually leads to increased MSE

  • But we need to be careful how we measure complexity.

Next time…

Module 5

unsupervised learning