It’s updates only depend on local information in the sense that if objects in the hierarchical model are unrelated to each other, the updates aren’t affected

(This helps in many ways, most notably in parallel architectures)

It doesn’t require second-derivative information

As the updates are only in terms of \(\hat{R}_i\), the algorithm can be run in either batch or online mode

Down sides:

It can be very slow

Need to choose the learning rate \(\gamma_t\)

Other algorithms

There are many variations on the fitting algorithm

Stochastic gradient descent: (SGD) discussed in the optimization lecture

The rest are variations that use lots of tricks

RMSprop

Adam

Adadelta

Adagrad

Adamax

Nadam

Ftrl

Regularizing neural networks

NNets can almost always achieve 0 training error. Even with regularization. Because they have so many parameters.

Flavours:

a complexity penalization term \(\longrightarrow\) solve \(\min \hat{R} + \rho(\alpha,\beta)\)

early stopping on the back propagation algorithm used for fitting

Weight decay

This is like ridge regression in that we penalize the squared Euclidean norm of the weights \(\rho(\mathbf{W},\mathbf{B}) = \sum w_i^2 + \sum b_i^2\)

Weight elimination

This encourages more shrinking of small weights \(\rho(\mathbf{W},\mathbf{B}) = \sum \frac{w_i^2}{1+w_i^2} + \sum \frac{b_i^2}{1 + b_i^2}\) or Lasso-type

Dropout

In each epoch, randomly choose \(z\%\) of the nodes and set those weights to zero.

Other common pitfalls

There are a few areas to watch out for

Nonconvexity:

The neural network optimization problem is non-convex.

This makes any numerical solution highly dependent on the initial values. These should be

chosen carefully, typically random near 0. DON’T use all 0.

regenerated several times to check sensitivity

Scaling:
Be sure to standardize the covariates before training

Other common pitfalls

Number of hidden units:
It is generally better to have too many hidden units than too few (regularization can eliminate some).

Sifting the output:

Choose the solution that minimizes training error

Choose the solution that minimizes the penalized training error

Average the solutions across runs

Tuning parameters

There are many.

Regularization

Stopping criterion

learning rate

Architecture

Dropout %

others…

These are hard to tune.

In practice, people might choose “some” with a validation set, and fix the rest largely arbitrarily

More often, people set them all arbitrarily

Thoughts on NNets

Off the top of my head, without lots of justification

🤬😡 Why don’t statisticians like them? 🤬😡

There is little theory (though this is increasing)

Stat theory applies to global minima, here, only local determined by the optimizer

Little understanding of when they work

In large part, NNets look like logistic regression + feature creation. We understand that well, and in many applications, it performs as well

Explosion of tuning parameters without a way to decide

Require massive datasets to work

Lots of examples where they perform exceedingly poorly

🔥🔥Why are they hot?🔥🔥

Perform exceptionally well on typical CS tasks (images, translation)

Take advantage of SOTA computing (parallel, GPUs)

Very good for multinomial logistic regression

An excellent example of “transfer learning”

They generate pretty pictures (the nets, pseudo-responses at hidden units)

Keras

Most people who do deep learning use Python \(+\) Keras \(+\) Tensorflow

It takes some work to get all this software up and running.

Remember the bias-variance trade-off? It says that models perform well for an "intermediate level of flexibility". You've seen the picture of the U-shape test error curve.

Goal: Choose amount of flexibility to balance these and minimize MSE.

Use CV or something to estimate MSE and decide how much flexibility.

In the past few yrs, (and particularly in the context of deep learning) ppl have noticed "double descent" -- when you continue to fit increasingly flexible models that interpolate the training data, then the test error can start to DECREASE again!!