It’s updates only depend on local information in the sense that if objects in the hierarchical model are unrelated to each other, the updates aren’t affected
(This helps in many ways, most notably in parallel architectures)
It doesn’t require second-derivative information
As the updates are only in terms of \(\hat{R}_i\), the algorithm can be run in either batch or online mode
Down sides:
It can be very slow
Need to choose the learning rate \(\gamma_t\)
Other algorithms
There are many variations on the fitting algorithm
Stochastic gradient descent: (SGD) discussed in the optimization lecture
The rest are variations that use lots of tricks
RMSprop
Adam
Adadelta
Adagrad
Adamax
Nadam
Ftrl
Regularizing neural networks
NNets can almost always achieve 0 training error. Even with regularization. Because they have so many parameters.
Flavours:
a complexity penalization term \(\longrightarrow\) solve \(\min \hat{R} + \rho(\alpha,\beta)\)
early stopping on the back propagation algorithm used for fitting
Weight decay
This is like ridge regression in that we penalize the squared Euclidean norm of the weights \(\rho(\mathbf{W},\mathbf{B}) = \sum w_i^2 + \sum b_i^2\)
Weight elimination
This encourages more shrinking of small weights \(\rho(\mathbf{W},\mathbf{B}) = \sum \frac{w_i^2}{1+w_i^2} + \sum \frac{b_i^2}{1 + b_i^2}\) or Lasso-type
Dropout
In each epoch, randomly choose \(z\%\) of the nodes and set those weights to zero.
Other common pitfalls
There are a few areas to watch out for
Nonconvexity:
The neural network optimization problem is non-convex.
This makes any numerical solution highly dependent on the initial values. These should be
chosen carefully, typically random near 0. DON’T use all 0.
regenerated several times to check sensitivity
Scaling:
Be sure to standardize the covariates before training
Other common pitfalls
Number of hidden units:
It is generally better to have too many hidden units than too few (regularization can eliminate some).
Sifting the output:
Choose the solution that minimizes training error
Choose the solution that minimizes the penalized training error
Average the solutions across runs
Tuning parameters
There are many.
Regularization
Stopping criterion
learning rate
Architecture
Dropout %
others…
These are hard to tune.
In practice, people might choose “some” with a validation set, and fix the rest largely arbitrarily
More often, people set them all arbitrarily
Thoughts on NNets
Off the top of my head, without lots of justification
🤬😡 Why don’t statisticians like them? 🤬😡
There is little theory (though this is increasing)
Stat theory applies to global minima, here, only local determined by the optimizer
Little understanding of when they work
In large part, NNets look like logistic regression + feature creation. We understand that well, and in many applications, it performs as well
Explosion of tuning parameters without a way to decide
Require massive datasets to work
Lots of examples where they perform exceedingly poorly
🔥🔥Why are they hot?🔥🔥
Perform exceptionally well on typical CS tasks (images, translation)
Take advantage of SOTA computing (parallel, GPUs)
Very good for multinomial logistic regression
An excellent example of “transfer learning”
They generate pretty pictures (the nets, pseudo-responses at hidden units)
Keras
Most people who do deep learning use Python \(+\) Keras \(+\) Tensorflow
It takes some work to get all this software up and running.
Remember the bias-variance trade-off? It says that models perform well for an "intermediate level of flexibility". You've seen the picture of the U-shape test error curve.
Goal: Choose amount of flexibility to balance these and minimize MSE.
Use CV or something to estimate MSE and decide how much flexibility.
In the past few yrs, (and particularly in the context of deep learning) ppl have noticed "double descent" -- when you continue to fit increasingly flexible models that interpolate the training data, then the test error can start to DECREASE again!!