00 Gradient descent

Stat 406

Geoff Pleiss, Trevor Campbell

Last modified – 21 October 2024

\[ \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator*{\argmax}{argmax} \DeclareMathOperator*{\minimize}{minimize} \DeclareMathOperator*{\maximize}{maximize} \DeclareMathOperator*{\find}{find} \DeclareMathOperator{\st}{subject\,\,to} \newcommand{\E}{E} \newcommand{\Expect}[1]{\E\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[2]{\mathrm{Cov}\left[#1,\ #2\right]} \newcommand{\given}{\ \vert\ } \newcommand{\X}{\mathbf{X}} \newcommand{\x}{\mathbf{x}} \newcommand{\y}{\mathbf{y}} \newcommand{\P}{\mathcal{P}} \newcommand{\R}{\mathbb{R}} \newcommand{\norm}[1]{\left\lVert #1 \right\rVert} \newcommand{\snorm}[1]{\lVert #1 \rVert} \newcommand{\tr}[1]{\mbox{tr}(#1)} \newcommand{\brt}{\widehat{\beta}^R_{s}} \newcommand{\brl}{\widehat{\beta}^R_{\lambda}} \newcommand{\bls}{\widehat{\beta}_{ols}} \newcommand{\blt}{\widehat{\beta}^L_{s}} \newcommand{\bll}{\widehat{\beta}^L_{\lambda}} \newcommand{\U}{\mathbf{U}} \newcommand{\D}{\mathbf{D}} \newcommand{\V}{\mathbf{V}} \]

Motivation: maximum likelihood estimation as optimization

By the principle of maximum likelihood, we have that

\[ \begin{align*} \hat \beta &= \argmax_{\beta} \prod_{i=1}^n \P(Y_i \mid X_i) \\ &= \argmin_{\beta} \sum_{i=1}^n -\log\P(Y_i \mid X_i) \end{align*} \]

Under the model we use for logistic regression… \[ \begin{gathered} \P(Y=1 \mid X=x) = h(\beta^\top x), \qquad \P(Y=0 \mid X=x) = h(-\beta^\top x), \\ h(z) = \tfrac{1}{1-e^{-z}} \end{gathered} \]

… we can’t simply find the argmin with algebra.

Gradient descent: the workhorse optimization algorithm

We’ll see “gradient descent” a few times:

solves logistic regression
gradient boosting
Neural networks

This seems like a good time to explain it.

So what is it and how does it work?

Very basic example

Suppose I want to minimize \(f(x)=(x-6)^2\) numerically.

I start at a point (say \(x_1=23\))

I want to “go” in the negative direction of the gradient.

The gradient (at \(x_1=23\)) is \(f'(23)=2(23-6)=34\).

Move current value toward current value - 34.

\(x_2 = x_1 - \gamma 34\), for \(\gamma\) small.

In general, \(x_{n+1} = x_n -\gamma f'(x_n)\).

niter <- 10
gam <- 0.1
x <- double(niter)
x[1] <- 23
grad <- function(x) 2 * (x - 6)
for (i in 2:niter) x[i] <- x[i - 1] - gam * grad(x[i - 1])

Why does this work?

Heuristic interpretation:

Gradient tells me the slope.
negative gradient points toward the minimum
go that way, but not too far (or we’ll miss it)

Why does this work?

More rigorous interpretation:

Taylor expansion \[ f(x) \approx f(x_0) + \nabla f(x_0)^{\top}(x-x_0) + \frac{1}{2}(x-x_0)^\top H(x_0) (x-x_0) \]
replace \(H\) with \(\gamma^{-1} I\)
minimize this quadratic approximation in \(x\): \[ 0\overset{\textrm{set}}{=}\nabla f(x_0) + \frac{1}{\gamma}(x-x_0) \Longrightarrow x = x_0 - \gamma \nabla f(x_0) \]

Visually

What \(\gamma\)? (more details than we have time for)

What to use for \(\gamma_k\)?

Fixed

Only works if \(\gamma\) is exactly right
Usually does not work

Decay on a schedule

\(\gamma_{n+1} = \frac{\gamma_n}{1+cn}\) or \(\gamma_{n} = \gamma_0 b^n\)

Exact line search

Tells you exactly how far to go.
At each iteration \(n\), solve \(\gamma_n = \arg\min_{s \geq 0} f( x^{(n)} - s f(x^{(n-1)}))\)
Usually can’t solve this.