Stat 406
Geoff Pleiss, Trevor Campbell
Last modified – 21 October 2024
\[ \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator*{\argmax}{argmax} \DeclareMathOperator*{\minimize}{minimize} \DeclareMathOperator*{\maximize}{maximize} \DeclareMathOperator*{\find}{find} \DeclareMathOperator{\st}{subject\,\,to} \newcommand{\E}{E} \newcommand{\Expect}[1]{\E\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[2]{\mathrm{Cov}\left[#1,\ #2\right]} \newcommand{\given}{\ \vert\ } \newcommand{\X}{\mathbf{X}} \newcommand{\x}{\mathbf{x}} \newcommand{\y}{\mathbf{y}} \newcommand{\P}{\mathcal{P}} \newcommand{\R}{\mathbb{R}} \newcommand{\norm}[1]{\left\lVert #1 \right\rVert} \newcommand{\snorm}[1]{\lVert #1 \rVert} \newcommand{\tr}[1]{\mbox{tr}(#1)} \newcommand{\brt}{\widehat{\beta}^R_{s}} \newcommand{\brl}{\widehat{\beta}^R_{\lambda}} \newcommand{\bls}{\widehat{\beta}_{ols}} \newcommand{\blt}{\widehat{\beta}^L_{s}} \newcommand{\bll}{\widehat{\beta}^L_{\lambda}} \newcommand{\U}{\mathbf{U}} \newcommand{\D}{\mathbf{D}} \newcommand{\V}{\mathbf{V}} \]
By the principle of maximum likelihood, we have that
\[ \begin{align*} \hat \beta &= \argmax_{\beta} \prod_{i=1}^n \P(Y_i \mid X_i) \\ &= \argmin_{\beta} \sum_{i=1}^n -\log\P(Y_i \mid X_i) \end{align*} \]
Under the model we use for logistic regression… \[ \begin{gathered} \P(Y=1 \mid X=x) = h(\beta^\top x), \qquad \P(Y=0 \mid X=x) = h(-\beta^\top x), \\ h(z) = \tfrac{1}{1-e^{-z}} \end{gathered} \]
… we can’t simply find the argmin with algebra.
We’ll see “gradient descent” a few times:
This seems like a good time to explain it.
So what is it and how does it work?
Suppose I want to minimize \(f(x)=(x-6)^2\) numerically.
I start at a point (say \(x_1=23\))
I want to “go” in the negative direction of the gradient.
The gradient (at \(x_1=23\)) is \(f'(23)=2(23-6)=34\).
Move current value toward current value - 34.
\(x_2 = x_1 - \gamma 34\), for \(\gamma\) small.
In general, \(x_{n+1} = x_n -\gamma f'(x_n)\).
Heuristic interpretation:
Gradient tells me the slope.
negative gradient points toward the minimum
go that way, but not too far (or we’ll miss it)
More rigorous interpretation:
Taylor expansion \[ f(x) \approx f(x_0) + \nabla f(x_0)^{\top}(x-x_0) + \frac{1}{2}(x-x_0)^\top H(x_0) (x-x_0) \]
replace \(H\) with \(\gamma^{-1} I\)
minimize this quadratic approximation in \(x\): \[ 0\overset{\textrm{set}}{=}\nabla f(x_0) + \frac{1}{\gamma}(x-x_0) \Longrightarrow x = x_0 - \gamma \nabla f(x_0) \]
What to use for \(\gamma_k\)?
Fixed
Decay on a schedule
\(\gamma_{n+1} = \frac{\gamma_n}{1+cn}\) or \(\gamma_{n} = \gamma_0 b^n\)
Exact line search
\[ f(x_1,x_2) = x_1^2 + 0.5x_2^2\]
\[ f(x_1,x_2) = x_1^2 + 0.5x_2^2\]
\[ f(x_1,x_2) = x_1^2 + 0.5x_2^2\]
\[ f(x_1,x_2) = x_1^2 + 0.5x_2^2\]
\[ f(x_1,x_2) = x_1^2 + 0.5x_2^2\]
For \(\epsilon>0\), small
Check any / all of
If optimizing \(\argmin_\beta \sum_{i=1}^n -\log P_\beta(Y_i \mid X_i)\) then derivative also additive: \[ \sum_{i=1}^n \frac{\partial}{\partial \beta} \left[-\log P_\beta(Y_i \mid X_i) \right] \]
If \(n\) is really big, it may take a long time to compute this
So, just sample a subset of data \(\mathcal{M} \subset \{ (X_i, Y_i)\}_{i=1}^n\) and approximate: \[\sum_{i=1}^n \frac{\partial}{\partial \beta} \left[-\log P_\beta(Y_i \mid X_i) \right] \approx \frac{n}{\vert \mathcal M \vert}\sum_{i\in\mathcal{M}} \left[-\log P_\beta(Y_i \mid X_i) \right]\]
For SGD need:
\[ \begin{aligned} f'(\beta) &= \frac{1}{n}\sum_{i=1}^n f'_i(\beta) \approx \frac{1}{|\mathcal{M}_j|}\sum_{i\in\mathcal{M}_j}f'_{i}(\beta) \end{aligned} \]
Instead of drawing samples independently, better to:
Gradient estimates are still marginally unbiased (why?)
This is the workhorse for neural network optimization
For \(\epsilon>0\), small
Can we check any / all of
None of this works due to the stochasticity. Knowing when to terminate SGD is hard.
\[ \begin{gathered} \P(Y=1 \mid X=x) = h(\beta^\top x), \qquad \P(Y=0 \mid X=x) = h(-\beta^\top x), \\ \\ h(z) = \tfrac{1}{1+e^{-z}} \end{gathered} \]
\[ \P(Y=1 \mid X=x) = h(\beta^\top x), \qquad \P(Y=0 \mid X=x) = h(-\beta^\top x) \]
Under maximum likelihood
\[ \hat \beta = \argmin_{\beta} \underbrace{ \textstyle{\sum_{i=1}^n - \log( \P_\beta(Y_i=y_i \mid X_i=x_i) )} }_{:= -\ell(\beta)} \]
\[ \begin{align*} \P_\beta(Y_i=y_i \mid X_i=X_i) &= h\left( [-1]^{1 - y_i} \beta^\top x_i \right) \\ \\ -\ell(\beta) &= \sum_{i=1}^n -\log\left( \P_\beta(Y_i=y_i \mid X_i=X_i) \right) \\ &= \sum_{i=1}^n \log\left( 1 + \exp\left( [-1]^{y_i} \beta^\top x_i \right) \right) \\ \\ -\frac{\partial \ell}{\partial \beta} &= \sum_{i=1}^n x_i[-1]^{y_i} \frac{\exp\left( [-1]^{y_i} \beta^\top x_i \right)}{1 + \exp\left( [-1]^{y_i} \beta^\top x_i \right)} \\ %&= \sum_{i=1}^n x_i \left( y_i - \P_\beta(Y_i=y_i \mid X_i=X_i) \right) \end{align*} \]
beta.mle <- function(x, y, beta0, gamma = 0.5, jmax = 50, eps = 1e-6) {
beta <- double(jmax) # place to hold stuff (always preallocate space)
beta[1] <- beta0 # starting value
for (j in 2:jmax) { # avoid possibly infinite while loops
px <- logit(beta[j - 1] * x)
grad <- mean(-x * (y - px))
beta[j] <- beta[j - 1] - gamma * grad
if (abs(grad) < eps || abs(beta[j] - beta[j - 1]) < eps) break
}
beta[1:j]
}
negll <- function(beta) {
-beta * mean(x * y) -
rowMeans(log(1 / (1 + exp(outer(beta, x)))))
}
blah <- list_rbind(
map(
rlang::dots_list(
too_big, too_small, just_right, .named = TRUE
),
as_tibble),
names_to = "gamma"
) |> mutate(negll = negll(value))
ggplot(blah, aes(value, negll)) +
geom_point(aes(colour = gamma)) +
facet_wrap(~gamma, ncol = 1) +
stat_function(fun = negll, xlim = c(-2.5, 5)) +
scale_y_log10() +
xlab("beta") +
ylab("negative log likelihood") +
geom_vline(xintercept = tail(just_right, 1)) +
scale_colour_brewer(palette = "Set1") +
theme(legend.position = "none")
glm()
Call:
glm(formula = y ~ x - 1, family = "binomial")
Coefficients:
Estimate Std. Error z value Pr(>|z|)
x 1.9174 0.4785 4.008 6.13e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 138.629 on 100 degrees of freedom
Residual deviance: 32.335 on 99 degrees of freedom
AIC: 34.335
Number of Fisher Scoring iterations: 7
UBC Stat 406 - 2024