16 Logistic regression

Stat 406

Daniel J. McDonald

Last modified – 25 October 2023

\[ \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator*{\argmax}{argmax} \DeclareMathOperator*{\minimize}{minimize} \DeclareMathOperator*{\maximize}{maximize} \DeclareMathOperator*{\find}{find} \DeclareMathOperator{\st}{subject\,\,to} \newcommand{\E}{E} \newcommand{\Expect}[1]{\E\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[2]{\mathrm{Cov}\left[#1,\ #2\right]} \newcommand{\given}{\ \vert\ } \newcommand{\X}{\mathbf{X}} \newcommand{\x}{\mathbf{x}} \newcommand{\y}{\mathbf{y}} \newcommand{\P}{\mathcal{P}} \newcommand{\R}{\mathbb{R}} \newcommand{\norm}[1]{\left\lVert #1 \right\rVert} \newcommand{\snorm}[1]{\lVert #1 \rVert} \newcommand{\tr}[1]{\mbox{tr}(#1)} \newcommand{\brt}{\widehat{\beta}^R_{s}} \newcommand{\brl}{\widehat{\beta}^R_{\lambda}} \newcommand{\bls}{\widehat{\beta}_{ols}} \newcommand{\blt}{\widehat{\beta}^L_{s}} \newcommand{\bll}{\widehat{\beta}^L_{\lambda}} \]

Last time

  • We showed that with two classes, the Bayes’ classifier is

\[g_*(X) = \begin{cases} 1 & \textrm{ if } \frac{p_1(X)}{p_0(X)} > \frac{1-\pi}{\pi} \\ 0 & \textrm{ otherwise} \end{cases}\]

where \(p_1(X) = Pr(X \given Y=1)\) and \(p_0(X) = Pr(X \given Y=0)\)

  • We then looked at what happens if we assume \(Pr(X \given Y=y)\) is Normally distributed.

We then used this distribution and the class prior \(\pi\) to find the posterior \(Pr(Y=1 \given X=x)\).

Direct model

Instead, let’s directly model the posterior

\[ \begin{aligned} Pr(Y = 1 \given X=x) & = \frac{\exp\{\beta_0 + \beta^{\top}x\}}{1 + \exp\{\beta_0 + \beta^{\top}x\}} \\ Pr(Y = 0 | X=x) & = \frac{1}{1 + \exp\{\beta_0 + \beta^{\top}x\}}=1-\frac{\exp\{\beta_0 + \beta^{\top}x\}}{1 + \exp\{\beta_0 + \beta^{\top}x\}} \end{aligned} \]

This is logistic regression.

Why this?

\[Pr(Y = 1 \given X=x) = \frac{\exp\{\beta_0 + \beta^{\top}x\}}{1 + \exp\{\beta_0 + \beta^{\top}x\}}\]

  • There are lots of ways to map \(\R \mapsto [0,1]\).

  • The “logistic” function \(z\mapsto (1 + \exp(-z))^{-1} = \exp(z) / (1+\exp(z)) =:h(z)\) is nice.

  • It’s symmetric: \(1 - h(z) = h(-z)\)

  • Has a nice derivative: \(h'(z) = \frac{\exp(z)}{(1 + \exp(z))^2} = h(z)(1-h(z))\).

  • It’s the inverse of the “log-odds” (logit): \(\log(p / (1-p))\).

Another linear classifier

Like LDA, logistic regression is a linear classifier

The logit (i.e.: log odds) transformation gives a linear decision boundary \[\log\left( \frac{\P(Y = 1 \given X=x)}{\P(Y = 0 \given X=x) } \right) = \beta_0 + \beta^{\top} x\] The decision boundary is the hyperplane \(\{x : \beta_0 + \beta^{\top} x = 0\}\)

If the log-odds are below 0, classify as 0, above 0 classify as a 1.

Logistic regression is also easy in R

logistic <- glm(y ~ ., dat, family = "binomial")

Or we can use lasso or ridge regression or a GAM as before

lasso_logit <- cv.glmnet(x, y, family = "binomial")
ridge_logit <- cv.glmnet(x, y, alpha = 0, family = "binomial")
gam_logit <- gam(y ~ s(x), data = dat, family = "binomial")

Baby example (continued from last time)

dat1 <- generate_lda_2d(100, Sigma = .5 * diag(2))
logit <- glm(y ~ ., dat1 |> mutate(y = y - 1), family = "binomial")
summary(logit)

Call:
glm(formula = y ~ ., family = "binomial", data = mutate(dat1, 
    y = y - 1))

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -2.6649     0.6281  -4.243 2.21e-05 ***
x1            2.5305     0.5995   4.221 2.43e-05 ***
x2            1.6610     0.4365   3.805 0.000142 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 138.469  on 99  degrees of freedom
Residual deviance:  68.681  on 97  degrees of freedom
AIC: 74.681

Number of Fisher Scoring iterations: 6

Visualizing the classification boundary

Code
gr <- expand_grid(x1 = seq(-2.5, 3, length.out = 100), 
                  x2 = seq(-2.5, 3, length.out = 100))
pts <- predict(logit, gr)
g0 <- ggplot(dat1, aes(x1, x2)) +
  scale_shape_manual(values = c("0", "1"), guide = "none") +
  geom_raster(data = tibble(gr, disc = pts), aes(x1, x2, fill = disc)) +
  geom_point(aes(shape = as.factor(y)), size = 4) +
  coord_cartesian(c(-2.5, 3), c(-2.5, 3)) +
  scale_fill_steps2(n.breaks = 6, name = "log odds") 
g0

Calculation

While the R formula for logistic regression is straightforward, it’s not as easy to compute as OLS or LDA or QDA.

Logistic regression for two classes simplifies to a likelihood:

Write \(p_i(\beta) = \P(Y_i = 1 | X = x_i,\beta)\)

  • \(P(Y_i = y_i \given X = x_i, \beta) = p_i^{y_i}(1-p_i)^{1-y_i}\) (…Bernoulli distribution)

  • \(P(\mathbf{Y} \given \mathbf{X}, \beta) = \prod_{i=1}^n p_i^{y_i}(1-p_i)^{1-y_i}\).

Calculation

Write \(p_i(\beta) = \P(Y_i = 1 | X = x_i,\beta)\)

\[ \begin{aligned} \ell(\beta) & = \log \left( \prod_{i=1}^n p_i^{y_i}(1-p_i)^{1-y_i} \right)\\ &=\sum_{i=1}^n \left( y_i\log(p_i(\beta)) + (1-y_i)\log(1-p_i(\beta))\right) \\ & = \sum_{i=1}^n \left( y_i\log(e^{\beta^{\top}x_i}/(1+e^{\beta^{\top}x_i})) - (1-y_i)\log(1+e^{\beta^{\top}x_i})\right) \\ & = \sum_{i=1}^n \left( y_i\beta^{\top}x_i -\log(1 + e^{\beta^{\top} x_i})\right) \end{aligned} \]

This gets optimized via Newton-Raphson updates and iteratively reweighed least squares.

IRWLS for logistic regression (skip for now)

(This is preparation for Neural Networks.)

logit_irwls <- function(y, x, maxit = 100, tol = 1e-6) {
  p <- ncol(x)
  beta <- double(p) # initialize coefficients
  beta0 <- 0
  conv <- FALSE # hasn't converged
  iter <- 1 # first iteration
  while (!conv && (iter < maxit)) { # check loops
    iter <- iter + 1 # update first thing (so as not to forget)
    eta <- beta0 + x %*% beta
    mu <- 1 / (1 + exp(-eta))
    gp <- 1 / (mu * (1 - mu)) # inverse of derivative of logistic
    z <- eta + (y - mu) * gp # effective transformed response
    beta_new <- coef(lm(z ~ x, weights = 1 / gp)) # do Weighted Least Squares
    conv <- mean(abs(c(beta0, beta) - betaNew)) < tol # check if the betas are "moving"
    beta0 <- betaNew[1] # update betas
    beta <- betaNew[-1]
  }
  return(c(beta0, beta))
}

Comparing LDA and Logistic regression

Both decision boundaries are linear in \(x\):

  • LDA \(\longrightarrow \alpha_0 + \alpha_1^\top x\)
  • Logit \(\longrightarrow \beta_0 + \beta_1^\top x\).

But the parameters are estimated differently.

Comparing LDA and Logistic regression

Examine the joint distribution of \((X,\ Y)\) (not the posterior):

  • LDA: \(f(X_i,\ Y_i) = \underbrace{ f(X_i \given Y_i)}_{\textrm{Gaussian}}\underbrace{ f(Y_i)}_{\textrm{Bernoulli}}\)

  • Logistic Regression: \(f(X_i,Y_i) = \underbrace{ f(Y_i\given X_i)}_{\textrm{Logistic}}\underbrace{ f(X_i)}_{\textrm{Ignored}}\)

  • LDA estimates the joint, but Logistic estimates only the conditional (posterior) distribution. But this is really all we need.

  • So logistic requires fewer assumptions.

  • But if the two classes are perfectly separable, logistic crashes (and the MLE is undefined, too many solutions)

  • LDA “works” even if the conditional isn’t normal, but works very poorly if any X is qualitative

Comparing with QDA (2 classes)

  • Recall: this gives a “quadratic” decision boundary (it’s a curve).

  • If we have \(p\) columns in \(X\)

    • Logistic estimates \(p+1\) parameters
    • LDA estimates \(2p + p(p+1)/2 + 1\)
    • QDA estimates \(2p + p(p+1) + 1\)
  • If \(p=50\),

    • Logistic: 51
    • LDA: 1376
    • QDA: 2651
  • QDA doesn’t get used much: there are better nonlinear versions with way fewer parameters

Bad parameter counting

I’ve motivated LDA as needing \(\Sigma\), \(\pi\) and \(\mu_0\), \(\mu_1\)

In fact, we don’t need all of this to get the decision boundary.

So the “degrees of freedom” is much lower if we only want the classes and not the probabilities.

The decision boundary only really depends on

  • \(\Sigma^{-1}(\mu_1-\mu_0)\)
  • \((\mu_1+\mu_0)\),
  • so appropriate algorithms estimate \(<2p\) parameters.

Note again:

while logistic regression and LDA produce linear decision boundaries, they are not linear smoothers

AIC/BIC/Cp work if you use the likelihood correctly and count degrees-of-freedom correctly

Must people use either test set or CV

Next time:

Nonlinear classifiers