Stat 406
Daniel J. McDonald
Last modified – 25 October 2023
\[ \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator*{\argmax}{argmax} \DeclareMathOperator*{\minimize}{minimize} \DeclareMathOperator*{\maximize}{maximize} \DeclareMathOperator*{\find}{find} \DeclareMathOperator{\st}{subject\,\,to} \newcommand{\E}{E} \newcommand{\Expect}[1]{\E\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[2]{\mathrm{Cov}\left[#1,\ #2\right]} \newcommand{\given}{\ \vert\ } \newcommand{\X}{\mathbf{X}} \newcommand{\x}{\mathbf{x}} \newcommand{\y}{\mathbf{y}} \newcommand{\P}{\mathcal{P}} \newcommand{\R}{\mathbb{R}} \newcommand{\norm}[1]{\left\lVert #1 \right\rVert} \newcommand{\snorm}[1]{\lVert #1 \rVert} \newcommand{\tr}[1]{\mbox{tr}(#1)} \newcommand{\brt}{\widehat{\beta}^R_{s}} \newcommand{\brl}{\widehat{\beta}^R_{\lambda}} \newcommand{\bls}{\widehat{\beta}_{ols}} \newcommand{\blt}{\widehat{\beta}^L_{s}} \newcommand{\bll}{\widehat{\beta}^L_{\lambda}} \]
\[g_*(X) = \begin{cases} 1 & \textrm{ if } \frac{p_1(X)}{p_0(X)} > \frac{1-\pi}{\pi} \\ 0 & \textrm{ otherwise} \end{cases}\]
where \(p_1(X) = Pr(X \given Y=1)\) and \(p_0(X) = Pr(X \given Y=0)\)
We then used this distribution and the class prior \(\pi\) to find the posterior \(Pr(Y=1 \given X=x)\).
Instead, let’s directly model the posterior
\[ \begin{aligned} Pr(Y = 1 \given X=x) & = \frac{\exp\{\beta_0 + \beta^{\top}x\}}{1 + \exp\{\beta_0 + \beta^{\top}x\}} \\ Pr(Y = 0 | X=x) & = \frac{1}{1 + \exp\{\beta_0 + \beta^{\top}x\}}=1-\frac{\exp\{\beta_0 + \beta^{\top}x\}}{1 + \exp\{\beta_0 + \beta^{\top}x\}} \end{aligned} \]
This is logistic regression.
\[Pr(Y = 1 \given X=x) = \frac{\exp\{\beta_0 + \beta^{\top}x\}}{1 + \exp\{\beta_0 + \beta^{\top}x\}}\]
There are lots of ways to map \(\R \mapsto [0,1]\).
The “logistic” function \(z\mapsto (1 + \exp(-z))^{-1} = \exp(z) / (1+\exp(z)) =:h(z)\) is nice.
It’s symmetric: \(1 - h(z) = h(-z)\)
Has a nice derivative: \(h'(z) = \frac{\exp(z)}{(1 + \exp(z))^2} = h(z)(1-h(z))\).
It’s the inverse of the “log-odds” (logit): \(\log(p / (1-p))\).
Like LDA, logistic regression is a linear classifier
The logit (i.e.: log odds) transformation gives a linear decision boundary \[\log\left( \frac{\P(Y = 1 \given X=x)}{\P(Y = 0 \given X=x) } \right) = \beta_0 + \beta^{\top} x\] The decision boundary is the hyperplane \(\{x : \beta_0 + \beta^{\top} x = 0\}\)
If the log-odds are below 0, classify as 0, above 0 classify as a 1.
Or we can use lasso or ridge regression or a GAM as before
dat1 <- generate_lda_2d(100, Sigma = .5 * diag(2))
logit <- glm(y ~ ., dat1 |> mutate(y = y - 1), family = "binomial")
summary(logit)
Call:
glm(formula = y ~ ., family = "binomial", data = mutate(dat1,
y = y - 1))
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.6649 0.6281 -4.243 2.21e-05 ***
x1 2.5305 0.5995 4.221 2.43e-05 ***
x2 1.6610 0.4365 3.805 0.000142 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 138.469 on 99 degrees of freedom
Residual deviance: 68.681 on 97 degrees of freedom
AIC: 74.681
Number of Fisher Scoring iterations: 6
gr <- expand_grid(x1 = seq(-2.5, 3, length.out = 100),
x2 = seq(-2.5, 3, length.out = 100))
pts <- predict(logit, gr)
g0 <- ggplot(dat1, aes(x1, x2)) +
scale_shape_manual(values = c("0", "1"), guide = "none") +
geom_raster(data = tibble(gr, disc = pts), aes(x1, x2, fill = disc)) +
geom_point(aes(shape = as.factor(y)), size = 4) +
coord_cartesian(c(-2.5, 3), c(-2.5, 3)) +
scale_fill_steps2(n.breaks = 6, name = "log odds")
g0
While the R
formula for logistic regression is straightforward, it’s not as easy to compute as OLS or LDA or QDA.
Logistic regression for two classes simplifies to a likelihood:
Write \(p_i(\beta) = \P(Y_i = 1 | X = x_i,\beta)\)
\(P(Y_i = y_i \given X = x_i, \beta) = p_i^{y_i}(1-p_i)^{1-y_i}\) (…Bernoulli distribution)
\(P(\mathbf{Y} \given \mathbf{X}, \beta) = \prod_{i=1}^n p_i^{y_i}(1-p_i)^{1-y_i}\).
Write \(p_i(\beta) = \P(Y_i = 1 | X = x_i,\beta)\)
\[ \begin{aligned} \ell(\beta) & = \log \left( \prod_{i=1}^n p_i^{y_i}(1-p_i)^{1-y_i} \right)\\ &=\sum_{i=1}^n \left( y_i\log(p_i(\beta)) + (1-y_i)\log(1-p_i(\beta))\right) \\ & = \sum_{i=1}^n \left( y_i\log(e^{\beta^{\top}x_i}/(1+e^{\beta^{\top}x_i})) - (1-y_i)\log(1+e^{\beta^{\top}x_i})\right) \\ & = \sum_{i=1}^n \left( y_i\beta^{\top}x_i -\log(1 + e^{\beta^{\top} x_i})\right) \end{aligned} \]
This gets optimized via Newton-Raphson updates and iteratively reweighed least squares.
(This is preparation for Neural Networks.)
logit_irwls <- function(y, x, maxit = 100, tol = 1e-6) {
p <- ncol(x)
beta <- double(p) # initialize coefficients
beta0 <- 0
conv <- FALSE # hasn't converged
iter <- 1 # first iteration
while (!conv && (iter < maxit)) { # check loops
iter <- iter + 1 # update first thing (so as not to forget)
eta <- beta0 + x %*% beta
mu <- 1 / (1 + exp(-eta))
gp <- 1 / (mu * (1 - mu)) # inverse of derivative of logistic
z <- eta + (y - mu) * gp # effective transformed response
beta_new <- coef(lm(z ~ x, weights = 1 / gp)) # do Weighted Least Squares
conv <- mean(abs(c(beta0, beta) - betaNew)) < tol # check if the betas are "moving"
beta0 <- betaNew[1] # update betas
beta <- betaNew[-1]
}
return(c(beta0, beta))
}
Both decision boundaries are linear in \(x\):
But the parameters are estimated differently.
Examine the joint distribution of \((X,\ Y)\) (not the posterior):
LDA: \(f(X_i,\ Y_i) = \underbrace{ f(X_i \given Y_i)}_{\textrm{Gaussian}}\underbrace{ f(Y_i)}_{\textrm{Bernoulli}}\)
Logistic Regression: \(f(X_i,Y_i) = \underbrace{ f(Y_i\given X_i)}_{\textrm{Logistic}}\underbrace{ f(X_i)}_{\textrm{Ignored}}\)
LDA estimates the joint, but Logistic estimates only the conditional (posterior) distribution. But this is really all we need.
So logistic requires fewer assumptions.
But if the two classes are perfectly separable, logistic crashes (and the MLE is undefined, too many solutions)
LDA “works” even if the conditional isn’t normal, but works very poorly if any X is qualitative
Recall: this gives a “quadratic” decision boundary (it’s a curve).
If we have \(p\) columns in \(X\)
If \(p=50\),
QDA doesn’t get used much: there are better nonlinear versions with way fewer parameters
I’ve motivated LDA as needing \(\Sigma\), \(\pi\) and \(\mu_0\), \(\mu_1\)
In fact, we don’t need all of this to get the decision boundary.
So the “degrees of freedom” is much lower if we only want the classes and not the probabilities.
The decision boundary only really depends on
while logistic regression and LDA produce linear decision boundaries, they are not linear smoothers
AIC/BIC/Cp work if you use the likelihood correctly and count degrees-of-freedom correctly
Must people use either test set or CV
Nonlinear classifiers
UBC Stat 406 - 2023