00 Evaluating classifiers

Stat 406

Daniel J. McDonald

Last modified – 16 October 2023

\[ \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator*{\argmax}{argmax} \DeclareMathOperator*{\minimize}{minimize} \DeclareMathOperator*{\maximize}{maximize} \DeclareMathOperator*{\find}{find} \DeclareMathOperator{\st}{subject\,\,to} \newcommand{\E}{E} \newcommand{\Expect}[1]{\E\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[2]{\mathrm{Cov}\left[#1,\ #2\right]} \newcommand{\given}{\ \vert\ } \newcommand{\X}{\mathbf{X}} \newcommand{\x}{\mathbf{x}} \newcommand{\y}{\mathbf{y}} \newcommand{\P}{\mathcal{P}} \newcommand{\R}{\mathbb{R}} \newcommand{\norm}[1]{\left\lVert #1 \right\rVert} \newcommand{\snorm}[1]{\lVert #1 \rVert} \newcommand{\tr}[1]{\mbox{tr}(#1)} \newcommand{\brt}{\widehat{\beta}^R_{s}} \newcommand{\brl}{\widehat{\beta}^R_{\lambda}} \newcommand{\bls}{\widehat{\beta}_{ols}} \newcommand{\blt}{\widehat{\beta}^L_{s}} \newcommand{\bll}{\widehat{\beta}^L_{\lambda}} \]

How do we measure accuracy?

So far — 0-1 loss. If correct class, lose 0 else lose 1.

Asymmetric classification loss — If correct class, lose 0 else lose something.

For example, consider facial recognition. Goal is “person OK”, “person has expired passport”, “person is a known terrorist”

  1. If classify OK, but was terrorist, lose 1,000,000
  2. If classify OK, but expired passport, lose 2
  3. If classify terrorist, but was OK, lose 100
  4. If classify terrorist, but was expired passport, lose 10
  5. etc.

Results in a 3x3 matrix of losses with 0 on the diagonal.

        [,1]  [,2] [,3]
[1,]       0     2   30
[2,]      10     0  100
[3,] 1000000 50000    0

Deviance loss

Sometimes we output probabilities as well as class labels.

For example, logistic regression returns the probability that an observation is in class 1. \(P(Y_i = 1 \given x_i) = 1 / (1 + \exp\{-x'_i \hat\beta\})\)

LDA and QDA produce probabilities as well. So do Neural Networks (typically)

(Trees “don’t”, neither does KNN, though you could fake it)


  • Deviance loss for 2-class classification is \(-2\textrm{loglikelihood}(y, \hat{p}) = -2 (y_i x'_i\hat{\beta} - \log (1-\hat{p}))\)

(Technically, it’s the difference between this and the loss of the null model, but people play fast and loose)

  • Could also use cross entropy or Gini index.

Calibration

Suppose we predict some probabilities for our data, how often do those events happen?

In principle, if we predict \(\hat{p}(x_i)=0.2\) for a bunch of events observations \(i\), we’d like to see about 20% 1 and 80% 0. (In training set and test set)

The same goes for the other probabilities. If we say “20% chance of rain” it should rain 20% of such days.

Of course, we didn’t predict exactly \(\hat{p}(x_i)=0.2\) ever, so lets look at \([.15, .25]\).

n <- 250
dat <- tibble(
  x = seq(-5, 5, length.out = n),
  p = 1 / (1 + exp(-x)),
  y = rbinom(n, 1, p)
)
fit <- glm(y ~ x, family = binomial, data = dat)
dat$phat <- predict(fit, type = "response") # predicted probabilities
dat |>
  filter(phat > .15, phat < .25) |>
  summarize(target = .2, obs = mean(y))
# A tibble: 1 × 2
  target   obs
   <dbl> <dbl>
1    0.2 0.222

Calibration plot

binary_calibration_plot <- function(y, phat, nbreaks = 10) {
  dat <- tibble(y = y, phat = phat) |>
    mutate(bins = cut_number(phat, n = nbreaks))
  midpts <- quantile(dat$phat, seq(0, 1, length.out = nbreaks + 1), na.rm = TRUE)
  midpts <- midpts[-length(midpts)] + diff(midpts) / 2
  sum_dat <- dat |>
    group_by(bins) |>
    summarise(
      p = mean(y, na.rm = TRUE),
      se = sqrt(p * (1 - p) / n())
    ) |>
    arrange(p)
  sum_dat$x <- midpts

  ggplot(sum_dat, aes(x = x)) +
    geom_errorbar(aes(ymin = pmax(p - 1.96 * se, 0), ymax = pmin(p + 1.96 * se, 1))) +
    geom_point(aes(y = p), colour = blue) +
    geom_abline(slope = 1, intercept = 0, colour = orange) +
    ylab("observed frequency") +
    xlab("average predicted probability") +
    coord_cartesian(xlim = c(0, 1), ylim = c(0, 1)) +
    geom_rug(data = dat, aes(x = phat), sides = "b")
}

Amazingly well-calibrated

binary_calibration_plot(dat$y, dat$phat, 20L)

Less well-calibrated

True positive, false negative, sensitivity, specificity

True positive rate
# correct predict positive / # actual positive (1 - FNR)
False negative rate
# incorrect predict negative / # actual positive (1 - TPR), Type II Error
True negative rate
# correct predict negative / # actual negative
False positive rate
# incorrect predict positive / # actual negative (1 - TNR), Type I Error
Sensitivity
TPR, 1 - Type II error
Specificity
TNR, 1 - Type I error

ROC and thresholds

ROC (Receiver Operating Characteristic) Curve
TPR (sensitivity) vs. FPR (1 - specificity)
AUC (Area under the curve)
Integral of ROC. Closer to 1 is better.

So far, we’ve been thresholding at 0.5, though you shouldn’t always do that.

With unbalanced data (say 10% 0 and 90% 1), if you care equally about predicting both classes, you might want to choose a different cutoff (like in LDA).

To make the ROC we look at our errors as we vary the cutoff

ROC curve

roc <- function(prediction, y) {
  op <- order(prediction, decreasing = TRUE)
  preds <- prediction[op]
  y <- y[op]
  noty <- 1 - y
  if (any(duplicated(preds))) {
    y <- rev(tapply(y, preds, sum))
    noty <- rev(tapply(noty, preds, sum))
  }
  tibble(
    FPR = cumsum(noty) / sum(noty),
    TPR = cumsum(y) / sum(y)
  )
}

ggplot(roc(dat$phat, dat$y), aes(FPR, TPR)) +
  geom_step(colour = blue, size = 2) +
  geom_abline(slope = 1, intercept = 0)

Other stuff