[,1] [,2] [,3]
[1,] 0 2 30
[2,] 10 0 100
[3,] 1000000 50000 0
Stat 406
Geoff Pleiss, Trevor Campbell
Last modified – 16 October 2023
\[ \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator*{\argmax}{argmax} \DeclareMathOperator*{\minimize}{minimize} \DeclareMathOperator*{\maximize}{maximize} \DeclareMathOperator*{\find}{find} \DeclareMathOperator{\st}{subject\,\,to} \newcommand{\E}{E} \newcommand{\Expect}[1]{\E\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[2]{\mathrm{Cov}\left[#1,\ #2\right]} \newcommand{\given}{\ \vert\ } \newcommand{\X}{\mathbf{X}} \newcommand{\x}{\mathbf{x}} \newcommand{\y}{\mathbf{y}} \newcommand{\P}{\mathcal{P}} \newcommand{\R}{\mathbb{R}} \newcommand{\norm}[1]{\left\lVert #1 \right\rVert} \newcommand{\snorm}[1]{\lVert #1 \rVert} \newcommand{\tr}[1]{\mbox{tr}(#1)} \newcommand{\brt}{\widehat{\beta}^R_{s}} \newcommand{\brl}{\widehat{\beta}^R_{\lambda}} \newcommand{\bls}{\widehat{\beta}_{ols}} \newcommand{\blt}{\widehat{\beta}^L_{s}} \newcommand{\bll}{\widehat{\beta}^L_{\lambda}} \]
So far — 0-1 loss. If correct class, lose 0 else lose 1.
Asymmetric classification loss — If correct class, lose 0 else lose something.
For example, consider facial recognition. Goal is “person OK”, “person has expired passport”, “person is a known terrorist”
Results in a 3x3 matrix of losses with 0 on the diagonal.
[,1] [,2] [,3]
[1,] 0 2 30
[2,] 10 0 100
[3,] 1000000 50000 0
Sometimes we output probabilities as well as class labels.
For example, logistic regression returns the probability that an observation is in class 1. \(P(Y_i = 1 \given x_i) = 1 / (1 + \exp\{-x'_i \hat\beta\})\)
LDA and QDA produce probabilities as well. So do Neural Networks (typically)
(Trees “don’t”, neither does KNN, though you could fake it)
(Technically, it’s the difference between this and the loss of the null model, but people play fast and loose)
Suppose we predict some probabilities for our data, how often do those events happen?
In principle, if we predict \(\hat{p}(x_i)=0.2\) for a bunch of events observations \(i\), we’d like to see about 20% 1 and 80% 0. (In training set and test set)
The same goes for the other probabilities. If we say “20% chance of rain” it should rain 20% of such days.
Of course, we didn’t predict exactly \(\hat{p}(x_i)=0.2\) ever, so lets look at \([.15, .25]\).
# A tibble: 1 × 2
target obs
<dbl> <dbl>
1 0.2 0.222
binary_calibration_plot <- function(y, phat, nbreaks = 10) {
dat <- tibble(y = y, phat = phat) |>
mutate(bins = cut_number(phat, n = nbreaks))
midpts <- quantile(dat$phat, seq(0, 1, length.out = nbreaks + 1), na.rm = TRUE)
midpts <- midpts[-length(midpts)] + diff(midpts) / 2
sum_dat <- dat |>
group_by(bins) |>
summarise(
p = mean(y, na.rm = TRUE),
se = sqrt(p * (1 - p) / n())
) |>
arrange(p)
sum_dat$x <- midpts
ggplot(sum_dat, aes(x = x)) +
geom_errorbar(aes(ymin = pmax(p - 1.96 * se, 0), ymax = pmin(p + 1.96 * se, 1))) +
geom_point(aes(y = p), colour = blue) +
geom_abline(slope = 1, intercept = 0, colour = orange) +
ylab("observed frequency") +
xlab("average predicted probability") +
coord_cartesian(xlim = c(0, 1), ylim = c(0, 1)) +
geom_rug(data = dat, aes(x = phat), sides = "b")
}
So far, we’ve been thresholding at 0.5, though you shouldn’t always do that.
With unbalanced data (say 10% 0 and 90% 1), if you care equally about predicting both classes, you might want to choose a different cutoff (like in LDA).
To make the ROC we look at our errors as we vary the cutoff
roc <- function(prediction, y) {
op <- order(prediction, decreasing = TRUE)
preds <- prediction[op]
y <- y[op]
noty <- 1 - y
if (any(duplicated(preds))) {
y <- rev(tapply(y, preds, sum))
noty <- rev(tapply(noty, preds, sum))
}
tibble(
FPR = cumsum(noty) / sum(noty),
TPR = cumsum(y) / sum(y)
)
}
ggplot(roc(dat$phat, dat$y), aes(FPR, TPR)) +
geom_step(colour = blue, size = 2) +
geom_abline(slope = 1, intercept = 0)
UBC Stat 406 - 2024