14 Classification

Stat 406

Geoff Pleiss, Trevor Campbell

Last modified – 14 October 2024

\[ \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator*{\argmax}{argmax} \DeclareMathOperator*{\minimize}{minimize} \DeclareMathOperator*{\maximize}{maximize} \DeclareMathOperator*{\find}{find} \DeclareMathOperator{\st}{subject\,\,to} \newcommand{\E}{E} \newcommand{\Expect}[1]{\E\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[2]{\mathrm{Cov}\left[#1,\ #2\right]} \newcommand{\given}{\ \vert\ } \newcommand{\X}{\mathbf{X}} \newcommand{\x}{\mathbf{x}} \newcommand{\y}{\mathbf{y}} \newcommand{\P}{\mathcal{P}} \newcommand{\R}{\mathbb{R}} \newcommand{\norm}[1]{\left\lVert #1 \right\rVert} \newcommand{\snorm}[1]{\lVert #1 \rVert} \newcommand{\tr}[1]{\mbox{tr}(#1)} \newcommand{\brt}{\widehat{\beta}^R_{s}} \newcommand{\brl}{\widehat{\beta}^R_{\lambda}} \newcommand{\bls}{\widehat{\beta}_{ols}} \newcommand{\blt}{\widehat{\beta}^L_{s}} \newcommand{\bll}{\widehat{\beta}^L_{\lambda}} \newcommand{\U}{\mathbf{U}} \newcommand{\D}{\mathbf{D}} \newcommand{\V}{\mathbf{V}} \]

An Overview of Classification

A person arrives at an emergency room with a set of symptoms that could be 1 of 3 possible conditions. Which one is it?
An online banking service must be able to determine whether each transaction is fraudulent or not, using a customer’s location, past transaction history, etc.
Given a set of individuals sequenced DNA, can we determine whether various mutations are associated with different phenotypes?

These problems are not regression problems. They are classification problems.

Classification involves a categorical response variable (no notion of “order”/“distance”).

Setup

It begins just like regression: suppose we have observations \[\{(x_1,y_1),\ldots,(x_n,y_n)\}\]

Again, we want to estimate a function that maps \(X\) to \(Y\) to predict as yet observed data.

(This function is known as a classifier)

The same constraints apply:

We want a classifier that predicts test data, not just the training data.
Often, this comes with the introduction of some bias to get lower variance and better predictions.

How do we measure quality?

Before in regression, we have \(y_i \in \mathbb{R}\) and use \((y - \hat{y})^2\) loss to measure accuracy.

Instead, let \(y \in \mathcal{K} = \{1,\ldots, K\}\)

(This is arbitrary, sometimes other numbers, such as \(\{-1,1\}\) will be used)

We will usually convert categories/“factors” (e.g. \(\{\textrm{cat},\textrm{dog}\}\)) to integers.

We again make predictions \(\hat{y}=k\) based on the data

We get zero loss if we predict the right class
We lose \(\ell(k,k')\) on \((k\neq k')\) for incorrect predictions

How do we measure quality?

Example: You’re trying to build a fun widget to classify images of cats and dogs.

Loss	Predict Dog	Predict Cat
Actual Dog	0	?
Actual Cat	?	0

Use the zero-one loss (1 if wrong, 0 if right). Type of error doesn’t matter.

Loss	Predict Dog	Predict Cat
Actual Dog	0	1
Actual Cat	1	0

How do we measure quality?

Example: Suppose you have a fever of 39º C. You get a rapid test on campus.

Loss	Test +	Test -
Are +	0	? (Infect others)
Are -	? (Isolation)	0

Use a weighted loss; type of error matters!

Loss	Test +	Test -
Are +	0	(LARGE)
Are -	1	0

Note that one class is “important”: we sometimes call that one positive. Errors are false positive and false negative.

In practice, you have to design your loss (just like before) to reflect what you care about.

How do we measure quality?

We’re going to use \(g(x)\) to be our classifier. It takes values in \(\mathcal{K}\).

Consider the risk \[R_n(g) = E [\ell(Y,g(X))]\] If we use the law of total probability, this can be written \[R_n(g) = E\left[\sum_{y=1}^K \ell(y,\; g(X)) Pr(Y = y \given X)\right]\] We minimize this over a class of options \(\mathcal{G}\), to produce \[g_*(X) = \argmin_{g\in\mathcal{G}} E\left[\sum_{y=1}^K \ell(y,g(X)) Pr(Y = y \given X)\right]\]

How do we measure quality?

\(g_*\) is named the Bayes’ classifier for loss \(\ell\) in class \(\mathcal{G}\).

\(R_n(g_*)\) is the called the Bayes’ limit or Bayes’ Risk.

It’s the best we could hope to do even if we knew the distribution of the data (recall irreducible error!)

But we don’t, so we’ll try to do our best to estimate \(g_*\).

Best classifier overall

Suppose we actually know the distribution of everything, and we’ve picked \(\ell\) to be the zero-one loss

\[\ell(y,\ g(x)) = \begin{cases}0 & y=g(x)\\1 & y\neq g(x) \end{cases}\]

Loss	Test +	Test -
Are +	0	1
Are -	1	0

Then

\[R_n(g) = \Expect{\ell(Y,\ g(X))} = Pr(g(X) \neq Y)\]

Best classifier overall

Want to classify a new observation \((X,Y)\) such that \(g(X) = Y\) with as high probability as possible. Under zero-one loss, we have

\[g_* = \argmin_{g} Pr(g(X) \neq Y) = \argmin_g 1- \Pr(g(X) = Y) = \argmax_g \Pr(g(X) = Y)\]

\[ \begin{aligned} g_* &= \argmax_{g} E[\Pr(g(X) = Y | X)]\\ &= \argmax_{g} E\left[\sum_{k\in\mathcal{K}}1[g(X) = k]\Pr(Y=k | X)\right] \end{aligned} \]

For each \(x\), only one \(k\) can satisfy \(g(x) = k\). So for each \(x\),

\[ g_*(x) = \argmax_{k\in\mathcal{K}} \Pr(Y = k | X = x). \]

Estimating \(g_*\) Approach 1: Empirical risk minimization

Choose some class of classifiers \(\mathcal{G}\).
Find \(\argmin_{g\in\mathcal{G}} \sum_{i = 1}^n I(g(x_i) \neq y_i)\)

Estimating \(g_*\) Approach 2: Class densities

Consider 2 classes \(\{0,1\}\): using Bayes’ theorem (and being loose with notation),

\[\begin{aligned} \Pr(Y=1 \given X=x) &= \frac{\Pr(X=x\given Y=1) \Pr(Y=1)}{\Pr(X=x)}\\ &=\frac{\Pr(X=x\given Y = 1) \Pr(Y = 1)}{\sum_{k \in \{0,1\}} \Pr(X=x\given Y = k) \Pr(Y = k)} \\ &= \frac{p_1(x) \pi}{ p_1(x)\pi + p_0(x)(1-\pi)}\end{aligned}\]

We call \(p_k(x)\) the class (conditional) densities
\(\pi\) is the marginal probability \(P(Y=1)\)
Similar formula for \(\Pr(Y=0\given X=x) = p_0(x)(1-\pi)/(\dots)\)

Estimating \(g_*\) Approach 2: Class densities

Recall \(g_*(x) = \argmax_k \Pr(Y=k|x)\); so we classify 1 if

\[\frac{p_1(x) \pi}{ p_1(x)\pi + p_0(x)(1-\pi)} > \frac{p_0(x) (1-\pi)}{ p_1(x)\pi + p_0(x)(1-\pi)}\]

i.e., the Bayes’ Classifier (best classifier for 0-1 loss) can be rewritten

\[g_*(X) = \begin{cases} 1 & \textrm{ if } \frac{p_1(X)}{p_0(X)} > \frac{1-\pi}{\pi} \\ 0 & \textrm{ otherwise} \end{cases}\]

Estimate everything in the expression above.

We need to estimate \(p_0\), \(p_1\), \(\pi\), \(1-\pi\)
Easily extended to more than two classes

Estimating \(g_*\) Approach 3: Regression discretization

0-1 loss natural, but discrete. Let’s try using squared error: \(\ell(y,\ f(x)) = (y - f(x))^2\)

What will be the optimal classifier here? (hint: think about regression)

The “Bayes’ Classifier” (sort of…minimizes risk) is just the regression function! \[f_*(x) = \Pr(Y = 1 \given X=x) = E[ Y \given X = x] \]

In this case, \(0\leq f_*(x)\leq 1\) not discrete… How do we get a class prediction?

Discretize the output:

\[g(x) = \begin{cases}0 & f_*(x) < 1/2\\1 & \textrm{else}\end{cases}\]

Estimate \(\hat f(x) = E[Y|X=x] = \Pr(Y=1|X=x)\) using any method we’ve learned so far.
Predict 0 if \(\hat{f}(x)\) is less than 1/2, else predict 1.

Claim: Classification is easier than regression

Let \(\hat{f}\) be any estimate of \(f_*\)
Let \(\widehat{g} (x) = \begin{cases}0 & \hat f(x) < 1/2\\1 & else\end{cases}\)

Proof by picture.

Claim: Classification is easier than regression

Code

set.seed(12345)
x <- 1:99 / 100
y <- rbinom(99, 1, 
            .25 + .5 * (x > .3 & x < .5) + 
              .6 * (x > .7))
dmat <- as.matrix(dist(x))
ksm <- function(sigma) {
  gg <-  dnorm(dmat, sd = sigma) 
  sweep(gg, 1, rowSums(gg), '/') %*% y
}
fstar <- ksm(.04)
gg <- tibble(x = x, fstar = fstar, y = y) %>%
  ggplot(aes(x)) +
  geom_point(aes(y = y), color = blue) +
  geom_line(aes(y = fstar), color = orange, size = 2) +
  coord_cartesian(ylim = c(0,1), xlim = c(0,1)) +
  annotate("label", x = .75, y = .65, label = "f_star", size = 5)
gg

Claim: Classification is easier than regression

Code

gg + geom_hline(yintercept = .5, color = green)

Claim: Classification is easier than regression

Code

tib <- tibble(x = x, fstar = fstar, y = y)
ggplot(tib) +
  geom_vline(data = filter(tib, fstar > 0.5), aes(xintercept = x), alpha = .5, color = green) +
  annotate("label", x = .75, y = .65, label = "f_star", size = 5) + 
  geom_point(aes(x = x, y = y), color = blue) +
  geom_line(aes(x = x, y = fstar), color = orange, size = 2) +
  coord_cartesian(ylim = c(0,1), xlim = c(0,1))

How to find a classifier

Why did we go through that math?

Each of these approaches has strengths/drawbacks:

Empirical risk minimization: Minimize \(R_n(g)\) in some family \(\mathcal{G}\)

(This can be quite challenging as, unlike in regression, the training error is nonconvex)

Density estimation: Estimate \(\pi\) and \(p_k\)

(We have to estimate class densities to classify. Too roundabout?)

Regression: Find an estimate \(\hat{f}\approx E[Y|X=x]\) and compare the predicted value to 1/2

(Unnatural, estimates whole regression function when we’ll just discretize anyway)

Next time…

Estimating the densities