A person arrives at an emergency room with a set of symptoms that could be 1 of 3 possible conditions. Which one is it?
An online banking service must be able to determine whether each transaction is fraudulent or not, using a customer’s location, past transaction history, etc.
Given a set of individuals sequenced DNA, can we determine whether various mutations are associated with different phenotypes?
These problems are not regression problems. They are classification problems.
Classification involves a categorical response variable (no notion of “order”/“distance”).
Setup
It begins just like regression: suppose we have observations \[\{(x_1,y_1),\ldots,(x_n,y_n)\}\]
Again, we want to estimate a function that maps \(X\) to \(Y\) to predict as yet observed data.
(This function is known as a classifier)
The same constraints apply:
We want a classifier that predicts test data, not just the training data.
Often, this comes with the introduction of some bias to get lower variance and better predictions.
How do we measure quality?
Before in regression, we have \(y_i \in \mathbb{R}\) and use \((y - \hat{y})^2\) loss to measure accuracy.
Instead, let \(y \in \mathcal{K} = \{1,\ldots, K\}\)
(This is arbitrary, sometimes other numbers, such as \(\{-1,1\}\) will be used)
We will usually convert categories/“factors” (e.g. \(\{\textrm{cat},\textrm{dog}\}\)) to integers.
We again make predictions \(\hat{y}=k\) based on the data
We get zero loss if we predict the right class
We lose \(\ell(k,k')\) on \((k\neq k')\) for incorrect predictions
How do we measure quality?
Example: You’re trying to build a fun widget to classify images of cats and dogs.
Loss
Predict Dog
Predict Cat
Actual Dog
0
?
Actual Cat
?
0
Use the zero-one loss (1 if wrong, 0 if right). Type of error doesn’t matter.
Loss
Predict Dog
Predict Cat
Actual Dog
0
1
Actual Cat
1
0
How do we measure quality?
Example: Suppose you have a fever of 39º C. You get a rapid test on campus.
Loss
Test +
Test -
Are +
0
? (Infect others)
Are -
? (Isolation)
0
Use a weighted loss; type of error matters!
Loss
Test +
Test -
Are +
0
(LARGE)
Are -
1
0
Note that one class is “important”: we sometimes call that one positive. Errors are false positive and false negative.
In practice, you have to design your loss (just like before) to reflect what you care about.
How do we measure quality?
We’re going to use \(g(x)\) to be our classifier. It takes values in \(\mathcal{K}\).
Consider the risk \[R_n(g) = E [\ell(Y,g(X))]\] If we use the law of total probability, this can be written \[R_n(g) = E\left[\sum_{y=1}^K \ell(y,\; g(X)) Pr(Y = y \given X)\right]\] We minimize this over a class of options \(\mathcal{G}\), to produce \[g_*(X) = \argmin_{g\in\mathcal{G}} E\left[\sum_{y=1}^K \ell(y,g(X)) Pr(Y = y \given X)\right]\]
How do we measure quality?
\(g_*\) is named the Bayes’ classifier for loss \(\ell\) in class \(\mathcal{G}\).
\(R_n(g_*)\) is the called the Bayes’ limit or Bayes’ Risk.
It’s the best we could hope to do even if we knew the distribution of the data (recall irreducible error!)
But we don’t, so we’ll try to do our best to estimate \(g_*\).
Best classifier overall
Suppose we actually know the distribution of everything, and we’ve picked \(\ell\) to be the zero-one loss