A person arrives at an emergency room with a set of symptoms that could be 1 of 3 possible conditions. Which one is it?
An online banking service must be able to determine whether each transaction is fraudulent or not, using a customer’s location, past transaction history, etc.
Given a set of individuals sequenced DNA, can we determine whether various mutations are associated with different phenotypes?
These problems are not regression problems. They are classification problems.
The Set-up
It begins just like regression: suppose we have observations \[\{(x_1,y_1),\ldots,(x_n,y_n)\}\]
Again, we want to estimate a function that maps \(X\) to \(Y\) to predict as yet observed data.
(This function is known as a classifier)
The same constraints apply:
We want a classifier that predicts test data, not just the training data.
Often, this comes with the introduction of some bias to get lower variance and better predictions.
How do we measure quality?
Before in regression, we have \(y_i \in \mathbb{R}\) and use squared error loss to measure accuracy: \((y - \hat{y})^2\).
Instead, let \(y \in \mathcal{K} = \{1,\ldots, K\}\)
(This is arbitrary, sometimes other numbers, such as \(\{-1,1\}\) will be used)
We can always take “factors”: \(\{\textrm{cat},\textrm{dog}\}\) and convert to integers, which is what we assume.
We again make predictions \(\hat{y}=k\) based on the data
We get zero loss if we predict the right class
We lose \(\ell(k,k')\) on \((k\neq k')\) for incorrect predictions
How do we measure quality?
Suppose you have a fever of 39º C. You get a rapid test on campus.
Loss
Test +
Test -
Are +
0
Infect others
Are -
Isolation
0
How do we measure quality?
Suppose you have a fever of 39º C. You get a rapid test on campus.
Loss
Test +
Test -
Are +
0
1
Are -
1
0
How do we measure quality?
We’re going to use \(g(x)\) to be our classifier. It takes values in \(\mathcal{K}\).
How do we measure quality?
Again, we appeal to risk \[R_n(g) = E [\ell(Y,g(X))]\] If we use the law of total probability, this can be written \[R_n(g) = E_X \sum_{y=1}^K \ell(y,\; g(X)) Pr(Y = y \given X)\] We minimize this over a class of options \(\mathcal{G}\), to produce \[g_*(X) = \argmin_{g\in\mathcal{G}} E_X \sum_{y=1}^K \ell(y,g(X)) Pr(Y = y \given X)\]
How do we measure quality?
\(g_*\) is named the Bayes’ classifier for loss \(\ell\) in class \(\mathcal{G}\).
\(R_n(g_*)\) is the called the Bayes’ limit or Bayes’ Risk.
It’s the best we could hope to do in terms of\(\ell\)if we knew the distribution of the data.
But we don’t, so we’ll try to do our best to estimate \(g_*\).
Best classifier overall
(for now, we limit to 2 classes)
Once we make a specific choice for \(\ell\), we can find \(g_*\) exactly (pretending we know the distribution)
Because \(Y\) takes only a few values, zero-one loss is natural (but not the only option) \[\ell(y,\ g(x)) = \begin{cases}0 & y=g(x)\\1 & y\neq g(x) \end{cases} \Longrightarrow R_n(g) = \Expect{\ell(Y,\ g(X))} = Pr(g(X) \neq Y),\]
Best classifier overall
Loss
Test +
Test -
Are +
0
1
Are -
1
0
Best classifier overall
This means we want to classify a new observation \((x_0,y_0)\) such that \(g(x_0) = y_0\) as often as possible
Under this loss, we have \[
\begin{aligned}
g_*(X) &= \argmin_{g} Pr(g(X) \neq Y) \\
&= \argmin_{g} \left[ 1 - Pr(Y = g(x) | X=x)\right] \\
&= \argmax_{g} Pr(Y = g(x) | X=x )
\end{aligned}
\]
Approach 2: estimate everything in the expression above.
We need to estimate \(p_1\), \(p_2\), \(\pi\), \(1-\pi\)
Easily extended to more than two classes
An alternative easy classifier
Zero-One loss was natural, but try something else
Let’s try using squared error loss instead: \(\ell(y,\ f(x)) = (y - f(x))^2\)
Then, the Bayes’ Classifier (the function that minimizes the Bayes Risk) is \[g_*(x) = f_*(x) = E[ Y \given X = x] = Pr(Y = 1 \given X)\] (recall that \(f_* \in [0,1]\) is still the regression function)
In this case, our “class” will actually just be a probability. But this isn’t a class, so it’s a bit unsatisfying.