A person arrives at an emergency room with a set of symptoms that could be 1 of 3 possible conditions. Which one is it?

An online banking service must be able to determine whether each transaction is fraudulent or not, using a customer’s location, past transaction history, etc.

Given a set of individuals sequenced DNA, can we determine whether various mutations are associated with different phenotypes?

These problems are not regression problems. They are classification problems.

The Set-up

It begins just like regression: suppose we have observations \[\{(x_1,y_1),\ldots,(x_n,y_n)\}\]

Again, we want to estimate a function that maps \(X\) to \(Y\) to predict as yet observed data.

(This function is known as a classifier)

The same constraints apply:

We want a classifier that predicts test data, not just the training data.

Often, this comes with the introduction of some bias to get lower variance and better predictions.

How do we measure quality?

Before in regression, we have \(y_i \in \mathbb{R}\) and use squared error loss to measure accuracy: \((y - \hat{y})^2\).

Instead, let \(y \in \mathcal{K} = \{1,\ldots, K\}\)

(This is arbitrary, sometimes other numbers, such as \(\{-1,1\}\) will be used)

We can always take “factors”: \(\{\textrm{cat},\textrm{dog}\}\) and convert to integers, which is what we assume.

We again make predictions \(\hat{y}=k\) based on the data

We get zero loss if we predict the right class

We lose \(\ell(k,k')\) on \((k\neq k')\) for incorrect predictions

How do we measure quality?

Suppose you have a fever of 39º C. You get a rapid test on campus.

Loss

Test +

Test -

Are +

0

Infect others

Are -

Isolation

0

How do we measure quality?

Suppose you have a fever of 39º C. You get a rapid test on campus.

Loss

Test +

Test -

Are +

0

1

Are -

1

0

How do we measure quality?

We’re going to use \(g(x)\) to be our classifier. It takes values in \(\mathcal{K}\).

How do we measure quality?

Again, we appeal to risk \[R_n(g) = E [\ell(Y,g(X))]\] If we use the law of total probability, this can be written \[R_n(g) = E_X \sum_{y=1}^K \ell(y,\; g(X)) Pr(Y = y \given X)\] We minimize this over a class of options \(\mathcal{G}\), to produce \[g_*(X) = \argmin_{g\in\mathcal{G}} E_X \sum_{y=1}^K \ell(y,g(X)) Pr(Y = y \given X)\]

How do we measure quality?

\(g_*\) is named the Bayes’ classifier for loss \(\ell\) in class \(\mathcal{G}\).

\(R_n(g_*)\) is the called the Bayes’ limit or Bayes’ Risk.

It’s the best we could hope to do in terms of\(\ell\)if we knew the distribution of the data.

But we don’t, so we’ll try to do our best to estimate \(g_*\).

Best classifier overall

(for now, we limit to 2 classes)

Once we make a specific choice for \(\ell\), we can find \(g_*\) exactly (pretending we know the distribution)

Because \(Y\) takes only a few values, zero-one loss is natural (but not the only option) \[\ell(y,\ g(x)) = \begin{cases}0 & y=g(x)\\1 & y\neq g(x) \end{cases} \Longrightarrow R_n(g) = \Expect{\ell(Y,\ g(X))} = Pr(g(X) \neq Y),\]

Best classifier overall

Loss

Test +

Test -

Are +

0

1

Are -

1

0

Best classifier overall

This means we want to classify a new observation \((x_0,y_0)\) such that \(g(x_0) = y_0\) as often as possible

Under this loss, we have \[
\begin{aligned}
g_*(X) &= \argmin_{g} Pr(g(X) \neq Y) \\
&= \argmin_{g} \left[ 1 - Pr(Y = g(x) | X=x)\right] \\
&= \argmax_{g} Pr(Y = g(x) | X=x )
\end{aligned}
\]

Approach 2: estimate everything in the expression above.

We need to estimate \(p_1\), \(p_2\), \(\pi\), \(1-\pi\)

Easily extended to more than two classes

An alternative easy classifier

Zero-One loss was natural, but try something else

Let’s try using squared error loss instead: \(\ell(y,\ f(x)) = (y - f(x))^2\)

Then, the Bayes’ Classifier (the function that minimizes the Bayes Risk) is \[g_*(x) = f_*(x) = E[ Y \given X = x] = Pr(Y = 1 \given X)\] (recall that \(f_* \in [0,1]\) is still the regression function)

In this case, our “class” will actually just be a probability. But this isn’t a class, so it’s a bit unsatisfying.