Lecture 3: Introduction to Learning (Cont.), Classification, Logistic Regression
Learning Objectives
By the end of this lecture, you will be able to:
- Define the log-odds statistical model for binary classification, and justify its advantages
- Derive logistic regression through MLE and ERM
- Construct predictions for classifiers based on different notions of risk
Supervised Learning for Classification: Statistical Perspective
In the previous lecture, we developed a statistical perspective of the supervised learning procedure, and we worked through the steps of the learning procedure in a regression context.
Step | CS Perspective | Statistical Perspective | Example: Linear Regression |
---|---|---|---|
2 | Hypothesis Class | Statistical Model | \(\mathbb{E}[Y \mid X = x] = x^\top \beta\) |
3 | Training | Estimation | \(\hat{\beta}_\mathrm{MLE/OLS} = (\boldsymbol{X}^\top \boldsymbol{X})^{-1} \boldsymbol{X}^\top \boldsymbol{Y}\) |
6 | Testing (Inference) | Prediction | \(\hat{Y}_\mathrm{new} = X_\mathrm{new}^\top \hat{\beta}\) |
In this lecture, we will work through the same steps, but this time for classification problems.
Classification versus regression:
- In regression, the response variable \(Y\) is continuous (i.e. \(\mathcal Y = \mathbb{R}\)).
- In binary classification, the response variable \(Y\) is boolean (i.e. \(\mathcal Y = \{0, 1\}\)), where
- \(Y = 1\) represents the “positive” class
- \(Y = 0\) represents the “negative” class
- (You might see \(Y = -1\) used to denote the negative class. It doesn’t make a huge difference, but we’ll stick with \(Y=0\) for mathematical simplicity.)
- The goal remains the same: learn a function \(\hat{f}: \mathbb{R}^p \to \{0, 1\}\) that accurately predicts the class label for new observations.
Let’s now derive a statistical model estimation procedure, and prediction rule for binary classification!
Sneak peek: logistic regression
- We will ultimately derive logistic regression,
Statistical Model: Linear Log-Odds
- Recall that a statistical model is a set of probability distributions.
- In classification, we are interested in families of \(P(Y=1 \mid X = x)\) distributions
- These distributions implicitly define the distribution of \(P(Y=0 \mid X = x)\) by the law of total probability, i.e. \(P(Y=0 \mid X = x) = 1 - P(Y=1 \mid X = x)\).
- Notation: For simplicity, let’s define:
- \(\pi_1(x) := P(Y = 1 \mid X = x)\) (probability of positive class)
- \(\pi_0(x) := P(Y = 0 \mid X = x)\) (probability of negative class)
- Note that \(\pi_0(x) = 1 - \pi_1(x)\).
Sneak peek: the logistic regression model
- You may remember from STAT 406 or CPSC 340 a set of distributions that look like: \[P(Y = 1 \mid X = x) = \pi_1(x) = \frac{\exp(x^\top \beta)}{1 + \exp(x^\top \beta)}\]
- \(\beta \in \mathbb{R}^p\) is a vector of parameters.
- This statistical model is known as logistic regression, and it is a good starting point for binary classification.
- We will now derive this model from first principles.
What type of distribution should we model?
- It’s hard to directly model \(\pi_1(x)\) or \(\pi_0(x)\) distributions.
- Consider \(\pi_1(x) = x^\top \beta\). What’s the problem with this?
- If \(x^\top \beta > 1\), then \(\pi_1(x) = P(Y=1|X=x) > 1\), which is not a valid probability.
- If \(x^\top \beta < 0\), then \(\pi_1(x) = P(Y=1|X=x) < 0\), which is also not a valid probability.
Modelling the log-odds
Instead of defining distributions through \(P(Y = 1 \mid X = x)\) directly, we will instead define distributions through the log-odds ratio
The log-odds ratio is then: \[r(x) := \log\left(\frac{\pi_1(x)}{\pi_0(x)}\right) = \log\left(\frac{\pi_1(x)}{1 - \pi_1(x)}\right)\]
I claim that this ratio can take any real value, i.e. \(r(x) \in \mathbb{R}\).
\(\pi_1(x)\) Odds Ratio \(r(x)\) (Log Odds Ratio) \(\approx 1\) \(\pi_1(x) / \pi_0(x) \to \infty\) \(r(x) \to \infty\) \(\approx 0\) \(\pi_1(x) / \pi_0(x) \to 0\) \(r(x) \to -\infty\)
The linear log-odds model
We’re now ready to define a statistical model for binary classification.
We will consider
\[\log\left(\frac{\pi_1(x)}{\pi_0(x)}\right) = x^\top\beta\]
Thus our statistical model is the following set of distributions:
\[\left\{ P(Y=1 \mid X) \: : \: P(Y=1 \mid X=x) = \frac{\exp(x^\top\beta)}{1 + \exp(x^\top\beta)}, \quad \beta \in \mathbb{R}^p \right\}\]
Start with \(\log\left(\frac{\pi_1(x)}{\pi_0(x)}\right) = x^\top\beta\)
By substituting \(\pi_0(x) = 1 - \pi_1(x)\) and then solving for \(\pi_1(x)\), we get:
\[\pi_1(x) = \frac{\exp(x^\top\beta)}{1 + \exp(x^\top\beta)}\]
Our statistical model is then the set of distributions \(P(Y=1 \mid X=x)\) that can take the form of \(\pi_1\).
The logistic function:
The function in the above equation is known as the logistic function or sigmoid function:
\[\frac{\exp(x^\top\beta)}{1 + \exp(x^\top\beta)} = \frac{1}{1 + \exp(-x^\top\beta)} := \sigma(x^\top \beta)\]
It has many useful properties
- Range: \(\sigma(z) \in (0, 1)\) for all \(z \in \mathbb{R}\)
- Symmetric: \(\sigma(-z) = 1 - \sigma(z)\)
- Convenient derivative: \(\sigma'(z) = \sigma(z)(1 - \sigma(z))\)
It can be seen as a “smooth approximation” to the 0-1 step function:
Prediction and Different Notions of Risk
Let’s discuss how we can make predictions using the logistic regression model.
Making predictions:
Given
- a \(\hat \beta\) estimate
- a new observation \(X_{\text{new}}\)
- a loss function \(L(Y, \hat{Y})\) that quantifies how “bad” a prediction is
recall that we make predictions by minimizing the expected loss:
\[\hat{Y}_{\text{new}} = \mathrm{argmin}_{\hat{y}} \mathbb{E}[L(Y, \hat{y}) \mid X_{\text{new}}, \hat \beta]\]
0/1 Loss:
- The most natural loss function for classification is the 0/1 loss:
\[L_{0/1}(Y, \hat{Y}) = \mathbb{I}(Y \neq \hat{Y}) = \begin{cases} 0 & \text{if } Y = \hat{Y} \\ 1 & \text{if } Y \neq \hat{Y} \end{cases}\]
- Under our logistic model, we have:
\[\begin{align*} \hat{Y}_{\text{new}} &= \mathrm{argmin}_{\hat{y}} \mathbb{E}[ \mathbb{I}(Y \neq \hat{y}) \mid X_{\text{new}}, \hat \beta] \\ &= \mathrm{argmin}_{\hat{y}} P(Y \neq \hat{y} \mid X_{\text{new}}, \hat \beta) \\ &= \mathrm{argmin}_{\hat{y}} \left[ \underbrace{P(Y = 1 \mid X_{\text{new}}, \hat \beta)}_{\sigma(X_\mathrm{new}^\top \hat \beta)} \mathbb{I}(\hat{y} = 0) + \underbrace{P(Y = 0 \mid X_{\text{new}}, \hat \beta)}_{1 - \sigma(X_\mathrm{new}^\top \hat \beta)} \mathbb{I}(\hat{y} = 1) \right] \\ &= \begin{cases} 1 & \text{if } \sigma(X_\mathrm{new}^\top \hat \beta) > 0.5 \\ 0 & \text{if } \sigma(X_\mathrm{new}^\top \hat \beta) \leq 0.5 \end{cases} \end{align*}\].
Decision boundary:
Note that
- \(\sigma(X_\mathrm{new}^\top \hat \beta) > 0.5\) when \(X_\mathrm{new}^\top \hat \beta > 0\)
- \(\sigma(X_\mathrm{new}^\top \hat \beta) \leq 0.5\) when \(X_\mathrm{new}^\top \hat \beta \leq 0\).
The decision boundary, defined by:
\[x^\top\beta = 0\]
is the hyperplane that separates the positively-classified \(x\) from the negatively classified \(x\).
This decision boundary for logistic regression is linear in \(x\).
Other losses:
There are other losses that we could use to generate different prediction rules:
- Probabilistic loss: \(L_\mathrm{prob}(Y, \hat{Y}) = -\log P(Y = \hat Y)\)
- This loss produces “soft” predictions (e.g. \(\hat{Y}_{\text{new}} = 0.273\)) that give a probability estimate of the positive class.
- Asymmetric losses: \(L_\alpha(Y, \hat{Y}) = \alpha \mathbb{I}(Y = 1, \hat{Y} = 0) + (1-\alpha) \mathbb{I}(Y = 0, \hat{Y} = 1)\) for some \(\alpha \in (0, 1)\)
- This loss allows us to penalize false positives and false negatives differently, which can be useful in imbalanced datasets.
We will explore these losses, as well as metrics derived from these losses, in a homework assignment.
Estimation
As with regression, we will derive estimators for the logistic regression parameters \(\beta\) through the principle of maximum likelihood estimation (MLE).
Maximum Likelihood Estimation (MLE)
Recall the MLE estimator is given by
\[\hat{\beta}_\mathrm{MLE} = \mathrm{argmax}_{\beta} \mathcal L(\beta) = \mathrm{argmax}_{\beta} \ell(\beta)\]
where \(\mathcal{L}(\beta)\) is the likelihood function and \(\ell(\beta)\) is the log-likelihood function.
Log likelihood of the linear log-odds model:
Given training data \(\mathcal{D} = \{(X_1, Y_1), \ldots, (X_n, Y_n)\}\), recall that the likelihood/log-likelihood function is:
\[\begin{gather} \mathcal{L}(\beta) = \prod_{i=1}^n P(Y_i \mid X_i; \beta) \\ \ell(\beta) = \sum_{i=1}^n \log P(Y_i \mid X_i; \beta) \end{gather}\]
Plugging in our model, we have that:
\[ P(Y_i \mid X_i; \beta) = \begin{cases} \sigma(X_i^\top \beta) & \text{if } Y_i = 1 \\ 1 - \sigma(X_i^\top \beta) & \text{if } Y_i = 0 \end{cases} \]
We can write this equation more compactly as:
\[P(Y_i \mid X_i, \beta) = \sigma(X_i^\top \beta)^{Y_i} (1 - \sigma(X_i^\top \beta))^{1-Y_i}.\]
Thus the log likelihood function is:
\[ \ell(\beta) = \sum_{i=1}^n \left[ Y_i \log(\sigma(X_i^\top \beta)) + (1 - Y_i) \log(1 - \sigma(X_i^\top \beta)) \right] \]
Computing the MLE estimator:
\[ \hat \beta_\mathrm{MLE} = \mathrm{argmax}_{\beta} \sum_{i=1}^n \left[ Y_i \log(\sigma(X_i^\top \beta)) + (1 - Y_i) \log(1 - \sigma(X_i^\top \beta)) \right] \]
Unlike linear regression, we cannot compute this maximum of this log-likelihood in closed form.
We can numerically solve for the optimization using a technique called gradient descent, which we will cover in a future lecture.
Empirical Risk Minimization (ERM)
- What about ERM?
- With the 0/1 loss, the optimization problem becomes
\[\hat \beta_\mathrm{ERM} = \mathrm{argmin}_{\beta} \frac{1}{n} \sum_{i=1}^n \mathbb{I}(Y_i \neq \hat{Y}_i) = \begin{cases} 1 & Y_i = 1, X_i^\top \beta < 0 \\ 1 & Y_i = 0, X_i^\top \beta > 0 \\ 0 & \mathrm{o.w.} \end{cases}\]
Unfortunately, this optimization is NP-hard (i.e. computationally intractable), and so we can’t even solve it numerically for large values of \(n\)!
Alternative loss functions can yield numerically-solvable ERM solutions. For example, the “probabilistic loss” \(L(Y, \hat{Y}) = - (1 - Y_i) \log \hat Y - Y_i \log (1 - \hat Y_i)\), lead to the same optimization problem as MLE.
Under the probabilistic loss \(L(Y, \hat{Y}) = - Y_i \log \hat Y - (1 - Y_i) \log (1 - \hat Y_i)\), the prediction rule for logistic regression becomes:
\[\begin{align*} \hat{Y}_{\text{new}} &= \mathrm{argmin}_{\hat{y}} \mathbb{E}[ - Y \log \hat y - (1 - Y) \log (1 - \hat y) \mid X_{\text{new}}, \hat \beta] \\ &= \mathrm{argmin}_{\hat{y}} - \mathbb{E}[Y \mid X_{\text{new}}, \hat \beta] \log \hat y - \mathbb{E}[1 - Y \mid X_{\text{new}}, \hat \beta] \log (1 - \hat y) \\ &= \mathrm{argmin}_{\hat{y}} - \sigma(X_\mathrm{new}^\top \hat \beta) \log \hat y - (1 - \sigma(X_\mathrm{new}^\top \hat \beta)) \log (1 - \hat y) \\ \end{align*}\]
where the second line follows from linearity of expectation, and the third line follows from our logistic model (and recognizing that \(\mathbb{E}[Y \mid X_{\text{new}}, \hat \beta] = P(Y=1 \mid X_{\text{new}}, \hat \beta)\)).
Solving for when the derivative is zero, we get:
\[ \frac{\sigma(X_\mathrm{new}^\top \hat \beta)}{\hat Y_\mathrm{new}} - \frac{1 - \sigma(X_\mathrm{new}^\top \hat \beta)}{1 - \hat Y_\mathrm{new}} = 0, \]
and thus \(\hat Y_\mathrm{new} = \sigma(X_\mathrm{new}^\top \hat \beta)\).
With this prediction rule, the ERM optimization becomes:
\[\begin{align*} \hat \beta_\mathrm{ERM} &= \mathrm{argmin}_{\beta} \frac{1}{n} \sum_{i=1}^n L(Y_i, \hat Y_i) \\ &= \mathrm{argmin}_{\beta} \frac{1}{n} \sum_{i=1}^n - Y_i \log(\sigma(X_i^\top \beta)) - (1 - Y_i) \log(1 - \sigma(X_i^\top \beta)) \\ &= \mathrm{argmax}_{\beta} \frac{1}{n} \sum_{i=1}^n Y_i \log(\sigma(X_i^\top \beta)) + (1 - Y_i) \log(1 - \sigma(X_i^\top \beta)), \end{align*}\]
which is the same optimization problem as MLE!
Summary
This lecture extended the statistical framework from regression to classification:
Statistical Model: The log-odds model provides a principled way to model binary responses while ensuring probabilities stay in \([0, 1]\).
Prediction: The optimal prediction rule depends on the loss function.
Linear Decision Boundary: Under the \(0/1\) loss, the \(Y=0\) and \(Y=1\) predictions are separated by a hyperplane defined by the decision boundary \(X^\top \beta = 0\).
Estimation: The MLE solution cannot be computed analytically; it requires a numerical methods.
In the next lecture, we will explore the last remaining steps of the learning procedure: model selection and evaluation.