Regularization

Visualizations

Visualizations: Model and Estimation

Let’s use 2D and 3D visualizations to develop intuition about estimation methods.

Simulation Design

We use simulated data to be under control of the true coefficients. To simplify the geometry, we will:

generate data from models with zero intercept
fit models without intercept so the objective is only to estimate the slopes

Warning

Here the true intercept is 0, so fitting a model without an intercept is appropriate. With real data, the intercept is unknown and should not be omitted.

Optimization

There are many methods to estimate regression coefficients. We learned that:

the Ordinary Least Squares (OLS) estimator gives values for the regression coefficients that minimize the sum of the squares of the residuals (aka RSS)
LASSO and Ridge also minimize the sum of the squares of the residuals, but subject to a constraint

Note

The RSS is a quadratic function. For a SLR (with zero intercept), it is just a parabola. In general, we look for the coefficients that optimize (result in the minimum of) the RSS.

OLS for SLR

Let’s start with a SLR without an intercept, so we only need to find the slope \(\beta_1\) that minimizes the RSS.

slr_rows = Array.from({length: slr_df.x1.length}, (_, i) => ({
  x1: slr_df.x1[i],
  y: slr_df.y[i]
}))

rss_rows = Array.from({length: rss_df.beta1.length}, (_, i) => ({
  beta1: rss_df.beta1[i],
  rss: rss_df.rss[i]
}))

viewof beta1 = Inputs.range(
  [Math.min(...rss_rows.map(d => d.beta1)), Math.max(...rss_rows.map(d => d.beta1))],
  {
    value: bhat1,
    step: 0.01,
    label: "β₁",
    width: 350
  }
)

currentPoints = slr_rows.map(d => ({
  x1: d.x1,
  y: d.y,
  yhat: beta1 * d.x1,
  resid: d.y - beta1 * d.x1
}))

currentRSS = currentPoints.reduce((acc, d) => acc + d.resid ** 2, 0)

xMin = Math.min(...slr_rows.map(d => d.x1))
xMax = Math.max(...slr_rows.map(d => d.x1))

fitLine = [
  {x1: xMin, y: beta1 * xMin},
  {x1: xMax, y: beta1 * xMax}
]

currentRSSPoint = [{beta1: beta1, rss: currentRSS}]
olsPoint = [{beta1: bhat1, rss: rss_min}]
labelPoint = [{
  beta1: Math.min(...rss_rows.map(d => d.beta1)),
  rss: currentRSS,
  label: `RSS = ${currentRSS.toFixed(2)}`
}]

Plot.plot({
  width: 500,
  height: 330,
  marginLeft: 65,
  marginRight: 20,
  marginTop: 38,
  marginBottom: 52,
  x: {label: "Explanatory Variable (x1)", grid: true,     labelArrow: false},
  y: {label: "Response Variable", grid: true, labelArrow: false},
  marks: [
    Plot.link(currentPoints, {
      x: "x1",
      y: "yhat",
      x2: "x1",
      y2: "y",
      stroke: "red",
      strokeOpacity: 0.45,
      strokeDasharray: "4,4"
    }),
    Plot.dot(currentPoints, {
      x: "x1",
      y: "y",
      r: 3.2,
      fill: "#2C7FB8",
      tip: true
    }),
    Plot.line(fitLine, {
      x: "x1",
      y: "y",
      stroke: "red",
      strokeWidth: 2.5
    })
  ]
})

Plot.plot({
  width: 500,
  height: 330,
  marginLeft: 72,
  marginRight: 20,
  marginTop: 38,
  marginBottom: 52,
  x: {label: "β₁", grid: true, labelArrow: false},
  y: {label: "Residual Sum of Squares", grid: true,
    labelArrow: false},
  marks: [
    Plot.line(rss_rows, {
      x: "beta1",
      y: "rss",
      stroke: "black",
      strokeWidth: 2
    }),
    Plot.ruleX([beta1], {
      stroke: "red",
      strokeDasharray: "5,5",
      strokeOpacity: 0.8
    }),
    Plot.ruleY([currentRSS], {
      stroke: "red",
      strokeDasharray: "5,5",
      strokeOpacity: 0.8
    }),
    Plot.dot(olsPoint, {
      x: "beta1",
      y: "rss",
      r: 4.2,
      fill: "black",
      tip: true
    }),
    Plot.dot(currentRSSPoint, {
      x: "beta1",
      y: "rss",
      r: 5,
      fill: "red",
      tip: true
    }),
    Plot.text(labelPoint, {
      x: "beta1",
      y: "rss",
      text: "label",
      dx: 45,
      dy: -10,
      fill: "red",
      fontSize: 14
    })
  ]
})

Note

Move the slider to see how changing \(\beta_1\) changes both the fitted line and the RSS. The OLS estimate is the value of \(\beta_1\) that minimizes the RSS.

OLS with 2 covariates: 3D plots

The LS regression line becomes a plane!

The LS plane is the one that minimizes the distances of the points to the plane, the RSS
The minimum RSS equals 70.77

Rotate the plot to view different angles. For simplicity, only the optimal plane is shown.

The RSS parabola becomes a surface!

The RSS now depends on 2 values: \(\beta_1\) and \(\beta_2\) (coefficients of 2 explanatory variables)
The red point corresponds to the minimum of the RSS 70.77: LS solution
The beta coefficients at the minimum (see red tag) are the slopes of the LS plane in previous slide
Compare the RSS away from the minimum with the minimal RSS

Move your cursor over the RSS surface to view RSS values for non-optimal βs. See red tag at the minimum.

Away from the minimum RSS

The RSS attains it minimum at the LS estimates and increases as you move away from it

Because the RSS is a quadratic function, many pairs \((\beta_1, \beta_2)\) give the same RSS. Think about the symmetry of a 2D parabola.
The red ellipses represent combinations of \((\beta_1, \beta_2)\) with equal RSS, aka contour curves.
See the contour lines projected onto the betas’ space

Move along the red curves (see red tag) with RSS larger than the minimum.

Compare the 3D plot with a projection on the space of the coefficients (2D projection):

the values of \(\beta_1\) and \(\beta_2\) at point give a minimum RSS, other values give larger RSS
due to symmetry, many combinations of betas have equal RSS (ellipses, with larger RSS)

Constrained minimization

What if the minimum RSS cannot be attained due to restrictions on betas?

If any \(\beta_1\) and \(\beta_2\) are allowed, the LS point (red center point) minimizes RSS.
If restricted to the orange region, LS is infeasible, so the minimum RSS cannot be attained.

The best feasible choice will be on the boundary, where the first ellipse touches the region
RSS increases with larger ellipses (labels show RSS values). There’s a trade-off between feasibility and RSS value
The first ellipse touching the feasible zone will determine the best combination of \(\beta_1\) and \(\beta_2\) given the restrictions, the LASSO estimator. Solutions inside the zone have larger RSS.

As \(\lambda\) increases, the feasible region becomes smaller, shrinking the estimates and pushing them further away from the least squares (LS) solution (shown by the black arrow).

How much regularization?

As \(\lambda\) increases, the feasible region shrinks and RSS increases (note matching colors)
The level of regularization is chosen by cross-valiation, e.g., lambda.1se shown in green

Summary: computing LASSO

Define a grid with values of \(\lambda\) (glmnet() set one by default)
Each value of \(\lambda\) defines a feasible region (orange diamonds) of allowable coefficient values.
This region limits how large the coefficients can be, and the solution occurs where a contour ellipse touches the boundary.
Larger values of \(\lambda\) in the grid correspond to smaller feasible regions: more shrinkage, smaller coefficients
With more shrinkage, some coefficients become exactly zero, enabling variable selection.

Regularization in GLM

For Logistic and Poisson regression we

used a different estimation: the MLE (maximum likelihood estimator)
defined other measures of goodness of fit: deviance as an extension of the RSS
defined other measures to evaluate prediction: missclassification or sensitivity/specificity for Logistic regression

While penalized estimators also exist for these models, the optimization is not based on the RSS.
However, conceptually, the problem is similar: minimize a function of the fit subject to a constrain on the size of the coefficients.
In R, you can choose the level of penalization setting different performace measures in type.measure (e.g., AUC for logistic) of the glment() function.
See glmnet vignette here

Measures

The type.measure:

for MLR, default (family = “gaussian”):
- “deviance” or “mse” for squared loss
- “mae”: mean absolute error
for Logistic (family = “binomial”):
- “mse” uses squared loss
- “deviance” uses actual deviance
- “mae” uses mean absolute error
- “class” gives misclassification error
- “auc” gives area under the ROC curve

for Poisson (family = “Poisson”):
- “deviance” (default)
- “mse”
- “mae”