24 Principal components, introduction

Stat 406

Geoff Pleiss, Trevor Campbell

Last modified – 24 November 2023

\[ \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator*{\argmax}{argmax} \DeclareMathOperator*{\minimize}{minimize} \DeclareMathOperator*{\maximize}{maximize} \DeclareMathOperator*{\find}{find} \DeclareMathOperator{\st}{subject\,\,to} \newcommand{\E}{E} \newcommand{\Expect}[1]{\E\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[2]{\mathrm{Cov}\left[#1,\ #2\right]} \newcommand{\given}{\ \vert\ } \newcommand{\X}{\mathbf{X}} \newcommand{\x}{\mathbf{x}} \newcommand{\y}{\mathbf{y}} \newcommand{\P}{\mathcal{P}} \newcommand{\R}{\mathbb{R}} \newcommand{\norm}[1]{\left\lVert #1 \right\rVert} \newcommand{\snorm}[1]{\lVert #1 \rVert} \newcommand{\tr}[1]{\mbox{tr}(#1)} \newcommand{\brt}{\widehat{\beta}^R_{s}} \newcommand{\brl}{\widehat{\beta}^R_{\lambda}} \newcommand{\bls}{\widehat{\beta}_{ols}} \newcommand{\blt}{\widehat{\beta}^L_{s}} \newcommand{\bll}{\widehat{\beta}^L_{\lambda}} \newcommand{\U}{\mathbf{U}} \newcommand{\D}{\mathbf{D}} \newcommand{\V}{\mathbf{V}} \]

Unsupervised learning

In Machine Learning, rather than calling \(\y\) the response, people call it the supervisor

So unsupervised learning means learning without \(\y\)

The only data you get are the features \(\{x_1,\ldots,x_n\}\).

This type of analysis is more often exploratory

We’re not necessarily using this for prediction (but we could)

So now, we get \(\X\)

The two main activities are representation learning and clustering

Representation learning

Representation learning is the idea that performance of ML methods is highly dependent on the choice of representation

For this reason, much of ML is geared towards transforming the data into the relevant features and then using these as inputs

This idea is as old as statistics itself, really,

However, the idea is constantly revisited in a variety of fields and contexts

Commonly, these learned representations capture low-level information like overall shapes

It is possible to quantify this intuition for PCA at least

Goal
Transform \(\mathbf{X}\in \R^{n\times p}\) into \(\mathbf{Z} \in \R^{n \times ?}\)

?-dimension can be bigger (feature creation) or smaller (dimension reduction) than \(p\)

You’ve done this already!

  • You added transformations as predictors in regression

  • You “expanded” \(\mathbf{X}\) using a basis \(\Phi\) (polynomials, splines, etc.)

  • You used Neural Nets to do a “feature map”

This is the same, just no \(Y\) around

PCA

Principal components analysis (PCA) is an (unsupervised) dimension reduction technique

It solves various equivalent optimization problems

(Maximize variance, minimize \(\ell_2\) distortions, find closest subspace of a given rank, \(\ldots\))

At its core, we are finding linear combinations of the original (centered) covariates \[z_{ij} = \alpha_j^{\top} x_i\]

This is expressed via the SVD: \(\X = \U\D\V^{\top}\).

Important

We assume throughout that \(\X - \mathbf{11^\top}\overline{x} = 0\) (we center the columns)

Then our new features are

\[\mathbf{Z} = \X \V = \U\D\]

Short SVD aside (reminder from Ridge Regression)

  • Any \(n\times p\) matrix can be decomposed into \(\mathbf{UDV}^\top\).

  • These have properties:

  1. \(\mathbf{U}^\top \mathbf{U} = \mathbf{I}_n\)
  2. \(\mathbf{V}^\top \mathbf{V} = \mathbf{I}_p\)
  3. \(\mathbf{D}\) is diagonal (0 off the diagonal)

Almost all the methods for we’ll talk about for representation learning use the SVD of some matrix.

Why?

  1. Given \(\X\), find a projection \(\mathbf{P}\) onto \(\R^M\) with \(M \leq p\) that minimizes the reconstruction error \[ \begin{aligned} \min_{\mathbf{P}} &\,\, \lVert \mathbf{X} - \mathbf{X}\mathbf{P} \rVert^2_F \,\,\, \textrm{(sum all the elements)}\\ \textrm{subject to} &\,\, \textrm{rank}(\mathbf{P}) = M,\, \mathbf{P} = \mathbf{P}^T,\, \mathbf{P} = \mathbf{P}^2 \end{aligned} \] The conditions ensure that \(\mathbf{P}\) is a projection matrix onto \(M\) dimensions.

  2. Maximize the variance explained by an orthogonal transformation \(\mathbf{A} \in \R^{p\times M}\) \[ \begin{aligned} \max_{\mathbf{A}} &\,\, \textrm{trace}\left(\frac{1}{n}\mathbf{A}^\top \X^\top \X \mathbf{A}\right)\\ \textrm{subject to} &\,\, \mathbf{A}^\top\mathbf{A} = \mathbf{I}_M \end{aligned} \]

  • In case one, the minimizer is \(\mathbf{P} = \mathbf{V}_M\mathbf{V}_M^\top\)
  • In case two, the maximizer is \(\mathbf{A} = \mathbf{V}_M\).

Lower dimensional embeddings

Suppose we have predictors \(\x_1\) and \(\x_2\)

  • We more faithfully preserve the structure of this data by keeping \(\x_1\) and setting \(\x_2\) to zero than the opposite

Lower dimensional embeddings

An important feature of the previous example is that \(\x_1\) and \(\x_2\) aren’t correlated

What if they are?

We lose a lot of structure by setting either \(\x_1\) or \(\x_2\) to zero

Lower dimensional embeddings

The only difference is the first is a rotation of the second

PCA

If we knew how to rotate our data, then we could more easily retain the structure.

PCA gives us exactly this rotation

  1. Center (+scale?) the data matrix \(\X\)
  2. Compute the SVD of \(\X = \U\D \V^\top\) or \(\X\X^\top = \U\D^2\U^\top\) or \(\X^\top \X = \V\D^2 \V^\top\)
  3. Return \(\U_M\D_M\), where \(\D_M\) is the largest \(M\) singular values of \(\X\)

PCA

Code
s <- svd(X)
tib <- rbind(X, s$u %*% diag(s$d), s$u %*% diag(c(s$d[1], 0)))
tib <- tibble(
  x1 = tib[, 1], x2 = tib[, 2],
  name = rep(1:3, each = 20)
)
plotter <- function(set = 1, main = "original") {
  tib |>
    filter(name == set) |>
    ggplot(aes(x1, x2)) +
    geom_point(colour = blue) +
    coord_cartesian(c(-2, 2), c(-2, 2)) +
    theme(legend.title = element_blank(), legend.position = "bottom") +
    ggtitle(main)
}
cowplot::plot_grid(
  plotter() + labs(x = bquote(x[1]), y = bquote(x[2])),
  plotter(2, "rotated") +
    labs(x = bquote((UD)[1] == (XV)[1]), y = bquote((UD)[2] == (XV)[2])),
  plotter(3, "rotated and projected") +
    labs(x = bquote(U[1] ~ D[1] == (XV)[1]), y = bquote(U[2] ~ D[2] %==% 0)),
  nrow = 1
)

PCA on some pop music data

music <- Stat406::popmusic_train
str(music)
tibble [1,269 × 15] (S3: tbl_df/tbl/data.frame)
 $ artist          : Factor w/ 3 levels "Radiohead","Taylor Swift",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ danceability    : num [1:1269] 0.781 0.627 0.516 0.629 0.686 0.522 0.31 0.705 0.553 0.419 ...
 $ energy          : num [1:1269] 0.357 0.266 0.917 0.757 0.705 0.691 0.374 0.621 0.604 0.908 ...
 $ key             : int [1:1269] 0 9 11 1 9 2 6 2 1 9 ...
 $ loudness        : num [1:1269] -16.39 -15.43 -3.19 -8.37 -10.82 ...
 $ mode            : int [1:1269] 1 1 0 0 1 1 1 1 0 1 ...
 $ speechiness     : num [1:1269] 0.912 0.929 0.0827 0.0512 0.249 0.0307 0.0275 0.0334 0.0258 0.0651 ...
 $ acousticness    : num [1:1269] 0.717 0.796 0.0139 0.00384 0.832 0.00609 0.761 0.101 0.202 0.00048 ...
 $ instrumentalness: num [1:1269] 0.00 0.00 6.37e-06 7.45e-01 4.55e-06 0.00 2.46e-05 4.30e-06 0.00 0.00 ...
 $ liveness        : num [1:1269] 0.185 0.292 0.36 0.0864 0.134 0.249 0.11 0.147 0.125 0.815 ...
 $ valence         : num [1:1269] 0.645 0.646 0.635 0.623 0.919 0.651 0.16 0.395 0.186 0.472 ...
 $ tempo           : num [1:1269] 118.3 79.3 145.8 157 151.9 ...
 $ time_signature  : int [1:1269] 4 4 4 4 5 4 4 4 4 4 ...
 $ duration_ms     : int [1:1269] 107133 89648 217160 201853 180653 201106 285640 240773 302266 290520 ...
 $ explicit        : logi [1:1269] FALSE FALSE FALSE FALSE FALSE FALSE ...
X <- music |> select(danceability:energy, loudness, speechiness:valence)
pca <- prcomp(X, scale = TRUE) ## DON'T USE princomp()

PCA on some pop music data

proj_pca <- predict(pca)[,1:2] |>
  as_tibble() |>
  mutate(artist = music$artist)

Things to look at

pca$rotation[, 1:2]
                         PC1          PC2
danceability     -0.02811038  0.577020027
energy           -0.56077454 -0.001137424
loudness         -0.53893087  0.085912854
speechiness       0.30125038  0.431188730
acousticness      0.51172138  0.082108326
instrumentalness  0.01374425 -0.370058813
liveness         -0.02282669 -0.194947054
valence          -0.20242211  0.540418541
pca$sdev
[1] 1.6233469 1.3466829 1.0690017 0.9510417 0.7638669 0.6737958 0.5495869
[8] 0.4054698
pca$sdev^2 / sum(pca$sdev^2)
[1] 0.32940690 0.22669435 0.14284558 0.11306004 0.07293658 0.05675010 0.03775572
[8] 0.02055072

Plotting the weights

Code
pca$rotation[, 1:2] |>
  as_tibble() |>
  mutate(feature = names(X)) |>
  pivot_longer(-feature) |>
  ggplot(aes(value, feature, fill = feature)) +
  facet_wrap(~name) +
  geom_col() +
  theme(legend.position = "none", axis.title = element_blank()) +
  scale_fill_brewer(palette = "Dark2") +
  geom_vline(xintercept = 0)

Next time…

When does PCA work?