Principal components analysis

Stat 550

Daniel J. McDonald

Last modified – 03 April 2024

\[ \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator*{\argmax}{argmax} \DeclareMathOperator*{\minimize}{minimize} \DeclareMathOperator*{\maximize}{maximize} \DeclareMathOperator*{\find}{find} \DeclareMathOperator{\st}{subject\,\,to} \newcommand{\E}{E} \newcommand{\Expect}[1]{\E\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[2]{\mathrm{Cov}\left[#1,\ #2\right]} \newcommand{\given}{\mid} \newcommand{\X}{\mathbf{X}} \newcommand{\x}{\mathbf{x}} \newcommand{\y}{\mathbf{y}} \newcommand{\P}{\mathcal{P}} \newcommand{\R}{\mathbb{R}} \newcommand{\norm}[1]{\left\lVert #1 \right\rVert} \newcommand{\snorm}[1]{\lVert #1 \rVert} \newcommand{\tr}[1]{\mbox{tr}(#1)} \newcommand{\U}{\mathbf{U}} \newcommand{\D}{\mathbf{D}} \newcommand{\V}{\mathbf{V}} \renewcommand{\hat}{\widehat} \]

Representation learning

Representation learning is the idea that performance of ML methods is highly dependent on the choice of representation

For this reason, much of ML is geared towards transforming the data into the relevant features and then using these as inputs

This idea is as old as statistics itself, really,

However, the idea is constantly revisited in a variety of fields and contexts

Commonly, these learned representations capture low-level information like overall shapes

It is possible to quantify this intuition for PCA at least

Transform \(\mathbf{X}\in \R^{n\times p}\) into \(\mathbf{Z} \in \R^{n \times ?}\)

?-dimension can be bigger (feature creation) or smaller (dimension reduction) than \(p\)


Principal components analysis (PCA) is a dimension reduction technique

It solves various equivalent optimization problems

(Maximize variance, minimize \(\ell_2\) distortions, find closest subspace of a given rank, \(\ldots\))

At its core, we are finding linear combinations of the original (centered) data \[z_{ij} = \alpha_j^{\top} x_i\]

Lower dimensional embeddings

Suppose we have predictors \(\x_1\) and \(\x_2\) (columns / features / measurements)

  • We more faithfully preserve the structure of this data by keeping \(\x_1\) and setting \(\x_2\) to zero than the opposite

Lower dimensional embeddings

An important feature of the previous example is that \(\x_1\) and \(\x_2\) aren’t correlated

What if they are?

We lose a lot of structure by setting either \(\x_1\) or \(\x_2\) to zero

Lower dimensional embeddings

The only difference is the first is a rotation of the second


If we knew how to rotate our data, then we could more easily retain the structure.

PCA gives us exactly this rotation

  1. Center (+scale?) the data matrix \(\X\)
  2. Compute the SVD of \(\X = \U\D \V^\top\) (always exists)
  3. Return \(\U_M\D_M\), where \(\D_M\) is the largest \(M\) singular values of \(\X\)


PCA on some pop music data

# A tibble: 1,269 × 15
   artist      danceability energy   key loudness  mode speechiness acousticness
   <fct>              <dbl>  <dbl> <int>    <dbl> <int>       <dbl>        <dbl>
 1 Taylor Swi…        0.781  0.357     0   -16.4      1      0.912       0.717  
 2 Taylor Swi…        0.627  0.266     9   -15.4      1      0.929       0.796  
 3 Taylor Swi…        0.516  0.917    11    -3.19     0      0.0827      0.0139 
 4 Taylor Swi…        0.629  0.757     1    -8.37     0      0.0512      0.00384
 5 Taylor Swi…        0.686  0.705     9   -10.8      1      0.249       0.832  
 6 Taylor Swi…        0.522  0.691     2    -4.82     1      0.0307      0.00609
 7 Taylor Swi…        0.31   0.374     6    -8.46     1      0.0275      0.761  
 8 Taylor Swi…        0.705  0.621     2    -8.09     1      0.0334      0.101  
 9 Taylor Swi…        0.553  0.604     1    -5.30     0      0.0258      0.202  
10 Taylor Swi…        0.419  0.908     9    -5.16     1      0.0651      0.00048
# ℹ 1,259 more rows
# ℹ 7 more variables: instrumentalness <dbl>, liveness <dbl>, valence <dbl>,
#   tempo <dbl>, time_signature <int>, duration_ms <int>, explicit <lgl>

PCA on some pop music data

  • 15 dimensions to 2
  • coloured by artist

Plotting the weights, \(\alpha_j,\ j=1,2\)

Mathematical details

Matrix decompositions

At its core, we are finding linear combinations of the original (centered) data \[z_{ij} = \alpha_j^{\top} x_i\]

This is expressed via the SVD: \(\X = \U\D\V^{\top}\).


We assume throughout that we have centered the data

Then our new features are

\[\mathbf{Z} = \X \V = \U\D\]

Short SVD aside

  • Any \(n\times p\) matrix can be decomposed into \(\mathbf{UDV}^\top\).

  • This is a computational procedure, like inverting a matrix, svd()

  • These have properties:

  1. \(\mathbf{U}^\top \mathbf{U} = \mathbf{I}_n\)
  2. \(\mathbf{V}^\top \mathbf{V} = \mathbf{I}_p\)
  3. \(\mathbf{D}\) is diagonal (0 off the diagonal)

Many methods for dimension reduction use the SVD of some matrix.


  1. Given \(\X\), find a projection \(\mathbf{P}\) onto \(\R^M\) with \(M \leq p\) that minimizes the reconstruction error \[ \begin{aligned} \min_{\mathbf{P}} &\,\, \lVert \mathbf{X} - \mathbf{X}\mathbf{P} \rVert^2_F \,\,\, \textrm{(sum all the elements)}\\ \textrm{subject to} &\,\, \textrm{rank}(\mathbf{P}) = M,\, \mathbf{P} = \mathbf{P}^T,\, \mathbf{P} = \mathbf{P}^2 \end{aligned} \] The conditions ensure that \(\mathbf{P}\) is a projection matrix onto \(M\) dimensions.

  2. Maximize the variance explained by an orthogonal transformation \(\mathbf{A} \in \R^{p\times M}\) \[ \begin{aligned} \max_{\mathbf{A}} &\,\, \textrm{trace}\left(\frac{1}{n}\mathbf{A}^\top \X^\top \X \mathbf{A}\right)\\ \textrm{subject to} &\,\, \mathbf{A}^\top\mathbf{A} = \mathbf{I}_M \end{aligned} \]

  • In case one, the minimizer is \(\mathbf{P} = \mathbf{V}_M\mathbf{V}_M^\top\)
  • In case two, the maximizer is \(\mathbf{A} = \mathbf{V}_M\).

