25 Principal components, the troubles

Stat 406

Geoff Pleiss, Trevor Campbell

Last modified – 23 November 2023

\[ \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator*{\argmax}{argmax} \DeclareMathOperator*{\minimize}{minimize} \DeclareMathOperator*{\maximize}{maximize} \DeclareMathOperator*{\find}{find} \DeclareMathOperator{\st}{subject\,\,to} \newcommand{\E}{E} \newcommand{\Expect}[1]{\E\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[2]{\mathrm{Cov}\left[#1,\ #2\right]} \newcommand{\given}{\ \vert\ } \newcommand{\X}{\mathbf{X}} \newcommand{\x}{\mathbf{x}} \newcommand{\y}{\mathbf{y}} \newcommand{\P}{\mathcal{P}} \newcommand{\R}{\mathbb{R}} \newcommand{\norm}[1]{\left\lVert #1 \right\rVert} \newcommand{\snorm}[1]{\lVert #1 \rVert} \newcommand{\tr}[1]{\mbox{tr}(#1)} \newcommand{\brt}{\widehat{\beta}^R_{s}} \newcommand{\brl}{\widehat{\beta}^R_{\lambda}} \newcommand{\bls}{\widehat{\beta}_{ols}} \newcommand{\blt}{\widehat{\beta}^L_{s}} \newcommand{\bll}{\widehat{\beta}^L_{\lambda}} \newcommand{\U}{\mathbf{U}} \newcommand{\D}{\mathbf{D}} \newcommand{\V}{\mathbf{V}} \]

PCA

If we knew how to rotate our data, then we could more easily retain the structure.

PCA gives us exactly this rotation

PCA works when the data can be represented (in a lower dimension) as lines (or planes, or hyperplanes).

So, in two dimensions:

PCA reduced

Here, we can capture a lot of the variation and underlying structure with just 1 dimension,

instead of the original 2 (the colouring is for visualizing).

PCA bad

What about other data structures? Again in two dimensions

PCA bad

Here, we have failed miserably.

There is actually only 1 dimension to this data (imagine walking up the spiral going from purple to yellow).

However, when we write it as 1 PCA dimension, all the points are all “mixed up”.

Explanation

PCA wants to minimize distances (equivalently maximize variance).

This means it slices through the data at the meatiest point, and then the next one, and so on.

If the data are curved this is going to induce artifacts.

PCA also looks at things as being close if they are near each other in a Euclidean sense.

On the spiral, our intuition says that things are close only if the distance is constrained to go along the curve.

In other words, purple and blue are close, blue and yellow are not.

Kernel PCA

Classical PCA comes from \(\X= \U\D\V^{\top}\), the SVD of the (centered) data

However, we can just as easily get it from the outer product \(\mathbf{K} = \X\X^{\top} = \U\D^2\U^{\top}\)

The intuition behind KPCA is that \(\mathbf{K}\) is an expansion into a kernel space, where \[\mathbf{K}_{i,i'} = k(x_i,\ x_{i'}) = \langle x_i,x_{i'} \rangle\]

We saw this trick before with feature expansion.

Procedure

Specify a kernel function \(k\)
many people use \(k(x,x') = \exp\left( -d(x, x')/\gamma\right)\) where \(d(x,x') = \norm{x-x'}_2^2\)
Form \(\mathbf{K}_{i,i'} = k(x_i,x_{i'})\)
Double center \(\mathbf{K} = \mathbf{PKP}\) where \(\mathbf{P} = \mathbf{I}_n - \mathbf{11}^\top / n\)
Take eigendecomposition \(\mathbf{K} = \U\D^2\U^{\top}\)

The scores are still \(\mathbf{Z} = \U_M\D_M\)

Note

We don’t explicitly generate the feature map \(\longrightarrow\) there are NO loadings

An alternate view

To get the first PC in classical PCA, we solve \[\max_\alpha \Var{\X\alpha} \quad \textrm{ subject to } \quad \left|\left| \alpha \right|\right|_2^2 = 1\]

In the kernel setting we solve \[\max_{g \in \mathcal{H}_k} \Var{g(X)} \quad \textrm{ subject to } \quad\left|\left| g \right|\right|_{\mathcal{H}_k} = 1\]

Here \(\mathcal{H}_k\) is a function space determined by \(k(x,x')\).

Example kernels

\(k(x,x') = x^\top x'\): gives back regular PCA
\(k(x,x') = (1+x^\top x')^d\): gives a function space which contains all \(d^{th}\)-order; polynomials.
\(k(x,x') = \exp(-\norm{x-x'}_2^2/\gamma)\): gives a function space spanned by the infinite Fourier basis

For more details see [ESL 14.5]

Kernel PCA on the spiral

Code

n <- nrow(df_spiral)
I_M <- (diag(n) - tcrossprod(rep(1, n)) / n)
kp <- (tcrossprod(as.matrix(df_spiral[, 1:2])) + 1)^2
Kp <- I_M %*% kp %*% I_M
Ep <- eigen(Kp, symmetric = TRUE)
polydf <- tibble(
  x = Ep$vectors[, 1] * Ep$values[1],
  y = jit,
  z = df_spiral$z
)
kg <- exp(-as.matrix(dist(df_spiral[, 1:2]))^2 / 1)
Kg <- I_M %*% kg %*% I_M
Eg <- eigen(Kg, symmetric = TRUE)
gaussdf <- tibble(
  x = Eg$vectors[, 1] * Eg$values[1],
  y = jit,
  z = df_spiral$z
)
dfkern <- bind_rows(df_spiral, df_spiral2, polydf, gaussdf)
dfkern$method <- rep(c("data", "pca", "kpoly (d = 2)", "kgauss (gamma = 1)"), each = n)

KPCA: summary

Kernel PCA seeks to generalize the notion of similarity using a kernel map

This can be interpreted as finding smooth, orthogonal directions in an RKHS

This can allow us to start picking up nonlinear (in the original feature space) aspects of our data

This new representation can be passed to a supervised method to form a semisupervised learner

This kernel is different than kernel smoothing!!

Just like with PCA (and lots of other things) the way you measure distance is important
The choice of Kernel is important
The embedding dimension must be chosen

Basic semi-supervised (see [ISLR 6.3.1])

You get data \(\{(x_1,y_1),\ldots,(x_n,y_n)\}\).
You do something unsupervised on \(\X\) to create new features (like PCA).
You use the learned features to find a predictor \(\hat{f}\) (say, do OLS on them)

Quick example

music <- Stat406::popmusic_train
X <- music |> select(danceability:energy, loudness, speechiness:valence)
pca <- prcomp(X, scale = TRUE)
Z <- predict(pca)[, 1:2]
Zgrid <- expand.grid(
  Z1 = seq(min(Z[,1]), max(Z[,1]), len = 100L), 
  Z2 = seq(min(Z[,2]), max(Z[,2]), len = 100L)
)
out <- class::knn(Z, Zgrid, music$artist, k = 6)

Why?

Recall that for nonparametric regression, we must deal with the curse of dimensionality.

This did KNN in 2 dimensions instead of 8: less curse
However, the dimensions were learned independently of \(\y\): may not be helpful
There’s a bias-variance tradeoff here…

What is PCA / KPCA estimating?

Imagine that the Population data lie near a low-dimensional linear manifold…

Think of the spiral. (1D manifold in 2D space)
Or a flat piece of paper (2D manifold in 3D space)

PCA / KPCA are estimating the manifold.

This works well under some conditions
And if it’s true, then we have reduced variance but added a bit of bias
And the downstream method may work better, because there are fewer predictors.

We just need the data near the manifold. But if not, then we introduced a lot of bias.

PCA estimates linear manifolds, KPCA estimates non-linear

Next time…

More PCA vs. KPCA