26 PCA v KPCA

Stat 406

Geoff Pleiss, Trevor Campbell

Last modified – 24 November 2023

\[ \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator*{\argmax}{argmax} \DeclareMathOperator*{\minimize}{minimize} \DeclareMathOperator*{\maximize}{maximize} \DeclareMathOperator*{\find}{find} \DeclareMathOperator{\st}{subject\,\,to} \newcommand{\E}{E} \newcommand{\Expect}[1]{\E\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[2]{\mathrm{Cov}\left[#1,\ #2\right]} \newcommand{\given}{\ \vert\ } \newcommand{\X}{\mathbf{X}} \newcommand{\x}{\mathbf{x}} \newcommand{\y}{\mathbf{y}} \newcommand{\P}{\mathcal{P}} \newcommand{\R}{\mathbb{R}} \newcommand{\norm}[1]{\left\lVert #1 \right\rVert} \newcommand{\snorm}[1]{\lVert #1 \rVert} \newcommand{\tr}[1]{\mbox{tr}(#1)} \newcommand{\brt}{\widehat{\beta}^R_{s}} \newcommand{\brl}{\widehat{\beta}^R_{\lambda}} \newcommand{\bls}{\widehat{\beta}_{ols}} \newcommand{\blt}{\widehat{\beta}^L_{s}} \newcommand{\bll}{\widehat{\beta}^L_{\lambda}} \newcommand{\U}{\mathbf{U}} \newcommand{\D}{\mathbf{D}} \newcommand{\V}{\mathbf{V}} \]

PCA v KPCA

(We assume \(\X\) is already centered/scaled, \(n\) rows, \(p\) columns)

PCA:

  1. Decompose \(\X=\U\D\V^\top\) (SVD).
  2. Embed into \(M\leq p\) dimensions: \[\U_M \D_M = \X\V_M\]

The “embedding” is \(\U_M \D_M\).

(called the “Principal Components” or the “scores” or occasionally the “factors”)

The “loadings” or “weights” are \(\V_M\)

KPCA:

  1. Choose \(k(x_i, x_{i'})\). Create \(\mathbf{K}\).
  2. Double center \(\mathbf{K} = \mathbf{PKP}\).
  3. Decompose \(\mathbf{K} = \U \D^2 \U^\top\) (eigendecomposition).
  4. Embed into \(M\leq p\) dimensions: \[\U_M \D_M\]

The “embedding” is \(\U_M \D_M\).

There are no “loadings”
(\(\not\exists\ \mathbf{B}\) such that \(\X\mathbf{B} = \U_M \D_M\))

Why is this the solution?

The “maximize variance” version of PCA:

\[\max_\alpha \Var{\X\alpha} \quad \textrm{ subject to } \quad \left|\left| \alpha \right|\right|_2^2 = 1\]

( \(\Var{\X\alpha} = \alpha^\top\X^\top\X\alpha\) )

This is equivalent to solving (Lagrangian):

\[\max_\alpha \alpha^\top\X^\top\X\alpha - \lambda\left|\left| \alpha \right|\right|_2^2\]

Take derivative wrt \(\alpha\) and set to 0:

\[0 = 2\X^\top\X\alpha - 2\lambda\alpha\]

This is the equation for an eigenproblem. The solution is \(\alpha=\V_1\) and the maximum is \(\D_1^2\).

Example (not real unless there’s code)

X <- Stat406::mobility |>
  select(Black:Married) |>
  as.matrix()
not_missing <- complete.cases(X)
X <- scale(X[not_missing, ], center = TRUE, scale = TRUE)
colors <- Stat406::mobility$Mobility[not_missing]
M <- 2 # embedding dimension
P <- diag(nrow(X)) - 1 / nrow(X)

PCA: (all 3 are equivalent)

s <- svd(X) # use svd
pca_loadings <- s$v[, 1:M]
pca_scores <- X %*% pca_loadings
s <- eigen(t(X) %*% X) # V D^2 V'
pca_loadings <- s$vectors[, 1:M]
pca_scores <- X %*% pca_loadings
s <- eigen(X %*% t(X)) # U D^2 U'
D <- sqrt(diag(s$values[1:M]))
U <- s$vectors[, 1:M]
pca_scores <- U %*% D
pca_loadings <- (1 / D) %*% t(U) %*% X

KPCA:

d <- 2
K <- P %*% (1 + X %*% t(X))^d %*% P # polynomial
e <- eigen(K) # U D^2 U'
# (different from the PCA one, K /= XX')
U <- e$vectors[, 1:M]
D <- diag(sqrt(e$values[1:M]))
kpca_poly <- U %*% D
K <- P %*% tanh(1 + X %*% t(X)) %*% P # sigmoid kernel
e <- eigen(K) # U D^2 U'
# (different from the PCA one, K /= XX')
U <- e$vectors[, 1:M]
D <- diag(sqrt(e$values[1:M]))
kpca_sigmoid <- U %*% D

Plotting

PCA loadings

Showing the first 10 PCA loadings:

  • First column are the weights on the first score
  • each number corresponds to a variable in the original data
  • How much does that variable contribute to that score?
head(round(pca_loadings, 2), 10)
       [,1]  [,2]
 [1,]  0.25  0.07
 [2,]  0.13 -0.14
 [3,]  0.17 -0.34
 [4,]  0.18 -0.33
 [5,]  0.16 -0.34
 [6,] -0.24  0.11
 [7,] -0.04 -0.35
 [8,]  0.28  0.00
 [9,]  0.13 -0.14
[10,]  0.29  0.10

KPCA, feature map version

p <- ncol(X)
width <- p * (p - 1) / 2 + p # = 630
Z <- matrix(NA, nrow(X), width)
k <- 0
for (i in 1:p) {
  for (j in i:p) {
    k <- k + 1
    Z[, k] <- X[, i] * X[, j]
  }
}
wideX <- scale(cbind(X, Z))
s <- RSpectra::svds(wideX, 2) # the whole svd would be super slow
fkpca_scores <- s$u %*% diag(s$d)
  • Unfortunately, can’t easily compare to check whether the result is the same
  • Also can cause numerical issues
  • But should be the “same” (assuming I didn’t screw up…)
  • Would also allow me to get the loadings, though they’d depend on polynomials

Other manifold learning methods

To name a few

  • Hessian maps
  • Laplacian eigenmaps
  • Classical Multidimensional Scaling
  • tSNE
  • UMAP
  • Locally linear embeddings
  • Diffusion maps
  • Local tangent space alignment
  • Isomap

Issues with nonlinear techniques

  1. Need to choose \(M\) (also with linear)
  2. Also other tuning parameters.
  3. These others can have huge effects
  4. The difference between the data lying on the manifold and the data lying near the manifold is important
Code
elephant <- function(eye = TRUE) {
  tib <- tibble(
    tt = -100:500 / 100,
    y = -(12 * cos(3 * tt) - 14 * cos(5 * tt) + 50 * sin(tt) + 18 * sin(2 * tt)),
    x = -30 * sin(tt) + 8 * sin(2 * tt) - 10 * sin(3 * tt) - 60 * cos(tt)
  )
  if (eye) tib <- add_row(tib, y = 20, x = 20)
  tib
}
ele <- elephant(FALSE)
noisy_ele <- ele |>
  mutate(y = y + rnorm(n(), 0, 5), x = x + rnorm(n(), 0, 5))
ggplot(noisy_ele, aes(x, y, colour = tt)) +
  geom_point() +
  scale_color_viridis_c() +
  theme(legend.position = "none") +
  geom_path(data = ele, colour = "black", linewidth = 2)

Next time…

Clustering