26 PCA v KPCA

Stat 406

Geoff Pleiss, Trevor Campbell

Last modified – 24 November 2023

\[ \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator*{\argmax}{argmax} \DeclareMathOperator*{\minimize}{minimize} \DeclareMathOperator*{\maximize}{maximize} \DeclareMathOperator*{\find}{find} \DeclareMathOperator{\st}{subject\,\,to} \newcommand{\E}{E} \newcommand{\Expect}[1]{\E\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[2]{\mathrm{Cov}\left[#1,\ #2\right]} \newcommand{\given}{\ \vert\ } \newcommand{\X}{\mathbf{X}} \newcommand{\x}{\mathbf{x}} \newcommand{\y}{\mathbf{y}} \newcommand{\P}{\mathcal{P}} \newcommand{\R}{\mathbb{R}} \newcommand{\norm}[1]{\left\lVert #1 \right\rVert} \newcommand{\snorm}[1]{\lVert #1 \rVert} \newcommand{\tr}[1]{\mbox{tr}(#1)} \newcommand{\brt}{\widehat{\beta}^R_{s}} \newcommand{\brl}{\widehat{\beta}^R_{\lambda}} \newcommand{\bls}{\widehat{\beta}_{ols}} \newcommand{\blt}{\widehat{\beta}^L_{s}} \newcommand{\bll}{\widehat{\beta}^L_{\lambda}} \newcommand{\U}{\mathbf{U}} \newcommand{\D}{\mathbf{D}} \newcommand{\V}{\mathbf{V}} \]

PCA v KPCA

(We assume \(\X\) is already centered/scaled, \(n\) rows, \(p\) columns)

PCA:

Decompose \(\X=\U\D\V^\top\) (SVD).
Embed into \(M\leq p\) dimensions: \[\U_M \D_M = \X\V_M\]

The “embedding” is \(\U_M \D_M\).

(called the “Principal Components” or the “scores” or occasionally the “factors”)

The “loadings” or “weights” are \(\V_M\)

KPCA:

Choose \(k(x_i, x_{i'})\). Create \(\mathbf{K}\).
Double center \(\mathbf{K} = \mathbf{PKP}\).
Decompose \(\mathbf{K} = \U \D^2 \U^\top\) (eigendecomposition).
Embed into \(M\leq p\) dimensions: \[\U_M \D_M\]

The “embedding” is \(\U_M \D_M\).

There are no “loadings”
(\(\not\exists\ \mathbf{B}\) such that \(\X\mathbf{B} = \U_M \D_M\))

Why is this the solution?

The “maximize variance” version of PCA:

\[\max_\alpha \Var{\X\alpha} \quad \textrm{ subject to } \quad \left|\left| \alpha \right|\right|_2^2 = 1\]

( \(\Var{\X\alpha} = \alpha^\top\X^\top\X\alpha\) )

This is equivalent to solving (Lagrangian):

\[\max_\alpha \alpha^\top\X^\top\X\alpha - \lambda\left|\left| \alpha \right|\right|_2^2\]

Take derivative wrt \(\alpha\) and set to 0:

\[0 = 2\X^\top\X\alpha - 2\lambda\alpha\]

This is the equation for an eigenproblem. The solution is \(\alpha=\V_1\) and the maximum is \(\D_1^2\).

Example (not real unless there’s code)

X <- Stat406::mobility |>
  select(Black:Married) |>
  as.matrix()
not_missing <- complete.cases(X)
X <- scale(X[not_missing, ], center = TRUE, scale = TRUE)
colors <- Stat406::mobility$Mobility[not_missing]
M <- 2 # embedding dimension
P <- diag(nrow(X)) - 1 / nrow(X)

PCA: (all 3 are equivalent)

s <- svd(X) # use svd
pca_loadings <- s$v[, 1:M]
pca_scores <- X %*% pca_loadings

s <- eigen(t(X) %*% X) # V D^2 V'
pca_loadings <- s$vectors[, 1:M]
pca_scores <- X %*% pca_loadings

s <- eigen(X %*% t(X)) # U D^2 U'
D <- sqrt(diag(s$values[1:M]))
U <- s$vectors[, 1:M]
pca_scores <- U %*% D
pca_loadings <- (1 / D) %*% t(U) %*% X

KPCA:

d <- 2
K <- P %*% (1 + X %*% t(X))^d %*% P # polynomial
e <- eigen(K) # U D^2 U'
# (different from the PCA one, K /= XX')
U <- e$vectors[, 1:M]
D <- diag(sqrt(e$values[1:M]))
kpca_poly <- U %*% D

K <- P %*% tanh(1 + X %*% t(X)) %*% P # sigmoid kernel
e <- eigen(K) # U D^2 U'
# (different from the PCA one, K /= XX')
U <- e$vectors[, 1:M]
D <- diag(sqrt(e$values[1:M]))
kpca_sigmoid <- U %*% D

Plotting

PCA loadings

Showing the first 10 PCA loadings:

First column are the weights on the first score
each number corresponds to a variable in the original data
How much does that variable contribute to that score?

head(round(pca_loadings, 2), 10)

       [,1]  [,2]
 [1,]  0.25  0.07
 [2,]  0.13 -0.14
 [3,]  0.17 -0.34
 [4,]  0.18 -0.33
 [5,]  0.16 -0.34
 [6,] -0.24  0.11
 [7,] -0.04 -0.35
 [8,]  0.28  0.00
 [9,]  0.13 -0.14
[10,]  0.29  0.10

KPCA, feature map version

p <- ncol(X)
width <- p * (p - 1) / 2 + p # = 630
Z <- matrix(NA, nrow(X), width)
k <- 0
for (i in 1:p) {
  for (j in i:p) {
    k <- k + 1
    Z[, k] <- X[, i] * X[, j]
  }
}
wideX <- scale(cbind(X, Z))
s <- RSpectra::svds(wideX, 2) # the whole svd would be super slow
fkpca_scores <- s$u %*% diag(s$d)

Unfortunately, can’t easily compare to check whether the result is the same
Also can cause numerical issues
But should be the “same” (assuming I didn’t screw up…)
Would also allow me to get the loadings, though they’d depend on polynomials

Other manifold learning methods

To name a few

Hessian maps
Laplacian eigenmaps
Classical Multidimensional Scaling
tSNE
UMAP
Locally linear embeddings
Diffusion maps
Local tangent space alignment
Isomap

Issues with nonlinear techniques

Need to choose \(M\) (also with linear)
Also other tuning parameters.
These others can have huge effects
The difference between the data lying on the manifold and the data lying near the manifold is important

Code

elephant <- function(eye = TRUE) {
  tib <- tibble(
    tt = -100:500 / 100,
    y = -(12 * cos(3 * tt) - 14 * cos(5 * tt) + 50 * sin(tt) + 18 * sin(2 * tt)),
    x = -30 * sin(tt) + 8 * sin(2 * tt) - 10 * sin(3 * tt) - 60 * cos(tt)
  )
  if (eye) tib <- add_row(tib, y = 20, x = 20)
  tib
}
ele <- elephant(FALSE)
noisy_ele <- ele |>
  mutate(y = y + rnorm(n(), 0, 5), x = x + rnorm(n(), 0, 5))
ggplot(noisy_ele, aes(x, y, colour = tt)) +
  geom_point() +
  scale_color_viridis_c() +
  theme(legend.position = "none") +
  geom_path(data = ele, colour = "black", linewidth = 2)

Next time…

Clustering