<- read.table("data/t8-5.dat", header = FALSE) x
20 Introduction
Unsupervised learning methods differ from the supervised ones we have studied so far in that there is no response variable. The objective is not related to prediction but rather to the identification of different possible structures that may be present in the data. For example, one may be interested in determining whether the observations are “grouped” in some way (clustering), or if the data can be efficiently represented using fewer variables or features (dimension reduction).
Many of these methods do not rely on any probabilistic model, and thus there may not be a clear target to be estimated or approximated. As a consequence, the conclusions that can be reached from this type of analyses are often of an exploratory nature.
20.1 Principal Components Analysis
Although principal components can be easily computed with the spectral decomposition of the covariance matrix of the data (using the function svd
in R
, for example), there are a few dedicated implementations in R
, among them prcomp
and princomp
). The main difference between these two is which internal function is used to compute eigenvalues and eigenvectors: prcomp
uses svd
and princomp
uses the less preferred function eigen
. Both princomp
and prcomp
return the matrix of loadings (eigenvectors), the scores (projections of the data on the basis of eigenvectors), and other auxiliary objects. They also include plot and summary methods.
Instead of reviewing those (which can be easily done individually), in these notes I will reproduce two of the examples used in class (the simple 2-dimensional one used to motivate the topic, and the more interesting 256-dimensional one using the digits data).
Finally, I will also show that principal components can be computed using an iterative algorithm (alternate regression), which may be faster than factorizing the covariance matrix, particularly when one is only interested in a few principal components and the dimension of the data is large (but also look at the arguments nu
and nv
for the function svd
in R
).
20.2 Simple 2-dimensional example
We first read the data for the simple illustration of PC’s as best lower dimensional approximations.
Note that the data has 5 explanatory variables. Here we only use two of them in order to be able to visualize the analysis more easily:
<- x[, c(2, 5)]
xx colnames(xx) <- c("Prof degree", "Median home value")
As discussed in class, we standardize the data to avoid a large difference in scales “hijacking” the principal components:
<- scale(xx, center = colMeans(xx), scale = TRUE) xx
We now define two auxiliary functions to compute Euclidean norms and squared Euclidean norms (less general by probably faster than R
’s base::norm
):
<- function(a) sum(a^2)
norm2 <- function(a) sqrt(norm2(a)) norm
We start by looking at the data with a scatter plot:
plot(xx, pch = 19, col = "black", cex = 2, xlim = c(-1.5, 4.5), ylim = c(-1.5, 4.5))
abline(h = 0, lwd = 2, col = "grey", lty = 2)
abline(v = 0, lwd = 2, col = "grey", lty = 2)
We now compute projections along the direction of the vector \(\mathbf{v} \propto (1, 0.05)^\top\). Recall from your linear algebra courses that the orthogonal projection of a point \(\mathbf{x}\) on the linear subspace spanned by \(\mathbf{v}\) (where \(\| \mathbf{v} \| = 1\)) is given by \(\pi_{\mathbf{v}} ( \mathbf{x} ) = \langle \mathbf{x}, \mathbf{v} \rangle \, \mathbf{v}\) which can also be written as \(\pi_{\mathbf{v}} ( \mathbf{x} ) = ( \mathbf{v} \, \mathbf{v}^\top) \mathbf{x}\). We first find the coordinates of the orthogonal projects of each observation along the subspace generated by \(\mathbf{v} = (1, 0.05)^\top\) (these are the scalars \(\langle \mathbf{x}_i, \mathbf{v} \rangle = \mathbf{x}_i^\top \mathbf{v}\) for each point \(\mathbf{x}_i\):
<- c(1, 0.05)
a <- a / norm(a)
a # Find the projections (coordinates of the
# observations on this basis of size 1)
<- (xx %*% a) prs
We now compute the projections \(\pi_{\mathbf{v}} ( \mathbf{x}_i ) = \langle \mathbf{x}_i, \mathbf{v} \rangle \, \mathbf{v}\):
<- prs %*% a pr
and add them to the plot, with a few observations highlighted. The subspace is shown in red, and the orthogonal projections as solid red dots on that line:
# Plot the data
plot(xx, pch = 19, col = "black", cex = 2, xlim = c(-1.5, 4.5), ylim = c(-1.5, 4.5))
abline(h = 0, lwd = 2, col = "grey", lty = 2)
abline(v = 0, lwd = 2, col = "grey", lty = 2)
# Show the subspace on which we are projecting
abline(0, a[2] / a[1], lwd = 2, col = "red")
# Add the projections of the data on this subspace
points(pr[, 1], pr[, 2], pch = 19, cex = 1.5, col = "red")
# Highlight a few of them
<- c(26, 25, 48, 36)
ind <- pr[ind, ]
pr2 for (j in 1:length(ind)) {
lines(c(xx[ind[j], 1], pr2[j, 1]), c(xx[ind[j], 2], pr2[j, 2]),
col = "blue", lwd = 3.5, lty = 2
) }
We repeat the above but projecting on a different direction \(\mathbf{v} \propto (-1, 3)^\top\):
<- c(-1, 3)
a <- a / norm(a)
a # Find the projections (coordinates of the
# observations on this basis of size 1)
<- (xx %*% a)
prs # Find the orthogonal projections of each
# observation on this subspace of dimension 1
<- prs %*% a
pr # Plot the data
plot(xx, pch = 19, col = "black", cex = 2, xlim = c(-1.5, 4.5), ylim = c(-1.5, 4.5))
abline(h = 0, lwd = 2, col = "grey", lty = 2)
abline(v = 0, lwd = 2, col = "grey", lty = 2)
# Show the subspace on which we are projecting
abline(0, a[2] / a[1], lwd = 2, col = "red")
# Add the projections of the data on this subspace
points(pr[, 1], pr[, 2], pch = 19, cex = 1.5, col = "red")
# Highlight a few of them
<- c(26, 25, 48, 36)
ind <- pr[ind, ]
pr2 for (j in 1:length(ind)) {
lines(c(xx[ind[j], 1], pr2[j, 1]), c(xx[ind[j], 2], pr2[j, 2]),
col = "blue", lwd = 3.5, lty = 2
) }
We saw in class that the direction \(\mathbf{v}\) that results in orthogonal projections closest to the original data (in the sense of minimizing the mean (or sum) of the residuals Euclidean norm squared) is given by the “first” eigenvector of the covariance matrix of the data. This is the first principal component. Refer to the class slides and discussion for more details and the definition and properties of the other principal components.
20.3 Digits example
In this example we use principal components to explore the zip code data. In particular, we focus on images from a single digit (we use 3, but the reader is strongly encouraged to re-do this analysis for other digits to explore whether similar conclusions hold for them). We load the training data from the ElemStatLearn
package in R, and extract the images that correspond to the digit 3. For more information use help(zip.train, package='ElemStatLearn')
.
data(zip.train, package = "ElemStatLearn")
<- zip.train[zip.train[, 1] == 3, -1] a
Define an auxiliary function to compute the squared Euclidean distance between two vectors (recall that we have already defined the function norm2
above):
<- function(a, b) norm2(a - b) dist
To display the images we adapt the following function for plotting a matrix, which was originally available at http://www.phaget4.org/R/image_matrix.html:
<- function(x) {
myImagePlot <- min(x)
min <- max(x)
max <- grey(seq(1, 0, length = 256))
ColorRamp <- seq(min, max, length = length(ColorRamp))
ColorLevels # Reverse Y axis
<- nrow(x):1
reverse <- x[reverse, ]
x image(1:ncol(x), 1:nrow(x), t(x),
col = ColorRamp, xlab = "",
ylab = "", axes = FALSE, zlim = c(min, max)
) }
Using this function, we plot 9 randomly chosen images from our data set:
set.seed(31)
<- sample(nrow(a), 9)
sa par(mai = c(1, 1, 1, 1) / 5, xaxs = "i", yaxs = "i")
par(mfrow = c(3, 3))
for (j in 1:9) myImagePlot(t(matrix(unlist(a[sa[j], ]), 16, 16)))
Next, we centre the observations in order to compute the eigenvectors and eigenvalues of the covariance matrix more efficiently. In fact, note that we do not need to even compute the covariance matrix and can use the SVD of the centred data.
<- scale(a, center = TRUE, scale = FALSE)
ac <- svd(ac) si.svd
Using the relationship between the eigenvectors of the covariance matrix and the SVD of the \(n \times p\) data matrix, we compute the coordinates of the centered data on their orthogonal projections along each of the first and 2nd and 3rd principal directions (eigenvectors of the covariance matrix). Recall that the data are stored as rows of the matrix a
:
<- as.vector(ac %*% si.svd$v[, 1])
v1 <- as.vector(ac %*% si.svd$v[, 2]) v2
As discussed in class, we identify 5 quantiles of each of these coordinates to use as our 2-dimensional grid:
<- quantile(v1, c(.05, .25, .5, .75, .95))
qv1 <- quantile(v2, c(.05, .25, .5, .75, .95)) qv2
We can visualize the grid of these 5 x 5 = 25 points over the scatter plot of all the 2-dimensional projections of the data (their coordinates on the principal components basis):
<- expand.grid(qv1, qv2)
qv plot(v1, v2, pch = 19, cex = 1, col = "grey")
points(qv[, 1], qv[, 2], pch = 19, cex = 1.5, col = "red")
We now find the points in our data set (images) with projections closest to each of the 5 x 5 = 25 points in the grid (note that these distances between points in the principal-subspace, which is in the 256 dimensional space) can be computed in terms of their coordinates on the principal-basis only (which are 2-dimensional points):
<- cbind(v1, v2)
vs <- array(0, dim = dim(qv))
cvs for (j in 1:dim(qv)[1]) cvs[j, ] <- vs[which.min(apply(vs, 1, dist, b = qv[j, ])), ]
We now add these points to our plot (we use color blue for them):
plot(v1, v2, pch = 19, cex = 1, col = "grey")
points(qv[, 1], qv[, 2], pch = 19, cex = 1.5, col = "red")
for (j in 1:dim(qv)[1]) points(cvs[j, 1], cvs[j, 2], pch = 19, col = "blue")
Using these “blue” coordinates, we construct the corresponding points in the 256-dimensional space:
<- t(si.svd$v[, 1:2] %*% t(cvs)) app
and identify the images in our data set that are closest to these points
<- matrix(0, dim(qv)[1], dim(app)[2])
repre for (j in 1:dim(qv)[1]) repre[j, ] <- ac[which.min(apply(ac, 1, dist, b = app[j, ])), ]
These are the actual images that are closest to the points in the array app
above. Now add the column means and display these 25 images according to the points they represent in the red grid:
<- scale(repre, center = -colMeans(a), scale = FALSE)
repre par(mai = c(1, 1, 1, 1) / 5, xaxs = "i", yaxs = "i")
par(mfrow = c(5, 5))
for (j in 1:dim(repre)[1]) {
myImagePlot(t(matrix(unlist(repre[j, ]), 16, 16)))
}
Note how these images change when we “traverse” the 256-dimensional space along each of these 2 principal directions.
20.4 Alternating regression to compute principal components
For details see Appendix @ref(alt-pca).
A function implementing this method to compute the first principal component is:
<- function(x, max.it = 500, eps = 1e-10) {
alter.pca.k1 <- function(a) sum(a^2)
n2 <- dim(x)[2]
p <- scale(x, scale = FALSE)
x <- 0
it <- c(1, rep(0, p - 1))
old.a <- 10 * eps
err while (((it <- it + 1) < max.it) & (abs(err) > eps)) {
<- as.vector(x %*% old.a) / n2(old.a)
b <- as.vector(t(x) %*% b) / n2(b)
a <- a / sqrt(n2(a))
a <- sqrt(n2(a - old.a))
err <- a
old.a
}<- (it < max.it)
conv return(list(a = a, b = b, conv = conv))
}
We use it on the digits data above to compute the first principal component (we also time it):
system.time(tmp <- alter.pca.k1(ac)$a)
#> user system elapsed
#> 0.058 0.007 0.065
and compare it with the one given by svd
, which we also time. Note that the sign of the eigenvectors is arbitrary, so we adjust these vectors in order to have first elements with the same sign.
system.time(tmp2 <- svd(ac)$v[, 1])
#> user system elapsed
#> 0.183 0.002 0.183
<- tmp * sign(tmp2[1] * tmp[1])
tmp summary(abs(tmp - tmp2))
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 4.200e-16 1.195e-12 4.012e-12 7.272e-12 1.169e-11 3.524e-11
Note that both eigenvectors are essentially identical, and that the alternating regression method is typically faster than a full SVD decomposition of the covariance matrix.
This difference in speed is more striking for problems in higer dimensions.
To further illustrate the potential gain in speed for larger dimensions, consider the following synthetic data set with n = 2000 observation and p = 1000, and compare the timing and the results (even when forcing svd
to only compute a single component).
First generate the data set
<- 2000
n <- 1000
p <- matrix(rt(n * p, df = 2), n, p) x
Compute the first eigenvector using alternating regression, and time it:
system.time(tmp <- alter.pca.k1(x))
#> user system elapsed
#> 0.143 0.013 0.157
<- tmp$a a1
Compute the first eigenvector using svd
, and time it:
system.time(e1 <- svd(cov(x))$u[, 1])
#> user system elapsed
#> 5.011 0.038 5.053
Asking svd
to only compute one component does not seem to make the algorithm faster (the results are identical):
system.time(e1.1 <- svd(cov(x), nu = 1, nv = 1)$u[, 1])
#> user system elapsed
#> 4.974 0.024 4.999
summary(abs(e1 - e1.1))
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 0 0 0 0 0 0
Finally, check that the first eigenvector computed with svd
and with the alternating regression approach are practially identical:
<- a1 * sign(e1[1] * a1[1])
a1 summary(abs(e1 - a1))
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 0.000e+00 9.376e-18 2.076e-17 5.745e-17 4.107e-17 2.232e-14