27 K-means clustering

Stat 406

Geoff Pleiss, Trevor Campbell

Last modified – 30 November 2023

\[ \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator*{\argmax}{argmax} \DeclareMathOperator*{\minimize}{minimize} \DeclareMathOperator*{\maximize}{maximize} \DeclareMathOperator*{\find}{find} \DeclareMathOperator{\st}{subject\,\,to} \newcommand{\E}{E} \newcommand{\Expect}[1]{\E\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[2]{\mathrm{Cov}\left[#1,\ #2\right]} \newcommand{\given}{\ \vert\ } \newcommand{\X}{\mathbf{X}} \newcommand{\x}{\mathbf{x}} \newcommand{\y}{\mathbf{y}} \newcommand{\P}{\mathcal{P}} \newcommand{\R}{\mathbb{R}} \newcommand{\norm}[1]{\left\lVert #1 \right\rVert} \newcommand{\snorm}[1]{\lVert #1 \rVert} \newcommand{\tr}[1]{\mbox{tr}(#1)} \newcommand{\brt}{\widehat{\beta}^R_{s}} \newcommand{\brl}{\widehat{\beta}^R_{\lambda}} \newcommand{\bls}{\widehat{\beta}_{ols}} \newcommand{\blt}{\widehat{\beta}^L_{s}} \newcommand{\bll}{\widehat{\beta}^L_{\lambda}} \newcommand{\U}{\mathbf{U}} \newcommand{\D}{\mathbf{D}} \newcommand{\V}{\mathbf{V}} \]

Clustering

So far, we’ve looked at ways of reducing the dimension.

Either linearly or nonlinearly,

The goal is visualization/exploration or possibly for an input to supervised learning.

Now we try to find groups or clusters in our data.

Think of clustering as classification without the labels.

K-means (ideally)

Select a number of clusters \(K\).
Let \(C_1,\ldots,C_K\) partition \(\{1,2,3,\ldots,n\}\) such that
- All observations belong to some set \(C_k\).
- No observation belongs to more than one set.
Make within-cluster variation, \(W(C_k)\), as small as possible. \[\min_{C_1,\ldots,C_K} \sum_{k=1}^K W(C_k).\]
Define \(W\) as \[W(C_k) = \frac{1}{2|C_k|} \sum_{i, i' \in C_k} \norm{x_i - x_{i'}}_2^2.\] That is, the average (Euclidean) distance between all cluster members.

To work, K-means needs distance to a center and notion of center

Why this formula?

Let \(\overline{x}_k = \frac{1}{|C_k|} \sum_{i\in C_k} x_i\)

\[ \begin{aligned} \sum_{k=1}^K W(C_k) &= \sum_{k=1}^K \frac{1}{2|C_k|} \sum_{i, i' \in C_k} \norm{x_i - x_{i'}}_2^2 = \sum_{k=1}^K \frac{1}{2|C_k|} \sum_{i\neq i' \in C_k} \norm{x_i - x_{i'}}_2^2 \\ &= \sum_{k=1}^K \frac{1}{2|C_k|} \sum_{i \neq i' \in C_k} \norm{x_i -\overline{x}_k + \overline{x}_k - x_{i'}}_2^2\\ &= \sum_{k=1}^K \frac{1}{2|C_k|} \left[\sum_{i \neq i' \in C_k} \left(\norm{x_i - \overline{x}_k}_2^2 + \norm{x_{i'} - \overline{x}_k}_2^2\right) + \sum_{i \neq i' \in C_k} 2 (x_i-\overline{x}_k)^\top(\overline{x}_k - x_{i'})\right]\\ &= \sum_{k=1}^K \frac{1}{2|C_k|} \left[2(|C_k|-1)\sum_{i \in C_k} \norm{x_i - \overline{x}_k}_2^2 + 2\sum_{i \in C_k} \norm{x_i - \overline{x}_k}_2^2 \right]\\ &= \sum_{k=1}^K \sum_{x \in C_k} \norm{x-\overline{x}_k}^2_2 \end{aligned} \]

If you wanted (equivalently) to minimize \(\sum_{k=1}^K \frac{1}{|C_k|} \sum_{x \in C_k} \norm{x-\overline{x}_k}^2_2\), then you’d use \(\sum_{k=1}^K \frac{1}{\binom{C_k}{2}} \sum_{i, i' \in C_k} \norm{x_i - x_{i'}}_2^2\)

K-means (in reality)

This is too computationally challenging ( \(K^n\) partions! ) \[\min_{C_1,\ldots,C_K} \sum_{k=1}^K W(C_k).\] So, we make a greedy approximation:

Randomly assign observations to the \(K\) clusters
Iterate the following:
- For each cluster, compute the \(p\)-length vector of the means in that cluster.
- Assign each observation to the cluster whose centroid is closest (in Euclidean distance).

This procedure is guaranteed to decrease \(\sum_{k=1}^K W(C_k)\) at each step.

Best practices

To fit K-means, you need to

Pick \(K\) (inherent in the method)
Convince yourself you have found a good solution (due to the randomized / greedy algorithm).

For 2., run K-means many times with different starting points. Pick the solution that has the smallest value for \[\sum_{k=1}^K W(C_k)\]

It turns out that 1. is difficult to do in a principled way.

Choosing the Number of Clusters

Why is it important?

It might make a big difference (concluding there are \(K = 2\) cancer sub-types versus \(K = 3\)).
One of the major goals of statistical learning is automatic inference. A good way of choosing \(K\) is certainly a part of this.

Withinness and betweenness

\[W(K) = \sum_{k=1}^K W(C_k) = \sum_{k=1}^K \sum_{x \in C_k} \norm{x-\overline{x}_k}^2_2,\]

Within-cluster variation measures how tightly grouped the clusters are.

It’s opposite is Between-cluster variation: How spread apart are the clusters?

\[B(K) = \sum_{k=1}^K |C_k| \norm{\overline{x}_k - \overline{x} }_2^2,\]

where \(|C_k|\) is the number of points in \(C_k\), and \(\overline{x}\) is the grand mean

\(W\) when \(K\)

\(B\) when \(K\)

\(B/K\) when \(K\)

CH index

Want small \(W\), big \(B/K\)

\[\textrm{CH}(K) = \frac{B(K)/(K-1)}{W(K)/(n-K)}\]

To choose \(K\), pick some maximum number of clusters to be considered, \(K_{\max} = 20\), for example

\[\widehat K = \argmax_{K \in \{ 2,\ldots, K_{\max} \}} CH(K).\]

Note

CH is undefined for \(K = 1\).
The divisors \((K-1)\) and \((n-K)\) scale \(B\) and \(W\) appropriately.

Dumb example

library(mvtnorm)
set.seed(406406406)
X1 <- rmvnorm(50, c(-1, 2), sigma = matrix(c(1, .5, .5, 1), 2))
X2 <- rmvnorm(40, c(2, -1), sigma = matrix(c(1.5, .5, .5, 1.5), 2))
X3 <- rmvnorm(40, c(4, 4))

Dumb example

We would maximize CH

Code

K <- 2:40
N <- nrow(clust_raw)
all_clusters <- map(K, ~ kmeans(clust_raw, .x, nstart = 20))
all_assignments <- map_dfc(all_clusters, "cluster")
names(all_assignments) <- paste0("K = ", K)
summaries <- map_dfr(all_clusters, `[`, c("tot.withinss", "betweenss")) |>
  rename(W = tot.withinss, B = betweenss) |>
  mutate(
    K = K,
    `W / (N - K)` = W / (N - K),
    `B / K` = B / (K - 1), 
    `CH index` = `B / K` / `W / (N - K)`
  )
summaries |>
  pivot_longer(-K) |>
  ggplot(aes(K, value)) +
  geom_line(color = blue, linewidth = 2) +
  ylab("") +
  coord_cartesian(c(1, 20)) +
  facet_wrap(~name, ncol = 3, scales = "free_y")

Dumb example

\(K = 3\)

km <- kmeans(clust_raw, 3, nstart = 20)
names(km)

[1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
[6] "betweenss"    "size"         "iter"         "ifault"

centers <- as_tibble(km$centers, .name_repair = "unique")

Next time…

Hierarchical clustering