27 K-means clustering

class: center, middle, inverse, title-slide

.title[
# 27 K-means clustering
]
.author[
### STAT 406
]
.author[
### Daniel J. McDonald
]
.date[
### Last modified - 2022-12-01
]

---

## Clustering

So far, we've looked at ways of reducing the dimension.

Either linearly or nonlinearly,

* The goal is visualization/exploration or possibly for an input to supervised learning.

Now we try to find groups or clusters in our data.

Think of __clustering__ as classification without the labels.

`$$\newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]}
\newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]}
\newcommand{\Cov}[2]{\mathrm{Cov}\left[#1,\ #2\right]}
\newcommand{\given}{\ \vert\ }
\newcommand{\argmin}{\arg\min}
\newcommand{\argmax}{\arg\max}
\newcommand{\R}{\mathbb{R}}
\newcommand{\P}{P}
\newcommand{\norm}[1]{\left\lVert #1 \right\rVert}
\renewcommand{\hat}{\widehat}
\newcommand{\tr}[1]{\mbox{tr}(#1)}
\newcommand{\X}{\mathbf{X}}
\newcommand{\y}{\mathbf{y}}
\newcommand{\x}{\mathbf{x}}
\newcommand{\U}{\mathbf{U}}
\newcommand{\D}{\mathbf{D}}
\newcommand{\V}{\mathbf{V}}$$`

---

## K-means (ideally)

1.  Select a number of clusters `$K$`.

2.  Let `$C_1,\ldots,C_K$` partition `$\{1,2,3,\ldots,n\}$` such that
    - All observations belong to some set `$C_k$`.
    - No observation belongs to more than one set.

3.  Make __within-cluster
    variation__, `$W(C_k)$`, as small as
    possible. `$$\min_{C_1,\ldots,C_K} \sum_{k=1}^K W(C_k).$$`

4.  Define `$W$` as `$$W(C_k) =  \frac{1}{2|C_k|} \sum_{i, i' \in C_k} \norm{x_i - x_{i'}}_2^2.$$`
That is, the average (Euclidean) distance between all cluster
members.

To work, K-means needs __distance to a center__ and __notion of center__

---
class: inverse

## Why this formula?

Let `$\overline{x}_k = \frac{1}{|C_k|} \sum_{i\in C_k} x_i$`

$$
`\begin{aligned}
\sum_{k=1}^K W(C_k) 
&= \sum_{k=1}^K \frac{1}{2|C_k|} \sum_{i, i' \in C_k} \norm{x_i - x_{i'}}_2^2
= \sum_{k=1}^K \frac{1}{2|C_k|} \sum_{i\neq i' \in C_k} \norm{x_i - x_{i'}}_2^2 \\
&= \sum_{k=1}^K \frac{1}{2|C_k|} \sum_{i \neq i' \in C_k} \norm{x_i -\overline{x}_k + \overline{x}_k - x_{i'}}_2^2\\
&= \sum_{k=1}^K \frac{1}{2|C_k|} \left[\sum_{i \neq i' \in C_k} \left(\norm{x_i - \overline{x}_k}_2^2 + 
\norm{x_{i'} - \overline{x}_k}_2^2\right) + \sum_{i \neq i' \in C_k} 2 (x_i-\overline{x}_k)^\top(\overline{x}_k - x_{i'})\right]\\
&= \sum_{k=1}^K \frac{1}{2|C_k|} \left[2(|C_k|-1)\sum_{i \in C_k} \norm{x_i - \overline{x}_k}_2^2  + 2\sum_{i \in C_k} \norm{x_i - \overline{x}_k}_2^2 \right]\\
&= \sum_{k=1}^K \sum_{x \in C_k} \norm{x-\overline{x}_k}^2_2
\end{aligned}`
$$

If you wanted (equivalently) to minimize `$\sum_{k=1}^K \frac{1}{|C_k|} \sum_{x \in C_k} \norm{x-\overline{x}_k}^2_2$`, then you'd use `$\sum_{k=1}^K \frac{1}{\binom{C_k}{2}} \sum_{i, i' \in C_k} \norm{x_i - x_{i'}}_2^2$`

---

## K-means (in reality)

It turns out `$$\min_{C_1,\ldots,C_K} \sum_{k=1}^K W(C_k).$$` is too challenging
computationally ( `$K^n$` partitions! ).

So, we make a greedy approximation:

.emphasis[

1.  Randomly assign observations to the `$K$` clusters

2.  Iterate the following:
    -   For each cluster, compute the `$p$`-length
        vector of the means in that cluster.
    -   Assign each observation to the cluster whose centroid is closest
        (in Euclidean distance).
]

This procedure is guaranteed to decrease `$\sum_{k=1}^K W(C_k)$` at each step.

But being greedy, it finds a local, rather than a global optimum.

---

## Best practices

To fit K-means, you need to

1.  Pick `$K$` (inherent in the method)

2.  Convince yourself you have found a good solution (due to the
    randomized approach to the algorithm).

For 2., run
K-means many times with different starting points. Pick the solution
that has the smallest value for
`$$\sum_{k=1}^K W(C_k)$$`

It turns out that __1.__ is difficult to do in a
principled way.

---

## Choosing the Number of Clusters

Why is it important?

-   It might make a big difference (concluding there are `$K = 2$` cancer
    sub-types versus `$K = 3$`).

-   One of the major goals of statistical learning is automatic
    inference. A good way of choosing `$K$` is certainly a part of this.

`$$W(K) = \sum_{k=1}^K W(C_k) = \sum_{k=1}^K \sum_{x \in C_k} \norm{x-\overline{x}_k}^2_2,$$`

Within-cluster variation measures how __tightly grouped__ the clusters are.

It's opposite is __Between-cluster variation__: How spread apart are the clusters?

`$$B(K) = \sum_{k=1}^K |C_k| \norm{\overline{x}_k - \overline{x} }_2^2,$$`

where `$|C_k|$` is the number of points in `$C_k$`, and `$\overline{x}$` is
the grand mean

.pull-left[.center[
`$W$` <svg aria-hidden="true" role="img" viewBox="0 0 384 512" style="height:1em;width:0.75em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:#0a8754;overflow:visible;position:relative;"><path d="M169.4 470.6c12.5 12.5 32.8 12.5 45.3 0l160-160c12.5-12.5 12.5-32.8 0-45.3s-32.8-12.5-45.3 0L224 370.8 224 64c0-17.7-14.3-32-32-32s-32 14.3-32 32l0 306.7L54.6 265.4c-12.5-12.5-32.8-12.5-45.3 0s-12.5 32.8 0 45.3l160 160z"/></svg> when `$K$` <svg aria-hidden="true" role="img" viewBox="0 0 384 512" style="height:1em;width:0.75em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:#2c365e;overflow:visible;position:relative;"><path d="M214.6 41.4c-12.5-12.5-32.8-12.5-45.3 0l-160 160c-12.5 12.5-12.5 32.8 0 45.3s32.8 12.5 45.3 0L160 141.2V448c0 17.7 14.3 32 32 32s32-14.3 32-32V141.2L329.4 246.6c12.5 12.5 32.8 12.5 45.3 0s12.5-32.8 0-45.3l-160-160z"/></svg>  ]]

.pull-right[.center[
`$B$` <svg aria-hidden="true" role="img" viewBox="0 0 384 512" style="height:1em;width:0.75em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:#2c365e;overflow:visible;position:relative;"><path d="M214.6 41.4c-12.5-12.5-32.8-12.5-45.3 0l-160 160c-12.5 12.5-12.5 32.8 0 45.3s32.8 12.5 45.3 0L160 141.2V448c0 17.7 14.3 32 32 32s32-14.3 32-32V141.2L329.4 246.6c12.5 12.5 32.8 12.5 45.3 0s12.5-32.8 0-45.3l-160-160z"/></svg> when `$K$` <svg aria-hidden="true" role="img" viewBox="0 0 384 512" style="height:1em;width:0.75em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:#2c365e;overflow:visible;position:relative;"><path d="M214.6 41.4c-12.5-12.5-32.8-12.5-45.3 0l-160 160c-12.5 12.5-12.5 32.8 0 45.3s32.8 12.5 45.3 0L160 141.2V448c0 17.7 14.3 32 32 32s32-14.3 32-32V141.2L329.4 246.6c12.5 12.5 32.8 12.5 45.3 0s12.5-32.8 0-45.3l-160-160z"/></svg>]]

---

## CH index

.emphasis[

Want small `$W$` and big `$B$`

]

__CH index__

`$$\textrm{CH}(K) = \frac{B(K)/(K-1)}{W(K)/(n-K)}$$`

To choose `$K$`, pick some
maximum number of clusters to be considered, `$K_{\max} = 20$`, for
example

`$$\hat K = \arg\max_{K \in \{ 2,\ldots, K_{\max} \}} CH(K).$$`

__Note:__ CH is undefined for `$K =1$`

---

## Dumb example

```r
library(mvtnorm)
set.seed(406406406)
X1 <- rmvnorm(50, c(-1, 2), sigma = matrix(c(1, .5, .5, 1), 2))
X2 <- rmvnorm(40, c(2, -1), sigma = matrix(c(1.5, .5, .5, 1.5), 2))
```

---

## Dumb example

* We would __maximize__ CH

---

## Dumb example

---

## Dumb example

* `$K = 2$`

```r
km <- kmeans(clust_raw, 2, nstart = 20)
names(km)
```

```
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"
```

```r
centers <- as_tibble(km$centers, .name_repair = "unique")
```

---
class: middle, center, inverse

# Next time...

Hierarchical clustering