class: center, middle, inverse, title-slide .title[ # 27 K-means clustering ] .author[ ### STAT 406 ] .author[ ### Daniel J. McDonald ] .date[ ### Last modified - 2022-12-01 ] --- ## Clustering So far, we've looked at ways of reducing the dimension. Either linearly or nonlinearly, * The goal is visualization/exploration or possibly for an input to supervised learning. Now we try to find groups or clusters in our data. Think of __clustering__ as classification without the labels. `$$\newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[2]{\mathrm{Cov}\left[#1,\ #2\right]} \newcommand{\given}{\ \vert\ } \newcommand{\argmin}{\arg\min} \newcommand{\argmax}{\arg\max} \newcommand{\R}{\mathbb{R}} \newcommand{\P}{P} \newcommand{\norm}[1]{\left\lVert #1 \right\rVert} \renewcommand{\hat}{\widehat} \newcommand{\tr}[1]{\mbox{tr}(#1)} \newcommand{\X}{\mathbf{X}} \newcommand{\y}{\mathbf{y}} \newcommand{\x}{\mathbf{x}} \newcommand{\U}{\mathbf{U}} \newcommand{\D}{\mathbf{D}} \newcommand{\V}{\mathbf{V}}$$` --- ## K-means (ideally) 1. Select a number of clusters `\(K\)`. 2. Let `\(C_1,\ldots,C_K\)` partition `\(\{1,2,3,\ldots,n\}\)` such that - All observations belong to some set `\(C_k\)`. - No observation belongs to more than one set. 3. Make __within-cluster variation__, `\(W(C_k)\)`, as small as possible. `$$\min_{C_1,\ldots,C_K} \sum_{k=1}^K W(C_k).$$` 4. Define `\(W\)` as `$$W(C_k) = \frac{1}{2|C_k|} \sum_{i, i' \in C_k} \norm{x_i - x_{i'}}_2^2.$$` That is, the average (Euclidean) distance between all cluster members. -- To work, K-means needs __distance to a center__ and __notion of center__ --- class: inverse ## Why this formula? Let `\(\overline{x}_k = \frac{1}{|C_k|} \sum_{i\in C_k} x_i\)` $$ `\begin{aligned} \sum_{k=1}^K W(C_k) &= \sum_{k=1}^K \frac{1}{2|C_k|} \sum_{i, i' \in C_k} \norm{x_i - x_{i'}}_2^2 = \sum_{k=1}^K \frac{1}{2|C_k|} \sum_{i\neq i' \in C_k} \norm{x_i - x_{i'}}_2^2 \\ &= \sum_{k=1}^K \frac{1}{2|C_k|} \sum_{i \neq i' \in C_k} \norm{x_i -\overline{x}_k + \overline{x}_k - x_{i'}}_2^2\\ &= \sum_{k=1}^K \frac{1}{2|C_k|} \left[\sum_{i \neq i' \in C_k} \left(\norm{x_i - \overline{x}_k}_2^2 + \norm{x_{i'} - \overline{x}_k}_2^2\right) + \sum_{i \neq i' \in C_k} 2 (x_i-\overline{x}_k)^\top(\overline{x}_k - x_{i'})\right]\\ &= \sum_{k=1}^K \frac{1}{2|C_k|} \left[2(|C_k|-1)\sum_{i \in C_k} \norm{x_i - \overline{x}_k}_2^2 + 2\sum_{i \in C_k} \norm{x_i - \overline{x}_k}_2^2 \right]\\ &= \sum_{k=1}^K \sum_{x \in C_k} \norm{x-\overline{x}_k}^2_2 \end{aligned}` $$ If you wanted (equivalently) to minimize `\(\sum_{k=1}^K \frac{1}{|C_k|} \sum_{x \in C_k} \norm{x-\overline{x}_k}^2_2\)`, then you'd use `\(\sum_{k=1}^K \frac{1}{\binom{C_k}{2}} \sum_{i, i' \in C_k} \norm{x_i - x_{i'}}_2^2\)` --- ## K-means (in reality) It turns out `$$\min_{C_1,\ldots,C_K} \sum_{k=1}^K W(C_k).$$` is too challenging computationally ( `\(K^n\)` partitions! ). So, we make a greedy approximation: .emphasis[ 1. Randomly assign observations to the `\(K\)` clusters 2. Iterate the following: - For each cluster, compute the `\(p\)`-length vector of the means in that cluster. - Assign each observation to the cluster whose centroid is closest (in Euclidean distance). ] This procedure is guaranteed to decrease `\(\sum_{k=1}^K W(C_k)\)` at each step. But being greedy, it finds a local, rather than a global optimum. --- ## Best practices To fit K-means, you need to 1. Pick `\(K\)` (inherent in the method) 2. Convince yourself you have found a good solution (due to the randomized approach to the algorithm). For 2., run K-means many times with different starting points. Pick the solution that has the smallest value for `$$\sum_{k=1}^K W(C_k)$$` It turns out that __1.__ is difficult to do in a principled way. --- ## Choosing the Number of Clusters Why is it important? - It might make a big difference (concluding there are `\(K = 2\)` cancer sub-types versus `\(K = 3\)`). - One of the major goals of statistical learning is automatic inference. A good way of choosing `\(K\)` is certainly a part of this. `$$W(K) = \sum_{k=1}^K W(C_k) = \sum_{k=1}^K \sum_{x \in C_k} \norm{x-\overline{x}_k}^2_2,$$` Within-cluster variation measures how __tightly grouped__ the clusters are. -- It's opposite is __Between-cluster variation__: How spread apart are the clusters? `$$B(K) = \sum_{k=1}^K |C_k| \norm{\overline{x}_k - \overline{x} }_2^2,$$` where `\(|C_k|\)` is the number of points in `\(C_k\)`, and `\(\overline{x}\)` is the grand mean .pull-left[.center[ `\(W\)`
when `\(K\)`
]] .pull-right[.center[ `\(B\)`
when `\(K\)`
]] --- ## CH index .emphasis[ Want small `\(W\)` and big `\(B\)` ] -- __CH index__ `$$\textrm{CH}(K) = \frac{B(K)/(K-1)}{W(K)/(n-K)}$$` To choose `\(K\)`, pick some maximum number of clusters to be considered, `\(K_{\max} = 20\)`, for example `$$\hat K = \arg\max_{K \in \{ 2,\ldots, K_{\max} \}} CH(K).$$` __Note:__ CH is undefined for `\(K =1\)` --- ## Dumb example ```r library(mvtnorm) set.seed(406406406) X1 <- rmvnorm(50, c(-1, 2), sigma = matrix(c(1, .5, .5, 1), 2)) X2 <- rmvnorm(40, c(2, -1), sigma = matrix(c(1.5, .5, .5, 1.5), 2)) ``` <img src="rmd_gfx/27-kmeans/plotting-dumb-clusts-1.svg" style="display: block; margin: auto;" /> --- ## Dumb example * We would __maximize__ CH <img src="rmd_gfx/27-kmeans/unnamed-chunk-2-1.svg" style="display: block; margin: auto;" /> --- ## Dumb example <img src="rmd_gfx/27-kmeans/unnamed-chunk-3-1.svg" style="display: block; margin: auto;" /> --- ## Dumb example * `\(K = 2\)` ```r km <- kmeans(clust_raw, 2, nstart = 20) names(km) ``` ``` ## [1] "cluster" "centers" "totss" "withinss" "tot.withinss" ## [6] "betweenss" "size" "iter" "ifault" ``` ```r centers <- as_tibble(km$centers, .name_repair = "unique") ``` <img src="rmd_gfx/27-kmeans/unnamed-chunk-5-1.svg" style="display: block; margin: auto;" /> --- class: middle, center, inverse # Next time... Hierarchical clustering