Stat 406

Geoff Pleiss, Trevor Campbell

Last modified – 30 November 2023

\[ \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator*{\argmax}{argmax} \DeclareMathOperator*{\minimize}{minimize} \DeclareMathOperator*{\maximize}{maximize} \DeclareMathOperator*{\find}{find} \DeclareMathOperator{\st}{subject\,\,to} \newcommand{\E}{E} \newcommand{\Expect}[1]{\E\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[2]{\mathrm{Cov}\left[#1,\ #2\right]} \newcommand{\given}{\ \vert\ } \newcommand{\X}{\mathbf{X}} \newcommand{\x}{\mathbf{x}} \newcommand{\y}{\mathbf{y}} \newcommand{\P}{\mathcal{P}} \newcommand{\R}{\mathbb{R}} \newcommand{\norm}[1]{\left\lVert #1 \right\rVert} \newcommand{\snorm}[1]{\lVert #1 \rVert} \newcommand{\tr}[1]{\mbox{tr}(#1)} \newcommand{\brt}{\widehat{\beta}^R_{s}} \newcommand{\brl}{\widehat{\beta}^R_{\lambda}} \newcommand{\bls}{\widehat{\beta}_{ols}} \newcommand{\blt}{\widehat{\beta}^L_{s}} \newcommand{\bll}{\widehat{\beta}^L_{\lambda}} \newcommand{\U}{\mathbf{U}} \newcommand{\D}{\mathbf{D}} \newcommand{\V}{\mathbf{V}} \]

K-means

It fits exactly \(K\) clusters.

Final clustering assignments depend on the chosen initial cluster centers.

Hierarchical clustering

No need to choose the number of clusters before hand.

There is no random component (nor choice of starting point).

There is a catch: we need to choose a way to measure the distance between clusters, called the linkage.

- Given the linkage, hierarchical clustering produces a sequence of clustering assignments.
- At one end, all points are in their own cluster.
- At the other, all points are in one cluster.
- In the middle, there are nontrivial solutions.

- Given these data points, an agglomerative algorithm chooses a cluster sequence by combining the points into groups.
- We can also represent the sequence of clustering assignments as a dendrogram
- Cutting the dendrogram horizontally partitions the data points into clusters

Notation: Define \(x_1,\ldots, x_n\) to be the data

Let the dissimiliarities be \(d_{ij}\) between each pair \(x_i, x_j\)

At any level, clustering assignments can be expressed by sets \(G = \{ i_1, i_2, \ldots, i_r\}\) giving the indicies of points in this group. Define \(|G|\) to be the size of \(G\).

- Linkage
- The function \(d(G,H)\) that takes two groups \(G,\ H\) and returns the linkage distance between them.

- Start with each point in its own group
- Until there is only one cluster, repeatedly merge the two groups \(G,H\) that minimize \(d(G,H)\).

**Important**

\(d\) measures the distance between GROUPS.

In single linkage (a.k.a nearest-neighbor linkage), the linkage distance between \(G,\ H\) is the smallest dissimilarity between two points in different groups: \[d_{\textrm{single}}(G,H) = \min_{i \in G, \, j \in H} d_{ij}\]

In complete linkage (i.e. farthest-neighbor linkage), linkage distance between \(G,H\) is the largest dissimilarity between two points in different clusters: \[d_{\textrm{complete}}(G,H) = \max_{i \in G,\, j \in H} d_{ij}.\]

In average linkage, the linkage distance between \(G,H\) is the average dissimilarity over all points in different clusters: \[d_{\textrm{average}}(G,H) = \frac{1}{|G| \cdot |H| }\sum_{i \in G, \,j \in H} d_{ij}.\]

Single, complete, and average linkage share the following:

They all operate on the dissimilarities \(d_{ij}\).

This means that the points we are clustering can be quite general (number of mutations on a genome, polygons, faces, whatever).

Running agglomerative clustering with any of these linkages produces a dendrogram with no inversions

“No inversions” means that the linkage distance between merged clusters only increases as we run the algorithm.

In other words, we can draw a proper dendrogram, where the height of a parent is always higher than the height of either daughter.

(We’ll return to this again shortly)

Centroid linkage is relatively new. We need \(x_i \in \mathbb{R}^p\).

\(\overline{x}_G\) and \(\overline{x}_H\) are group averages

\(d_{\textrm{centroid}} = ||\overline{x}_G - \overline{x}_H||_2^2\)

Centroid linkage is

… quite intuitive

… nicely analogous to \(K\)-means.

… very related to average linkage (and much, much faster)

However, it may introduce inversions.

- Single
- 👎 chaining — a single pair of close points merges two clusters. \(\Rightarrow\) clusters can be too spread out, not compact
- Complete linkage
- 👎 crowding — a point can be closer to points in other clusters than to points in its own cluster.\(\Rightarrow\) clusters are compact, not far enough apart.
- Average linkage
- tries to strike a balance these
- 👎 Unclear what properties the resulting clusters have when we cut an average linkage tree.
- 👎 Results change with a monotone increasing transformation of the dissimilarities
- Centroid linkage
- 👎 same monotonicity problem
- 👎 and inversions
- All linkages
- ⁇ where do we cut?

Note how all the methods depend on the distance function

Can do lots of things besides Euclidean

This is very important

No more slides. All done.

FINAL EXAM!!

UBC Stat 406 - 2024