Stat 406
Geoff Pleiss, Trevor Campbell
Last modified – 30 November 2023
\[ \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator*{\argmax}{argmax} \DeclareMathOperator*{\minimize}{minimize} \DeclareMathOperator*{\maximize}{maximize} \DeclareMathOperator*{\find}{find} \DeclareMathOperator{\st}{subject\,\,to} \newcommand{\E}{E} \newcommand{\Expect}[1]{\E\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[2]{\mathrm{Cov}\left[#1,\ #2\right]} \newcommand{\given}{\ \vert\ } \newcommand{\X}{\mathbf{X}} \newcommand{\x}{\mathbf{x}} \newcommand{\y}{\mathbf{y}} \newcommand{\P}{\mathcal{P}} \newcommand{\R}{\mathbb{R}} \newcommand{\norm}[1]{\left\lVert #1 \right\rVert} \newcommand{\snorm}[1]{\lVert #1 \rVert} \newcommand{\tr}[1]{\mbox{tr}(#1)} \newcommand{\brt}{\widehat{\beta}^R_{s}} \newcommand{\brl}{\widehat{\beta}^R_{\lambda}} \newcommand{\bls}{\widehat{\beta}_{ols}} \newcommand{\blt}{\widehat{\beta}^L_{s}} \newcommand{\bll}{\widehat{\beta}^L_{\lambda}} \newcommand{\U}{\mathbf{U}} \newcommand{\D}{\mathbf{D}} \newcommand{\V}{\mathbf{V}} \]
K-means
It fits exactly \(K\) clusters.
Final clustering assignments depend on the chosen initial cluster centers.
Hierarchical clustering
No need to choose the number of clusters before hand.
There is no random component (nor choice of starting point).
There is a catch: we need to choose a way to measure the distance between clusters, called the linkage.
Notation: Define \(x_1,\ldots, x_n\) to be the data
Let the dissimiliarities be \(d_{ij}\) between each pair \(x_i, x_j\)
At any level, clustering assignments can be expressed by sets \(G = \{ i_1, i_2, \ldots, i_r\}\) giving the indicies of points in this group. Define \(|G|\) to be the size of \(G\).
Important
\(d\) measures the distance between GROUPS.
In single linkage (a.k.a nearest-neighbor linkage), the linkage distance between \(G,\ H\) is the smallest dissimilarity between two points in different groups: \[d_{\textrm{single}}(G,H) = \min_{i \in G, \, j \in H} d_{ij}\]
In complete linkage (i.e. farthest-neighbor linkage), linkage distance between \(G,H\) is the largest dissimilarity between two points in different clusters: \[d_{\textrm{complete}}(G,H) = \max_{i \in G,\, j \in H} d_{ij}.\]
In average linkage, the linkage distance between \(G,H\) is the average dissimilarity over all points in different clusters: \[d_{\textrm{average}}(G,H) = \frac{1}{|G| \cdot |H| }\sum_{i \in G, \,j \in H} d_{ij}.\]
Single, complete, and average linkage share the following:
They all operate on the dissimilarities \(d_{ij}\).
This means that the points we are clustering can be quite general (number of mutations on a genome, polygons, faces, whatever).
Running agglomerative clustering with any of these linkages produces a dendrogram with no inversions
“No inversions” means that the linkage distance between merged clusters only increases as we run the algorithm.
In other words, we can draw a proper dendrogram, where the height of a parent is always higher than the height of either daughter.
(We’ll return to this again shortly)
Centroid linkage is relatively new. We need \(x_i \in \mathbb{R}^p\).
\(\overline{x}_G\) and \(\overline{x}_H\) are group averages
\(d_{\textrm{centroid}} = ||\overline{x}_G - \overline{x}_H||_2^2\)
Centroid linkage is
… quite intuitive
… nicely analogous to \(K\)-means.
… very related to average linkage (and much, much faster)
However, it may introduce inversions.
Note how all the methods depend on the distance function
Can do lots of things besides Euclidean
This is very important
No more slides. All done.
FINAL EXAM!!
UBC Stat 406 - 2024