The term cluster appears throughout data analytics in different contexts. In the analysis of correlated data a cluster is a group of observations that belong together and group membership is known a priori. For example, a subsample that is drawn from a larger sampling unit creates a hierarchy of sampling units. The longitudinal observations collected on a subject over time form a cluster of subject-specific data. The data from different subjects might be independent while the longitudinal observations within a cluster (a subject) are correlated.
In unsupervised learning, a cluster is a group of observations that are somehow similar. Group membership is not known a priori and determining membership as well as the number of clusters is part of the analysis. Examples are
Students that are similar with respect to STEM achievement scores
Real estate properties that share similar property attributes
Online shoppers with similar browsing and purchase history
Cluster analysis seeks to find groups of data such that members within a group are similar to each other and members from different groups are dissimilar. It is an unsupervised method, there is no target variable, we simply are trying to find structure in the data. The number of clusters can be set a priori, for example in \(k\)-means clustering, or be determined as part of analysis, as in hierarchical clustering.
Clustering Rows or Columns
Note that we are looking for groups of “data”, we did not specify whether clustering applies to finding similar observations or similar features. Usually it is the former, and clustering columns versus rows can be performed by simply transposing the data. From here on we assume that clustering is seeking similar groups of observations.
Scaling and Centering
Key to all clustering methods is some notion of similarity–or the opposite, dissimilarity–of data points. Measures of similarity (or dissimilarity) depend on a metric expressing distance. Squared Euclidean distance is a common choice, but other metrics such as the Manhattan (city-block) distance, correlation distance, or Gower’s distance are also important. Many distance measures depend on the units of measurement; variables with large values tend to dominate the distance calculations. It is highly recommended to scale data prior to cluster analysis to put features on equal footing.
Scaling is often not applied to binary variables, for example, variables that result from coding factors as a series of 0/1 variables.
25.2\(K\)-Means Clustering
Introduction
\(K\)-means clustering is an intuitive method to cluster \(p\) numeric input variables. The value \(K\) is the number of clusters and is set a priori. If you perform a \(3\)-means analysis, the algorithm will assign all observations to one of three clusters. If you perform a \(100\)-means analysis, the algorithm will assign all observations to one of 100 clusters. Choosing the appropriate number of clusters for a data set uses scree plots similar to choosing the number of components in principal component analysis.
The \(K\)-means algorithm has the following properties:
The analysis leads to \(K\) clusters
Every observation belongs to exactly one cluster
No observation belongs to more than one cluster
Finding the optimal partitioning of \(n\) observations into \(K\) groups is a formidable computational problem, there are approximately \(K^n\) ways of partitioning the data. However, efficient algorithms exist to find at least a local solution to the global partitioning problem.
To introduce some notation for \(K\)-means clustering, let \(C_i\) denote the set of observations assigned to cluster \(i=1,\cdots,K\). The \(K\)-means properties imply that
The number of observations in cluster \(i\) is called its cardinality, denoted \(|C_i|\).
If squared Euclidean distance is the dissimilarity measure of choice, the distance between two data points is \[
d(\textbf{x}_i,\textbf{x}_j) = ||\textbf{x}_i - \textbf{x}_j||_2^2 = \sum_{m=1}^p \left ( x_{im} - x_{jm} \right )^2
\] The within-cluster variation in cluster \(k\) is the average dissimilarity of the observations in \(C_k\): \[
W(C_k) = \frac{1}{|C_k|} \sum_{i,j \in C_k} ||\textbf{x}_i - \textbf{x}_j||_2^2
\] Let \(\overline{\textbf{x}}_k = [\overline{x}_{1k},\cdots,\overline{x}_{pk}]\) be the vector of means of the inputs in the \(k\)th cluster. Finding the \(K\)-means solution requires to find the cluster allocations such that \[
\min_{C_1,\cdots, C_k} \left \{ \sum_{k=1}^K W(C_k) \right \} \Longleftrightarrow
\min_{C_1,\cdots, C_k} \left \{ \sum_{k=1}^K \sum_{i \in C_k} ||\textbf{x}_i - \overline{\textbf{x}}_k||_2 ^2 \right \}
\]
This states that the cluster assignment that minimizes the sum of the within-cluster dissimilarity is the same assignment that minimizes the distances of data points from the cluster centroid. This is how \(K\)-means clustering gets its name; the cluster centroids are computed as the mean of the observations assigned to the cluster.
The within-cluster sum of squares is the sum of the squared distances between the data points in a cluster and the cluster centroid. For cluster \(k\) this sum of squares is \[
\text{SSW}_k = \frac{1}{2} W(C_k) = \sum_{i \in C_k} ||\textbf{x}_i - \overline{\textbf{x}}_k||_2 ^2
\] This quantity is also called the inertia of the cluster. The average inertia, \[
\frac{1}{|C_k|} \text{SSW}_k
\] is called the distortion of cluster \(k\).
A (local) solution is found by iterating from an initial cluster assignment: given cluster centroids \(\overline{\textbf{x}}_k\) assign each observation to the cluster whose center is closest. Following the assignment recompute the centers. Continue until the cluster assignment no longer changes. At the local solution no movement of a data point from one cluster to another will reduce the within-cluster sum of squares (Hartigan and Wong 1979).
The initial cluster assignment is done by either assigning observations randomly to the \(k\) clusters or by using \(k\) randomly chosen observations as the initial cluster centroids.
Because of this random element, and because the algorithm is not guaranteed to find a global solution, \(K\)-means is typically run with multiple random starts, and the best solution is reported.
Example: \(K\)-Means for Iris Data
To show the basic calculations in \(K\)-means analysis, let’s first look at the familiar Iris data set. We have the luxury of knowing that the data set comprises three species, a \(3\)-means analysis of the flower measurements should be interesting: does it recover the iris species?
The kmeans function in R performs the \(K\)-means analysis. By default, it uses he algorithm of Hartigan and Wong (1979) with a single random start for the initial cluster assignment. Set nstart= to a larger number to increase the number of random starts. Because there are four inputs, Sepal.Length, Sepal.Width. Petal.Length, and Petal.Width, each observation and the centroids live in 4-dimensional space.
The total sum of squares does not depend on the number of clusters. For \(K=3\), it is allocated to 138.8884 sum of squares units within the clusters and 457.1116 between the clusters.
The within-cluster sum of squares measures the average squared Euclidean distance between the points in a cluster and the cluster centroid. We can validate for any of the clusters as follows
The distortions of the clusters are obtained by dividing the within-cluster sum of squares with the cluster sizes:
km$withinss / km$size
[1] 0.9470124 0.8318405 1.0095786
Figure 25.1 shows the cluster assignment in a bivariate plot of two of the flower measurements. The colored cluster symbols are overlaid with the species. The three clusters track the species fairly well, in particular I. setosa. The boundaries of the other two clusters align fairly well with species, but there is considerable overlap.
Figure 25.1: Results of 3-means clustering for Iris data. Clusters are identified through colors, species are identified with plotting symbols.
The separation of these clusters is probably better than Figure 25.1 suggests, because two dimensions (Petal.Width and Petal.Length) are not represented in the figure.
The confusion matrix between species and cluster assignment has an accuracy of 83.3333%.
Caution
\(K\)-means analysis is generally susceptible to outliers, as they contribute large distances. Also, \(K\)-means analysis is sensitive to perturbations of the data; when observations are added or deleted the results will change. Finally, \(K\)-means is affected by the curse of dimensionality (@#sec-curse-dimensionality).
Clustering Metrics
To choose the appropriate number of clusters in \(K\)-means clustering, we can apply various metrics that measure the tightness of the clusters and their separation. These metrics are plotted against the value of \(k\) in a scree plot. We do not look for a minimum of the criteria, but the “knee” or “elbow” where the increase/decrease of the metric abruptly changes.
The following criteria are commonly computed and plotted.
Inertia: this is the within-cluster sum of squares; it measures the tightness of the clusters. It does not necessarily mean that clusters are well separated, it just means that the data points within the clusters are close to their centroid. The within-cluster sum of squares decreases as \(K\) increases, more clusters will lead to less variability within the clusters. In the extreme case when \(k=n\), each observation is a cluster and the within-cluster sum of squares is zero. That is why we do not look for a global minimum with these criteria.
Distortion: this is the average inertia within a cluster, obtained by dividing \(\text{SSW}_k\) by the cluster cardinality.
Silhouette score: measures how similar a data point is to its own cluster compared to other clusters. While inertia is based on distances of data points from their cluster center, the silhouette takes into account the distances between points in one cluster and the nearest cluster center. The score ranges from \(-1\) to \(+1\); a high silhouette score means that we can easily tell the clusters apart–they are far from each other.
Inertia and silhouette measure different things: the former captures the tightness of the clusters, the latter how far apart (distinguishable) the clusters are. You can have a good (low) inertia but a bad (low) silhouette score if the clusters overlap or sit on top of each other.
Example: Silhouette Scores
You can calculate and/or visualize silhouette scores in R in several ways: using the silhouette function in the cluster library or the fviz_nbclust function in the factoextra package. fviz_nbclust supports additional metrics, for example method="wss" produced a scree plot of the within-cluster sum of squares (inertia).
library(cluster)set.seed(6345)silhouette_score <-function(k){ kmns <-kmeans(iris_s, centers = k, nstart=50) ss <-silhouette(kmns$cluster, dist(iris_s))mean(ss[, 3])}k <-2:10plot(k, sapply(k, silhouette_score),type='b',xlab='Number of clusters', ylab='Average Silhouette Scores',bty="l")
The silhouette scores suggest \(k=2\) and the inertia scree plot \(k\) between 3 and 5.
Predicted Values
Although \(K\)-means is an unsupervised learning method, we can use it to predict the cluster of a new observation. Calculate The distance of the coordinate of the new point to the cluster centers and assign the observation to the cluster whose center is closest. The cluster centroids serve as the predicted values. You can write a function in R that accomplishes that.
If the data were centered and/or scaled in the \(K\)-means analysis, make sure that the same treatment is applied before calculating distances to the cluster centroids.
clusters <-function(x, centers) {# compute squared euclidean distance from # each sample to each cluster center tmp <-sapply(seq_len(nrow(x)),function(i) apply(centers, 1,function(v) sum((x[i, ]-v)^2)))max.col(-t(tmp)) # find index of min distance}# two new observationsnewx =data.frame("Sepal.Length"=c(4 , 6 ),"Sepal.Width"=c(2 , 3 ),"Petal.Length"=c(1.5, 1.3),"Petal.Width"=c(0.3, 0.5))#center and scales from training datameans <-attr(iris_s,"scaled:center")scales <-attr(iris_s,"scaled:scale")pred_clus <-clusters((newx-means)/scales,km$centers)pred_clus
[1] 1 3
# Using the cluster centers as the predicted valueskm$centers[pred_clus,]
\(K\)-means analysis finds groups of observations that are similar to each other in the inputs as judged by a distance metric. Principal component analysis finds independent linear combinations of the inputs that explain substantial amounts of information. In the Iris example analyzed earlier, we used 4 input variables but plotted the cluster assignment for two of the variables, because visualization in more dimensions is difficult (Figure 25.1).
There are two ways to combine PCA and \(K\)-means:
PCA after \(K\)-means: Run a \(K\)-means analysis on \(p\) inputs, then calculate the first two principal components with the cluster assignment. This is a visualization techniques for clusters in high-dimensional data. It does not rectify the curse of dimensionality issue from which \(K\)-means suffers as \(p\) gets larger. When applied to visualize data in 2 dimensions, this technique reduces \(p(p-1)/2\) scatterplots to a singple biplot based on the first 2 components.
\(K\)-means after PCA: Use PCA to reduce \(p\) inputs to \(M < p\) principal components, then run a \(K\)-means analysis to find clusters in the components. This approach eliminates the curse of dimensionality.
Example: Airbnb properties in Asheville, NC
The following data is Airbnb data about Asheville, NC. The data for this and other cities is available from http://insideairbnb.com/get-the-data/. We are using six numeric variables for the properties.
library(duckdb)con <-dbConnect(duckdb(),dbdir ="ads.ddb",read_only=TRUE)airbnb <-dbGetQuery(con, "SELECT * FROM Asheville")dbDisconnect(con)airbnb2 <-na.omit(airbnb[,c("price","number_of_reviews","minimum_nights","reviews_per_month","availability_365","calculated_host_listings_count")])
Figure 25.2 shows the distribution of prices as a function of the number of reviews. Many properties have accumulated hundreds of reviews over time. and most are toward the lower end of the price scale.
Figure 25.2
The property with a rental price of more than $10,000 per day is a 1-bedroom, 1-bath guest suite in the middle of Asheville. The rental has a 2-night minimum and over 200 reviews. We are excluding this observation as an outlier.
We now perform a \(K\)-means analysis based on the first two principal components after limiting the data to properties with a daily rental price of less than $2,000.
We use the first three principal components for the subsequent \(K\)-means analysis; they explain 70.383% of variability in the data.
Based on the scree plot of the within-cluster sum of squares and the silhouette scores, \(K\)=5 or \(K=6\) seems like a reasonable number of clusters. The silhouette plot suggests \(K\)=7 instead (Figure 25.3). We compromise on \(K=6\).
km <-kmeans(pca_asheville$x[,1:3],centers=6,nstart=25)library(ggfortify)autoplot(pca_asheville, data=airbnb2, color=km$cluster, size=0.6,loadings.label=TRUE, loadings.label.size =3,loadings=TRUE)
Not surprisingly, the reviews_per_month and the number_of_reviews are highly correlated. The six clusters separate pretty well. There is some overlap between black and green clusters, but the display is missing one of the principal components. The PCA rotation shows that PC1 is dominated by review-related attributes and PC2 by availability of the property and the number of listings that a host has in Asheville. PC3 has negative scores for pricey properties.
Before we leave \(K\)-means clustering, a word of caution. K-Means clustering will always find \(K\) clusters even if the data have no structure. The following data perform a \(3\)-means analysis on 100 observations on 5 inputs drawn randomly from a standard Gaussian distribution. The correlation analysis shows no (pairwise) relationship among the inputs.
To visualize, run a PCA and color the scores of the first two components with the cluster id. It appears that the algorithm found three somewhat distinct groups of observations. The cluster centroids are certainly quite different (Figure 25.4).
Figure 25.4: The first two principal components for \(3\)-means analysis on random data, \(p=5\). The diamonds are the cluster centroids.
There is a clue that something is amiss.
(krand$betweenss/krand$totss)*100
[1] 29.14485
The variability between the clusters accounts for only 29.145% of the variation in the data. If grouping explains differences between the data points, this percentage should be much higher.
25.3 Hierarchical Clustering
Introduction
In \(K\)-means clustering you specify \(K\), find the clusters, and examine the results. Metrics such as inertia, distortion, or the silhouette score are used to find an appropriate value for \(K\). Hierarchical clustering (HC) is a clustering technique where you do not specify the number of clusters in advance. Instead, the entire data set is organized between two extremes:
at the top, all observations belong to a single cluster
at the bottom, each observations is in a cluster by itself
If \(c\) denotes the number of clusters in hierarchical clustering, HC offers you to choose \(1 \le c \le n\). Between the two extremes, \(c=1\) and \(c=n\) lie many configurations where observations are combined into groups based on similarity measures and rules for combining groups, called linkage methods. The choice is typically made based on heuristics such as a visual inspection of the dendrogram, an upside-down tree display of the cluster arrangements (Figure 25.5). Algorithms exist that try to automate and optimize the determination of \(c\) based on criteria such as inertia.
Figure 25.5: Example of a dendrogram in hierarchical clustering
The Dendrogram
Hierarchical clustering is popular because of the dendrogram, an intuitive representation of structure in the data. A word of caution is in order, however: just like \(K\)-means clustering will find \(K\) clusters–whether they exist or not–hierarchical clustering will organize the observations hierarchically in the dendrogram–whether a hierarchy makes sense or not.
At the lowest level of the dendrogram are the leaves, corresponding to observations. As you move up the tree, those merge into branches. Observations fuse first into groups, later on observations or groups merge with other groups. A common mistake in interpreting dendrograms is to assume similarity is greatest when observations are close to each other on the horizontal axis when fused. Observations are more similar if they are fused lower on the tree. The further up on the tree you go before merging branches, the more dissimilar are the members of the branches.
In Figure 25.6, observations 11 and 4 near the right edge of the tree appear “close” along the horizontal axis. Since they merge much higher up on the tree, these observations are more dissimilar as, for example, observations 23 and 25, which merge early on. Based on where they merge, observation 11 is no more similar to #4 than it is to observations 23, 25, 10, and 15.
Figure 25.6: Example of a dendrogram in hierarchical clustering.
The name hierarchical clustering stems from the fact that clusters lower on the tree (near the bottom) are necessarily contained in clusters higher up on the tree (near the top), since clusters are formed by merging or splitting. This hierarchical arrangement can be unrealistic. James et al. (2021, 523) give the following example
Suppose you have data on men and women from three countries.
The best division into three groups might be by country.
The best division into two groups might be by gender.
The best division into three groups does not result from taking the two gender groups and splitting one of those.
There are two general approaches to construct a dendrogram.
The agglomerative (bottom-up) approach starts with \(c=n\) clusters at the bottom of the dendrogram, where each observation is a separate cluster, and merges observations and branches based on their similarity. The pair chosen for merging at any stage consists of the least dissimilar (most similar) groups.
The divisive (top-down) approach starts at the trunk (the top) of the tree with a single cluster that contains all observations. Moving down the tree, branches are split into clusters to produce groups with the largest between-group dissimilarity (least similar clusters).
Cutting the Dendrogram
It is best to interpret the dendrogram as a data summary, not as a confirmatory tool. Based on the dendrogram you can choose any number of clusters. The typical approach is called cutting the tree, whereby you choose a particular height on the dendrogram and draw a line across. Depending on where you draw the line you end up with a different number of clusters (Figure 25.7). The number of clusters corresponds to the number of vertical lines the cut intersects.
Figure 25.7: Dendrogram cut at different heights
Dissimilarity and Linkage
Before we can construct a dendrogram, we need to decide on two more things (besides whether the approach is top-down or bottom-up): a measure of dissimilarity and a rule on which groups are to be merged. These choices have profound effect on the resulting dendrogram, more so than the choice between top-down or bottom-up approach.
Dissimilarity measures
Let \(x_{ij}\) denote the measurements for \(i=1,\cdots, n\) observations on \(j=1,\cdots, p\) inputs (variables). As before, the vector of inputs for the \(i\)th observation is denoted \(\textbf{x}_i = [x_{i1},\cdots, x_{ip}]\).
The dissimilarity (distance) matrix\(\textbf{D}\) for the data is an \((n \times n)\) matrix with typical element \[
D(\textbf{x}_i,\textbf{x}_{i^\prime}) = \sum_{j=1}^p w_j \, d_j(x_{ij},x_{i^\prime j})
\] The \(w_j\) are weights associated with the \(j\)th attribute, \(\sum_{j=1}^p w_j = 1\). \(d_j(x_{ij},x_{i^\prime j})\) measures the dissimilarity between any two observations for the \(j\)th attribute.
A number of dissimilarity measures are used, depending on the type of variable and the application. For quantitative variables, the following are most popular:
Squared Euclidean distance\[d_j(x_{ij},x_{i^\prime j}) = (x_{ij} - x_{i^\prime j})^2\] A frequent choice, but it is sensitive to large distances due to squaring.
Euclidean distance\[
D(\textbf{x}_i,\textbf{x}_{i^\prime}) = \sqrt{ \sum_{j=1}^p (x_{ij} - x_{i^\prime j})^2}
\] This is the usual distance (\(L_2\)-norm) between two vectors \(\textbf{x}_i\) and \(\textbf{x}_{i^\prime}\). Taking the square root expresses the distance in the same units as the \(X\)s.
Absolute distance, also called “Manhattan” or city-block distance \[
d_j(x_{ij},x_{i^\prime j}) = |x_{ij} - x_{i^\prime j}|
\] Absolute distance is more robust to large differences compared to dissimilarity based on (squared) Euclidean distance.
Note that \(\rho(\textbf{x}_i, \textbf{x}_{i^\prime})\) does not measure the correlation between two variables across a set of \(n\) observations–that would be the familiar way to calculate and interpret a correlation coefficient. \(\rho(\textbf{x}_i, \textbf{x}_{i^\prime})\) is the correlation between two observations across \(p\) attributes.
Example: Effect of Distance Metrics
We use a simple data set with observations on the shopping behavior of four imaginary shoppers. Frank, Betsy, Julian, and Lisa make purchases of 5 possible items. The values for the item attributes are the number of times the item was purchased.
First, let’s calculate various distance metrics. These are represented as matrices of distances between the data points. The function dist() returns the lower triangular matrix of pairwise distances
The dist function excludes the diagonal entries of the distance matrix by default, these are known to be zero. Because the input values are integers in this example, the city-block distances are also integers.
The correlation-based distances can be calculated with factoextra::get_dist(). This function adds methods for correlations based on Pearson (method="pearson"), Spearman (method="spearman") or Kendall (method="kendall"). The variables can be centered and scaled with stand=TRUE (stand=FALSE is the default).
The correlation-based dissimilarity is not equal to the correlation among the item purchases, it is one minus the correlation of the item purchases for each shopper.
The cluster::daisy() function can compute Euclidean, Manhattan, and Gower’s distance matrices. More on Gower’s distance after the example.
daisy(df[,2:6],metric="manhattan")## Dissimilarities :## 1 2 3## 2 5 ## 3 7 10 ## 4 3 6 4## ## Metric : manhattan ## Number of objects : 4daisy(df[,2:6],metric="euclidean")## Dissimilarities :## 1 2 3## 2 3.000000 ## 3 4.358899 5.291503 ## 4 1.732051 2.828427 3.162278## ## Metric : euclidean ## Number of objects : 4daisy(df[,2:6],metric="gower")## Dissimilarities :## 1 2 3## 2 0.5333333 ## 3 0.5666667 0.9000000 ## 4 0.3500000 0.6833333 0.2166667## ## Metric : mixed ; Types = I, I, I, I, I ## Number of objects : 4
Now let’s construct the dendrograms for the data based on Euclidean and Pearson correlation distance matrices using the hclust function. The input to hclust is a distance (dissimilarity) matrix as produced by dist(), get_dist(), daisy(). The actual values of the variables are no longer needed once the dissimilarities are calculated.
Choosing Euclidean distance groups together users who bought few items, because they appear as similar (close). Frank and Lisa bought 4 and 3 items, respectively. Betsy and Julian purchased 5 items. Choosing correlation-based distance groups users who bought items together. For example, Julian and Lisa bought items 2 and 5 together, Frank and Betsy purchased items 3 and 4 together.
The distance metrics discussed so far are not appropriate for categorical variables (nominal or ordinal) because differences between values are not defined. A four-star rating is not twice as much as a two-star rating and the “distance” between a one- and two-star rating is not the same as that between a four- and five-star rating.
Still, for an ordinal variable with \(M\) categories it is not uncommon to replace the label for category \(j\) with \[
\frac{j-1/2}{M}
\] and treat this as a quantitative score. With nominal variables it is common to assign a simple loss value depending on whether the values of two variables are the same (loss = 0) or different (loss=1).
What should we do when the inputs are of mixed type, for example, \(x_1\) is continuous, \(x_2\) is binary, and \(x_3\) is nominal? Gower (n.d.) introduced a similarity metric to compute distances in this case, known as Gower’s distance. Suppose there are no missing values. Gower’s similarity coefficient is \[
S(\textbf{x}_i,\textbf{x}_{i^\prime}) = \frac{1}{p}\sum_{j=1}^p s_{ii^\prime j}
\] The \(s_{ii^\prime j}\) is the score between observations \(i\) and \(i^\prime\) for variable \(j\); The scores range \(0 \leq s_{ii^\prime j} \leq 1\) and are calculated as follows:
qualitative attributes: 0/1 loss
quantitative attritbutes: \[
s_{i i^\prime j} = 1 - \frac{x_{ij} - x_{i^\prime j}}{R_j}
\] where \(R_j\) is the range (max-min) for the \(j\)th variable. The Gower similarity coefficient has the following properties:
\(S(\textbf{x}_i, \textbf{x}_{i^\prime}) = 0 \Rightarrow\) records differ maximally
\(S(\textbf{x}_i, \textbf{x}_{i^\prime}) = 1 \Rightarrow\) records do not differ
For purposes of clustering, the dissimilarity measure based on Gower’s distance is \(1 - S(\textbf{x}_i, \textbf{x}_{i^\prime})\). This is implemented in the daisy function of the cluster package in R.
Linkage methods
\(D(\textbf{x}_i,\textbf{x}_{i^\prime})\) measures the dissimilarity between two data points. In order to move up (or down) the tree in hierarchical clustering we also need to determine how to measure the similarity/dissimilarity between groups of points. Suppose \(G\) and \(H\) present two clusters and \(D(G,H)\) is the dissimilarity between the two, some function of the dissimilarity of the points in the clusters. The decision rule that determines how to merge (or split) clusters is called linkage. The three most common linkage methods are
Single linkage: \(D(G,H) = \min D(\textbf{x}_i,\textbf{x}_{i^\prime}), i \in G, i^\prime \in H\)
Average linkage: \(D(G,H) = \text{ave} D(\textbf{x}_i,\textbf{x}_{i^\prime}), i \in G, i^\prime \in H\)
The agglomerative clustering algorithm merges the clusters with the smallest linkage value.
When clusters separate well, the choice of linkage is not that important. Otherwise, linkage can have substantial impact on the outcome of hierarchical clustering. Single linkage is known to cause chaining, combining observations that are linked by a series of close observations. Complete linkage tends to produce compact clusters, but observations can end up being closer to members of other clusters than to members of their own cluster. Average linkage is a compromise, but is not invariant to transformations of the dissimilarities. Centroid linkage uses distances between centroids of the clusters (Figure 25.8).
Figure 25.8: Some linkage types in hierarchical clustering.
Example: Hierarchical Cluster Analysis for Glaucoma Data.
What can we learn about the 98 subjects in the Glaucoma data set who had glaucomatous eyes by way of hierarchical cluster analysis? The following statements create a data frame from the DuckDB table, filter the glaucotamous cases, remove Glaucoma column and scale the remaining 62 variables.
We first perform agglomerative clustering with correlation-based dissimilarity and complete linkage.
hc <-hclust(get_dist(glauc,method="pearson"),method="complete")
The merge matrix on the hclust output object describes the \(n-1\) steps in which the observations/clusters were merged. Negative numbers refer to observations, positive numbers refer to clusters formed at that stage.
The first merge combines observations #34 and #41 into a group of two. The next merge combines #4 and #73 into another group of two. At the seventh merge, observation #76 is combined with the group created at the first merge. This cluster now contains observations [34, 41, 76]. The first time two groups are being merged is at step 23: the groups consisting of observations [58, 93] and [72, 90] are combined into a cluster of 4.
The height vector is a vector of \(n-1\) values of the height criterion; the actual values depend on the linkage method. For this analysis, the first 25 heights at which merges occurred are as follows:
The dendrogram for this analysis is plotted in Figure 25.9 along with the bounding boxes for 4 and 8 clusters. The cut for the larger number of clusters occurs lower at the tree.
Figure 25.9: Dendrogram for partial glaucoma data with correlation-based dissimilarity and complete linkage.
The sizes of the four clusters are as follows:
table(cutree(hc,k=4))
1 2 3 4
37 26 20 15
Changing the linkage method to single demonstrates the chaining effect on the dendrogram (Figure 25.10). Identifying a reasonable number of clusters is more difficult.
Figure 25.10: Dendrogram for partial glaucoma data, single linkage.
If you create a dendrogram object from the hclust results, a number of plotting functions are available to visualize the dendrogram in interesting ways. For example:
Figure 25.1: Results of 3-means clustering for Iris data. Clusters are identified through colors, species are identified with plotting symbols.Figure 25.5: Example of a dendrogram in hierarchical clusteringFigure 25.6: Example of a dendrogram in hierarchical clustering.Figure 25.7: Dendrogram cut at different heightsFigure 25.8: Some linkage types in hierarchical clustering.Figure 25.9: Dendrogram for partial glaucoma data with correlation-based dissimilarity and complete linkage.Figure 25.10: Dendrogram for partial glaucoma data, single linkage.
Gower, J. C. n.d. “A General Coefficient of Similarity and Some of Its Properties.”Biometrics 27 (4): 857–71.
Hartigan, J. A., and M. A. Wong. 1979. “Algorithm AS 136: A k-Means Clustering Algorithm.”Journal of the Royal Statistical Society. Series C (Applied Statistics) 28 (1): 100–108.
James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2021. An Introduction to Statistical Learning: With Applications in r, 2nd Ed. Springer. https://www.statlearning.com/.