25 Cluster Analysis

25.1 Introduction

The term cluster appears throughout data analytics in different contexts. In the analysis of correlated data a cluster is a group of observations that belong together and group membership is known a priori. For example, a subsample that is drawn from a larger sampling unit creates a hierarchy of sampling units. The longitudinal observations collected on a subject over time form a cluster of subject-specific data. The data from different subjects might be independent while the longitudinal observations within a cluster (a subject) are correlated.

In unsupervised learning, a cluster is a group of observations that are somehow similar. Group membership is not known a priori and determining membership as well as the number of clusters is part of the analysis. Examples are

Students that are similar with respect to STEM achievement scores
Real estate properties that share similar property attributes
Online shoppers with similar browsing and purchase history

Cluster analysis seeks to find groups of data such that members within a group are similar to each other and members from different groups are dissimilar. It is an unsupervised method, there is no target variable, we simply are trying to find structure in the data. The number of clusters can be set a priori, for example in $k$-means clustering, or be determined as part of analysis, as in hierarchical clustering.

Clustering Rows or Columns

Note that we are looking for groups of “data”, we did not specify whether clustering applies to finding similar observations or similar features. Usually it is the former, and clustering columns versus rows can be performed by simply transposing the data. From here on we assume that clustering is seeking similar groups of observations.

Scaling and Centering

Key to all clustering methods is some notion of similarity–or the opposite, issimilarity–of data points. Measures of similarity (or dissimilarity) depend on a metric expressing distance. Squared Euclidean distance is a common choice, but other metrics such as the Manhattan (city-block) distance, correlation distance, or Gower’s distance are also important. Many distance measures depend on the units of measurement; variables with large values tend to dominate the distance calculations. It is highly recommended to scale data prior to cluster analysis to put features on equal footing.

Scaling is often not applied to binary variables, for example, variables that result from coding factors as a series of 0/1 variables.

25.2 $K$-Means Clustering

Introduction

$K$-means clustering is an intuitive method to cluster $p$ numeric input variables. The value $K$ is the number of clusters and is set a priori. If you perform a $3$-means analysis, the algorithm will assign all observations to one of three clusters. If you perform a $100$-means analysis, the algorithm will assign all observations to one of 100 clusters. Choosing the appropriate number of clusters for a data set uses scree plots similar to choosing the number of components in principal component analysis.

The $K$-means algorithm has the following properties:

The analysis leads to $K$ clusters
Every observation belongs to exactly one cluster
No observation belongs to more than one cluster

Finding the optimal partitioning of $n$ observations into $K$ groups is a formidable computational problem, there are approximately $K^n$ ways of partitioning the data. However, efficient algorithms exist to find at least a local solution to the global partitioning problem.

To introduce some notation for $K$-means clustering, let $C_i$ denote the set of observations assigned to cluster $i=1,\cdots,K$. The $K$-means properties imply that

$C_1 \cup C_2 \cup \cdots \cup C_K = \{1,\cdots, n\}$
$C_i \cap C_j = \emptyset$ if $i \neq j$

The number of observations in cluster $i$ is called its cardinality, denoted $|C_i|$.

If squared Euclidean distance is the dissimilarity measure of choice, the distance between two data points is \[ d(\textbf{x}_i,\textbf{x}_j) = ||\textbf{x}_i - \textbf{x}_j||_2^2 = \sum_{m=1}^p \left ( x_{im} - x_{jm} \right )^2 \] The within-cluster variation in cluster $k$ is the average dissimilarity of the observations in $C_k$: \[ W(C_k) = \frac{1}{|C_k|} \sum_{i,j \in C_k} ||\textbf{x}_i - \textbf{x}_j||_2^2 \] Let $\overline{\textbf{x}}_k = [\overline{x}_{1k},\cdots,\overline{x}_{pk}]$ be the vector of means of the inputs in the $k$^th cluster. Finding the $K$-means solution requires to find the cluster allocations such that \[ \min_{C_1,\cdots, C_k} \left \{ \sum_{k=1}^K W(C_k) \right \} \Longleftrightarrow \min_{C_1,\cdots, C_k} \left \{ \sum_{k=1}^K \sum_{i \in C_k} ||\textbf{x}_i - \overline{\textbf{x}}_k||_2 ^2 \right \} \]

This states that the cluster assignment that minimizes the sum of the within-cluster dissimilarity is the same assignment that minimizes the distances of data points from the cluster centroid. This is how $K$-means clustering gets its name; the cluster centroids are computed as the mean of the observations assigned to the cluster.

The within-cluster sum of squares is the sum of the squared distances between the data points in a cluster and the cluster centroid. For cluster $k$ this sum of squares is \[ \text{SSW}_k = \frac{1}{2} W(C_k) = \sum_{i \in C_k} ||\textbf{x}_i - \overline{\textbf{x}}_k||_2 ^2 \] This quantity is also called the inertia of the cluster. The average inertia, \[ \frac{1}{|C_k|} \text{SSW}_k \] is called the distortion of cluster $k$.

A (local) solution is found by iterating from an initial cluster assignment: given cluster centroids $\overline{\textbf{x}}_k$ assign each observation to the cluster whose center is closest. Following the assignment recompute the centers. Continue until the cluster assignment no longer changes. At the local solution no movement of a data point from one cluster to another will reduce the within-cluster sum of squares (Hartigan and Wong 1979).

The initial cluster assignment is done by either assigning observations randomly to the $k$ clusters or by using $k$ randomly chosen observations as the initial cluster centroids.

Because of this random element, and because the algorithm is not guaranteed to find a global solution, $K$-means is typically run with multiple random starts, and the best solution is reported.

Example: $K$-Means for Iris Data

To show the basic calculations in $K$-means analysis, let’s first look at the familiar Iris data set. We have the luxury of knowing that the data set comprises three species, a $3$-means analysis of the flower measurements should be interesting: does it recover the iris species?

R
Python

The kmeans function in R performs the $K$-means analysis. By default, it uses the algorithm of Hartigan and Wong (1979) with a single random start for the initial cluster assignment. Set nstart= to a larger number to increase the number of random starts. Because there are four inputs, Sepal.Length, Sepal.Width. Petal.Length, and Petal.Width, each observation and the centroids live in 4-dimensional space.

set.seed(1234)
iris_s <- scale(iris[,1:4])
km <- kmeans(iris_s,centers=3,nstart=50)

km$size

[1] 50 53 47

km$centers

  Sepal.Length Sepal.Width Petal.Length Petal.Width
1  -1.01119138  0.85041372   -1.3006301  -1.2507035
2  -0.05005221 -0.88042696    0.3465767   0.2805873
3   1.13217737  0.08812645    0.9928284   1.0141287

The algorithm finds three clusters of sizes 50, 53, and 47. The centroid of the first cluster is at coordinate [-1.0112, 0.8504, -1.3006, -1.2507].

The breakdown of the dissimilarities, the squared distances, in the data set is as follows.

km$totss
## [1] 596
km$betweenss
## [1] 457.1116
km$tot.withinss
## [1] 138.8884
km$withinss
## [1] 47.35062 44.08754 47.45019
(km$tot.withinss/km$totss)*100
## [1] 23.30342

The KMeans class in Python’s scikit-learn library performs the K-means analysis. By default, it uses the k-means++ algorithm for initialization with a single random start (n_init=1 in newer versions). Set n_init= to a larger number to increase the number of random starts with different centroid seeds. Because there are four inputs, Sepal.Length, Sepal.Width, Petal.Length, and Petal.Width, each observation and the centroids live in 4-dimensional space.

import duckdb
import numpy as np
import pandas as pd
import math
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

# Load iris dataset
con = duckdb.connect(database="ads.ddb", read_only=True)
iris = con.sql("SELECT * FROM iris;").df()
con.close()

# Set random seed for reproducibility
np.random.seed(1234)

#Function to correct standard deviation calculation from n dof to n-1 dof
def sd_correct(x, n):
  correction = x * math.sqrt(n-1) / math.sqrt(n)
  return correction

# Scale the iris features with correct of sd from n dof to n-1 dof
scaler_iris = StandardScaler()
iris_s = pd.DataFrame(scaler_iris.fit_transform(iris.iloc[:, :4]))
iris_s = iris_s.map(lambda x: sd_correct(x, n=len(iris_s)))

# Perform k-means clustering
km = KMeans(n_clusters=3, n_init=50, random_state=1234)
km.fit(iris_s)

KMeans(n_clusters=3, n_init=50, random_state=1234)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

# Print cluster information
print("Cluster sizes:", np.bincount(km.labels_))

Cluster sizes: [50 47 53]

print("Cluster centers:\n", pd.DataFrame(km.cluster_centers_, 
      columns = iris.iloc[:, :4].columns))

Cluster centers:
    Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
0     -1.011191     0.850414     -1.300630    -1.250704
1      1.132177     0.088126      0.992828     1.014129
2     -0.050052    -0.880427      0.346577     0.280587

The algorithm finds three clusters of sizes 50, 47, and 53. The centroid of the first cluster is at coordinate [-1.011191, 0.850414, -1.300630, -1.250704].

The breakdown of the dissimilarities, the squared distances, in the data set is as follows.

# Calculations
n_samples = len(iris_s)
overall_mean = np.mean(iris_s, axis=0)

#Total sum of squares
totss = sum(np.sum((iris_s - overall_mean)**2, axis=0))

#Between cluster sum of squares
betweenss = round(sum(np.bincount(km.labels_)[i] * np.sum((km.cluster_centers_[i] - overall_mean)**2) 
                for i in range(km.n_clusters)), 4)

#Total within cluster sum of squares
tot_withinss = round(km.inertia_, 4)

#Within cluster sum of squares
def calculate_withinss(data, labels, centers):
    withinss = []
    for i in range(len(centers)):
        cluster_data = data[labels == i]
        if len(cluster_data) > 0:
            ss = np.sum((cluster_data - centers[i])**2, axis=0)
            withinss.append(round(sum(ss), 4))
        else:
            withinss.append(0)
    return withinss

withinss = calculate_withinss(iris_s, km.labels_, km.cluster_centers_)


print("Total sum of squares:", totss)

Total sum of squares: 596.0

print("Between-cluster sum of squares:", betweenss)

Between-cluster sum of squares: 457.1116

print("Total within-cluster sum of squares:", tot_withinss)

Total within-cluster sum of squares: 138.8884

print("Within-cluster sum of squares for each cluster:", withinss)

Within-cluster sum of squares for each cluster: [47.3506, 47.4502, 44.0875]

print("Percentage of within-cluster variance:", round((km.inertia_ / totss),6) * 100)

Percentage of within-cluster variance: 23.3034

The total sum of squares does not depend on the number of clusters. For $K=3$, it is allocated to 138.8884 sum of squares units within the clusters and 457.1116 between the clusters.

R
Python

The within-cluster sum of squares measures the average squared Euclidean distance between the points in a cluster and the cluster centroid. We can validate for any of the clusters as follows.

withinss <- function(x, center) {
    tmp <- sapply(seq_len(nrow(x)),function(i) sum((x[i,]-center)^2))
    return (sum(tmp))
}
for (i in 1:3) {
    print(withinss(iris_s[km$cluster==i,],km$center[i,]))
}

[1] 47.35062
[1] 44.08754
[1] 47.45019

The distortions of the clusters are obtained by dividing the within-cluster sum of squares with the cluster sizes:

km$withinss / km$size

[1] 0.9470124 0.8318405 1.0095786

The distortions of the clusters are obtained by dividing the within-cluster sum of squares with the cluster sizes:

withinss/np.bincount(km.labels_)

array([0.947012  , 1.00957872, 0.83183962])

Figure 25.1 shows the cluster assignment in a bivariate plot of two of the flower measurements. The colored cluster symbols are overlaid with the species. The three clusters track the species fairly well, in particular I. setosa. The boundaries of the other two clusters align fairly well with species, but there is considerable overlap.

Figure 25.1: Results of 3-means clustering for Iris data. Clusters are identified through colors, species are identified with plotting symbols.

The separation of these clusters is probably better than Figure 25.1 suggests, because two dimensions (Petal.Width and Petal.Length) are not represented in the figure.

R
Python

psym <- ifelse(iris$Species=="setosa", 1, 
               ifelse(iris$Species=="versicolor" ,2,3))
cm <- caret::confusionMatrix(as.factor(km$cluster),as.factor(psym))
round(cm$overall,4)

      Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
        0.8333         0.7500         0.7639         0.8891         0.3333 
AccuracyPValue  McnemarPValue 
        0.0000            NaN

from sklearn.metrics import confusion_matrix

# Map species to numbers
species_map = {'setosa': 1, 'virginica': 2, 'versicolor': 3}
psym = iris['Species'].map(species_map)

# Confusion matrix
cm = confusion_matrix(km.labels_ + 1, psym)
accuracy= (cm[0,0] + cm[1,1] + cm[2,2]) / len(iris)
print(f"Accuracy: {round(accuracy, 4)}")

Accuracy: 0.8333

The confusion matrix between species and cluster assignment has an accuracy of 83.3333%.

Caution

$K$-means analysis is generally susceptible to outliers, as they contribute large distances. Also, $K$-means analysis is sensitive to perturbations of the data; when observations are added or deleted the results will change. Finally, $K$-means is affected by the curse of dimensionality (Section 6.4.1).

Clustering Metrics

To choose the appropriate number of clusters in $K$-means clustering, we can apply various metrics that measure the tightness of the clusters and their separation. These metrics are plotted against the value of $k$ in a scree plot. We do not look for a minimum of the criteria, but the “knee” or “elbow” where the increase/decrease of the metric abruptly changes.

The following criteria are commonly computed and plotted.

Inertia: this is the within-cluster sum of squares; it measures the tightness of the clusters. It does not necessarily mean that clusters are well separated, it just means that the data points within the clusters are close to their centroid. The within-cluster sum of squares decreases as $K$ increases, more clusters will lead to less variability within the clusters. In the extreme case when $k=n$, each observation is a cluster and the within-cluster sum of squares is zero. That is why we do not look for a global minimum with these criteria.
Distortion: this is the average inertia within a cluster, obtained by dividing $\text{SSW}_k$ by the cluster cardinality.
Silhouette score: measures how similar a data point is to its own cluster compared to other clusters. While inertia is based on distances of data points from their cluster center, the silhouette takes into account the distances between points in one cluster and the nearest cluster center. The score ranges from $-1$ to $+1$; a high silhouette score means that we can easily tell the clusters apart–they are far from each other.

Inertia and silhouette measure different things: the former captures the tightness of the clusters, the latter how far apart (distinguishable) the clusters are. You can have a good (low) inertia but a bad (low) silhouette score if the clusters overlap or sit on top of each other.

Example: Silhouette Scores

R
Python

You can calculate and/or visualize silhouette scores in R in several ways: using the silhouette function in the cluster library or the fviz_nbclust function in the factoextra package. fviz_nbclust supports additional metrics, for example method="wss" produced a scree plot of the within-cluster sum of squares (inertia).

library(cluster)
set.seed(6345)
silhouette_score <- function(k){
  kmns <- kmeans(iris_s, centers = k, nstart=50)
  ss <- silhouette(kmns$cluster, dist(iris_s))
  mean(ss[, 3])
}
k <- 2:10
plot(k, 
     sapply(k, silhouette_score),
     type='b',
     xlab='Number of clusters', 
     ylab='Average Silhouette Scores',bty="l")

library(factoextra)
fviz_nbclust(iris_s, kmeans, method='silhouette')

fviz_nbclust(iris_s, kmeans, method='wss')

You can calculate and/or visualize silhouette scores in Python in several ways: using the silhouette_score and silhouette_samples functions in the sklearn.metrics module, or visualization functions like KElbowVisualizer in yellowbrick.cluster. Plots can also be created manually using the output of the KMeans object and functions in matplotlib.


from sklearn.metrics import silhouette_score, silhouette_samples
import matplotlib.pyplot as plt

np.random.seed(6345)

def silhouette_score_func(k):
    """Calculate average silhouette score for k clusters"""
    kmns = KMeans(n_clusters=k, n_init=50, random_state=6345)
    cluster_labels = kmns.fit_predict(iris_s)
    ss = silhouette_score(iris_s, cluster_labels)
    return ss

# Test different numbers of clusters
k = list(range(2, 11))  # Range of clusters from 2 through 10 
silhouette_scores = [silhouette_score_func(i) for i in k]
sil_plot_df = pd.DataFrame (data = {'k': k, 'Scores': silhouette_scores})

# Plot silhouette scores 
plt.figure(figsize=(10, 6))
plt.plot('k', 'Scores', 'bo-', data = sil_plot_df,linewidth=2, markersize=6)
plt.xlabel('Number of clusters')
plt.ylabel('Average Silhouette Scores')
plt.title('Silhouette Analysis')
plt.grid(True, alpha=0.3)
plt.show

<function make_python_function.<locals>.python_function at 0x326b2e8e0>

from yellowbrick.cluster import KElbowVisualizer

#Silhoutte plot using KElbowVisualizer
model = KMeans(n_init=50, random_state=6345)
visualizer_sil = KElbowVisualizer(model, k=(2, 10), metric='silhouette',
                  timings=False)
visualizer_sil.fit(iris_s)

KElbowVisualizer(ax=<Axes: >,
                 estimator=KMeans(n_clusters=9, n_init=50, random_state=6345),
                 k=(2, 10), metric='silhouette', timings=False)

visualizer_sil.show()

#Scree Plot
visualizer_wss = KElbowVisualizer(model, k=(1, 10), metric='distortion',
              timings=False, locate_elbow=False)
visualizer_wss.fit(iris_s)

KElbowVisualizer(ax=<Axes: >,
                 estimator=KMeans(n_clusters=9, n_init=50, random_state=6345),
                 k=(1, 10), locate_elbow=False, timings=False)

visualizer_wss.show()

The silhouette scores suggest $k=2$ and the inertia scree plot $k$ between 3 and 5.

Predicted Values

Although $K$-means is an unsupervised learning method, we can use it to predict the cluster of a new observation. Calculate the distance of the coordinate of the new point from the cluster centers and assign the observation to the cluster whose center is closest. The cluster centroids serve as the predicted values. You can write a function that accomplishes that.

If the data were centered and/or scaled in the $K$-means analysis, make sure that the same treatment is applied before calculating distances to the cluster centroids.

R
Python

clusters <- function(x, centers) {
  # compute squared euclidean distance from 
  # each sample to each cluster center
  tmp <- sapply(seq_len(nrow(x)),
                function(i) apply(centers, 1,
                                  function(v) sum((x[i, ]-v)^2)))
  max.col(-t(tmp))  # find index of min distance
}

# two new observations
newx = data.frame("Sepal.Length"=c(4  , 6  ),
                  "Sepal.Width" =c(2  , 3  ),
                  "Petal.Length"=c(1.5, 1.3),
                  "Petal.Width" =c(0.3, 0.5))

#center and scales from training data
means <- attr(iris_s,"scaled:center")
scales <- attr(iris_s,"scaled:scale")
 
pred_clus <- clusters(scale(newx, center=means, scale=scales),km$centers)

pred_clus

[1] 2 1

# Using the cluster centers as the predicted values
km$centers[pred_clus,]

  Sepal.Length Sepal.Width Petal.Length Petal.Width
2  -0.05005221  -0.8804270    0.3465767   0.2805873
1  -1.01119138   0.8504137   -1.3006301  -1.2507035

from scipy.spatial.distance import cdist

# Function to assign clusters based on distance to centers
def clusters(x, centers):
    """
    Compute squared euclidean distance from each sample to each cluster center
    and return the index of the closest center for each sample
    """
    # Compute squared euclidean distances
    distances = cdist(x, centers, metric='sqeuclidean')
    # Find index of minimum distance (closest center)
    return np.argmin(distances, axis=1)

# Two new observations
newx = pd.DataFrame({
    'Sepal.Length': [4.0, 6.0],
    'Sepal.Width': [2.0, 3.0], 
    'Petal.Length': [1.5, 1.3],
    'Petal.Width': [0.3, 0.5]
})

# Means and scales from training data
means = scaler_iris.mean_  # from StandardScaler fitted on training data
scales = scaler_iris.scale_  # from StandardScaler fitted on training data

# New data scaled with same values as training data, including the 
# sd correction from dof n to dof n-1
newx_scaled = (newx - means ) / scales
newx_scaled = newx_scaled.map(lambda x: sd_correct(x, n=len(iris_s)))

# Get cluster predictions
pred_clus = clusters(newx_scaled, km.cluster_centers_)
print(pred_clus)

[2 0]

print(pd.DataFrame(km.cluster_centers_[pred_clus], columns = iris.columns[0:4],
    index = pred_clus))

   Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
2     -0.050052    -0.880427      0.346577     0.280587
0     -1.011191     0.850414     -1.300630    -1.250704

Combining $K$-Means and PCA

$K$-means analysis finds groups of observations that are similar to each other in the inputs as judged by a distance metric. Principal component analysis finds independent linear combinations of the inputs that explain substantial amounts of information. In the Iris example analyzed earlier, we used 4 input variables but plotted the cluster assignment for two of the variables, because visualization in more dimensions is difficult (Figure 25.1).

There are two ways to combine PCA and $K$-means:

PCA after $K$-means: Run a $K$-means analysis on $p$ inputs, then calculate the first two principal components with the cluster assignment. This is a visualization techniques for clusters in high-dimensional data. It does not rectify the curse of dimensionality issue from which $K$-means suffers as $p$ gets larger. When applied to visualize data in 2 dimensions, this technique reduces $p(p-1)/2$ scatterplots to a singple biplot based on the first 2 components.
$K$-means after PCA: Use PCA to reduce $p$ inputs to $M < p$ principal components, then run a $K$-means analysis to find clusters in the components. This approach eliminates the curse of dimensionality.

Example: Airbnb properties in Asheville, NC

The following data is Airbnb data about Asheville, NC. The data for this and other cities is available from http://insideairbnb.com/get-the-data/. We are using six numeric variables for the properties.

R
Python

library(duckdb)
con <- dbConnect(duckdb(),dbdir = "ads.ddb",read_only=TRUE)
airbnb <- dbGetQuery(con, "SELECT * FROM Asheville")

dbDisconnect(con)
    
airbnb2 <- na.omit(airbnb[,c("price",
    "number_of_reviews","minimum_nights",
    "reviews_per_month","availability_365",
    "calculated_host_listings_count")])


#Import airbnb dataset
con = duckdb.connect("ads.ddb", read_only=True)
airbnb = con.execute("SELECT * FROM Asheville").df()
con.close()

# Select and clean data
columns_to_keep = ["price", "number_of_reviews", "minimum_nights", 
                   "reviews_per_month", "availability_365", 
                   "calculated_host_listings_count"]
                   
airbnb2 = airbnb[columns_to_keep].dropna()

Figure 25.2 shows the distribution of prices as a function of the number of reviews. Many properties have accumulated hundreds of reviews over time, and most are toward the lower end of the price scale.

The property with a rental price of more than $10,000 per day is a 1-bedroom, 1-bath guest suite in the middle of Asheville. The rental has a 2-night minimum and over 200 reviews. We are excluding this observation as an outlier.

We now perform a $K$-means analysis based on the first two principal components after limiting the data to properties with a daily rental price of less than $2,000.

R
Python

airbnb2 <- airbnb2[airbnb2$price < 2000,]

pca_asheville <- prcomp(airbnb2,retx=TRUE,scale.=TRUE)
summary(pca_asheville)

Importance of components:
                          PC1    PC2    PC3    PC4    PC5    PC6
Standard deviation     1.3779 1.1772 0.9687 0.8906 0.8262 0.5488
Proportion of Variance 0.3164 0.2310 0.1564 0.1322 0.1138 0.0502
Cumulative Proportion  0.3164 0.5474 0.7038 0.8360 0.9498 1.0000

# Function for Table of Standard Deviation and Explained Variances by PC
def PCA_sum(pca):
  var_sum = pd.DataFrame(np.array((pca.explained_variance_ ,
  pca.explained_variance_ratio_, np.cumsum(pca.explained_variance_ratio_))))

  #Change explained variance to standard deviation by taking square root
  var_sum.iloc[0] = var_sum.iloc[0].apply(lambda x: math.sqrt(x))
  
  colnames = [f'PC{i+1}' for i in range(pca.n_components)]
  var_sum = var_sum.set_axis(colnames, axis=1)
  var_sum.index = ['Standard Deviation', 'Explained Variance Ratio', 
  'Cumulative Explained Variance Ratio']
  
  return var_sum

# Filter out extreme prices
airbnb2 = airbnb2.loc[airbnb2['price'] < 2000]

from sklearn.decomposition import PCA

# Perform PCA with scaling
scaler_airbnb = StandardScaler()
airbnb2_scaled = pd.DataFrame(scaler_airbnb.fit_transform(airbnb2), columns=airbnb2.columns)
airbnb2_scaled = airbnb2_scaled.map(lambda x: sd_correct(x, n=len(airbnb2_scaled)))

pca_asheville = PCA(n_components = 6)
pca_components = pca_asheville.fit_transform(airbnb2_scaled)

pd.set_option('display.max_columns', None)
PCA_sum(pca_asheville)

                                          PC1       PC2       PC3       PC4  \
Standard Deviation                   1.377913  1.177233  0.968749  0.890574   
Explained Variance Ratio             0.316441  0.230980  0.156412  0.132187   
Cumulative Explained Variance Ratio  0.316441  0.547421  0.703833  0.836020   

                                          PC5       PC6  
Standard Deviation                   0.826237  0.548828  
Explained Variance Ratio             0.113778  0.050202  
Cumulative Explained Variance Ratio  0.949798  1.000000

We use the first three principal components for the subsequent $K$-means analysis; they explain 70.383% of variability in the data.

Based on the scree plot of the within-cluster sum of squares and the silhouette scores, $K$=5 or $K=6$ seems like a reasonable number of clusters. The silhouette plot suggests $K$=7 instead (Figure 25.3). We compromise on $K=6$.

R
Python

fviz_nbclust(pca_asheville$x[,1:3], kmeans, method='silhouette')

Figure 25.3: Silhouette score and inertia scree plot.

fviz_nbclust(pca_asheville$x[,1:3], kmeans, method='wss')

km <- kmeans(pca_asheville$x[,1:3],centers=6,nstart=25)

library(ggfortify)
autoplot(pca_asheville, 
         data=airbnb2, 
         color=km$cluster, 
         size=0.6,
         loadings.label=TRUE, 
         loadings.label.size = 3,
         loadings=TRUE)

# Function to find optimal number of clusters using silhouette method
def find_optimal_clusters_silhouette(data, max_k=10):
    silhouette_scores = []
    k_range = range(2, max_k + 1)
    
    for k in k_range:
        kmeans = KMeans(n_clusters=k, n_init=25, random_state=42)
        cluster_labels = kmeans.fit_predict(data)
        silhouette_avg = silhouette_score(data, cluster_labels)
        silhouette_scores.append(silhouette_avg)
    
    plt.figure(figsize=(10, 6))
    plt.plot(k_range, silhouette_scores, 'bo-')
    plt.xlabel('Number of clusters')
    plt.ylabel('Average Silhouette Score')
    plt.title('Silhouette Analysis for Optimal k')
    plt.grid(True)
    plt.show()
    

# Use first 3 principal components for clustering analysis
pca_data_3d = pca_components[:, :3]

find_optimal_clusters_silhouette(pca_data_3d)

Silhouette score and inertia scree plot.

# Function to find optimal number of clusters using elbow method (WSS)
def find_optimal_clusters_wss(data, max_k=10):
    wss = []
    k_range = range(1, max_k + 1)
    
    for k in k_range:
        if k == 1:
            wss.append(np.sum(np.var(data, axis=0)) * data.shape[0])
        else:
            kmeans = KMeans(n_clusters=k, n_init=25, random_state=42)
            kmeans.fit(data)
            wss.append(kmeans.inertia_)
    
    plt.figure(figsize=(10, 6))
    plt.plot(k_range, wss, 'bo-')
    plt.xlabel('Number of clusters')
    plt.ylabel('Within Sum of Squares (WSS)')
    plt.title('Elbow Method for Optimal k')
    plt.grid(True)
    plt.show()

find_optimal_clusters_wss(pca_data_3d)

# Perform K-means clustering with k=6 
km = KMeans(n_clusters=6, n_init=25, random_state=42)
cluster_labels = km.fit_predict(pca_data_3d)

from matplotlib.colors import ListedColormap
# Create custom colormap with your specified colors
custom_colors = ['purple', 'black', 'teal', 'maroon', 'green', 'blue']
custom_cmap = ListedColormap(custom_colors)

# Create visualization similar to autoplot in R
plt.figure(figsize=(12, 8))

# Create scatter plot of first two principal components
scatter = plt.scatter(pca_components[:, 0], pca_components[:, 1], 
                     c=cluster_labels, cmap=custom_cmap, alpha=0.4, s=20)

plt.xlabel(f'PC1 ({pca_asheville.explained_variance_ratio_[0]:.1%} variance)')
plt.ylabel(f'PC2 ({pca_asheville.explained_variance_ratio_[1]:.1%} variance)')
plt.title('PCA Plot with K-means Clusters (k=6)')

# Add loading vectors (similar to loadings in R)
loadings = pca_asheville.components_[:2, :].T  # First 2 components
feature_names = airbnb2.columns

# Scale loadings for better visualization
scale_factor = 5
for i, (feature, loading) in enumerate(zip(feature_names, loadings)):
    plt.arrow(0, 0, loading[0] * scale_factor, loading[1] * scale_factor, 
              head_width=0.2, head_length=0.3, fc='red', ec='red', alpha=0.7, width=0.05)
    plt.text(loading[0] * scale_factor * 1.1, loading[1] * scale_factor * 1.1, 
             feature, fontsize=14,fontweight='bold', ha='center', va='center')

plt.colorbar(scatter, label='Cluster')

<matplotlib.colorbar.Colorbar object at 0x371ae4ef0>

plt.grid(True, alpha=0.3)
plt.show()

Not surprisingly, the reviews_per_month and the number_of_reviews are highly correlated. The six clusters separate pretty well. There is some overlap between black and green clusters, but the display is missing one of the principal components. The PCA rotation shows that PC1 is dominated by review-related attributes and PC2 by availability of the property and the number of listings that a host has in Asheville. PC3 has negative scores for pricey properties.

R
Python

pca_asheville$rotation[,1:3]

                                      PC1        PC2         PC3
price                           0.2902296  0.3294264 -0.62170726
number_of_reviews              -0.6172910  0.1445229  0.12113455
minimum_nights                  0.2131000 -0.4718076  0.53510059
reviews_per_month              -0.6282131  0.2176421  0.03637559
availability_365                0.1669516  0.5698854  0.42704350
calculated_host_listings_count  0.2584232  0.5252157  0.35886561

# Print PCA loadings for first 3 components 
print("\nPCA Loadings (first 3 components):")


PCA Loadings (first 3 components):

loadings_df = pd.DataFrame(
    pca_asheville.components_[:3, :].T,
    columns=[f'PC{i+1}' for i in range(3)],
    index=airbnb2.columns
)
print(loadings_df)

                                     PC1       PC2       PC3
price                          -0.290230 -0.329426 -0.621707
number_of_reviews               0.617291 -0.144523  0.121135
minimum_nights                 -0.213100  0.471808  0.535101
reviews_per_month               0.628213 -0.217642  0.036376
availability_365               -0.166952 -0.569885  0.427044
calculated_host_listings_count -0.258423 -0.525216  0.358866

$K$-Means on Random Data

Before we leave $K$-means clustering, a word of caution. K-Means clustering will always find $K$ clusters even if the data have no structure. The following data perform a $3$-means analysis on 100 observations on 5 inputs drawn randomly from a standard Gaussian distribution. The correlation analysis shows no (pairwise) relationship among the inputs.

set.seed(7654)
x <- matrix(rnorm(500,0,1),nrow=100,ncol=5)
round(cor(x),4)

        [,1]    [,2]    [,3]    [,4]    [,5]
[1,]  1.0000 -0.1803 -0.0074  0.0591  0.0151
[2,] -0.1803  1.0000 -0.1494 -0.0894  0.0121
[3,] -0.0074 -0.1494  1.0000  0.0103 -0.0346
[4,]  0.0591 -0.0894  0.0103  1.0000 -0.0004
[5,]  0.0151  0.0121 -0.0346 -0.0004  1.0000

krand <- kmeans(x,centers=3,nstart=20)
krand$center

        [,1]       [,2]        [,3]       [,4]        [,5]
1 -0.8585193  0.7523440 -0.47938437  0.2871084  0.05588092
2  0.4961648 -0.8522573  0.27922916  1.3559955 -0.03425393
3  0.2284897 -0.3133494  0.05111486 -0.6513347 -0.11558704

krand$withinss

[1]  99.74124  92.39653 149.99366

To visualize, run a PCA and color the scores of the first two components with the cluster id. It appears that the algorithm found three somewhat distinct groups of observations. The cluster centroids are certainly quite different (Figure 25.4).

pca <- prcomp(x,scale.=TRUE)
proj_means <- predict(pca,newdata=krand$centers)

Figure 25.4: The first two principal components for $3$-means analysis on random data, $p=5$. The diamonds are the cluster centroids.

There is a clue that something is amiss.

(krand$betweenss/krand$totss)*100

[1] 29.14485

The variability between the clusters accounts for only 29.145% of the variation in the data. If grouping explains differences between the data points, this percentage should be much higher.

25.3 Hierarchical Clustering

Introduction

In $K$-means clustering you specify $K$, find the clusters, and examine the results. Metrics such as inertia, distortion, or the silhouette score are used to find an appropriate value for $K$. Hierarchical clustering (HC) is a clustering technique where you do not specify the number of clusters in advance. Instead, the entire data set is organized between two extremes:

at the top, all observations belong to a single cluster
at the bottom, each observations is in a cluster by itself

If $c$ denotes the number of clusters in hierarchical clustering, HC offers you to choose $1 \le c \le n$. Between the two extremes, $c=1$ and $c=n$ lie many configurations where observations are combined into groups based on similarity measures and rules for combining groups, called linkage methods. The choice is typically made based on heuristics such as a visual inspection of the dendrogram, an upside-down tree display of the cluster arrangements (Figure 25.5). Algorithms exist that try to automate and optimize the determination of $c$ based on criteria such as inertia.

The Dendrogram

Hierarchical clustering is popular because of the dendrogram, an intuitive representation of structure in the data. A word of caution is in order, however: just like $K$-means clustering will find $K$ clusters–whether they exist or not–hierarchical clustering will organize the observations hierarchically in the dendrogram–whether a hierarchy makes sense or not.

At the lowest level of the dendrogram are the leaves, corresponding to observations. As you move up the tree, those merge into branches. Observations fuse first into groups, later on observations or groups merge with other groups. A common mistake in interpreting dendrograms is to assume similarity is greatest when observations are close to each other on the horizontal axis when fused. Observations are more similar if they are fused lower on the tree. The further up on the tree you go before merging branches, the more dissimilar are the members of the branches.

In Figure 25.6, observations 11 and 4 near the right edge of the tree appear “close” along the horizontal axis. Since they merge much higher up on the tree, these observations are more dissimilar as, for example, observations 23 and 25, which merge early on. Based on where they merge, observation 11 is no more similar to #4 than it is to observations 23, 25, 10, and 15.

Figure 25.6: Example of a dendrogram in hierarchical clustering.

The name hierarchical clustering stems from the fact that clusters lower on the tree (near the bottom) are necessarily contained in clusters higher up on the tree (near the top), since clusters are formed by merging or splitting. This hierarchical arrangement can be unrealistic. James et al. (2021, 523) give the following example

Suppose you have data on men and women from three countries.
The best division into three groups might be by country.
The best division into two groups might be by gender.

The best division into three groups does not result from taking the two gender groups and splitting one of those.

There are two general approaches to construct a dendrogram.

The agglomerative (bottom-up) approach starts with $c=n$ clusters at the bottom of the dendrogram, where each observation is a separate cluster, and merges observations and branches based on their similarity. The pair chosen for merging at any stage consists of the least dissimilar (most similar) groups.
The divisive (top-down) approach starts at the trunk (the top) of the tree with a single cluster that contains all observations. Moving down the tree, branches are split into clusters to produce groups with the largest between-group dissimilarity (least similar clusters).

Cutting the Dendrogram

It is best to interpret the dendrogram as a data summary, not as a confirmatory tool. Based on the dendrogram you can choose any number of clusters. The typical approach is called cutting the tree, whereby you choose a particular height on the dendrogram and draw a line across. Depending on where you draw the line you end up with a different number of clusters (Figure 25.7). The number of clusters corresponds to the number of vertical lines the cut intersects.

Dissimilarity and Linkage

Before we can construct a dendrogram, we need to decide on two more things (besides whether the approach is top-down or bottom-up): a measure of dissimilarity and a rule on which groups are to be merged. These choices have profound effect on the resulting dendrogram, more so than the choice between top-down or bottom-up approach.

Dissimilarity measures

Let $x_{ij}$ denote the measurements for $i=1,\cdots, n$ observations on $j=1,\cdots, p$ inputs (variables). As before, the vector of inputs for the $i$^th observation is denoted $\textbf{x}_i = [x_{i1},\cdots, x_{ip}]$.

The dissimilarity (distance) matrix $\textbf{D}$ for the data is an $(n \times n)$ matrix with typical element \[ D(\textbf{x}_i,\textbf{x}_{i^\prime}) = \sum_{j=1}^p w_j \, d_j(x_{ij},x_{i^\prime j}) \] The $w_j$ are weights associated with the $j$^th attribute, $\sum_{j=1}^p w_j = 1$. $d_j(x_{ij},x_{i^\prime j})$ measures the dissimilarity between any two observations for the $j$^th attribute.

A number of dissimilarity measures are used, depending on the type of variable and the application. For quantitative variables, the following are most popular:

Squared Euclidean distance \[d_j(x_{ij},x_{i^\prime j}) = (x_{ij} - x_{i^\prime j})^2\] A frequent choice, but it is sensitive to large distances due to squaring.
Euclidean distance \[ D(\textbf{x}_i,\textbf{x}_{i^\prime}) = \sqrt{ \sum_{j=1}^p (x_{ij} - x_{i^\prime j})^2} \] This is the usual distance ($L_2$-norm) between two vectors $\textbf{x}_i$ and $\textbf{x}_{i^\prime}$. Taking the square root expresses the distance in the same units as the $X$s.
Absolute distance, also called “Manhattan” or city-block distance \[ d_j(x_{ij},x_{i^\prime j}) = |x_{ij} - x_{i^\prime j}| \] Absolute distance is more robust to large differences compared to dissimilarity based on (squared) Euclidean distance.
Correlation-based distance \[ d_j(x_{ij},x_{i^\prime j}) = 1- \rho(\textbf{x}_i,\textbf{x}_{i^\prime}) = 1- \frac{\sum_j (x_{ij}-\overline{x}_i) (x_{i^\prime j} - \overline{x}_{i^\prime})} {\sqrt{\sum_j (x_{ij}-\overline{x}_i)^2 \, (x_{i^\prime j} - \overline{x}_{i^\prime})^2 }} \] with $\overline{x}_i = 1/p \sum_j x_{ij}$.

Note

Note that $\rho(\textbf{x}_i, \textbf{x}_{i^\prime})$ does not measure the correlation between two variables across a set of $n$ observations–that would be the familiar way to calculate and interpret a correlation coefficient. $\rho(\textbf{x}_i, \textbf{x}_{i^\prime})$ is the correlation between two observations across $p$ attributes.

Example: Effect of Distance Metrics

We use a simple data set with observations on the shopping behavior of four imaginary shoppers. Frank, Betsy, Julian, and Lisa make purchases of 5 possible items. The values for the item attributes are the number of times the item was purchased.

R
Python

df <- data.frame(shopper=c("Frank","Besty","Julian","Lisa"),
                 item1=c(0,1,0,0),
                 item2=c(0,0,4,1),
                 item3=c(1,1,0,0),
                 item4=c(1,3,0,1),
                 item5=c(2,0,1,1)
                 )
df

  shopper item1 item2 item3 item4 item5
1   Frank     0     0     1     1     2
2   Besty     1     0     1     3     0
3  Julian     0     4     0     0     1
4    Lisa     0     1     0     1     1

First, let’s calculate various distance metrics. These are represented as matrices of distances between the data points. The function dist() returns the lower triangular matrix of pairwise distances.

dist(df[,2:6],method="manhattan")
##    1  2  3
## 2  5      
## 3  7 10   
## 4  3  6  4
dist(df[,2:6],method="euclidean")
##          1        2        3
## 2 3.000000                  
## 3 4.358899 5.291503         
## 4 1.732051 2.828427 3.162278

The dist function excludes the diagonal entries of the distance matrix by default, these are known to be zero. Because the input values are integers in this example, the city-block distances are also integers.

The correlation-based distances can be calculated with factoextra::get_dist(). This function adds methods for correlations based on Pearson (method="pearson"), Spearman (method="spearman") or Kendall (method="kendall"). The variables can be centered and scaled with stand=TRUE (stand=FALSE is the default).

cor(t(df[,2:6]))

           [,1]       [,2]       [,3]      [,4]
[1,]  1.0000000  0.0000000 -0.3450328 0.3273268
[2,]  0.0000000  1.0000000 -0.5892557 0.0000000
[3,] -0.3450328 -0.5892557  1.0000000 0.5270463
[4,]  0.3273268  0.0000000  0.5270463 1.0000000

get_dist(df[,2:6],method="pearson",stand=FALSE)

          1         2         3
2 1.0000000                    
3 1.3450328 1.5892557          
4 0.6726732 1.0000000 0.4729537

The correlation-based dissimilarity is not equal to the correlation among the item purchases, it is one minus the correlation of the item purchases for each shopper.

The cluster::daisy() function can compute Euclidean, Manhattan, and Gower’s distance matrices. More on Gower’s distance after the example.

daisy(df[,2:6],metric="manhattan")
## Dissimilarities :
##    1  2  3
## 2  5      
## 3  7 10   
## 4  3  6  4
## 
## Metric :  manhattan 
## Number of objects : 4
daisy(df[,2:6],metric="euclidean")
## Dissimilarities :
##          1        2        3
## 2 3.000000                  
## 3 4.358899 5.291503         
## 4 1.732051 2.828427 3.162278
## 
## Metric :  euclidean 
## Number of objects : 4
daisy(df[,2:6],metric="gower")
## Dissimilarities :
##           1         2         3
## 2 0.5333333                    
## 3 0.5666667 0.9000000          
## 4 0.3500000 0.6833333 0.2166667
## 
## Metric :  mixed ;  Types = I, I, I, I, I 
## Number of objects : 4

Now let’s construct the dendrograms for the data based on Euclidean and Pearson correlation distance matrices using the hclust function. The input to hclust is a distance (dissimilarity) matrix as produced by dist(), get_dist(), daisy(). The actual values of the variables are no longer needed once the dissimilarities are calculated.

h1 <- hclust(dist(df[,2:6]))
h2 <- hclust(get_dist(df[,2:6],method="pearson"))

par(mfrow=c(1,2))
par(cex=0.7) 
par(mai=c(0.6,0.6,0.2,0.3))
plot(h1,labels=df[,1],sub="",xlab="Shoppers",main="Euclidean Dist.")
plot(h2,labels=df[,1],sub="",xlab="Shoppers",main="Correlation")

df = pd.DataFrame({
    'shopper': ['Frank', 'Besty', 'Julian', 'Lisa'],
    'item1': [0, 1, 0, 0],
    'item2': [0, 0, 4, 1],
    'item3': [1, 1, 0, 0],
    'item4': [1, 3, 0, 1],
    'item5': [2, 0, 1, 1]
})

df

  shopper  item1  item2  item3  item4  item5
0   Frank      0      0      1      1      2
1   Besty      1      0      1      3      0
2  Julian      0      4      0      0      1
3    Lisa      0      1      0      1      1

First, let’s calculate various distance metrics. These are represented as matrices of distances between the data points. The function pdist() returns a condensed distance matrix.

from scipy.spatial.distance import pdist, squareform

items_data = df.iloc[:, 1:6]  # columns 1-5 (0-indexed)

# Manhattan distance 
manhattan_dist = pdist(items_data, metric='cityblock')
manhattan_dist_matrix = squareform(manhattan_dist)
print(pd.DataFrame(manhattan_dist_matrix, index=df['shopper'], columns=df['shopper']))

shopper  Frank  Besty  Julian  Lisa
shopper                            
Frank      0.0    5.0     7.0   3.0
Besty      5.0    0.0    10.0   6.0
Julian     7.0   10.0     0.0   4.0
Lisa       3.0    6.0     4.0   0.0

# Euclidean distance 
euclidean_dist = pdist(items_data, metric='euclidean')
euclidean_dist_matrix = squareform(euclidean_dist)
print(pd.DataFrame(euclidean_dist_matrix, index=df['shopper'], columns=df['shopper']))

shopper     Frank     Besty    Julian      Lisa
shopper                                        
Frank    0.000000  3.000000  4.358899  1.732051
Besty    3.000000  0.000000  5.291503  2.828427
Julian   4.358899  5.291503  0.000000  3.162278
Lisa     1.732051  2.828427  3.162278  0.000000

Because the input values are integers in this example, the city-block distances are also integers.

The correlation-based distances can be calculated with the output of the Numpy function corrcoef.

np.corrcoef(items_data)

array([[ 1.        ,  0.        , -0.34503278,  0.32732684],
       [ 0.        ,  1.        , -0.58925565,  0.        ],
       [-0.34503278, -0.58925565,  1.        ,  0.52704628],
       [ 0.32732684,  0.        ,  0.52704628,  1.        ]])

from sklearn.metrics import pairwise_distances

#Pearson Distance Matrix
pearson_dist = pairwise_distances(items_data, metric='correlation')
print(pd.DataFrame(pearson_dist, index=df['shopper'], columns=df['shopper']))

shopper     Frank     Besty    Julian      Lisa
shopper                                        
Frank    0.000000  1.000000  1.345033  0.672673
Besty    1.000000  0.000000  1.589256  1.000000
Julian   1.345033  1.589256  0.000000  0.472954
Lisa     0.672673  1.000000  0.472954  0.000000

The correlation-based dissimilarity is not equal to the correlation among the item purchases, it is one minus the correlation of the item purchases for each shopper.

The gower function gower_matrix can be used to compute the Gower’s distance matrix.

import gower
# Need to change data type to float for gower_matrix function
items_data = items_data.astype(float)
gower_dist_matrix = gower.gower_matrix(items_data)
print(pd.DataFrame(gower_dist_matrix, index=df['shopper'], columns=df['shopper']))

shopper     Frank     Besty    Julian      Lisa
shopper                                        
Frank    0.000000  0.533333  0.566667  0.350000
Besty    0.533333  0.000000  0.900000  0.683333
Julian   0.566667  0.900000  0.000000  0.216667
Lisa     0.350000  0.683333  0.216667  0.000000

Now let’s construct the dendrograms for the data based on Euclidean and Pearson correlation distance matrices using first the linkage function. The input to linkage is a distance (dissimilarity) matrix as produced by pdist(). Then the dendrogram function is used to build the plot, with output of the linkage function. The actual values of the variables are no longer needed once the dissimilarities are calculated.

from scipy.cluster.hierarchy import linkage, dendrogram, fcluster

# Hierarchical clustering - equivalent to hclust(dist(df[,2:6]))
h1_linkage = linkage(euclidean_dist, method='complete')  # default method in R's hclust is 'complete'

# Hierarchical clustering with Pearson correlation
h2_linkage = linkage(pdist(items_data, metric='correlation'), method='complete')

# Plotting dendrograms
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

# Plot dendrogram 1 (Euclidean)
dendrogram(h1_linkage, labels=df['shopper'].tolist(), ax=ax1)

{'icoord': [[25.0, 25.0, 35.0, 35.0], [15.0, 15.0, 30.0, 30.0], [5.0, 5.0, 22.5, 22.5]], 'dcoord': [[0.0, 1.7320508075688772, 1.7320508075688772, 0.0], [0.0, 3.0, 3.0, 1.7320508075688772], [0.0, 5.291502622129181, 5.291502622129181, 3.0]], 'ivl': ['Julian', 'Besty', 'Frank', 'Lisa'], 'leaves': [2, 1, 0, 3], 'color_list': ['C1', 'C1', 'C0'], 'leaves_color_list': ['C0', 'C1', 'C1', 'C1']}

ax1.set_title('Euclidean Dist.')
ax1.set_xlabel('Shoppers')

# Plot dendrogram 2 (Correlation)
dendrogram(h2_linkage, labels=df['shopper'].tolist(), ax=ax2)

{'icoord': [[5.0, 5.0, 15.0, 15.0], [25.0, 25.0, 35.0, 35.0], [10.0, 10.0, 30.0, 30.0]], 'dcoord': [[0.0, 0.4729537233052701, 0.4729537233052701, 0.0], [0.0, 1.0, 1.0, 0.0], [0.4729537233052701, 1.5892556509887896, 1.5892556509887896, 1.0]], 'ivl': ['Julian', 'Lisa', 'Frank', 'Besty'], 'leaves': [2, 3, 0, 1], 'color_list': ['C1', 'C2', 'C0'], 'leaves_color_list': ['C1', 'C1', 'C2', 'C2']}

ax2.set_title('Correlation')
ax2.set_xlabel('Shoppers')

plt.tight_layout()
plt.show()

Choosing Euclidean distance groups together users who bought few items, because they appear as similar (close). Frank and Lisa bought 4 and 3 items, respectively. Betsy and Julian purchased 5 items. Choosing correlation-based distance groups users who bought items together. For example, Julian and Lisa bought items 2 and 5 together, Frank and Betsy purchased items 3 and 4 together.

The distance metrics discussed so far are not appropriate for categorical variables (nominal or ordinal) because differences between values are not defined. A four-star rating is not twice as much as a two-star rating and the “distance” between a one- and two-star rating is not the same as that between a four- and five-star rating.

Still, for an ordinal variable with $M$ categories it is not uncommon to replace the label for category $j$ with \[ \frac{j-1/2}{M} \] and treat this as a quantitative score. With nominal variables it is common to assign a simple loss value depending on whether the values of two variables are the same (loss = 0) or different (loss=1).

What should we do when the inputs are of mixed type, for example, $x_1$ is continuous, $x_2$ is binary, and $x_3$ is nominal? Gower (n.d.) introduced a similarity metric to compute distances in this case, known as Gower’s distance. Suppose there are no missing values. Gower’s similarity coefficient is \[ S(\textbf{x}_i,\textbf{x}_{i^\prime}) = \frac{1}{p}\sum_{j=1}^p s_{ii^\prime j} \] The $s_{ii^\prime j}$ is the score between observations $i$ and $i^\prime$ for variable $j$; The scores range $0 \leq s_{ii^\prime j} \leq 1$ and are calculated as follows:

qualitative attributes: 0/1 loss
quantitative attritbutes: \[ s_{i i^\prime j} = 1 - \frac{x_{ij} - x_{i^\prime j}}{R_j} \] where $R_j$ is the range (max-min) for the $j$^th variable. The Gower similarity coefficient has the following properties:
$0 \leq S(\textbf{x}_i, \textbf{x}_{i^\prime}) \leq 1$
$S(\textbf{x}_i, \textbf{x}_{i^\prime}) = 0 \Rightarrow$ records differ maximally
$S(\textbf{x}_i, \textbf{x}_{i^\prime}) = 1 \Rightarrow$ records do not differ

For purposes of clustering, the dissimilarity measure based on Gower’s distance is $1 - S(\textbf{x}_i, \textbf{x}_{i^\prime})$. This is implemented in the daisy function of the cluster package in R.

Linkage methods

$D(\textbf{x}_i,\textbf{x}_{i^\prime})$ measures the dissimilarity between two data points. In order to move up (or down) the tree in hierarchical clustering we also need to determine how to measure the similarity/dissimilarity between groups of points. Suppose $G$ and $H$ present two clusters and $D(G,H)$ is the dissimilarity between the two, some function of the dissimilarity of the points in the clusters. The decision rule that determines how to merge (or split) clusters is called linkage. The three most common linkage methods are

Single linkage: $D(G,H) = \min D(\textbf{x}_i,\textbf{x}_{i^\prime}), i \in G, i^\prime \in H$
Complete linkage: $D(G,H) = \max D(\textbf{x}_i,\textbf{x}_{i^\prime}), i \in G, i^\prime \in H$
Average linkage: $D(G,H) = \text{ave} D(\textbf{x}_i,\textbf{x}_{i^\prime}), i \in G, i^\prime \in H$

The agglomerative clustering algorithm merges the clusters with the smallest linkage value.

When clusters separate well, the choice of linkage is not that important. Otherwise, linkage can have substantial impact on the outcome of hierarchical clustering. Single linkage is known to cause chaining, combining observations that are linked by a series of close observations. Complete linkage tends to produce compact clusters, but observations can end up being closer to members of other clusters than to members of their own cluster. Average linkage is a compromise, but is not invariant to transformations of the dissimilarities. Centroid linkage uses distances between centroids of the clusters (Figure 25.8).

Example: Hierarchical Cluster Analysis for Glaucoma Data.

What can we learn about the 98 subjects in the Glaucoma data set who had glaucomatous eyes by way of hierarchical cluster analysis? The following statements create a data frame from the DuckDB table, filter the glaucotamous cases, remove Glaucoma column and scale the remaining 62 variables.

R
Python

library(duckdb)
library(dplyr)
con <- dbConnect(duckdb(),dbdir = "ads.ddb",read_only=TRUE)
glauc <- dbGetQuery(con, "SELECT * FROM Glaucoma")

dbDisconnect(con)

glauc <- glauc %>% filter(Glaucoma==1) %>% 
    dplyr::select(-c(Glaucoma)) %>% 
    scale()


# Load Glaucoma dataset
con = duckdb.connect(database="ads.ddb", read_only=True)
glauc = con.sql("SELECT * FROM Glaucoma;").df()
con.close()

# Filter and scale data 
glauc_filtered = glauc[glauc['Glaucoma'] == 1].drop(columns=['Glaucoma'])
    
 # Standardize the data 
scaler_glauc = StandardScaler()
glauc_scaled = pd.DataFrame(scaler_glauc.fit_transform(glauc_filtered),
                            columns=glauc_filtered.columns)
#Correction to sd with n-1 degree of freedom                            
glauc_scaled = glauc_scaled.map(lambda x: sd_correct(x, n=len(glauc_scaled)))

We first perform agglomerative clustering with correlation-based dissimilarity and complete linkage.

R
Python

hc <- hclust(get_dist(glauc,method="pearson"),method="complete")

The merge matrix on the hclust output object describes the $n-1$ steps in which the observations/clusters were merged. Negative numbers refer to observations, positive numbers refer to clusters formed at that stage.

hc$merge[1:25,]

      [,1] [,2]
 [1,]  -34  -41
 [2,]   -4  -73
 [3,]  -16  -95
 [4,]  -58  -93
 [5,]  -72  -90
 [6,]  -24  -70
 [7,]  -76    1
 [8,]   -6  -75
 [9,]  -51    2
[10,]  -43  -45
[11,]  -27  -55
[12,]   -9  -79
[13,]  -20  -65
[14,]  -19  -56
[15,]  -15  -71
[16,]  -29  -81
[17,]   -7   12
[18,]  -30  -57
[19,]  -17  -54
[20,]  -23  -25
[21,]  -80    7
[22,]  -61  -68
[23,]    4    5
[24,]    3   14
[25,]  -13  -60

# Hierarchical clustering with Pearson correlation and complete linkage
hc_linkage = linkage(pdist(glauc_scaled, metric='correlation'), method='complete')

The merge matrix is embedded in the linkage output object. It describes the $n-1$ steps in which the observations/clusters were merged.

Here, because we add 1 to our index for consistency with the R output, values 1 through 98 represent single observations. Values over 98 represent previously grouped observations. Subtract
98 from these values to determine which previous merge is being combined in the current merge.

# Display merge information
# "+ 1" added for consistency with R's indexing 
merge = pd.DataFrame(hc_linkage[:25, :2].astype(int) + 1, index=range(1,26))
merge.index.name = 'Merge'
merge

         0    1
Merge          
1       34   41
2        4   73
3       16   95
4       58   93
5       72   90
6       24   70
7       76   99
8        6   75
9       51  100
10      43   45
11      27   55
12       9   79
13      20   65
14      19   56
15      15   71
16      29   81
17       7  110
18      30   57
19      17   54
20      23   25
21      80  105
22      61   68
23     102  103
24     101  112
25      13   60

The first merge combines observations #34 and #41 into a group of two. The next merge combines #4 and #73 into another group of two. At the seventh merge, observation #76 is combined with the group created at the first merge. This cluster now contains observations [34, 41, 76]. The first time two groups are being merged is at step 23: the groups consisting of observations [58, 93] and [72, 90] are combined into a cluster of 4.

The height vector is a vector of $n-1$ values of the height criterion; the actual values depend on the linkage method. For this analysis, the first 25 heights at which merges occurred are as follows:

R
Python

hc$height[1:25]

 [1] 0.1022034 0.1031823 0.1144513 0.1331928 0.1349688 0.1386951 0.1487625
 [8] 0.1510436 0.1540351 0.1547025 0.1657972 0.1679269 0.1770342 0.1806037
[15] 0.1838289 0.1868672 0.2097274 0.2276110 0.2467682 0.2477665 0.2543348
[22] 0.2682308 0.2684616 0.2737816 0.2779692

The dendrogram for this analysis is plotted in Figure 25.9 along with the bounding boxes for 4 and 8 clusters. The cut for the larger number of clusters occurs lower at the tree.

plot(hc, cex=0.5, main="",)
rect.hclust(hc,k=4)
rect.hclust(hc,k=8)

Figure 25.9: Dendrogram for partial glaucoma data with correlation-based dissimilarity and complete linkage.

The sizes of the four clusters are as follows:

table(cutree(hc,k=4))


 1  2  3  4 
37 26 20 15

Changing the linkage method to single demonstrates the chaining effect on the dendrogram (Figure 25.10). Identifying a reasonable number of clusters is more difficult.

hc_s <- hclust(get_dist(glauc,method="pearson"),method="single")
plot(hc_s, cex=0.5,main="")
rect.hclust(hc_s,k=4)

Figure 25.10: Dendrogram for partial glaucoma data, single linkage.

If you create a dendrogram object from the hclust results, a number of plotting functions are available to visualize the dendrogram in interesting ways. For example:

hc.dend <- as.dendrogram(hc) # create dendrogram object
plot(dendextend::color_branches(hc.dend,k=4),leaflab="none",horiz=TRUE)

factoextra::fviz_dend(hc.dend,k=4,horiz=TRUE,cex=0.4,palette="aaas",type="rectangle")

factoextra::fviz_dend(hc.dend,k=4,cex=0.4,palette="aaas",type="circular")

factoextra::fviz_dend(hc.dend,k=4,cex=0.4,palette="aaas",type="phylogenic")

print(hc_linkage[:25, 2])

[0.10220342 0.10318233 0.11445126 0.13319279 0.1349688  0.13869512
 0.14876254 0.15104362 0.15403514 0.15470255 0.16579723 0.16792694
 0.17703418 0.18060366 0.1838289  0.18686716 0.20972735 0.22761103
 0.24676817 0.24776647 0.25433481 0.26823084 0.26846158 0.27378163
 0.27796915]

The dendrogram for this analysis is plotted in Figure 25.11 along with plots of 4 and 8 clusters, with clusters distinguished by color. The cut for the larger number of clusters occurs lower at the tree.

from scipy.cluster.hierarchy import set_link_color_palette

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 8))

# First plot - 4 clusters
dend1 = dendrogram(hc_linkage, no_labels=True, color_threshold=hc_linkage[-4+1, 2], ax=ax1)
ax1.set_title('Complete Linkage Clustering - 4 Clusters')
ax1.axhline(y=hc_linkage[-4+1, 2], color='r', linestyle='--', alpha=0.7, label='k = 4')
ax1.legend()

# Second plot - 8 clusters  
dend2 = dendrogram(hc_linkage, no_labels=True, color_threshold=hc_linkage[-8+1, 2], ax=ax2)
ax2.set_title('Complete Linkage Clustering - 8 Clusters')
ax2.axhline(y=hc_linkage[-8+1, 2], color='b', linestyle='--', alpha=0.7, label='k = 8')
ax2.legend()

plt.tight_layout()
plt.show()

Figure 25.11: Dendrogram for partial glaucoma data with correlation-based dissimilarity and complete linkage.

The sizes of the four clusters are as follows:

 # Get cluster assignments 
clusters_k4 = fcluster(hc_linkage, 4, criterion='maxclust')
cluster_counts = pd.Series(clusters_k4).value_counts()
print(f"\nCluster sizes for k=4: \n{cluster_counts}")


Cluster sizes for k=4: 
4    37
3    26
2    20
1    15
Name: count, dtype: int64

 # Single linkage clustering
hc_single_linkage = linkage(pdist(glauc_scaled, metric='correlation'), method='single')

plt.figure(figsize=(10, 6))
dendrogram(hc_single_linkage, no_labels=True, color_threshold= hc_single_linkage[-4+1, 2])

{'icoord': [[105.0, 105.0, 115.0, 115.0], [95.0, 95.0, 110.0, 110.0], [205.0, 205.0, 215.0, 215.0], [195.0, 195.0, 210.0, 210.0], [385.0, 385.0, 395.0, 395.0], [445.0, 445.0, 455.0, 455.0], [465.0, 465.0, 475.0, 475.0], [505.0, 505.0, 515.0, 515.0], [495.0, 495.0, 510.0, 510.0], [485.0, 485.0, 502.5, 502.5], [575.0, 575.0, 585.0, 585.0], [635.0, 635.0, 645.0, 645.0], [655.0, 655.0, 665.0, 665.0], [705.0, 705.0, 715.0, 715.0], [695.0, 695.0, 710.0, 710.0], [685.0, 685.0, 702.5, 702.5], [675.0, 675.0, 693.75, 693.75], [660.0, 660.0, 684.375, 684.375], [640.0, 640.0, 672.1875, 672.1875], [745.0, 745.0, 755.0, 755.0], [735.0, 735.0, 750.0, 750.0], [725.0, 725.0, 742.5, 742.5], [775.0, 775.0, 785.0, 785.0], [805.0, 805.0, 815.0, 815.0], [835.0, 835.0, 845.0, 845.0], [825.0, 825.0, 840.0, 840.0], [865.0, 865.0, 875.0, 875.0], [885.0, 885.0, 895.0, 895.0], [915.0, 915.0, 925.0, 925.0], [905.0, 905.0, 920.0, 920.0], [890.0, 890.0, 912.5, 912.5], [870.0, 870.0, 901.25, 901.25], [855.0, 855.0, 885.625, 885.625], [832.5, 832.5, 870.3125, 870.3125], [810.0, 810.0, 851.40625, 851.40625], [945.0, 945.0, 955.0, 955.0], [935.0, 935.0, 950.0, 950.0], [830.703125, 830.703125, 942.5, 942.5], [795.0, 795.0, 886.6015625, 886.6015625], [780.0, 780.0, 840.80078125, 840.80078125], [765.0, 765.0, 810.400390625, 810.400390625], [733.75, 733.75, 787.7001953125, 787.7001953125], [965.0, 965.0, 975.0, 975.0], [760.72509765625, 760.72509765625, 970.0, 970.0], [656.09375, 656.09375, 865.362548828125, 865.362548828125], [625.0, 625.0, 760.7281494140625, 760.7281494140625], [615.0, 615.0, 692.8640747070312, 692.8640747070312], [605.0, 605.0, 653.9320373535156, 653.9320373535156], [595.0, 595.0, 629.4660186767578, 629.4660186767578], [580.0, 580.0, 612.2330093383789, 612.2330093383789], [565.0, 565.0, 596.1165046691895, 596.1165046691895], [555.0, 555.0, 580.5582523345947, 580.5582523345947], [545.0, 545.0, 567.7791261672974, 567.7791261672974], [535.0, 535.0, 556.3895630836487, 556.3895630836487], [525.0, 525.0, 545.6947815418243, 545.6947815418243], [493.75, 493.75, 535.3473907709122, 535.3473907709122], [470.0, 470.0, 514.5486953854561, 514.5486953854561], [450.0, 450.0, 492.27434769272804, 492.27434769272804], [435.0, 435.0, 471.137173846364, 471.137173846364], [425.0, 425.0, 453.068586923182, 453.068586923182], [415.0, 415.0, 439.034293461591, 439.034293461591], [405.0, 405.0, 427.0171467307955, 427.0171467307955], [390.0, 390.0, 416.00857336539775, 416.00857336539775], [375.0, 375.0, 403.0042866826989, 403.0042866826989], [365.0, 365.0, 389.00214334134944, 389.00214334134944], [355.0, 355.0, 377.0010716706747, 377.0010716706747], [345.0, 345.0, 366.00053583533736, 366.00053583533736], [335.0, 335.0, 355.5002679176687, 355.5002679176687], [325.0, 325.0, 345.25013395883434, 345.25013395883434], [315.0, 315.0, 335.12506697941717, 335.12506697941717], [305.0, 305.0, 325.0625334897086, 325.0625334897086], [295.0, 295.0, 315.0312667448543, 315.0312667448543], [285.0, 285.0, 305.01563337242715, 305.01563337242715], [275.0, 275.0, 295.0078166862136, 295.0078166862136], [265.0, 265.0, 285.0039083431068, 285.0039083431068], [255.0, 255.0, 275.0019541715534, 275.0019541715534], [245.0, 245.0, 265.0009770857767, 265.0009770857767], [235.0, 235.0, 255.00048854288835, 255.00048854288835], [225.0, 225.0, 245.00024427144416, 245.00024427144416], [202.5, 202.5, 235.00012213572208, 235.00012213572208], [185.0, 185.0, 218.75006106786105, 218.75006106786105], [175.0, 175.0, 201.87503053393053, 201.87503053393053], [165.0, 165.0, 188.43751526696525, 188.43751526696525], [155.0, 155.0, 176.71875763348262, 176.71875763348262], [145.0, 145.0, 165.8593788167413, 165.8593788167413], [135.0, 135.0, 155.42968940837065, 155.42968940837065], [125.0, 125.0, 145.21484470418534, 145.21484470418534], [102.5, 102.5, 135.10742235209267, 135.10742235209267], [85.0, 85.0, 118.80371117604633, 118.80371117604633], [75.0, 75.0, 101.90185558802317, 101.90185558802317], [65.0, 65.0, 88.45092779401159, 88.45092779401159], [55.0, 55.0, 76.7254638970058, 76.7254638970058], [45.0, 45.0, 65.8627319485029, 65.8627319485029], [35.0, 35.0, 55.43136597425145, 55.43136597425145], [25.0, 25.0, 45.215682987125724, 45.215682987125724], [15.0, 15.0, 35.10784149356286, 35.10784149356286], [5.0, 5.0, 25.05392074678143, 25.05392074678143]], 'dcoord': [[0.0, 0.3099301524369442, 0.3099301524369442, 0.0], [0.0, 0.33174938255125574, 0.33174938255125574, 0.3099301524369442], [0.0, 0.3693877054853454, 0.3693877054853454, 0.0], [0.0, 0.3815268540039486, 0.3815268540039486, 0.3693877054853454], [0.0, 0.15104361992729431, 0.15104361992729431, 0.0], [0.0, 0.26823084455493573, 0.26823084455493573, 0.0], [0.0, 0.27861147145671505, 0.27861147145671505, 0.0], [0.0, 0.16792694446987888, 0.16792694446987888, 0.0], [0.0, 0.18257423546727902, 0.18257423546727902, 0.16792694446987888], [0.0, 0.2258490647475634, 0.2258490647475634, 0.18257423546727902], [0.0, 0.2467681707548327, 0.2467681707548327, 0.0], [0.0, 0.1806036559929557, 0.1806036559929557, 0.0], [0.0, 0.1868671617993154, 0.1868671617993154, 0.0], [0.0, 0.11445126497486058, 0.11445126497486058, 0.0], [0.0, 0.13505381671550198, 0.13505381671550198, 0.11445126497486058], [0.0, 0.17703417628010532, 0.17703417628010532, 0.13505381671550198], [0.0, 0.20243273239203363, 0.20243273239203363, 0.17703417628010532], [0.1868671617993154, 0.20661697799784795, 0.20661697799784795, 0.20243273239203363], [0.1806036559929557, 0.23756464791100718, 0.23756464791100718, 0.20661697799784795], [0.0, 0.18382889623106247, 0.18382889623106247, 0.0], [0.0, 0.220664216174967, 0.220664216174967, 0.18382889623106247], [0.0, 0.2276110333377006, 0.2276110333377006, 0.220664216174967], [0.0, 0.165797232341662, 0.165797232341662, 0.0], [0.0, 0.13496879944580653, 0.13496879944580653, 0.0], [0.0, 0.10318232796894677, 0.10318232796894677, 0.0], [0.0, 0.1488020776622505, 0.1488020776622505, 0.10318232796894677], [0.0, 0.133192794622228, 0.133192794622228, 0.0], [0.0, 0.13869511933908607, 0.13869511933908607, 0.0], [0.0, 0.1022034166089496, 0.1022034166089496, 0.0], [0.0, 0.1409643255563786, 0.1409643255563786, 0.1022034166089496], [0.13869511933908607, 0.1502276607565718, 0.1502276607565718, 0.1409643255563786], [0.133192794622228, 0.15113664271650296, 0.15113664271650296, 0.1502276607565718], [0.0, 0.17408110729209758, 0.17408110729209758, 0.15113664271650296], [0.1488020776622505, 0.17706816402011438, 0.17706816402011438, 0.17408110729209758], [0.13496879944580653, 0.1803397904327696, 0.1803397904327696, 0.17706816402011438], [0.0, 0.15470254966203856, 0.15470254966203856, 0.0], [0.0, 0.1982430665762347, 0.1982430665762347, 0.15470254966203856], [0.1803397904327696, 0.21012977743424577, 0.21012977743424577, 0.1982430665762347], [0.0, 0.23608002937594696, 0.23608002937594696, 0.21012977743424577], [0.165797232341662, 0.2368338644552197, 0.2368338644552197, 0.23608002937594696], [0.0, 0.24012476974730945, 0.24012476974730945, 0.2368338644552197], [0.2276110333377006, 0.24093827716593985, 0.24093827716593985, 0.24012476974730945], [0.0, 0.2477664712095553, 0.2477664712095553, 0.0], [0.24093827716593985, 0.24854725894214147, 0.24854725894214147, 0.2477664712095553], [0.23756464791100718, 0.2542973948286472, 0.2542973948286472, 0.24854725894214147], [0.0, 0.27105407434018214, 0.27105407434018214, 0.2542973948286472], [0.0, 0.27273668637263004, 0.27273668637263004, 0.27105407434018214], [0.0, 0.28146719444854573, 0.28146719444854573, 0.27273668637263004], [0.0, 0.2825975306770947, 0.2825975306770947, 0.28146719444854573], [0.2467681707548327, 0.28506833272888843, 0.28506833272888843, 0.2825975306770947], [0.0, 0.2885295036052501, 0.2885295036052501, 0.28506833272888843], [0.0, 0.28905783121970086, 0.28905783121970086, 0.2885295036052501], [0.0, 0.2973906549190045, 0.2973906549190045, 0.28905783121970086], [0.0, 0.30098279161476127, 0.30098279161476127, 0.2973906549190045], [0.0, 0.30160961218139926, 0.30160961218139926, 0.30098279161476127], [0.2258490647475634, 0.3021538623099621, 0.3021538623099621, 0.30160961218139926], [0.27861147145671505, 0.30275367599477887, 0.30275367599477887, 0.3021538623099621], [0.26823084455493573, 0.3033802488056636, 0.3033802488056636, 0.30275367599477887], [0.0, 0.30612387313870826, 0.30612387313870826, 0.3033802488056636], [0.0, 0.30783161672865655, 0.30783161672865655, 0.30612387313870826], [0.0, 0.3150898386757859, 0.3150898386757859, 0.30783161672865655], [0.0, 0.31773714738541226, 0.31773714738541226, 0.3150898386757859], [0.15104361992729431, 0.32316608791290535, 0.32316608791290535, 0.31773714738541226], [0.0, 0.32327988904631766, 0.32327988904631766, 0.32316608791290535], [0.0, 0.32427952162875573, 0.32427952162875573, 0.32327988904631766], [0.0, 0.3250798867054798, 0.3250798867054798, 0.32427952162875573], [0.0, 0.32641894169923547, 0.32641894169923547, 0.3250798867054798], [0.0, 0.3272579781062841, 0.3272579781062841, 0.32641894169923547], [0.0, 0.3302518732988847, 0.3302518732988847, 0.3272579781062841], [0.0, 0.3320392697494122, 0.3320392697494122, 0.3302518732988847], [0.0, 0.337874834706127, 0.337874834706127, 0.3320392697494122], [0.0, 0.34474313802956214, 0.34474313802956214, 0.337874834706127], [0.0, 0.3545683960369016, 0.3545683960369016, 0.34474313802956214], [0.0, 0.3586806371194541, 0.3586806371194541, 0.3545683960369016], [0.0, 0.36815217064053063, 0.36815217064053063, 0.3586806371194541], [0.0, 0.3847804960259954, 0.3847804960259954, 0.36815217064053063], [0.0, 0.3873804376982909, 0.3873804376982909, 0.3847804960259954], [0.0, 0.39505358942269164, 0.39505358942269164, 0.3873804376982909], [0.0, 0.40381621801532186, 0.40381621801532186, 0.39505358942269164], [0.3815268540039486, 0.4109322593469308, 0.4109322593469308, 0.40381621801532186], [0.0, 0.4124175725502991, 0.4124175725502991, 0.4109322593469308], [0.0, 0.42697781781882405, 0.42697781781882405, 0.4124175725502991], [0.0, 0.427094947043108, 0.427094947043108, 0.42697781781882405], [0.0, 0.43287192471933145, 0.43287192471933145, 0.427094947043108], [0.0, 0.4351979477566871, 0.4351979477566871, 0.43287192471933145], [0.0, 0.43883296055112564, 0.43883296055112564, 0.4351979477566871], [0.0, 0.44580837966110376, 0.44580837966110376, 0.43883296055112564], [0.33174938255125574, 0.4516994506939689, 0.4516994506939689, 0.44580837966110376], [0.0, 0.4547810204336292, 0.4547810204336292, 0.4516994506939689], [0.0, 0.4556697381730924, 0.4556697381730924, 0.4547810204336292], [0.0, 0.48014032103944426, 0.48014032103944426, 0.4556697381730924], [0.0, 0.48039986445841887, 0.48039986445841887, 0.48014032103944426], [0.0, 0.48134341932384306, 0.48134341932384306, 0.48039986445841887], [0.0, 0.5231995673063781, 0.5231995673063781, 0.48134341932384306], [0.0, 0.5614024965006799, 0.5614024965006799, 0.5231995673063781], [0.0, 0.5848685297607765, 0.5848685297607765, 0.5614024965006799], [0.0, 0.6639624413005827, 0.6639624413005827, 0.5848685297607765]], 'ivl': ['13', '58', '10', '41', '52', '25', '87', '68', '83', '95', '2', '36', '34', '61', '81', '38', '51', '39', '93', '73', '35', '77', '32', '17', '11', '47', '4', '85', '63', '30', '97', '96', '46', '48', '65', '0', '7', '84', '5', '74', '66', '88', '27', '86', '60', '67', '37', '90', '91', '6', '8', '78', '49', '9', '45', '43', '20', '16', '53', '31', '76', '1', '82', '18', '55', '28', '80', '21', '19', '64', '15', '94', '56', '29', '14', '70', '12', '26', '54', '59', '71', '89', '50', '3', '72', '79', '57', '92', '23', '69', '75', '33', '40', '62', '42', '44', '22', '24'], 'leaves': [13, 58, 10, 41, 52, 25, 87, 68, 83, 95, 2, 36, 34, 61, 81, 38, 51, 39, 93, 73, 35, 77, 32, 17, 11, 47, 4, 85, 63, 30, 97, 96, 46, 48, 65, 0, 7, 84, 5, 74, 66, 88, 27, 86, 60, 67, 37, 90, 91, 6, 8, 78, 49, 9, 45, 43, 20, 16, 53, 31, 76, 1, 82, 18, 55, 28, 80, 21, 19, 64, 15, 94, 56, 29, 14, 70, 12, 26, 54, 59, 71, 89, 50, 3, 72, 79, 57, 92, 23, 69, 75, 33, 40, 62, 42, 44, 22, 24], 'color_list': ['C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C0', 'C0', 'C0'], 'leaves_color_list': ['C0', 'C0', 'C0', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1']}

plt.title('Single Linkage Clustering')
plt.axhline(y=hc_single_linkage[-4+1, 2], color='r', linestyle='--', alpha=0.7, label='k = 4') 
plt.legend()
plt.show()

If you create a dendrogram object from the linkage results, a number of plotting options are available to visualize the dendrogram in interesting ways. In addition to displaying a different color for each cluster using the color_threshold argument, setting orientation = 'left' will move the height scaling to the x-axis.

# Example of colored dendrogram
plt.figure(figsize=(15, 6))

dendrogram(hc_linkage, orientation='left', no_labels=True, 
               color_threshold= hc_linkage[-4+1, 2])  # Color by k=4 clusters

{'icoord': [[5.0, 5.0, 15.0, 15.0], [35.0, 35.0, 45.0, 45.0], [25.0, 25.0, 40.0, 40.0], [10.0, 10.0, 32.5, 32.5], [65.0, 65.0, 75.0, 75.0], [55.0, 55.0, 70.0, 70.0], [95.0, 95.0, 105.0, 105.0], [85.0, 85.0, 100.0, 100.0], [115.0, 115.0, 125.0, 125.0], [92.5, 92.5, 120.0, 120.0], [135.0, 135.0, 145.0, 145.0], [106.25, 106.25, 140.0, 140.0], [62.5, 62.5, 123.125, 123.125], [21.25, 21.25, 92.8125, 92.8125], [165.0, 165.0, 175.0, 175.0], [185.0, 185.0, 195.0, 195.0], [170.0, 170.0, 190.0, 190.0], [155.0, 155.0, 180.0, 180.0], [205.0, 205.0, 215.0, 215.0], [235.0, 235.0, 245.0, 245.0], [225.0, 225.0, 240.0, 240.0], [265.0, 265.0, 275.0, 275.0], [285.0, 285.0, 295.0, 295.0], [270.0, 270.0, 290.0, 290.0], [255.0, 255.0, 280.0, 280.0], [305.0, 305.0, 315.0, 315.0], [335.0, 335.0, 345.0, 345.0], [325.0, 325.0, 340.0, 340.0], [310.0, 310.0, 332.5, 332.5], [267.5, 267.5, 321.25, 321.25], [232.5, 232.5, 294.375, 294.375], [210.0, 210.0, 263.4375, 263.4375], [167.5, 167.5, 236.71875, 236.71875], [57.03125, 57.03125, 202.109375, 202.109375], [375.0, 375.0, 385.0, 385.0], [365.0, 365.0, 380.0, 380.0], [355.0, 355.0, 372.5, 372.5], [395.0, 395.0, 405.0, 405.0], [415.0, 415.0, 425.0, 425.0], [400.0, 400.0, 420.0, 420.0], [363.75, 363.75, 410.0, 410.0], [445.0, 445.0, 455.0, 455.0], [435.0, 435.0, 450.0, 450.0], [386.875, 386.875, 442.5, 442.5], [465.0, 465.0, 475.0, 475.0], [414.6875, 414.6875, 470.0, 470.0], [505.0, 505.0, 515.0, 515.0], [495.0, 495.0, 510.0, 510.0], [485.0, 485.0, 502.5, 502.5], [535.0, 535.0, 545.0, 545.0], [525.0, 525.0, 540.0, 540.0], [555.0, 555.0, 565.0, 565.0], [575.0, 575.0, 585.0, 585.0], [595.0, 595.0, 605.0, 605.0], [580.0, 580.0, 600.0, 600.0], [560.0, 560.0, 590.0, 590.0], [532.5, 532.5, 575.0, 575.0], [493.75, 493.75, 553.75, 553.75], [442.34375, 442.34375, 523.75, 523.75], [615.0, 615.0, 625.0, 625.0], [655.0, 655.0, 665.0, 665.0], [645.0, 645.0, 660.0, 660.0], [635.0, 635.0, 652.5, 652.5], [675.0, 675.0, 685.0, 685.0], [643.75, 643.75, 680.0, 680.0], [620.0, 620.0, 661.875, 661.875], [695.0, 695.0, 705.0, 705.0], [715.0, 715.0, 725.0, 725.0], [700.0, 700.0, 720.0, 720.0], [735.0, 735.0, 745.0, 745.0], [755.0, 755.0, 765.0, 765.0], [775.0, 775.0, 785.0, 785.0], [760.0, 760.0, 780.0, 780.0], [740.0, 740.0, 770.0, 770.0], [710.0, 710.0, 755.0, 755.0], [805.0, 805.0, 815.0, 815.0], [825.0, 825.0, 835.0, 835.0], [810.0, 810.0, 830.0, 830.0], [795.0, 795.0, 820.0, 820.0], [845.0, 845.0, 855.0, 855.0], [865.0, 865.0, 875.0, 875.0], [850.0, 850.0, 870.0, 870.0], [885.0, 885.0, 895.0, 895.0], [925.0, 925.0, 935.0, 935.0], [915.0, 915.0, 930.0, 930.0], [905.0, 905.0, 922.5, 922.5], [890.0, 890.0, 913.75, 913.75], [860.0, 860.0, 901.875, 901.875], [965.0, 965.0, 975.0, 975.0], [955.0, 955.0, 970.0, 970.0], [945.0, 945.0, 962.5, 962.5], [880.9375, 880.9375, 953.75, 953.75], [807.5, 807.5, 917.34375, 917.34375], [732.5, 732.5, 862.421875, 862.421875], [640.9375, 640.9375, 797.4609375, 797.4609375], [483.046875, 483.046875, 719.19921875, 719.19921875], [129.5703125, 129.5703125, 601.123046875, 601.123046875]], 'dcoord': [[0.0, 0.427094947043108, 0.427094947043108, 0.0], [0.0, 0.15104361992729431, 0.15104361992729431, 0.0], [0.0, 0.4437495137922809, 0.4437495137922809, 0.15104361992729431], [0.427094947043108, 0.6970345281551849, 0.6970345281551849, 0.4437495137922809], [0.0, 0.26823084455493573, 0.26823084455493573, 0.0], [0.0, 0.40233982593019135, 0.40233982593019135, 0.26823084455493573], [0.0, 0.16792694446987888, 0.16792694446987888, 0.0], [0.0, 0.20972735072502802, 0.20972735072502802, 0.16792694446987888], [0.0, 0.3150898386757859, 0.3150898386757859, 0.0], [0.20972735072502802, 0.3991759450768849, 0.3991759450768849, 0.3150898386757859], [0.0, 0.4705497016277773, 0.4705497016277773, 0.0], [0.3991759450768849, 0.6754782229470899, 0.6754782229470899, 0.4705497016277773], [0.40233982593019135, 0.7309054795346219, 0.7309054795346219, 0.6754782229470899], [0.6970345281551849, 1.205690916008412, 1.205690916008412, 0.7309054795346219], [0.0, 0.4547810204336292, 0.4547810204336292, 0.0], [0.0, 0.4556697381730924, 0.4556697381730924, 0.0], [0.4547810204336292, 0.7167103217993036, 0.7167103217993036, 0.4556697381730924], [0.0, 0.8858206083135869, 0.8858206083135869, 0.7167103217993036], [0.0, 0.48039986445841887, 0.48039986445841887, 0.0], [0.0, 0.2467681707548327, 0.2467681707548327, 0.0], [0.0, 0.5070106728795143, 0.5070106728795143, 0.2467681707548327], [0.0, 0.11445126497486058, 0.11445126497486058, 0.0], [0.0, 0.1806036559929557, 0.1806036559929557, 0.0], [0.11445126497486058, 0.2737816313485836, 0.2737816313485836, 0.1806036559929557], [0.0, 0.3771060038931421, 0.3771060038931421, 0.2737816313485836], [0.0, 0.17703417628010532, 0.17703417628010532, 0.0], [0.0, 0.1868671617993154, 0.1868671617993154, 0.0], [0.0, 0.33470428167457933, 0.33470428167457933, 0.1868671617993154], [0.17703417628010532, 0.4899361309924043, 0.4899361309924043, 0.33470428167457933], [0.3771060038931421, 0.6306582458716683, 0.6306582458716683, 0.4899361309924043], [0.5070106728795143, 0.8161824838857541, 0.8161824838857541, 0.6306582458716683], [0.48039986445841887, 0.9555503954579323, 0.9555503954579323, 0.8161824838857541], [0.8858206083135869, 1.3060286272308888, 1.3060286272308888, 0.9555503954579323], [1.205690916008412, 1.6309490924265773, 1.6309490924265773, 1.3060286272308888], [0.0, 0.1022034166089496, 0.1022034166089496, 0.0], [0.0, 0.1487625371937188, 0.1487625371937188, 0.1022034166089496], [0.0, 0.25433481292043725, 0.25433481292043725, 0.1487625371937188], [0.0, 0.133192794622228, 0.133192794622228, 0.0], [0.0, 0.13496879944580653, 0.13496879944580653, 0.0], [0.133192794622228, 0.26846157535472304, 0.26846157535472304, 0.13496879944580653], [0.25433481292043725, 0.3696865986699447, 0.3696865986699447, 0.26846157535472304], [0.0, 0.13869511933908607, 0.13869511933908607, 0.0], [0.0, 0.4259584867322581, 0.4259584867322581, 0.13869511933908607], [0.3696865986699447, 0.565393087816171, 0.565393087816171, 0.4259584867322581], [0.0, 0.5848685297607765, 0.5848685297607765, 0.0], [0.565393087816171, 0.9145166202718087, 0.9145166202718087, 0.5848685297607765], [0.0, 0.28905783121970086, 0.28905783121970086, 0.0], [0.0, 0.3838206357390578, 0.3838206357390578, 0.28905783121970086], [0.0, 0.5533293771238466, 0.5533293771238466, 0.3838206357390578], [0.0, 0.3099301524369442, 0.3099301524369442, 0.0], [0.0, 0.6036589106439676, 0.6036589106439676, 0.3099301524369442], [0.0, 0.34474313802956214, 0.34474313802956214, 0.0], [0.0, 0.165797232341662, 0.165797232341662, 0.0], [0.0, 0.35286928697700704, 0.35286928697700704, 0.0], [0.165797232341662, 0.509471319985582, 0.509471319985582, 0.35286928697700704], [0.34474313802956214, 0.7581194521513093, 0.7581194521513093, 0.509471319985582], [0.6036589106439676, 0.8871897907595808, 0.8871897907595808, 0.7581194521513093], [0.5533293771238466, 1.1236440634418194, 1.1236440634418194, 0.8871897907595808], [0.9145166202718087, 1.411635917175011, 1.411635917175011, 1.1236440634418194], [0.0, 0.3693877054853454, 0.3693877054853454, 0.0], [0.0, 0.15470254966203856, 0.15470254966203856, 0.0], [0.0, 0.3136492493962859, 0.3136492493962859, 0.15470254966203856], [0.0, 0.4297605939853645, 0.4297605939853645, 0.3136492493962859], [0.0, 0.548127179013956, 0.548127179013956, 0.0], [0.4297605939853645, 0.7923719926829805, 0.7923719926829805, 0.548127179013956], [0.3693877054853454, 0.9430499044512267, 0.9430499044512267, 0.7923719926829805], [0.0, 0.27861147145671505, 0.27861147145671505, 0.0], [0.0, 0.3866713115637531, 0.3866713115637531, 0.0], [0.27861147145671505, 0.6269036345160947, 0.6269036345160947, 0.3866713115637531], [0.0, 0.43883296055112564, 0.43883296055112564, 0.0], [0.0, 0.2276110333377006, 0.2276110333377006, 0.0], [0.0, 0.4468355706498911, 0.4468355706498911, 0.0], [0.2276110333377006, 0.6458542742248858, 0.6458542742248858, 0.4468355706498911], [0.43883296055112564, 0.8351729817073044, 0.8351729817073044, 0.6458542742248858], [0.6269036345160947, 1.0270239828930285, 1.0270239828930285, 0.8351729817073044], [0.0, 0.2477664712095553, 0.2477664712095553, 0.0], [0.0, 0.38115073602114136, 0.38115073602114136, 0.0], [0.2477664712095553, 0.4977822849058988, 0.4977822849058988, 0.38115073602114136], [0.0, 0.695073617872833, 0.695073617872833, 0.4977822849058988], [0.0, 0.27796915110360676, 0.27796915110360676, 0.0], [0.0, 0.29104076086562314, 0.29104076086562314, 0.0], [0.27796915110360676, 0.35132170668044116, 0.35132170668044116, 0.29104076086562314], [0.0, 0.18382889623106247, 0.18382889623106247, 0.0], [0.0, 0.10318232796894677, 0.10318232796894677, 0.0], [0.0, 0.15403514330239898, 0.15403514330239898, 0.10318232796894677], [0.0, 0.4169524842535671, 0.4169524842535671, 0.15403514330239898], [0.18382889623106247, 0.4969571693523359, 0.4969571693523359, 0.4169524842535671], [0.35132170668044116, 0.6714103852567987, 0.6714103852567987, 0.4969571693523359], [0.0, 0.2973906549190045, 0.2973906549190045, 0.0], [0.0, 0.7171089531594332, 0.7171089531594332, 0.2973906549190045], [0.0, 0.7934941677529319, 0.7934941677529319, 0.7171089531594332], [0.6714103852567987, 1.0356193703928562, 1.0356193703928562, 0.7934941677529319], [0.695073617872833, 1.2677350112148709, 1.2677350112148709, 1.0356193703928562], [1.0270239828930285, 1.4582654464269498, 1.4582654464269498, 1.2677350112148709], [0.9430499044512267, 1.6088772579291841, 1.6088772579291841, 1.4582654464269498], [1.411635917175011, 1.7453438930340197, 1.7453438930340197, 1.6088772579291841], [1.6309490924265773, 1.9011859193497094, 1.9011859193497094, 1.7453438930340197]], 'ivl': ['48', '51', '32', '5', '74', '7', '60', '67', '6', '8', '78', '88', '91', '93', '97', '13', '11', '83', '34', '68', '25', '39', '4', '16', '53', '21', '15', '94', '18', '55', '19', '64', '27', '28', '80', '79', '75', '33', '40', '57', '92', '71', '89', '66', '23', '69', '52', '58', '81', '46', '20', '43', '95', '2', '36', '30', '49', '26', '54', '65', '84', '35', '77', '31', '62', '42', '44', '41', '73', '37', '90', '0', '86', '38', '61', '29', '56', '17', '47', '85', '22', '24', '63', '96', '12', '59', '1', '76', '14', '70', '9', '50', '3', '72', '87', '10', '45', '82'], 'leaves': [48, 51, 32, 5, 74, 7, 60, 67, 6, 8, 78, 88, 91, 93, 97, 13, 11, 83, 34, 68, 25, 39, 4, 16, 53, 21, 15, 94, 18, 55, 19, 64, 27, 28, 80, 79, 75, 33, 40, 57, 92, 71, 89, 66, 23, 69, 52, 58, 81, 46, 20, 43, 95, 2, 36, 30, 49, 26, 54, 65, 84, 35, 77, 31, 62, 42, 44, 41, 73, 37, 90, 0, 86, 38, 61, 29, 56, 17, 47, 85, 22, 24, 63, 96, 12, 59, 1, 76, 14, 70, 9, 50, 3, 72, 87, 10, 45, 82], 'color_list': ['C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C2', 'C2', 'C2', 'C2', 'C2', 'C2', 'C2', 'C2', 'C2', 'C2', 'C2', 'C2', 'C2', 'C2', 'C2', 'C2', 'C2', 'C2', 'C2', 'C0', 'C3', 'C3', 'C3', 'C3', 'C3', 'C3', 'C3', 'C3', 'C3', 'C3', 'C3', 'C3', 'C3', 'C3', 'C3', 'C3', 'C3', 'C3', 'C3', 'C3', 'C3', 'C3', 'C3', 'C3', 'C3', 'C4', 'C4', 'C4', 'C4', 'C4', 'C4', 'C4', 'C4', 'C4', 'C4', 'C4', 'C4', 'C4', 'C4', 'C4', 'C4', 'C4', 'C4', 'C4', 'C4', 'C4', 'C4', 'C4', 'C4', 'C4', 'C4', 'C4', 'C4', 'C4', 'C4', 'C4', 'C4', 'C4', 'C4', 'C4', 'C4', 'C0', 'C0'], 'leaves_color_list': ['C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C2', 'C2', 'C2', 'C2', 'C2', 'C2', 'C2', 'C2', 'C2', 'C2', 'C2', 'C2', 'C2', 'C2', 'C2', 'C2', 'C2', 'C2', 'C2', 'C2', 'C3', 'C3', 'C3', 'C3', 'C3', 'C3', 'C3', 'C3', 'C3', 'C3', 'C3', 'C3', 'C3', 'C3', 'C3', 'C3', 'C3', 'C3', 'C3', 'C3', 'C3', 'C3', 'C3', 'C3', 'C3', 'C3', 'C4', 'C4', 'C4', 'C4', 'C4', 'C4', 'C4', 'C4', 'C4', 'C4', 'C4', 'C4', 'C4', 'C4', 'C4', 'C4', 'C4', 'C4', 'C4', 'C4', 'C4', 'C4', 'C4', 'C4', 'C4', 'C4', 'C4', 'C4', 'C4', 'C4', 'C4', 'C4', 'C4', 'C4', 'C4', 'C4', 'C4']}

plt.title('Horizontal Dendrogram (k=4 clusters)')
plt.show()

25 Cluster Analysis

25.1 Introduction

25.2 \(K\)-Means Clustering

Introduction

Clustering Metrics

Predicted Values

Combining \(K\)-Means and PCA

\(K\)-Means on Random Data

25.3 Hierarchical Clustering

Introduction

The Dendrogram

Cutting the Dendrogram

Dissimilarity and Linkage

Dissimilarity measures

Linkage methods