Hierarchical Clustering: A Linkage-Based Approach to Data Clustering

What’s Inside?

Introduction
Types of Hierarchical Clustering
Linkage Methods in Hierarchical Clustering
How does it work?
Agglomerative Clustering
Divisive Clustering
How to choose the Optimal Number of Clusters?
Advantages
Disadvantages
When to Use?
Example Code
Final Thoughts

Introduction

Hierarchical clustering offers a structured method for grouping data points by progressively merging or dividing clusters. Unlike density-based or centroid-based clustering methods, hierarchical clustering uses a linkage method to calculate cluster similarity and manage cluster formation. Let’s explore its types, working mechanism, advantages, and disadvantages.

Types of Hierarchical Clustering

Hierarchical clustering can be performed in two ways:

Agglomerative Clustering (Bottom-Up Approach):
- Initially, each data point is treated as its own cluster.
- Clusters are progressively merged based on their similarity until only one cluster or the desired number of clusters remain.
Divisive Clustering (Top-Down Approach):
- Starts with all data points grouped into a single large cluster.
- This cluster is recursively split into smaller clusters, increasing purity and reducing the distance between points within a cluster.

Linkage Methods in Hierarchical Clustering

The linkage method determines how the similarity between clusters is calculated during the merging or splitting process. Common methods include:

Ward’s Method (Default in scikit-learn):
- Computes the cost of increasing cluster size by adding a new point.
- Measures the variance within the cluster before and after adding the point, using the difference as the “cost”.
Average Linkage:
- Calculates the average distance between all points in two clusters.
- Aims to minimize the average pairwise distance across clusters.
Single Linkage:
- Uses the minimum distance between any two points from different clusters.
- It can result in long, chain-like clusters for datasets with non-uniform density.

How Hierarchical Clustering Works

Agglomerative Clustering Process:

Initialize Clusters:
- Treat each data point as an individual cluster.
Calculate Similarity:
- Use the chosen linkage method to compute the similarity between clusters.
Merge Clusters:
- Combine the two most similar clusters into one.
Repeat:
- Continue merging until the desired number of clusters is achieved or all points are in one cluster.

Divisive Clustering Process:

Start with One Cluster:
- Treat all data points as a single cluster.
Split Clusters:
- Recursively divide clusters into smaller groups based on the chosen criteria.
Continue Until Completion:
- Stop when each point is its own cluster or when the desired number of clusters is reached.

Choosing the Optimal Number of Clusters

A dendrogram is a visual tool used in hierarchical clustering to determine the number of clusters. The vertical lines represent cluster splits, and the height at which two clusters merge indicates their dissimilarity.

Look for the largest vertical gap (often called the “elbow” or “cut-off point”) to determine the optimal number of clusters.

Advantages of Hierarchical Clustering

Easy Interpretation:
- Dendrograms provide a clear visual representation of clustering and help identify the optimal number of clusters.
Robust to Outliers:
- Outliers do not significantly affect cluster formation since linkage methods focus on overall similarity.

Disadvantages of Hierarchical Clustering

Scalability Issues:
- Not suitable for large datasets due to high computational complexity (O(n²) or higher).
Parameter Dependency:
- Requires specifying the number of clusters or using dendrogram analysis, which can be subjective.

When to Use Hierarchical Clustering?

Hierarchical clustering is ideal for:

Small to Medium-Sized Datasets:
- Works best when the dataset size is manageable.
Data with a Natural Hierarchical Structure:
- Suitable for cases where the data inherently exhibits a hierarchical relationship, such as biological taxonomies or organizational structures.
Exploratory Analysis:
- Useful for understanding the structure of data before applying other clustering techniques.

Example Code

import numpy as np
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage, fcluster
from sklearn.datasets import make_blobs

# Generate synthetic data
X, _ = make_blobs(n_samples=150, centers=3, cluster_std=1.0, random_state=42)

# Perform hierarchical clustering
# Method options: 'single', 'complete', 'average', 'ward'
linkage_matrix = linkage(X, method='ward')

# Plot dendrogram
plt.figure(figsize=(10, 7))
dendrogram(linkage_matrix)
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('Sample index')
plt.ylabel('Distance')
plt.show()

# Determine clusters using a distance threshold or number of clusters
# `t` is the threshold or max number of clusters
num_clusters = 3
clusters = fcluster(linkage_matrix, num_clusters, criterion='maxclust')

# Plot the clusters
plt.figure(figsize=(8, 6))
for cluster_label in np.unique(clusters):
    plt.scatter(X[clusters == cluster_label, 0], X[clusters == cluster_label, 1], label=f'Cluster {cluster_label}')
plt.title('Clusters Identified by Hierarchical Clustering')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.show()

Final Thoughts

Hierarchical clustering provides a flexible and interpretable approach to data grouping. While its scalability limits its application to smaller datasets, its ability to reveal hierarchical relationships and work with different linkage methods makes it valuable for exploratory analysis.

As I explore clustering algorithms for various AI projects, I see that hierarchical clustering offers unique advantages that complement other techniques like K-Means and DB-SCAN. For smaller datasets or problems requiring interpretability, it’s a strong contender.

Let’s Discuss!

Have specific datasets in mind for hierarchical clustering or want to see a practical implementation? Share your questions, and let’s explore together!