Hierarchical Clustering: Building Dendrograms

Master hierarchical clustering from scratch. Learn agglomerative and divisive approaches, linkage criteria, dendrogram interpretation, cutting trees, and Python implementation with scipy and scikit-learn.

Hierarchical Clustering: Building Dendrograms

Hierarchical clustering builds a tree (dendrogram) of nested cluster relationships without requiring you to specify the number of clusters in advance. Agglomerative (bottom-up) clustering starts with each point as its own cluster and merges the two closest clusters at each step. Divisive (top-down) clustering starts with all points in one cluster and recursively splits. You choose the final number of clusters by cutting the dendrogram horizontally at any level — different heights reveal different granularities of the same hierarchical structure.

Introduction

K-Means forces a fundamental decision before clustering begins: specify k. This is inconvenient when you don’t know how many clusters exist, and limiting when clusters exist at multiple scales simultaneously — a dataset of news articles might cluster into 5 broad topics or 50 specific subtopics, and both views are valid.

Hierarchical clustering offers a fundamentally different philosophy: rather than partitioning data into exactly k clusters, it builds a complete tree of merges (or splits) from which any level of granularity can be read. A biologist grouping animal species can read the dendrogram at the level of genus, family, order, or class — each cut of the tree reveals a different but equally valid taxonomy. A marketing analyst can cut the customer dendrogram at 5 clusters for a broad strategy or 20 clusters for a precision campaign.

The result is a dendrogram — a visual tree diagram where the height of each merge reflects the dissimilarity between the groups being merged. Similar points merge early (low on the tree) at small heights. Dissimilar groups merge late (high on the tree) at large heights. The structure of the dendrogram itself is informative — long branches indicate well-separated groups; short branches indicate that nearby points merged at similar distances.

This article covers the complete theory and implementation of hierarchical clustering: the agglomerative algorithm, every major linkage criterion and when each fails, dendrogram interpretation, cutting strategies, practical Python implementation with scipy and scikit-learn, and a comparison with K-Means.

The Agglomerative Algorithm

Agglomerative hierarchical clustering builds the dendrogram bottom-up:

  1. Start: each of the n data points is its own cluster (n clusters total)
  2. Compute all pairwise distances between clusters
  3. Merge the two closest clusters into one
  4. Update the distance matrix (using the chosen linkage criterion)
  5. Repeat steps 3–4 until only one cluster remains

The result is a complete binary tree of n−1 merges.

Python
import numpy as np
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage, fcluster
from scipy.spatial.distance import pdist, squareform
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_blobs
import warnings
warnings.filterwarnings('ignore')

np.random.seed(42)


class AgglomerativeFromScratch:
    """
    Agglomerative hierarchical clustering implemented from scratch.
    Tracks every merge step for educational visualization.
    Supports single, complete, average, and Ward linkage.
    """

    def __init__(self, linkage='ward'):
        """
        Args:
            linkage: 'single', 'complete', 'average', or 'ward'
        """
        self.linkage_  = linkage
        self.merges_   = []   # List of (cluster_a, cluster_b, distance, size)
        self.labels_   = None

    def _cluster_distance(self, C_a, C_b, X, dist_matrix):
        """
        Compute distance between two clusters using the specified linkage.

        Single:   min distance between any pair of points across clusters
        Complete: max distance between any pair of points across clusters
        Average:  mean distance between all pairs (UPGMA)
        Ward:     increase in total WCSS from merging (variance-based)
        """
        a_pts = list(C_a)
        b_pts = list(C_b)

        if self.linkage_ == 'single':
            return min(dist_matrix[i, j] for i in a_pts for j in b_pts)

        elif self.linkage_ == 'complete':
            return max(dist_matrix[i, j] for i in a_pts for j in b_pts)

        elif self.linkage_ == 'average':
            total = sum(dist_matrix[i, j] for i in a_pts for j in b_pts)
            return total / (len(a_pts) * len(b_pts))

        elif self.linkage_ == 'ward':
            # Ward linkage: increase in WCSS from merging A and B
            n_a = len(a_pts)
            n_b = len(b_pts)
            mu_a = X[a_pts].mean(axis=0)
            mu_b = X[b_pts].mean(axis=0)
            # Ward distance formula
            return np.sqrt(n_a * n_b / (n_a + n_b)) * np.linalg.norm(mu_a - mu_b)

        else:
            raise ValueError(f"Unknown linkage: {self.linkage_}")

    def fit(self, X):
        """Build the dendrogram by repeatedly merging nearest clusters."""
        n = len(X)
        # Precompute pairwise distances
        dist_matrix = squareform(pdist(X, metric='euclidean'))

        # Each point starts as its own cluster (stored as a frozenset of indices)
        clusters    = {i: frozenset([i]) for i in range(n)}
        cluster_ids = list(range(n))
        self.merges_ = []

        while len(clusters) > 1:
            # Find the two closest clusters
            min_dist = np.inf
            merge_a  = None
            merge_b  = None

            ids = list(clusters.keys())
            for i in range(len(ids)):
                for j in range(i + 1, len(ids)):
                    a, b = ids[i], ids[j]
                    d    = self._cluster_distance(
                        clusters[a], clusters[b], X, dist_matrix
                    )
                    if d < min_dist:
                        min_dist = d
                        merge_a  = a
                        merge_b  = b

            # Merge
            new_cluster = clusters[merge_a] | clusters[merge_b]
            new_id      = max(clusters.keys()) + 1
            size        = len(new_cluster)

            self.merges_.append((merge_a, merge_b, min_dist, size))

            # Update clusters
            del clusters[merge_a]
            del clusters[merge_b]
            clusters[new_id] = new_cluster

        return self


# Validate against scipy on a small dataset
print("=== Agglomerative From Scratch: Validation ===\n")

np.random.seed(42)
X_small = np.array([
    [1.0, 1.0], [1.5, 1.2], [2.0, 1.0],   # Cluster A
    [6.0, 5.0], [6.5, 5.5], [7.0, 5.0],   # Cluster B
    [3.5, 8.0], [4.0, 7.5],               # Cluster C
])
X_small_s = StandardScaler().fit_transform(X_small)

for link in ['single', 'complete', 'average', 'ward']:
    hc = AgglomerativeFromScratch(linkage=link)
    hc.fit(X_small_s)

    print(f"  Linkage: {link}")
    print(f"  {'Step':>5} | {'Merge A':>8} | {'Merge B':>8} | {'Distance':>10} | "
          f"{'New Size':>9}")
    print(f"  {'-'*48}")
    for step, (a, b, dist, size) in enumerate(hc.merges_, 1):
        print(f"  {step:>5} | {a:>8} | {b:>8} | {dist:>10.4f} | {size:>9}")
    print()

Linkage Criteria: The Critical Choice

The linkage criterion defines how the distance between two clusters is measured. This is the most important hyperparameter in hierarchical clustering — different linkages produce qualitatively different dendrograms and cluster shapes.

Python
import numpy as np
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage, fcluster
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_blobs, make_moons


def compare_linkage_criteria(X_s, y_true, title="Dataset", figsize=(20, 12)):
    """
    Side-by-side comparison of four linkage criteria on the same dataset.
    Shows both the dendrogram structure and the resulting cluster assignments
    when the tree is cut to the true number of clusters.
    """
    linkage_methods = ['single', 'complete', 'average', 'ward']
    linkage_labels  = [
        'Single Linkage\n(Min distance — "chaining")',
        'Complete Linkage\n(Max distance — "compactness")',
        'Average Linkage\n(Mean distance — UPGMA)',
        'Ward Linkage\n(Min variance increase)',
    ]

    k_true = len(np.unique(y_true))
    fig, axes = plt.subplots(2, 4, figsize=figsize)

    cluster_colors = ['#e74c3c', '#3498db', '#2ecc71', '#f39c12', '#9b59b6']

    for col, (method, label) in enumerate(zip(linkage_methods, linkage_labels)):
        # Row 0: Dendrogram
        ax_dend = axes[0, col]
        Z = linkage(X_s, method=method)
        dendrogram(
            Z, ax=ax_dend,
            truncate_mode='lastp', p=20,
            show_leaf_counts=True,
            leaf_rotation=90,
            leaf_font_size=7,
            color_threshold=Z[-k_true, 2],  # Color at the k_true-cluster cut
        )
        ax_dend.axhline(y=Z[-k_true, 2], color='coral', linestyle='--',
                         lw=1.5, alpha=0.8, label=f'Cut at k={k_true}')
        ax_dend.set_title(label, fontsize=9, fontweight='bold')
        ax_dend.set_ylabel('Distance' if col == 0 else '', fontsize=9)
        ax_dend.tick_params(axis='x', labelsize=7)
        ax_dend.legend(fontsize=7)

        # Row 1: Scatter plot of cluster assignments
        ax_scat = axes[1, col]
        labels  = fcluster(Z, t=k_true, criterion='maxclust') - 1  # 0-indexed

        from sklearn.metrics import adjusted_rand_score
        ari = adjusted_rand_score(y_true, labels)

        for cls in range(k_true):
            mask = labels == cls
            ax_scat.scatter(X_s[mask, 0], X_s[mask, 1],
                            c=cluster_colors[cls % len(cluster_colors)],
                            s=40, edgecolors='white', linewidth=0.4,
                            alpha=0.85)
        ax_scat.set_title(f'Clusters (ARI={ari:.3f})',
                           fontsize=9, fontweight='bold')
        if col == 0:
            ax_scat.set_xlabel('Feature 1', fontsize=8)
            ax_scat.set_ylabel('Feature 2', fontsize=8)
        ax_scat.grid(True, alpha=0.2)

    plt.suptitle(f'Four Linkage Criteria: {title}\n'
                 'Row 1: Dendrograms | Row 2: Cluster Assignments at k={}'.format(k_true),
                 fontsize=13, fontweight='bold', y=1.01)
    plt.tight_layout()
    plt.savefig(f'linkage_comparison_{title.lower().replace(" ", "_")}.png',
                dpi=150, bbox_inches='tight')
    plt.show()
    print(f"Saved linkage comparison plot.")


# Test on blobs (spherical clusters)
np.random.seed(42)
X_blobs, y_blobs = make_blobs(n_samples=150, centers=3, cluster_std=0.7,
                               random_state=42)
X_blobs_s = StandardScaler().fit_transform(X_blobs)
compare_linkage_criteria(X_blobs_s, y_blobs, title="Spherical Blobs (k=3)")

# Test on moons (non-spherical)
X_moons, y_moons = make_moons(n_samples=150, noise=0.1, random_state=42)
X_moons_s = StandardScaler().fit_transform(X_moons)
compare_linkage_criteria(X_moons_s, y_moons, title="Two Moons")

Linkage Criterion Properties

LinkageDistance definitionCluster shapeSensitivityBest for
Singlemin(d(a,b)) for a∈A, b∈BElongated chainsOutliers (chaining effect)Manifold/filament data
Completemax(d(a,b)) for a∈A, b∈BCompact, equal-sizeOutliers (forced into clusters)Compact, spherical clusters
Averagemean(d(a,b)) for a∈A, b∈BBalancedModerateGeneral purpose
Wardmin increase in WCSSCompact, sphericalOutliers break compactnessCompact clusters, default choice

Ward linkage is the default choice for most applications because it minimizes the total within-cluster variance at each merge — conceptually equivalent to K-Means’ objective but solved hierarchically. Single linkage should be used when clusters have elongated or manifold shapes but avoided otherwise due to the chaining effect.

Reading Dendrograms

A dendrogram encodes the complete clustering hierarchy. Understanding how to read it is essential for interpreting hierarchical clustering results.

Python
import numpy as np
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage, fcluster, cophenet
from scipy.spatial.distance import pdist
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris


def annotated_dendrogram(X_s, labels_true=None, names=None,
                           method='ward', title="Annotated Dendrogram",
                           figsize=(16, 8)):
    """
    Create a fully annotated dendrogram with educational callouts.

    Reading guide:
    - Height of merge: dissimilarity between the merging clusters
    - Long vertical line: the two groups are very different — good separator
    - Short vertical line: groups were similar — uncertain separation
    - Colored regions: distinct subtrees when cut at a given height
    - Leaves: individual data points (or small groups if truncated)
    """
    Z = linkage(X_s, method=method)

    # Cophenetic correlation: how well does the dendrogram preserve distances?
    c, coph_dists = cophenet(Z, pdist(X_s))

    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=figsize,
                                    gridspec_kw={'width_ratios': [2, 1]})

    # ── Left: Full annotated dendrogram ──────────────────────────
    # Suggest a cut level: where is there the longest gap between merges?
    heights = Z[:, 2]
    height_diffs = np.diff(heights)
    cut_idx  = np.argmax(height_diffs[-5:]) - 5  # Among the top merges
    cut_height = (heights[-2] + heights[-1]) / 2  # Default: 2 clusters

    # Find "best" cut: largest gap in top 8 merges
    top_8 = heights[-8:]
    gaps  = np.diff(top_8)
    best_gap_idx = np.argmax(gaps) + len(heights) - 8
    best_cut_height = (heights[best_gap_idx] + heights[best_gap_idx + 1]) / 2

    k_at_best_cut = len(np.unique(fcluster(Z, t=best_cut_height,
                                            criterion='distance')))

    dend_result = dendrogram(
        Z, ax=ax1,
        orientation='top',
        truncate_mode='lastp', p=30,
        show_leaf_counts=True,
        leaf_rotation=90,
        leaf_font_size=8,
        color_threshold=best_cut_height,
        above_threshold_color='gray',
        labels=names,
    )

    # Annotate the cut line
    ax1.axhline(y=best_cut_height, color='coral', linestyle='--', lw=2.5,
                label=f'Suggested cut → k={k_at_best_cut} clusters')
    ax1.set_title(f'{title}\n'
                  f'Cophenetic r = {c:.4f} | Method: {method} | '
                  f'Suggested k = {k_at_best_cut}',
                  fontsize=10, fontweight='bold')
    ax1.set_ylabel('Distance (merge height)', fontsize=11)
    ax1.legend(fontsize=9)

    # Annotate largest gap
    top_k = min(20, len(heights))
    ax1.annotate(
        f'Largest gap\n(cut here\nfor k={k_at_best_cut})',
        xy=(ax1.get_xlim()[1] * 0.6, best_cut_height),
        xytext=(ax1.get_xlim()[1] * 0.75, best_cut_height * 1.3),
        fontsize=8, color='coral',
        arrowprops=dict(arrowstyle='->', color='coral', lw=1.5)
    )

    # ── Right: Height histogram — where are the merges? ──────────
    ax2.barh(range(len(heights[-15:])), heights[-15:],
             color='steelblue', edgecolor='white', linewidth=0.5)
    ax2.axvline(x=best_cut_height, color='coral', linestyle='--', lw=2,
                label='Suggested cut')
    ax2.set_xlabel('Merge height (distance)', fontsize=10)
    ax2.set_ylabel('Merge step (recent)', fontsize=10)
    ax2.set_title('Last 15 Merge Heights\n(Large gap = natural separation)',
                  fontsize=9, fontweight='bold')
    ax2.legend(fontsize=8)
    ax2.invert_yaxis()

    plt.suptitle(f'Dendrogram Reading Guide: {title}',
                 fontsize=13, fontweight='bold')
    plt.tight_layout()
    plt.savefig('dendrogram_annotated.png', dpi=150, bbox_inches='tight')
    plt.show()
    print("Saved: dendrogram_annotated.png")

    print(f"\n  Cophenetic correlation: {c:.4f}")
    print(f"  (Range [0,1] — closer to 1 = dendrogram preserves true distances)")
    if c > 0.8:
        print(f"  → Strong: dendrogram is a faithful representation of the data")
    elif c > 0.6:
        print(f"  → Moderate: some distortion from hierarchical structure")
    else:
        print(f"  → Weak: dendrogram may not represent the data well")
    print(f"\n  Suggested cut: k = {k_at_best_cut} clusters")

    return Z, k_at_best_cut


iris = load_iris()
X_iris_s = StandardScaler().fit_transform(iris.data)
Z_iris, k_iris = annotated_dendrogram(
    X_iris_s, labels_true=iris.target,
    method='ward', title="Iris Dataset"
)

The Cophenetic Correlation

The cophenetic correlation coefficient measures how faithfully the dendrogram represents the true pairwise distances. For two points x_i and x_j, their cophenetic distance is the height at which they first merge in the dendrogram. A high correlation (>0.8) between original distances and cophenetic distances means the dendrogram is a good summary of the data’s distance structure.

Cutting the Dendrogram: Choosing k

Once the dendrogram is built, you choose the number of clusters by “cutting” the tree horizontally. Every horizontal cut through the dendrogram defines a unique clustering.

Python
import numpy as np
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage, fcluster
from sklearn.metrics import silhouette_score
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_wine


def cut_dendrogram_analysis(X_s, Z, dataset_name="Dataset",
                               k_range=range(2, 9)):
    """
    Analyze multiple cut heights to find the best k.

    Two cutting strategies:
    1. maxclust: cut to produce exactly k clusters
    2. distance: cut at a given height (natural approach)

    For each k, compute silhouette score to evaluate quality.
    The "largest gap" heuristic identifies the height where
    the dendrogram suggests the most natural cut.
    """
    # Strategy 1: Evaluate silhouette for k = 2 to k_max
    k_vals    = list(k_range)
    sil_scores = []

    for k in k_vals:
        labels = fcluster(Z, t=k, criterion='maxclust') - 1
        sil = silhouette_score(X_s, labels) if len(np.unique(labels)) > 1 else -1
        sil_scores.append(sil)

    best_k_sil = k_vals[np.argmax(sil_scores)]

    # Strategy 2: Largest gap heuristic
    heights = Z[:, 2]
    height_diffs = np.diff(heights)
    # Find largest gap in the last min(10, n-1) merges
    top_n = min(10, len(height_diffs))
    gap_idx = np.argmax(height_diffs[-top_n:]) + len(height_diffs) - top_n
    cut_height_gap = (heights[gap_idx] + heights[gap_idx + 1]) / 2
    k_gap = len(np.unique(fcluster(Z, t=cut_height_gap, criterion='distance')))

    print(f"=== Dendrogram Cut Analysis: {dataset_name} ===\n")
    print(f"  {'k':>4} | {'Silhouette':>12} | {'Cluster sizes'}")
    print(f"  {'-'*50}")

    for k, sil in zip(k_vals, sil_scores):
        labels = fcluster(Z, t=k, criterion='maxclust') - 1
        sizes  = sorted(np.bincount(labels).tolist(), reverse=True)
        flag   = " ← best" if k == best_k_sil else ""
        print(f"  {k:>4} | {sil:>12.4f} | {sizes}{flag}")

    print(f"\n  Best k by silhouette:      {best_k_sil}")
    print(f"  Best k by largest gap:     {k_gap}")

    # Visualize
    fig, axes = plt.subplots(1, 3, figsize=(16, 5))

    # Dendrogram with cuts shown
    ax = axes[0]
    dendrogram(Z, ax=ax, truncate_mode='lastp', p=20,
                color_threshold=heights[-best_k_sil], leaf_font_size=7)
    ax.axhline(y=heights[-best_k_sil], color='coral', linestyle='--', lw=2,
               label=f'Silhouette cut (k={best_k_sil})')
    ax.axhline(y=cut_height_gap, color='steelblue', linestyle=':', lw=2,
               label=f'Gap heuristic (k={k_gap})')
    ax.set_title('Dendrogram with Cut Lines', fontsize=10, fontweight='bold')
    ax.set_ylabel('Distance', fontsize=9)
    ax.legend(fontsize=8)

    # Silhouette vs k
    ax = axes[1]
    ax.plot(k_vals, sil_scores, 'o-', color='coral', lw=2.5, markersize=8)
    ax.axvline(x=best_k_sil, color='coral', linestyle='--', lw=2,
               label=f'Best k={best_k_sil}')
    ax.set_xlabel('k', fontsize=11); ax.set_ylabel('Silhouette', fontsize=11)
    ax.set_title('Silhouette Score vs k', fontsize=10, fontweight='bold')
    ax.legend(fontsize=9); ax.grid(True, alpha=0.3)

    # Cluster sizes at best k
    ax = axes[2]
    best_labels = fcluster(Z, t=best_k_sil, criterion='maxclust') - 1
    sizes = np.bincount(best_labels)
    colors = plt.cm.tab10(np.linspace(0, 0.9, best_k_sil))
    ax.bar(range(best_k_sil), sizes, color=colors, edgecolor='white')
    ax.set_xlabel('Cluster', fontsize=11); ax.set_ylabel('Size', fontsize=11)
    ax.set_xticks(range(best_k_sil))
    ax.set_xticklabels([f'C{i}' for i in range(best_k_sil)], fontsize=9)
    ax.set_title(f'Cluster Sizes at k={best_k_sil}', fontsize=10, fontweight='bold')
    ax.grid(True, alpha=0.3, axis='y')

    plt.suptitle(f'Dendrogram Cut Selection: {dataset_name}',
                 fontsize=13, fontweight='bold')
    plt.tight_layout()
    plt.savefig('dendrogram_cut_analysis.png', dpi=150)
    plt.show()
    print("Saved: dendrogram_cut_analysis.png")

    return best_k_sil


wine = load_wine()
X_w_s  = StandardScaler().fit_transform(wine.data)
Z_wine = linkage(X_w_s, method='ward')
best_k = cut_dendrogram_analysis(X_w_s, Z_wine, dataset_name="Wine Dataset")

Scikit-learn Implementation

For large datasets or pipeline integration, scikit-learn’s AgglomerativeClustering is the practical choice. It is faster than the from-scratch implementation and integrates with the Pipeline API.

Python
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import AgglomerativeClustering
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import adjusted_rand_score, silhouette_score
from sklearn.datasets import load_iris, load_wine, load_breast_cancer
from sklearn.pipeline import Pipeline


def sklearn_agglomerative_demo():
    """
    Demonstrate AgglomerativeClustering with all sklearn features:
    - All four linkage methods
    - Connectivity constraints (graph structure)
    - Integration with Pipeline
    - Comparison with K-Means
    """
    datasets = {
        'Iris (k=3)':    (load_iris().data,          load_iris().target),
        'Wine (k=3)':    (load_wine().data,           load_wine().target),
        'Cancer (k=2)':  (load_breast_cancer().data, load_breast_cancer().target),
    }

    linkage_methods = ['ward', 'complete', 'average', 'single']

    print("=== AgglomerativeClustering: Scikit-learn ===\n")

    for ds_name, (X, y_true) in datasets.items():
        k = len(np.unique(y_true))
        print(f"  {ds_name}:")
        print(f"  {'Linkage':<12} | {'ARI':>8} | {'Silhouette':>11} | Notes")
        print(f"  {'-'*50}")

        scaler  = StandardScaler()
        X_s     = scaler.fit_transform(X)

        best_ari = -1
        for method in linkage_methods:
            if method == 'ward':
                # Ward only works with Euclidean distance
                hc = AgglomerativeClustering(n_clusters=k, linkage=method)
            else:
                hc = AgglomerativeClustering(n_clusters=k, linkage=method,
                                              metric='euclidean')

            labels = hc.fit_predict(X_s)
            ari    = adjusted_rand_score(y_true, labels)
            sil    = silhouette_score(X_s, labels)
            flag   = " ← best" if ari > best_ari else ""
            if ari > best_ari:
                best_ari = ari
            note   = "Default, usually best" if method == 'ward' else ""
            print(f"  {method:<12} | {ari:>8.4f} | {sil:>11.4f} | {note}{flag}")
        print()


sklearn_agglomerative_demo()


def agglomerative_pipeline():
    """
    AgglomerativeClustering in a Pipeline — same as any other sklearn estimator.
    Note: fit_predict works but predict() is not available (hierarchical
    clustering doesn't generalize to new points without refitting).
    """
    from sklearn.datasets import make_blobs
    np.random.seed(42)
    X, y = make_blobs(300, centers=4, cluster_std=0.7, random_state=42)

    pipe = Pipeline([
        ('scaler', StandardScaler()),
        ('hc',     AgglomerativeClustering(n_clusters=4, linkage='ward')),
    ])

    labels = pipe.fit_predict(X)
    sil    = silhouette_score(
        pipe.named_steps['scaler'].transform(X), labels
    )
    ari    = adjusted_rand_score(y, labels)

    print(f"=== Agglomerative Pipeline ===\n")
    print(f"  Pipeline: StandardScaler → AgglomerativeClustering(ward, k=4)")
    print(f"  Silhouette: {sil:.4f}")
    print(f"  ARI:        {ari:.4f}")
    print(f"\n  Note: AgglomerativeClustering has no .predict() method.")
    print(f"  For new data, you must refit or use K-Means for large-scale deployment.")


agglomerative_pipeline()

Hierarchical Clustering vs K-Means

Python
import numpy as np
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import linkage, fcluster
from sklearn.cluster import KMeans, AgglomerativeClustering
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import adjusted_rand_score, silhouette_score
from sklearn.datasets import make_blobs, make_moons, make_circles


def compare_kmeans_vs_hierarchical(figsize=(18, 12)):
    """
    Direct comparison on four datasets that challenge different aspects
    of each algorithm.
    """
    np.random.seed(42)

    datasets = {
        'Spherical blobs':
            make_blobs(200, centers=3, cluster_std=0.6, random_state=42),
        'Unequal sizes':
            (np.vstack([np.random.randn(20, 2)*0.3,
                        np.random.randn(180, 2)*0.5 + [3, 0]]),
             np.array([0]*20 + [1]*180)),
        'Non-spherical (moons)':
            make_moons(200, noise=0.1, random_state=42),
        'Concentric circles':
            make_circles(200, noise=0.05, factor=0.5, random_state=42),
    }

    fig, axes = plt.subplots(len(datasets), 3, figsize=figsize)
    colors = ['#e74c3c', '#3498db', '#2ecc71']

    for row, (ds_name, (X, y_true)) in enumerate(datasets.items()):
        X_s   = StandardScaler().fit_transform(X)
        k     = len(np.unique(y_true))
        scaler = StandardScaler()
        X_s   = scaler.fit_transform(X)

        methods = {
            'K-Means':
                KMeans(n_clusters=k, init='k-means++', n_init=10,
                       random_state=42).fit_predict(X_s),
            'Hierarchical (Ward)':
                AgglomerativeClustering(n_clusters=k, linkage='ward')
                    .fit_predict(X_s),
            'Hierarchical (Single)':
                AgglomerativeClustering(n_clusters=k, linkage='single')
                    .fit_predict(X_s),
        }

        for col, (method_name, labels) in enumerate(methods.items()):
            ax = axes[row, col]
            ari = adjusted_rand_score(y_true, labels)

            for cls_i, color in enumerate(colors[:k]):
                mask = labels == cls_i
                ax.scatter(X_s[mask, 0], X_s[mask, 1], c=color,
                           s=25, edgecolors='white', linewidth=0.3, alpha=0.85)

            ax.set_title(f'{method_name}\nARI = {ari:.3f}',
                         fontsize=9, fontweight='bold')
            ax.set_xticks([]); ax.set_yticks([])
            ax.grid(True, alpha=0.2)

            if col == 0:
                ax.set_ylabel(ds_name, fontsize=9, fontweight='bold')

    plt.suptitle('K-Means vs Hierarchical Clustering (Ward and Single Linkage)\n'
                 'ARI = 1.0 means perfect recovery of true clusters',
                 fontsize=13, fontweight='bold', y=1.01)
    plt.tight_layout()
    plt.savefig('kmeans_vs_hierarchical.png', dpi=150, bbox_inches='tight')
    plt.show()
    print("Saved: kmeans_vs_hierarchical.png")


compare_kmeans_vs_hierarchical()

Key Differences

PropertyK-MeansHierarchical (Ward)Hierarchical (Single)
Requires k in advanceYesNo (cut afterwards)No
ScalabilityO(n·k·T) — fastO(n² log n) — slowO(n² log n) — slow
Cluster shapeSpherical onlySpherical (Ward)Any shape (single)
DeterministicNo (random init)YesYes
New data predictionYes (predict())No (must refit)No
InterpretabilityCentroidsFull tree structureFull tree structure
MemoryO(n)O(n²) for distance matrixO(n²)

Use Hierarchical Clustering when:

  • You want to explore multiple levels of granularity without rerunning
  • The dendrogram structure itself is informative (biology, genealogy)
  • You have fewer than ~10,000 points (memory constraint)
  • You need deterministic results
  • Ward linkage is appropriate for your data (compact clusters)

Use K-Means when:

  • n > 10,000 (hierarchical is too slow)
  • You need to predict cluster membership for new data
  • Speed matters
  • Clusters are roughly spherical and equal-sized

Real-World Application: Gene Expression Clustering

Hierarchical clustering is the standard tool in bioinformatics for visualizing gene expression patterns — the dendrogram structure directly corresponds to functional gene families.

Python
import numpy as np
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage
from sklearn.preprocessing import StandardScaler


def gene_expression_clustering(n_genes=50, n_conditions=20, random_state=42):
    """
    Simulate a gene expression matrix and apply hierarchical clustering
    to discover gene groups with similar expression patterns.

    This is the standard "heatmap + dendrogram" visualization used
    throughout bioinformatics.
    """
    np.random.seed(random_state)

    # Simulate 5 gene families with different expression patterns
    n_per_family = n_genes // 5
    families = [
        np.random.randn(n_per_family, n_conditions) * 0.3 + [2, 0, -2, 0, 2, 0, -2, 0,
                        2, 0, -2, 0, 2, 0, -2, 0, 2, 0, -2, 0],  # Oscillating up
        np.random.randn(n_per_family, n_conditions) * 0.3 + np.linspace(-2, 2, n_conditions),  # Linear up
        np.random.randn(n_per_family, n_conditions) * 0.3 + np.linspace(2, -2, n_conditions),  # Linear down
        np.random.randn(n_per_family, n_conditions) * 0.3 + 2 * np.sin(
            np.linspace(0, 4*np.pi, n_conditions)),  # Sinusoidal
        np.random.randn(n_per_family, n_conditions) * 0.5,                    # Flat (noise)
    ]

    X_genes = np.vstack(families)
    true_families = np.repeat(np.arange(5), n_per_family)

    gene_names  = [f'Gene_{i:03d}' for i in range(n_genes)]
    cond_names  = [f'Cond_{i+1}' for i in range(n_conditions)]

    # Cluster genes (rows) using correlation distance
    X_scaled = StandardScaler().fit_transform(X_genes.T).T  # Standardize per gene
    Z_genes  = linkage(X_scaled, method='ward')

    # Cluster conditions (columns)
    Z_conds = linkage(X_scaled.T, method='ward')

    # Reorder for visualization
    from scipy.cluster.hierarchy import leaves_list
    gene_order = leaves_list(Z_genes)
    cond_order = leaves_list(Z_conds)

    fig = plt.figure(figsize=(16, 10))

    # Gene dendrogram (left)
    ax_gene_dend = fig.add_axes([0.05, 0.1, 0.12, 0.75])
    dend_gene = dendrogram(Z_genes, orientation='left',
                            no_labels=True, ax=ax_gene_dend,
                            color_threshold=Z_genes[-5, 2])
    ax_gene_dend.axis('off')

    # Condition dendrogram (top)
    ax_cond_dend = fig.add_axes([0.18, 0.87, 0.72, 0.1])
    dend_cond = dendrogram(Z_conds, orientation='top',
                            no_labels=True, ax=ax_cond_dend)
    ax_cond_dend.axis('off')

    # Heatmap (center)
    ax_heat = fig.add_axes([0.18, 0.1, 0.72, 0.75])
    X_reordered = X_scaled[gene_order, :][:, cond_order]
    im = ax_heat.imshow(X_reordered, aspect='auto', cmap='RdBu_r',
                         vmin=-2.5, vmax=2.5)
    ax_heat.set_xticks(range(n_conditions))
    ax_heat.set_xticklabels([cond_names[i] for i in cond_order],
                              rotation=90, fontsize=7)
    ax_heat.set_yticks(range(n_genes))
    ax_heat.set_yticklabels([gene_names[i] for i in gene_order], fontsize=6)
    ax_heat.set_title('Gene Expression Heatmap with Hierarchical Clustering\n'
                       '(Rows = genes, Columns = conditions, Color = expression level)',
                       fontsize=11, fontweight='bold', pad=60)

    # Color bar
    ax_cbar = fig.add_axes([0.92, 0.1, 0.02, 0.75])
    plt.colorbar(im, cax=ax_cbar, label='Standardized expression')

    plt.suptitle('Hierarchical Clustering of Gene Expression Data\n'
                 '(Reordered heatmap reveals gene families and co-expressed groups)',
                 fontsize=13, fontweight='bold', y=1.01)
    plt.savefig('gene_expression_clustering.png', dpi=150, bbox_inches='tight')
    plt.show()
    print("Saved: gene_expression_clustering.png")

    # Evaluate recovery of true gene families
    from sklearn.metrics import adjusted_rand_score
    from scipy.cluster.hierarchy import fcluster
    predicted = fcluster(Z_genes, t=5, criterion='maxclust') - 1
    ari = adjusted_rand_score(true_families, predicted)
    print(f"\n  Gene family recovery: ARI = {ari:.4f}")
    print(f"  (Hierarchical clustering recovers known functional groups)")


gene_expression_clustering()

Divisive Hierarchical Clustering

Agglomerative clustering works bottom-up, starting with n clusters and merging. Divisive hierarchical clustering works top-down: start with all points in one cluster and recursively split. The resulting dendrogram has the same structure — a binary tree — but is built from root to leaves rather than leaves to root.

Divisive clustering is less commonly implemented (scikit-learn does not include it), but it naturally surfaces the major splits first, which can be useful when the top-level divisions are the most meaningful.

Python
import numpy as np
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_blobs
from sklearn.metrics import adjusted_rand_score


class DivisiveClustering:
    """
    Divisive hierarchical clustering (DIANA-like algorithm).

    Strategy:
    1. Start with all n points in one cluster
    2. Choose the cluster with the highest diameter (max pairwise distance)
    3. Split that cluster into 2 sub-clusters using K-Means (k=2)
    4. Repeat until each cluster has 1 point or max_clusters is reached

    Stores splits as a tree for dendrogram-like visualization.
    """

    def __init__(self, max_clusters=None, random_state=42):
        self.max_clusters   = max_clusters
        self.random_state   = random_state
        self.split_history_ = []   # (parent_id, child1_id, child2_id, split_score)
        self.cluster_map_   = {}   # id → set of point indices

    def _cluster_diameter(self, X, indices):
        """Maximum pairwise distance within a cluster."""
        if len(indices) < 2:
            return 0.0
        from scipy.spatial.distance import pdist
        pts = X[list(indices)]
        return pdist(pts, metric='euclidean').max()

    def fit(self, X):
        """Build divisive tree by recursively splitting the 'widest' cluster."""
        n   = len(X)
        max_k = self.max_clusters if self.max_clusters else n

        # Start: one cluster containing all points
        next_id = 0
        self.cluster_map_ = {next_id: frozenset(range(n))}
        next_id += 1
        self.split_history_ = []

        while len(self.cluster_map_) < max_k:
            # Find the cluster with the largest diameter
            best_cluster_id = None
            best_diameter   = -1
            for cid, indices in self.cluster_map_.items():
                if len(indices) < 2:
                    continue
                diam = self._cluster_diameter(X, indices)
                if diam > best_diameter:
                    best_diameter   = diam
                    best_cluster_id = cid

            if best_cluster_id is None:
                break  # All clusters have 1 point

            # Split into 2 using K-Means
            indices_list = list(self.cluster_map_[best_cluster_id])
            X_sub = X[indices_list]
            km = KMeans(n_clusters=2, init='k-means++', n_init=5,
                        random_state=self.random_state)
            sub_labels = km.fit_predict(X_sub)

            child_A = frozenset(indices_list[i] for i in range(len(indices_list))
                                if sub_labels[i] == 0)
            child_B = frozenset(indices_list[i] for i in range(len(indices_list))
                                if sub_labels[i] == 1)

            id_A, id_B = next_id, next_id + 1
            next_id += 2

            self.split_history_.append(
                (best_cluster_id, id_A, id_B, best_diameter)
            )

            del self.cluster_map_[best_cluster_id]
            self.cluster_map_[id_A] = child_A
            self.cluster_map_[id_B] = child_B

        # Build final labels
        self.labels_ = np.zeros(n, dtype=int)
        for cls_i, (cid, indices) in enumerate(self.cluster_map_.items()):
            for idx in indices:
                self.labels_[idx] = cls_i

        return self

    def get_labels_at_k(self, k):
        """
        Return cluster labels when the tree is cut to produce k clusters.
        Works by replaying the split history up to k-1 splits.
        """
        n = sum(len(v) for v in self.cluster_map_.values())
        # Re-run to k clusters
        divisive = DivisiveClustering(max_clusters=k,
                                       random_state=self.random_state)
        # Build temporary X mapping
        # Note: this requires re-fitting, which is a limitation of this
        # simple implementation
        return divisive


# Test divisive clustering
print("=== Divisive Clustering ===\n")
np.random.seed(42)
X_div, y_div = make_blobs(n_samples=150, centers=4, cluster_std=0.6,
                            random_state=42)
X_div_s = StandardScaler().fit_transform(X_div)

for k in [2, 3, 4, 5]:
    dc = DivisiveClustering(max_clusters=k, random_state=42)
    dc.fit(X_div_s)
    ari = adjusted_rand_score(y_div, dc.labels_)
    sizes = sorted(np.bincount(dc.labels_).tolist(), reverse=True)
    print(f"  k={k}: ARI={ari:.4f}, sizes={sizes}")

print(f"\n  Split history (diameter → cluster):")
dc_full = DivisiveClustering(max_clusters=5, random_state=42)
dc_full.fit(X_div_s)
for step, (parent, c1, c2, diam) in enumerate(dc_full.split_history_, 1):
    print(f"    Step {step}: Split cluster {parent} → [{c1}, {c2}] "
          f"(diameter={diam:.3f})")

Distance Metrics for Hierarchical Clustering

The linkage criterion determines how cluster distances are computed, but the underlying pairwise distance metric is equally important. Different metrics capture different notions of similarity.

Python
import numpy as np
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import linkage, fcluster, dendrogram
from scipy.spatial.distance import pdist
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import adjusted_rand_score
from sklearn.datasets import load_wine


def compare_distance_metrics(dataset_name="Wine"):
    """
    Compare different distance metrics for hierarchical clustering.

    Euclidean: standard L2 distance (scale-sensitive)
    Manhattan: L1 distance (more robust to outliers)
    Cosine:    angle between vectors (scale-invariant, good for text/embeddings)
    Correlation: 1 - Pearson correlation (captures pattern similarity, not magnitude)
    """
    wine = load_wine()
    X, y = wine.data, wine.target
    k = 3

    # Note: Ward linkage only works with Euclidean distance.
    # Other metrics use complete or average linkage.
    configs = [
        ('euclidean', 'ward',     'Ward + Euclidean\n(default, scale-sensitive)'),
        ('euclidean', 'complete', 'Complete + Euclidean'),
        ('cityblock', 'complete', 'Complete + Manhattan\n(robust to outliers)'),
        ('cosine',    'complete', 'Complete + Cosine\n(direction, not magnitude)'),
        ('correlation','complete','Complete + Correlation\n(pattern similarity)'),
    ]

    print(f"=== Distance Metrics: {dataset_name} ===\n")
    print(f"  {'Config':<35} | {'ARI':>8} | {'Silhouette':>11}")
    print(f"  {'-'*58}")

    X_s = StandardScaler().fit_transform(X)

    for metric, method, desc in configs:
        Z = linkage(X_s, method=method, metric=metric)
        labels = fcluster(Z, t=k, criterion='maxclust') - 1
        ari = adjusted_rand_score(y, labels)
        from sklearn.metrics import silhouette_score
        sil = silhouette_score(X_s, labels)
        print(f"  {desc:<35} | {ari:>8.4f} | {sil:>11.4f}")

    print(f"\n  Recommendation: Always scale first.")
    print(f"  Correlation distance is powerful for gene expression data")
    print(f"  where relative patterns matter more than absolute magnitudes.")


compare_distance_metrics()

Connectivity Constraints: Structure-Aware Clustering

One of hierarchical clustering’s unique features in scikit-learn is the ability to enforce connectivity constraints — limiting which clusters can be merged to only those that are spatially adjacent according to a graph structure. This is powerful for image segmentation and spatial data where only adjacent regions should merge.

Python
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import AgglomerativeClustering
from sklearn.neighbors import kneighbors_graph
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_swiss_roll


def connectivity_constrained_clustering(figsize=(14, 6)):
    """
    Demonstrate connectivity constraints in AgglomerativeClustering.

    Without constraints: any two clusters can merge regardless of spatial proximity.
    With constraints: only clusters connected by the k-NN graph can merge.

    This is critical for manifold data (like the Swiss Roll) where
    globally nearby points may be far apart along the manifold.
    """
    np.random.seed(42)
    n = 500
    X, t = make_swiss_roll(n_samples=n, noise=0.5, random_state=42)

    # Use only the 2D structure-revealing slice
    X_2d = X[:, [0, 2]]
    X_s  = StandardScaler().fit_transform(X_2d)

    k_clusters = 6

    # Without connectivity constraint
    hc_free = AgglomerativeClustering(n_clusters=k_clusters, linkage='ward')
    labels_free = hc_free.fit_predict(X_s)

    # With connectivity constraint (k-NN graph)
    connectivity = kneighbors_graph(X_s, n_neighbors=10, include_self=False)
    hc_connected = AgglomerativeClustering(
        n_clusters=k_clusters, linkage='ward',
        connectivity=connectivity
    )
    labels_connected = hc_connected.fit_predict(X_s)

    fig, axes = plt.subplots(1, 2, figsize=figsize)
    colors = plt.cm.tab10(np.linspace(0, 0.9, k_clusters))

    for ax, labels, title in [
        (axes[0], labels_free,
         f'Without Connectivity\n(Ward, k={k_clusters} — can merge non-adjacent)'),
        (axes[1], labels_connected,
         f'With k-NN Connectivity (k_nn=10)\n(Ward, k={k_clusters} — respects adjacency)'),
    ]:
        for cls, color in enumerate(colors):
            mask = labels == cls
            ax.scatter(X_s[mask, 0], X_s[mask, 1], c=[color],
                       s=25, edgecolors='white', linewidth=0.3, alpha=0.85)
        ax.set_title(title, fontsize=10, fontweight='bold')
        ax.set_xlabel('Feature 1', fontsize=9)
        ax.set_ylabel('Feature 2', fontsize=9)
        ax.grid(True, alpha=0.2)

    plt.suptitle('Connectivity Constraints: Structure-Aware Hierarchical Clustering\n'
                 'Constraints prevent non-adjacent clusters from merging',
                 fontsize=12, fontweight='bold')
    plt.tight_layout()
    plt.savefig('connectivity_constrained_clustering.png', dpi=150)
    plt.show()
    print("Saved: connectivity_constrained_clustering.png")

    print(f"\n  Connectivity-constrained clustering:")
    print(f"  - Prevents merging spatially non-adjacent clusters")
    print(f"  - Critical for image segmentation (merge only adjacent pixels)")
    print(f"  - Useful for geographic data (merge only neighboring regions)")
    print(f"  - k_neighbors controls the neighborhood size")
    print(f"    Larger k = looser constraint (more merges allowed)")
    print(f"    Smaller k = tighter constraint (only very nearby merges)")


connectivity_constrained_clustering()

Summary

Hierarchical clustering builds a complete binary tree of nested cluster relationships called a dendrogram, recording every merge from n singleton clusters (agglomerative) or every split from one global cluster (divisive) until the full hierarchy is assembled. Unlike K-Means, it requires no advance specification of k; the same dendrogram supports any level of granularity by cutting at different heights.

The dendrogram encodes dissimilarity as height: points that are similar merge early at small heights; dissimilar groups merge late at large heights. Long vertical lines indicate well-separated groups — natural breakpoints for cutting. The cophenetic correlation coefficient (ideally > 0.8) measures how faithfully the dendrogram preserves the true pairwise distances.

Linkage criteria are the most important hyperparameter: Ward linkage (minimizing WCSS increase) produces compact spherical clusters and is the default for most tabular data. Single linkage enables manifold-shaped clusters through its chaining property but is fragile to noise. Complete linkage creates compact, equal-size clusters but is sensitive to outliers. Average linkage balances both extremes.

Cutting the dendrogram converts the hierarchy back to a flat clustering. The largest-gap heuristic (cut where consecutive merge heights jump the most) identifies the most natural k. The silhouette score provides a complementary quality signal — evaluate it for k = 2 through k_max and pick the maximum.

Distance metrics interact with linkage criteria. Ward linkage requires Euclidean distance. Correlation distance (1 − Pearson r) captures pattern similarity regardless of magnitude — the standard choice for gene expression data. Cosine distance works well for text embeddings and other direction-sensitive data.

Connectivity constraints restrict which clusters can merge to those connected by a k-NN graph, enabling structure-aware clustering for image segmentation, spatial data, and manifold data where Euclidean distance misrepresents true similarity.

Computational limits: O(n²) memory and O(n² log n) time make hierarchical clustering impractical for n > 10,000. For large datasets, use K-Means for speed. Hierarchical clustering’s lack of a predict() method (cannot generalize to new data without refitting) further limits production deployment.

Share:
Subscribe
Notify of
0 Comments
Inline Feedbacks
View all comments

Discover More

Condition Variables: Thread Communication

Condition Variables: Thread Communication

Master C++ condition variables — learn how threads wait and notify each other, implement producer-consumer…

Setting Up VS Code for Data Science

Setting Up VS Code for Data Science

Learn how to set up VS Code for data science. Install essential extensions, configure Python…

Robotics Revolution: NVIDIA’s GR00T Brings Human-Like Reasoning to Bots

NVIDIA dropped Isaac GR00T on September 29, a foundation model infusing robots with “humanlike” reasoning,…

How Operating Systems Manage Network Connections

How Operating Systems Manage Network Connections

Discover how operating systems manage network connections, from establishing links to data transmission. Complete guide…

Understanding the Difference Between System Software and Application Software

Understanding the Difference Between System Software and Application Software

Learn the key differences between system software and application software, how they interact, and why…

Clustering Techniques: An Introduction to K-Means

Learn K-Means clustering, from basics to advanced variations. Master clustering techniques for better data analysis…

Click For More
0
Would love your thoughts, please comment.x
()
x