[Paper] Chamfer-Linkage for Hierarchical Agglomerative Clustering
Source: arXiv - 2602.10444v1
Overview
Hierarchical Agglomerative Clustering (HAC) is a work‑horse for grouping data points, but its performance hinges on the linkage function that decides which clusters to merge. The new paper introduces Chamfer‑linkage, a linkage based on the Chamfer distance (a staple in point‑cloud processing). The authors show that this simple change yields more reliable clusterings across a wide range of real‑world datasets while keeping the classic (O(n^2)) runtime.
Key Contributions
- Chamfer‑linkage definition – adapts the Chamfer distance to measure inter‑cluster similarity, preserving fine‑grained geometric information that classic linkages discard.
- Theoretical guarantees – proves that HAC with Chamfer‑linkage can be implemented in the same quadratic time as traditional linkages, requiring no exotic data structures.
- Desirable representation properties – demonstrates that Chamfer‑linkage satisfies several “concept representation” axioms (e.g., monotonicity, stability under outliers) that many classic linkages violate.
- Extensive empirical evaluation – benchmarks Chamfer‑linkage against single‑, average‑, complete‑, and Ward’s linkages on dozens of datasets from image retrieval, text clustering, and network analysis, consistently achieving higher clustering quality (measured by silhouette score, adjusted Rand index, etc.).
- Drop‑in usability – provides a reference implementation compatible with popular Python libraries (scikit‑learn, scipy), making it trivial for practitioners to replace their existing linkage function.
Methodology
- Chamfer distance for clusters – For two clusters (A) and (B), compute the average nearest‑neighbor distance from points in (A) to (B) and vice‑versa, then sum the two averages. This yields a symmetric, computationally cheap proxy for the true Earth‑Mover’s Distance.
- HAC algorithm – The standard agglomerative loop is retained: repeatedly merge the pair of clusters with the smallest Chamfer‑linkage distance. The authors show that by caching nearest‑neighbor sums, each merge can be updated in constant amortized time, leading to an overall (O(n^2)) algorithm.
- Benchmark suite – The authors selected 30+ datasets spanning different modalities (image feature vectors, TF‑IDF text vectors, graph embeddings). For each dataset they ran HAC with five linkages, evaluated the dendrogram cut that maximizes the ground‑truth adjusted Rand index, and recorded runtime and memory usage.
- Statistical analysis – Paired t‑tests and Wilcoxon signed‑rank tests confirm that Chamfer‑linkage’s improvements are statistically significant across the board.
Results & Findings
- Quality boost – Across the benchmark, Chamfer‑linkage improves the average adjusted Rand index by 7–12 % over average‑linkage and 5–9 % over Ward’s method.
- Robustness to noise – In synthetic experiments with injected outliers, Chamfer‑linkage’s dendrograms remain stable, whereas single‑linkage collapses into “chaining” artifacts.
- Runtime parity – The quadratic runtime matches that of classic linkages; on a 10 k‑point dataset, Chamfer‑linkage finishes in ~2.3 seconds vs. 2.1 seconds for average‑linkage—well within practical limits.
- Memory footprint – Only a modest increase (≈10 % extra) due to storing nearest‑neighbor sums, still comfortably fitting in standard workstation RAM for datasets up to 100 k points.
Practical Implications
- Plug‑and‑play improvement – Developers can swap
linkage='average'withlinkage='chamfer'in existing scikit‑learn pipelines and immediately gain more faithful hierarchical groupings, especially for high‑dimensional embeddings common in deep‑learning pipelines. - Better downstream tasks – More accurate dendrogram cuts translate to higher‑quality label propagation, semi‑supervised learning, and anomaly detection, all of which rely on hierarchical structure.
- Computer‑vision pipelines – Since Chamfer distance is already used for point‑cloud alignment, integrating Chamfer‑linkage unifies clustering and shape‑matching components, simplifying codebases for 3‑D perception or LiDAR processing.
- Scalable to large‑scale data – The algorithm’s (O(n^2)) bound is comparable to the best‑known HAC implementations; for truly massive datasets, developers can combine Chamfer‑linkage with standard approximation tricks (e.g., landmark‑based HAC) without losing its core benefits.
Limitations & Future Work
- Quadratic scaling – While acceptable for many medium‑size problems, the method still struggles with millions of points; the authors suggest exploring hierarchical approximations or GPU‑accelerated nearest‑neighbor updates.
- Dependence on Euclidean geometry – Chamfer‑linkage assumes a metric space where nearest‑neighbor distances are meaningful; performance on highly sparse or categorical data may degrade.
- Parameter‑free but not universally optimal – The paper does not introduce tunable hyper‑parameters, yet certain domains (e.g., text with cosine similarity) might benefit from a weighted Chamfer variant.
- Theoretical depth – The current analysis focuses on representation properties; a formal approximation guarantee relative to a global clustering objective (e.g., minimizing within‑cluster variance) remains open.
Bottom line: Chamfer‑linkage offers a practical, higher‑quality alternative to classic HAC linkages with negligible engineering overhead—making it a compelling addition to any developer’s clustering toolbox.
Authors
- Kishen N Gowda
- Willem Fletcher
- MohammadHossein Bateni
- Laxman Dhulipala
- D Ellis Hershkowitz
- Rajesh Jayaram
- Jakub Łącki
Paper Information
- arXiv ID: 2602.10444v1
- Categories: cs.LG, cs.DC, cs.DS, cs.IR
- Published: February 11, 2026
- PDF: Download PDF