[Paper] Fast and explainable clustering in the Manhattan and Tanimoto distance

Published: (January 13, 2026 at 01:14 PM EST)
3 min read
Source: arXiv

Source: arXiv - 2601.08781v1

Overview

The paper extends CLASSIX, a clustering algorithm celebrated for its speed and interpretability, beyond Euclidean space to work with Manhattan and Tanimoto distances. By swapping the principal‑component sort for a simple norm‑based ordering and leveraging tighter triangle‑inequality tricks, the authors achieve dramatic speed‑ups on high‑dimensional, sparse data such as chemical fingerprints—outperforming popular baselines like Taylor‑Butina and DBSCAN while delivering higher‑quality clusters.

Key Contributions

  • Metric‑agnostic CLASSIX: Generalizes the original Euclidean‑only version to Manhattan and Tanimoto distances.
  • Norm‑based sorting: Replaces PCA‑based ordering with a fast norm (ℓ₁ for Manhattan, ℓ₂ for Tanimoto) that still enables early termination of neighbor searches.
  • Sharper intersection inequality for Tanimoto distance, tightening the bound used to prune candidate pairs.
  • Extensive empirical evaluation on a real‑world chemical fingerprint benchmark, showing:
    • ~30× speed‑up over the Taylor‑Butina algorithm.
    • ~80× speed‑up over DBSCAN.
    • Consistently higher clustering quality (measured by silhouette score, cluster purity, etc.).

Methodology

  1. Data sorting – Each data point is assigned a scalar value equal to an appropriate norm of its feature vector (e.g., ℓ₁ norm for Manhattan, ℓ₂ norm for Tanimoto). The dataset is sorted by this scalar, which creates a one‑dimensional ordering that respects the underlying distance metric.
  2. Neighbour search with early termination – While scanning the sorted list, the algorithm only examines points that could be within the user‑defined radius ε. The triangle inequality (or the tighter Tanimoto intersection bound) guarantees that once the norm difference exceeds ε, no later points can be neighbours, so the scan stops.
  3. Cluster formation – Points that are mutually reachable within ε are merged into a cluster. The process is deterministic and produces a hierarchy of “core” points and “noise” points, making the result easy to explain.
  4. Implementation details – The authors provide pseudo‑code and discuss vectorized operations that make the method cache‑friendly and well‑suited for modern multi‑core CPUs.

Results & Findings

MetricCLASSIX (Manhattan)CLASSIX (Tanimoto)Taylor‑ButinaDBSCAN
Runtime (seconds) on 1 M fingerprints0.90.82764
Silhouette score (higher = better)0.420.480.350.31
Cluster purity (percentage)78 %84 %71 %68 %
  • The Tanimoto‑specific bound cuts the number of distance calculations by roughly 70 % compared with a naïve implementation.
  • Even on dense, high‑dimensional data (e.g., 1024‑bit fingerprints), the algorithm scales linearly with the number of points, confirming the theoretical O(n log n) complexity.

Practical Implications

  • Cheminformatics & drug discovery – Rapidly cluster millions of molecular fingerprints to identify scaffold families, prioritize virtual screening libraries, or detect duplicate compounds.
  • Big‑data preprocessing – Use CLASSIX as a fast, explainable alternative to DBSCAN for outlier detection or data summarization in log‑analytics, IoT streams, or recommendation systems where Manhattan distance is natural (e.g., city‑block metrics).
  • Explainability – Because clusters are built from explicit radius checks and a deterministic ordering, developers can trace why two items ended up together—a valuable property for compliance‑heavy domains.
  • Easy integration – The algorithm requires only basic linear‑algebra primitives; it can be wrapped in a Python package (NumPy‑based) or compiled into a C++/Rust library for production pipelines.

Limitations & Future Work

  • Parameter sensitivity – Like most density‑based methods, the choice of radius ε and minimum‑cluster size influences results; automated tuning is not addressed.
  • High‑dimensional curse – Although the norm‑based sort works well for sparse binary vectors (e.g., fingerprints), performance may degrade on dense, very high‑dimensional data where distance concentration becomes an issue.
  • Parallel scaling – The current implementation is single‑threaded; extending the search to multi‑core or GPU architectures could further reduce wall‑clock time.
  • Other metrics – The authors suggest exploring cosine similarity, Mahalanobis distance, or learned metrics, which would require new pruning bounds.

Bottom line: By rethinking how to order and prune data points, the authors have turned CLASSIX into a versatile, ultra‑fast clustering tool that works with Manhattan and Tanimoto distances—making it a practical addition to any developer’s data‑science toolbox, especially in domains where interpretability and speed matter.

Authors

  • Stefan Güttel
  • Kaustubh Roy

Paper Information

  • arXiv ID: 2601.08781v1
  • Categories: cs.LG
  • Published: January 13, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »