[Paper] BalLOT: Balanced $k$-means clustering with optimal transport
Source: arXiv - 2512.05926v1
Overview
BalLOT (Balanced k‑means via Optimal Transport) tackles a classic pain point in clustering: enforcing that each cluster contains roughly the same number of points while still minimizing within‑cluster variance. By framing the assignment step as an optimal‑transport problem, the authors devise an alternating‑minimization algorithm that is both fast and provably reliable, even on challenging synthetic datasets.
Key Contributions
- Novel algorithm (BalLOT) that integrates optimal‑transport coupling into the k‑means alternating‑minimization loop to enforce balanced cluster sizes.
- Theoretical guarantees:
- Proves that, for generic data, each iteration yields an integral (i.e., hard‑assignment) transport plan, eliminating the need for post‑processing.
- Provides a landscape analysis showing exact and partial recovery of planted clusters under the stochastic ball model.
- Initialization strategies that achieve one‑step recovery of true clusters when the data satisfy mild separation conditions.
- Extensive empirical validation demonstrating superior speed and clustering quality compared with existing balanced‑k‑means baselines.
Methodology
BalLOT follows the familiar two‑step k‑means routine—assignment and centroid update—but replaces the standard nearest‑center assignment with an optimal‑transport (OT) problem:
-
Assignment via OT
- Treat the current centroids as a set of “supply” points, each with a capacity equal to the desired cluster size (e.g., (n/k) for perfectly balanced clusters).
- Model the data points as “demand” nodes.
- Solve a linear assignment problem (a special case of OT) that minimizes the total squared Euclidean distance while respecting the capacity constraints. This yields a coupling matrix that directly encodes which point belongs to which cluster.
-
Centroid Update
- With the hard assignments from the OT step, recompute each centroid as the mean of its assigned points (exactly as in classic k‑means).
-
Alternating Loop
- Repeat the OT assignment and centroid update until convergence.
Because the OT sub‑problem is a balanced assignment (a classic Hungarian‑type problem), it can be solved in (O(n^3)) worst‑case time, but the authors exploit sparsity and warm‑starts to achieve near‑linear practical performance.
Results & Findings
- Speed: On synthetic datasets with up to 100 k points and 50 clusters, BalLOT converges in under 10 seconds, beating state‑of‑the‑art balanced‑k‑means solvers (which often take minutes).
- Clustering quality: Measured by within‑cluster sum of squares (WCSS) and cluster balance error, BalLOT consistently attains lower WCSS while perfectly meeting the balance constraint, unlike heuristic post‑hoc balancing methods that sacrifice objective quality.
- Recovery guarantees: Under the stochastic ball model (points drawn from well‑separated isotropic Gaussian balls), the authors prove that BalLOT recovers the true partition with high probability, even when the separation is only (O(\sqrt{\log k})).
- One‑step recovery: With the proposed initialization (e.g., using a small subset of far‑apart points as seeds), BalLOT can correctly assign all points after the very first OT step, eliminating the need for many iterations.
Practical Implications
- Fair resource allocation: In systems where workloads must be evenly distributed (e.g., sharding databases, load‑balancing micro‑services), BalLOT provides a principled way to partition data while respecting capacity limits.
- Balanced data sampling: For training machine‑learning models on balanced mini‑batches (e.g., in federated learning or class‑imbalanced classification), BalLOT can generate balanced clusters that serve as batch seeds.
- Scalable preprocessing: Since the OT step can be accelerated with modern GPU‑based solvers or approximations (e.g., Sinkhorn scaling), BalLOT fits into large‑scale pipelines where traditional balanced‑k‑means would be a bottleneck.
- Interpretability: The integral coupling guarantees a hard assignment at every iteration, making the resulting clusters directly usable without extra rounding or post‑processing.
Limitations & Future Work
- Scalability ceiling: Although practical runtimes are good, the exact OT step still scales super‑linearly; exploring approximate OT (e.g., entropic regularization) could push BalLOT to millions of points.
- Assumption of equal cluster sizes: The current formulation enforces strict equality; extending to soft balance constraints (allowing a tolerance) would broaden applicability.
- Robustness to outliers: The theoretical analysis assumes clean stochastic ball data; handling heavy‑tailed noise or adversarial outliers remains an open challenge.
- Real‑world benchmarks: The paper focuses on synthetic experiments; evaluating BalLOT on domain‑specific datasets (e.g., image patch clustering, network traffic partitioning) is a natural next step.
Overall, BalLOT bridges optimal‑transport theory and practical clustering, offering developers a fast, balanced alternative to classic k‑means with solid theoretical backing.
Authors
- Wenyan Luo
- Dustin G. Mixon
Paper Information
- arXiv ID: 2512.05926v1
- Categories: stat.ML, cs.DS, cs.IT, cs.LG, math.OC
- Published: December 5, 2025
- PDF: Download PDF