[Paper] Only relative ranks matter in weight-clustered large language models
Source: arXiv - 2603.17917v1
Overview
The paper Only relative ranks matter in weight‑clustered large language models shows that, for massive language models, the exact numeric values of individual weights are far less important than the relative ordering (rank) of those weights. By clustering each weight matrix into a handful of shared values, the authors compress models like Llama 3.1‑8B and SmolLM2‑135M to just 16–64 distinct weight levels without any retraining, while preserving most of the original accuracy.
Key Contributions
- Weight‑clustering compression: Replaces every weight matrix with K centroids obtained via K‑means, reducing unique values to 16–64 per layer.
- Training‑free compression: Demonstrates that this aggressive quantization works out‑of‑the‑box, requiring no costly fine‑tuning.
- Fine‑tuning centroids only: Shows that updating just the cluster means (the centroids) recovers 30‑40 % of the remaining accuracy loss at negligible compute cost.
- Rank‑vs‑magnitude analysis: Systematically randomizes cluster means while keeping assignments fixed, revealing that scrambling rank order catastrophically harms perplexity, whereas preserving rank leaves performance almost unchanged.
- Layer‑wise drift study: Identifies scale drift (global scaling changes) as the main cause of collapse when many layers are perturbed together, and proposes a simple affine correction ( w′ = a w + b , a > 0) that preserves rank and mitigates drift.
- New perspective on robustness: Positions relative weight ranking as a core invariant for both compression and model stability, opening avenues for rank‑preserving regularization and diagnostics.
Methodology
- Weight clustering – For each linear layer, the authors run K‑means on the raw weight values and replace every entry with the nearest centroid. The number of centroids K is set to a small constant (16–64).
- Zero‑shot evaluation – The clustered model is evaluated on standard language‑model benchmarks (e.g., perplexity on WikiText‑103) without any additional training.
- Centroid fine‑tuning – Only the K centroid values are treated as learnable parameters and updated for a few epochs, leaving the assignment map untouched.
- Randomization experiments –
- Rank‑preserving: Randomly permute centroid values while keeping their order (rank) the same.
- Rank‑scrambling: Randomly shuffle centroid values, destroying the original rank.
Both keep global statistics (mean, variance) constant.
- Progressive layer replacement – Layers are swapped one‑by‑one from original to clustered, measuring how error accumulates and whether scale drift or rank distortion dominates.
- Affine correction – After clustering, an optional linear transform (scale a > 0, shift b) is applied to each layer to re‑align the overall distribution while preserving rank.
Results & Findings
| Model | K (centroids) | Zero‑shot perplexity Δ | After centroid‑only fine‑tune Δ | Rank‑preserving shuffle Δ | Rank‑scrambling Δ |
|---|---|---|---|---|---|
| Llama 3.1‑8B | 32 | +3 % (≈ negligible) | –30 % relative to baseline gap | ≈ 0 % (no impact) | ↑ × 10–100 (orders of magnitude) |
| SmolLM2‑135M | 16 | +5 % | –35 % of gap | ≈ 0 % | ↑ × 50–200 |
Δ denotes change in perplexity relative to the original uncompressed model.
- Compression works: Even with only 16 distinct weight levels, the models retain most of their predictive power.
- Centroid fine‑tuning is cheap: Updating just a few dozen numbers per layer (the centroids) yields a sizable boost, requiring far less GPU time than full‑model fine‑tuning.
- Rank matters: Destroying the ordering of clusters blows up perplexity, confirming that the model relies on which connections are stronger rather than their exact magnitudes.
- Scale drift: When many layers are simultaneously altered, the overall scale of weights drifts, leading to performance collapse. An affine correction that preserves positive scaling (a > 0) dramatically postpones this collapse.
Practical Implications
- Disk‑space savings: Deploying LLMs on edge devices or in containerized services becomes feasible; an 8‑B model can be stored with a 10‑fold reduction in weight size.
- Fast model shipping: Teams can share compressed checkpoints without re‑training, accelerating collaboration and reproducibility.
- Low‑cost fine‑tuning: Updating only centroids enables rapid domain adaptation (e.g., instruction‑following tweaks) on modest hardware.
- Robustness diagnostics: Monitoring rank preservation during quantization or pruning can serve as a sanity check—if rank order changes, expect severe degradation.
- Hardware‑friendly inference: Fewer unique weight values translate to better cache locality and potential for custom integer‑only kernels, improving latency on CPUs/GPUs.
Limitations & Future Work
- Scope of models: Experiments focus on two models (8 B and 135 M parameters). Scaling the approach to 100 B‑plus models may reveal new challenges (e.g., memory bandwidth for centroid look‑ups).
- Task diversity: Evaluation is limited to language modeling perplexity; downstream tasks (code generation, reasoning) could be more sensitive to rank distortions.
- Dynamic rank changes: The study treats rank as static; future work could explore rank‑aware training objectives that explicitly preserve ordering during quantization or pruning.
- Hardware integration: Implementing efficient centroid‑lookup kernels across different accelerators remains an engineering hurdle.
Bottom line: By reframing weight compression as a rank‑preserving problem, the authors provide a simple, training‑free pathway to shrink LLMs while keeping them functional—an insight that could reshape how developers package, ship, and fine‑tune massive language models.
Authors
- Borja Aizpurua
- Sukhbinder Singh
- Román Orús
Paper Information
- arXiv ID: 2603.17917v1
- Categories: cs.LG, cs.CL
- Published: March 18, 2026
- PDF: Download PDF