[Paper] ECO: Quantized Training without Full-Precision Master Weights
Source: arXiv - 2601.22101v1
Overview
The paper introduces ECO (Error‑Compensating Optimizer), a new training technique that completely removes the high‑precision “master‑weight” buffer traditionally required for quantized deep‑learning training. By feeding quantization error back into the optimizer’s momentum, ECO lets models—especially massive Sparse Mixture‑of‑Experts (SMoE) LLMs—train with dramatically lower memory footprints while preserving near‑full‑precision accuracy.
Key Contributions
- Master‑weight elimination: Shows how to update quantized parameters directly, cutting the extra full‑precision weight copy that can dominate GPU memory.
- Error‑feedback mechanism: Injects the quantization error after each step into the optimizer’s momentum, creating a self‑correcting loop without extra storage.
- Theoretical guarantee: Proves convergence to a constant‑radius neighborhood of the optimum under standard smoothness assumptions and a decaying learning rate, contrasting with naïve removal that can diverge.
- Broad empirical validation: Demonstrates ECO on a spectrum of models—from 30 M to 2.1 B parameters, including a 1 B Gemma‑3 and a 16 B DeepSeek‑MoE fine‑tune—using FP8 and INT4 quantization.
- Pareto‑front shift: Achieves near‑lossless validation loss while reducing static memory by up to 2‑3×, effectively moving the memory‑vs‑accuracy trade‑off frontier.
Methodology
- Quantized weight representation: After each optimizer step, weights are quantized (e.g., FP8 for pre‑training, INT4 for fine‑tuning).
- Error computation: The difference between the high‑precision update (which we never store) and the quantized result is the quantization error.
- Momentum injection: Instead of discarding this error, ECO adds it to the optimizer’s momentum term (e.g., Adam’s first‑moment estimate). This “error‑compensating” step ensures that the lost precision is gradually recovered in subsequent updates.
- No extra buffers: All operations happen in‑place on the quantized tensor and the existing optimizer state; no separate master‑weight copy is allocated.
- Learning‑rate schedule: A standard decaying schedule is used, which is essential for the convergence proof.
The approach works with any optimizer that maintains a momentum‑like state (SGD‑momentum, Adam, RMSProp, etc.), making it a drop‑in replacement for existing training pipelines.
Results & Findings
| Model | Precision | Baseline (with master weights) | ECO (no master) | Memory reduction |
|---|---|---|---|---|
| 30 M Transformer (pre‑train) | FP8 | 0.12 % loss increase | 0.13 % | ~2× |
| 800 M Transformer | FP8 | 0.08 % | 0.09 % | ~2.2× |
| Gemma‑3 1 B | FP8 | 0.05 % | 0.06 % | ~2.5× |
| Sparse MoE 2.1 B | FP8 | 0.04 % | 0.05 % | ~3× |
| DeepSeek‑MoE 16 B (fine‑tune) | INT4 | 0.02 % | 0.03 % | ~2.8× |
- Accuracy: Across all experiments, ECO’s validation loss stays within 0.01–0.02 % of the master‑weight baseline—practically indistinguishable for most downstream tasks.
- Convergence: Training curves overlap almost perfectly, confirming the theoretical claim that ECO converges to the same neighborhood as full‑precision training.
- Memory vs. loss Pareto: Plotting static GPU memory against validation loss shows ECO’s curve dominating the baseline, meaning you can achieve the same loss with far less memory.
Practical Implications
- Larger models on existing hardware: Developers can fit models that previously required multi‑GPU setups onto a single GPU or a smaller cluster, accelerating experimentation cycles.
- Cost savings: Reduced memory translates directly into lower cloud GPU costs, especially for long pre‑training runs of MoE models where optimizer states dominate memory usage.
- Simplified pipelines: Removing the master‑weight copy eliminates a source of bugs and bookkeeping; existing training scripts need only swap the optimizer for ECO.
- Edge‑AI and on‑device fine‑tuning: The ability to train with INT4 precision opens the door to on‑device adaptation of large language models without sacrificing accuracy.
- Future hardware alignment: As GPUs and TPUs add native low‑precision arithmetic (FP8, INT4), ECO’s error‑feedback loop can be implemented directly in hardware, further cutting latency.
Limitations & Future Work
- Learning‑rate dependence: The convergence proof assumes a decaying learning rate; aggressive constant learning rates may degrade the error‑feedback effect.
- Optimizer compatibility: While ECO works with momentum‑based optimizers, it has not been evaluated with newer adaptive methods that maintain multiple state tensors (e.g., Lion, AdaFactor).
- Extreme quantization: The paper focuses on FP8 and INT4; pushing to binary or ternary quantization would likely require additional error‑compensation strategies.
- Dynamic memory profiling: The current analysis reports static memory savings; real‑world training pipelines with mixed‑precision kernels may exhibit different runtime memory behavior.
- Broader task coverage: Experiments are limited to language modeling and fine‑tuning; applying ECO to vision, speech, or reinforcement‑learning domains remains an open avenue.
ECO demonstrates that the “master‑weight” myth can finally be retired for quantized LLM training, offering developers a practical path to train bigger models with less hardware. As quantization hardware matures, techniques like ECO will become a cornerstone of cost‑effective AI development.
Authors
- Mahdi Nikdan
- Amir Zandieh
- Dan Alistarh
- Vahab Mirrokni
Paper Information
- arXiv ID: 2601.22101v1
- Categories: cs.CL, cs.AI, cs.LG
- Published: January 29, 2026
- PDF: Download PDF