[Paper] When Less is More: 8-bit Quantization Improves Continual Learning in Large Language Models

Published: (December 21, 2025 at 07:51 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.18934v1

Overview

The paper investigates a surprising phenomenon: lower‑precision quantization (especially 8‑bit INT8) can actually improve continual learning in large language models (LLMs). By systematically testing FP16, INT8, and INT4 precision together with various replay‑buffer sizes, the authors show that quantized models retain prior knowledge better and even outperform full‑precision baselines on later tasks such as code generation.

Key Contributions

  • Empirical study of precision vs. continual learning: Benchmarks FP16, INT8, and INT4 on a sequence of NLU, Math, and Code tasks, revealing a consistent performance inversion after the first task.
  • Quantization as implicit regularizer: Proposes that the noise introduced by low‑bit quantization mitigates catastrophic forgetting by preventing over‑fitting to new‑task gradients.
  • Replay‑buffer efficiency analysis: Demonstrates that tiny replay buffers (as low as 0.1 % of the training data) dramatically boost retention across all precisions, with quantized models needing less replay than FP16 to achieve similar or better results.
  • Practical deployment guidelines: Recommends INT8 as the sweet spot for balancing inference speed, memory footprint, and continual‑learning stability; suggests buffer sizes per task type (1‑2 % for NLU, 5‑10 % for Math/Code).
  • Open‑source reproducibility: Provides full training scripts and evaluation pipelines at the linked GitHub repository.

Methodology

  1. Model & Tasks – The authors fine‑tune a pre‑trained LLM (≈2‑3 B parameters) sequentially on three downstream tasks:

    • Natural Language Understanding (NLU) – classification style.
    • Mathematics problem solving (Math).
    • Code generation (Code).
  2. Precision Settings – For each task order, the same model is run under three numeric formats:

    • FP16 (standard half‑precision).
    • INT8 (8‑bit symmetric quantization).
    • INT4 (4‑bit quantization).
  3. Replay Buffers – A small subset of previously seen examples is stored and mixed into the training data of the current task. Buffer sizes are varied from 0 % (no replay) up to 10 % of the original dataset.

  4. Evaluation – After each task, the model is evaluated on:

    • Forward accuracy on the just‑learned task.
    • Retention accuracy on all earlier tasks.
  5. Analysis – The authors compare accuracy curves, compute the “plasticity‑retention trade‑off,” and run ablation experiments to isolate the effect of quantization noise.

Results & Findings

PrecisionInitial NLU AccuracyFinal Task Forward Accuracy (Code)Retention after Math (NLU)
FP1674.44 %20 %45 %
INT8~71 %35 % (≈+15 % over FP16)65 % (≈+20 % over FP16)
INT4~68 %40 % (≈+20 % over FP16)60 %
  • Quantized models lag slightly on the first task (expected due to reduced capacity) but surpass FP16 by 8‑15 % on later tasks.
  • INT8 consistently offers the best balance: it retains most of the first‑task performance while delivering the largest gains on subsequent tasks.
  • Replay buffers as small as 0.1 % lift NLU retention from 45 % to 65 % across all precisions, confirming that even minimal rehearsal dramatically curbs forgetting.
  • Noise hypothesis: The stochastic rounding and quantization error act like a regularizer, smoothing gradient updates and preventing the model from catastrophically overwriting earlier representations.

Practical Implications

  • Deploying LLMs in evolving environments (e.g., chatbots that learn new intents, code assistants that adapt to new APIs) can be done with INT8‑quantized models without sacrificing, and often improving, long‑term performance.
  • Memory‑constrained edge devices benefit from the 4‑8× reduction in model size while still supporting continual updates.
  • Reduced replay overhead: Teams can store only a tiny fraction of historic data (or even synthetic exemplars) and still achieve strong retention, lowering storage costs and privacy concerns.
  • Training pipelines: Adding a quantization‑aware fine‑tuning step and a lightweight replay buffer is enough to reap the benefits—no need for complex regularization tricks or architectural changes.
  • Inference speed: INT8 inference is typically 2‑3× faster on modern GPUs/TPUs, meaning faster response times for services that continuously learn from user feedback.

Limitations & Future Work

  • Scope of tasks: Experiments focus on three relatively homogeneous tasks (NLU, Math, Code). Generalization to vision‑language or multimodal streams remains untested.
  • Model scale: Results are shown on a 2‑3 B‑parameter LLM; it is unclear whether the same dynamics hold for much larger (≥30 B) models.
  • Quantization granularity: Only symmetric per‑tensor quantization is explored; mixed‑precision or per‑channel schemes could yield different trade‑offs.
  • Theoretical grounding: The “implicit regularization” hypothesis is supported empirically but lacks a formal analysis; future work could model the noise‑induced gradient dynamics.
  • Replay buffer generation: The study uses random sampling from original data; investigating synthetic or generative replay could further reduce storage needs.

Overall, the paper flips a long‑standing assumption—higher precision is always better—and offers a pragmatic recipe for building efficient, continually learning LLMs that are ready for real‑world deployment.

Authors

  • Michael S. Zhang
  • Rishi A. Ruia
  • Arnav Kewalram
  • Saathvik Dharmapuram
  • Utkarsh Sharma
  • Kevin Zhu

Paper Information

  • arXiv ID: 2512.18934v1
  • Categories: cs.LG, cs.AI
  • Published: December 22, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »