[Paper] Quasar: Quantized Self-Speculative Acceleration for Rapid Inference via Memory-Efficient Verification

Published: (March 1, 2026 at 10:02 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2603.01399v1

Overview

The paper presents Quasar, a training‑free framework that speeds up large‑language‑model (LLM) inference by applying low‑bit quantization only to the verification step of speculative decoding. By cutting the memory traffic of verification in half while preserving the quality of the model’s logits, Quasar squeezes an extra ~28 % throughput gain on top of existing drafting techniques—an important win for anyone deploying LLMs at scale.

Key Contributions

  • Quantized verification: Introduces a novel, low‑bit (e.g., 4‑bit) quantization scheme applied exclusively to the verification pass, leaving the drafting model untouched.
  • Training‑free pipeline: No extra fine‑tuning or data‑centric retraining is required; the method works out‑of‑the‑box on any pre‑trained LLM.
  • Memory‑bandwidth reduction: Demonstrates that quantization halves the memory bandwidth demand of verification, the dominant bottleneck in speculative decoding.
  • Empirical validation: Shows on state‑of‑the‑art models (OpenPangu, Qwen‑3) that acceptance lengths remain on par with full‑precision verification while achieving a 1.28× end‑to‑end speedup.
  • Orthogonal to drafting: The approach can be stacked on top of any existing drafting strategy (self‑speculation, look‑ahead decoding, etc.) without modification.

Methodology

  1. Speculative decoding recap – The inference pipeline is split into a fast draft pass (producing candidate tokens) and a slower verification pass (checking those candidates with the full target model).
  2. Targeted quantization – Quasar quantizes only the verification forward pass to a low bit‑width (typically 4‑bit) using a symmetric, per‑tensor scaling factor. This preserves the relative ordering of logits, which is crucial for the acceptance decision.
  3. Preserving logit fidelity – The authors compare two naive acceleration tricks: aggressive structural pruning (which destroys logit quality) vs. quantization. Experiments show quantization retains the original logit distribution with negligible drift.
  4. Integration flow – The draft model runs unchanged in FP16/FP32. After the draft produces a batch of candidate tokens, the verification model runs the same input in quantized mode, computes logits, and decides whether to accept or reject each token. No extra training or calibration data is needed.
  5. Implementation details – The quantized kernels are built on top of existing low‑bit inference libraries (e.g., bitsandbytes), and the authors expose a simple API that swaps the verification model at runtime.

Results & Findings

ModelBaseline (full‑precision SD)Quasar (quantized verification)Throughput ↑Acceptance length Δ
OpenPangu‑13B1.00×1.28×+28 %< 0.5 % drop
Qwen‑3‑7B1.00×1.27×+27 %< 0.4 % drop
  • Memory traffic: Quantization cuts verification memory reads/writes by ~50 %, directly alleviating the bandwidth bottleneck.
  • Logit similarity: KL‑divergence between full‑precision and quantized logits stays below 0.001, confirming that acceptance decisions are virtually unchanged.
  • Compatibility: When combined with state‑of‑the‑art drafting methods (e.g., self‑speculation with look‑ahead), Quasar adds its speedup on top of the existing gains, confirming orthogonality.

Practical Implications

  • Lower hardware costs: By reducing memory bandwidth demand, Quasar enables higher inference throughput on existing GPUs/TPUs without needing newer, faster memory subsystems.
  • Higher request concurrency: Cloud providers can serve more simultaneous LLM queries per GPU, translating to better utilization and lower per‑token cost.
  • Easy integration: Since no retraining is required, developers can drop Quasar into existing inference pipelines with a single configuration change.
  • Edge and on‑device scenarios: The reduced memory footprint makes speculative decoding viable on devices with limited bandwidth (e.g., mobile GPUs, inference accelerators).
  • Future‑proofing: As LLMs continue to grow, the verification step will become an even larger bottleneck; Quasar’s quantized verification offers a scalable, model‑agnostic mitigation path.

Limitations & Future Work

  • Quantization granularity: The current implementation uses uniform per‑tensor scaling; more sophisticated mixed‑precision or per‑channel schemes could push speedups further.
  • Hardware dependence: The reported gains assume GPUs with efficient low‑bit kernels; on older hardware the speedup may be modest.
  • Edge‑case accuracy: While acceptance length is largely unchanged, rare pathological prompts could see a slight degradation; a fallback to full‑precision verification is needed in safety‑critical applications.
  • Broader benchmarks: Experiments focus on two models; extending evaluation to encoder‑decoder architectures and multimodal LLMs is left for future work.

Quasar demonstrates that a targeted, training‑free quantization of the verification leg can break the “memory wall” that has capped speculative decoding performance, offering a practical, immediately deployable boost for developers building high‑throughput LLM services.

Authors

  • Guang Huang
  • Zeyi Wen

Paper Information

  • arXiv ID: 2603.01399v1
  • Categories: cs.DC, cs.LG
  • Published: March 2, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »