[Paper] RAMP: Reinforcement Adaptive Mixed Precision Quantization for Efficient On Device LLM Inference

Published: (March 18, 2026 at 12:16 PM EDT)
5 min read
Source: arXiv

Source: arXiv - 2603.17891v1

Overview

The paper introduces RAMP (Reinforcement Adaptive Mixed Precision), a novel framework that automatically decides the optimal number of bits to use for each layer of a large language model (LLM) during post‑training quantization. By treating the bit‑width selection as a reinforcement‑learning problem, RAMP achieves higher accuracy at a lower memory footprint than existing uniform‑precision methods, making on‑device LLM inference more practical.

Key Contributions

  • Per‑layer mixed‑precision policy learned via an off‑policy Soft Actor‑Critic (SAC) algorithm, optimizing perplexity under a global bit‑budget.
  • Scale‑Folding pre‑conditioning that shifts activation outliers into the weight tensors, enabling stable sub‑4‑bit quantization.
  • Lightweight 11‑dimensional state representation (activation stats, weight characteristics, structural descriptors) that generalizes zero‑shot across model families and scales.
  • Quality‑prioritized reward design with asymmetric penalties and “budget cliffs” that speeds up convergence.
  • Empirical gains: on Llama‑2 7B, RAMP reaches 5.54 perplexity at 3.68 GB (≈3.65 effective bits), beating uniform 4‑bit AWQ and GPTQ in both size and quality.
  • Zero‑shot transfer: a policy trained on a single 7B model works out‑of‑the‑box for Llama‑2 13B and Mistral 7B, often surpassing policies trained per model.
  • HALO export pipeline that writes the mixed‑precision layout to the GGUF format, enabling kernel‑free inference on CPUs, GPUs, and edge devices while preserving ~99.5 % of FP16 commonsense reasoning performance.

Methodology

  1. State Construction – For each layer, RAMP extracts an 11‑dimensional embedding that captures:

    • Activation distribution statistics (mean, variance, outlier ratio)
    • Weight properties (norm, sparsity, dynamic range)
    • Structural descriptors (layer type, size, position in the network)
  2. Reinforcement Learning Loop

    • Agent: an off‑policy Soft Actor‑Critic (SAC) network proposes a bit‑width (e.g., 2‑8 bits) for every layer.
    • Environment: the quantization engine applies the proposed widths, runs a short forward pass on a validation set, and reports perplexity and memory usage.
    • Reward – Combines a quality term (lower perplexity = higher reward) with a penalty that sharply increases when the total memory exceeds the target budget (“budget cliff”). The reward is asymmetric: small degradations in quality are penalized more heavily than modest memory savings, steering the policy toward accuracy‑first solutions.
  3. Scale‑Folding – Before quantization, per‑channel scaling factors are absorbed into the weight tensors, and the corresponding normalization layers are adjusted. This reduces extreme activation values that would otherwise cause large quantization errors in sub‑4‑bit regimes.

  4. Training & Transfer – The SAC agent is trained on a single model (Llama‑2 7B). Because the state representation abstracts away model‑specific parameters, the learned policy can be applied directly to other LLMs without retraining.

  5. Export – The final per‑layer bit assignments are serialized into the GGUF format via the HALO pipeline, which also generates the necessary runtime kernels for various hardware back‑ends.

Results & Findings

ModelBit‑budget (GB)Effective BitsPerplexityBaseline (Uniform 4‑bit AWQ)
Llama‑2 7B3.683.655.545.60 (3.90 GB)
Llama‑2 13B (zero‑shot)~7.2~3.7≈5.65.8 (uniform)
Mistral 7B (zero‑shot)~3.9~3.6≈5.55.7 (uniform)
  • Size reduction: RAMP saves ~6 % of memory compared to the best uniform‑precision method.
  • Quality improvement: Perplexity drops 1‑3 % relative to baselines, translating to near‑FP16 reasoning performance (99.5 % retained).
  • Training efficiency: The reward design and scale‑folding enable convergence within a few hundred thousand environment steps, far fewer than naïve RL quantization attempts.
  • Generalization: A single policy works across different architectures and parameter counts, supporting the authors’ claim that quantization sensitivity is largely architectural rather than model‑specific.

Practical Implications

  • On‑device LLMs: Developers can now run 7‑13 B parameter models on edge devices (smartphones, embedded GPUs, micro‑servers) with memory budgets previously reserved for much smaller networks.
  • Deployment pipelines: The HALO → GGUF workflow integrates with existing model serving stacks (e.g., Hugging Face Transformers, llama.cpp), requiring only a one‑time RL policy inference to generate the mixed‑precision layout.
  • Cost savings: Smaller memory footprints reduce hardware costs, power consumption, and latency—critical for real‑time applications like voice assistants, on‑device summarization, or personalized recommendation engines.
  • Flexibility: Since the policy is lightweight, teams can experiment with different global budgets (e.g., “fit within 4 GB”) without re‑training the entire quantizer, simply re‑running the RL inference step.
  • Future‑proofing: As newer, larger LLMs emerge, the same RAMP policy can be applied (or fine‑tuned) to obtain mixed‑precision configurations, accelerating time‑to‑market for AI‑powered products.

Limitations & Future Work

  • Training overhead: While the RL policy converges relatively quickly, the initial off‑policy training still requires a full‑precision model and a validation set, which may be prohibitive for extremely large models (>30 B).
  • Hardware‑specific nuances: The current reward does not explicitly model hardware latency or energy; extending RAMP to optimize for those metrics could yield even more deployment‑ready configurations.
  • Outlier handling: Scale‑Folding mitigates activation outliers but may introduce numerical instability in certain normalization layers; further robustness checks are needed for diverse architectures.
  • Broader benchmarks: The paper focuses on perplexity and commonsense reasoning; evaluating on downstream tasks (e.g., code generation, translation) would clarify the trade‑offs in real‑world use cases.
  • Policy interpretability: Understanding why the policy assigns specific bit widths to particular layers could guide manual heuristics and improve trust in automated quantization pipelines.

Overall, RAMP pushes mixed‑precision quantization from a research curiosity toward a production‑ready tool that can democratize on‑device LLM inference.

Authors

  • Arpit Singh Gautam
  • Saurabh Jha

Paper Information

  • arXiv ID: 2603.17891v1
  • Categories: cs.LG, cs.AI
  • Published: March 18, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »