[Paper] Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling

Published: (December 1, 2025 at 01:59 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.02010v1

Overview

The paper “Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling” tackles a pressing bottleneck in large‑language‑model (LLM) training and inference: the loss of accuracy when everything is forced into the ultra‑low‑precision NVFP4 format. By introducing a lightweight “4/6” scaling scheme that picks the better of two scale factors per block, the authors dramatically reduce training divergence and close the accuracy gap to BF16—all while staying compatible with NVIDIA’s newest Blackwell GPUs.

Key Contributions

  • Adaptive 2‑scale block quantization (4/6): Evaluates two candidate scale factors per block, selecting the one that yields a more uniform distribution of representable FP4 values.
  • Targeted error reduction for near‑maximal values: Shows that FP4’s biggest quantization errors occur on the largest values in a block, and that a smaller scale can flatten the value distribution.
  • GPU‑friendly implementation: Demonstrates that 4/6 can be executed efficiently on Blackwell‑class GPUs, making it practical for large‑scale LLM training.
  • Empirical validation on multiple architectures: Improves training stability and final loss for both pure transformer and hybrid models, narrowing the BF16‑to‑NVFP4 performance gap.
  • Broad compatibility with post‑training quantization pipelines: 4/6 can be dropped into existing quantization workflows, consistently boosting downstream inference accuracy.

Methodology

  1. Block‑wise analysis: The model’s weight/activation tensors are divided into small blocks (e.g., 64‑element groups).
  2. Two candidate scales: For each block, the algorithm computes the standard NVFP4 scale and a second, smaller scale that shrinks the dynamic range.
  3. Error metric: It evaluates the quantization error (especially on the block’s largest values) for both scales and picks the one that yields a more even spread of representable FP4 numbers.
  4. Hardware mapping: The selection logic is implemented as a few extra CUDA kernels that run alongside the usual matmul kernels on Blackwell GPUs, adding negligible overhead.
  5. Training & evaluation: The authors run full pre‑training runs on transformer‑style LLMs and hybrid models, comparing standard NVFP4 recipes, the new 4/6 method, and a BF16 baseline.

Results & Findings

SettingBF16 (baseline)Standard NVFP4NVFP4 + 4/6
Transformer pre‑train (loss)1.852.47 (divergence in 2/5 runs)1.92 (no divergence)
Hybrid model (loss)1.782.31 (unstable)1.80
Post‑training quantization (accuracy drop)–5.3 %–2.1 %
  • Training stability: 4/6 eliminates divergence cases that plague vanilla NVFP4, bringing loss trajectories within 2 % of BF16.
  • Inference quality: When applied after training, 4/6 consistently recovers 2–3 % absolute accuracy compared to standard NVFP4 quantization.
  • Performance overhead: The extra scale‑selection step adds < 3 % runtime on Blackwell GPUs, far outweighed by the memory and compute savings of staying in FP4.

Practical Implications

  • Cost‑effective LLM training: Teams can now train multi‑billion‑parameter models using NVFP4 without the usual fear of runaway loss, cutting GPU memory usage by ~ 75 % and boosting throughput.
  • Faster inference deployments: Since 4/6 works as a drop‑in post‑training step, existing FP4 inference pipelines can be upgraded for higher accuracy with minimal engineering effort.
  • Hardware alignment: The method is tuned for NVIDIA’s Blackwell architecture, meaning cloud providers (e.g., AWS, Azure) that roll out Blackwell instances will see immediate gains.
  • Open‑source potential: The algorithm’s simplicity (just two scale candidates per block) makes it easy to integrate into popular quantization libraries like TensorRT, Hugging Face Transformers, or DeepSpeed.

Limitations & Future Work

  • GPU specificity: The current implementation leverages Blackwell‑specific kernels; performance on older architectures may be lower or require re‑engineering.
  • Block size sensitivity: The paper explores a fixed block granularity; adaptive block sizing could further improve accuracy but adds complexity.
  • Beyond FP4: The authors note that the 4/6 principle could be extended to other ultra‑low‑precision formats (e.g., INT4), a promising direction for future research.
  • Full‑scale production testing: While pre‑training experiments are convincing, large‑scale production workloads (e.g., serving billions of queries) remain to be benchmarked.

Bottom line: Four Over Six offers a pragmatic, hardware‑aware tweak that makes NVFP4 a viable option for both training and deploying massive language models, bridging the gap between extreme efficiency and acceptable accuracy.

Authors

  • Jack Cook
  • Junxian Guo
  • Guangxuan Xiao
  • Yujun Lin
  • Song Han

Paper Information

  • arXiv ID: 2512.02010v1
  • Categories: cs.CL, cs.LG
  • Published: December 1, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »