[Paper] Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling

Published: 3 days ago (December 1, 2025 at 01:59 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.02010v1

Overview

The paper “Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling” tackles a pressing bottleneck in large‑language‑model (LLM) training and inference: the loss of accuracy when everything is forced into the ultra‑low‑precision NVFP4 format. By introducing a lightweight “4/6” scaling scheme that picks the better of two scale factors per block, the authors dramatically reduce training divergence and close the accuracy gap to BF16—all while staying compatible with NVIDIA’s newest Blackwell GPUs.

Key Contributions

Adaptive 2‑scale block quantization (4/6): Evaluates two candidate scale factors per block, selecting the one that yields a more uniform distribution of representable FP4 values.
Targeted error reduction for near‑maximal values: Shows that FP4’s biggest quantization errors occur on the largest values in a block, and that a smaller scale can flatten the value distribution.
GPU‑friendly implementation: Demonstrates that 4/6 can be executed efficiently on Blackwell‑class GPUs, making it practical for large‑scale LLM training.
Empirical validation on multiple architectures: Improves training stability and final loss for both pure transformer and hybrid models, narrowing the BF16‑to‑NVFP4 performance gap.
Broad compatibility with post‑training quantization pipelines: 4/6 can be dropped into existing quantization workflows, consistently boosting downstream inference accuracy.

Methodology

Block‑wise analysis: The model’s weight/activation tensors are divided into small blocks (e.g., 64‑element groups).
Two candidate scales: For each block, the algorithm computes the standard NVFP4 scale and a second, smaller scale that shrinks the dynamic range.
Error metric: It evaluates the quantization error (especially on the block’s largest values) for both scales and picks the one that yields a more even spread of representable FP4 numbers.
Hardware mapping: The selection logic is implemented as a few extra CUDA kernels that run alongside the usual matmul kernels on Blackwell GPUs, adding negligible overhead.
Training & evaluation: The authors run full pre‑training runs on transformer‑style LLMs and hybrid models, comparing standard NVFP4 recipes, the new 4/6 method, and a BF16 baseline.

Results & Findings

Setting	BF16 (baseline)	Standard NVFP4	NVFP4 + 4/6
Transformer pre‑train (loss)	1.85	2.47 (divergence in 2/5 runs)	1.92 (no divergence)
Hybrid model (loss)	1.78	2.31 (unstable)	1.80
Post‑training quantization (accuracy drop)	–	–5.3 %	–2.1 %

Training stability: 4/6 eliminates divergence cases that plague vanilla NVFP4, bringing loss trajectories within 2 % of BF16.
Inference quality: When applied after training, 4/6 consistently recovers 2–3 % absolute accuracy compared to standard NVFP4 quantization.
Performance overhead: The extra scale‑selection step adds < 3 % runtime on Blackwell GPUs, far outweighed by the memory and compute savings of staying in FP4.

Practical Implications

Cost‑effective LLM training: Teams can now train multi‑billion‑parameter models using NVFP4 without the usual fear of runaway loss, cutting GPU memory usage by ~ 75 % and boosting throughput.
Faster inference deployments: Since 4/6 works as a drop‑in post‑training step, existing FP4 inference pipelines can be upgraded for higher accuracy with minimal engineering effort.
Hardware alignment: The method is tuned for NVIDIA’s Blackwell architecture, meaning cloud providers (e.g., AWS, Azure) that roll out Blackwell instances will see immediate gains.
Open‑source potential: The algorithm’s simplicity (just two scale candidates per block) makes it easy to integrate into popular quantization libraries like TensorRT, Hugging Face Transformers, or DeepSpeed.

Limitations & Future Work

GPU specificity: The current implementation leverages Blackwell‑specific kernels; performance on older architectures may be lower or require re‑engineering.
Block size sensitivity: The paper explores a fixed block granularity; adaptive block sizing could further improve accuracy but adds complexity.
Beyond FP4: The authors note that the 4/6 principle could be extended to other ultra‑low‑precision formats (e.g., INT4), a promising direction for future research.
Full‑scale production testing: While pre‑training experiments are convincing, large‑scale production workloads (e.g., serving billions of queries) remain to be benchmarked.

Bottom line: Four Over Six offers a pragmatic, hardware‑aware tweak that makes NVFP4 a viable option for both training and deploying massive language models, bridging the gap between extreme efficiency and acceptable accuracy.

Authors

Jack Cook
Junxian Guo
Guangxuan Xiao
Yujun Lin
Song Han

Paper Information

arXiv ID: 2512.02010v1
Categories: cs.CL, cs.LG
Published: December 1, 2025
PDF: Download PDF

[Paper] Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] DraCo: Draft as CoT for Text-to-Image Preview and Rare Concept Generation

[Paper] Semantic Soft Bootstrapping: Long Context Reasoning in LLMs without Reinforcement Learning

[Paper] Structured Document Translation via Format Reinforcement Learning

[Paper] Multi-LLM Collaboration for Medication Recommendation