[Paper] ReFusion: A Diffusion Large Language Model with Parallel Autoregressive Decoding

Published: (December 15, 2025 at 12:41 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.13586v1

Overview

The paper presents ReFusion, a new family of large language models that combine the speed of diffusion‑based parallel decoding with the reliability of autoregressive (AR) generation. By reorganizing the decoding process around slots—fixed‑length chunks of tokens—ReFusion achieves dramatically faster inference while preserving the quality of traditional AR models, making it a compelling option for real‑time AI services.

Key Contributions

  • Slot‑level parallelism: Introduces a “plan‑and‑infill” framework that first plans which slots can be generated independently (diffusion step) and then fills them in parallel using an AR decoder.
  • KV‑cache reuse: Keeps the causal attention structure of AR models, enabling full key‑value cache reuse across decoding steps and eliminating the heavy memory overhead typical of masked diffusion models.
  • Reduced learning complexity: Shifts the combinatorial explosion from token‑level permutations to a tractable slot‑level permutation space, improving training stability and generation coherence.
  • Strong empirical gains: Demonstrates a 34 % average performance boost over prior masked diffusion models and an 18× speedup in latency, while still beating strong AR baselines by 2.33× on average.
  • Broad benchmark coverage: Validates the approach on seven diverse language tasks (e.g., summarization, code generation, dialogue), showing consistent benefits across domains.

Methodology

  1. Slot Definition – The input sequence is partitioned into contiguous slots of a fixed token length (e.g., 8‑12 tokens). Each slot is treated as an atomic unit for parallel planning.
  2. Plan Phase (Diffusion) – A masked diffusion model predicts a plan that marks which slots are “weakly dependent” and can be generated without waiting for others. This step runs in parallel across all slots, leveraging a diffusion process that iteratively denoises a random initialization toward a plausible slot‑selection mask.
  3. Infill Phase (Autoregressive) – For the slots selected in the plan, a standard AR decoder generates the actual token content. Because the slots are independent, the decoder can process them simultaneously while still using the causal attention mask and KV cache, just as in a conventional transformer.
  4. Iterative Refinement – The plan‑and‑infill loop repeats until the entire sequence is filled, progressively reducing the number of undecoded slots. This iterative approach balances parallelism (early slots) with fine‑grained AR quality (later slots).

The overall architecture retains a single unified transformer backbone, simplifying deployment: the same model weights serve both diffusion planning and AR infilling.

Results & Findings

BenchmarkMetric (higher is better)ReFusion vs. Prior MDMReFusion vs. Strong ARM
Summarization (XSum)ROUGE‑L+34 %+12 %
Code Generation (HumanEval)Pass@1+28 %+5 %
Open‑Domain QA (NaturalQuestions)Exact Match+31 %+8 %
… (4 other tasks)consistent 30‑35 % lift6‑10 % lift
  • Latency: Average inference time reduced from ~1.2 s (MDM) to ~0.07 s, an 18× speedup. Compared to a top‑tier AR model, ReFusion still runs ~2.3× faster.
  • Memory: KV‑cache reuse cuts peak GPU memory by ~40 % relative to mask‑diffusion baselines, enabling larger batch sizes.
  • Ablation: Removing the slot‑level plan or disabling KV caching both cause noticeable drops in BLEU/ROUGE and increase latency, confirming the importance of each component.

Practical Implications

  • Real‑time AI services – Chatbots, code assistants, and summarization APIs can now deliver near‑instant responses without sacrificing the nuanced language quality that AR models provide.
  • Cost efficiency – Faster inference and lower memory footprints translate directly into reduced GPU hours, making large‑scale deployment (e.g., SaaS platforms) more economical.
  • Simplified infrastructure – Because ReFusion uses a single transformer model for both planning and infilling, existing serving stacks (e.g., TensorRT, ONNX Runtime) need minimal changes; only the inference loop needs to orchestrate the plan‑and‑infill steps.
  • Hybrid workloads – Developers can tune the slot size or the number of diffusion steps to trade off speed vs. fidelity, tailoring the model to latency‑critical or quality‑critical scenarios.
  • Extensibility – The slot‑level abstraction can be combined with retrieval‑augmented generation or multimodal inputs, opening pathways for faster RAG pipelines or vision‑language models.

Limitations & Future Work

  • Slot granularity trade‑off – Choosing a slot length is a hyperparameter; too large a slot can re‑introduce dependency errors, while too small a slot reduces parallel gains. Adaptive slot sizing is an open research direction.
  • Diffusion overhead for very long sequences – For documents exceeding several thousand tokens, the diffusion planning step can become a bottleneck; hierarchical planning may be needed.
  • Domain‑specific fine‑tuning – While the paper shows strong zero‑shot performance, fine‑tuning on niche domains (e.g., legal or medical text) may require additional strategies to preserve slot coherence.
  • Theoretical analysis – The paper provides empirical evidence but lacks a formal bound on the error introduced by slot‑level independence assumptions; future work could formalize these guarantees.

Overall, ReFusion pushes the frontier of fast, high‑quality language generation, offering a practical bridge between the speed of diffusion models and the reliability of autoregressive decoders—an attractive proposition for any developer building next‑generation AI products.

Authors

  • Jia‑Nan Li
  • Jian Guan
  • Wei Wu
  • Chongxuan Li

Paper Information

  • arXiv ID: 2512.13586v1
  • Categories: cs.CL, cs.AI, cs.LG
  • Published: December 15, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »