[Paper] DiffusionVL: Translating Any Autoregressive Models into Diffusion Vision Language Models

Published: (December 17, 2025 at 01:59 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.15713v1

Overview

The paper DiffusionVL shows that any strong autoregressive (AR) vision‑language model (VLM) can be turned into a diffusion‑based VLM with just a modest fine‑tuning step. By leveraging the superior decoding properties of diffusion models—such as smoother generation and better handling of uncertainty—the authors achieve a new family of “diffusion VLMs” that match or surpass the performance of state‑of‑the‑art AR models while running up to twice as fast.

Key Contributions

  • Universal translation pipeline – A simple fine‑tuning recipe that converts any pretrained AR VLM (e.g., LLaVA, MiniGPT‑4) into a diffusion vision‑language model (dVLM).
  • Performance boost with tiny data – Trains on < 5 % of the data used by prior diffusion VLMs yet delivers a 34 %–38 % relative gain on major multimodal benchmarks (MMMU‑Pro, MME).
  • Block‑decoding architecture – Introduces a block‑wise decoding scheme that enables arbitrary‑length output, KV‑cache reuse, and a ≈2× inference speedup compared to vanilla diffusion decoding.
  • Competitive with AR instruction‑tuning – Shows that a directly converted AR model can rival LLaVA‑style visual‑instruction tuning without any specialized multimodal instruction data.
  • Open‑source release – Code, models, and training scripts are publicly available, facilitating rapid adoption and further research.

Methodology

  1. Start from an AR VLM – The authors take a powerful language backbone already trained on large text corpora (e.g., LLaMA) and optionally equipped with visual adapters.
  2. Swap the decoder – The AR token‑by‑token decoder is replaced with a diffusion decoder that predicts noisy token embeddings and gradually denoises them over a fixed number of diffusion steps.
  3. Fine‑tune on multimodal data – Using a modest multimodal dataset (≈ 5 % of what previous diffusion VLMs used), the model learns to align visual features with the diffusion language space. The loss combines standard diffusion reconstruction loss with cross‑entropy on the final clean tokens.
  4. Block‑decoding trick – Instead of generating one token per diffusion step, the model predicts blocks of tokens (e.g., 8‑16 tokens) in parallel. The KV cache from previous blocks is reused, dramatically cutting down the number of diffusion passes needed for long sentences.
  5. Inference pipeline – At test time, the model runs a small number of diffusion steps per block, producing fluent, high‑quality captions, answers, or instructions conditioned on an image.

Results & Findings

BenchmarkMetricDiffusionVL (ours)Prior Diffusion VLMAR‑style VLM
MMMU‑Pro (vision)Accuracy ↑+34.4 % over prior diffusionComparable
MME (cognitive)Score ↑+37.5 % over prior diffusionNear‑state‑of‑the‑art
Inference latencyTime per token2× faster than vanilla diffusionSimilar to AR
  • Paradigm shift works – Moving from AR to diffusion yields a clear quality jump even when the underlying language model stays the same.
  • Direct conversion is viable – Simply swapping decoders and fine‑tuning already gives results on par with models that undergo extensive visual‑instruction tuning.
  • Speed‑efficiency – Block‑decoding recovers much of the AR‑style latency while preserving diffusion’s robustness.

Practical Implications

  • Rapid prototyping of multimodal assistants – Teams can take an existing LLM (e.g., LLaMA‑2) and, with a few hours of fine‑tuning, obtain a diffusion‑based VLM that is more stable for open‑ended generation (e.g., fewer hallucinations, smoother token distribution).
  • Cost‑effective training – Because only a fraction of multimodal data is needed, startups and research labs can build competitive VLMs without the massive data pipelines that dominate current diffusion VLM research.
  • Scalable generation for long outputs – Block‑decoding makes diffusion practical for tasks like report generation, code explanation, or multi‑step reasoning where output length can be hundreds of tokens.
  • Better integration with generative vision models – Diffusion VLMs naturally align with diffusion image generators (e.g., Stable Diffusion), opening doors for tightly coupled “image‑to‑text‑to‑image” loops in creative applications.
  • Open‑source foundation – The released repository provides a plug‑and‑play conversion script, lowering the barrier for developers to experiment with diffusion decoding in their own multimodal pipelines.

Limitations & Future Work

  • Diffusion step budget – Although block‑decoding speeds things up, diffusion still requires multiple denoising steps per block, which can be a bottleneck on low‑power devices.
  • Dependence on a strong AR backbone – The quality ceiling is tied to the original AR model; converting a weak AR VLM will not magically produce a strong dVLM.
  • Limited modality scope – The current work focuses on vision‑language; extending the translation pipeline to audio, video, or 3‑D data remains an open challenge.
  • Evaluation on downstream tasks – While benchmark scores are impressive, real‑world user studies (e.g., chat assistants, code assistants) are needed to confirm the perceived quality gains.

Future research directions include adaptive diffusion schedules to further cut inference time, multi‑modal diffusion pipelines that jointly denoise visual and textual streams, and exploring curriculum fine‑tuning to reduce the data requirement even more.

Authors

  • Lunbin Zeng
  • Jingfeng Yao
  • Bencheng Liao
  • Hongyuan Tao
  • Wenyu Liu
  • Xinggang Wang

Paper Information

  • arXiv ID: 2512.15713v1
  • Categories: cs.CV
  • Published: December 17, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »

[Paper] Multi-View Foundation Models

Foundation models are vital tools in various Computer Vision applications. They take as input a single RGB image and output a deep feature representation that i...