[Paper] DiffusionVL: Translating Any Autoregressive Models into Diffusion Vision Language Models

Published: 1 month ago (December 17, 2025 at 01:59 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.15713v1

Overview

The paper DiffusionVL shows that any strong autoregressive (AR) vision‑language model (VLM) can be turned into a diffusion‑based VLM with just a modest fine‑tuning step. By leveraging the superior decoding properties of diffusion models—such as smoother generation and better handling of uncertainty—the authors achieve a new family of “diffusion VLMs” that match or surpass the performance of state‑of‑the‑art AR models while running up to twice as fast.

Key Contributions

Universal translation pipeline – A simple fine‑tuning recipe that converts any pretrained AR VLM (e.g., LLaVA, MiniGPT‑4) into a diffusion vision‑language model (dVLM).
Performance boost with tiny data – Trains on < 5 % of the data used by prior diffusion VLMs yet delivers a 34 %–38 % relative gain on major multimodal benchmarks (MMMU‑Pro, MME).
Block‑decoding architecture – Introduces a block‑wise decoding scheme that enables arbitrary‑length output, KV‑cache reuse, and a ≈2× inference speedup compared to vanilla diffusion decoding.
Competitive with AR instruction‑tuning – Shows that a directly converted AR model can rival LLaVA‑style visual‑instruction tuning without any specialized multimodal instruction data.
Open‑source release – Code, models, and training scripts are publicly available, facilitating rapid adoption and further research.

Methodology

Start from an AR VLM – The authors take a powerful language backbone already trained on large text corpora (e.g., LLaMA) and optionally equipped with visual adapters.
Swap the decoder – The AR token‑by‑token decoder is replaced with a diffusion decoder that predicts noisy token embeddings and gradually denoises them over a fixed number of diffusion steps.
Fine‑tune on multimodal data – Using a modest multimodal dataset (≈ 5 % of what previous diffusion VLMs used), the model learns to align visual features with the diffusion language space. The loss combines standard diffusion reconstruction loss with cross‑entropy on the final clean tokens.
Block‑decoding trick – Instead of generating one token per diffusion step, the model predicts blocks of tokens (e.g., 8‑16 tokens) in parallel. The KV cache from previous blocks is reused, dramatically cutting down the number of diffusion passes needed for long sentences.
Inference pipeline – At test time, the model runs a small number of diffusion steps per block, producing fluent, high‑quality captions, answers, or instructions conditioned on an image.

Results & Findings

Benchmark	Metric	DiffusionVL (ours)	Prior Diffusion VLM	AR‑style VLM
MMMU‑Pro (vision)	Accuracy ↑	+34.4 % over prior diffusion	–	Comparable
MME (cognitive)	Score ↑	+37.5 % over prior diffusion	–	Near‑state‑of‑the‑art
Inference latency	Time per token	2× faster than vanilla diffusion	–	Similar to AR

Paradigm shift works – Moving from AR to diffusion yields a clear quality jump even when the underlying language model stays the same.
Direct conversion is viable – Simply swapping decoders and fine‑tuning already gives results on par with models that undergo extensive visual‑instruction tuning.
Speed‑efficiency – Block‑decoding recovers much of the AR‑style latency while preserving diffusion’s robustness.

Practical Implications

Rapid prototyping of multimodal assistants – Teams can take an existing LLM (e.g., LLaMA‑2) and, with a few hours of fine‑tuning, obtain a diffusion‑based VLM that is more stable for open‑ended generation (e.g., fewer hallucinations, smoother token distribution).
Cost‑effective training – Because only a fraction of multimodal data is needed, startups and research labs can build competitive VLMs without the massive data pipelines that dominate current diffusion VLM research.
Scalable generation for long outputs – Block‑decoding makes diffusion practical for tasks like report generation, code explanation, or multi‑step reasoning where output length can be hundreds of tokens.
Better integration with generative vision models – Diffusion VLMs naturally align with diffusion image generators (e.g., Stable Diffusion), opening doors for tightly coupled “image‑to‑text‑to‑image” loops in creative applications.
Open‑source foundation – The released repository provides a plug‑and‑play conversion script, lowering the barrier for developers to experiment with diffusion decoding in their own multimodal pipelines.

Limitations & Future Work

Diffusion step budget – Although block‑decoding speeds things up, diffusion still requires multiple denoising steps per block, which can be a bottleneck on low‑power devices.
Dependence on a strong AR backbone – The quality ceiling is tied to the original AR model; converting a weak AR VLM will not magically produce a strong dVLM.
Limited modality scope – The current work focuses on vision‑language; extending the translation pipeline to audio, video, or 3‑D data remains an open challenge.
Evaluation on downstream tasks – While benchmark scores are impressive, real‑world user studies (e.g., chat assistants, code assistants) are needed to confirm the perceived quality gains.

Future research directions include adaptive diffusion schedules to further cut inference time, multi‑modal diffusion pipelines that jointly denoise visual and textual streams, and exploring curriculum fine‑tuning to reduce the data requirement even more.

Authors

Lunbin Zeng
Jingfeng Yao
Bencheng Liao
Hongyuan Tao
Wenyu Liu
Xinggang Wang

Paper Information

arXiv ID: 2512.15713v1
Categories: cs.CV
Published: December 17, 2025
PDF: Download PDF

[Paper] DiffusionVL: Translating Any Autoregressive Models into Diffusion Vision Language Models

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Both Semantics and Reconstruction Matter: Making Representation Encoders Ready for Text-to-Image Generation and Editing

[Paper] Re-Depth Anything: Test-Time Depth Refinement via Self-Supervised Re-lighting

[Paper] Dexterous World Models

[Paper] Adversarial Robustness of Vision in Open Foundation Models