[Paper] Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion

Published: 3 days ago (March 6, 2026 at 01:59 PM EST)

5 min read

Source: arXiv

Source: arXiv - 2603.06577v1

Overview

Omni‑Diffusion is the first any‑to‑any multimodal model that replaces the classic autoregressive backbone with a mask‑based discrete diffusion engine. By treating text, speech, and images as sequences of discrete tokens, the model learns a single joint distribution that can both understand and generate across modalities, opening a new architectural direction for multimodal AI.

Key Contributions

Unified diffusion backbone: Introduces a mask‑based discrete diffusion model that jointly models tokens from text, audio, and vision, eliminating the need for separate encoders/decoders per modality.
Any‑to‑any capability: Supports arbitrary combinations of input and output modalities (e.g., text → image, speech → text, image + text → speech, etc.) within a single model.
Scalable tokenization scheme: Leverages state‑of‑the‑art tokenizers (e.g., BPE for text, VQ‑GAN for images, and neural codec for speech) to convert continuous signals into a shared discrete space.
Competitive performance: Achieves state‑of‑the‑art or comparable results on a wide range of multimodal benchmarks, often surpassing larger autoregressive MLLMs while using fewer parameters.
Open‑source release: Provides code, pretrained checkpoints, and a demo web UI, encouraging community adoption and further research.

Methodology

Tokenization – Each modality is first transformed into a sequence of discrete tokens:
- Text → byte‑pair‑encoding (BPE) tokens.
- Images → VQ‑GAN or similar vector‑quantized codebooks.
- Speech → neural audio codec (e.g., Encodec) tokens.
Mask‑based Discrete Diffusion –
- The model starts from a fully masked token sequence.
- At each diffusion step, a learned denoising network predicts the original token for a randomly selected subset of masked positions, gradually “unmasking” the sequence.
- The denoising network is a transformer that receives the partially observed token stream together with a step‑embedding that tells it how many diffusion steps remain.
Joint Distribution Learning – Because all modalities share the same token vocabulary, the diffusion process learns a single joint probability distribution (p(\mathbf{t}{\text{text}}, \mathbf{t}{\text{image}}, \mathbf{t}_{\text{speech}})).
Task Specification via Masks – To perform a specific task, the user masks the tokens that correspond to the desired output modality(s) while keeping the input tokens visible. The diffusion process then fills in the missing tokens, effectively “generating” the target modality.
Training – The model is trained on a large, heterogeneous dataset containing paired text‑image, text‑speech, image‑speech, and triple‑modal examples. The loss is the standard cross‑entropy between predicted and ground‑truth tokens at each diffusion step.

Results & Findings

Benchmark	Task	Omni‑Diffusion	Prior SOTA (autoregressive)
COCO Captions	Image → Text	BLEU‑4 ↑ 1.2%	Comparable
MS‑COCO Image Generation	Text → Image	FID ↓ 4.5	Better
Speech‑to‑Text (LibriSpeech)	Speech → Text	WER ↓ 3.1%	Slightly better
AudioCaps	Image + Text → Speech	MOS ↑ 0.15	First reported result
Multi‑modal Retrieval (MME)	Mixed modalities	Recall@1 ↑ 2.8%	On par

Efficiency: Despite using a diffusion process (typically slower than autoregressive decoding), the mask‑based design allows parallel token prediction for large unmasked blocks, reducing inference latency by ~30 % compared with token‑by‑token generation.
Parameter Economy: Omni‑Diffusion matches or exceeds performance of models with 2–3× more parameters, suggesting diffusion’s stronger inductive bias for multimodal alignment.

Practical Implications

Unified API for developers – One model can serve as a “Swiss‑army knife” for multimodal applications: generate images from prompts, transcribe audio, create captions, or even synthesize speech from a combination of visual and textual cues, all via a single endpoint.
Simplified deployment – Maintaining a single backbone reduces engineering overhead (no need to orchestrate separate vision, language, and audio models).
Better data efficiency – The joint diffusion training leverages cross‑modal signals, meaning fewer labeled examples are required to achieve high performance on a new modality pair.
Potential for on‑device use – The parallel unmasking step and modest parameter count make it feasible to run trimmed versions on edge devices for tasks like real‑time captioning or voice‑controlled UI generation.
Creative tooling – Artists and content creators can experiment with “any‑to‑any” generation (e.g., feed a sketch and a spoken description to obtain a narrated illustration) without stitching together multiple models.

Limitations & Future Work

Inference speed on long sequences – While parallel unmasking helps, diffusion still requires multiple denoising passes, which can be a bottleneck for very high‑resolution images or long audio clips.
Tokenization artifacts – Discrete tokenizers can introduce quantization loss, especially for high‑fidelity audio; future work may explore hybrid continuous‑discrete diffusion.
Dataset bias – The training data is dominated by English text and Western visual content, limiting performance on low‑resource languages or culturally specific imagery.
Scalability to more modalities – Extending the framework to video, 3‑D point clouds, or sensor data will require larger token vocabularies and more sophisticated masking strategies.

The authors plan to address these points by optimizing the diffusion schedule, integrating learned tokenizers, and expanding the multimodal pre‑training corpus.

Authors

Lijiang Li
Zuwei Long
Yunhang Shen
Heting Gao
Haoyu Cao
Xing Sun
Caifeng Shan
Ran He
Chaoyou Fu

Paper Information

arXiv ID: 2603.06577v1
Categories: cs.CV
Published: March 6, 2026
PDF: Download PDF

[Paper] Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Multimodal Large Language Models as Image Classifiers

[Paper] BEVLM: Distilling Semantic Knowledge from LLMs into Bird's-Eye View Representations

[Paper] SCOPE: Scene-Contextualized Incremental Few-Shot 3D Segmentation

[Paper] SUREON: A Benchmark and Vision-Language-Model for Surgical Reasoning