[Paper] Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion
Source: arXiv - 2603.06577v1
Overview
Omni‑Diffusion is the first any‑to‑any multimodal model that replaces the classic autoregressive backbone with a mask‑based discrete diffusion engine. By treating text, speech, and images as sequences of discrete tokens, the model learns a single joint distribution that can both understand and generate across modalities, opening a new architectural direction for multimodal AI.
Key Contributions
- Unified diffusion backbone: Introduces a mask‑based discrete diffusion model that jointly models tokens from text, audio, and vision, eliminating the need for separate encoders/decoders per modality.
- Any‑to‑any capability: Supports arbitrary combinations of input and output modalities (e.g., text → image, speech → text, image + text → speech, etc.) within a single model.
- Scalable tokenization scheme: Leverages state‑of‑the‑art tokenizers (e.g., BPE for text, VQ‑GAN for images, and neural codec for speech) to convert continuous signals into a shared discrete space.
- Competitive performance: Achieves state‑of‑the‑art or comparable results on a wide range of multimodal benchmarks, often surpassing larger autoregressive MLLMs while using fewer parameters.
- Open‑source release: Provides code, pretrained checkpoints, and a demo web UI, encouraging community adoption and further research.
Methodology
-
Tokenization – Each modality is first transformed into a sequence of discrete tokens:
- Text → byte‑pair‑encoding (BPE) tokens.
- Images → VQ‑GAN or similar vector‑quantized codebooks.
- Speech → neural audio codec (e.g., Encodec) tokens.
-
Mask‑based Discrete Diffusion –
- The model starts from a fully masked token sequence.
- At each diffusion step, a learned denoising network predicts the original token for a randomly selected subset of masked positions, gradually “unmasking” the sequence.
- The denoising network is a transformer that receives the partially observed token stream together with a step‑embedding that tells it how many diffusion steps remain.
-
Joint Distribution Learning – Because all modalities share the same token vocabulary, the diffusion process learns a single joint probability distribution (p(\mathbf{t}{\text{text}}, \mathbf{t}{\text{image}}, \mathbf{t}_{\text{speech}})).
-
Task Specification via Masks – To perform a specific task, the user masks the tokens that correspond to the desired output modality(s) while keeping the input tokens visible. The diffusion process then fills in the missing tokens, effectively “generating” the target modality.
-
Training – The model is trained on a large, heterogeneous dataset containing paired text‑image, text‑speech, image‑speech, and triple‑modal examples. The loss is the standard cross‑entropy between predicted and ground‑truth tokens at each diffusion step.
Results & Findings
| Benchmark | Task | Omni‑Diffusion | Prior SOTA (autoregressive) |
|---|---|---|---|
| COCO Captions | Image → Text | BLEU‑4 ↑ 1.2% | Comparable |
| MS‑COCO Image Generation | Text → Image | FID ↓ 4.5 | Better |
| Speech‑to‑Text (LibriSpeech) | Speech → Text | WER ↓ 3.1% | Slightly better |
| AudioCaps | Image + Text → Speech | MOS ↑ 0.15 | First reported result |
| Multi‑modal Retrieval (MME) | Mixed modalities | Recall@1 ↑ 2.8% | On par |
- Efficiency: Despite using a diffusion process (typically slower than autoregressive decoding), the mask‑based design allows parallel token prediction for large unmasked blocks, reducing inference latency by ~30 % compared with token‑by‑token generation.
- Parameter Economy: Omni‑Diffusion matches or exceeds performance of models with 2–3× more parameters, suggesting diffusion’s stronger inductive bias for multimodal alignment.
Practical Implications
- Unified API for developers – One model can serve as a “Swiss‑army knife” for multimodal applications: generate images from prompts, transcribe audio, create captions, or even synthesize speech from a combination of visual and textual cues, all via a single endpoint.
- Simplified deployment – Maintaining a single backbone reduces engineering overhead (no need to orchestrate separate vision, language, and audio models).
- Better data efficiency – The joint diffusion training leverages cross‑modal signals, meaning fewer labeled examples are required to achieve high performance on a new modality pair.
- Potential for on‑device use – The parallel unmasking step and modest parameter count make it feasible to run trimmed versions on edge devices for tasks like real‑time captioning or voice‑controlled UI generation.
- Creative tooling – Artists and content creators can experiment with “any‑to‑any” generation (e.g., feed a sketch and a spoken description to obtain a narrated illustration) without stitching together multiple models.
Limitations & Future Work
- Inference speed on long sequences – While parallel unmasking helps, diffusion still requires multiple denoising passes, which can be a bottleneck for very high‑resolution images or long audio clips.
- Tokenization artifacts – Discrete tokenizers can introduce quantization loss, especially for high‑fidelity audio; future work may explore hybrid continuous‑discrete diffusion.
- Dataset bias – The training data is dominated by English text and Western visual content, limiting performance on low‑resource languages or culturally specific imagery.
- Scalability to more modalities – Extending the framework to video, 3‑D point clouds, or sensor data will require larger token vocabularies and more sophisticated masking strategies.
The authors plan to address these points by optimizing the diffusion schedule, integrating learned tokenizers, and expanding the multimodal pre‑training corpus.
Authors
- Lijiang Li
- Zuwei Long
- Yunhang Shen
- Heting Gao
- Haoyu Cao
- Xing Sun
- Caifeng Shan
- Ran He
- Chaoyou Fu
Paper Information
- arXiv ID: 2603.06577v1
- Categories: cs.CV
- Published: March 6, 2026
- PDF: Download PDF