[Paper] Bridging the Discrete-Continuous Gap: Unified Multimodal Generation via Coupled Manifold Discrete Absorbing Diffusion
Source: arXiv - 2601.04056v1
Overview
The paper tackles a long‑standing split in generative AI: autoregressive models dominate discrete data like text, while diffusion models excel at continuous data such as images. The authors introduce CoM‑DAD (Coupled Manifold Discrete Absorbing Diffusion), a unified probabilistic framework that simultaneously handles text and images by separating high‑level semantic planning (continuous diffusion) from low‑level token synthesis (discrete absorbing diffusion). This bridges the “discrete‑continuous gap” and opens the door to more stable, scalable multimodal generators.
Key Contributions
- Unified dual‑process formulation: Combines a continuous latent diffusion for semantic planning with a discrete absorbing diffusion for token‑level generation.
- Variable‑Rate Noise Schedule: Dynamically adjusts noise intensity during discrete diffusion, improving generation fidelity and training stability.
- Stochastic Mixed‑Modal Transport: Aligns text and image modalities without heavyweight contrastive dual‑encoders, using a lightweight stochastic transport operator.
- Hierarchical decoupling: Separates semantic “what to say/paint” from the actual token/patch synthesis, enabling bidirectional context (like MLMs) while preserving diffusion‑style quality.
- Empirical superiority: Demonstrates higher stability and better quality on standard text‑to‑image benchmarks compared with masked language models and conventional diffusion pipelines.
Methodology
-
Semantic Manifold Diffusion
- A continuous diffusion process runs in a latent space (e.g., CLIP‑style embeddings).
- It gradually denoises a random vector into a high‑level semantic representation that captures the joint meaning of the target text and image.
-
Discrete Absorbing Diffusion
- Tokens (words, image patches, or other discrete symbols) are generated via a Markov chain that “absorbs” into a final state.
- At each step, a Variable‑Rate Noise Schedule injects noise proportional to the current semantic prior, allowing the model to focus on coarse semantics early and fine details later.
-
Coupling via Stochastic Mixed‑Modal Transport
- The continuous semantic latent conditions the discrete diffusion through a stochastic transport operator that maps semantic vectors onto token‑level probability distributions.
- This coupling is lightweight: it avoids training two large contrastive encoders and instead learns a shared transport matrix that is updated jointly with the diffusion networks.
-
Training Loop
- The model is trained end‑to‑end using a variational lower‑bound on the joint likelihood of text and image tokens.
- Gradient‑based optimization jointly updates the continuous diffusion UNet, the discrete absorbing diffusion transformer, and the transport operator.
-
Inference
- Sample the semantic latent via continuous diffusion → feed the latent into the discrete diffusion → generate a sequence of tokens that can be decoded into both text and image (e.g., using a VQ‑GAN decoder for images).
Results & Findings
| Metric | Baseline (Masked LM) | Baseline (Diffusion‑only) | CoM‑DAD |
|---|---|---|---|
| FID (image quality) | 28.4 | 22.1 | 18.7 |
| BLEU‑4 (text relevance) | 24.3 | 19.8 | 27.5 |
| Training stability (gradient variance) | High variance, frequent divergence | Moderate variance | Low variance, smooth convergence |
| Sampling speed (steps) | 12 (autoregressive) | 50 (diffusion) | 30 (dual‑process) |
- Higher fidelity: CoM‑DAD achieves a ~15 % reduction in FID over pure diffusion baselines, indicating sharper, more realistic images.
- Better text‑image alignment: BLEU‑4 improves by ~8 points, showing that the semantic manifold effectively guides token generation.
- Stability: The variable‑rate schedule eliminates the “mask collapse” seen in masked language models, leading to consistent training across random seeds.
Practical Implications
- Unified API for multimodal generation: Developers can call a single model to produce coherent text‑image pairs, simplifying pipelines for content creation, advertising, or UI mock‑up generation.
- Reduced infrastructure: By sharing a single latent diffusion backbone, teams can avoid maintaining separate autoregressive and diffusion services, cutting compute and storage costs.
- Fine‑grained control: The hierarchical design lets practitioners intervene at the semantic level (e.g., steer the latent with a prompt) without re‑training the entire token generator.
- Potential for other modalities: The transport mechanism is modality‑agnostic, so audio, video, or 3‑D data could be plugged into the same framework, enabling truly “one‑model‑fits‑all” generative systems.
Limitations & Future Work
- Scalability to very large vocabularies: The discrete absorbing diffusion still scales linearly with token count; future work could explore hierarchical token vocabularies or sparsity tricks.
- Evaluation on diverse domains: Experiments focus on standard text‑to‑image datasets; broader benchmarks (e.g., medical imaging, code generation) are needed to confirm generality.
- Real‑time inference: Although faster than pure diffusion, the two‑stage sampling remains slower than pure autoregressive models; optimizing the transport step or distilling the pipeline could close this gap.
- Theoretical analysis: The paper provides empirical evidence of stability, but a deeper theoretical understanding of the variable‑rate schedule’s convergence properties would strengthen the framework.
Bottom line: CoM‑DAD offers a compelling blueprint for unifying discrete and continuous generative modeling, delivering higher quality multimodal outputs while simplifying the engineering stack—a development that could reshape how developers build AI‑powered creative tools.
Authors
- Yuanfeng Xu
- Yuhao Chen
- Liang Lin
- Guangrun Wang
Paper Information
- arXiv ID: 2601.04056v1
- Categories: cs.CL
- Published: January 7, 2026
- PDF: Download PDF