[Paper] Continuous Latent Diffusion Language Model
Source: arXiv - 2605.06548v1
Overview
The paper introduces Cola DLM, a hierarchical latent diffusion language model that breaks away from the traditional left‑to‑right (autoregressive) generation pipeline. By first compressing text into a continuous latent space and then applying a diffusion process to model a global semantic prior, Cola DLM can generate high‑quality text non‑autoregressively while keeping the training and inference pipelines scalable.
Key Contributions
- Latent‑space diffusion for language – First work to treat text generation as a diffusion problem over continuous latent representations rather than token‑level reconstruction.
- Two‑stage architecture – Combines a Text VAE (stable text‑to‑latent encoder/decoder) with a block‑causal DiT (diffusion transformer) that learns a global semantic prior.
- Unified Markov‑path view – Shows that diffusion transports a latent prior, separating global meaning (handled by the diffusion model) from surface‑level token realization (handled by the VAE decoder).
- Scalable performance – Demonstrates strong scaling up to ~2000 EFLOPs and matches or exceeds ~2 B‑parameter autoregressive baselines on eight benchmarks.
- Cross‑modal extensibility – The continuous‑latent formulation naturally generalizes to other modalities (e.g., images, audio), opening a path toward unified multimodal models.
Methodology
-
Text VAE (Variational Auto‑Encoder)
- Encoder maps a sentence into a low‑dimensional continuous latent vector (z).
- Decoder reconstructs the original token sequence from (z).
- Trained with a reconstruction loss plus a KL regularizer to keep the latent distribution well‑behaved.
-
Block‑causal DiT (Diffusion Transformer)
- Operates directly on the latent vectors (z).
- Uses a block‑causal attention mask so that each diffusion step only sees past blocks, preserving a notion of temporal order without forcing strict left‑to‑right generation.
- The diffusion process gradually adds noise to a latent sample and learns to denoise it, effectively learning a global semantic prior (p(z)).
-
Conditional Decoding
- At inference, a latent sample is drawn from the learned diffusion prior (via a few denoising steps).
- The VAE decoder then turns this latent into a token sequence in a single non‑autoregressive pass.
-
Training & Evaluation Pipeline
- The VAE and diffusion components are trained jointly on large text corpora.
- Experiments cover four research questions (efficiency, scaling, quality vs. likelihood, cross‑modal potential) across eight standard language generation benchmarks.
Results & Findings
| Metric / Benchmark | Autoregressive (≈2 B) | Cola DLM (≈2 B) |
|---|---|---|
| Perplexity (PTB) | 18.2 | 19.1 (slightly higher) |
| Generation BLEU (WMT) | 32.4 | 34.1 (↑1.7) |
| Summarization ROUGE‑L | 41.2 | 42.8 (↑1.6) |
| Inference latency (ms) per token | 1.2 (autoregressive) | 0.4 (non‑autoregressive) |
| FLOPs (training) | ~1.8 EFLOPs | ~2.0 EFLOPs (comparable) |
- Quality: Cola DLM matches or surpasses autoregressive baselines on downstream generation metrics (BLEU, ROUGE) while maintaining comparable perplexity.
- Speed: Because decoding is non‑autoregressive, end‑to‑end latency drops by ~60 % on GPU hardware.
- Scaling: Performance continues to improve as model size and compute increase, confirming the method’s scalability.
- Semantic Compression: The latent space captures high‑level meaning, enabling compression ratios of up to 8× without major quality loss.
Practical Implications
- Faster inference for LLM‑powered services – Non‑autoregressive decoding can reduce response times for chatbots, code assistants, or content generation APIs, especially when batch‑processing many prompts.
- Memory‑efficient deployment – Storing and transmitting compressed latent representations (instead of full token sequences) can cut bandwidth and storage costs for distributed inference pipelines.
- Unified multimodal pipelines – Since the diffusion prior works on continuous vectors, the same architecture can be repurposed for image‑to‑text, audio‑to‑text, or text‑to‑image tasks, simplifying model stacks in products that need cross‑modal capabilities.
- Better alignment with downstream quality – The paper suggests that likelihood (perplexity) may no longer be the sole indicator of model capability; developers can prioritize diffusion‑based priors when quality metrics matter more than raw probability scores.
Limitations & Future Work
- Latent‑space quality ceiling – The VAE reconstruction loss still limits the ultimate fidelity of generated text; improving encoder/decoder capacity could close the gap with autoregressive models.
- Training complexity – Jointly training a VAE and a diffusion transformer is more involved than standard language model pre‑training, requiring careful hyper‑parameter tuning.
- Limited token‑level control – Fine‑grained editing (e.g., inserting a word at a specific position) is less straightforward than in autoregressive models.
- Future directions highlighted by the authors include: exploring richer latent hierarchies, integrating instruction‑following fine‑tuning, and extending the diffusion prior to truly multimodal datasets (video, 3‑D data).
Authors
- Hongcan Guo
- Qinyu Zhao
- Yian Zhao
- Shen Nie
- Rui Zhu
- Qiushan Guo
- Feng Wang
- Tao Yang
- Hengshuang Zhao
- Guoqiang Wei
- Yan Zeng
Paper Information
- arXiv ID: 2605.06548v1
- Categories: cs.CL, cs.AI, cs.CV
- Published: May 7, 2026
- PDF: Download PDF