[Paper] Context Unrolling in Omni Models
Source: arXiv - 2604.21921v1
Overview
The paper introduces Omni, a single multimodal model that is trained from scratch on a wide spectrum of data—plain text, 2‑D images, video clips, 3‑D geometry, and even hidden feature representations. By learning to handle all these modalities together, Omni develops a novel “Context Unrolling” ability: it can internally reason across different representations before emitting a final output, leading to richer, more coherent predictions.
Key Contributions
- Unified multimodal training on five fundamentally different data types (text, image, video, 3‑D, latent features) without modality‑specific adapters.
- Context Unrolling mechanism that explicitly propagates information across modalities during inference, improving reasoning fidelity.
- State‑of‑the‑art results on both multimodal generation (e.g., text‑to‑image, video synthesis) and understanding benchmarks (e.g., video‑question answering, 3‑D shape retrieval).
- Demonstration of in‑context multimodal generation: a single query can trigger the model to output text, an image, a short video, or a 3‑D mesh depending on the prompt.
- Open‑source release of the model weights and training pipeline, enabling reproducibility and downstream fine‑tuning.
Methodology
-
Data Collection & Pre‑processing
- Curated large‑scale datasets for each modality (e.g., LAION‑5B for images, HowTo100M for video, ShapeNet for 3‑D).
- All inputs are projected into a shared token space using modality‑specific encoders (ViT for images, TimeSformer for video, PointNet++ for 3‑D, and a transformer for text).
-
Joint Transformer Backbone
- A single large transformer processes the concatenated token streams.
- Positional and modality embeddings let the model distinguish where each token comes from while still allowing cross‑modal attention.
-
Context Unrolling
- During forward passes, the model performs iterative cross‑modal attention cycles.
- After each cycle, intermediate representations are “unrolled” back into each modality’s decoder, letting the model refine its understanding before producing the final output.
-
Training Objective
- A mixture of contrastive losses (to align modalities) and generative losses (autoregressive text, diffusion‑based image/video, and implicit surface decoding for 3‑D).
- Curriculum learning gradually introduces more complex modalities, ensuring stable convergence.
-
Inference
- Users supply a prompt with optional modality tags.
- The model runs a fixed number of unrolling steps, then decodes the requested output type(s).
Results & Findings
| Benchmark | Modality | Omni Score | Prior SOTA | Δ |
|---|---|---|---|---|
| COCO Caption (text‑image) | Text ↔ Image | 138.2 CIDEr | 132.5 | +5.7 |
| Kinetics‑600 Video QA | Video ↔ Text | 78.4% accuracy | 73.1% | +5.3% |
| ShapeNet Retrieval | 3‑D ↔ Text | 91.2% recall@1 | 86.8% | +4.4% |
| Text‑to‑Video Generation (MS‑VD) | Text → Video | FVD 210 | 260 | -50 |
| Multi‑modal In‑Context Generation | Mixed | Human evaluation 4.6/5 | 3.9/5 | +0.7 |
- Context Unrolling consistently improves cross‑modal alignment, especially on tasks that require reasoning over heterogeneous cues (e.g., answering a video question that references a 3‑D object shown in an image).
- The unified model matches or exceeds specialized models that were trained on a single modality, demonstrating that joint training does not sacrifice performance.
Practical Implications
- One‑stop AI service: Developers can expose a single API endpoint that handles text, image, video, and 3‑D generation or understanding, simplifying product architecture.
- Cross‑modal content creation: Content platforms can generate synchronized assets (e.g., a product description, a rendered image, a short demo video, and a 3‑D model) from a single prompt, cutting down on manual asset pipelines.
- Enhanced AR/VR pipelines: By feeding a textual scene description, Omni can output both the visual texture (image) and the spatial mesh (3‑D), accelerating prototyping for immersive experiences.
- Improved multimodal retrieval: Search engines can index and retrieve across modalities using a shared embedding space, enabling queries like “show me a video of a red sports car similar to this 3‑D model.”
- Reduced engineering overhead: Instead of maintaining separate models for each modality, teams can fine‑tune Omni on domain‑specific data (e.g., medical imaging + reports) with a single training run.
Limitations & Future Work
- Compute‑intensive training: Jointly training on five modalities required > 600 GPU‑days; smaller organizations may need to rely on the released checkpoints rather than training from scratch.
- Modal imbalance: The model still performs slightly worse on low‑resource modalities (e.g., 3‑D generation) when the training data is dominated by text and images.
- Real‑time constraints: The iterative unrolling steps add latency, making the current version less suitable for ultra‑low‑latency applications (e.g., live video chat).
- Future directions proposed by the authors include:
- Adaptive unrolling schedules that stop early when confidence is high, reducing inference time.
- Incorporating additional modalities such as audio and sensor streams.
- Exploring more efficient training regimes (e.g., mixture‑of‑experts) to lower the resource barrier.
Omni’s ability to “think across senses” opens a new frontier for developers building truly multimodal AI products. By exposing a single, unified interface, it promises to streamline pipelines, cut costs, and enable richer user experiences across the web, mobile, and immersive platforms.
Authors
- Ceyuan Yang
- Zhijie Lin
- Yang Zhao
- Fei Xiao
- Hao He
- Qi Zhao
- Chaorui Deng
- Kunchang Li
- Zihan Ding
- Yuwei Guo
- Fuyun Wang
- Fangqi Zhu
- Xiaonan Nie
- Shenhan Zhu
- Shanchuan Lin
- Hongsheng Li
- Weilin Huang
- Guang Shi
- Haoqi Fan
Paper Information
- arXiv ID: 2604.21921v1
- Categories: cs.CV
- Published: April 23, 2026
- PDF: Download PDF