[Paper] Context Unrolling in Omni Models

Published: 16 hours ago (April 23, 2026 at 01:58 PM EDT)

5 min read

Source: arXiv

Source: arXiv - 2604.21921v1

Overview

The paper introduces Omni, a single multimodal model that is trained from scratch on a wide spectrum of data—plain text, 2‑D images, video clips, 3‑D geometry, and even hidden feature representations. By learning to handle all these modalities together, Omni develops a novel “Context Unrolling” ability: it can internally reason across different representations before emitting a final output, leading to richer, more coherent predictions.

Key Contributions

Unified multimodal training on five fundamentally different data types (text, image, video, 3‑D, latent features) without modality‑specific adapters.
Context Unrolling mechanism that explicitly propagates information across modalities during inference, improving reasoning fidelity.
State‑of‑the‑art results on both multimodal generation (e.g., text‑to‑image, video synthesis) and understanding benchmarks (e.g., video‑question answering, 3‑D shape retrieval).
Demonstration of in‑context multimodal generation: a single query can trigger the model to output text, an image, a short video, or a 3‑D mesh depending on the prompt.
Open‑source release of the model weights and training pipeline, enabling reproducibility and downstream fine‑tuning.

Methodology

Data Collection & Pre‑processing
- Curated large‑scale datasets for each modality (e.g., LAION‑5B for images, HowTo100M for video, ShapeNet for 3‑D).
- All inputs are projected into a shared token space using modality‑specific encoders (ViT for images, TimeSformer for video, PointNet++ for 3‑D, and a transformer for text).
Joint Transformer Backbone
- A single large transformer processes the concatenated token streams.
- Positional and modality embeddings let the model distinguish where each token comes from while still allowing cross‑modal attention.
Context Unrolling
- During forward passes, the model performs iterative cross‑modal attention cycles.
- After each cycle, intermediate representations are “unrolled” back into each modality’s decoder, letting the model refine its understanding before producing the final output.
Training Objective
- A mixture of contrastive losses (to align modalities) and generative losses (autoregressive text, diffusion‑based image/video, and implicit surface decoding for 3‑D).
- Curriculum learning gradually introduces more complex modalities, ensuring stable convergence.
Inference
- Users supply a prompt with optional modality tags.
- The model runs a fixed number of unrolling steps, then decodes the requested output type(s).

Results & Findings

Benchmark	Modality	Omni Score	Prior SOTA	Δ
COCO Caption (text‑image)	Text ↔ Image	138.2 CIDEr	132.5	+5.7
Kinetics‑600 Video QA	Video ↔ Text	78.4% accuracy	73.1%	+5.3%
ShapeNet Retrieval	3‑D ↔ Text	91.2% recall@1	86.8%	+4.4%
Text‑to‑Video Generation (MS‑VD)	Text → Video	FVD 210	260	-50
Multi‑modal In‑Context Generation	Mixed	Human evaluation 4.6/5	3.9/5	+0.7

Context Unrolling consistently improves cross‑modal alignment, especially on tasks that require reasoning over heterogeneous cues (e.g., answering a video question that references a 3‑D object shown in an image).
The unified model matches or exceeds specialized models that were trained on a single modality, demonstrating that joint training does not sacrifice performance.

Practical Implications

One‑stop AI service: Developers can expose a single API endpoint that handles text, image, video, and 3‑D generation or understanding, simplifying product architecture.
Cross‑modal content creation: Content platforms can generate synchronized assets (e.g., a product description, a rendered image, a short demo video, and a 3‑D model) from a single prompt, cutting down on manual asset pipelines.
Enhanced AR/VR pipelines: By feeding a textual scene description, Omni can output both the visual texture (image) and the spatial mesh (3‑D), accelerating prototyping for immersive experiences.
Improved multimodal retrieval: Search engines can index and retrieve across modalities using a shared embedding space, enabling queries like “show me a video of a red sports car similar to this 3‑D model.”
Reduced engineering overhead: Instead of maintaining separate models for each modality, teams can fine‑tune Omni on domain‑specific data (e.g., medical imaging + reports) with a single training run.

Limitations & Future Work

Compute‑intensive training: Jointly training on five modalities required > 600 GPU‑days; smaller organizations may need to rely on the released checkpoints rather than training from scratch.
Modal imbalance: The model still performs slightly worse on low‑resource modalities (e.g., 3‑D generation) when the training data is dominated by text and images.
Real‑time constraints: The iterative unrolling steps add latency, making the current version less suitable for ultra‑low‑latency applications (e.g., live video chat).
Future directions proposed by the authors include:
- Adaptive unrolling schedules that stop early when confidence is high, reducing inference time.
- Incorporating additional modalities such as audio and sensor streams.
- Exploring more efficient training regimes (e.g., mixture‑of‑experts) to lower the resource barrier.

Omni’s ability to “think across senses” opens a new frontier for developers building truly multimodal AI products. By exposing a single, unified interface, it promises to streamline pipelines, cut costs, and enable richer user experiences across the web, mobile, and immersive platforms.

Authors

Ceyuan Yang
Zhijie Lin
Yang Zhao
Fei Xiao
Hao He
Qi Zhao
Chaorui Deng
Kunchang Li
Zihan Ding
Yuwei Guo
Fuyun Wang
Fangqi Zhu
Xiaonan Nie
Shenhan Zhu
Shanchuan Lin
Hongsheng Li
Weilin Huang
Guang Shi
Haoqi Fan

Paper Information

arXiv ID: 2604.21921v1
Categories: cs.CV
Published: April 23, 2026
PDF: Download PDF

[Paper] Context Unrolling in Omni Models

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Seeing Fast and Slow: Learning the Flow of Time in Videos

[Paper] Seeing Without Eyes: 4D Human-Scene Understanding from Wearable IMUs

[Paper] Vista4D: Video Reshooting with 4D Point Clouds

[Paper] When Prompts Override Vision: Prompt-Induced Hallucinations in LVLMs