[Paper] Context Unrolling in Omni Models

Published: (April 23, 2026 at 01:58 PM EDT)
5 min read
Source: arXiv

Source: arXiv - 2604.21921v1

Overview

The paper introduces Omni, a single multimodal model that is trained from scratch on a wide spectrum of data—plain text, 2‑D images, video clips, 3‑D geometry, and even hidden feature representations. By learning to handle all these modalities together, Omni develops a novel “Context Unrolling” ability: it can internally reason across different representations before emitting a final output, leading to richer, more coherent predictions.

Key Contributions

  • Unified multimodal training on five fundamentally different data types (text, image, video, 3‑D, latent features) without modality‑specific adapters.
  • Context Unrolling mechanism that explicitly propagates information across modalities during inference, improving reasoning fidelity.
  • State‑of‑the‑art results on both multimodal generation (e.g., text‑to‑image, video synthesis) and understanding benchmarks (e.g., video‑question answering, 3‑D shape retrieval).
  • Demonstration of in‑context multimodal generation: a single query can trigger the model to output text, an image, a short video, or a 3‑D mesh depending on the prompt.
  • Open‑source release of the model weights and training pipeline, enabling reproducibility and downstream fine‑tuning.

Methodology

  1. Data Collection & Pre‑processing

    • Curated large‑scale datasets for each modality (e.g., LAION‑5B for images, HowTo100M for video, ShapeNet for 3‑D).
    • All inputs are projected into a shared token space using modality‑specific encoders (ViT for images, TimeSformer for video, PointNet++ for 3‑D, and a transformer for text).
  2. Joint Transformer Backbone

    • A single large transformer processes the concatenated token streams.
    • Positional and modality embeddings let the model distinguish where each token comes from while still allowing cross‑modal attention.
  3. Context Unrolling

    • During forward passes, the model performs iterative cross‑modal attention cycles.
    • After each cycle, intermediate representations are “unrolled” back into each modality’s decoder, letting the model refine its understanding before producing the final output.
  4. Training Objective

    • A mixture of contrastive losses (to align modalities) and generative losses (autoregressive text, diffusion‑based image/video, and implicit surface decoding for 3‑D).
    • Curriculum learning gradually introduces more complex modalities, ensuring stable convergence.
  5. Inference

    • Users supply a prompt with optional modality tags.
    • The model runs a fixed number of unrolling steps, then decodes the requested output type(s).

Results & Findings

BenchmarkModalityOmni ScorePrior SOTAΔ
COCO Caption (text‑image)Text ↔ Image138.2 CIDEr132.5+5.7
Kinetics‑600 Video QAVideo ↔ Text78.4% accuracy73.1%+5.3%
ShapeNet Retrieval3‑D ↔ Text91.2% recall@186.8%+4.4%
Text‑to‑Video Generation (MS‑VD)Text → VideoFVD 210260-50
Multi‑modal In‑Context GenerationMixedHuman evaluation 4.6/53.9/5+0.7
  • Context Unrolling consistently improves cross‑modal alignment, especially on tasks that require reasoning over heterogeneous cues (e.g., answering a video question that references a 3‑D object shown in an image).
  • The unified model matches or exceeds specialized models that were trained on a single modality, demonstrating that joint training does not sacrifice performance.

Practical Implications

  • One‑stop AI service: Developers can expose a single API endpoint that handles text, image, video, and 3‑D generation or understanding, simplifying product architecture.
  • Cross‑modal content creation: Content platforms can generate synchronized assets (e.g., a product description, a rendered image, a short demo video, and a 3‑D model) from a single prompt, cutting down on manual asset pipelines.
  • Enhanced AR/VR pipelines: By feeding a textual scene description, Omni can output both the visual texture (image) and the spatial mesh (3‑D), accelerating prototyping for immersive experiences.
  • Improved multimodal retrieval: Search engines can index and retrieve across modalities using a shared embedding space, enabling queries like “show me a video of a red sports car similar to this 3‑D model.”
  • Reduced engineering overhead: Instead of maintaining separate models for each modality, teams can fine‑tune Omni on domain‑specific data (e.g., medical imaging + reports) with a single training run.

Limitations & Future Work

  • Compute‑intensive training: Jointly training on five modalities required > 600 GPU‑days; smaller organizations may need to rely on the released checkpoints rather than training from scratch.
  • Modal imbalance: The model still performs slightly worse on low‑resource modalities (e.g., 3‑D generation) when the training data is dominated by text and images.
  • Real‑time constraints: The iterative unrolling steps add latency, making the current version less suitable for ultra‑low‑latency applications (e.g., live video chat).
  • Future directions proposed by the authors include:
    • Adaptive unrolling schedules that stop early when confidence is high, reducing inference time.
    • Incorporating additional modalities such as audio and sensor streams.
    • Exploring more efficient training regimes (e.g., mixture‑of‑experts) to lower the resource barrier.

Omni’s ability to “think across senses” opens a new frontier for developers building truly multimodal AI products. By exposing a single, unified interface, it promises to streamline pipelines, cut costs, and enable richer user experiences across the web, mobile, and immersive platforms.

Authors

  • Ceyuan Yang
  • Zhijie Lin
  • Yang Zhao
  • Fei Xiao
  • Hao He
  • Qi Zhao
  • Chaorui Deng
  • Kunchang Li
  • Zihan Ding
  • Yuwei Guo
  • Fuyun Wang
  • Fangqi Zhu
  • Xiaonan Nie
  • Shenhan Zhu
  • Shanchuan Lin
  • Hongsheng Li
  • Weilin Huang
  • Guang Shi
  • Haoqi Fan

Paper Information

  • arXiv ID: 2604.21921v1
  • Categories: cs.CV
  • Published: April 23, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »