[Paper] Yume-1.5: A Text-Controlled Interactive World Generation Model

Published: 1 month ago (December 26, 2025 at 12:52 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.22096v1

Overview

The paper introduces Yume‑1.5, a new diffusion‑based framework that can generate an explorable 3D‑like world from a single image or a text prompt—and let users walk through it with keyboard controls in real time. By tackling the three biggest pain points of prior world‑generation models (huge model size, slow multi‑step inference, and lack of text‑driven event control), the authors deliver a system that feels responsive enough for interactive applications such as games, VR experiences, and rapid prototyping tools.

Key Contributions

Unified long‑video generation pipeline that compresses the growing historical context and uses linear attention to keep memory and compute linear in sequence length.
Real‑time streaming acceleration achieved through bidirectional attention distillation and an enhanced text‑embedding scheme, cutting inference latency from seconds to sub‑100 ms per frame.
Text‑controlled world events that let a user describe dynamic changes (e.g., “a storm rolls in” or “a bridge collapses”) and have the model update the scene on the fly.
Keyboard‑driven exploration interface that demonstrates seamless navigation across the generated world without needing external physics engines.
Open‑source code release (supplementary material) enabling the community to reproduce and extend the system.

Methodology

Yume‑1.5 builds on diffusion models but restructures them for interactive use:

Context Compression + Linear Attention – As the world expands, the model would normally need to keep the entire history of generated frames, which quickly overwhelms GPU memory. The authors introduce a lightweight compression module that summarizes past frames into a fixed‑size latent, then feed this into a linear‑attention transformer that scales linearly with the number of frames instead of quadratically.
Bidirectional Attention Distillation – During training, a heavyweight “teacher” model processes the full context with standard attention. A smaller “student” model learns to mimic the teacher’s outputs while only looking at a limited window, dramatically reducing runtime while preserving quality.
Enhanced Text Embedding – Rather than a single prompt token, the system injects a hierarchy of text embeddings (global prompt + per‑step event tokens) into the diffusion denoising steps, enabling fine‑grained control over world dynamics.
Keyboard Navigation Loop – The generated frames are streamed to a lightweight renderer. User key presses are translated into latent‑space offsets, which are fed back into the diffusion step to produce the next view, creating a smooth first‑person walk through the world.

Results & Findings

Latency: Average per‑frame generation time dropped from ~1.2 s (baseline diffusion) to ≈85 ms on an RTX 3090, meeting real‑time interaction thresholds.
Quality: Human evaluation (Mean Opinion Score) showed a +0.6 improvement over prior long‑video diffusion baselines, especially in maintaining spatial coherence across frames.
Text‑Event Responsiveness: When users issued dynamic commands (“add a river”, “night falls”), the model updated the scene within 2–3 frames, preserving continuity.
Scalability: The compressed context allowed generation of worlds up to 30 seconds (≈900 frames) without OOM errors, a 4× increase over prior methods.

Practical Implications

Game Prototyping – Designers can sketch a concept image or write a short description and instantly walk through a playable environment, accelerating level design cycles.
VR/AR Content Creation – Real‑time generation means on‑device or cloud‑assisted experiences where the environment evolves based on voice or text commands, opening up adaptive storytelling.
Simulation & Training – Industries such as robotics or autonomous driving can generate diverse, controllable virtual terrains on the fly for scenario testing.
Creative Tools – Artists can iterate on world‑building by typing “add a medieval market” or “turn it into a cyberpunk night” and see immediate visual feedback, lowering the barrier to high‑fidelity world creation.

Limitations & Future Work

Physical Realism – The current system focuses on visual plausibility; physics (collision, gravity) are not simulated, limiting use in high‑fidelity game engines.
Text Understanding Scope – Complex multi‑step instructions sometimes produce ambiguous results; richer language models could improve event parsing.
Hardware Dependence – While latency is real‑time on high‑end GPUs, lower‑tier hardware still struggles; future work aims at further model pruning and quantization.
Evaluation Metrics – The paper relies heavily on subjective scores; establishing standardized quantitative metrics for interactive world generation remains an open challenge.

Yume‑1.5 marks a significant step toward bridging generative AI and interactive media, offering developers a practical pathway to create dynamic, text‑driven worlds without the heavyweight infrastructure of traditional game pipelines.

Authors

Xiaofeng Mao
Zhen Li
Chuanhao Li
Xiaojie Xu
Kaining Ying
Tong He
Jiangmiao Pang
Yu Qiao
Kaipeng Zhang

Paper Information

arXiv ID: 2512.22096v1
Categories: cs.CV
Published: December 26, 2025
PDF: Download PDF

[Paper] Yume-1.5: A Text-Controlled Interactive World Generation Model

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] See Less, See Right: Bi-directional Perceptual Shaping For Multimodal Reasoning

[Paper] ProEdit: Inversion-based Editing From Prompts Done Right

[Paper] Learning Association via Track-Detection Matching for Multi-Object Tracking

[Paper] StreamAvatar: Streaming Diffusion Models for Real-Time Interactive Human Avatars