[Paper] Yume-1.5: A Text-Controlled Interactive World Generation Model
Source: arXiv - 2512.22096v1
Overview
The paper introduces Yume‑1.5, a new diffusion‑based framework that can generate an explorable 3D‑like world from a single image or a text prompt—and let users walk through it with keyboard controls in real time. By tackling the three biggest pain points of prior world‑generation models (huge model size, slow multi‑step inference, and lack of text‑driven event control), the authors deliver a system that feels responsive enough for interactive applications such as games, VR experiences, and rapid prototyping tools.
Key Contributions
- Unified long‑video generation pipeline that compresses the growing historical context and uses linear attention to keep memory and compute linear in sequence length.
- Real‑time streaming acceleration achieved through bidirectional attention distillation and an enhanced text‑embedding scheme, cutting inference latency from seconds to sub‑100 ms per frame.
- Text‑controlled world events that let a user describe dynamic changes (e.g., “a storm rolls in” or “a bridge collapses”) and have the model update the scene on the fly.
- Keyboard‑driven exploration interface that demonstrates seamless navigation across the generated world without needing external physics engines.
- Open‑source code release (supplementary material) enabling the community to reproduce and extend the system.
Methodology
Yume‑1.5 builds on diffusion models but restructures them for interactive use:
-
Context Compression + Linear Attention – As the world expands, the model would normally need to keep the entire history of generated frames, which quickly overwhelms GPU memory. The authors introduce a lightweight compression module that summarizes past frames into a fixed‑size latent, then feed this into a linear‑attention transformer that scales linearly with the number of frames instead of quadratically.
-
Bidirectional Attention Distillation – During training, a heavyweight “teacher” model processes the full context with standard attention. A smaller “student” model learns to mimic the teacher’s outputs while only looking at a limited window, dramatically reducing runtime while preserving quality.
-
Enhanced Text Embedding – Rather than a single prompt token, the system injects a hierarchy of text embeddings (global prompt + per‑step event tokens) into the diffusion denoising steps, enabling fine‑grained control over world dynamics.
-
Keyboard Navigation Loop – The generated frames are streamed to a lightweight renderer. User key presses are translated into latent‑space offsets, which are fed back into the diffusion step to produce the next view, creating a smooth first‑person walk through the world.
Results & Findings
- Latency: Average per‑frame generation time dropped from ~1.2 s (baseline diffusion) to ≈85 ms on an RTX 3090, meeting real‑time interaction thresholds.
- Quality: Human evaluation (Mean Opinion Score) showed a +0.6 improvement over prior long‑video diffusion baselines, especially in maintaining spatial coherence across frames.
- Text‑Event Responsiveness: When users issued dynamic commands (“add a river”, “night falls”), the model updated the scene within 2–3 frames, preserving continuity.
- Scalability: The compressed context allowed generation of worlds up to 30 seconds (≈900 frames) without OOM errors, a 4× increase over prior methods.
Practical Implications
- Game Prototyping – Designers can sketch a concept image or write a short description and instantly walk through a playable environment, accelerating level design cycles.
- VR/AR Content Creation – Real‑time generation means on‑device or cloud‑assisted experiences where the environment evolves based on voice or text commands, opening up adaptive storytelling.
- Simulation & Training – Industries such as robotics or autonomous driving can generate diverse, controllable virtual terrains on the fly for scenario testing.
- Creative Tools – Artists can iterate on world‑building by typing “add a medieval market” or “turn it into a cyberpunk night” and see immediate visual feedback, lowering the barrier to high‑fidelity world creation.
Limitations & Future Work
- Physical Realism – The current system focuses on visual plausibility; physics (collision, gravity) are not simulated, limiting use in high‑fidelity game engines.
- Text Understanding Scope – Complex multi‑step instructions sometimes produce ambiguous results; richer language models could improve event parsing.
- Hardware Dependence – While latency is real‑time on high‑end GPUs, lower‑tier hardware still struggles; future work aims at further model pruning and quantization.
- Evaluation Metrics – The paper relies heavily on subjective scores; establishing standardized quantitative metrics for interactive world generation remains an open challenge.
Yume‑1.5 marks a significant step toward bridging generative AI and interactive media, offering developers a practical pathway to create dynamic, text‑driven worlds without the heavyweight infrastructure of traditional game pipelines.
Authors
- Xiaofeng Mao
- Zhen Li
- Chuanhao Li
- Xiaojie Xu
- Kaining Ying
- Tong He
- Jiangmiao Pang
- Yu Qiao
- Kaipeng Zhang
Paper Information
- arXiv ID: 2512.22096v1
- Categories: cs.CV
- Published: December 26, 2025
- PDF: Download PDF