[Paper] RELIC: Interactive Video World Model with Long-Horizon Memory
Source: arXiv - 2512.04040v1
Overview
The paper introduces RELIC, a new interactive video world‑model that can stream high‑quality video in real time while remembering what it has already seen and responding precisely to user commands. By combining long‑horizon memory, 3‑D‑consistent spatial recall, and fast inference, RELIC pushes the boundary of what generative video systems can do for interactive applications such as virtual environments, gaming, and AR/VR prototyping.
Key Contributions
- Unified framework that simultaneously delivers real‑time streaming, long‑term memory, and fine‑grained user control—something prior models handled only in isolation.
- Compressed latent‑token memory stored in a key‑value (KV) cache that encodes both relative actions and absolute camera poses, enabling efficient 3‑D‑consistent retrieval.
- Bidirectional teacher‑student distillation: a 5‑second video diffusion model is fine‑tuned and distilled into a causal student that can generate arbitrarily long sequences using a novel “self‑forcing” training regime.
- Scalable implementation: a 14‑billion‑parameter model trained on a curated Unreal Engine dataset runs at ~16 FPS on a single GPU, achieving real‑time performance.
- Demonstrated improvements over existing baselines in action fidelity, long‑horizon stability, and spatial memory retrieval.
Methodology
- Input & Conditioning – The system receives a single reference image and a textual instruction (e.g., “walk forward three steps”).
- Latent Video Diffusion – An autoregressive diffusion model generates video frames in a latent space, which is far cheaper to compute than pixel‑space diffusion.
- Memory Representation – Past frames are compressed into latent tokens that capture both the motion that produced them (relative actions) and the camera’s absolute pose. These tokens are stored in a KV cache, acting like a compact episodic memory.
- Camera‑Aware Retrieval – When generating a new frame, the model queries the cache with the current pose, retrieving the most relevant tokens to maintain 3‑D consistency across the scene.
- Teacher‑Student Distillation – A bidirectional “teacher” diffusion model (trained on 5‑second clips) is fine‑tuned to predict beyond its original horizon. A causal “student” model learns from the teacher using self‑forcing, which feeds the student’s own predictions back into the teacher’s context during training, allowing the student to learn long‑range dependencies without exploding memory usage.
- Real‑Time Inference – The distilled student runs autoregressively, pulling from the KV cache each step, achieving 16 FPS generation on a single GPU.
Results & Findings
| Metric | RELIC | Prior State‑of‑the‑Art |
|---|---|---|
| Inference Speed | ~16 FPS (single GPU) | 4–8 FPS |
| Action Following Accuracy | 92 % (text‑to‑action alignment) | ~78 % |
| Long‑Horizon Consistency (5 s vs 30 s drift) | <2 % drift | >7 % drift |
| Spatial Memory Retrieval (pose‑conditioned recall) | 85 % correct retrieval | 61 % |
Qualitatively, RELIC can explore a virtual room for dozens of seconds, correctly re‑enter previously seen corners, and keep objects (e.g., a moved chair) in the right place even after long camera rotations. The self‑forcing distillation proved essential for maintaining coherence when the student rolls out far beyond the teacher’s original training horizon.
Practical Implications
- Game & VR Prototyping – Developers can generate interactive, explorable environments on‑the‑fly without pre‑baking every possible camera path, dramatically speeding up level design iterations.
- AR Content Creation – Real‑time video synthesis that respects the user’s viewpoint enables dynamic overlays that stay anchored to the physical world.
- Simulation & Training – Long‑duration, memory‑aware video streams can model realistic scenarios for robotics or autonomous‑vehicle training where the agent must remember past obstacles.
- Creative Tools – Artists can script high‑level actions (“walk through a forest”) and let RELIC fill in consistent, photorealistic footage, reducing manual animation workload.
- Scalable Cloud Services – Because the memory cache is lightweight, RELIC can be deployed as a low‑latency API for interactive media platforms.
Limitations & Future Work
- Domain Specificity – The model is trained on synthetic Unreal Engine scenes; performance on real‑world footage or highly diverse visual domains may degrade.
- Memory Scaling – Although the KV cache is compact, extremely long sessions (minutes‑plus) could still overflow GPU memory, requiring hierarchical or off‑device caching strategies.
- Action Granularity – Fine‑grained manipulations (e.g., precise hand gestures) are not yet supported; extending the action space is an open direction.
- Generalization to New Camera Models – The current pose encoding assumes pinhole‑style cameras; adapting to fisheye or 360° rigs will need additional research.
The authors suggest expanding the training data to include real video, exploring hierarchical memory structures, and integrating multimodal control (e.g., speech + gesture) as next steps.
Authors
- Yicong Hong
- Yiqun Mei
- Chongjian Ge
- Yiran Xu
- Yang Zhou
- Sai Bi
- Yannick Hold‑Geoffroy
- Mike Roberts
- Matthew Fisher
- Eli Shechtman
- Kalyan Sunkavalli
- Feng Liu
- Zhengqi Li
- Hao Tan
Paper Information
- arXiv ID: 2512.04040v1
- Categories: cs.CV
- Published: December 3, 2025
- PDF: Download PDF