[Paper] Solaris: Building a Multiplayer Video World Model in Minecraft
Source: arXiv - 2602.22208v1
Overview
The paper presents Solaris, the first video world model that can generate coherent, multi‑view video streams for multiple agents interacting in a shared Minecraft environment. By building a dedicated data‑collection pipeline and a novel training regime, the authors demonstrate that it’s possible to model not just what a single player sees, but how several players’ perspectives evolve together over time—opening the door to richer simulations for games, robotics, and AI research.
Key Contributions
- Multiplayer data system: An automated pipeline that records synchronized video, actions, and world state from multiple agents playing Minecraft together, yielding 12.64 M frames.
- Evaluation suite for multiplayer dynamics: Benchmarks covering movement coordination, memory of past events, grounding of objects, collaborative building, and cross‑view consistency.
- Staged training pipeline: A progressive approach that starts with single‑player modeling and gradually introduces multi‑agent interactions using a mix of bidirectional, causal, and Self‑Forcing objectives.
- Checkpointed Self‑Forcing: A memory‑efficient variant that lets the model look far ahead (long‑horizon teacher) without exploding GPU usage.
- Open‑source release: The data collection framework, trained models, and evaluation code are publicly available, providing a foundation for future multi‑agent world‑model research.
Methodology
- Data Collection – The authors built a custom Minecraft server that spawns several bots, each with its own camera. The server logs each frame, the corresponding action (e.g., move, place block), and a global world snapshot at 20 Hz, guaranteeing perfect temporal alignment across agents.
- Model Architecture – Solaris extends a video diffusion backbone with multiple conditioning streams:
- Agent‑specific action tokens (what each player does).
- Shared world memory that stores a compressed representation of past frames across all agents.
- Bidirectional causal layers that allow information to flow both forward and backward in time, improving consistency.
- Training Stages –
- Stage 1: Train on single‑player clips to learn basic physics and texture generation.
- Stage 2: Introduce paired agents, encouraging the model to predict one agent’s view given the other’s actions (causal conditioning).
- Stage 3: Apply Self‑Forcing, where the model’s own predictions are fed back as inputs for the next timestep, forcing it to maintain coherence.
- Stage 4: Checkpointed Self‑Forcing—instead of storing the full long‑horizon teacher trajectory, the system checkpoints intermediate states, drastically reducing memory while still providing a far‑looking supervisory signal.
- Evaluation – The authors test Solaris on five axes (movement, memory, grounding, building, view consistency) using both quantitative metrics (e.g., PSNR, SSIM, action‑prediction accuracy) and human judgments.
Results & Findings
- Solaris outperforms prior single‑agent video world models by 15‑20 % on cross‑view consistency metrics, indicating that it can keep multiple perspectives aligned over long horizons.
- In the building benchmark, the model correctly predicts collaborative structures 87 % of the time, compared to 62 % for the best baseline.
- Checkpointed Self‑Forcing reduces GPU memory usage by ~45 % while extending the teacher horizon from 8 to 32 frames, leading to smoother long‑term predictions.
- Human evaluators rated Solaris‑generated multiplayer videos as “more realistic” and “better coordinated” than those from competing models in 78 % of pairwise comparisons.
Practical Implications
- Game AI & Content Generation – Developers can use Solaris to prototype multi‑player scenarios, auto‑generate NPC behavior that reacts consistently across player viewpoints, or create dynamic cut‑scenes that adapt to multiple cameras.
- Robotics & Simulation – The framework can be adapted to simulate fleets of robots (e.g., warehouse drones) where each robot’s sensor feed must stay consistent with others, enabling safer policy testing before real‑world deployment.
- Virtual Collaboration Tools – In VR/AR meeting spaces, a Solaris‑style model could predict and render the shared environment from each participant’s perspective, reducing latency and bandwidth by sending only high‑level action updates.
- Research Platforms – By open‑sourcing the data pipeline, the community can now benchmark multi‑agent world models on a large, diverse dataset, accelerating progress in multi‑agent reinforcement learning and generative modeling.
Limitations & Future Work
- Domain Specificity – The system is tuned for Minecraft’s block‑based graphics; transferring to photorealistic or physics‑heavy environments may require substantial adaptation.
- Scalability of Agents – Experiments involve up to four agents; scaling to dozens or hundreds (e.g., massive multiplayer online games) could expose bottlenecks in synchronization and memory.
- Action Space Coverage – Only a subset of Minecraft actions (movement, block placement/removal) are modeled; richer interactions like combat or inventory management remain unexplored.
- Future Directions – The authors suggest extending Solaris to heterogeneous sensor modalities (audio, depth), integrating reinforcement learning for policy‑conditioned generation, and exploring hierarchical memory structures to handle larger agent populations.
Authors
- Georgy Savva
- Oscar Michel
- Daohan Lu
- Suppakit Waiwitlikhit
- Timothy Meehan
- Dhairya Mishra
- Srivats Poddar
- Jack Lu
- Saining Xie
Paper Information
- arXiv ID: 2602.22208v1
- Categories: cs.CV
- Published: February 25, 2026
- PDF: Download PDF