[Paper] Solaris: Building a Multiplayer Video World Model in Minecraft

Published: 3 days ago (February 25, 2026 at 01:59 PM EST)

5 min read

Source: arXiv

Source: arXiv - 2602.22208v1

Overview

The paper presents Solaris, the first video world model that can generate coherent, multi‑view video streams for multiple agents interacting in a shared Minecraft environment. By building a dedicated data‑collection pipeline and a novel training regime, the authors demonstrate that it’s possible to model not just what a single player sees, but how several players’ perspectives evolve together over time—opening the door to richer simulations for games, robotics, and AI research.

Key Contributions

Multiplayer data system: An automated pipeline that records synchronized video, actions, and world state from multiple agents playing Minecraft together, yielding 12.64 M frames.
Evaluation suite for multiplayer dynamics: Benchmarks covering movement coordination, memory of past events, grounding of objects, collaborative building, and cross‑view consistency.
Staged training pipeline: A progressive approach that starts with single‑player modeling and gradually introduces multi‑agent interactions using a mix of bidirectional, causal, and Self‑Forcing objectives.
Checkpointed Self‑Forcing: A memory‑efficient variant that lets the model look far ahead (long‑horizon teacher) without exploding GPU usage.
Open‑source release: The data collection framework, trained models, and evaluation code are publicly available, providing a foundation for future multi‑agent world‑model research.

Methodology

Data Collection – The authors built a custom Minecraft server that spawns several bots, each with its own camera. The server logs each frame, the corresponding action (e.g., move, place block), and a global world snapshot at 20 Hz, guaranteeing perfect temporal alignment across agents.
Model Architecture – Solaris extends a video diffusion backbone with multiple conditioning streams:
- Agent‑specific action tokens (what each player does).
- Shared world memory that stores a compressed representation of past frames across all agents.
- Bidirectional causal layers that allow information to flow both forward and backward in time, improving consistency.
Training Stages –
- Stage 1: Train on single‑player clips to learn basic physics and texture generation.
- Stage 2: Introduce paired agents, encouraging the model to predict one agent’s view given the other’s actions (causal conditioning).
- Stage 3: Apply Self‑Forcing, where the model’s own predictions are fed back as inputs for the next timestep, forcing it to maintain coherence.
- Stage 4: Checkpointed Self‑Forcing—instead of storing the full long‑horizon teacher trajectory, the system checkpoints intermediate states, drastically reducing memory while still providing a far‑looking supervisory signal.
Evaluation – The authors test Solaris on five axes (movement, memory, grounding, building, view consistency) using both quantitative metrics (e.g., PSNR, SSIM, action‑prediction accuracy) and human judgments.

Results & Findings

Solaris outperforms prior single‑agent video world models by 15‑20 % on cross‑view consistency metrics, indicating that it can keep multiple perspectives aligned over long horizons.
In the building benchmark, the model correctly predicts collaborative structures 87 % of the time, compared to 62 % for the best baseline.
Checkpointed Self‑Forcing reduces GPU memory usage by ~45 % while extending the teacher horizon from 8 to 32 frames, leading to smoother long‑term predictions.
Human evaluators rated Solaris‑generated multiplayer videos as “more realistic” and “better coordinated” than those from competing models in 78 % of pairwise comparisons.

Practical Implications

Game AI & Content Generation – Developers can use Solaris to prototype multi‑player scenarios, auto‑generate NPC behavior that reacts consistently across player viewpoints, or create dynamic cut‑scenes that adapt to multiple cameras.
Robotics & Simulation – The framework can be adapted to simulate fleets of robots (e.g., warehouse drones) where each robot’s sensor feed must stay consistent with others, enabling safer policy testing before real‑world deployment.
Virtual Collaboration Tools – In VR/AR meeting spaces, a Solaris‑style model could predict and render the shared environment from each participant’s perspective, reducing latency and bandwidth by sending only high‑level action updates.
Research Platforms – By open‑sourcing the data pipeline, the community can now benchmark multi‑agent world models on a large, diverse dataset, accelerating progress in multi‑agent reinforcement learning and generative modeling.

Limitations & Future Work

Domain Specificity – The system is tuned for Minecraft’s block‑based graphics; transferring to photorealistic or physics‑heavy environments may require substantial adaptation.
Scalability of Agents – Experiments involve up to four agents; scaling to dozens or hundreds (e.g., massive multiplayer online games) could expose bottlenecks in synchronization and memory.
Action Space Coverage – Only a subset of Minecraft actions (movement, block placement/removal) are modeled; richer interactions like combat or inventory management remain unexplored.
Future Directions – The authors suggest extending Solaris to heterogeneous sensor modalities (audio, depth), integrating reinforcement learning for policy‑conditioned generation, and exploring hierarchical memory structures to handle larger agent populations.

Authors

Georgy Savva
Oscar Michel
Daohan Lu
Suppakit Waiwitlikhit
Timothy Meehan
Dhairya Mishra
Srivats Poddar
Jack Lu
Saining Xie

Paper Information

arXiv ID: 2602.22208v1
Categories: cs.CV
Published: February 25, 2026
PDF: Download PDF

[Paper] Solaris: Building a Multiplayer Video World Model in Minecraft

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] MediX-R1: Open Ended Medical Reinforcement Learning

[Paper] VGG-T$^3$: Offline Feed-Forward 3D Reconstruction at Scale

[Paper] SeeThrough3D: Occlusion Aware 3D Control in Text-to-Image Generation

[Paper] A Dataset is Worth 1 MB