[Paper] Solaris: Building a Multiplayer Video World Model in Minecraft

Published: (February 25, 2026 at 01:59 PM EST)
5 min read
Source: arXiv

Source: arXiv - 2602.22208v1

Overview

The paper presents Solaris, the first video world model that can generate coherent, multi‑view video streams for multiple agents interacting in a shared Minecraft environment. By building a dedicated data‑collection pipeline and a novel training regime, the authors demonstrate that it’s possible to model not just what a single player sees, but how several players’ perspectives evolve together over time—opening the door to richer simulations for games, robotics, and AI research.

Key Contributions

  • Multiplayer data system: An automated pipeline that records synchronized video, actions, and world state from multiple agents playing Minecraft together, yielding 12.64 M frames.
  • Evaluation suite for multiplayer dynamics: Benchmarks covering movement coordination, memory of past events, grounding of objects, collaborative building, and cross‑view consistency.
  • Staged training pipeline: A progressive approach that starts with single‑player modeling and gradually introduces multi‑agent interactions using a mix of bidirectional, causal, and Self‑Forcing objectives.
  • Checkpointed Self‑Forcing: A memory‑efficient variant that lets the model look far ahead (long‑horizon teacher) without exploding GPU usage.
  • Open‑source release: The data collection framework, trained models, and evaluation code are publicly available, providing a foundation for future multi‑agent world‑model research.

Methodology

  1. Data Collection – The authors built a custom Minecraft server that spawns several bots, each with its own camera. The server logs each frame, the corresponding action (e.g., move, place block), and a global world snapshot at 20 Hz, guaranteeing perfect temporal alignment across agents.
  2. Model Architecture – Solaris extends a video diffusion backbone with multiple conditioning streams:
    • Agent‑specific action tokens (what each player does).
    • Shared world memory that stores a compressed representation of past frames across all agents.
    • Bidirectional causal layers that allow information to flow both forward and backward in time, improving consistency.
  3. Training Stages
    • Stage 1: Train on single‑player clips to learn basic physics and texture generation.
    • Stage 2: Introduce paired agents, encouraging the model to predict one agent’s view given the other’s actions (causal conditioning).
    • Stage 3: Apply Self‑Forcing, where the model’s own predictions are fed back as inputs for the next timestep, forcing it to maintain coherence.
    • Stage 4: Checkpointed Self‑Forcing—instead of storing the full long‑horizon teacher trajectory, the system checkpoints intermediate states, drastically reducing memory while still providing a far‑looking supervisory signal.
  4. Evaluation – The authors test Solaris on five axes (movement, memory, grounding, building, view consistency) using both quantitative metrics (e.g., PSNR, SSIM, action‑prediction accuracy) and human judgments.

Results & Findings

  • Solaris outperforms prior single‑agent video world models by 15‑20 % on cross‑view consistency metrics, indicating that it can keep multiple perspectives aligned over long horizons.
  • In the building benchmark, the model correctly predicts collaborative structures 87 % of the time, compared to 62 % for the best baseline.
  • Checkpointed Self‑Forcing reduces GPU memory usage by ~45 % while extending the teacher horizon from 8 to 32 frames, leading to smoother long‑term predictions.
  • Human evaluators rated Solaris‑generated multiplayer videos as “more realistic” and “better coordinated” than those from competing models in 78 % of pairwise comparisons.

Practical Implications

  • Game AI & Content Generation – Developers can use Solaris to prototype multi‑player scenarios, auto‑generate NPC behavior that reacts consistently across player viewpoints, or create dynamic cut‑scenes that adapt to multiple cameras.
  • Robotics & Simulation – The framework can be adapted to simulate fleets of robots (e.g., warehouse drones) where each robot’s sensor feed must stay consistent with others, enabling safer policy testing before real‑world deployment.
  • Virtual Collaboration Tools – In VR/AR meeting spaces, a Solaris‑style model could predict and render the shared environment from each participant’s perspective, reducing latency and bandwidth by sending only high‑level action updates.
  • Research Platforms – By open‑sourcing the data pipeline, the community can now benchmark multi‑agent world models on a large, diverse dataset, accelerating progress in multi‑agent reinforcement learning and generative modeling.

Limitations & Future Work

  • Domain Specificity – The system is tuned for Minecraft’s block‑based graphics; transferring to photorealistic or physics‑heavy environments may require substantial adaptation.
  • Scalability of Agents – Experiments involve up to four agents; scaling to dozens or hundreds (e.g., massive multiplayer online games) could expose bottlenecks in synchronization and memory.
  • Action Space Coverage – Only a subset of Minecraft actions (movement, block placement/removal) are modeled; richer interactions like combat or inventory management remain unexplored.
  • Future Directions – The authors suggest extending Solaris to heterogeneous sensor modalities (audio, depth), integrating reinforcement learning for policy‑conditioned generation, and exploring hierarchical memory structures to handle larger agent populations.

Authors

  • Georgy Savva
  • Oscar Michel
  • Daohan Lu
  • Suppakit Waiwitlikhit
  • Timothy Meehan
  • Dhairya Mishra
  • Srivats Poddar
  • Jack Lu
  • Saining Xie

Paper Information

  • arXiv ID: 2602.22208v1
  • Categories: cs.CV
  • Published: February 25, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »

[Paper] A Dataset is Worth 1 MB

A dataset server must often distribute the same large payload to many clients, incurring massive communication costs. Since clients frequently operate on divers...