[Paper] WorldCam: Interactive Autoregressive 3D Gaming Worlds with Camera Pose as a Unifying Geometric Representation

Published: 3 days ago (March 17, 2026 at 01:59 PM EDT)

5 min read

Source: arXiv

Source: arXiv - 2603.16871v1

Overview

The paper introduces WorldCam, a novel framework that treats the camera pose—its 6‑DoF position and orientation—as the core language for interacting with AI‑generated 3D gaming worlds. By grounding user actions in precise geometric terms, WorldCam delivers far‑more controllable navigation and maintains visual consistency across long‑duration gameplay sessions, addressing two long‑standing pain points in generative game‑world research.

Key Contributions

Unified geometric representation: Camera pose is used as the single, continuous conditioning signal that links immediate player actions to the global 3D world.
Physics‑based action space: User inputs are mapped to Lie‑algebra vectors, yielding smooth, differentiable 6‑DoF camera motions.
Camera embedder: A dedicated module injects the pose information into a video diffusion transformer, ensuring that generated frames align perfectly with the intended viewpoint.
Spatial indexing via global poses: Past observations are retrieved based on their absolute camera coordinates, enabling the model to “remember” and faithfully revisit previously seen locations.
Large‑scale gameplay dataset: 3,000 minutes of real human play, annotated with camera trajectories and textual descriptions, released to the community for further research.
State‑of‑the‑art performance: Quantitative and qualitative gains in action controllability, long‑horizon visual fidelity, and 3D spatial consistency over existing interactive world models.

Methodology

Action → Lie Algebra → Pose
- Player inputs (e.g., joystick moves, mouse clicks) are first expressed as continuous velocity vectors in a physics‑based action space.
- These vectors are embedded in the Lie algebra 𝔰𝔢(3), which naturally encodes both translation and rotation.
- Exponential mapping converts the algebraic representation into a 6‑DoF camera pose (position + orientation) at each timestep.
Camera Embedding into Diffusion Transformer
- The computed pose is passed through a lightweight camera embedder that produces a positional token.
- This token is concatenated with the usual textual and visual tokens before feeding the sequence into a video diffusion transformer (VDT).
- The VDT then generates the next frame conditioned on the exact viewpoint, guaranteeing that the rendered scene matches the intended camera motion.
Global Pose as Retrieval Index
- All generated frames are stored alongside their absolute camera poses.
- When the agent revisits a region, the system queries the memory using the current global pose, pulling the most relevant past observations.
- Retrieved frames provide a geometric anchor, allowing the model to preserve textures, layout, and object placement across long navigation loops.
Training & Evaluation
- The model is trained on the newly collected dataset, optimizing a standard diffusion loss while also minimizing pose‑reconstruction error.
- Benchmarks include action alignment metrics, Fréchet Video Distance (FVD) for visual quality, and a custom 3‑D consistency score based on pose‑aligned reprojection error.

Results & Findings

Metric	WorldCam	Prior Art (e.g., DreamFusion‑Game)
Action Controllability (° error)	0.8°	2.7°
Long‑horizon FVD (↓)	112	219
3‑D Spatial Consistency (reprojection error)	1.4 px	3.9 px
User Study (perceived realism)	84 % prefer WorldCam	61 %

Tighter action alignment: The Lie‑algebra mapping yields sub‑degree orientation errors, making fine‑grained steering feel natural.
Consistent world reuse: When looping back to a previously visited spot, textures and object placements remain stable, eliminating the “pop‑in” artifacts common in earlier models.
Scalable generation: Despite the added pose conditioning, inference speed remains comparable to baseline VDTs (≈ 30 fps on a single RTX 4090), thanks to the lightweight embedder.

Practical Implications

Game prototyping: Designers can rapidly iterate on level layouts by simply steering a virtual camera; the model guarantees that the visual output stays coherent across the entire playthrough.
VR/AR content creation: Precise 6‑DoF control is essential for immersive experiences; WorldCam’s pose‑driven generation can produce on‑the‑fly environments that react accurately to head‑tracked motion.
Simulation & training: Autonomous‑vehicle or robotics simulators can benefit from a generative world that respects exact camera (or sensor) trajectories, improving realism for perception‑stack testing.
Tooling for developers: The released dataset and open‑source camera embedder make it straightforward to plug WorldCam into existing pipelines (e.g., Unity, Unreal) for real‑time world synthesis.

Limitations & Future Work

Static scene bias: The current training data emphasizes relatively static environments; dynamic objects (e.g., moving NPCs) are not yet handled robustly.
Memory scaling: Storing every frame with its global pose can become costly for ultra‑long sessions; the authors suggest hierarchical indexing as a next step.
Generalization to novel domains: While the model excels on the collected gameplay footage, transferring to entirely different genres (e.g., sci‑fi or open‑world RPGs) may require domain‑specific fine‑tuning.

The authors plan to extend WorldCam with temporal dynamics (action‑conditioned object motion) and hierarchical memory structures to keep memory footprints low while preserving long‑range consistency.

WorldCam demonstrates that treating the camera pose as a first‑class citizen bridges the gap between user intent and high‑fidelity 3D generation, opening the door to more controllable, immersive, and developer‑friendly AI‑driven game worlds.

Authors

Jisu Nam
Yicong Hong
Chun-Hao Paul Huang
Feng Liu
JoungBin Lee
Jiyoung Kim
Siyoon Jin
Yunsung Lee
Jaeyoon Jung
Suhwan Choi
Seungryong Kim
Yang Zhou

Paper Information

arXiv ID: 2603.16871v1
Categories: cs.CV
Published: March 17, 2026
PDF: Download PDF

[Paper] WorldCam: Interactive Autoregressive 3D Gaming Worlds with Camera Pose as a Unifying Geometric Representation

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Generation Models Know Space: Unleashing Implicit 3D Priors for Scene Understanding

[Paper] Matryoshka Gaussian Splatting

[Paper] Cubic Discrete Diffusion: Discrete Visual Generation on High-Dimensional Representation Tokens

[Paper] MonoArt: Progressive Structural Reasoning for Monocular Articulated 3D Reconstruction