[Paper] WorldCam: Interactive Autoregressive 3D Gaming Worlds with Camera Pose as a Unifying Geometric Representation

Published: (March 17, 2026 at 01:59 PM EDT)
5 min read
Source: arXiv

Source: arXiv - 2603.16871v1

Overview

The paper introduces WorldCam, a novel framework that treats the camera pose—its 6‑DoF position and orientation—as the core language for interacting with AI‑generated 3D gaming worlds. By grounding user actions in precise geometric terms, WorldCam delivers far‑more controllable navigation and maintains visual consistency across long‑duration gameplay sessions, addressing two long‑standing pain points in generative game‑world research.

Key Contributions

  • Unified geometric representation: Camera pose is used as the single, continuous conditioning signal that links immediate player actions to the global 3D world.
  • Physics‑based action space: User inputs are mapped to Lie‑algebra vectors, yielding smooth, differentiable 6‑DoF camera motions.
  • Camera embedder: A dedicated module injects the pose information into a video diffusion transformer, ensuring that generated frames align perfectly with the intended viewpoint.
  • Spatial indexing via global poses: Past observations are retrieved based on their absolute camera coordinates, enabling the model to “remember” and faithfully revisit previously seen locations.
  • Large‑scale gameplay dataset: 3,000 minutes of real human play, annotated with camera trajectories and textual descriptions, released to the community for further research.
  • State‑of‑the‑art performance: Quantitative and qualitative gains in action controllability, long‑horizon visual fidelity, and 3D spatial consistency over existing interactive world models.

Methodology

  1. Action → Lie Algebra → Pose

    • Player inputs (e.g., joystick moves, mouse clicks) are first expressed as continuous velocity vectors in a physics‑based action space.
    • These vectors are embedded in the Lie algebra 𝔰𝔢(3), which naturally encodes both translation and rotation.
    • Exponential mapping converts the algebraic representation into a 6‑DoF camera pose (position + orientation) at each timestep.
  2. Camera Embedding into Diffusion Transformer

    • The computed pose is passed through a lightweight camera embedder that produces a positional token.
    • This token is concatenated with the usual textual and visual tokens before feeding the sequence into a video diffusion transformer (VDT).
    • The VDT then generates the next frame conditioned on the exact viewpoint, guaranteeing that the rendered scene matches the intended camera motion.
  3. Global Pose as Retrieval Index

    • All generated frames are stored alongside their absolute camera poses.
    • When the agent revisits a region, the system queries the memory using the current global pose, pulling the most relevant past observations.
    • Retrieved frames provide a geometric anchor, allowing the model to preserve textures, layout, and object placement across long navigation loops.
  4. Training & Evaluation

    • The model is trained on the newly collected dataset, optimizing a standard diffusion loss while also minimizing pose‑reconstruction error.
    • Benchmarks include action alignment metrics, Fréchet Video Distance (FVD) for visual quality, and a custom 3‑D consistency score based on pose‑aligned reprojection error.

Results & Findings

MetricWorldCamPrior Art (e.g., DreamFusion‑Game)
Action Controllability (° error)0.8°2.7°
Long‑horizon FVD (↓)112219
3‑D Spatial Consistency (reprojection error)1.4 px3.9 px
User Study (perceived realism)84 % prefer WorldCam61 %
  • Tighter action alignment: The Lie‑algebra mapping yields sub‑degree orientation errors, making fine‑grained steering feel natural.
  • Consistent world reuse: When looping back to a previously visited spot, textures and object placements remain stable, eliminating the “pop‑in” artifacts common in earlier models.
  • Scalable generation: Despite the added pose conditioning, inference speed remains comparable to baseline VDTs (≈ 30 fps on a single RTX 4090), thanks to the lightweight embedder.

Practical Implications

  • Game prototyping: Designers can rapidly iterate on level layouts by simply steering a virtual camera; the model guarantees that the visual output stays coherent across the entire playthrough.
  • VR/AR content creation: Precise 6‑DoF control is essential for immersive experiences; WorldCam’s pose‑driven generation can produce on‑the‑fly environments that react accurately to head‑tracked motion.
  • Simulation & training: Autonomous‑vehicle or robotics simulators can benefit from a generative world that respects exact camera (or sensor) trajectories, improving realism for perception‑stack testing.
  • Tooling for developers: The released dataset and open‑source camera embedder make it straightforward to plug WorldCam into existing pipelines (e.g., Unity, Unreal) for real‑time world synthesis.

Limitations & Future Work

  • Static scene bias: The current training data emphasizes relatively static environments; dynamic objects (e.g., moving NPCs) are not yet handled robustly.
  • Memory scaling: Storing every frame with its global pose can become costly for ultra‑long sessions; the authors suggest hierarchical indexing as a next step.
  • Generalization to novel domains: While the model excels on the collected gameplay footage, transferring to entirely different genres (e.g., sci‑fi or open‑world RPGs) may require domain‑specific fine‑tuning.

The authors plan to extend WorldCam with temporal dynamics (action‑conditioned object motion) and hierarchical memory structures to keep memory footprints low while preserving long‑range consistency.


WorldCam demonstrates that treating the camera pose as a first‑class citizen bridges the gap between user intent and high‑fidelity 3D generation, opening the door to more controllable, immersive, and developer‑friendly AI‑driven game worlds.

Authors

  • Jisu Nam
  • Yicong Hong
  • Chun-Hao Paul Huang
  • Feng Liu
  • JoungBin Lee
  • Jiyoung Kim
  • Siyoon Jin
  • Yunsung Lee
  • Jaeyoon Jung
  • Suhwan Choi
  • Seungryong Kim
  • Yang Zhou

Paper Information

  • arXiv ID: 2603.16871v1
  • Categories: cs.CV
  • Published: March 17, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »

[Paper] Matryoshka Gaussian Splatting

The ability to render scenes at adjustable fidelity from a single model, known as level of detail (LoD), is crucial for practical deployment of 3D Gaussian Spla...