[Paper] Hunyuan-GameCraft-2: Instruction-following Interactive Game World Model
Source: arXiv - 2511.23429v1
Overview
Hunyuan‑GameCraft‑2 pushes generative game‑world modeling beyond static scene synthesis by letting users steer video‑based game simulations with natural‑language instructions, keyboard, or mouse inputs. By turning large, unstructured text‑video pairs into causally aligned interactive data, the authors demonstrate a more flexible, low‑cost way to create dynamic, player‑driven game content.
Key Contributions
- Instruction‑driven interaction: Replaces rigid keyboard‑only control schemes with free‑form language, mouse, and keyboard signals for richer gameplay manipulation.
- Automated interactive dataset pipeline: Converts massive text‑video corpora into causally aligned “interactive video” pairs without manual annotation.
- 14B MoE image‑to‑video foundation model: Extends a mixture‑of‑experts architecture with a text‑driven interaction injection module that controls camera motion, character actions, and environment dynamics.
- InterBench benchmark: A new evaluation suite focused on interaction quality, measuring responsiveness, temporal coherence, and causal grounding.
- Demonstrated free‑form actions: Shows the model can reliably execute commands like “open the door”, “draw a torch”, or “trigger an explosion” in generated game videos.
Methodology
- Interactive Video Definition – The authors formalize an “interactive video” as a sequence where each frame is conditioned on a user instruction (text, key press, or mouse event) and the preceding visual context.
- Data Construction – Starting from publicly available text‑video pairs (e.g., YouTube gameplay clips with subtitles), they run an automated pipeline that:
- Detects action cues in the text (verbs, objects).
- Aligns those cues with temporal segments in the video using off‑the‑shelf action localization models.
- Generates paired instruction‑video clips that are causally linked (the instruction directly causes the visual change).
- Model Architecture – A 14‑billion‑parameter Mixture‑of‑Experts (MoE) backbone processes a single keyframe image and a sequence of instruction tokens. A lightweight Interaction Injection Module injects the instruction embedding at multiple transformer layers, allowing fine‑grained control over:
- Camera motion (pan, zoom).
- Character behavior (movement, gestures).
- Environment dynamics (object state changes, particle effects).
- Training – The model is trained end‑to‑end on the automatically built interactive dataset using a combination of video reconstruction loss, temporal consistency loss, and a causal alignment loss that penalizes mismatches between the instruction and the resulting visual change.
- Evaluation (InterBench) – The benchmark measures:
- Responsiveness (does the video reflect the instruction?).
- Temporal coherence (smooth transitions).
- Causal fidelity (no spurious actions).
Results & Findings
- High instruction fidelity: On InterBench, Hunyuan‑GameCraft‑2 achieves a 78 % success rate in correctly executing free‑form commands, a ~20 % boost over the previous GameCraft baseline.
- Temporal smoothness: The model reduces flicker and abrupt motion artifacts, scoring 0.92 on a video‑smoothness metric (vs. 0.81 for prior work).
- Generalization to unseen verbs: Even when presented with novel actions (“ignite a lantern”), the system produces plausible visual outcomes, indicating strong semantic grounding.
- Low annotation overhead: The automated pipeline cuts human labeling cost by >90 %, enabling scaling to millions of interactive clips.
Practical Implications
- Rapid prototyping for indie developers – Teams can generate interactive gameplay footage from simple textual scripts, shortening the iteration loop for level design and narrative testing.
- Dynamic content generation in live services – MMOs or live‑ops games could use the model to spawn context‑aware events (e.g., “a sudden storm appears”) without hand‑crafted assets.
- AI‑assisted game testing – QA bots can issue natural‑language commands to verify that game mechanics respond correctly, automating regression testing.
- Educational and training simulators – Instruction‑driven video generation can create scenario‑based learning modules where learners dictate actions and see immediate visual feedback.
- Cross‑modal game UI – By supporting mouse and keyboard signals alongside text, developers can build hybrid control schemes (voice + mouse) for accessibility or VR/AR interfaces.
Limitations & Future Work
- Domain specificity – The training data is heavily biased toward typical 3rd‑person adventure or RPG footage; exotic genres (e.g., strategy, puzzle) may see degraded performance.
- Physical realism – While visually coherent, the model does not enforce physics constraints, leading to occasional impossible motions (e.g., floating objects).
- Scalability of real‑time inference – The 14B MoE model still requires substantial GPU memory, limiting on‑device deployment.
- Future directions suggested by the authors include: expanding the dataset to cover more game genres, integrating a physics engine for constraint‑aware generation, and distilling the MoE into a lighter model for real‑time interactive applications.
Authors
- Junshu Tang
- Jiacheng Liu
- Jiaqi Li
- Longhuang Wu
- Haoyu Yang
- Penghao Zhao
- Siruis Gong
- Xiang Yuan
- Shuai Shao
- Qinglin Lu
Paper Information
- arXiv ID: 2511.23429v1
- Categories: cs.CV
- Published: November 28, 2025
- PDF: Download PDF