[Paper] MaskWAM: Unifying Mask Prompting and Prediction for World-Action Models

Published: (June 11, 2026 at 12:02 PM EDT)
2 min read
Source: arXiv

Source: arXiv - 2606.13515v1

Overview

World Action Models (WAMs) present a promising paradigm for robotic control via video prediction. However, current WAMs suffer from fundamental spatial bottlenecks: standard text inputs introduce referential ambiguity in cluttered scenes, while unstructured RGB predictions lack semantic grounding and remain biased by task-irrelevant backgrounds. To overcome these limitations, we introduce MaskWAM, an object-centric world-action model. By jointly integrating masks as both explicit inputs and predictions via a unified Mixture of Transformers (MoT), MaskWAM unlocks robust policy generalization. This design provides two key benefits: (1) predicting future masks yields object-centric semantic supervision that suppresses visual noise, significantly enhancing even standard text-conditioned WAMs; and (2) coupling this predictive supervision with first-frame visual prompts, such as target object masks, establishes a precise spatial anchor that substantially reduces language ambiguity. Crucially, as WAMs are inherently vision-driven architectures, direct mask conditioning yields substantially stronger guidance than text alone, establishing a precise and robust paradigm for manipulating unseen objects. Evaluations on LIBERO, RoboTwin, and real-world tasks demonstrate that MaskWAM significantly outperforms baselines in both language-clear and language-ambiguous tasks.

Key Contributions

This paper presents research in the following areas:

  • cs.CV
  • cs.LG
  • cs.RO

Methodology

Please refer to the full paper for detailed methodology.

Practical Implications

This research contributes to the advancement of cs.CV.

Authors

  • Hanyang Yu
  • Haitao Lin
  • Jingbo Zhang
  • Wenyao Zhang
  • Chenghao Gu
  • Heng Li
  • Ping Tan

Paper Information

  • arXiv ID: 2606.13515v1
  • Categories: cs.CV, cs.LG, cs.RO
  • Published: June 11, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »