[Paper] MaskWAM: Unifying Mask Prompting and Prediction for World-Action Models

Published: 3 days ago (June 11, 2026 at 12:02 PM EDT)

2 min read

Source: arXiv

Source: arXiv - 2606.13515v1

Overview

World Action Models (WAMs) present a promising paradigm for robotic control via video prediction. However, current WAMs suffer from fundamental spatial bottlenecks: standard text inputs introduce referential ambiguity in cluttered scenes, while unstructured RGB predictions lack semantic grounding and remain biased by task-irrelevant backgrounds. To overcome these limitations, we introduce MaskWAM, an object-centric world-action model. By jointly integrating masks as both explicit inputs and predictions via a unified Mixture of Transformers (MoT), MaskWAM unlocks robust policy generalization. This design provides two key benefits: (1) predicting future masks yields object-centric semantic supervision that suppresses visual noise, significantly enhancing even standard text-conditioned WAMs; and (2) coupling this predictive supervision with first-frame visual prompts, such as target object masks, establishes a precise spatial anchor that substantially reduces language ambiguity. Crucially, as WAMs are inherently vision-driven architectures, direct mask conditioning yields substantially stronger guidance than text alone, establishing a precise and robust paradigm for manipulating unseen objects. Evaluations on LIBERO, RoboTwin, and real-world tasks demonstrate that MaskWAM significantly outperforms baselines in both language-clear and language-ambiguous tasks.

Key Contributions

This paper presents research in the following areas:

cs.CV
cs.LG
cs.RO

Methodology

Please refer to the full paper for detailed methodology.

Practical Implications

This research contributes to the advancement of cs.CV.

Authors

Hanyang Yu
Haitao Lin
Jingbo Zhang
Wenyao Zhang
Chenghao Gu
Heng Li
Ping Tan

Paper Information

arXiv ID: 2606.13515v1
Categories: cs.CV, cs.LG, cs.RO
Published: June 11, 2026
PDF: Download PDF

[Paper] MaskWAM: Unifying Mask Prompting and Prediction for World-Action Models

Overview

Key Contributions

Methodology

Practical Implications

Authors

Paper Information

Related posts

[Paper] Mana: Dexterous Manipulation of Articulated Tools

[Paper] SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning

[Paper] EvTexture++: Event-Driven Texture Enhancement for Video Super-Resolution

[Paper] Contrast-Informed Augmentation and Domain-Adversarial Training for Adult-to-Neonatal MR Reconstruction Generalization