[Paper] AHA-WAM:Asynchronous Horizon-Adaptive World-Action Modeling with Observation-Guided Context Routing

Published: 3 days ago (June 8, 2026 at 01:55 PM EDT)

2 min read

Source: arXiv

Source: arXiv - 2606.09811v1

Overview

World-action models have emerged as a promising paradigm for robot manipulation, jointly modeling visual scene dynamics and actions to inject physical priors into policy learning. However, existing world-action models couple world prediction and action execution at the same temporal resolution, forcing the world branch to model near-term frame variations that are redundant and weakly informative. We posit that strictly binding world prediction and action execution to the same temporal rhythm may underutilize the potential of the video branch for embodied control. Therefore, we propose AHA-WAM, an Asynchronous Horizon-Adaptive World-Action Model built on a dual Diffusion Transformer (DiT) architecture that reorganizes world-action modeling around this temporal asymmetry. AHA-WAM instantiates the video DiT as a low-frequency world planner that maintains rolling key-value memory over past observations and exposes reusable layerwise latent context encoding long-horizon scene evolution, while a high-frequency action DiT executes short action chunks in closed loop by querying this context through layerwise joint attention. To support asynchronous execution, we introduce horizon-adaptive offset training and Observation-Guided Video-Context Routing (OVCR), which together let the action expert exploit long-horizon world context while remaining responsive to real-time execution state without rerunning the video DiT. Experiments on RoboTwin and real-world manipulation tasks show that AHA-WAM achieves state-of-the-art performance without any robot-data pretraining, attaining 92.80% average success on RoboTwin and 78.3% success across 4 real-world tasks, while reaching 24.17 Hz closed-loop control with a 4.59x speedup over Fast-WAM.

Key Contributions

This paper presents research in the following areas:

cs.RO
cs.AI
cs.CV

Methodology

Please refer to the full paper for detailed methodology.

Practical Implications

This research contributes to the advancement of cs.RO.

Authors

Jisong Cai
Long Ling
Shiwei Chu
Zhongshan Liu
Jiayue Kang
Zhixuan Liang
Wenjie Xu
Yinan Mao
Weinan Zhang
Xiaokang Yang
Ru Ying
Ran Zheng
Yao Mu

Paper Information

arXiv ID: 2606.09811v1
Categories: cs.RO, cs.AI, cs.CV
Published: June 8, 2026
PDF: Download PDF

[Paper] AHA-WAM:Asynchronous Horizon-Adaptive World-Action Modeling with Observation-Guided Context Routing

Overview

Key Contributions

Methodology

Practical Implications

Authors

Paper Information

Related posts

[Paper] Reroute, Don't Remove: Recoverable Visual Token Routing for Vision-Language Models

[Paper] DIRECT: When and Where Should You Allocate Test-Time Compute in Embodied Planners?

[Paper] Illumination-Robust Camera-Based Heart-Rate Estimation for Physiological Sensing in Robots

[Paper] Atlas H&E-TME: Scalable AI-Based Tissue Profiling at Expert Pathologist-Level Accuracy

Overview

Key Contributions

Methodology

Practical Implications

Authors

Paper Information

Related posts

[Paper] Reroute, Don't Remove: Recoverable Visual Token Routing for Vision-Language Models

[Paper] DIRECT: When and Where Should You Allocate Test-Time Compute in Embodied Planners?

[Paper] Illumination-Robust Camera-Based Heart-Rate Estimation for Physiological Sensing in Robots

[Paper] Atlas H&amp;E-TME: Scalable AI-Based Tissue Profiling at Expert Pathologist-Level Accuracy

[Paper] Atlas H&E-TME: Scalable AI-Based Tissue Profiling at Expert Pathologist-Level Accuracy