[Paper] NavWAM: A Navigation World Action Model for Goal-Conditioned Visual Navigation

Published: 3 days ago (June 11, 2026 at 11:44 AM EDT)

2 min read

Source: arXiv

Source: arXiv - 2606.13494v1

Overview

Goal-conditioned visual navigation requires a robot to act under partial observability by anticipating how its motion will change the future egocentric view and whether that change brings it closer to the goal. Navigation world models provide such visual foresight, but they remain prediction modules that require an external planner to convert predicted futures into closed-loop control. We propose Navigation World Action Model (NavWAM), a diffusion-transformer policy that turns navigation world-model prediction into executable action by representing future observations, goal-progress values, and action chunks in a shared latent sequence. By learning future prediction jointly with the action and value targets that determine closed-loop behavior, NavWAM makes visual foresight directly usable for robot control. We build NavWAM through simulation pretraining and real-robot adaptation, and evaluate it on image-goal navigation against planning-based world models and a representative direct navigation policy. Across offline benchmarks and closed-loop real-robot deployment, NavWAM improves over planning-based world-model baselines in our evaluations while using the default policy mode without CEM-style action search. Project page: https://dachii-azm.github.io/navwam/

Key Contributions

This paper presents research in the following areas:

cs.RO
cs.CV

Methodology

Please refer to the full paper for detailed methodology.

Practical Implications

This research contributes to the advancement of cs.RO.

Authors

Daichi Azuma
Taiki Miyanishi
Koya Sakamoto
Shuhei Kurita
Yaonan Zhu
Petr Khrapchenkov
Motoaki Kawanabe
Yusuke Iwasawa
Yutaka Matsuo

Paper Information

arXiv ID: 2606.13494v1
Categories: cs.RO, cs.CV
Published: June 11, 2026
PDF: Download PDF

[Paper] NavWAM: A Navigation World Action Model for Goal-Conditioned Visual Navigation

Overview

Key Contributions

Methodology

Practical Implications

Authors

Paper Information

Related posts

[Paper] InterleaveThinker: Reinforcing Agentic Interleaved Generation

[Paper] Mana: Dexterous Manipulation of Articulated Tools

[Paper] Modality Forcing for Scalable Spatial Generation

[Paper] RepWAM: World Action Modeling with Representation Visual-Action Tokenizers