[Paper] iMaC: Translating Actions into Motion and Contact Images for Embodied World Models

Published: 3 days ago (June 8, 2026 at 01:55 PM EDT)

2 min read

Source: arXiv

Source: arXiv - 2606.09813v1

Overview

Embodied world models have emerged as a pivotal paradigm for visual robotic decision-making and interactive environment simulation. However, conventional embodied frameworks rely on low-dimensional structured action vectors (e.g., joint angles and end-effector poses), which suffer from limited expressive capacity, poor generalization across diverse embodiments, and unnatural dynamic modeling for complex physical interactions. To address these limitations, this paper proposesiMac (Image as Action Control), a novel unified control paradigm that treats raw visual images as native action representations for embodied world models. Departing from traditional explicit kinematic action encoding, iMac formulates continuous visual manipulation as image-based action tokens, which inherently encapsulate spatial motion intentions, interactive geometric constraints and subtle physical dynamics. We construct a dual-branch embodied architecture consisting of an image-action encoder and a dynamic world predictor: the encoder compresses target-driven visual images into compact action embeddings, while the predictor learns environment transition rules conditioned on image actions to achieve high-fidelity future state prediction and closed-loop embodied control. Extensive experiments are conducted on public embodied manipulation benchmarks and real-world robotic scenarios. The results demonstrate that iMac outperforms vector-based action control baselines in prediction accuracy, task success rate and cross-scene generalization ability. Moreover, our image-action design eliminates the reliance on manually defined action spaces, realizing flexible and universal control for heterogeneous embodied agents. This work provides an innovative visual-action perspective for embodied world models, offering a simple yet effective paradigm for scalable robotic perception and manipulation.

Key Contributions

This paper presents research in the following areas:

cs.RO
cs.CV

Methodology

Please refer to the full paper for detailed methodology.

Practical Implications

This research contributes to the advancement of cs.RO.

Authors

Zhenyu Wu
Xiuwei Xu
Yukun Zhou
Yifan Li
Qiuping Deng
Xiaofeng Wang
Zheng Zhu
Bingyao Yu
Ziwei Wang
Jiwen Lu
Haibin Yan

Paper Information

arXiv ID: 2606.09813v1
Categories: cs.RO, cs.CV
Published: June 8, 2026
PDF: Download PDF

[Paper] iMaC: Translating Actions into Motion and Contact Images for Embodied World Models

Overview

Key Contributions

Methodology

Practical Implications

Authors

Paper Information

Related posts

[Paper] Reroute, Don't Remove: Recoverable Visual Token Routing for Vision-Language Models

[Paper] How Seemingly Inconsequential Design Choices Dictate Performance of LLMs in Pathology

[Paper] DIRECT: When and Where Should You Allocate Test-Time Compute in Embodied Planners?

[Paper] VLGA: Vision-Language-Geometry-Action Models for Autonomous Driving