[Paper] DAM-VLA: Decoupled Asynchronous Multimodal Vision Language Action model

Published: 3 days ago (June 10, 2026 at 09:59 AM EDT)

2 min read

Source: arXiv

Source: arXiv - 2606.12105v1

Overview

Vision-language-action (VLA) models inherit a shared synchronous clock from vision-language pretraining, processing every input at one rate. This is misaligned with physical interaction, where a high-frequency modality changes at hundreds of hertz, vision evolves more slowly, and language stays constant across an episode. A synchronous VLA oversamples slow modalities, undersamples fast ones, and caps action generation at the lowest effective frequency. We hypothesize that decoupling temporal processing per modality, letting each update and retain information at its own sensor rate, yields stronger representations and more robust control. We present DAM-VLA, which maintains per-modality latent buffers refreshed at sensor rates and read continuously by the action head, integrating new high-frequency modalities through gated cross-attention that leaves the pretrained backbone intact. Across seven contact-rich real-world manipulation tasks, DAM-VLA more than doubles the average success rate of the strongest synchronous baseline (95.2% vs.\ 40.95%) while sustaining smooth, reactive 100,Hz control. Project website: \href{https://intuitive-robots.github.io/DAM-VLA/}{intuitive-robots.github.io/DAM-VLA/}

Key Contributions

This paper presents research in the following areas:

cs.RO
cs.CV
cs.LG

Methodology

Please refer to the full paper for detailed methodology.

Practical Implications

This research contributes to the advancement of cs.RO.

Authors

Pankhuri Vanjani
Zhuoyue Li
Jakub Suliga
Moritz Reuss
Gianluca Geraci
Xinkai Jiang
Rudolf Lioutikov

Paper Information

arXiv ID: 2606.12105v1
Categories: cs.RO, cs.CV, cs.LG
Published: June 10, 2026
PDF: Download PDF

[Paper] DAM-VLA: Decoupled Asynchronous Multimodal Vision Language Action model

Overview

Key Contributions

Methodology

Practical Implications

Authors

Paper Information

Related posts

[Paper] Mana: Dexterous Manipulation of Articulated Tools

[Paper] SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning

[Paper] EvTexture++: Event-Driven Texture Enhancement for Video Super-Resolution

[Paper] Contrast-Informed Augmentation and Domain-Adversarial Training for Adult-to-Neonatal MR Reconstruction Generalization