[Paper] MemoryVLA++: Temporal Modeling via Memory and Imagination in Vision-Language-Action Models

Published: 3 days ago (June 8, 2026 at 01:59 PM EDT)

2 min read

Source: arXiv

Source: arXiv - 2606.09827v1

Overview

Temporal modeling is essential for robotic manipulation, as effective control requires both memory of past interactions and imagination of future states. However, most VLA models rely primarily on the current observation and therefore struggle with long-horizon, temporally dependent tasks. Cognitive science suggests that humans rely on working memory to buffer short-lived context, the hippocampal system to preserve episodic memory of past experience, and internal models to imagine possible future state evolution. Inspired by these mechanisms, we propose MemoryVLA++, a full temporal modeling framework that equips VLA models with memory and imagination for robotic manipulation. A pretrained VLM encodes the current observation into perceptual and cognitive tokens, forming working memory. These tokens query a Perceptual-Cognitive Memory Bank to retrieve relevant historical context. This bank stores low-level details and high-level semantics from past interactions, and is updated through redundancy-aware consolidation. A world model imagines future states in a denoising latent space, and the imagined latents are integrated under memory guidance to form full temporal-aware tokens. The resulting tokens condition a diffusion action expert to predict temporally consistent action sequences. We conduct extensive experiments on 5 simulation benchmarks and 3 categories of real-robot tasks across 3 robots, covering general manipulation, long-horizon temporal tasks, robustness, and generalization. Our method achieves strong performance across Libero, SimplerEnv, Mikasa-Robo, Calvin, Libero-Plus, and diverse real-robot tasks, validating the effectiveness of full temporal modeling with memory and imagination. For example, on real robots, it achieves +9%, +26%, +28% gains on general, memory-dependent, and imagination-dependent tasks. Project Page: https://shihao1895.github.io/MemoryVLA-PP-Web

Key Contributions

This paper presents research in the following areas:

cs.RO
cs.CV

Methodology

Please refer to the full paper for detailed methodology.

Practical Implications

This research contributes to the advancement of cs.RO.

Authors

Hao Shi
Weiye Li
Bin Xie
Yulin Wang
Renping Zhou
Tiancai Wang
Xiangyu Zhang
Ping Luo
Gao Huang

Paper Information

arXiv ID: 2606.09827v1
Categories: cs.RO, cs.CV
Published: June 8, 2026
PDF: Download PDF

[Paper] MemoryVLA++: Temporal Modeling via Memory and Imagination in Vision-Language-Action Models

Overview

Key Contributions

Methodology

Practical Implications

Authors

Paper Information

Related posts

[Paper] Reroute, Don't Remove: Recoverable Visual Token Routing for Vision-Language Models

[Paper] How Seemingly Inconsequential Design Choices Dictate Performance of LLMs in Pathology

[Paper] DIRECT: When and Where Should You Allocate Test-Time Compute in Embodied Planners?

[Paper] VLGA: Vision-Language-Geometry-Action Models for Autonomous Driving