[Paper] MemDreamer: Decoupling Perception and Reasoning for Long Video Understanding via Hierarchical Graph Memory and Agentic Retrieval Mechanism

Published: 5 days ago (June 5, 2026 at 01:59 PM EDT)

2 min read

Source: arXiv

Source: arXiv - 2606.07512v1

Overview

Current Vision-Language Models struggle with hours-long videos because processing full-length visual sequences induces prohibitive token explosion and attention dilution. To overcome this, we introduce MemDreamer to decouple perception and reasoning, shifting long-video understanding into an agentic exploration process. As a plug-and-play framework, it incrementally streams videos to construct a Hierarchical Graph Memory, a top-down three-tier architecture for semantic abstraction, anchored by a foundational graph capturing spatiotemporal and causal relations. During inference, the reasoning model employs agentic tool-augmented retrieval, navigating hierarchies, searching nodes, and traversing logical edges via an Observation-Reason-Action loop. Experiments show MemDreamer achieves SOTA results across four mainstream benchmarks, narrowing the gap with human experts to only 3.7 points. It constrains the reasoning context window to merely 2% of full-context ingestion while delivering a 12.5 point absolute accuracy gain. Furthermore, statistical analysis uncovers a strong positive linear correlation between an VLM’s performance on logic reasoning and long-video understanding benchmarks, establishing agentic capability scaling as a new paradigm for multimodal comprehension.

Key Contributions

This paper presents research in the following areas:

cs.CV
cs.AI
cs.CL

Methodology

Please refer to the full paper for detailed methodology.

Practical Implications

This research contributes to the advancement of cs.CV.

Authors

Cong Chen
Guo Gan
Kaixiang Ji
ChaoYang Zhang
Zhen Yang
Guangming Yao
Hao Chen
Jingdong Chen
Yi Yuan
Chunhua Shen

Paper Information

arXiv ID: 2606.07512v1
Categories: cs.CV, cs.AI, cs.CL
Published: June 5, 2026
PDF: Download PDF

[Paper] MemDreamer: Decoupling Perception and Reasoning for Long Video Understanding via Hierarchical Graph Memory and Agentic Retrieval Mechanism

Overview

Key Contributions

Methodology

Practical Implications

Authors

Paper Information

Related posts

[Paper] TEVI: Text-Conditioned Editing of Visual Representations via Sparse Autoencoders for Improved Vision-Language Alignment

[Paper] Benchmarking Open-Source Layout Detection Models for Data Snapshot Extraction from Institutional Documents

[Paper] Continual Visual and Verbal Learning Through a Child's Egocentric Input

[Paper] Neuron Populations Exhibit Divergent Selectivity with Scale