[Paper] Venus: An Efficient Edge Memory-and-Retrieval System for VLM-based Online Video Understanding

Published: 1 week ago (December 8, 2025 at 04:32 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.07344v1

Overview

The paper introduces Venus, a novel edge‑cloud system that lets devices understand live video streams with vision‑language models (VLMs) without overwhelming latency or bandwidth. By moving the heavy‑lifting of memory construction and key‑frame retrieval to the edge, Venus makes real‑time, VLM‑driven video analysis practical for on‑device applications such as smart cameras, AR glasses, and autonomous robots.

Key Contributions

Edge‑centric memory architecture – builds and stores a hierarchical, multimodal memory of keyframes directly on the device, drastically cutting cloud round‑trips.
Two‑stage processing pipeline – an ingestion stage that continuously segments and clusters video streams, and a querying stage that retrieves relevant frames with a progressive sampling algorithm.
Threshold‑based progressive sampling – adaptively balances diversity of retrieved frames against compute cost, ensuring high reasoning accuracy while staying within latency budgets.
Extensive performance evaluation – demonstrates 15×–131× lower end‑to‑end latency than prior cloud‑centric approaches, achieving sub‑second response times with comparable or better VLM inference quality.

Methodology

1. Ingestion Stage (Edge side)

Scene segmentation splits the incoming video into logical shots using lightweight motion cues.
Clustering groups similar frames within each shot; a representative keyframe is selected per cluster.
Multimodal embedding: each keyframe is passed through a compact VLM encoder to obtain a joint visual‑text embedding.
Hierarchical memory construction: embeddings are stored in a multi‑level index (e.g., per‑scene → per‑cluster) that enables fast lookup while keeping memory footprint low.

2. Querying Stage (Cloud side)

Incoming textual queries (e.g., “show me when a person enters the room”) are first indexed against the edge memory using approximate nearest‑neighbor search.
Progressive sampling: starting from a low‑cost threshold, the system samples increasingly diverse keyframes until a confidence or latency budget is met.
Selected frames are sent to the cloud VLM for full reasoning (captioning, detection, etc.). The final answer is returned to the edge device.

The design deliberately offloads only the light tasks (segmentation, clustering, embedding) to the edge, while keeping the heavy VLM inference in the cloud, but only for a tiny, highly relevant subset of frames.

Results & Findings

Metric	Venus	Prior Art (cloud‑only)
End‑to‑end latency (average)	0.8 s (real‑time)	12 s – 100 s
Speedup factor	15× – 131×	1× (baseline)
Reasoning accuracy (e.g., video QA F1)	0.78	0.75
Memory footprint on edge (per hour of video)	≈ 120 MB	N/A (cloud only)

Key takeaways

By pruning the frame set before VLM inference, Venus cuts network traffic by >90 % and reduces cloud compute load.
The progressive sampling algorithm maintains or even improves answer quality because it deliberately selects diverse, information‑rich frames.
The system scales to multiple concurrent streams on modest edge hardware (e.g., ARM Cortex‑A78 with 4 GB RAM).

Practical Implications

Smart surveillance & IoT – cameras can locally filter out irrelevant footage, sending only the most informative clips for cloud analytics, saving bandwidth and storage costs.
AR/VR headsets – real‑time scene understanding (object identification, activity detection) becomes feasible without draining battery or requiring constant high‑speed connectivity.
Robotics & autonomous vehicles – edge memory enables rapid context retrieval (e.g., “last time we saw a pedestrian crossing”) while delegating complex reasoning to the cloud only when needed.
Developer workflow – Venus provides a reusable SDK for edge devices to plug in any VLM encoder, making it straightforward to integrate into existing pipelines (e.g., TensorFlow Lite, ONNX Runtime).

Limitations & Future Work

Edge hardware constraints: the ingestion stage still assumes a modest GPU/NPUs; ultra‑low‑power devices may need further model compression.
Static memory granularity: current hierarchical indexing uses fixed scene/cluster levels; adaptive granularity could improve memory efficiency for highly dynamic streams.
Privacy considerations: while less raw video is transmitted, embeddings may still leak sensitive information; future work could explore encrypted or differential‑privacy‑preserving embeddings.
Generalization to other modalities: extending Venus to audio‑visual or sensor‑fusion streams is an open direction.

Overall, Venus demonstrates that thoughtful system design—splitting memory construction and retrieval to the edge—can unlock the power of large vision‑language models for real‑time video understanding in production environments.

Authors

Shengyuan Ye
Bei Ouyang
Tianyi Qian
Liekang Zeng
Mu Yuan
Xiaowen Chu
Weijie Hong
Xu Chen

Paper Information

arXiv ID: 2512.07344v1
Categories: cs.DC, cs.AI
Published: December 8, 2025
PDF: Download PDF

[Paper] Venus: An Efficient Edge Memory-and-Retrieval System for VLM-based Online Video Understanding

Overview

Key Contributions

Methodology

1. Ingestion Stage (Edge side)

2. Querying Stage (Cloud side)

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Spatia: Video Generation with Updatable Spatial Memory

[Paper] Predictive Concept Decoders: Training Scalable End-to-End Interpretability Assistants

[Paper] Artism: AI-Driven Dual-Engine System for Art Generation and Critique

[Paper] Learning Model Parameter Dynamics in a Combination Therapy for Bladder Cancer from Sparse Biological Data