[Paper] Venus: An Efficient Edge Memory-and-Retrieval System for VLM-based Online Video Understanding
Source: arXiv - 2512.07344v1
Overview
The paper introduces Venus, a novel edge‑cloud system that lets devices understand live video streams with vision‑language models (VLMs) without overwhelming latency or bandwidth. By moving the heavy‑lifting of memory construction and key‑frame retrieval to the edge, Venus makes real‑time, VLM‑driven video analysis practical for on‑device applications such as smart cameras, AR glasses, and autonomous robots.
Key Contributions
- Edge‑centric memory architecture – builds and stores a hierarchical, multimodal memory of keyframes directly on the device, drastically cutting cloud round‑trips.
- Two‑stage processing pipeline – an ingestion stage that continuously segments and clusters video streams, and a querying stage that retrieves relevant frames with a progressive sampling algorithm.
- Threshold‑based progressive sampling – adaptively balances diversity of retrieved frames against compute cost, ensuring high reasoning accuracy while staying within latency budgets.
- Extensive performance evaluation – demonstrates 15×–131× lower end‑to‑end latency than prior cloud‑centric approaches, achieving sub‑second response times with comparable or better VLM inference quality.
Methodology
1. Ingestion Stage (Edge side)
- Scene segmentation splits the incoming video into logical shots using lightweight motion cues.
- Clustering groups similar frames within each shot; a representative keyframe is selected per cluster.
- Multimodal embedding: each keyframe is passed through a compact VLM encoder to obtain a joint visual‑text embedding.
- Hierarchical memory construction: embeddings are stored in a multi‑level index (e.g., per‑scene → per‑cluster) that enables fast lookup while keeping memory footprint low.
2. Querying Stage (Cloud side)
- Incoming textual queries (e.g., “show me when a person enters the room”) are first indexed against the edge memory using approximate nearest‑neighbor search.
- Progressive sampling: starting from a low‑cost threshold, the system samples increasingly diverse keyframes until a confidence or latency budget is met.
- Selected frames are sent to the cloud VLM for full reasoning (captioning, detection, etc.). The final answer is returned to the edge device.
The design deliberately offloads only the light tasks (segmentation, clustering, embedding) to the edge, while keeping the heavy VLM inference in the cloud, but only for a tiny, highly relevant subset of frames.
Results & Findings
| Metric | Venus | Prior Art (cloud‑only) |
|---|---|---|
| End‑to‑end latency (average) | 0.8 s (real‑time) | 12 s – 100 s |
| Speedup factor | 15× – 131× | 1× (baseline) |
| Reasoning accuracy (e.g., video QA F1) | 0.78 | 0.75 |
| Memory footprint on edge (per hour of video) | ≈ 120 MB | N/A (cloud only) |
Key takeaways
- By pruning the frame set before VLM inference, Venus cuts network traffic by >90 % and reduces cloud compute load.
- The progressive sampling algorithm maintains or even improves answer quality because it deliberately selects diverse, information‑rich frames.
- The system scales to multiple concurrent streams on modest edge hardware (e.g., ARM Cortex‑A78 with 4 GB RAM).
Practical Implications
- Smart surveillance & IoT – cameras can locally filter out irrelevant footage, sending only the most informative clips for cloud analytics, saving bandwidth and storage costs.
- AR/VR headsets – real‑time scene understanding (object identification, activity detection) becomes feasible without draining battery or requiring constant high‑speed connectivity.
- Robotics & autonomous vehicles – edge memory enables rapid context retrieval (e.g., “last time we saw a pedestrian crossing”) while delegating complex reasoning to the cloud only when needed.
- Developer workflow – Venus provides a reusable SDK for edge devices to plug in any VLM encoder, making it straightforward to integrate into existing pipelines (e.g., TensorFlow Lite, ONNX Runtime).
Limitations & Future Work
- Edge hardware constraints: the ingestion stage still assumes a modest GPU/NPUs; ultra‑low‑power devices may need further model compression.
- Static memory granularity: current hierarchical indexing uses fixed scene/cluster levels; adaptive granularity could improve memory efficiency for highly dynamic streams.
- Privacy considerations: while less raw video is transmitted, embeddings may still leak sensitive information; future work could explore encrypted or differential‑privacy‑preserving embeddings.
- Generalization to other modalities: extending Venus to audio‑visual or sensor‑fusion streams is an open direction.
Overall, Venus demonstrates that thoughtful system design—splitting memory construction and retrieval to the edge—can unlock the power of large vision‑language models for real‑time video understanding in production environments.
Authors
- Shengyuan Ye
- Bei Ouyang
- Tianyi Qian
- Liekang Zeng
- Mu Yuan
- Xiaowen Chu
- Weijie Hong
- Xu Chen
Paper Information
- arXiv ID: 2512.07344v1
- Categories: cs.DC, cs.AI
- Published: December 8, 2025
- PDF: Download PDF