[Paper] CABLE: Cloud-Assisted Bandwidth-efficient LMM-based Encoding for V2X Systems

Published: 1 day ago (June 17, 2026 at 12:35 PM EDT)

2 min read

Source: arXiv

Source: arXiv - 2606.19258v1

Overview

Cloud-hosted large multimodal models (LMMs) can provide strong open-vocabulary perception for Vehicle-to-Everything systems, but naively transmitting full-resolution frames from edge to cloud causes severe communication overhead and high cloud-side prefill latency. We present CABLE, a cloud-assisted bandwidth-efficient LMM-based encoding framework for edge-cloud perception. CABLE propagates the previous cloud segmentation mask on the edge using ego-motion compensation, refines it with residual-motion cues, and consolidates disconnected regions via a corridor envelope to form a robust region of interest (ROI). Only ROI-masked images are uploaded, while the cloud segmentation output is fed back as the prior for the next frame, forming a mask-to-ROI-to-LMM feedback loop. Experiments on five datasets (nuScenes, WOD-ZB, Waymo, KITTI, and CADC) show consistent communication savings while largely preserving perception, achieving $73$—$87%$ ROI pixel-coverage reduction with $5$—$8\times$ estimated LMM prefill speedup at a modest detection-quality trade-off relative to full-frame inference.

Key Contributions

This paper presents research in the following areas:

cs.CV
cs.RO

Methodology

Please refer to the full paper for detailed methodology.

Practical Implications

This research contributes to the advancement of cs.CV.

Authors

Haohua Que
Zhipeng Bao
Qianyi Wu
Handong Yao

Paper Information

arXiv ID: 2606.19258v1
Categories: cs.CV, cs.RO
Published: June 17, 2026
PDF: Download PDF

[Paper] CABLE: Cloud-Assisted Bandwidth-efficient LMM-based Encoding for V2X Systems

Overview

Key Contributions

Methodology

Practical Implications

Authors

Paper Information

Related posts

[Paper] Native Active Perception as Reasoning for Omni-Modal Understanding

[Paper] Beyond the Current Observation: Evaluating Multimodal Large Language Models in Controllable Non-Markov Games

[Paper] Do as I Do: Dexterous Manipulation Data from Everyday Human Videos

[Paper] Reference-Driven Multi-Speaker Audio Scene Generation from In-the-Wild Priors