[Paper] RAPID: Redundancy-Aware and Compatibility-Optimal Edge-Cloud Partitioned Inference for Diverse VLA models

Published: 2 days ago (March 9, 2026 at 12:30 AM EDT)

5 min read

Source: arXiv

Source: arXiv - 2603.07949v1

Overview

The paper introduces RAPID, a new edge‑cloud collaborative inference framework designed for Vision‑Language‑Action (VLA) models that power embodied AI agents (e.g., robots, AR/VR assistants). By intelligently partitioning the model between a resource‑constrained edge device and a powerful cloud server, RAPID cuts inference latency by up to 1.73× while adding only 5‑7 % extra overhead, making real‑time VLA applications far more practical.

Key Contributions

Redundancy‑aware partitioning: Detects and skips step‑wise redundant computations that are common in sequential embodied tasks, preserving motion continuity.
Noise‑robust edge‑cloud split: Introduces a visual‑noise‑resilient strategy that prevents the partition point from being destabilized by cluttered or ambiguous scenes.
Compatibility‑optimal design: Works with a wide range of existing VLA architectures without requiring model retraining or heavy code changes.
Prototype implementation & evaluation: Demonstrates up to 1.73× speedup on benchmark VLA workloads (e.g., RoboTHOR, ALFRED) with only a modest 5‑7 % overhead.
Open‑source reference: Provides a modular codebase that can be plugged into typical PyTorch / TensorRT pipelines for rapid adoption.

Methodology

Profiling the VLA pipeline – The authors first break down a VLA model into three logical stages: visual encoding, language grounding, and action decoding. Each stage is profiled on both edge hardware (e.g., Jetson Nano, Snapdragon) and cloud GPUs to obtain latency and memory footprints.
Redundancy detection – Using a lightweight temporal‑consistency estimator, RAPID identifies frames where the visual scene or language instruction changes minimally. For those frames, it reuses previously computed intermediate tensors instead of recomputing them, effectively “skipping” redundant work.
Noise‑aware partition point selection – A reinforcement‑learning controller evaluates candidate split points under varying visual‑noise conditions (e.g., motion blur, occlusions). The controller learns a policy that prefers split locations whose intermediate representations are less sensitive to noise, ensuring stable offloading decisions.
Dynamic scheduling – At runtime, RAPID monitors network bandwidth and device load. If conditions drift, it can shift the partition point on‑the‑fly, always respecting the redundancy and noise constraints learned offline.
Implementation glue – The framework wraps the selected sub‑graph in an RPC layer (gRPC + protobuf) and uses shared memory buffers to avoid data copying, keeping the added overhead under 7 %.

The whole pipeline is built on top of standard deep‑learning libraries, so developers can drop RAPID into existing VLA codebases with a few configuration changes.

Results & Findings

Metric	Edge‑Only	Cloud‑Only	RAPID (Edge‑Cloud)
End‑to‑end latency (ms)	210	95	122 (≈1.73× faster than edge‑only)
Bandwidth usage (MB per inference)	–	120	38
Redundancy skip rate	N/A	N/A	32 % of frames
Accuracy drop (task success)	0 %	0 %	<1 %

Latency: RAPID consistently outperforms pure edge inference, especially when the network is stable (≥10 Mbps).
Overhead: The extra 5‑7 % comes from RPC marshaling and the redundancy estimator, which the authors show is negligible compared to the saved compute.
Robustness to noise: In experiments with synthetic visual noise (Gaussian blur, random occlusions), RAPID’s partition decisions remained stable, whereas baseline methods suffered up to 30 % latency spikes.
Task performance: Because redundant frames are only skipped when the scene/action does not change, the overall success rate on embodied benchmarks stays virtually unchanged.

Practical Implications

Robotics & Edge AI: Developers building autonomous drones, warehouse robots, or home assistants can run heavy VLA models without over‑provisioning edge hardware, extending battery life and reducing form‑factor constraints.
AR/VR streaming: Real‑time captioning or gesture‑guided interfaces can offload the bulk of VLA computation to the cloud while keeping latency low enough for immersive experiences.
Scalable SaaS platforms: Cloud providers can expose a “RAPID‑enabled” inference endpoint that automatically adapts to the client’s device capabilities, simplifying SDK design.
Network‑aware deployment: The dynamic scheduling component makes it feasible to deploy VLA services over variable 5G/Wi‑Fi links, automatically throttling or expanding cloud involvement based on current bandwidth.

In short, RAPID gives engineers a plug‑and‑play way to get the best of both worlds—edge responsiveness and cloud horsepower—without rewriting their models.

Limitations & Future Work

Dependency on temporal redundancy: Tasks with highly dynamic scenes (e.g., fast‑moving sports) may see fewer skip opportunities, reducing speedup.
Network assumptions: The current prototype assumes a relatively stable uplink; extreme latency or packet loss could degrade performance.
Model‑agnostic but not hardware‑agnostic: The profiling stage needs to be redone for each new edge device, which adds a calibration step.

Future research directions suggested by the authors include: extending the redundancy estimator to handle multimodal (audio‑visual) streams, integrating more sophisticated bandwidth prediction models, and exploring on‑device learning to adapt the partition policy continuously in the field.

Authors

Zihao Zheng
Sicheng Tian
Hangyu Cao
Chenyue Li
Jiayu Chen
Maoliang Li
Xinhao Sun
Hailong Zou
Guojie Luo
Xiang Chen

Paper Information

arXiv ID: 2603.07949v1
Categories: cs.DC, cs.RO
Published: March 9, 2026
PDF: Download PDF

[Paper] RAPID: Redundancy-Aware and Compatibility-Optimal Edge-Cloud Partitioned Inference for Diverse VLA models

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] The Bureaucracy of Speed: Structural Equivalence Between Memory Consistency Models and Multi-Agent Authorization Revocation

[Paper] Rate-Distortion Bounds for Heterogeneous Random Fields on Finite Lattices

[Paper] Ensuring Data Freshness in Multi-Rate Task Chains Scheduling

[Paper] Randomized Distributed Function Computation (RDFC): Ultra-Efficient Semantic Communication Applications to Privacy