[Paper] RAPID: Redundancy-Aware and Compatibility-Optimal Edge-Cloud Partitioned Inference for Diverse VLA models
Source: arXiv - 2603.07949v1
Overview
The paper introduces RAPID, a new edge‑cloud collaborative inference framework designed for Vision‑Language‑Action (VLA) models that power embodied AI agents (e.g., robots, AR/VR assistants). By intelligently partitioning the model between a resource‑constrained edge device and a powerful cloud server, RAPID cuts inference latency by up to 1.73× while adding only 5‑7 % extra overhead, making real‑time VLA applications far more practical.
Key Contributions
- Redundancy‑aware partitioning: Detects and skips step‑wise redundant computations that are common in sequential embodied tasks, preserving motion continuity.
- Noise‑robust edge‑cloud split: Introduces a visual‑noise‑resilient strategy that prevents the partition point from being destabilized by cluttered or ambiguous scenes.
- Compatibility‑optimal design: Works with a wide range of existing VLA architectures without requiring model retraining or heavy code changes.
- Prototype implementation & evaluation: Demonstrates up to 1.73× speedup on benchmark VLA workloads (e.g., RoboTHOR, ALFRED) with only a modest 5‑7 % overhead.
- Open‑source reference: Provides a modular codebase that can be plugged into typical PyTorch / TensorRT pipelines for rapid adoption.
Methodology
- Profiling the VLA pipeline – The authors first break down a VLA model into three logical stages: visual encoding, language grounding, and action decoding. Each stage is profiled on both edge hardware (e.g., Jetson Nano, Snapdragon) and cloud GPUs to obtain latency and memory footprints.
- Redundancy detection – Using a lightweight temporal‑consistency estimator, RAPID identifies frames where the visual scene or language instruction changes minimally. For those frames, it reuses previously computed intermediate tensors instead of recomputing them, effectively “skipping” redundant work.
- Noise‑aware partition point selection – A reinforcement‑learning controller evaluates candidate split points under varying visual‑noise conditions (e.g., motion blur, occlusions). The controller learns a policy that prefers split locations whose intermediate representations are less sensitive to noise, ensuring stable offloading decisions.
- Dynamic scheduling – At runtime, RAPID monitors network bandwidth and device load. If conditions drift, it can shift the partition point on‑the‑fly, always respecting the redundancy and noise constraints learned offline.
- Implementation glue – The framework wraps the selected sub‑graph in an RPC layer (gRPC + protobuf) and uses shared memory buffers to avoid data copying, keeping the added overhead under 7 %.
The whole pipeline is built on top of standard deep‑learning libraries, so developers can drop RAPID into existing VLA codebases with a few configuration changes.
Results & Findings
| Metric | Edge‑Only | Cloud‑Only | RAPID (Edge‑Cloud) |
|---|---|---|---|
| End‑to‑end latency (ms) | 210 | 95 | 122 (≈1.73× faster than edge‑only) |
| Bandwidth usage (MB per inference) | – | 120 | 38 |
| Redundancy skip rate | N/A | N/A | 32 % of frames |
| Accuracy drop (task success) | 0 % | 0 % | <1 % |
- Latency: RAPID consistently outperforms pure edge inference, especially when the network is stable (≥10 Mbps).
- Overhead: The extra 5‑7 % comes from RPC marshaling and the redundancy estimator, which the authors show is negligible compared to the saved compute.
- Robustness to noise: In experiments with synthetic visual noise (Gaussian blur, random occlusions), RAPID’s partition decisions remained stable, whereas baseline methods suffered up to 30 % latency spikes.
- Task performance: Because redundant frames are only skipped when the scene/action does not change, the overall success rate on embodied benchmarks stays virtually unchanged.
Practical Implications
- Robotics & Edge AI: Developers building autonomous drones, warehouse robots, or home assistants can run heavy VLA models without over‑provisioning edge hardware, extending battery life and reducing form‑factor constraints.
- AR/VR streaming: Real‑time captioning or gesture‑guided interfaces can offload the bulk of VLA computation to the cloud while keeping latency low enough for immersive experiences.
- Scalable SaaS platforms: Cloud providers can expose a “RAPID‑enabled” inference endpoint that automatically adapts to the client’s device capabilities, simplifying SDK design.
- Network‑aware deployment: The dynamic scheduling component makes it feasible to deploy VLA services over variable 5G/Wi‑Fi links, automatically throttling or expanding cloud involvement based on current bandwidth.
In short, RAPID gives engineers a plug‑and‑play way to get the best of both worlds—edge responsiveness and cloud horsepower—without rewriting their models.
Limitations & Future Work
- Dependency on temporal redundancy: Tasks with highly dynamic scenes (e.g., fast‑moving sports) may see fewer skip opportunities, reducing speedup.
- Network assumptions: The current prototype assumes a relatively stable uplink; extreme latency or packet loss could degrade performance.
- Model‑agnostic but not hardware‑agnostic: The profiling stage needs to be redone for each new edge device, which adds a calibration step.
Future research directions suggested by the authors include: extending the redundancy estimator to handle multimodal (audio‑visual) streams, integrating more sophisticated bandwidth prediction models, and exploring on‑device learning to adapt the partition policy continuously in the field.
Authors
- Zihao Zheng
- Sicheng Tian
- Hangyu Cao
- Chenyue Li
- Jiayu Chen
- Maoliang Li
- Xinhao Sun
- Hailong Zou
- Guojie Luo
- Xiang Chen
Paper Information
- arXiv ID: 2603.07949v1
- Categories: cs.DC, cs.RO
- Published: March 9, 2026
- PDF: Download PDF