[Paper] VacuumVLA: Boosting VLA Capabilities via a Unified Suction and Gripping Tool for Complex Robotic Manipulation
Source: arXiv - 2511.21557v1
Overview
The paper introduces VacuumVLA, a low‑cost, plug‑and‑play end‑effector that fuses a classic two‑finger gripper with a vacuum suction module. By giving Vision‑Language‑Action (VLA) systems a second “hand” for picking, sticking, and wiping, the authors dramatically broaden the set of manipulation tasks that a single robot can handle—everything from lifting smooth glass panels to pulling open handle‑less drawers.
Key Contributions
- Hybrid hardware design: A compact, 3‑D‑printable module that mechanically integrates a parallel‑jaw gripper and a vacuum suction cup, with a single control interface.
- Dual‑mode operation: Supports exclusive (grip‑only or suction‑only) and synergistic (grip + suction simultaneously) manipulation without re‑tooling.
- Seamless VLA integration: Plugged into two state‑of‑the‑art VLA pipelines—DexVLA and Pi0—showing that the same vision‑language model can learn to select the appropriate modality on the fly.
- Open‑source release: Full CAD files, wiring schematics, and ROS‑compatible drivers are made publicly available, lowering the barrier for labs and startups.
- Empirical validation: Benchmarks on a set of 12 real‑world tasks (e.g., glass wiping, handle‑less drawer opening, thin‑sheet picking) demonstrate success rates up to 90 %—far beyond the ~30 % achievable with a plain two‑finger gripper.
Methodology
- Hardware integration – The authors mount a miniature vacuum pump and suction cup on the side of a standard parallel‑jaw gripper. A single microcontroller (Arduino Nano) reads a binary “mode” command from the VLA policy and actuates either the gripper motor, the suction pump, or both.
- Control abstraction – In the VLA software stack, the end‑effector is exposed as a single action primitive with three discrete sub‑actions:
GRIP,SUCTION,GRIP+SUCTION. This keeps the language model’s action space unchanged while adding expressive power. - Training & inference – The authors fine‑tune DexVLA and Pi0 on a mixed dataset of RGB‑D images, natural‑language task descriptions, and demonstration trajectories that include the new hybrid actions. No extra language tokens are required; the model learns to map phrases like “pick up the glass” to the
SUCTIONprimitive. - Evaluation protocol – Each task is run 20 times on a Franka Emika Panda robot. Success is defined as completing the high‑level goal (e.g., “wipe the surface”) without human intervention. Baselines use the same VLA models but with a vanilla two‑finger gripper.
Results & Findings
| Task Category | Success (Hybrid) | Success (Gripper‑Only) |
|---|---|---|
| Glass wiping | 92 % | 18 % |
| Thin‑sheet pick‑up | 88 % | 25 % |
| Handle‑less drawer pull | 85 % | 30 % |
| Mixed objects (grip + suction) | 90 % | 40 % |
- Mode selection learns automatically – The VLA policies correctly choose suction for smooth, low‑mass items and grip for irregular shapes, even when the same textual command is used.
- Synergistic use improves stability – For heavy or partially porous objects, activating both grip and suction simultaneously raises the lift capacity by ~35 % compared with either mode alone.
- No noticeable latency – The added pump control adds < 50 ms overhead, well within real‑time constraints for VLA inference loops.
Practical Implications
- Rapid prototyping – Robotics startups can 3‑D‑print the VacuumVLA module and retrofit existing arms, instantly expanding their product’s task repertoire without redesigning the entire manipulator.
- Warehouse & logistics – Vacuum‑assisted picking of glossy packages or thin cardboard sheets becomes feasible, reducing the need for multiple specialized end‑effectors on a single line.
- Service robots – Home assistants can now clean windows, wipe countertops, or open sleek cabinets that lack traditional handles—capabilities that were previously out of reach for VLA‑driven bots.
- Research acceleration – By releasing the hardware and ROS drivers, the authors enable the community to benchmark new VLA architectures on a richer set of manipulation primitives, fostering more robust, generalist policies.
Limitations & Future Work
- Suction power constraints – The current low‑cost pump struggles with heavy or highly porous items; scaling to industrial‑grade suction will require more robust hardware.
- Surface dependency – Suction effectiveness drops on textured or oily surfaces, suggesting a need for adaptive suction pads or hybrid tactile sensing.
- Learning sample efficiency – Although the hybrid actions are learned end‑to‑end, the authors note that additional demonstrations (≈ 10 % more) are needed to reach peak performance on the most complex tasks.
- Future directions – The team plans to explore dynamic mode switching mid‑trajectory (e.g., grip‑then‑suction) and to integrate force/torque feedback for safer contact‑rich operations.
VacuumVLA shows that a modest hardware tweak can unlock a whole new class of real‑world tasks for vision‑language‑driven robots, making general‑purpose manipulation a step closer to everyday deployment.
Authors
- Hui Zhou
- Siyuan Huang
- Minxing Li
- Hao Zhang
- Lue Fan
- Shaoshuai Shi
Paper Information
- arXiv ID: 2511.21557v1
- Categories: cs.RO, cs.AI
- Published: November 26, 2025
- PDF: Download PDF