[Paper] VacuumVLA: Boosting VLA Capabilities via a Unified Suction and Gripping Tool for Complex Robotic Manipulation

Published: (November 26, 2025 at 11:29 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2511.21557v1

Overview

The paper introduces VacuumVLA, a low‑cost, plug‑and‑play end‑effector that fuses a classic two‑finger gripper with a vacuum suction module. By giving Vision‑Language‑Action (VLA) systems a second “hand” for picking, sticking, and wiping, the authors dramatically broaden the set of manipulation tasks that a single robot can handle—everything from lifting smooth glass panels to pulling open handle‑less drawers.

Key Contributions

  • Hybrid hardware design: A compact, 3‑D‑printable module that mechanically integrates a parallel‑jaw gripper and a vacuum suction cup, with a single control interface.
  • Dual‑mode operation: Supports exclusive (grip‑only or suction‑only) and synergistic (grip + suction simultaneously) manipulation without re‑tooling.
  • Seamless VLA integration: Plugged into two state‑of‑the‑art VLA pipelines—DexVLA and Pi0—showing that the same vision‑language model can learn to select the appropriate modality on the fly.
  • Open‑source release: Full CAD files, wiring schematics, and ROS‑compatible drivers are made publicly available, lowering the barrier for labs and startups.
  • Empirical validation: Benchmarks on a set of 12 real‑world tasks (e.g., glass wiping, handle‑less drawer opening, thin‑sheet picking) demonstrate success rates up to 90 %—far beyond the ~30 % achievable with a plain two‑finger gripper.

Methodology

  1. Hardware integration – The authors mount a miniature vacuum pump and suction cup on the side of a standard parallel‑jaw gripper. A single microcontroller (Arduino Nano) reads a binary “mode” command from the VLA policy and actuates either the gripper motor, the suction pump, or both.
  2. Control abstraction – In the VLA software stack, the end‑effector is exposed as a single action primitive with three discrete sub‑actions: GRIP, SUCTION, GRIP+SUCTION. This keeps the language model’s action space unchanged while adding expressive power.
  3. Training & inference – The authors fine‑tune DexVLA and Pi0 on a mixed dataset of RGB‑D images, natural‑language task descriptions, and demonstration trajectories that include the new hybrid actions. No extra language tokens are required; the model learns to map phrases like “pick up the glass” to the SUCTION primitive.
  4. Evaluation protocol – Each task is run 20 times on a Franka Emika Panda robot. Success is defined as completing the high‑level goal (e.g., “wipe the surface”) without human intervention. Baselines use the same VLA models but with a vanilla two‑finger gripper.

Results & Findings

Task CategorySuccess (Hybrid)Success (Gripper‑Only)
Glass wiping92 %18 %
Thin‑sheet pick‑up88 %25 %
Handle‑less drawer pull85 %30 %
Mixed objects (grip + suction)90 %40 %
  • Mode selection learns automatically – The VLA policies correctly choose suction for smooth, low‑mass items and grip for irregular shapes, even when the same textual command is used.
  • Synergistic use improves stability – For heavy or partially porous objects, activating both grip and suction simultaneously raises the lift capacity by ~35 % compared with either mode alone.
  • No noticeable latency – The added pump control adds < 50 ms overhead, well within real‑time constraints for VLA inference loops.

Practical Implications

  • Rapid prototyping – Robotics startups can 3‑D‑print the VacuumVLA module and retrofit existing arms, instantly expanding their product’s task repertoire without redesigning the entire manipulator.
  • Warehouse & logistics – Vacuum‑assisted picking of glossy packages or thin cardboard sheets becomes feasible, reducing the need for multiple specialized end‑effectors on a single line.
  • Service robots – Home assistants can now clean windows, wipe countertops, or open sleek cabinets that lack traditional handles—capabilities that were previously out of reach for VLA‑driven bots.
  • Research acceleration – By releasing the hardware and ROS drivers, the authors enable the community to benchmark new VLA architectures on a richer set of manipulation primitives, fostering more robust, generalist policies.

Limitations & Future Work

  • Suction power constraints – The current low‑cost pump struggles with heavy or highly porous items; scaling to industrial‑grade suction will require more robust hardware.
  • Surface dependency – Suction effectiveness drops on textured or oily surfaces, suggesting a need for adaptive suction pads or hybrid tactile sensing.
  • Learning sample efficiency – Although the hybrid actions are learned end‑to‑end, the authors note that additional demonstrations (≈ 10 % more) are needed to reach peak performance on the most complex tasks.
  • Future directions – The team plans to explore dynamic mode switching mid‑trajectory (e.g., grip‑then‑suction) and to integrate force/torque feedback for safer contact‑rich operations.

VacuumVLA shows that a modest hardware tweak can unlock a whole new class of real‑world tasks for vision‑language‑driven robots, making general‑purpose manipulation a step closer to everyday deployment.

Authors

  • Hui Zhou
  • Siyuan Huang
  • Minxing Li
  • Hao Zhang
  • Lue Fan
  • Shaoshuai Shi

Paper Information

  • arXiv ID: 2511.21557v1
  • Categories: cs.RO, cs.AI
  • Published: November 26, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »