[Paper] Zero-shot Interactive Perception

Published: (February 20, 2026 at 12:30 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.18374v1

Overview

The paper introduces Zero‑Shot Interactive Perception (ZS‑IP), a framework that lets a robot reason about what to do without any task‑specific training. By combining a vision‑language model (VLM) with a set of “pushlines” – lightweight 2‑D visual cues that encode how a push will affect an object – the system can decide when to push, pull, or grasp to answer semantic queries (e.g., “where is the red cup?”) even when objects are hidden or occluded.

Key Contributions

  • Pushlines: A novel visual augmentation that encodes feasible pushing directions directly on the image, enabling the VLM to understand contact‑rich affordances beyond simple keypoints.
  • Enhanced Observation (EO) module: Merges conventional keypoints with pushlines, feeding richer context into the VLM for zero‑shot reasoning.
  • Memory‑guided action selection: A lightweight episodic memory that stores recent observations and actions, allowing the VLM to perform context‑aware semantic look‑ups.
  • Unified controller: Executes pushing, pulling, or grasping actions purely based on the VLM’s textual output, without a separate motion‑planning network.
  • Empirical validation on a 7‑DOF Franka Panda: Demonstrates superior performance over passive perception baselines (e.g., MOKA) especially on tasks that require pushing to uncover hidden objects, while preserving unrelated scene elements.

Methodology

  1. Perception Front‑end – The robot captures an RGB‑D image of the workspace. Two sets of annotations are overlaid:

    • Keypoints (standard object landmarks) and
    • Pushlines – short line segments drawn on the image that indicate viable push directions for each visible surface. These are generated automatically from depth geometry and contact‑stability heuristics.
  2. Vision‑Language Model (VLM) – A pre‑trained VLM (e.g., CLIP‑based) receives the augmented image together with a natural‑language query (“Is the blue block behind the green box?”). Because the VLM has never seen the specific task, it relies on its broad visual knowledge plus the pushline cues to infer a plausible answer.

  3. Memory Module – After each interaction, the system logs the observation, the VLM’s textual response, and the executed action. When a new query arrives, the memory is consulted to provide context (e.g., “we already pushed the left side, so the object must be on the right”).

  4. Action Planner / Controller – The VLM’s textual decision (e.g., “push left‑center”) is parsed into a motion primitive (push, pull, or grasp). The controller translates this into joint trajectories for the Franka Panda, respecting safety constraints and collision avoidance.

  5. Iterative Loop – The robot repeats perception → VLM reasoning → memory lookup → action until the query is resolved or a timeout is reached.

Results & Findings

MetricZS‑IP (push)MOKA (baseline)Passive Vision
Success‑rate (object uncovered)87 %62 %48 %
Average pushes per query1.32.13.0
Non‑target disturbance (objects moved unintentionally)4 %9 %12 %
Query latency (seconds)5.27.86.4
  • Pushlines dramatically improve pushing accuracy – the VLM can directly “see” where a push will make contact, leading to fewer wasted motions.
  • Memory guidance reduces redundant actions – the system rarely repeats the same push direction, cutting down on interaction steps.
  • Semantic correctness – In 93 % of cases the final answer matched ground‑truth object locations, showing that zero‑shot VLM reasoning combined with physical interaction can resolve occlusions reliably.

Practical Implications

  • Rapid prototyping for warehouse robots: Engineers can deploy a robot that understands high‑level commands (“bring me the red box”) without hand‑crafting perception pipelines for each new item.
  • Service robotics in homes/offices: Pushlines enable a robot to tidy up cluttered tables or shelves by nudging objects aside, a capability that’s hard to encode with static vision models.
  • Reduced data collection costs: Since ZS‑IP works zero‑shot, companies can avoid costly annotation campaigns for every new manipulation scenario.
  • Safety‑aware interaction: The memory module helps avoid unnecessary disturbance of delicate items, making the approach suitable for collaborative settings.

Limitations & Future Work

  • Reliance on depth quality: Pushline generation assumes reasonably clean depth data; noisy sensors can produce misleading push cues.
  • Scalability of memory: The current episodic memory is linear in stored steps; larger, longer‑horizon tasks may need more sophisticated retrieval (e.g., learned embeddings).
  • Action repertoire limited to push/pull/grasp: Extending to more complex primitives (e.g., sliding, rolling) will require richer augmentations.
  • Generalization to novel object categories: While zero‑shot, performance drops for objects whose visual features are far from the VLM’s pre‑training distribution; future work could incorporate few‑shot fine‑tuning or domain adaptation.

Authors

  • Venkatesh Sripada
  • Frank Guerin
  • Amir Ghalamzan

Paper Information

  • arXiv ID: 2602.18374v1
  • Categories: cs.RO, cs.AI
  • Published: February 20, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »