[Paper] Zero-shot Interactive Perception

Published: 2 months ago (February 20, 2026 at 12:30 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.18374v1

Overview

The paper introduces Zero‑Shot Interactive Perception (ZS‑IP), a framework that lets a robot reason about what to do without any task‑specific training. By combining a vision‑language model (VLM) with a set of “pushlines” – lightweight 2‑D visual cues that encode how a push will affect an object – the system can decide when to push, pull, or grasp to answer semantic queries (e.g., “where is the red cup?”) even when objects are hidden or occluded.

Key Contributions

Pushlines: A novel visual augmentation that encodes feasible pushing directions directly on the image, enabling the VLM to understand contact‑rich affordances beyond simple keypoints.
Enhanced Observation (EO) module: Merges conventional keypoints with pushlines, feeding richer context into the VLM for zero‑shot reasoning.
Memory‑guided action selection: A lightweight episodic memory that stores recent observations and actions, allowing the VLM to perform context‑aware semantic look‑ups.
Unified controller: Executes pushing, pulling, or grasping actions purely based on the VLM’s textual output, without a separate motion‑planning network.
Empirical validation on a 7‑DOF Franka Panda: Demonstrates superior performance over passive perception baselines (e.g., MOKA) especially on tasks that require pushing to uncover hidden objects, while preserving unrelated scene elements.

Methodology

Perception Front‑end – The robot captures an RGB‑D image of the workspace. Two sets of annotations are overlaid:
- Keypoints (standard object landmarks) and
- Pushlines – short line segments drawn on the image that indicate viable push directions for each visible surface. These are generated automatically from depth geometry and contact‑stability heuristics.
Vision‑Language Model (VLM) – A pre‑trained VLM (e.g., CLIP‑based) receives the augmented image together with a natural‑language query (“Is the blue block behind the green box?”). Because the VLM has never seen the specific task, it relies on its broad visual knowledge plus the pushline cues to infer a plausible answer.
Memory Module – After each interaction, the system logs the observation, the VLM’s textual response, and the executed action. When a new query arrives, the memory is consulted to provide context (e.g., “we already pushed the left side, so the object must be on the right”).
Action Planner / Controller – The VLM’s textual decision (e.g., “push left‑center”) is parsed into a motion primitive (push, pull, or grasp). The controller translates this into joint trajectories for the Franka Panda, respecting safety constraints and collision avoidance.
Iterative Loop – The robot repeats perception → VLM reasoning → memory lookup → action until the query is resolved or a timeout is reached.

Results & Findings

Metric	ZS‑IP (push)	MOKA (baseline)	Passive Vision
Success‑rate (object uncovered)	87 %	62 %	48 %
Average pushes per query	1.3	2.1	3.0
Non‑target disturbance (objects moved unintentionally)	4 %	9 %	12 %
Query latency (seconds)	5.2	7.8	6.4

Pushlines dramatically improve pushing accuracy – the VLM can directly “see” where a push will make contact, leading to fewer wasted motions.
Memory guidance reduces redundant actions – the system rarely repeats the same push direction, cutting down on interaction steps.
Semantic correctness – In 93 % of cases the final answer matched ground‑truth object locations, showing that zero‑shot VLM reasoning combined with physical interaction can resolve occlusions reliably.

Practical Implications

Rapid prototyping for warehouse robots: Engineers can deploy a robot that understands high‑level commands (“bring me the red box”) without hand‑crafting perception pipelines for each new item.
Service robotics in homes/offices: Pushlines enable a robot to tidy up cluttered tables or shelves by nudging objects aside, a capability that’s hard to encode with static vision models.
Reduced data collection costs: Since ZS‑IP works zero‑shot, companies can avoid costly annotation campaigns for every new manipulation scenario.
Safety‑aware interaction: The memory module helps avoid unnecessary disturbance of delicate items, making the approach suitable for collaborative settings.

Limitations & Future Work

Reliance on depth quality: Pushline generation assumes reasonably clean depth data; noisy sensors can produce misleading push cues.
Scalability of memory: The current episodic memory is linear in stored steps; larger, longer‑horizon tasks may need more sophisticated retrieval (e.g., learned embeddings).
Action repertoire limited to push/pull/grasp: Extending to more complex primitives (e.g., sliding, rolling) will require richer augmentations.
Generalization to novel object categories: While zero‑shot, performance drops for objects whose visual features are far from the VLM’s pre‑training distribution; future work could incorporate few‑shot fine‑tuning or domain adaptation.

Authors

Venkatesh Sripada
Frank Guerin
Amir Ghalamzan

Paper Information

arXiv ID: 2602.18374v1
Categories: cs.RO, cs.AI
Published: February 20, 2026
PDF: Download PDF

[Paper] Zero-shot Interactive Perception

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] The Geometry of Noise: Why Diffusion Models Don't Need Noise Conditioning

[Paper] Subgroups of $U(d)$ Induce Natural RNN and Transformer Architectures

[Paper] Unifying approach to uniform expressivity of graph neural networks

[Paper] Latent Equivariant Operators for Robust Object Recognition: Promise and Challenges