[Paper] Learning Visual Affordance from Audio

Published: (December 1, 2025 at 01:58 PM EST)
3 min read
Source: arXiv

Source: arXiv - 2512.02005v1

Overview

The paper introduces Audio‑Visual Affordance Grounding (AV‑AG), a novel task that lets a model locate the exact region on an object where an interaction is happening—using only the sound of the action. By leveraging audio as a cue, the approach sidesteps the ambiguities of textual instructions and the occlusion problems that plague video‑based methods, opening a new avenue for real‑time, multimodal perception.

Key Contributions

  • New task definition: AV‑AG, which segments interaction regions from action sounds rather than text or video.
  • First‑of‑its‑kind dataset: Over 10 K object images paired with high‑quality action‑sound recordings and pixel‑level affordance masks, plus an unseen split for zero‑shot evaluation.
  • AVAGFormer model: A transformer‑based architecture featuring a semantic‑conditioned cross‑modal mixer and a dual‑head decoder that fuses audio and visual streams efficiently.
  • State‑of‑the‑art results: AVAGFormer outperforms strong baselines from related audio‑visual segmentation (AVS) and multimodal grounding tasks.
  • Open‑source release: Code, pretrained weights, and the dataset are publicly available, encouraging reproducibility and downstream research.

Methodology

  1. Data preprocessing – Audio clips are transformed into log‑mel spectrograms; images are resized and normalized.
  2. Feature extraction – Separate encoders (a CNN for images, a lightweight audio transformer for spectrograms) produce modality‑specific embeddings.
  3. Semantic‑conditioned cross‑modal mixer – The audio embedding generates a set of query vectors that attend to visual tokens, effectively “telling” the visual stream where to look based on the sound semantics.
  4. Dual‑head decoder
    • Mask head: predicts a binary affordance mask at the original image resolution.
    • Classification head: outputs a coarse affordance category (e.g., “grasp”, “cut”) to guide mask refinement.
  5. Training – A combination of binary cross‑entropy (mask) and cross‑entropy (category) losses, plus an auxiliary contrastive loss that aligns audio‑visual pairs, is optimized end‑to‑end.

The whole pipeline runs in a single forward pass, making it suitable for real‑time applications.

Results & Findings

ModelmIoU (Seen)mIoU (Unseen)
Baseline (AVS‑ResNet)42.3%35.1%
AVAGFormer (full)58.7%49.4%
AVAGFormer (no semantic mixer)53.2%44.0%
AVAGFormer (single‑head)55.1%46.3%
  • Significant boost over existing audio‑visual segmentation baselines (≈+16 % mIoU).
  • The semantic‑conditioned mixer contributes the largest performance jump, confirming that audio semantics are crucial for precise grounding.
  • Zero‑shot results show the model can generalize to unseen object–sound pairs, thanks to the shared audio embedding space.
  • Ablation studies reveal that end‑to‑end training outperforms a two‑stage pipeline (feature extraction → mask prediction) by ~3–4 % mIoU.

Practical Implications

  • Robotics & HRI: Robots can infer where to interact with a tool or object simply by listening to a human’s action sound (e.g., “cutting” → locate the blade edge).
  • AR/VR interaction: Audio cues can trigger context‑aware overlays (highlighting a handle when a user says “grab”) without needing explicit hand tracking.
  • Assistive tech: Devices for visually impaired users could use ambient sounds to highlight actionable regions on nearby objects in a wearable display.
  • Smart manufacturing: Audio monitoring of assembly lines can automatically flag mis‑aligned parts by detecting mismatched affordance regions.
  • Content creation: Video editors could auto‑mask interaction zones for effects or subtitles based on the accompanying soundtrack, reducing manual rotoscoping.

Because the model runs in a single forward pass (~30 FPS on a modern GPU), integrating it into real‑time pipelines is feasible.

Limitations & Future Work

  • Audio quality dependency: Noisy environments degrade performance; the current dataset assumes relatively clean recordings.
  • Limited affordance taxonomy: Only a handful of interaction types are covered; expanding to more fine‑grained actions (e.g., “twist”, “press”) is needed.
  • Static images only: Temporal dynamics (e.g., moving objects) are not modeled; extending AVAGFormer to video streams could capture evolving affordances.
  • Cross‑modal bias: The model may over‑rely on dominant audio cues, potentially ignoring subtle visual hints; future work could explore balanced attention mechanisms.

The authors plan to enrich the dataset with noisy, real‑world recordings, broaden the affordance label set, and experiment with multimodal transformers that jointly process video and audio.

Authors

  • Lidong Lu
  • Guo Chen
  • Zhu Wei
  • Yicheng Liu
  • Tong Lu

Paper Information

  • arXiv ID: 2512.02005v1
  • Categories: cs.CV
  • Published: December 1, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »