[Paper] VisualActBench: Can VLMs See and Act like a Human?

Published: (December 10, 2025 at 01:36 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.09907v1

Overview

The paper introduces VisualActBench, a new benchmark that asks Vision‑Language Models (VLMs) not just to describe what they see, but to decide what to do in a visual scene. By pairing 1,074 real‑world videos with 3,733 human‑annotated actions, the authors create a testbed for “visual action reasoning” – the ability of an AI to proactively generate sensible, priority‑aware actions without any textual prompt.

Key Contributions

  • New task definition – “Visual Action Reasoning” that evaluates proactive decision‑making from pure visual input.
  • Large‑scale benchmark – VisualActBench with 1,074 videos covering four everyday scenarios (e.g., kitchen, office, outdoor, home assistance).
  • Rich annotation schema – each action is labeled with an Action Prioritization Level (APL) and a proactive vs. reactive tag, enabling fine‑grained assessment of human‑aligned reasoning.
  • Comprehensive evaluation – 29 state‑of‑the‑art VLMs (including GPT‑4o, LLaVA, Gemini‑Vision, etc.) are benchmarked, revealing systematic gaps in proactive, high‑priority action generation.
  • Open resource – dataset, evaluation scripts, and a leaderboard are released to spur community progress on vision‑centric agents.

Methodology

  1. Scenario selection – Four realistic domains were chosen (e.g., cooking, office work, home maintenance, outdoor navigation) to capture diverse contextual cues.
  2. Video collection & preprocessing – Short clips (5–15 s) were sourced from public datasets and manually trimmed to focus on a single decision point.
  3. Human annotation – Crowdworkers watched each clip and wrote the most appropriate next action. They then assigned:
    • APL (1 = low urgency, 5 = critical) and
    • Type (Proactive – anticipatory, Reactive – response to an event).
  4. Model prompting – VLMs received only the raw video frames (or a short frame sequence) and were asked to output an action sentence. No textual prompt or task description was provided, mimicking a “see‑and‑act” scenario.
  5. Scoring – Generated actions are compared against the human reference using a hybrid metric:
    • Semantic similarity (BERTScore) for language fidelity, and
    • APL alignment (penalizing mismatched priority) plus a type match (proactive vs. reactive).
      The final score aggregates these components to reflect both correctness and human‑like decision quality.

Results & Findings

ModelOverall Score (0‑100)Proactive‑High‑APL Accuracy
GPT‑4o (vision)71.458 %
Gemini‑Vision64.245 %
LLaVA‑13B48.722 %
Other open‑source VLMs (average)39.115 %
Human baseline94.392 %
  • Frontier models (GPT‑4o, Gemini‑Vision) outperform older open‑source VLMs but still lag far behind humans, especially on high‑priority proactive actions.
  • Most models default to reactive descriptions (“the person picks up a cup”) rather than anticipatory actions (“prepare a clean mug for the next drink”).
  • Errors often stem from contextual blind spots (e.g., missing temporal cues) and value insensitivity (ignoring urgency indicated by APL).

Practical Implications

  • Robotics & Assistive Devices – Deploying VLMs that can decide what to do next (e.g., a kitchen robot that anticipates a user’s need for a utensil) requires closing the proactive‑reasoning gap highlighted by VisualActBench.
  • Enterprise Automation – Vision‑centric agents for monitoring workspaces (e.g., safety compliance) could benefit from APL‑aware reasoning to prioritize alerts.
  • Human‑AI Collaboration Tools – UI assistants that suggest next steps based on a live video feed (e.g., remote support) need models that understand not just “what is happening” but “what should happen next.”
  • Benchmark‑Driven Development – VisualActBench provides a concrete target for fine‑tuning VLMs with reinforcement learning from human feedback (RLHF) that incorporates priority and proactivity signals.

Limitations & Future Work

  • Dataset scope – While diverse, the four scenarios still cover a limited set of everyday tasks; rare or safety‑critical domains (e.g., medical procedures) are absent.
  • Annotation subjectivity – APL and proactive/reactive labels can vary across annotators; the authors report inter‑annotator agreement of ~0.78, leaving room for noise.
  • Model input constraints – Current evaluation uses short frame sequences; longer temporal context (e.g., multi‑second video streams) may be needed for richer reasoning.
  • Future directions – The authors suggest expanding to multimodal feedback (audio, tactile), integrating explicit value‑learning objectives, and exploring curriculum‑style training that gradually introduces higher‑priority actions.

VisualActBench shines a light on the next frontier for vision‑centric AI: moving from passive description to active, human‑aligned decision making. For developers building the next generation of intelligent agents, the benchmark offers both a diagnostic tool and a roadmap for the capabilities that still need to be built.

Authors

  • Daohan Zhang
  • Pai Liu
  • Xiaofei Zhou
  • Yuan Ge
  • Guangchen Lan
  • Jing Bi
  • Christopher Brinton
  • Ehsan Hoque
  • Jiebo Luo

Paper Information

  • arXiv ID: 2512.09907v1
  • Categories: cs.CV
  • Published: December 10, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »