[Paper] VisualActBench: Can VLMs See and Act like a Human?
Source: arXiv - 2512.09907v1
Overview
The paper introduces VisualActBench, a new benchmark that asks Vision‑Language Models (VLMs) not just to describe what they see, but to decide what to do in a visual scene. By pairing 1,074 real‑world videos with 3,733 human‑annotated actions, the authors create a testbed for “visual action reasoning” – the ability of an AI to proactively generate sensible, priority‑aware actions without any textual prompt.
Key Contributions
- New task definition – “Visual Action Reasoning” that evaluates proactive decision‑making from pure visual input.
- Large‑scale benchmark – VisualActBench with 1,074 videos covering four everyday scenarios (e.g., kitchen, office, outdoor, home assistance).
- Rich annotation schema – each action is labeled with an Action Prioritization Level (APL) and a proactive vs. reactive tag, enabling fine‑grained assessment of human‑aligned reasoning.
- Comprehensive evaluation – 29 state‑of‑the‑art VLMs (including GPT‑4o, LLaVA, Gemini‑Vision, etc.) are benchmarked, revealing systematic gaps in proactive, high‑priority action generation.
- Open resource – dataset, evaluation scripts, and a leaderboard are released to spur community progress on vision‑centric agents.
Methodology
- Scenario selection – Four realistic domains were chosen (e.g., cooking, office work, home maintenance, outdoor navigation) to capture diverse contextual cues.
- Video collection & preprocessing – Short clips (5–15 s) were sourced from public datasets and manually trimmed to focus on a single decision point.
- Human annotation – Crowdworkers watched each clip and wrote the most appropriate next action. They then assigned:
- APL (1 = low urgency, 5 = critical) and
- Type (Proactive – anticipatory, Reactive – response to an event).
- Model prompting – VLMs received only the raw video frames (or a short frame sequence) and were asked to output an action sentence. No textual prompt or task description was provided, mimicking a “see‑and‑act” scenario.
- Scoring – Generated actions are compared against the human reference using a hybrid metric:
- Semantic similarity (BERTScore) for language fidelity, and
- APL alignment (penalizing mismatched priority) plus a type match (proactive vs. reactive).
The final score aggregates these components to reflect both correctness and human‑like decision quality.
Results & Findings
| Model | Overall Score (0‑100) | Proactive‑High‑APL Accuracy |
|---|---|---|
| GPT‑4o (vision) | 71.4 | 58 % |
| Gemini‑Vision | 64.2 | 45 % |
| LLaVA‑13B | 48.7 | 22 % |
| Other open‑source VLMs (average) | 39.1 | 15 % |
| Human baseline | 94.3 | 92 % |
- Frontier models (GPT‑4o, Gemini‑Vision) outperform older open‑source VLMs but still lag far behind humans, especially on high‑priority proactive actions.
- Most models default to reactive descriptions (“the person picks up a cup”) rather than anticipatory actions (“prepare a clean mug for the next drink”).
- Errors often stem from contextual blind spots (e.g., missing temporal cues) and value insensitivity (ignoring urgency indicated by APL).
Practical Implications
- Robotics & Assistive Devices – Deploying VLMs that can decide what to do next (e.g., a kitchen robot that anticipates a user’s need for a utensil) requires closing the proactive‑reasoning gap highlighted by VisualActBench.
- Enterprise Automation – Vision‑centric agents for monitoring workspaces (e.g., safety compliance) could benefit from APL‑aware reasoning to prioritize alerts.
- Human‑AI Collaboration Tools – UI assistants that suggest next steps based on a live video feed (e.g., remote support) need models that understand not just “what is happening” but “what should happen next.”
- Benchmark‑Driven Development – VisualActBench provides a concrete target for fine‑tuning VLMs with reinforcement learning from human feedback (RLHF) that incorporates priority and proactivity signals.
Limitations & Future Work
- Dataset scope – While diverse, the four scenarios still cover a limited set of everyday tasks; rare or safety‑critical domains (e.g., medical procedures) are absent.
- Annotation subjectivity – APL and proactive/reactive labels can vary across annotators; the authors report inter‑annotator agreement of ~0.78, leaving room for noise.
- Model input constraints – Current evaluation uses short frame sequences; longer temporal context (e.g., multi‑second video streams) may be needed for richer reasoning.
- Future directions – The authors suggest expanding to multimodal feedback (audio, tactile), integrating explicit value‑learning objectives, and exploring curriculum‑style training that gradually introduces higher‑priority actions.
VisualActBench shines a light on the next frontier for vision‑centric AI: moving from passive description to active, human‑aligned decision making. For developers building the next generation of intelligent agents, the benchmark offers both a diagnostic tool and a roadmap for the capabilities that still need to be built.
Authors
- Daohan Zhang
- Pai Liu
- Xiaofei Zhou
- Yuan Ge
- Guangchen Lan
- Jing Bi
- Christopher Brinton
- Ehsan Hoque
- Jiebo Luo
Paper Information
- arXiv ID: 2512.09907v1
- Categories: cs.CV
- Published: December 10, 2025
- PDF: Download PDF