[Paper] AgentBeats: Agentifying Agent Assessment for Openness, Standardization, and Reproducibility
Source: arXiv - 2606.13608v1
Overview
Agent systems are advancing quickly across domains, but their evaluation remains fragmented. Most benchmarks rely on fixed, LLM-centric harnesses that require heavy integration, create test-production mismatch, and limit fair comparison across diverse agent designs. The root problem is the lack of an open, agent-agnostic assessment interface. We advocate Agentified Agent Assessment (AAA), where evaluation is performed by judge agents and all participants interact through standardized protocols: A2A for task management and MCP for tool access. Conventional benchmarking defines two separate interfaces, one for the benchmark and one for the agent, while AAA only needs one; this yields a generic, unified framework that separates assessment logic from agent implementation and enables reproducible, interoperable, and multi-agent evaluation. We further introduce AgentBeats as a concrete realization of AAA: we identify five practical operation modes that make standardized assessment compatible with real-world constraints on openness, privacy, and reproducibility. To evaluate our design at scale, we conduct two studies: a five-month open competition that drew 298 judge agents across 12 categories together with 467 subject agents from independent participants, showing that AAA applies across a heterogeneous range of benchmarks; and a case study on coding agents that confirms agentified evaluation preserves fidelity with the public record while surfacing previously missing head-to-head results, yielding research insights about agent design. Combining a community-scale field study and a controlled coding case study, we verify that AAA delivers coverage, practicality, and fidelity across heterogeneous scenarios at scale. Together, AAA and AgentBeats offer a clear path toward open, standardized, and reproducible agent assessment.
Key Contributions
This paper presents research in the following areas:
- cs.AI
- cs.LG
Methodology
Please refer to the full paper for detailed methodology.
Practical Implications
This research contributes to the advancement of cs.AI.
Authors
- Xiaoyuan Liu
- Jianhong Tu
- Yuqi Chen
- Siyuan Xie
- Sihan Ren
- Tianneng Shi
- Gal Gantar
- Evan Sandoval
- Donghyun Lee
- Daniel Miao
- Peter J. Gilbert
- Nick Hynes
- Mauro Staver
- Warren He
- David Marn
- Andrew Low
- Xi Zhang
- Elron Bandel
- Michal Shmueli-Scheuer
- Siva Reddy
- Alexandre Drouin
- Alexandre Lacoste
- Ramayya Krishnan
- Elham Tabassi
- Yu Su
- Victor Barres
- Chenguang Wang
- Wenbo Guo
- Dawn Song
Paper Information
- arXiv ID: 2606.13608v1
- Categories: cs.AI, cs.LG
- Published: June 11, 2026
- PDF: Download PDF