[Paper] Asuka-Bench: Benchmarking Code Agents on Underspecified User Intent and Multi-Round Refinement

Published: 6 days ago (June 4, 2026 at 05:24 AM EDT)

2 min read

Source: arXiv

Source: arXiv - 2606.05920v1

Overview

Existing code-generation benchmarks score a single mapping from a complete prompt to a one-shot output. However, real web development is different. Users seldom write a full spec at the start; many requirements only become clear once they look at an intermediate result and react to it. We present Asuka-Bench, a benchmark that pairs underspecified user intent with multi-round refinement, grounded in browser-rendered behavior. Each task is resolved through a closed loop: a Code Agent generates a web project, a UI Agent executes test cases on the deployed site, and a User LLM turns evaluation outcomes into natural-language feedback for the next round. The benchmark comprises 50 web tasks with 784 evaluation criteria and 2402 expected outcomes. We benchmark 8 LLMs across 2 agent frameworks. The results separate models clearly: weighted Task Pass Rate varies by 38 percentage points and models also differ substantially in their ability to repair from feedback. Asuka-Bench is also far from saturated: even the strongest model completes only 52% of projects after three rounds.

Key Contributions

This paper presents research in the following areas:

cs.SE
cs.CL

Methodology

Please refer to the full paper for detailed methodology.

Practical Implications

This research contributes to the advancement of cs.SE.

Authors

Xin Wang
Liangtai Sun
Yaoming Zhu
Shuang Zhou
Jiaxing Liu
Fengjiao Chen
Lin Qiu
Xuezhi Cao
Xunliang Cai
Licheng Zhang
Zhendong Mao

Paper Information

arXiv ID: 2606.05920v1
Categories: cs.SE, cs.CL
Published: June 4, 2026
PDF: Download PDF

[Paper] Asuka-Bench: Benchmarking Code Agents on Underspecified User Intent and Multi-Round Refinement

Overview

Key Contributions

Methodology

Practical Implications

Authors

Paper Information

Related posts

[Paper] How reliable are LLMs when it comes to playing dice?

[Paper] Agentopia: Long-Term Life Simulation and Learning in Agent Societies

[Paper] MemDreamer: Decoupling Perception and Reasoning for Long Video Understanding via Hierarchical Graph Memory and Agentic Retrieval Mechanism

[Paper] Your UnEmbedding Matrix is Secretly a Feature Lens for Text Embeddings